diff options
Diffstat (limited to 'Documentation')
424 files changed, 16587 insertions, 9323 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-uids b/Documentation/ABI/removed/sysfs-kernel-uids index 4182b7061816..dc4463f190a7 100644 --- a/Documentation/ABI/testing/sysfs-kernel-uids +++ b/Documentation/ABI/removed/sysfs-kernel-uids @@ -1,5 +1,5 @@ What: /sys/kernel/uids/<uid>/cpu_shares -Date: December 2007 +Date: December 2007, finally removed in kernel v2.6.34-rc1 Contact: Dhaval Giani <dhaval@linux.vnet.ibm.com> Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Description: diff --git a/Documentation/ABI/testing/configfs-most b/Documentation/ABI/testing/configfs-most new file mode 100644 index 000000000000..ed67a4d9f6d6 --- /dev/null +++ b/Documentation/ABI/testing/configfs-most @@ -0,0 +1,196 @@ +What: /sys/kernel/config/most_<component> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: Interface is used to configure and connect device channels + to component drivers. + + Attributes are visible only when configfs is mounted. To mount + configfs in /sys/kernel/config directory use: + # mount -t configfs none /sys/kernel/config/ + + +What: /sys/kernel/config/most_cdev/<link> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_video/<link> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_net/<link> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link + +What: /sys/kernel/config/most_sound/<card> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + create_card write '1' to this attribute to trigger the + registration of the sound card with the ALSA + subsystem. + +What: /sys/kernel/config/most_sound/<card>/<link> +Date: March 8, 2019 +KernelVersion: 5.2 +Description: + The attributes: + + buffer_size configure the buffer size for this channel + + subbuffer_size configure the sub-buffer size for this channel + (needed for synchronous and isochrnous data) + + + num_buffers configure number of buffers used for this + channel + + datatype configure type of data that will travel over + this channel + + direction configure whether this link will be an input + or output + + dbr_size configure DBR data buffer size (this is used + for MediaLB communication only) + + packets_per_xact + configure the number of packets that will be + collected from the network before being + transmitted via USB (this is used for USB + communication only) + + device name of the device the link is to be attached to + + channel name of the channel the link is to be attached to + + comp_params pass parameters needed by some components + + create_link write '1' to this attribute to trigger the + creation of the link. In case of speculative + configuration, the creation is post-poned until + a physical device is being attached to the bus. + + destroy_link write '1' to this attribute to destroy an + active link diff --git a/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 b/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 index 46b1f33b2fce..eac32180c40d 100644 --- a/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 +++ b/Documentation/ABI/testing/sysfs-bus-counter-104-quad-8 @@ -1,3 +1,28 @@ +What: /sys/bus/counter/devices/counterX/signalY/cable_fault +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Read-only attribute that indicates whether a differential + encoder cable fault (not connected or loose wires) is detected + for the respective channel of Signal Y. Valid attribute values + are boolean. Detection must first be enabled via the + corresponding cable_fault_enable attribute. + +What: /sys/bus/counter/devices/counterX/signalY/cable_fault_enable +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Whether detection of differential encoder cable faults for the + respective channel of Signal Y is enabled. Valid attribute + values are boolean. + +What: /sys/bus/counter/devices/counterX/signalY/filter_clock_prescaler +KernelVersion: 5.7 +Contact: linux-iio@vger.kernel.org +Description: + Filter clock factor for input Signal Y. This prescaler value + affects the inputs of both quadrature pair signals. + What: /sys/bus/counter/devices/counterX/signalY/index_polarity KernelVersion: 5.2 Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 index 7627d3be08f5..f8315202c8f0 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 +++ b/Documentation/ABI/testing/sysfs-bus-iio-adc-ad7192 @@ -2,17 +2,22 @@ What: /sys/bus/iio/devices/iio:deviceX/ac_excitation_en KernelVersion: Contact: linux-iio@vger.kernel.org Description: - Reading gives the state of AC excitation. - Writing '1' enables AC excitation. + This attribute, if available, is used to enable the AC + excitation mode found on some converters. In ac excitation mode, + the polarity of the excitation voltage is reversed on + alternate cycles, to eliminate DC errors. What: /sys/bus/iio/devices/iio:deviceX/bridge_switch_en KernelVersion: Contact: linux-iio@vger.kernel.org Description: - This bridge switch is used to disconnect it when there is a - need to minimize the system current consumption. - Reading gives the state of the bridge switch. - Writing '1' enables the bridge switch. + This attribute, if available, is used to close or open the + bridge power down switch found on some converters. + In bridge applications, such as strain gauges and load cells, + the bridge itself consumes the majority of the current in the + system. To minimize the current consumption of the system, + the bridge can be disconnected (when it is not being used + using the bridge_switch_en attribute. What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration KernelVersion: @@ -21,6 +26,13 @@ Description: Initiates the system calibration procedure. This is done on a single channel at a time. Write '1' to start the calibration. +What: /sys/bus/iio/devices/iio:deviceX/in_voltage2-voltage2_shorted_raw +KernelVersion: +Contact: linux-iio@vger.kernel.org +Description: + Measure voltage from AIN2 pin connected to AIN(+) + and AIN(-) shorted. + What: /sys/bus/iio/devices/iio:deviceX/in_voltagex_sys_calibration_mode_available KernelVersion: Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-most b/Documentation/ABI/testing/sysfs-bus-most new file mode 100644 index 000000000000..6b1d06e3285e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-most @@ -0,0 +1,295 @@ +What: /sys/bus/most/devices/.../description +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Provides information about the interface type and the physical + location of the device. Hardware attached via USB, for instance, + might return <1-1.1:1.0> +Users: + +What: /sys/bus/most/devices/.../interface +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the type of peripheral interface the device uses. +Users: + +What: /sys/bus/most/devices/.../dci +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + If the network interface controller is attached via USB, a dci + directory is created that allows applications to read and + write the controller's DCI registers. +Users: + +What: /sys/bus/most/devices/.../dci/arb_address +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to set an arbitrary DCI register address an + application wants to read from or write to. +Users: + +What: /sys/bus/most/devices/.../dci/arb_value +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to read and write the DCI register whose address + is stored in arb_address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_hi +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_lo +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_eui48_mi +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MAC address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_filter +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MEP filter address. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash0 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash1 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash2 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/mep_hash3 +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to check and configure the MEP hash table. +Users: + +What: /sys/bus/most/devices/.../dci/ni_state +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the current network interface state. +Users: + +What: /sys/bus/most/devices/.../dci/node_address +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the current node address. +Users: + +What: /sys/bus/most/devices/.../dci/node_position +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the current node position. +Users: + +What: /sys/bus/most/devices/.../dci/packet_bandwidth +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the configured packet bandwidth. +Users: + +What: /sys/bus/most/devices/.../dci/sync_ep +Date: June 2016 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Triggers the controller's synchronization process for a certain + endpoint. +Users: + +What: /sys/bus/most/devices/.../<channel>/ +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + For every channel of the device a directory is created, whose + name is dictated by the HDM. This enables an application to + collect information about the channel's capabilities and + configure it. +Users: + +What: /sys/bus/most/devices/.../<channel>/available_datatypes +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the data types the current channel can transport. +Users: + +What: /sys/bus/most/devices/.../<channel>/available_directions +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the directions the current channel is capable of. +Users: + +What: /sys/bus/most/devices/.../<channel>/number_of_packet_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the number of packet buffers the current channel can + handle. +Users: + +What: /sys/bus/most/devices/.../<channel>/number_of_stream_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the number of streaming buffers the current channel can + handle. +Users: + +What: /sys/bus/most/devices/.../<channel>/size_of_packet_buffer +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the size of a packet buffer the current channel can + handle. +Users: + +What: /sys/bus/most/devices/.../<channel>/size_of_stream_buffer +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates the size of a streaming buffer the current channel can + handle. +Users: + +What: /sys/bus/most/devices/.../<channel>/set_number_of_buffers +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the number of buffers of the current channel. +Users: + +What: /sys/bus/most/devices/.../<channel>/set_buffer_size +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the size of a buffer of the current channel. +Users: + +What: /sys/bus/most/devices/.../<channel>/set_direction +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the direction of the current channel. + The following strings will be accepted: + 'dir_tx', + 'dir_rx' +Users: + +What: /sys/bus/most/devices/.../<channel>/set_datatype +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the data type of the current channel. + The following strings will be accepted: + 'control', + 'async', + 'sync', + 'isoc_avp' +Users: + +What: /sys/bus/most/devices/.../<channel>/set_subbuffer_size +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the subbuffer size of the current channel. +Users: + +What: /sys/bus/most/devices/.../<channel>/set_packets_per_xact +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is to configure the number of packets per transaction of + the current channel. This is only needed network interface + controller is attached via USB. +Users: + +What: /sys/bus/most/devices/.../<channel>/channel_starving +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + Indicates whether current channel ran out of buffers. +Users: + +What: /sys/bus/most/drivers/most_core/components +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to retrieve a list of registered components. +Users: + +What: /sys/bus/most/drivers/most_core/links +Date: March 2017 +KernelVersion: 4.15 +Contact: Christian Gromm <christian.gromm@microchip.com> +Description: + This is used to retrieve a list of established links. +Users: diff --git a/Documentation/ABI/testing/sysfs-class-typec b/Documentation/ABI/testing/sysfs-class-typec index d7647b258c3c..b834671522d6 100644 --- a/Documentation/ABI/testing/sysfs-class-typec +++ b/Documentation/ABI/testing/sysfs-class-typec @@ -20,13 +20,13 @@ Date: April 2017 Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com> Description: The supported power roles. This attribute can be used to request - power role swap on the port when the port supports USB Power - Delivery. Swapping is supported as synchronous operation, so - write(2) to the attribute will not return until the operation - has finished. The attribute is notified about role changes so - that poll(2) on the attribute wakes up. Change on the role will - also generate uevent KOBJ_CHANGE. The current role is show in - brackets, for example "[source] sink" when in source mode. + power role swap on the port. Swapping is supported as + synchronous operation, so write(2) to the attribute will not + return until the operation has finished. The attribute is + notified about role changes so that poll(2) on the attribute + wakes up. Change on the role will also generate uevent + KOBJ_CHANGE. The current role is show in brackets, for example + "[source] sink" when in source mode. Valid values: source, sink @@ -108,6 +108,15 @@ Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com> Description: Revision number of the supported USB Type-C specification. +What: /sys/class/typec/<port>/orientation +Date: February 2020 +Contact: Badhri Jagan Sridharan <badhri@google.com> +Description: + Indicates the active orientation of the Type-C connector. + Valid values: + - "normal": CC1 orientation + - "reverse": CC2 orientation + - "unknown": Orientation cannot be determined. USB Type-C partner devices (eg. /sys/class/typec/port0-partner/) diff --git a/Documentation/EDID/1024x768.S b/Documentation/EDID/1024x768.S deleted file mode 100644 index 4aed3f9ab88a..000000000000 --- a/Documentation/EDID/1024x768.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1024x768.S: EDID data set for standard 1024x768 60 Hz monitor - - Copyright (C) 2011 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 65000 /* kHz */ -#define XPIX 1024 -#define YPIX 768 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 320 -#define YBLANK 38 -#define XOFFSET 8 -#define XPULSE 144 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux XGA" -#define ESTABLISHED_TIMING2_BITS 0x08 /* Bit 3 -> 1024x768 @60 Hz */ -#define HSYNC_POL 0 -#define VSYNC_POL 0 - -#include "edid.S" diff --git a/Documentation/EDID/1280x1024.S b/Documentation/EDID/1280x1024.S deleted file mode 100644 index b26dd424cad7..000000000000 --- a/Documentation/EDID/1280x1024.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1280x1024.S: EDID data set for standard 1280x1024 60 Hz monitor - - Copyright (C) 2011 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 108000 /* kHz */ -#define XPIX 1280 -#define YPIX 1024 -#define XY_RATIO XY_RATIO_5_4 -#define XBLANK 408 -#define YBLANK 42 -#define XOFFSET 48 -#define XPULSE 112 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1600x1200.S b/Documentation/EDID/1600x1200.S deleted file mode 100644 index 0d091b282768..000000000000 --- a/Documentation/EDID/1600x1200.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1600x1200.S: EDID data set for standard 1600x1200 60 Hz monitor - - Copyright (C) 2013 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 162000 /* kHz */ -#define XPIX 1600 -#define YPIX 1200 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 560 -#define YBLANK 50 -#define XOFFSET 64 -#define XPULSE 192 -#define YOFFSET 1 -#define YPULSE 3 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux UXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1680x1050.S b/Documentation/EDID/1680x1050.S deleted file mode 100644 index 7dfed9a33eab..000000000000 --- a/Documentation/EDID/1680x1050.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1680x1050.S: EDID data set for standard 1680x1050 60 Hz monitor - - Copyright (C) 2012 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 146250 /* kHz */ -#define XPIX 1680 -#define YPIX 1050 -#define XY_RATIO XY_RATIO_16_10 -#define XBLANK 560 -#define YBLANK 39 -#define XOFFSET 104 -#define XPULSE 176 -#define YOFFSET 3 -#define YPULSE 6 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux WSXGA" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/1920x1080.S b/Documentation/EDID/1920x1080.S deleted file mode 100644 index d6ffbba28e95..000000000000 --- a/Documentation/EDID/1920x1080.S +++ /dev/null @@ -1,43 +0,0 @@ -/* - 1920x1080.S: EDID data set for standard 1920x1080 60 Hz monitor - - Copyright (C) 2012 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 148500 /* kHz */ -#define XPIX 1920 -#define YPIX 1080 -#define XY_RATIO XY_RATIO_16_9 -#define XBLANK 280 -#define YBLANK 45 -#define XOFFSET 88 -#define XPULSE 44 -#define YOFFSET 4 -#define YPULSE 5 -#define DPI 96 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux FHD" -/* No ESTABLISHED_TIMINGx_BITS */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/800x600.S b/Documentation/EDID/800x600.S deleted file mode 100644 index a5616588de08..000000000000 --- a/Documentation/EDID/800x600.S +++ /dev/null @@ -1,40 +0,0 @@ -/* - 800x600.S: EDID data set for standard 800x600 60 Hz monitor - - Copyright (C) 2011 Carsten Emde <C.Emde@osadl.org> - Copyright (C) 2014 Linaro Limited - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. -*/ - -/* EDID */ -#define VERSION 1 -#define REVISION 3 - -/* Display */ -#define CLOCK 40000 /* kHz */ -#define XPIX 800 -#define YPIX 600 -#define XY_RATIO XY_RATIO_4_3 -#define XBLANK 256 -#define YBLANK 28 -#define XOFFSET 40 -#define XPULSE 128 -#define YOFFSET 1 -#define YPULSE 4 -#define DPI 72 -#define VFREQ 60 /* Hz */ -#define TIMING_NAME "Linux SVGA" -#define ESTABLISHED_TIMING1_BITS 0x01 /* Bit 0: 800x600 @ 60Hz */ -#define HSYNC_POL 1 -#define VSYNC_POL 1 - -#include "edid.S" diff --git a/Documentation/EDID/Makefile b/Documentation/EDID/Makefile deleted file mode 100644 index 85a927dfab02..000000000000 --- a/Documentation/EDID/Makefile +++ /dev/null @@ -1,37 +0,0 @@ - -SOURCES := $(wildcard [0-9]*x[0-9]*.S) - -BIN := $(patsubst %.S, %.bin, $(SOURCES)) - -IHEX := $(patsubst %.S, %.bin.ihex, $(SOURCES)) - -CODE := $(patsubst %.S, %.c, $(SOURCES)) - -all: $(BIN) $(IHEX) $(CODE) - -clean: - @rm -f *.o *.bin.ihex *.bin *.c - -%.o: %.S - @cc -c $^ - -%.bin.nocrc: %.o - @objcopy -Obinary $^ $@ - -%.crc: %.bin.nocrc - @list=$$(for i in `seq 1 127`; do head -c$$i $^ | tail -c1 \ - | hexdump -v -e '/1 "%02X+"'; done); \ - echo "ibase=16;100-($${list%?})%100" | bc >$@ - -%.p: %.crc %.S - @cc -c -DCRC="$$(cat $*.crc)" -o $@ $*.S - -%.bin: %.p - @objcopy -Obinary $^ $@ - -%.bin.ihex: %.p - @objcopy -Oihex $^ $@ - @dos2unix $@ 2>/dev/null - -%.c: %.bin - @echo "{" >$@; hexdump -f hex $^ >>$@; echo "};" >>$@ diff --git a/Documentation/EDID/edid.S b/Documentation/EDID/edid.S deleted file mode 100644 index c3d13815526d..000000000000 --- a/Documentation/EDID/edid.S +++ /dev/null @@ -1,274 +0,0 @@ -/* - edid.S: EDID data template - - Copyright (C) 2012 Carsten Emde <C.Emde@osadl.org> - - This program is free software; you can redistribute it and/or - modify it under the terms of the GNU General Public License - as published by the Free Software Foundation; either version 2 - of the License, or (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. -*/ - - -/* Manufacturer */ -#define MFG_LNX1 'L' -#define MFG_LNX2 'N' -#define MFG_LNX3 'X' -#define SERIAL 0 -#define YEAR 2012 -#define WEEK 5 - -/* EDID 1.3 standard definitions */ -#define XY_RATIO_16_10 0b00 -#define XY_RATIO_4_3 0b01 -#define XY_RATIO_5_4 0b10 -#define XY_RATIO_16_9 0b11 - -/* Provide defaults for the timing bits */ -#ifndef ESTABLISHED_TIMING1_BITS -#define ESTABLISHED_TIMING1_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING2_BITS -#define ESTABLISHED_TIMING2_BITS 0x00 -#endif -#ifndef ESTABLISHED_TIMING3_BITS -#define ESTABLISHED_TIMING3_BITS 0x00 -#endif - -#define mfgname2id(v1,v2,v3) \ - ((((v1-'@')&0x1f)<<10)+(((v2-'@')&0x1f)<<5)+((v3-'@')&0x1f)) -#define swap16(v1) ((v1>>8)+((v1&0xff)<<8)) -#define lsbs2(v1,v2) (((v1&0x0f)<<4)+(v2&0x0f)) -#define msbs2(v1,v2) ((((v1>>8)&0x0f)<<4)+((v2>>8)&0x0f)) -#define msbs4(v1,v2,v3,v4) \ - ((((v1>>8)&0x03)<<6)+(((v2>>8)&0x03)<<4)+\ - (((v3>>4)&0x03)<<2)+((v4>>4)&0x03)) -#define pixdpi2mm(pix,dpi) ((pix*25)/dpi) -#define xsize pixdpi2mm(XPIX,DPI) -#define ysize pixdpi2mm(YPIX,DPI) - - .data - -/* Fixed header pattern */ -header: .byte 0x00,0xff,0xff,0xff,0xff,0xff,0xff,0x00 - -mfg_id: .hword swap16(mfgname2id(MFG_LNX1, MFG_LNX2, MFG_LNX3)) - -prod_code: .hword 0 - -/* Serial number. 32 bits, little endian. */ -serial_number: .long SERIAL - -/* Week of manufacture */ -week: .byte WEEK - -/* Year of manufacture, less 1990. (1990-2245) - If week=255, it is the model year instead */ -year: .byte YEAR-1990 - -version: .byte VERSION /* EDID version, usually 1 (for 1.3) */ -revision: .byte REVISION /* EDID revision, usually 3 (for 1.3) */ - -/* If Bit 7=1 Digital input. If set, the following bit definitions apply: - Bits 6-1 Reserved, must be 0 - Bit 0 Signal is compatible with VESA DFP 1.x TMDS CRGB, - 1 pixel per clock, up to 8 bits per color, MSB aligned, - If Bit 7=0 Analog input. If clear, the following bit definitions apply: - Bits 6-5 Video white and sync levels, relative to blank - 00=+0.7/-0.3 V; 01=+0.714/-0.286 V; - 10=+1.0/-0.4 V; 11=+0.7/0 V - Bit 4 Blank-to-black setup (pedestal) expected - Bit 3 Separate sync supported - Bit 2 Composite sync (on HSync) supported - Bit 1 Sync on green supported - Bit 0 VSync pulse must be serrated when somposite or - sync-on-green is used. */ -video_parms: .byte 0x6d - -/* Maximum horizontal image size, in centimetres - (max 292 cm/115 in at 16:9 aspect ratio) */ -max_hor_size: .byte xsize/10 - -/* Maximum vertical image size, in centimetres. - If either byte is 0, undefined (e.g. projector) */ -max_vert_size: .byte ysize/10 - -/* Display gamma, minus 1, times 100 (range 1.00-3.5 */ -gamma: .byte 120 - -/* Bit 7 DPMS standby supported - Bit 6 DPMS suspend supported - Bit 5 DPMS active-off supported - Bits 4-3 Display type: 00=monochrome; 01=RGB colour; - 10=non-RGB multicolour; 11=undefined - Bit 2 Standard sRGB colour space. Bytes 25-34 must contain - sRGB standard values. - Bit 1 Preferred timing mode specified in descriptor block 1. - Bit 0 GTF supported with default parameter values. */ -dsp_features: .byte 0xea - -/* Chromaticity coordinates. */ -/* Red and green least-significant bits - Bits 7-6 Red x value least-significant 2 bits - Bits 5-4 Red y value least-significant 2 bits - Bits 3-2 Green x value lst-significant 2 bits - Bits 1-0 Green y value least-significant 2 bits */ -red_green_lsb: .byte 0x5e - -/* Blue and white least-significant 2 bits */ -blue_white_lsb: .byte 0xc0 - -/* Red x value most significant 8 bits. - 0-255 encodes 0-0.996 (255/256); 0-0.999 (1023/1024) with lsbits */ -red_x_msb: .byte 0xa4 - -/* Red y value most significant 8 bits */ -red_y_msb: .byte 0x59 - -/* Green x and y value most significant 8 bits */ -green_x_y_msb: .byte 0x4a,0x98 - -/* Blue x and y value most significant 8 bits */ -blue_x_y_msb: .byte 0x25,0x20 - -/* Default white point x and y value most significant 8 bits */ -white_x_y_msb: .byte 0x50,0x54 - -/* Established timings */ -/* Bit 7 720x400 @ 70 Hz - Bit 6 720x400 @ 88 Hz - Bit 5 640x480 @ 60 Hz - Bit 4 640x480 @ 67 Hz - Bit 3 640x480 @ 72 Hz - Bit 2 640x480 @ 75 Hz - Bit 1 800x600 @ 56 Hz - Bit 0 800x600 @ 60 Hz */ -estbl_timing1: .byte ESTABLISHED_TIMING1_BITS - -/* Bit 7 800x600 @ 72 Hz - Bit 6 800x600 @ 75 Hz - Bit 5 832x624 @ 75 Hz - Bit 4 1024x768 @ 87 Hz, interlaced (1024x768) - Bit 3 1024x768 @ 60 Hz - Bit 2 1024x768 @ 72 Hz - Bit 1 1024x768 @ 75 Hz - Bit 0 1280x1024 @ 75 Hz */ -estbl_timing2: .byte ESTABLISHED_TIMING2_BITS - -/* Bit 7 1152x870 @ 75 Hz (Apple Macintosh II) - Bits 6-0 Other manufacturer-specific display mod */ -estbl_timing3: .byte ESTABLISHED_TIMING3_BITS - -/* Standard timing */ -/* X resolution, less 31, divided by 8 (256-2288 pixels) */ -std_xres: .byte (XPIX/8)-31 -/* Y resolution, X:Y pixel ratio - Bits 7-6 X:Y pixel ratio: 00=16:10; 01=4:3; 10=5:4; 11=16:9. - Bits 5-0 Vertical frequency, less 60 (60-123 Hz) */ -std_vres: .byte (XY_RATIO<<6)+VFREQ-60 - .fill 7,2,0x0101 /* Unused */ - -descriptor1: -/* Pixel clock in 10 kHz units. (0.-655.35 MHz, little-endian) */ -clock: .hword CLOCK/10 - -/* Horizontal active pixels 8 lsbits (0-4095) */ -x_act_lsb: .byte XPIX&0xff -/* Horizontal blanking pixels 8 lsbits (0-4095) - End of active to start of next active. */ -x_blk_lsb: .byte XBLANK&0xff -/* Bits 7-4 Horizontal active pixels 4 msbits - Bits 3-0 Horizontal blanking pixels 4 msbits */ -x_msbs: .byte msbs2(XPIX,XBLANK) - -/* Vertical active lines 8 lsbits (0-4095) */ -y_act_lsb: .byte YPIX&0xff -/* Vertical blanking lines 8 lsbits (0-4095) */ -y_blk_lsb: .byte YBLANK&0xff -/* Bits 7-4 Vertical active lines 4 msbits - Bits 3-0 Vertical blanking lines 4 msbits */ -y_msbs: .byte msbs2(YPIX,YBLANK) - -/* Horizontal sync offset pixels 8 lsbits (0-1023) From blanking start */ -x_snc_off_lsb: .byte XOFFSET&0xff -/* Horizontal sync pulse width pixels 8 lsbits (0-1023) */ -x_snc_pls_lsb: .byte XPULSE&0xff -/* Bits 7-4 Vertical sync offset lines 4 lsbits (0-63) - Bits 3-0 Vertical sync pulse width lines 4 lsbits (0-63) */ -y_snc_lsb: .byte lsbs2(YOFFSET, YPULSE) -/* Bits 7-6 Horizontal sync offset pixels 2 msbits - Bits 5-4 Horizontal sync pulse width pixels 2 msbits - Bits 3-2 Vertical sync offset lines 2 msbits - Bits 1-0 Vertical sync pulse width lines 2 msbits */ -xy_snc_msbs: .byte msbs4(XOFFSET,XPULSE,YOFFSET,YPULSE) - -/* Horizontal display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -x_dsp_size: .byte xsize&0xff - -/* Vertical display size, mm, 8 lsbits (0-4095 mm, 161 in) */ -y_dsp_size: .byte ysize&0xff - -/* Bits 7-4 Horizontal display size, mm, 4 msbits - Bits 3-0 Vertical display size, mm, 4 msbits */ -dsp_size_mbsb: .byte msbs2(xsize,ysize) - -/* Horizontal border pixels (each side; total is twice this) */ -x_border: .byte 0 -/* Vertical border lines (each side; total is twice this) */ -y_border: .byte 0 - -/* Bit 7 Interlaced - Bits 6-5 Stereo mode: 00=No stereo; other values depend on bit 0: - Bit 0=0: 01=Field sequential, sync=1 during right; 10=similar, - sync=1 during left; 11=4-way interleaved stereo - Bit 0=1 2-way interleaved stereo: 01=Right image on even lines; - 10=Left image on even lines; 11=side-by-side - Bits 4-3 Sync type: 00=Analog composite; 01=Bipolar analog composite; - 10=Digital composite (on HSync); 11=Digital separate - Bit 2 If digital separate: Vertical sync polarity (1=positive) - Other types: VSync serrated (HSync during VSync) - Bit 1 If analog sync: Sync on all 3 RGB lines (else green only) - Digital: HSync polarity (1=positive) - Bit 0 2-way line-interleaved stereo, if bits 4-3 are not 00. */ -features: .byte 0x18+(VSYNC_POL<<2)+(HSYNC_POL<<1) - -descriptor2: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xff /* Descriptor is monitor serial number (text) */ - .byte 0 /* Must be zero */ -start1: .ascii "Linux #0" -end1: .byte 0x0a /* End marker */ - .fill 12-(end1-start1), 1, 0x20 /* Padded spaces */ -descriptor3: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfd /* Descriptor is monitor range limits */ - .byte 0 /* Must be zero */ -start2: .byte VFREQ-1 /* Minimum vertical field rate (1-255 Hz) */ - .byte VFREQ+1 /* Maximum vertical field rate (1-255 Hz) */ - .byte (CLOCK/(XPIX+XBLANK))-1 /* Minimum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/(XPIX+XBLANK))+1 /* Maximum horizontal line rate - (1-255 kHz) */ - .byte (CLOCK/10000)+1 /* Maximum pixel clock rate, rounded up - to 10 MHz multiple (10-2550 MHz) */ - .byte 0 /* No extended timing information type */ -end2: .byte 0x0a /* End marker */ - .fill 12-(end2-start2), 1, 0x20 /* Padded spaces */ -descriptor4: .byte 0,0 /* Not a detailed timing descriptor */ - .byte 0 /* Must be zero */ - .byte 0xfc /* Descriptor is text */ - .byte 0 /* Must be zero */ -start3: .ascii TIMING_NAME -end3: .byte 0x0a /* End marker */ - .fill 12-(end3-start3), 1, 0x20 /* Padded spaces */ -extensions: .byte 0 /* Number of extensions to follow */ -checksum: .byte CRC /* Sum of all bytes must be 0 */ diff --git a/Documentation/EDID/hex b/Documentation/EDID/hex deleted file mode 100644 index 8873ebb618af..000000000000 --- a/Documentation/EDID/hex +++ /dev/null @@ -1 +0,0 @@ -"\t" 8/1 "0x%02x, " "\n" diff --git a/Documentation/Makefile b/Documentation/Makefile index d77bb607aea4..79ecee62d597 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -13,7 +13,7 @@ endif SPHINXBUILD = sphinx-build SPHINXOPTS = SPHINXDIRS = . -_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst)) +_SPHINXDIRS = $(sort $(patsubst $(srctree)/Documentation/%/index.rst,%,$(wildcard $(srctree)/Documentation/*/index.rst))) SPHINX_CONF = conf.py PAPER = BUILDDIR = $(obj)/output diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst index 6864f9a70f5f..8c016d8c9862 100644 --- a/Documentation/PCI/pci.rst +++ b/Documentation/PCI/pci.rst @@ -239,7 +239,7 @@ from the PCI device config space. Use the values in the pci_dev structure as the PCI "bus address" might have been remapped to a "host physical" address by the arch/chip-set specific kernel support. -See Documentation/io-mapping.txt for how to access device registers +See Documentation/driver-api/io-mapping.rst for how to access device registers or device memory. The device driver needs to call pci_request_region() to verify diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst index 1a8b129cfc04..83ae3b79a643 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -4,7 +4,7 @@ A Tour Through TREE_RCU's Grace-Period Memory Ordering August 8, 2017 -This article was contributed by Paul E. McKenney +This article was contributed by Paul E. McKenney Introduction ============ @@ -48,7 +48,7 @@ Tree RCU Grace Period Memory Ordering Building Blocks The workhorse for RCU's grace-period memory ordering is the critical section for the ``rcu_node`` structure's -``->lock``. These critical sections use helper functions for lock +``->lock``. These critical sections use helper functions for lock acquisition, including ``raw_spin_lock_rcu_node()``, ``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, @@ -102,9 +102,9 @@ lock-acquisition and lock-release functions:: 23 r3 = READ_ONCE(x); 24 } 25 - 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); -The ``WARN_ON()`` is evaluated at “the end of time”, +The ``WARN_ON()`` is evaluated at "the end of time", after all changes have propagated throughout the system. Without the ``smp_mb__after_unlock_lock()`` provided by the acquisition functions, this ``WARN_ON()`` could trigger, for example diff --git a/Documentation/RCU/listRCU.rst b/Documentation/RCU/listRCU.rst index 7956ff33042b..2a643e293fb4 100644 --- a/Documentation/RCU/listRCU.rst +++ b/Documentation/RCU/listRCU.rst @@ -4,12 +4,61 @@ Using RCU to Protect Read-Mostly Linked Lists ============================================= One of the best applications of RCU is to protect read-mostly linked lists -("struct list_head" in list.h). One big advantage of this approach +(``struct list_head`` in list.h). One big advantage of this approach is that all of the required memory barriers are included for you in the list macros. This document describes several applications of RCU, with the best fits first. -Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates + +Example 1: Read-mostly list: Deferred Destruction +------------------------------------------------- + +A widely used usecase for RCU lists in the kernel is lockless iteration over +all processes in the system. ``task_struct::tasks`` represents the list node that +links all the processes. The list can be traversed in parallel to any list +additions or removals. + +The traversal of the list is done using ``for_each_process()`` which is defined +by the 2 macros:: + + #define next_task(p) \ + list_entry_rcu((p)->tasks.next, struct task_struct, tasks) + + #define for_each_process(p) \ + for (p = &init_task ; (p = next_task(p)) != &init_task ; ) + +The code traversing the list of all processes typically looks like:: + + rcu_read_lock(); + for_each_process(p) { + /* Do something with p */ + } + rcu_read_unlock(); + +The simplified code for removing a process from a task list is:: + + void release_task(struct task_struct *p) + { + write_lock(&tasklist_lock); + list_del_rcu(&p->tasks); + write_unlock(&tasklist_lock); + call_rcu(&p->rcu, delayed_put_task_struct); + } + +When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under +``tasklist_lock`` writer lock protection, to remove the task from the list of +all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals +from corrupting the list. Readers using ``for_each_process()`` are not protected +with the ``tasklist_lock``. To prevent readers from noticing changes in the list +pointers, the ``task_struct`` object is freed only after one or more grace +periods elapse (with the help of call_rcu()). This deferring of destruction +ensures that any readers traversing the list will see valid ``p->tasks.next`` +pointers and deletion/freeing can happen in parallel with traversal of the list. +This pattern is also called an **existence lock**, since RCU pins the object in +memory until all existing readers finish. + + +Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates ---------------------------------------------------------------------- The best applications are cases where, if reader-writer locking were @@ -26,7 +75,7 @@ added or deleted, rather than being modified in place. A straightforward example of this use of RCU may be found in the system-call auditing support. For example, a reader-writer locked -implementation of audit_filter_task() might be as follows:: +implementation of ``audit_filter_task()`` might be as follows:: static enum audit_state audit_filter_task(struct task_struct *tsk) { @@ -34,7 +83,7 @@ implementation of audit_filter_task() might be as follows:: enum audit_state state; read_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { read_unlock(&auditsc_lock); @@ -58,7 +107,7 @@ This means that RCU can be easily applied to the read side, as follows:: enum audit_state state; rcu_read_lock(); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry_rcu(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { rcu_read_unlock(); @@ -69,18 +118,18 @@ This means that RCU can be easily applied to the read side, as follows:: return AUDIT_BUILD_CONTEXT; } -The read_lock() and read_unlock() calls have become rcu_read_lock() +The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock() and rcu_read_unlock(), respectively, and the list_for_each_entry() has -become list_for_each_entry_rcu(). The _rcu() list-traversal primitives +become list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives insert the read-side memory barriers that are required on DEC Alpha CPUs. -The changes to the update side are also straightforward. A reader-writer -lock might be used as follows for deletion and insertion:: +The changes to the update side are also straightforward. A reader-writer lock +might be used as follows for deletion and insertion:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; write_lock(&auditsc_lock); list_for_each_entry(e, list, list) { @@ -113,9 +162,9 @@ Following are the RCU equivalents for these two functions:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; - /* Do not use the _rcu iterator here, since this is the only + /* No need to use the _rcu iterator here, since this is the only * deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -139,45 +188,45 @@ Following are the RCU equivalents for these two functions:: return 0; } -Normally, the write_lock() and write_unlock() would be replaced by -a spin_lock() and a spin_unlock(), but in this case, all callers hold -audit_netlink_sem, so no additional locking is required. The auditsc_lock -can therefore be eliminated, since use of RCU eliminates the need for -writers to exclude readers. Normally, the write_lock() calls would -be converted into spin_lock() calls. +Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a +spin_lock() and a spin_unlock(). But in this case, all callers hold +``audit_filter_mutex``, so no additional locking is required. The +``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the +need for writers to exclude readers. The list_del(), list_add(), and list_add_tail() primitives have been replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). -The _rcu() list-manipulation primitives add memory barriers that are -needed on weakly ordered CPUs (most of them!). The list_del_rcu() -primitive omits the pointer poisoning debug-assist code that would -otherwise cause concurrent readers to fail spectacularly. +The **_rcu()** list-manipulation primitives add memory barriers that are needed on +weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the +pointer poisoning debug-assist code that would otherwise cause concurrent +readers to fail spectacularly. -So, when readers can tolerate stale data and when entries are either added -or deleted, without in-place modification, it is very easy to use RCU! +So, when readers can tolerate stale data and when entries are either added or +deleted, without in-place modification, it is very easy to use RCU! -Example 2: Handling In-Place Updates + +Example 3: Handling In-Place Updates ------------------------------------ -The system-call auditing code does not update auditing rules in place. -However, if it did, reader-writer-locked code to do so might look as -follows (presumably, the field_count is only permitted to decrease, -otherwise, the added fields would need to be filled in):: +The system-call auditing code does not update auditing rules in place. However, +if it did, the reader-writer-locked code to do so might look as follows +(assuming only ``field_count`` is updated, otherwise, the added fields would +need to be filled in):: static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { - struct audit_entry *e; - struct audit_newentry *ne; + struct audit_entry *e; + struct audit_entry *ne; write_lock(&auditsc_lock); - /* Note: audit_netlink_sem held by caller. */ + /* Note: audit_filter_mutex held by caller. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { e->rule.action = newaction; - e->rule.file_count = newfield_count; + e->rule.field_count = newfield_count; write_unlock(&auditsc_lock); return 0; } @@ -188,16 +237,16 @@ otherwise, the added fields would need to be filled in):: The RCU version creates a copy, updates the copy, then replaces the old entry with the newly updated entry. This sequence of actions, allowing -concurrent reads while doing a copy to perform an update, is what gives -RCU ("read-copy update") its name. The RCU code is as follows:: +concurrent reads while making a copy to perform an update, is what gives +RCU (*read-copy update*) its name. The RCU code is as follows:: static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { - struct audit_entry *e; - struct audit_newentry *ne; + struct audit_entry *e; + struct audit_entry *ne; list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -206,7 +255,7 @@ RCU ("read-copy update") its name. The RCU code is as follows:: return -ENOMEM; audit_copy_rule(&ne->rule, &e->rule); ne->rule.action = newaction; - ne->rule.file_count = newfield_count; + ne->rule.field_count = newfield_count; list_replace_rcu(&e->list, &ne->list); call_rcu(&e->rcu, audit_free_rule); return 0; @@ -215,34 +264,45 @@ RCU ("read-copy update") its name. The RCU code is as follows:: return -EFAULT; /* No matching rule */ } -Again, this assumes that the caller holds audit_netlink_sem. Normally, -the reader-writer lock would become a spinlock in this sort of code. +Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the +writer lock would become a spinlock in this sort of code. -Example 3: Eliminating Stale Data +Another use of this pattern can be found in the openswitch driver's *connection +tracking table* code in ``ct_limit_set()``. The table holds connection tracking +entries and has a limit on the maximum entries. There is one such table +per-zone and hence one *limit* per zone. The zones are mapped to their limits +through a hashtable using an RCU-managed hlist for the hash chains. When a new +limit is set, a new limit object is allocated and ``ct_limit_set()`` is called +to replace the old limit object with the new one using list_replace_rcu(). +The old limit object is then freed after a grace period using kfree_rcu(). + + +Example 4: Eliminating Stale Data --------------------------------- -The auditing examples above tolerate stale data, as do most algorithms +The auditing example above tolerates stale data, as do most algorithms that are tracking external state. Because there is a delay from the time the external state changes before Linux becomes aware of the change, -additional RCU-induced staleness is normally not a problem. +additional RCU-induced staleness is generally not a problem. However, there are many examples where stale data cannot be tolerated. -One example in the Linux kernel is the System V IPC (see the ipc_lock() -function in ipc/util.c). This code checks a "deleted" flag under a -per-entry spinlock, and, if the "deleted" flag is set, pretends that the +One example in the Linux kernel is the System V IPC (see the shm_lock() +function in ipc/shm.c). This code checks a *deleted* flag under a +per-entry spinlock, and, if the *deleted* flag is set, pretends that the entry does not exist. For this to be helpful, the search function must -return holding the per-entry spinlock, as ipc_lock() does in fact do. +return holding the per-entry spinlock, as shm_lock() does in fact do. + +.. _quick_quiz: Quick Quiz: - Why does the search function need to return holding the per-entry lock for - this deleted-flag technique to be helpful? + For the deleted-flag technique to be helpful, why is it necessary + to hold the per-entry lock while returning from the search function? -:ref:`Answer to Quick Quiz <answer_quick_quiz_list>` +:ref:`Answer to Quick Quiz <quick_quiz_answer>` -If the system-call audit module were to ever need to reject stale data, -one way to accomplish this would be to add a "deleted" flag and a "lock" -spinlock to the audit_entry structure, and modify audit_filter_task() -as follows:: +If the system-call audit module were to ever need to reject stale data, one way +to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the +audit_entry structure, and modify ``audit_filter_task()`` as follows:: static enum audit_state audit_filter_task(struct task_struct *tsk) { @@ -267,20 +327,20 @@ as follows:: } Note that this example assumes that entries are only added and deleted. -Additional mechanism is required to deal correctly with the -update-in-place performed by audit_upd_rule(). For one thing, -audit_upd_rule() would need additional memory barriers to ensure -that the list_add_rcu() was really executed before the list_del_rcu(). +Additional mechanism is required to deal correctly with the update-in-place +performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would +need additional memory barriers to ensure that the list_add_rcu() was really +executed before the list_del_rcu(). -The audit_del_rule() function would need to set the "deleted" -flag under the spinlock as follows:: +The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the +spinlock as follows:: static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { - struct audit_entry *e; + struct audit_entry *e; - /* Do not need to use the _rcu iterator here, since this + /* No need to use the _rcu iterator here, since this * is the only deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { @@ -295,6 +355,91 @@ flag under the spinlock as follows:: return -EFAULT; /* No matching rule */ } +This too assumes that the caller holds ``audit_filter_mutex``. + + +Example 5: Skipping Stale Objects +--------------------------------- + +For some usecases, reader performance can be improved by skipping stale objects +during read-side list traversal if the object in concern is pending destruction +after one or more grace periods. One such example can be found in the timerfd +subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to +setting of the system time, then all programmed timerfds that depend on this +clock get triggered and processes waiting on them to expire are woken up in +advance of their scheduled expiry. To facilitate this, all such timers are added +to an RCU-managed ``cancel_list`` when they are setup in +``timerfd_setup_cancel()``:: + + static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags) + { + spin_lock(&ctx->cancel_lock); + if ((ctx->clockid == CLOCK_REALTIME && + (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) { + if (!ctx->might_cancel) { + ctx->might_cancel = true; + spin_lock(&cancel_lock); + list_add_rcu(&ctx->clist, &cancel_list); + spin_unlock(&cancel_lock); + } + } + spin_unlock(&ctx->cancel_lock); + } + +When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the +timerfd object is cleared, the object removed from the ``cancel_list`` and +destroyed:: + + int timerfd_release(struct inode *inode, struct file *file) + { + struct timerfd_ctx *ctx = file->private_data; + + spin_lock(&ctx->cancel_lock); + if (ctx->might_cancel) { + ctx->might_cancel = false; + spin_lock(&cancel_lock); + list_del_rcu(&ctx->clist); + spin_unlock(&cancel_lock); + } + spin_unlock(&ctx->cancel_lock); + + hrtimer_cancel(&ctx->t.tmr); + kfree_rcu(ctx, rcu); + return 0; + } + +If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the +hrtimer framework calls ``timerfd_clock_was_set()`` which walks the +``cancel_list`` and wakes up processes waiting on the timerfd. While iterating +the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale +objects:: + + void timerfd_clock_was_set(void) + { + struct timerfd_ctx *ctx; + unsigned long flags; + + rcu_read_lock(); + list_for_each_entry_rcu(ctx, &cancel_list, clist) { + if (!ctx->might_cancel) + continue; + spin_lock_irqsave(&ctx->wqh.lock, flags); + if (ctx->moffs != ktime_mono_to_real(0)) { + ctx->moffs = KTIME_MAX; + ctx->ticks++; + wake_up_locked_poll(&ctx->wqh, EPOLLIN); + } + spin_unlock_irqrestore(&ctx->wqh.lock, flags); + } + rcu_read_unlock(); + } + +The key point here is, because RCU-traversal of the ``cancel_list`` happens +while objects are being added and removed to the list, sometimes the traversal +can step on an object that has been removed from the list. In this example, it +is seen that it is better to skip such objects using a flag. + + Summary ------- @@ -303,19 +448,21 @@ the most amenable to use of RCU. The simplest case is where entries are either added or deleted from the data structure (or atomically modified in place), but non-atomic in-place modifications can be handled by making a copy, updating the copy, then replacing the original with the copy. -If stale data cannot be tolerated, then a "deleted" flag may be used +If stale data cannot be tolerated, then a *deleted* flag may be used in conjunction with a per-entry spinlock in order to allow the search function to reject newly deleted data. -.. _answer_quick_quiz_list: +.. _quick_quiz_answer: Answer to Quick Quiz: - Why does the search function need to return holding the per-entry - lock for this deleted-flag technique to be helpful? + For the deleted-flag technique to be helpful, why is it necessary + to hold the per-entry lock while returning from the search function? If the search function drops the per-entry lock before returning, then the caller will be processing stale data in any case. If it is really OK to be processing stale data, then you don't need a - "deleted" flag. If processing stale data really is a problem, + *deleted* flag. If processing stale data really is a problem, then you need to hold the per-entry lock across all of the code that uses the value that was returned. + +:ref:`Back to Quick Quiz <quick_quiz>` diff --git a/Documentation/RCU/rcu.rst b/Documentation/RCU/rcu.rst index 8dfb437dacc3..0e03c6ef3147 100644 --- a/Documentation/RCU/rcu.rst +++ b/Documentation/RCU/rcu.rst @@ -11,8 +11,8 @@ must be long enough that any readers accessing the item being deleted have since dropped their references. For example, an RCU-protected deletion from a linked list would first remove the item from the list, wait for a grace period to elapse, then free the element. See the -Documentation/RCU/listRCU.rst file for more information on using RCU with -linked lists. +:ref:`Documentation/RCU/listRCU.rst <list_rcu_doc>` for more information on +using RCU with linked lists. Frequently Asked Questions -------------------------- @@ -50,7 +50,7 @@ Frequently Asked Questions - If I am running on a uniprocessor kernel, which can only do one thing at a time, why should I wait for a grace period? - See the Documentation/RCU/UP.rst file for more information. + See :ref:`Documentation/RCU/UP.rst <up_doc>` for more information. - How can I see where RCU is currently used in the Linux kernel? @@ -68,18 +68,18 @@ Frequently Asked Questions - Why the name "RCU"? - "RCU" stands for "read-copy update". The file Documentation/RCU/listRCU.rst - has more information on where this name came from, search for - "read-copy update" to find it. + "RCU" stands for "read-copy update". + :ref:`Documentation/RCU/listRCU.rst <list_rcu_doc>` has more information on where + this name came from, search for "read-copy update" to find it. - I hear that RCU is patented? What is with that? Yes, it is. There are several known patents related to RCU, - search for the string "Patent" in RTFP.txt to find them. + search for the string "Patent" in Documentation/RCU/RTFP.txt to find them. Of these, one was allowed to lapse by the assignee, and the others have been contributed to the Linux kernel under GPL. There are now also LGPL implementations of user-level RCU - available (http://liburcu.org/). + available (https://liburcu.org/). - I hear that RCU needs work in order to support realtime kernels? @@ -88,5 +88,5 @@ Frequently Asked Questions - Where can I find more information on RCU? - See the RTFP.txt file in this directory. + See the Documentation/RCU/RTFP.txt file. Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/). diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index a41a0384d20c..af712a3c5b6a 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -124,9 +124,14 @@ using a dynamically allocated srcu_struct (hence "srcud-" rather than debugging. The final "T" entry contains the totals of the counters. -USAGE +USAGE ON SPECIFIC KERNEL BUILDS -The following script may be used to torture RCU: +It is sometimes desirable to torture RCU on a specific kernel build, +for example, when preparing to put that kernel build into production. +In that case, the kernel should be built with CONFIG_RCU_TORTURE_TEST=m +so that the test can be started using modprobe and terminated using rmmod. + +For example, the following script may be used to torture RCU: #!/bin/sh @@ -142,8 +147,136 @@ checked for such errors. The "rmmod" command forces a "SUCCESS", two are self-explanatory, while the last indicates that while there were no RCU failures, CPU-hotplug problems were detected. -However, the tools/testing/selftests/rcutorture/bin/kvm.sh script -provides better automation, including automatic failure analysis. -It assumes a qemu/kvm-enabled platform, and runs guest OSes out of initrd. -See tools/testing/selftests/rcutorture/doc/initrd.txt for instructions -on setting up such an initrd. + +USAGE ON MAINLINE KERNELS + +When using rcutorture to test changes to RCU itself, it is often +necessary to build a number of kernels in order to test that change +across a broad range of combinations of the relevant Kconfig options +and of the relevant kernel boot parameters. In this situation, use +of modprobe and rmmod can be quite time-consuming and error-prone. + +Therefore, the tools/testing/selftests/rcutorture/bin/kvm.sh +script is available for mainline testing for x86, arm64, and +powerpc. By default, it will run the series of tests specified by +tools/testing/selftests/rcutorture/configs/rcu/CFLIST, with each test +running for 30 minutes within a guest OS using a minimal userspace +supplied by an automatically generated initrd. After the tests are +complete, the resulting build products and console output are analyzed +for errors and the results of the runs are summarized. + +On larger systems, rcutorture testing can be accelerated by passing the +--cpus argument to kvm.sh. For example, on a 64-CPU system, "--cpus 43" +would use up to 43 CPUs to run tests concurrently, which as of v5.4 would +complete all the scenarios in two batches, reducing the time to complete +from about eight hours to about one hour (not counting the time to build +the sixteen kernels). The "--dryrun sched" argument will not run tests, +but rather tell you how the tests would be scheduled into batches. This +can be useful when working out how many CPUs to specify in the --cpus +argument. + +Not all changes require that all scenarios be run. For example, a change +to Tree SRCU might run only the SRCU-N and SRCU-P scenarios using the +--configs argument to kvm.sh as follows: "--configs 'SRCU-N SRCU-P'". +Large systems can run multiple copies of of the full set of scenarios, +for example, a system with 448 hardware threads can run five instances +of the full set concurrently. To make this happen: + + kvm.sh --cpus 448 --configs '5*CFLIST' + +Alternatively, such a system can run 56 concurrent instances of a single +eight-CPU scenario: + + kvm.sh --cpus 448 --configs '56*TREE04' + +Or 28 concurrent instances of each of two eight-CPU scenarios: + + kvm.sh --cpus 448 --configs '28*TREE03 28*TREE04' + +Of course, each concurrent instance will use memory, which can be +limited using the --memory argument, which defaults to 512M. Small +values for memory may require disabling the callback-flooding tests +using the --bootargs parameter discussed below. + +Sometimes additional debugging is useful, and in such cases the --kconfig +parameter to kvm.sh may be used, for example, "--kconfig 'CONFIG_KASAN=y'". + +Kernel boot arguments can also be supplied, for example, to control +rcutorture's module parameters. For example, to test a change to RCU's +CPU stall-warning code, use "--bootargs 'rcutorture.stall_cpu=30'". +This will of course result in the scripting reporting a failure, namely +the resuling RCU CPU stall warning. As noted above, reducing memory may +require disabling rcutorture's callback-flooding tests: + + kvm.sh --cpus 448 --configs '56*TREE04' --memory 128M \ + --bootargs 'rcutorture.fwd_progress=0' + +Sometimes all that is needed is a full set of kernel builds. This is +what the --buildonly argument does. + +Finally, the --trust-make argument allows each kernel build to reuse what +it can from the previous kernel build. + +There are additional more arcane arguments that are documented in the +source code of the kvm.sh script. + +If a run contains failures, the number of buildtime and runtime failures +is listed at the end of the kvm.sh output, which you really should redirect +to a file. The build products and console output of each run is kept in +tools/testing/selftests/rcutorture/res in timestamped directories. A +given directory can be supplied to kvm-find-errors.sh in order to have +it cycle you through summaries of errors and full error logs. For example: + + tools/testing/selftests/rcutorture/bin/kvm-find-errors.sh \ + tools/testing/selftests/rcutorture/res/2020.01.20-15.54.23 + +However, it is often more convenient to access the files directly. +Files pertaining to all scenarios in a run reside in the top-level +directory (2020.01.20-15.54.23 in the example above), while per-scenario +files reside in a subdirectory named after the scenario (for example, +"TREE04"). If a given scenario ran more than once (as in "--configs +'56*TREE04'" above), the directories corresponding to the second and +subsequent runs of that scenario include a sequence number, for example, +"TREE04.2", "TREE04.3", and so on. + +The most frequently used file in the top-level directory is testid.txt. +If the test ran in a git repository, then this file contains the commit +that was tested and any uncommitted changes in diff format. + +The most frequently used files in each per-scenario-run directory are: + +.config: This file contains the Kconfig options. + +Make.out: This contains build output for a specific scenario. + +console.log: This contains the console output for a specific scenario. + This file may be examined once the kernel has booted, but + it might not exist if the build failed. + +vmlinux: This contains the kernel, which can be useful with tools like + objdump and gdb. + +A number of additional files are available, but are less frequently used. +Many are intended for debugging of rcutorture itself or of its scripting. + +As of v5.4, a successful run with the default set of scenarios produces +the following summary at the end of the run on a 12-CPU system: + +SRCU-N ------- 804233 GPs (148.932/s) [srcu: g10008272 f0x0 ] +SRCU-P ------- 202320 GPs (37.4667/s) [srcud: g1809476 f0x0 ] +SRCU-t ------- 1122086 GPs (207.794/s) [srcu: g0 f0x0 ] +SRCU-u ------- 1111285 GPs (205.794/s) [srcud: g1 f0x0 ] +TASKS01 ------- 19666 GPs (3.64185/s) [tasks: g0 f0x0 ] +TASKS02 ------- 20541 GPs (3.80389/s) [tasks: g0 f0x0 ] +TASKS03 ------- 19416 GPs (3.59556/s) [tasks: g0 f0x0 ] +TINY01 ------- 836134 GPs (154.84/s) [rcu: g0 f0x0 ] n_max_cbs: 34198 +TINY02 ------- 850371 GPs (157.476/s) [rcu: g0 f0x0 ] n_max_cbs: 2631 +TREE01 ------- 162625 GPs (30.1157/s) [rcu: g1124169 f0x0 ] +TREE02 ------- 333003 GPs (61.6672/s) [rcu: g2647753 f0x0 ] n_max_cbs: 35844 +TREE03 ------- 306623 GPs (56.782/s) [rcu: g2975325 f0x0 ] n_max_cbs: 1496497 +CPU count limited from 16 to 12 +TREE04 ------- 246149 GPs (45.5831/s) [rcu: g1695737 f0x0 ] n_max_cbs: 434961 +TREE05 ------- 314603 GPs (58.2598/s) [rcu: g2257741 f0x2 ] n_max_cbs: 193997 +TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732 +CPU count limited from 16 to 12 +TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011 diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst index 621111ce5740..f2b3439edcc2 100644 --- a/Documentation/accounting/psi.rst +++ b/Documentation/accounting/psi.rst @@ -1,3 +1,5 @@ +.. _psi: + ================================ PSI - Pressure Stall Information ================================ diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst index 21d233ca50d8..98fe5c333121 100644 --- a/Documentation/admin-guide/acpi/fan_performance_states.rst +++ b/Documentation/admin-guide/acpi/fan_performance_states.rst @@ -18,7 +18,7 @@ may look as follows:: $ ls -l /sys/bus/acpi/devices/INT3404:00/ total 0 -... + ... -r--r--r-- 1 root root 4096 Dec 13 20:38 state0 -r--r--r-- 1 root root 4096 Dec 13 20:38 state1 -r--r--r-- 1 root root 4096 Dec 13 20:38 state10 @@ -38,7 +38,7 @@ where each of the "state*" files represents one performance state of the fan and contains a colon-separated list of 5 integer numbers (fields) with the following interpretation:: -control_percent:trip_point_index:speed_rpm:noise_level_mdb:power_mw + control_percent:trip_point_index:speed_rpm:noise_level_mdb:power_mw * ``control_percent``: The percent value to be used to set the fan speed to a specific level using the _FSL object (0-100). diff --git a/Documentation/admin-guide/binfmt-misc.rst b/Documentation/admin-guide/binfmt-misc.rst index 97b0d7927078..95c93bbe408a 100644 --- a/Documentation/admin-guide/binfmt-misc.rst +++ b/Documentation/admin-guide/binfmt-misc.rst @@ -1,5 +1,5 @@ -Kernel Support for miscellaneous (your favourite) Binary Formats v1.1 -===================================================================== +Kernel Support for miscellaneous Binary Formats (binfmt_misc) +============================================================= This Kernel feature allows you to invoke almost (for restrictions see below) every program by simply typing its name in the shell. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 27c77d853028..a6fd1f9b5faf 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -251,8 +251,6 @@ line of text and contains the following stats separated by whitespace: ================ ============================================================= orig_data_size uncompressed size of data stored in this disk. - This excludes same-element-filled pages (same_pages) since - no memory is allocated for them. Unit: bytes compr_data_size compressed size of data stored in this disk mem_used_total the amount of memory allocated for this disk. This diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index b342a6796392..d6b3b77a4129 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -23,7 +23,7 @@ of dot-connected-words, and key and value are connected by ``=``. The value has to be terminated by semi-colon (``;``) or newline (``\n``). For array value, array entries are separated by comma (``,``). :: -KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. @@ -62,6 +62,30 @@ Or more shorter, written as following:: In both styles, same key words are automatically merged when parsing it at boot time. So you can append similar trees or key-values. +Same-key Values +--------------- + +It is prohibited that two or more values or arrays share a same-key. +For example,:: + + foo = bar, baz + foo = qux # !ERROR! we can not re-define same key + +If you want to append the value to existing key as an array member, +you can use ``+=`` operator. For example:: + + foo = bar, baz + foo += qux + +In this case, the key ``foo`` has ``bar``, ``baz`` and ``qux``. + +However, a sub-key and a value can not co-exist under a parent key. +For example, following config is NOT allowed.:: + + foo = value1 + foo.bar = value2 # !ERROR! subkey "bar" and value "value1" can NOT co-exist + + Comments -------- @@ -102,9 +126,13 @@ Boot Kernel With a Boot Config ============================== Since the boot configuration file is loaded with initrd, it will be added -to the end of the initrd (initramfs) image file. The Linux kernel decodes -the last part of the initrd image in memory to get the boot configuration -data. +to the end of the initrd (initramfs) image file with size, checksum and +12-byte magic word as below. + +[initrd][bootconfig][size(u32)][checksum(u32)][#BOOTCONFIG\n] + +The Linux kernel decodes the last part of the initrd image in memory to +get the boot configuration data. Because of this "piggyback" method, there is no need to change or update the boot loader and the kernel image itself. diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst index 10bf48bae0b0..226f64473e8e 100644 --- a/Documentation/admin-guide/cgroup-v1/index.rst +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -1,3 +1,5 @@ +.. _cgroup-v1: + ======================== Control Groups version 1 ======================== diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f801461f0f3..fbb111616705 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for -v1 is available under Documentation/admin-guide/cgroup-v1/. +v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. .. CONTENTS @@ -1023,7 +1023,7 @@ All time durations are in microseconds. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for CPU. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. cpu.uclamp.min A read-write single value file which exists on non-root cgroups. @@ -1103,7 +1103,7 @@ PAGE_SIZE multiple when read back. proportionally to the overage, reducing reclaim pressure for smaller overages. - Effective min boundary is limited by memory.min values of + Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get @@ -1313,53 +1313,41 @@ PAGE_SIZE multiple when read back. Number of major page faults incurred workingset_refault - Number of refaults of previously evicted pages workingset_activate - Number of refaulted pages that were immediately activated workingset_nodereclaim - Number of times a shadow node has been reclaimed pgrefill - Amount of scanned pages (in an active LRU list) pgscan - Amount of scanned pages (in an inactive LRU list) pgsteal - Amount of reclaimed pages pgactivate - Amount of pages moved to the active LRU list pgdeactivate - Amount of pages moved to the inactive LRU list pglazyfree - Amount of pages postponed to be freed under memory pressure pglazyfreed - Amount of reclaimed lazyfree pages thp_fault_alloc - Number of transparent hugepages which were allocated to satisfy a page fault, including COW faults. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. thp_collapse_alloc - Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. @@ -1403,7 +1391,7 @@ PAGE_SIZE multiple when read back. A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for memory. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. Usage Guidelines @@ -1478,7 +1466,7 @@ IO Interface Files dios Number of discard IOs ====== ===================== - An example read output follows: + An example read output follows:: 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 @@ -1643,7 +1631,7 @@ IO Interface Files A read-only nested-key file which exists on non-root cgroups. Shows pressure stall information for IO. See - Documentation/accounting/psi.rst for details. + :ref:`Documentation/accounting/psi.rst <psi>` for details. Writeback @@ -1853,7 +1841,7 @@ Cpuset Interface Files from the requested CPUs. The CPU numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.cpus 0-4,6,8-10 @@ -1892,7 +1880,7 @@ Cpuset Interface Files from the requested memory nodes. The memory node numbers are comma-separated numbers or ranges. - For example: + For example:: # cat cpuset.mems 0-1,3 diff --git a/Documentation/driver-api/edid.rst b/Documentation/admin-guide/edid.rst index b1b5acd501ed..80deeb21a265 100644 --- a/Documentation/driver-api/edid.rst +++ b/Documentation/admin-guide/edid.rst @@ -11,11 +11,13 @@ Today, with the advent of Kernel Mode Setting, a graphics board is either correctly working because all components follow the standards - or the computer is unusable, because the screen remains dark after booting or it displays the wrong area. Cases when this happens are: + - The graphics board does not recognize the monitor. - The graphics board is unable to detect any EDID data. - The graphics board incorrectly forwards EDID data to the driver. - The monitor sends no or bogus EDID data. - A KVM sends its own EDID data instead of querying the connected monitor. + Adding the kernel parameter "nomodeset" helps in most cases, but causes restrictions later on. @@ -32,7 +34,7 @@ individual data for a specific misbehaving monitor, commented sources and a Makefile environment are given here. To create binary EDID and C source code files from the existing data -material, simply type "make". +material, simply type "make" in tools/edid/. If you want to create your own EDID file, copy the file 1024x768.S, replace the settings with your own data and add a new target to the diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst index af6865b822d2..68d96f0e9c95 100644 --- a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst +++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst @@ -136,8 +136,6 @@ enables the mitigation by default. The mitigation can be controlled at boot time via a kernel command line option. See :ref:`taa_mitigation_control_command_line`. -.. _virt_mechanism: - Virtualization mitigation ^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index f1d0ccffbe72..5a6269fb8593 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -75,6 +75,7 @@ configure specific aspects of kernel behavior to your liking. cputopology dell_rbu device-mapper/index + edid efi-stub ext4 nfs/index diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index df5b8345c41d..9b14b0c2c9c4 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -100,7 +100,7 @@ Field 10 -- # of milliseconds spent doing I/Os (unsigned int) Since 5.0 this field counts jiffies when at least one request was started or completed. If request runs more than 2 jiffies then some - I/O time will not be accounted unless there are other requests. + I/O time might be not accounted in case of concurrent requests. Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int) This field is incremented at each I/O start, I/O completion, I/O @@ -143,6 +143,9 @@ are summed (possibly overflowing the unsigned long variable they are summed to) and the result given to the user. There is no convenient user interface for accessing the per-CPU counters themselves. +Since 4.19 request times are measured with nanoseconds precision and +truncated to milliseconds before showing in this interface. + Disks vs Partitions ------------------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ff1428d69b2d..ed73df5f1369 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -136,6 +136,10 @@ dynamic table installation which will install SSDT tables to /sys/firmware/acpi/tables/dynamic. + acpi_no_watchdog [HW,ACPI,WDT] + Ignore the ACPI-based watchdog interface (WDAT) and let + a native driver control the watchdog device instead. + acpi_rsdp= [ACPI,EFI,KEXEC] Pass the RSDP address to the kernel, mostly used on machines running EFI runtime service to boot the @@ -446,6 +450,9 @@ bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. + bgrt_disable [ACPI][X86] + Disable BGRT to avoid flickering OEM logo. + bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) bttv.radio= Most important insmod options are available as kernel args too. @@ -1096,6 +1103,12 @@ A valid base address must be provided, and the serial port must already be setup and configured. + ec_imx21,<addr> + ec_imx6q,<addr> + Start an early, polled-mode, output-only console on the + Freescale i.MX UART at the specified address. The UART + must already be setup and configured. + ar3700_uart,<addr> Start an early, polled-mode console on the Armada 3700 serial port at the specified @@ -1351,6 +1364,24 @@ can be changed at run time by the max_graph_depth file in the tracefs tracing directory. default: 0 (no limit) + fw_devlink= [KNL] Create device links between consumer and supplier + devices by scanning the firmware to infer the + consumer/supplier relationships. This feature is + especially useful when drivers are loaded as modules as + it ensures proper ordering of tasks like device probing + (suppliers first, then consumers), supplier boot state + clean up (only after all consumers have probed), + suspend/resume & runtime PM (consumers first, then + suppliers). + Format: { off | permissive | on | rpm } + off -- Don't create device links from firmware info. + permissive -- Create device links from firmware info + but use it only for ordering boot state clean + up (sync_state() calls). + on -- Create device links from firmware info and use it + to enforce probe and suspend/resume ordering. + rpm -- Like "on", but also use to order runtime PM. + gamecon.map[2|3]= [HW,JOY] Multisystem joystick and NES/SNES/PSX pad support via parallel port (up to 5 devices per port) @@ -1776,7 +1807,7 @@ provided by tboot because it makes the system vulnerable to DMA attacks. nobounce [Default off] - Disable bounce buffer for unstrusted devices such as + Disable bounce buffer for untrusted devices such as the Thunderbolt devices. This will treat the untrusted devices as the trusted ones, hence might expose security risks of DMA attacks. @@ -1880,7 +1911,7 @@ No delay ip= [IP_PNP] - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. ipcmni_extend [KNL] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. @@ -2792,7 +2823,7 @@ <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] mtdparts= [MTD] - See drivers/mtd/cmdlinepart.c. + See drivers/mtd/parsers/cmdlinepart.c multitce=off [PPC] This parameter disables the use of the pSeries firmware feature for updating multiple TCE entries @@ -2850,13 +2881,13 @@ Default value is 0. nfsaddrs= [NFS] Deprecated. Use ip= instead. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsroot= [NFS] nfs root filesystem for disk-less boxes. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfsrootdebug [NFS] enable nfsroot debugging messages. - See Documentation/filesystems/nfs/nfsroot.txt. + See Documentation/admin-guide/nfs/nfsroot.rst. nfs.callback_nr_threads= [NFSv4] set the total number of threads that the @@ -3171,7 +3202,7 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM,ARM64] Disable paravirtualized steal time + no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time accounting. steal time is computed, but won't influence scheduler behaviour @@ -3282,12 +3313,6 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. - of_devlink [OF, KNL] Create device links between consumer and - supplier devices by scanning the devictree to infer the - consumer/supplier relationships. A consumer device - will not be probed until all the supplier devices have - probed successfully. - ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more info. @@ -3981,6 +4006,15 @@ Set threshold of queued RCU callbacks below which batch limiting is re-enabled. + rcutree.qovld= [KNL] + Set threshold of queued RCU callbacks beyond which + RCU's force-quiescent-state scan will aggressively + enlist help from cond_resched() and sched IPIs to + help CPUs more quickly reach quiescent states. + Set to less than zero to make this be set based + on rcutree.qhimark at boot time and to zero to + disable more aggressive help enlistment. + rcutree.rcu_idle_gp_delay= [KNL] Set wakeup interval for idle CPUs that have RCU callbacks (RCU_FAST_NO_HZ=y). @@ -4196,6 +4230,12 @@ rcupdate.rcu_cpu_stall_suppress= [KNL] Suppress RCU CPU stall warning messages. + rcupdate.rcu_cpu_stall_suppress_at_boot= [KNL] + Suppress RCU CPU stall warning messages and + rcutorture writer stall warnings that occur + during early boot, that is, during the time + before the init task is spawned. + rcupdate.rcu_cpu_stall_timeout= [KNL] Set timeout for RCU CPU stall warning messages. @@ -4389,6 +4429,22 @@ incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. + sched_thermal_decay_shift= + [KNL, SMP] Set a decay shift for scheduler thermal + pressure signal. Thermal pressure signal follows the + default decay period of other scheduler pelt + signals(usually 32 ms but configurable). Setting + sched_thermal_decay_shift will left shift the decay + period for the thermal pressure signal by the shift + value. + i.e. with the default pelt decay period of 32 ms + sched_thermal_decay_shift thermal pressure decay pr + 1 64 ms + 2 128 ms + and so on. + Format: integer between 0 and 10 + Default is 0. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. @@ -4511,10 +4567,10 @@ Format: <integer> A nonzero value instructs the soft-lockup detector - to panic the machine when a soft-lockup occurs. This - is also controlled by CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC - which is the respective build-time switch to that - functionality. + to panic the machine when a soft-lockup occurs. It is + also controlled by the kernel.softlockup_panic sysctl + and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the + respective build-time switch to that functionality. softlockup_all_cpu_backtrace= [KNL] Should the soft-lockup detector generate @@ -4656,6 +4712,28 @@ spia_pedr= spia_peddr= + split_lock_detect= + [X86] Enable split lock detection + + When enabled (and if hardware support is present), atomic + instructions that access data across cache line + boundaries will result in an alignment check exception. + + off - not enabled + + warn - the kernel will emit rate limited warnings + about applications triggering the #AC + exception. This mode is the default on CPUs + that supports split lock detection. + + fatal - the kernel will send SIGBUS to applications + that trigger the #AC exception. + + If an #AC exception is hit in the kernel or in + firmware (i.e. not while executing in user mode) + the kernel will oops in either "warn" or "fatal" + mode. + srcutree.counter_wrap_check [KNL] Specifies how frequently to check for grace-period sequence counter wrap for the @@ -4868,6 +4946,10 @@ topology updates sent by the hypervisor to this LPAR. + torture.disable_onoff_at_boot= [KNL] + Prevent the CPU-hotplug component of torturing + until after init has spawned. + tp720= [HW,PS2] tpm_suspend_pcr=[HW,TPM] diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index baeeba8762ae..21818aca4708 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -234,7 +234,7 @@ To reduce its OS jitter, do any of the following: Such a workqueue can be confined to a given subset of the CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs files. The set of WQ_SYSFS workqueues can be displayed using - "ls sys/devices/virtual/workqueue". That said, the workqueues + "ls /sys/devices/virtual/workqueue". That said, the workqueues maintainer would like to caution people against indiscriminately sprinkling WQ_SYSFS across all the workqueues. The reason for caution is that it is easy to add WQ_SYSFS, but because sysfs is diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst index 3726a10a03ba..f05f56c73b7d 100644 --- a/Documentation/admin-guide/perf/imx-ddr.rst +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -43,7 +43,8 @@ value 1 for supported. AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter. When non-masked bits are matching corresponding AXI_ID bits then counter is - incremented. Perf counter is incremented if + incremented. Perf counter is incremented if:: + AxID && AXI_MASKING == AXI_ID && AXI_MASKING This filter doesn't support filter different AXI ID for axid-read and axid-write diff --git a/Documentation/admin-guide/pm/cpufreq_drivers.rst b/Documentation/admin-guide/pm/cpufreq_drivers.rst new file mode 100644 index 000000000000..9a134ae65803 --- /dev/null +++ b/Documentation/admin-guide/pm/cpufreq_drivers.rst @@ -0,0 +1,274 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Legacy Documentation of CPU Performance Scaling Drivers +======================================================= + +Included below are historic documents describing assorted +:doc:`CPU performance scaling <cpufreq>` drivers. They are reproduced verbatim, +with the original white space formatting and indentation preserved, except for +the added leading space character in every line of text. + + +AMD PowerNow! Drivers +===================== + +:: + + PowerNow! and Cool'n'Quiet are AMD names for frequency + management capabilities in AMD processors. As the hardware + implementation changes in new generations of the processors, + there is a different cpu-freq driver for each generation. + + Note that the driver's will not load on the "wrong" hardware, + so it is safe to try each driver in turn when in doubt as to + which is the correct driver. + + Note that the functionality to change frequency (and voltage) + is not available in all processors. The drivers will refuse + to load on processors without this capability. The capability + is detected with the cpuid instruction. + + The drivers use BIOS supplied tables to obtain frequency and + voltage information appropriate for a particular platform. + Frequency transitions will be unavailable if the BIOS does + not supply these tables. + + 6th Generation: powernow-k6 + + 7th Generation: powernow-k7: Athlon, Duron, Geode. + + 8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron. + Documentation on this functionality in 8th generation processors + is available in the "BIOS and Kernel Developer's Guide", publication + 26094, in chapter 9, available for download from www.amd.com. + + BIOS supplied data, for powernow-k7 and for powernow-k8, may be + from either the PSB table or from ACPI objects. The ACPI support + is only available if the kernel config sets CONFIG_ACPI_PROCESSOR. + The powernow-k8 driver will attempt to use ACPI if so configured, + and fall back to PST if that fails. + The powernow-k7 driver will try to use the PSB support first, and + fall back to ACPI if the PSB support fails. A module parameter, + acpi_force, is provided to force ACPI support to be used instead + of PSB support. + + +``cpufreq-nforce2`` +=================== + +:: + + The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms. + + This works better than on other platforms, because the FSB of the CPU + can be controlled independently from the PCI/AGP clock. + + The module has two options: + + fid: multiplier * 10 (for example 8.5 = 85) + min_fsb: minimum FSB + + If not set, fid is calculated from the current CPU speed and the FSB. + min_fsb defaults to FSB at boot time - 50 MHz. + + IMPORTANT: The available range is limited downwards! + Also the minimum available FSB can differ, for systems + booting with 200 MHz, 150 should always work. + + +``pcc-cpufreq`` +=============== + +:: + + /* + * pcc-cpufreq.txt - PCC interface documentation + * + * Copyright (C) 2009 Red Hat, Matthew Garrett <mjg@redhat.com> + * Copyright (C) 2009 Hewlett-Packard Development Company, L.P. + * Nagananda Chumbalkar <nagananda.chumbalkar@hp.com> + */ + + + Processor Clocking Control Driver + --------------------------------- + + Contents: + --------- + 1. Introduction + 1.1 PCC interface + 1.1.1 Get Average Frequency + 1.1.2 Set Desired Frequency + 1.2 Platforms affected + 2. Driver and /sys details + 2.1 scaling_available_frequencies + 2.2 cpuinfo_transition_latency + 2.3 cpuinfo_cur_freq + 2.4 related_cpus + 3. Caveats + + 1. Introduction: + ---------------- + Processor Clocking Control (PCC) is an interface between the platform + firmware and OSPM. It is a mechanism for coordinating processor + performance (ie: frequency) between the platform firmware and the OS. + + The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC + interface. + + OS utilizes the PCC interface to inform platform firmware what frequency the + OS wants for a logical processor. The platform firmware attempts to achieve + the requested frequency. If the request for the target frequency could not be + satisfied by platform firmware, then it usually means that power budget + conditions are in place, and "power capping" is taking place. + + 1.1 PCC interface: + ------------------ + The complete PCC specification is available here: + https://acpica.org/sites/acpica/files/Processor-Clocking-Control-v1p0.pdf + + PCC relies on a shared memory region that provides a channel for communication + between the OS and platform firmware. PCC also implements a "doorbell" that + is used by the OS to inform the platform firmware that a command has been + sent. + + The ACPI PCCH() method is used to discover the location of the PCC shared + memory region. The shared memory region header contains the "command" and + "status" interface. PCCH() also contains details on how to access the platform + doorbell. + + The following commands are supported by the PCC interface: + * Get Average Frequency + * Set Desired Frequency + + The ACPI PCCP() method is implemented for each logical processor and is + used to discover the offsets for the input and output buffers in the shared + memory region. + + When PCC mode is enabled, the platform will not expose processor performance + or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore, + the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for + AMD) will not load. + + However, OSPM remains in control of policy. The governor (eg: "ondemand") + computes the required performance for each processor based on server workload. + The PCC driver fills in the command interface, and the input buffer and + communicates the request to the platform firmware. The platform firmware is + responsible for delivering the requested performance. + + Each PCC command is "global" in scope and can affect all the logical CPUs in + the system. Therefore, PCC is capable of performing "group" updates. With PCC + the OS is capable of getting/setting the frequency of all the logical CPUs in + the system with a single call to the BIOS. + + 1.1.1 Get Average Frequency: + ---------------------------- + This command is used by the OSPM to query the running frequency of the + processor since the last time this command was completed. The output buffer + indicates the average unhalted frequency of the logical processor expressed as + a percentage of the nominal (ie: maximum) CPU frequency. The output buffer + also signifies if the CPU frequency is limited by a power budget condition. + + 1.1.2 Set Desired Frequency: + ---------------------------- + This command is used by the OSPM to communicate to the platform firmware the + desired frequency for a logical processor. The output buffer is currently + ignored by OSPM. The next invocation of "Get Average Frequency" will inform + OSPM if the desired frequency was achieved or not. + + 1.2 Platforms affected: + ----------------------- + The PCC driver will load on any system where the platform firmware: + * supports the PCC interface, and the associated PCCH() and PCCP() methods + * assumes responsibility for managing the hardware clocking controls in order + to deliver the requested processor performance + + Currently, certain HP ProLiant platforms implement the PCC interface. On those + platforms PCC is the "default" choice. + + However, it is possible to disable this interface via a BIOS setting. In + such an instance, as is also the case on platforms where the PCC interface + is not implemented, the PCC driver will fail to load silently. + + 2. Driver and /sys details: + --------------------------- + When the driver loads, it merely prints the lowest and the highest CPU + frequencies supported by the platform firmware. + + The PCC driver loads with a message such as: + pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933 + MHz + + This means that the OPSM can request the CPU to run at any frequency in + between the limits (1600 MHz, and 2933 MHz) specified in the message. + + Internally, there is no need for the driver to convert the "target" frequency + to a corresponding P-state. + + The VERSION number for the driver will be of the format v.xy.ab. + eg: 1.00.02 + ----- -- + | | + | -- this will increase with bug fixes/enhancements to the driver + |-- this is the version of the PCC specification the driver adheres to + + + The following is a brief discussion on some of the fields exported via the + /sys filesystem and how their values are affected by the PCC driver: + + 2.1 scaling_available_frequencies: + ---------------------------------- + scaling_available_frequencies is not created in /sys. No intermediate + frequencies need to be listed because the BIOS will try to achieve any + frequency, within limits, requested by the governor. A frequency does not have + to be strictly associated with a P-state. + + 2.2 cpuinfo_transition_latency: + ------------------------------- + The cpuinfo_transition_latency field is 0. The PCC specification does + not include a field to expose this value currently. + + 2.3 cpuinfo_cur_freq: + --------------------- + A) Often cpuinfo_cur_freq will show a value different than what is declared + in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq. + This is due to "turbo boost" available on recent Intel processors. If certain + conditions are met the BIOS can achieve a slightly higher speed than requested + by OSPM. An example: + + scaling_cur_freq : 2933000 + cpuinfo_cur_freq : 3196000 + + B) There is a round-off error associated with the cpuinfo_cur_freq value. + Since the driver obtains the current frequency as a "percentage" (%) of the + nominal frequency from the BIOS, sometimes, the values displayed by + scaling_cur_freq and cpuinfo_cur_freq may not match. An example: + + scaling_cur_freq : 1600000 + cpuinfo_cur_freq : 1583000 + + In this example, the nominal frequency is 2933 MHz. The driver obtains the + current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency: + + 54% of 2933 MHz = 1583 MHz + + Nominal frequency is the maximum frequency of the processor, and it usually + corresponds to the frequency of the P0 P-state. + + 2.4 related_cpus: + ----------------- + The related_cpus field is identical to affected_cpus. + + affected_cpus : 4 + related_cpus : 4 + + Currently, the PCC driver does not evaluate _PSD. The platforms that support + PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination + to ensure that the same frequency is requested of all dependent CPUs. + + 3. Caveats: + ----------- + The "cpufreq_stats" module in its present form cannot be loaded and + expected to work with the PCC driver. Since the "cpufreq_stats" module + provides information wrt each P-state, it is not applicable to the PCC driver. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 6a06dc473dd6..5605cc6f9560 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -583,20 +583,17 @@ Power Management Quality of Service for CPUs The power management quality of service (PM QoS) framework in the Linux kernel allows kernel code and user space processes to set constraints on various energy-efficiency features of the kernel to prevent performance from dropping -below a required level. The PM QoS constraints can be set globally, in -predefined categories referred to as PM QoS classes, or against individual -devices. +below a required level. CPU idle time management can be affected by PM QoS in two ways, through the -global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the -resume latency constraints for individual CPUs. Kernel code (e.g. device -drivers) can set both of them with the help of special internal interfaces -provided by the PM QoS framework. User space can modify the former by opening -the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing -a binary value (interpreted as a signed 32-bit integer) to it. In turn, the -resume latency constraint for a CPU can be modified by user space by writing a -string (representing a signed 32-bit integer) to the -:file:`power/pm_qos_resume_latency_us` file under +global CPU latency limit and through the resume latency constraints for +individual CPUs. Kernel code (e.g. device drivers) can set both of them with +the help of special internal interfaces provided by the PM QoS framework. User +space can modify the former by opening the :file:`cpu_dma_latency` special +device file under :file:`/dev/` and writing a binary value (interpreted as a +signed 32-bit integer) to it. In turn, the resume latency constraint for a CPU +can be modified from user space by writing a string (representing a signed +32-bit integer) to the :file:`power/pm_qos_resume_latency_us` file under :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number ``<N>`` is allocated at the system initialization time. Negative values will be rejected in both cases and, also in both cases, the written integer @@ -605,32 +602,34 @@ number will be interpreted as a requested PM QoS constraint in microseconds. The requested value is not automatically applied as a new constraint, however, as it may be less restrictive (greater in this particular case) than another constraint previously requested by someone else. For this reason, the PM QoS -framework maintains a list of requests that have been made so far in each -global class and for each device, aggregates them and applies the effective -(minimum in this particular case) value as the new constraint. +framework maintains a list of requests that have been made so far for the +global CPU latency limit and for each individual CPU, aggregates them and +applies the effective (minimum in this particular case) value as the new +constraint. In fact, opening the :file:`cpu_dma_latency` special device file causes a new -PM QoS request to be created and added to the priority list of requests in the -``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the -"open" operation represents that request. If that file descriptor is then -used for writing, the number written to it will be associated with the PM QoS -request represented by it as a new requested constraint value. Next, the -priority list mechanism will be used to determine the new effective value of -the entire list of requests and that effective value will be set as a new -constraint. Thus setting a new requested constraint value will only change the -real constraint if the effective "list" value is affected by it. In particular, -for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if -it is the minimum of the requested constraints in the list. The process holding -a file descriptor obtained by opening the :file:`cpu_dma_latency` special device -file controls the PM QoS request associated with that file descriptor, but it -controls this particular PM QoS request only. +PM QoS request to be created and added to a global priority list of CPU latency +limit requests and the file descriptor coming from the "open" operation +represents that request. If that file descriptor is then used for writing, the +number written to it will be associated with the PM QoS request represented by +it as a new requested limit value. Next, the priority list mechanism will be +used to determine the new effective value of the entire list of requests and +that effective value will be set as a new CPU latency limit. Thus requesting a +new limit value will only change the real limit if the effective "list" value is +affected by it, which is the case if it is the minimum of the requested values +in the list. + +The process holding a file descriptor obtained by opening the +:file:`cpu_dma_latency` special device file controls the PM QoS request +associated with that file descriptor, but it controls this particular PM QoS +request only. Closing the :file:`cpu_dma_latency` special device file or, more precisely, the file descriptor obtained while opening it, causes the PM QoS request associated -with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY`` -class priority list and destroyed. If that happens, the priority list mechanism -will be used, again, to determine the new effective value for the whole list -and that value will become the new real constraint. +with that file descriptor to be removed from the global priority list of CPU +latency limit requests and destroyed. If that happens, the priority list +mechanism will be used again, to determine the new effective value for the whole +list and that value will become the new limit. In turn, for each CPU there is one resume latency PM QoS request associated with the :file:`power/pm_qos_resume_latency_us` file under @@ -647,10 +646,10 @@ CPU in question every time the list of requests is updated this way or another (there may be other requests coming from kernel code in that list). CPU idle time governors are expected to regard the minimum of the global -effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective -resume latency constraint for the given CPU as the upper limit for the exit -latency of the idle states they can select for that CPU. They should never -select any idle states with exit latency beyond that limit. +(effective) CPU latency limit and the effective resume latency constraint for +the given CPU as the upper limit for the exit latency of the idle states that +they are allowed to select for that CPU. They should never select any idle +states with exit latency beyond that limit. Idle States Control Via Kernel Command Line diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 67e414e34f37..ad392f3aee06 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -734,10 +734,10 @@ References ========== .. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*, - http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf + https://events.static.linuxfound.org/sites/events/files/slides/LinuxConEurope_2015.pdf .. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*, - http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html .. [3] *Advanced Configuration and Power Interface Specification*, https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index 88f717e59a42..0a38cdf39df1 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -11,4 +11,5 @@ Working-State Power Management intel_idle cpufreq intel_pstate + cpufreq_drivers intel_epb diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index def074807cee..335696d3360d 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -2,262 +2,197 @@ Documentation for /proc/sys/kernel/ =================================== -kernel version 2.2.10 +.. See scripts/check-sysctl-docs to keep this up to date + Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com> -For general info and legal blurb, please look in index.rst. +For general info and legal blurb, please look in :doc:`index`. ------------------------------------------------------------------------------ This file contains documentation for the sysctl files in -/proc/sys/kernel/ and is valid for Linux kernel version 2.2. +``/proc/sys/kernel/`` and is valid for Linux kernel version 2.2. The files in this directory can be used to tune and monitor miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your +kernel. Since some of the files *can* be used to screw up your system, it is advisable to read both documentation and source before actually making adjustments. Currently, these files might (depending on your configuration) -show up in /proc/sys/kernel: - -- acct -- acpi_video_flags -- auto_msgmni -- bootloader_type [ X86 only ] -- bootloader_version [ X86 only ] -- cap_last_cap -- core_pattern -- core_pipe_limit -- core_uses_pid -- ctrl-alt-del -- dmesg_restrict -- domainname -- hostname -- hotplug -- hardlockup_all_cpu_backtrace -- hardlockup_panic -- hung_task_panic -- hung_task_check_count -- hung_task_timeout_secs -- hung_task_check_interval_secs -- hung_task_warnings -- hyperv_record_panic_msg -- kexec_load_disabled -- kptr_restrict -- l2cr [ PPC only ] -- modprobe ==> Documentation/debugging-modules.txt -- modules_disabled -- msg_next_id [ sysv ipc ] -- msgmax -- msgmnb -- msgmni -- nmi_watchdog -- osrelease -- ostype -- overflowgid -- overflowuid -- panic -- panic_on_oops -- panic_on_stackoverflow -- panic_on_unrecovered_nmi -- panic_on_warn -- panic_print -- panic_on_rcu_stall -- perf_cpu_time_max_percent -- perf_event_paranoid -- perf_event_max_stack -- perf_event_mlock_kb -- perf_event_max_contexts_per_stack -- pid_max -- powersave-nap [ PPC only ] -- printk -- printk_delay -- printk_ratelimit -- printk_ratelimit_burst -- pty ==> Documentation/filesystems/devpts.txt -- randomize_va_space -- real-root-dev ==> Documentation/admin-guide/initrd.rst -- reboot-cmd [ SPARC only ] -- rtsig-max -- rtsig-nr -- sched_energy_aware -- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst -- sem -- sem_next_id [ sysv ipc ] -- sg-big-buff [ generic SCSI device (sg) ] -- shm_next_id [ sysv ipc ] -- shm_rmid_forced -- shmall -- shmmax [ sysv ipc ] -- shmmni -- softlockup_all_cpu_backtrace -- soft_watchdog -- stack_erasing -- stop-a [ SPARC only ] -- sysrq ==> Documentation/admin-guide/sysrq.rst -- sysctl_writes_strict -- tainted ==> Documentation/admin-guide/tainted-kernels.rst -- threads-max -- unknown_nmi_panic -- watchdog -- watchdog_thresh -- version - - -acct: -===== +show up in ``/proc/sys/kernel``: + +.. contents:: :local: + + +acct +==== + +:: -highwater lowwater frequency + highwater lowwater frequency If BSD-style process accounting is enabled these values control its behaviour. If free space on filesystem where the log lives -goes below <lowwater>% accounting suspends. If free space gets -above <highwater>% accounting resumes. <Frequency> determines +goes below ``lowwater``% accounting suspends. If free space gets +above ``highwater``% accounting resumes. ``frequency`` determines how often do we check the amount of free space (value is in seconds). Default: -4 2 30 -That is, suspend accounting if there left <= 2% free; resume it -if we got >=4%; consider information about amount of free space -valid for 30 seconds. +:: -acpi_video_flags: -================= + 4 2 30 + +That is, suspend accounting if free space drops below 2%; resume it +if it increases to at least 4%; consider information about amount of +free space valid for 30 seconds. -flags -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. +acpi_video_flags +================ +See :doc:`/power/video`. This allows the video resume mode to be set, +in a similar fashion to the ``acpi_sleep`` kernel parameter, by +combining the following values: + += ======= +1 s3_bios +2 s3_mode +4 s3_beep += ======= -auto_msgmni: -============ + +auto_msgmni +=========== This variable has no effect and may be removed in future kernel releases. Reading it always returns 0. -Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni -upon memory add/remove or upon ipc namespace creation/removal. +Up to Linux 3.17, it enabled/disabled automatic recomputing of +`msgmni`_ +upon memory add/remove or upon IPC namespace creation/removal. Echoing "1" into this file enabled msgmni automatic recomputing. -Echoing "0" turned it off. auto_msgmni default value was 1. - +Echoing "0" turned it off. The default value was 1. -bootloader_type: -================ -x86 bootloader identification +bootloader_type (x86 only) +========================== This gives the bootloader type number as indicated by the bootloader, shifted left by 4, and OR'd with the low four bits of the bootloader version. The reason for this encoding is that this used to match the -type_of_loader field in the kernel header; the encoding is kept for +``type_of_loader`` field in the kernel header; the encoding is kept for backwards compatibility. That is, if the full bootloader type number is 0x15 and the full version number is 0x234, this file will contain the value 340 = 0x154. -See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.rst for additional information. - +See the ``type_of_loader`` and ``ext_loader_type`` fields in +:doc:`/x86/boot` for additional information. -bootloader_version: -=================== -x86 bootloader version +bootloader_version (x86 only) +============================= The complete bootloader version number. In the example above, this file will contain the value 564 = 0x234. -See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.rst for additional information. +See the ``type_of_loader`` and ``ext_loader_ver`` fields in +:doc:`/x86/boot` for additional information. -cap_last_cap: -============= +cap_last_cap +============ Highest valid capability of the running kernel. Exports -CAP_LAST_CAP from the kernel. +``CAP_LAST_CAP`` from the kernel. -core_pattern: -============= +core_pattern +============ -core_pattern is used to specify a core dumpfile pattern name. +``core_pattern`` is used to specify a core dumpfile pattern name. * max length 127 characters; default value is "core" -* core_pattern is used as a pattern template for the output filename; - certain string patterns (beginning with '%') are substituted with - their actual values. -* backward compatibility with core_uses_pid: +* ``core_pattern`` is used as a pattern template for the output + filename; certain string patterns (beginning with '%') are + substituted with their actual values. +* backward compatibility with ``core_uses_pid``: - If core_pattern does not include "%p" (default does not) - and core_uses_pid is set, then .PID will be appended to + If ``core_pattern`` does not include "%p" (default does not) + and ``core_uses_pid`` is set, then .PID will be appended to the filename. -* corename format specifiers:: - - %<NUL> '%' is dropped - %% output one '%' - %p pid - %P global pid (init PID namespace) - %i tid - %I global tid (init PID namespace) - %u uid (in initial user namespace) - %g gid (in initial user namespace) - %d dump mode, matches PR_SET_DUMPABLE and - /proc/sys/fs/suid_dumpable - %s signal number - %t UNIX time of dump - %h hostname - %e executable filename (may be shortened) - %E executable path - %<OTHER> both are dropped +* corename format specifiers + + ======== ========================================== + %<NUL> '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches ``PR_SET_DUMPABLE`` and + ``/proc/sys/fs/suid_dumpable`` + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + %c maximum size of core file by resource limit RLIMIT_CORE + %<OTHER> both are dropped + ======== ========================================== * If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file. -core_pipe_limit: -================ +core_pipe_limit +=============== -This sysctl is only applicable when core_pattern is configured to pipe -core files to a user space helper (when the first character of -core_pattern is a '|', see above). When collecting cores via a pipe -to an application, it is occasionally useful for the collecting -application to gather data about the crashing process from its -/proc/pid directory. In order to do this safely, the kernel must wait -for the collecting process to exit, so as not to remove the crashing -processes proc files prematurely. This in turn creates the -possibility that a misbehaving userspace collecting process can block -the reaping of a crashed process simply by never exiting. This sysctl -defends against that. It defines how many concurrent crashing -processes may be piped to user space applications in parallel. If -this value is exceeded, then those crashing processes above that value -are noted via the kernel log and their cores are skipped. 0 is a -special value, indicating that unlimited processes may be captured in -parallel, but that no waiting will take place (i.e. the collecting -process is not guaranteed access to /proc/<crashing pid>/). This -value defaults to 0. - - -core_uses_pid: -============== +This sysctl is only applicable when `core_pattern`_ is configured to +pipe core files to a user space helper (when the first character of +``core_pattern`` is a '|', see above). +When collecting cores via a pipe to an application, it is occasionally +useful for the collecting application to gather data about the +crashing process from its ``/proc/pid`` directory. +In order to do this safely, the kernel must wait for the collecting +process to exit, so as not to remove the crashing processes proc files +prematurely. +This in turn creates the possibility that a misbehaving userspace +collecting process can block the reaping of a crashed process simply +by never exiting. +This sysctl defends against that. +It defines how many concurrent crashing processes may be piped to user +space applications in parallel. +If this value is exceeded, then those crashing processes above that +value are noted via the kernel log and their cores are skipped. +0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the +collecting process is not guaranteed access to ``/proc/<crashing +pid>/``). +This value defaults to 0. + + +core_uses_pid +============= The default coredump filename is "core". By setting -core_uses_pid to 1, the coredump filename becomes core.PID. -If core_pattern does not include "%p" (default does not) -and core_uses_pid is set, then .PID will be appended to +``core_uses_pid`` to 1, the coredump filename becomes core.PID. +If `core_pattern`_ does not include "%p" (default does not) +and ``core_uses_pid`` is set, then .PID will be appended to the filename. -ctrl-alt-del: -============= +ctrl-alt-del +============ When the value in this file is 0, ctrl-alt-del is trapped and -sent to the init(1) program to handle a graceful restart. +sent to the ``init(1)`` program to handle a graceful restart. When, however, the value is > 0, Linux's reaction to a Vulcan Nerve Pinch (tm) will be an immediate reboot, without even syncing its dirty buffers. @@ -269,21 +204,22 @@ Note: to decide what to do with it. -dmesg_restrict: -=============== +dmesg_restrict +============== This toggle indicates whether unprivileged users are prevented -from using dmesg(8) to view messages from the kernel's log buffer. -When dmesg_restrict is set to (0) there are no restrictions. When -dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use -dmesg(8). +from using ``dmesg(8)`` to view messages from the kernel's log +buffer. +When ``dmesg_restrict`` is set to 0 there are no restrictions. +When ``dmesg_restrict`` is set set to 1, users must have +``CAP_SYSLOG`` to use ``dmesg(8)``. -The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the -default value of dmesg_restrict. +The kernel config option ``CONFIG_SECURITY_DMESG_RESTRICT`` sets the +default value of ``dmesg_restrict``. -domainname & hostname: -====================== +domainname & hostname +===================== These files can be used to set the NIS/YP domainname and the hostname of your box in exactly the same way as the commands @@ -302,167 +238,206 @@ hostname "darkstar" and DNS (Internet Domain Name Server) domainname "frop.org", not to be confused with the NIS (Network Information Service) or YP (Yellow Pages) domainname. These two domain names are in general different. For a detailed discussion -see the hostname(1) man page. +see the ``hostname(1)`` man page. -hardlockup_all_cpu_backtrace: -============================= +hardlockup_all_cpu_backtrace +============================ This value controls the hard lockup detector behavior when a hard lockup condition is detected as to whether or not to gather further debug information. If enabled, arch-specific all-CPU stack dumping will be initiated. -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -hardlockup_panic: -================= +hardlockup_panic +================ This parameter can be used to control whether the kernel panics when a hard lockup is detected. - 0 - don't panic on hard lockup - 1 - panic on hard lockup += =========================== +0 Don't panic on hard lockup. +1 Panic on hard lockup. += =========================== -See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can -also be set using the nmi_watchdog kernel parameter. +See :doc:`/admin-guide/lockup-watchdogs` for more information. +This can also be set using the nmi_watchdog kernel parameter. -hotplug: -======== +hotplug +======= Path for the hotplug policy agent. -Default value is "/sbin/hotplug". +Default value is "``/sbin/hotplug``". -hung_task_panic: -================ +hung_task_panic +=============== Controls the kernel's behavior when a hung task is detected. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: continue operation. This is the default behavior. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -1: panic immediately. += ================================================= +0 Continue operation. This is the default behavior. +1 Panic immediately. += ================================================= -hung_task_check_count: -====================== +hung_task_check_count +===================== The upper bound on the number of tasks that are checked. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -hung_task_timeout_secs: -======================= +hung_task_timeout_secs +====================== When a task in D state did not get scheduled for more than this value report a warning. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0: means infinite timeout - no checking done. +0 means infinite timeout, no checking is done. -Possible values to set are in range {0..LONG_MAX/HZ}. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_check_interval_secs: -============================== +hung_task_check_interval_secs +============================= Hung task check interval. If hung task checking is enabled -(see hung_task_timeout_secs), the check is done every -hung_task_check_interval_secs seconds. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +(see `hung_task_timeout_secs`_), the check is done every +``hung_task_check_interval_secs`` seconds. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -0 (default): means use hung_task_timeout_secs as checking interval. -Possible values to set are in range {0..LONG_MAX/HZ}. +0 (default) means use ``hung_task_timeout_secs`` as checking +interval. +Possible values to set are in range {0:``LONG_MAX``/``HZ``}. -hung_task_warnings: -=================== + +hung_task_warnings +================== The maximum number of warnings to report. During a check interval if a hung task is detected, this value is decreased by 1. When this value reaches 0, no more warnings will be reported. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -1: report an infinite number of warnings. -hyperv_record_panic_msg: -======================== +hyperv_record_panic_msg +======================= Controls whether the panic kmsg data should be reported to Hyper-V. -0: do not report panic kmsg data. += ========================================================= +0 Do not report panic kmsg data. +1 Report the panic kmsg data. This is the default behavior. += ========================================================= -1: report the panic kmsg data. This is the default behavior. +kexec_load_disabled +=================== -kexec_load_disabled: -==================== - -A toggle indicating if the kexec_load syscall has been disabled. This -value defaults to 0 (false: kexec_load enabled), but can be set to 1 -(true: kexec_load disabled). Once true, kexec can no longer be used, and -the toggle cannot be set back to false. This allows a kexec image to be -loaded before disabling the syscall, allowing a system to set up (and -later use) an image without it being altered. Generally used together -with the "modules_disabled" sysctl. +A toggle indicating if the ``kexec_load`` syscall has been disabled. +This value defaults to 0 (false: ``kexec_load`` enabled), but can be +set to 1 (true: ``kexec_load`` disabled). +Once true, kexec can no longer be used, and the toggle cannot be set +back to false. +This allows a kexec image to be loaded before disabling the syscall, +allowing a system to set up (and later use) an image without it being +altered. +Generally used together with the `modules_disabled`_ sysctl. -kptr_restrict: -============== +kptr_restrict +============= This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. +exposing kernel addresses via ``/proc`` and other interfaces. + +When ``kptr_restrict`` is set to 0 (the default) the address is hashed +before printing. +(This is the equivalent to %p.) + +When ``kptr_restrict`` is set to 1, kernel pointers printed using the +%pK format specifier will be replaced with 0s unless the user has +``CAP_SYSLOG`` and effective user and group ids are equal to the real +ids. +This is because %pK checks are done at read() time rather than open() +time, so if permissions are elevated between the open() and the read() +(e.g via a setuid binary) then %pK will not leak kernel pointers to +unprivileged users. +Note, this is a temporary solution only. +The correct long-term solution is to do the permission checks at +open() time. +Consider removing world read permissions from files that use %pK, and +using `dmesg_restrict`_ to protect against uses of %pK in ``dmesg(8)`` +if leaking kernel pointer values to unprivileged users is a concern. + +When ``kptr_restrict`` is set to 2, kernel pointers printed using +%pK will be replaced with 0s regardless of privileges. + + +modprobe +======== -When kptr_restrict is set to 0 (the default) the address is hashed before -printing. (This is the equivalent to %p.) +This gives the full path of the modprobe command which the kernel will +use to load modules. This can be used to debug module loading +requests:: -When kptr_restrict is set to (1), kernel pointers printed using the %pK -format specifier will be replaced with 0's unless the user has CAP_SYSLOG -and effective user and group ids are equal to the real ids. This is -because %pK checks are done at read() time rather than open() time, so -if permissions are elevated between the open() and the read() (e.g via -a setuid binary) then %pK will not leak kernel pointers to unprivileged -users. Note, this is a temporary solution only. The correct long-term -solution is to do the permission checks at open() time. Consider removing -world read permissions from files that use %pK, and using dmesg_restrict -to protect against uses of %pK in dmesg(8) if leaking kernel pointer -values to unprivileged users is a concern. + echo '#! /bin/sh' > /tmp/modprobe + echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe + echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe + chmod a+x /tmp/modprobe + echo /tmp/modprobe > /proc/sys/kernel/modprobe -When kptr_restrict is set to (2), kernel pointers printed using -%pK will be replaced with 0's regardless of privileges. +This only applies when the *kernel* is requesting that the module be +loaded; it won't have any effect if the module is being loaded +explicitly using ``modprobe`` from userspace. -l2cr: (PPC only) +modules_disabled ================ -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. - - -modules_disabled: -================= - A toggle value indicating if modules are allowed to be loaded in an otherwise modular kernel. This toggle defaults to off (0), but can be set true (1). Once true, modules can be neither loaded nor unloaded, and the toggle cannot be set back -to false. Generally used with the "kexec_load_disabled" toggle. +to false. Generally used with the `kexec_load_disabled`_ toggle. + + +.. _msgmni: + +msgmax, msgmnb, and msgmni +========================== + +``msgmax`` is the maximum size of an IPC message, in bytes. 8192 by +default (``MSGMAX``). +``msgmnb`` is the maximum size of an IPC queue, in bytes. 16384 by +default (``MSGMNB``). -msg_next_id, sem_next_id, and shm_next_id: -========================================== +``msgmni`` is the maximum number of IPC queues. 32000 by default +(``MSGMNI``). + + +msg_next_id, sem_next_id, and shm_next_id (System V IPC) +======================================================== These three toggles allows to specify desired id for next allocated IPC object: message, semaphore or shared memory respectively. By default they are equal to -1, which means generic allocation logic. -Possible values to set are in range {0..INT_MAX}. +Possible values to set are in range {0:``INT_MAX``}. Notes: 1) kernel doesn't guarantee, that new object will have desired id. So, @@ -472,15 +447,16 @@ Notes: fails, it is undefined if the value remains unmodified or is reset to -1. -nmi_watchdog: -============= +nmi_watchdog +============ This parameter can be used to control the NMI watchdog (i.e. the hard lockup detector) on x86 systems. -0 - disable the hard lockup detector - -1 - enable the hard lockup detector += ================================= +0 Disable the hard lockup detector. +1 Enable the hard lockup detector. += ================================= The hard lockup detector monitors each CPU for its ability to respond to timer interrupts. The mechanism utilizes CPU performance counter registers @@ -492,11 +468,11 @@ in a KVM virtual machine. This default can be overridden by adding:: nmi_watchdog=1 -to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). +to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`). -numa_balancing: -=============== +numa_balancing +============== Enables/disables automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes @@ -514,9 +490,10 @@ ideally is offset by improved memory locality but there is no universal guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting -faults may be controlled by the numa_balancing_scan_period_min_ms, +faults may be controlled by the `numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. +numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls. + numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb =============================================================================================================================== @@ -542,23 +519,23 @@ workload pattern changes and minimises performance impact due to remote memory accesses. These sysctls control the thresholds for scan delays and the number of pages scanned. -numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to scan a tasks virtual memory. It effectively controls the maximum scanning rate for each task. -numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task when it initially forks. -numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to scan a tasks virtual memory. It effectively controls the minimum scanning rate for each task. -numa_balancing_scan_size_mb is how many megabytes worth of pages are +``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are scanned for a given scan. -osrelease, ostype & version: -============================ +osrelease, ostype & version +=========================== :: @@ -569,15 +546,16 @@ osrelease, ostype & version: # cat version #5 Wed Feb 25 21:49:24 MET 1998 -The files osrelease and ostype should be clear enough. Version +The files ``osrelease`` and ``ostype`` should be clear enough. +``version`` needs a little more clarification however. The '#5' means that this is the fifth kernel built from this source base and the date behind it indicates the time the kernel was built. The only way to tune these values is to rebuild the kernel :-) -overflowgid & overflowuid: -========================== +overflowgid & overflowuid +========================= if your architecture did not always support 32-bit UIDs (i.e. arm, i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to @@ -588,108 +566,119 @@ These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. +panic +===== + +The value in this file determines the behaviour of the kernel on a panic: -====== -The value in this file represents the number of seconds the kernel -waits before rebooting on a panic. When you use the software watchdog, -the recommended setting is 60. +* if zero, the kernel will loop forever; +* if negative, the kernel will reboot immediately; +* if positive, the kernel will reboot after the corresponding number + of seconds. +When you use the software watchdog, the recommended setting is 60. -panic_on_io_nmi: -================ + +panic_on_io_nmi +=============== Controls the kernel's behavior when a CPU receives an NMI caused by an IO error. -0: try to continue operation (default) - -1: panic immediately. The IO error triggered an NMI. This indicates a - serious system condition which could result in IO data corruption. - Rather than continuing, panicking might be a better choice. Some - servers issue this sort of NMI when the dump button is pushed, - and you can use this option to take a crash dump. += ================================================================== +0 Try to continue operation (default). +1 Panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. += ================================================================== -panic_on_oops: -============== +panic_on_oops +============= Controls the kernel's behaviour when an oops or BUG is encountered. -0: try to continue operation - -1: panic immediately. If the `panic` sysctl is also non-zero then the - machine will be rebooted. += =================================================================== +0 Try to continue operation. +1 Panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. += =================================================================== -panic_on_stackoverflow: -======================= +panic_on_stackoverflow +====================== Controls the kernel's behavior when detecting the overflows of kernel, IRQ and exception stacks except a user stack. -This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. - -0: try to continue operation. +This file shows up if ``CONFIG_DEBUG_STACKOVERFLOW`` is enabled. -1: panic immediately. += ========================== +0 Try to continue operation. +1 Panic immediately. += ========================== -panic_on_unrecovered_nmi: -========================= +panic_on_unrecovered_nmi +======================== The default Linux behaviour on an NMI of either memory or unknown is to continue operation. For many environments such as scientific computing it is preferable that the box is taken out and the error dealt with than an uncorrected parity/ECC error get propagated. -A small number of systems do generate NMI's for bizarre random reasons +A small number of systems do generate NMIs for bizarre random reasons such as power management so the default is off. That sysctl works like the existing panic controls already in that directory. -panic_on_warn: -============== +panic_on_warn +============= Calls panic() in the WARN() path when set to 1. This is useful to avoid a kernel rebuild when attempting to kdump at the location of a WARN(). -0: only WARN(), default behaviour. - -1: call panic() after printing out WARN() location. += ================================================ +0 Only WARN(), default behaviour. +1 Call panic() after printing out WARN() location. += ================================================ -panic_print: -============ +panic_print +=========== Bitmask for printing system info when panic happens. User can chose combination of the following bits: -===== ======================================== +===== ============================================ bit 0 print all tasks info bit 1 print system memory info bit 2 print timer info -bit 3 print locks info if CONFIG_LOCKDEP is on +bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -===== ======================================== +===== ============================================ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print -panic_on_rcu_stall: -=================== +panic_on_rcu_stall +================== When set to 1, calls panic() after RCU stall detection messages. This is useful to define the root cause of RCU stalls using a vmcore. -0: do not panic() when RCU stall takes place, default behavior. += ============================================================ +0 Do not panic() when RCU stall takes place, default behavior. +1 panic() after printing RCU stall messages. += ============================================================ -1: panic() after printing RCU stall messages. - -perf_cpu_time_max_percent: -========================== +perf_cpu_time_max_percent +========================= Hints to the kernel how much CPU time it should be allowed to use to handle perf sampling events. If the perf subsystem @@ -702,171 +691,179 @@ unexpectedly take too long to execute, the NMIs can become stacked up next to each other so much that nothing else is allowed to execute. -0: - disable the mechanism. Do not monitor or correct perf's - sampling rate no matter how CPU time it takes. +===== ======================================================== +0 Disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. -1-100: - attempt to throttle perf's sample rate to this - percentage of CPU. Note: the kernel calculates an - "expected" length of each sample event. 100 here means - 100% of that expected length. Even if this is set to - 100, you may still see sample throttling if this - length is exceeded. Set to 0 if you truly do not care - how much CPU is consumed. +1-100 Attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. +===== ======================================================== -perf_event_paranoid: -==================== +perf_event_paranoid +=================== Controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). The default value is 2. === ================================================================== - -1 Allow use of (almost) all events by all users + -1 Allow use of (almost) all events by all users. - Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK + Ignore mlock limit after perf_event_mlock_kb without + ``CAP_IPC_LOCK``. ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN +>=0 Disallow ftrace function tracepoint by users without + ``CAP_SYS_ADMIN``. - Disallow raw tracepoint access by users without CAP_SYS_ADMIN + Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``. ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN +>=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``. ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +>=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``. === ================================================================== -perf_event_max_stack: -===================== +perf_event_max_stack +==================== -Controls maximum number of stack frames to copy for (attr.sample_type & -PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using -'perf record -g' or 'perf trace --call-graph fp'. +Controls maximum number of stack frames to copy for (``attr.sample_type & +PERF_SAMPLE_CALLCHAIN``) configured events, for instance, when using +'``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 127. -perf_event_mlock_kb: -==================== +perf_event_mlock_kb +=================== Control size of per-cpu ring buffer not counted agains mlock limit. The default value is 512 + 1 page -perf_event_max_contexts_per_stack: -================================== +perf_event_max_contexts_per_stack +================================= Controls maximum number of stack frame context entries for -(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for -instance, when using 'perf record -g' or 'perf trace --call-graph fp'. +(``attr.sample_type & PERF_SAMPLE_CALLCHAIN``) configured events, for +instance, when using '``perf record -g``' or '``perf trace --call-graph fp``'. This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. +enabled, otherwise writing to this file will return ``-EBUSY``. The default value is 8. -pid_max: -======== +pid_max +======= PID allocation wrap value. When the kernel's next PID value reaches this value, it wraps back to a minimum PID value. -PIDs of value pid_max or larger are not allocated. +PIDs of value ``pid_max`` or larger are not allocated. -ns_last_pid: -============ +ns_last_pid +=========== The last pid allocated in the current (the one task using this sysctl lives in) pid namespace. When selecting a pid for a next task on fork kernel tries to allocate a number starting from this one. -powersave-nap: (PPC only) -========================= +powersave-nap (PPC only) +======================== If set, Linux-PPC will use the 'nap' mode of powersaving, otherwise the 'doze' mode will be used. + ============================================================== -printk: -======= +printk +====== -The four values in printk denote: console_loglevel, -default_message_loglevel, minimum_console_loglevel and -default_console_loglevel respectively. +The four values in printk denote: ``console_loglevel``, +``default_message_loglevel``, ``minimum_console_loglevel`` and +``default_console_loglevel`` respectively. These values influence printk() behavior when printing or -logging error messages. See 'man 2 syslog' for more info on +logging error messages. See '``man 2 syslog``' for more info on the different loglevels. -- console_loglevel: - messages with a higher priority than - this will be printed to the console -- default_message_loglevel: - messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: - minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: - default value for console_loglevel +======================== ===================================== +console_loglevel messages with a higher priority than + this will be printed to the console +default_message_loglevel messages without an explicit priority + will be printed with this priority +minimum_console_loglevel minimum (highest) value to which + console_loglevel can be set +default_console_loglevel default value for console_loglevel +======================== ===================================== -printk_delay: -============= +printk_delay +============ -Delay each printk message in printk_delay milliseconds +Delay each printk message in ``printk_delay`` milliseconds Value from 0 - 10000 is allowed. -printk_ratelimit: -================= +printk_ratelimit +================ -Some warning messages are rate limited. printk_ratelimit specifies +Some warning messages are rate limited. ``printk_ratelimit`` specifies the minimum length of time between these messages (in seconds). The default value is 5 seconds. A value of 0 will disable rate limiting. -printk_ratelimit_burst: -======================= +printk_ratelimit_burst +====================== -While long term we enforce one message per printk_ratelimit +While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. -printk_ratelimit_burst specifies the number of messages we can +``printk_ratelimit_burst`` specifies the number of messages we can send before ratelimiting kicks in. The default value is 10 messages. -printk_devkmsg: -=============== - -Control the logging to /dev/kmsg from userspace: - -ratelimit: - default, ratelimited +printk_devkmsg +============== -on: unlimited logging to /dev/kmsg from userspace +Control the logging to ``/dev/kmsg`` from userspace: -off: logging to /dev/kmsg disabled +========= ============================================= +ratelimit default, ratelimited +on unlimited logging to /dev/kmsg from userspace +off logging to /dev/kmsg disabled +========= ============================================= -The kernel command line parameter printk.devkmsg= overrides this and is +The kernel command line parameter ``printk.devkmsg=`` overrides this and is a one-time setting until next reboot: once set, it cannot be changed by this sysctl interface anymore. +============================================================== -randomize_va_space: -=================== + +pty +=== + +See Documentation/filesystems/devpts.txt. + + +randomize_va_space +================== This option can be used to select the type of process address space randomization that is used in the system, for architectures @@ -881,10 +878,10 @@ that support this feature. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the - CONFIG_COMPAT_BRK option is enabled. + ``CONFIG_COMPAT_BRK`` option is enabled. 2 Additionally enable heap randomization. This is the default if - CONFIG_COMPAT_BRK is disabled. + ``CONFIG_COMPAT_BRK`` is disabled. There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts @@ -894,31 +891,27 @@ that support this feature. systems it is safe to choose full randomization. Systems with ancient and/or broken binaries should be configured - with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + with ``CONFIG_COMPAT_BRK`` enabled, which excludes the heap from process address space randomization. == =========================================================================== -reboot-cmd: (Sparc only) -======================== - -??? This seems to be a way to give an argument to the Sparc -ROM/Flash boot loader. Maybe to tell it what to do after -rebooting. ??? +real-root-dev +============= +See :doc:`/admin-guide/initrd`. -rtsig-max & rtsig-nr: -===================== -The file rtsig-max can be used to tune the maximum number -of POSIX realtime (queued) signals that can be outstanding -in the system. +reboot-cmd (SPARC only) +======================= -rtsig-nr shows the number of RT signals currently queued. +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? -sched_energy_aware: -=================== +sched_energy_aware +================== Enables/disables Energy Aware Scheduling (EAS). EAS starts automatically on platforms where it can run (that is, @@ -928,75 +921,88 @@ requirements for EAS but you do not want to use it, change this value to 0. -sched_schedstats: -================= +sched_schedstats +================ Enables/disables scheduler statistics. Enabling this feature incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. -sg-big-buff: -============ +seccomp +======= + +See :doc:`/userspace-api/seccomp_filter`. + + +sg-big-buff +=========== This file shows the size of the generic SCSI (sg) buffer. You can't tune it just yet, but you could change it on -compile time by editing include/scsi/sg.h and changing -the value of SG_BIG_BUFF. +compile time by editing ``include/scsi/sg.h`` and changing +the value of ``SG_BIG_BUFF``. There shouldn't be any reason to change this value. If you can come up with one, you probably know what you are doing anyway :) -shmall: -======= +shmall +====== This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, SHMALL should always be at least -ceil(shmmax/PAGE_SIZE). +can be used system wide. Hence, ``shmall`` should always be at least +``ceil(shmmax/PAGE_SIZE)``. -If you are not sure what the default PAGE_SIZE is on your Linux -system, you can run the following command: +If you are not sure what the default ``PAGE_SIZE`` is on your Linux +system, you can run the following command:: # getconf PAGE_SIZE -shmmax: -======= +shmmax +====== This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. Shared memory segments up to 1Gb are now supported in the -kernel. This value defaults to SHMMAX. +kernel. This value defaults to ``SHMMAX``. -shm_rmid_forced: -================ +shmmni +====== + +This value determines the maximum number of shared memory segments. +4096 by default (``SHMMNI``). + + +shm_rmid_forced +=============== Linux lets you set resource limits, including how much memory one -process can consume, via setrlimit(2). Unfortunately, shared memory +process can consume, via ``setrlimit(2)``. Unfortunately, shared memory segments are allowed to exist without association with any process, and thus might not be counted against any resource limits. If enabled, shared memory segments are automatically destroyed when their attach count becomes zero after a detach or a process termination. It will also destroy segments that were created, but never attached to, on exit -from the process. The only use left for IPC_RMID is to immediately +from the process. The only use left for ``IPC_RMID`` is to immediately destroy an unattached segment. Of course, this breaks the way things are defined, so some applications might stop working. Note that this feature will do you no good unless you also configure your resource -limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +limits (in particular, ``RLIMIT_AS`` and ``RLIMIT_NPROC``). Most systems don't need this. Note that if you change this from 0 to 1, already created segments without users and with a dead originative process will be destroyed. -sysctl_writes_strict: -===================== +sysctl_writes_strict +==================== Control how file position affects the behavior of updating sysctl values -via the /proc/sys interface: +via the ``/proc/sys`` interface: == ====================================================================== -1 Legacy per-write sysctl value handling, with no printk warnings. @@ -1013,8 +1019,8 @@ via the /proc/sys interface: == ====================================================================== -softlockup_all_cpu_backtrace: -============================= +softlockup_all_cpu_backtrace +============================ This value controls the soft lockup detector thread's behavior when a soft lockup condition is detected as to whether or not @@ -1024,43 +1030,80 @@ be issued an NMI and instructed to capture stack trace. This feature is only applicable for architectures which support NMI. -0: do nothing. This is the default behavior. += ============================================ +0 Do nothing. This is the default behavior. +1 On detection capture more debug information. += ============================================ -1: on detection capture more debug information. +softlockup_panic +================= -soft_watchdog: -============== +This parameter can be used to control whether the kernel panics +when a soft lockup is detected. -This parameter can be used to control the soft lockup detector. += ============================================ +0 Don't panic on soft lockup. +1 Panic on soft lockup. += ============================================ - 0 - disable the soft lockup detector +This can also be set using the softlockup_panic kernel parameter. - 1 - enable the soft lockup detector + +soft_watchdog +============= + +This parameter can be used to control the soft lockup detector. + += ================================= +0 Disable the soft lockup detector. +1 Enable the soft lockup detector. += ================================= The soft lockup detector monitors CPUs for threads that are hogging the CPUs without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads from running. The mechanism depends on the CPUs ability to respond to timer interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog - if enabled - can +the watchdog timer function, otherwise the NMI watchdog — if enabled — can detect a hard lockup condition. -stack_erasing: -============== +stack_erasing +============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. +of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. - 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. += ==================================================================== +0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +1 Kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. += ==================================================================== + + +stop-a (SPARC only) +=================== + +Controls Stop-A: + += ==================================== +0 Stop-A has no effect. +1 Stop-A breaks to the PROM (default). += ==================================== + +Stop-A is always enabled on a panic, so that the user can return to +the boot PROM. - 1: kernel stack erasing is enabled (default), it is performed before - returning to the userspace at the end of syscalls. + +sysrq +===== + +See :doc:`/admin-guide/sysrq`. tainted @@ -1090,30 +1133,30 @@ ORed together. The letters are seen in "Tainted" line of Oops reports. 131072 `(T)` The kernel was built with the struct randomization plugin ====== ===== ============================================================== -See Documentation/admin-guide/tainted-kernels.rst for more information. +See :doc:`/admin-guide/tainted-kernels` for more information. -threads-max: -============ +threads-max +=========== This value controls the maximum number of threads that can be created -using fork(). +using ``fork()``. During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. -The minimum value that can be written to threads-max is 1. +The minimum value that can be written to ``threads-max`` is 1. -The maximum value that can be written to threads-max is given by the -constant FUTEX_TID_MASK (0x3fffffff). +The maximum value that can be written to ``threads-max`` is given by the +constant ``FUTEX_TID_MASK`` (0x3fffffff). -If a value outside of this range is written to threads-max an error -EINVAL occurs. +If a value outside of this range is written to ``threads-max`` an +``EINVAL`` error occurs. -unknown_nmi_panic: -================== +unknown_nmi_panic +================= The value in this file affects behavior of handling NMI. When the value is non-zero, unknown NMI is trapped and then panic occurs. At @@ -1123,37 +1166,39 @@ NMI switch that most IA32 servers have fires unknown NMI up, for example. If a system hangs up, try pressing the NMI switch. -watchdog: -========= +watchdog +======== This parameter can be used to disable or enable the soft lockup detector -_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. - - 0 - disable both lockup detectors +*and* the NMI watchdog (i.e. the hard lockup detector) at the same time. - 1 - enable both lockup detectors += ============================== +0 Disable both lockup detectors. +1 Enable both lockup detectors. += ============================== The soft lockup detector and the NMI watchdog can also be disabled or -enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing:: +enabled individually, using the ``soft_watchdog`` and ``nmi_watchdog`` +parameters. +If the ``watchdog`` parameter is read, for example by executing:: cat /proc/sys/kernel/watchdog -the output of this command (0 or 1) shows the logical OR of soft_watchdog -and nmi_watchdog. +the output of this command (0 or 1) shows the logical OR of +``soft_watchdog`` and ``nmi_watchdog``. -watchdog_cpumask: -================= +watchdog_cpumask +================ This value can be used to control on which cpus the watchdog may run. -The default cpumask is all possible cores, but if NO_HZ_FULL is +The default cpumask is all possible cores, but if ``NO_HZ_FULL`` is enabled in the kernel config, and cores are specified with the -nohz_full= boot argument, those cores are excluded by default. +``nohz_full=`` boot argument, those cores are excluded by default. Offline cores can be included in this mask, and if the core is later brought online, the watchdog will be started based on the mask value. -Typically this value would only be touched in the nohz_full case +Typically this value would only be touched in the ``nohz_full`` case to re-enable cores that by default were not running the watchdog, if a kernel lockup was suspected on those cores. @@ -1164,12 +1209,12 @@ might say:: echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask -watchdog_thresh: -================ +watchdog_thresh +=============== This value can be used to control the frequency of hrtimer and NMI events and the soft and hard lockup thresholds. The default threshold is 10 seconds. -The softlockup threshold is (2 * watchdog_thresh). Setting this +The softlockup threshold is (``2 * watchdog_thresh``). Setting this tunable to zero will disable lockup detection altogether. diff --git a/Documentation/arm/tcm.rst b/Documentation/arm/tcm.rst index effd9c7bc968..b256f9783883 100644 --- a/Documentation/arm/tcm.rst +++ b/Documentation/arm/tcm.rst @@ -4,18 +4,18 @@ ARM TCM (Tightly-Coupled Memory) handling in Linux Written by Linus Walleij <linus.walleij@stericsson.com> -Some ARM SoC:s have a so-called TCM (Tightly-Coupled Memory). +Some ARM SoCs have a so-called TCM (Tightly-Coupled Memory). This is usually just a few (4-64) KiB of RAM inside the ARM processor. -Due to being embedded inside the CPU The TCM has a +Due to being embedded inside the CPU, the TCM has a Harvard-architecture, so there is an ITCM (instruction TCM) and a DTCM (data TCM). The DTCM can not contain any instructions, but the ITCM can actually contain data. The size of DTCM or ITCM is minimum 4KiB so the typical minimum configuration is 4KiB ITCM and 4KiB DTCM. -ARM CPU:s have special registers to read out status, physical +ARM CPUs have special registers to read out status, physical location and size of TCM memories. arch/arm/include/asm/cputype.h defines a CPUID_TCM register that you can read out from the system control coprocessor. Documentation from ARM can be found diff --git a/Documentation/arm64/amu.rst b/Documentation/arm64/amu.rst new file mode 100644 index 000000000000..5057b11100ed --- /dev/null +++ b/Documentation/arm64/amu.rst @@ -0,0 +1,112 @@ +======================================================= +Activity Monitors Unit (AMU) extension in AArch64 Linux +======================================================= + +Author: Ionela Voinescu <ionela.voinescu@arm.com> + +Date: 2019-09-10 + +This document briefly describes the provision of Activity Monitors Unit +support in AArch64 Linux. + + +Architecture overview +--------------------- + +The activity monitors extension is an optional extension introduced by the +ARMv8.4 CPU architecture. + +The activity monitors unit, implemented in each CPU, provides performance +counters intended for system management use. The AMU extension provides a +system register interface to the counter registers and also supports an +optional external memory-mapped interface. + +Version 1 of the Activity Monitors architecture implements a counter group +of four fixed and architecturally defined 64-bit event counters. + - CPU cycle counter: increments at the frequency of the CPU. + - Constant counter: increments at the fixed frequency of the system + clock. + - Instructions retired: increments with every architecturally executed + instruction. + - Memory stall cycles: counts instruction dispatch stall cycles caused by + misses in the last level cache within the clock domain. + +When in WFI or WFE these counters do not increment. + +The Activity Monitors architecture provides space for up to 16 architected +event counters. Future versions of the architecture may use this space to +implement additional architected event counters. + +Additionally, version 1 implements a counter group of up to 16 auxiliary +64-bit event counters. + +On cold reset all counters reset to 0. + + +Basic support +------------- + +The kernel can safely run a mix of CPUs with and without support for the +activity monitors extension. Therefore, when CONFIG_ARM64_AMU_EXTN is +selected we unconditionally enable the capability to allow any late CPU +(secondary or hotplugged) to detect and use the feature. + +When the feature is detected on a CPU, we flag the availability of the +feature but this does not guarantee the correct functionality of the +counters, only the presence of the extension. + +Firmware (code running at higher exception levels, e.g. arm-tf) support is +needed to: + - Enable access for lower exception levels (EL2 and EL1) to the AMU + registers. + - Enable the counters. If not enabled these will read as 0. + - Save/restore the counters before/after the CPU is being put/brought up + from the 'off' power state. + +When using kernels that have this feature enabled but boot with broken +firmware the user may experience panics or lockups when accessing the +counter registers. Even if these symptoms are not observed, the values +returned by the register reads might not correctly reflect reality. Most +commonly, the counters will read as 0, indicating that they are not +enabled. + +If proper support is not provided in firmware it's best to disable +CONFIG_ARM64_AMU_EXTN. To be noted that for security reasons, this does not +bypass the setting of AMUSERENR_EL0 to trap accesses from EL0 (userspace) to +EL1 (kernel). Therefore, firmware should still ensure accesses to AMU registers +are not trapped in EL2/EL3. + +The fixed counters of AMUv1 are accessible though the following system +register definitions: + - SYS_AMEVCNTR0_CORE_EL0 + - SYS_AMEVCNTR0_CONST_EL0 + - SYS_AMEVCNTR0_INST_RET_EL0 + - SYS_AMEVCNTR0_MEM_STALL_EL0 + +Auxiliary platform specific counters can be accessed using +SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15. + +Details can be found in: arch/arm64/include/asm/sysreg.h. + + +Userspace access +---------------- + +Currently, access from userspace to the AMU registers is disabled due to: + - Security reasons: they might expose information about code executed in + secure mode. + - Purpose: AMU counters are intended for system management use. + +Also, the presence of the feature is not visible to userspace. + + +Virtualization +-------------- + +Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM +guest side is disabled due to: + - Security reasons: they might expose information about code executed + by other guests or the host. + +Any attempt to access the AMU registers will result in an UNDEFINED +exception being injected into the guest. diff --git a/Documentation/arm64/booting.rst b/Documentation/arm64/booting.rst index 5d78a6f5b0ae..a3f1a47b6f1c 100644 --- a/Documentation/arm64/booting.rst +++ b/Documentation/arm64/booting.rst @@ -248,6 +248,20 @@ Before jumping into the kernel, the following conditions must be met: - HCR_EL2.APK (bit 40) must be initialised to 0b1 - HCR_EL2.API (bit 41) must be initialised to 0b1 + For CPUs with Activity Monitors Unit v1 (AMUv1) extension present: + - If EL3 is present: + CPTR_EL3.TAM (bit 30) must be initialised to 0b0 + CPTR_EL2.TAM (bit 30) must be initialised to 0b0 + AMCNTENSET0_EL0 must be initialised to 0b1111 + AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. + - If the kernel is entered at EL1: + AMCNTENSET0_EL0 must be initialised to 0b1111 + AMCNTENSET1_EL0 must be initialised to a platform specific value + having 0b1 set for the corresponding bit for each of the auxiliary + counters present. + The requirements described above for CPU mode, caches, MMUs, architected timers, coherency and system registers apply to all CPUs. All CPUs must enter the kernel in the same exception level. diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst index 5c0c69dc58aa..09cbb4ed2237 100644 --- a/Documentation/arm64/index.rst +++ b/Documentation/arm64/index.rst @@ -6,6 +6,7 @@ ARM64 Architecture :maxdepth: 1 acpi_object_usage + amu arm-acpi booting cpu-feature-registers diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst index 02e02175e6f5..cf03b3290800 100644 --- a/Documentation/arm64/memory.rst +++ b/Documentation/arm64/memory.rst @@ -129,7 +129,7 @@ this logic. As a single binary will need to support both 48-bit and 52-bit VA spaces, the VMEMMAP must be sized large enough for 52-bit VAs and -also must be sized large enought to accommodate a fixed PAGE_OFFSET. +also must be sized large enough to accommodate a fixed PAGE_OFFSET. Most code in the kernel should not need to consider the VA_BITS, for code that does need to know the VA size the variables are diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst index 9120e59578dc..2c08c628febd 100644 --- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -110,6 +110,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Cavium | ThunderX GICv3 | #23154 | CAVIUM_ERRATUM_23154 | +----------------+-----------------+-----------------+-----------------------------+ +| Cavium | ThunderX GICv3 | #38539 | N/A | ++----------------+-----------------+-----------------+-----------------------------+ | Cavium | ThunderX Core | #27456 | CAVIUM_ERRATUM_27456 | +----------------+-----------------+-----------------+-----------------------------+ | Cavium | ThunderX Core | #30115 | CAVIUM_ERRATUM_30115 | diff --git a/Documentation/arm64/tagged-address-abi.rst b/Documentation/arm64/tagged-address-abi.rst index d4a85d535bf9..4a9d9c794ee5 100644 --- a/Documentation/arm64/tagged-address-abi.rst +++ b/Documentation/arm64/tagged-address-abi.rst @@ -44,8 +44,15 @@ The AArch64 Tagged Address ABI has two stages of relaxation depending how the user addresses are used by the kernel: 1. User addresses not accessed by the kernel but used for address space - management (e.g. ``mmap()``, ``mprotect()``, ``madvise()``). The use - of valid tagged pointers in this context is always allowed. + management (e.g. ``mprotect()``, ``madvise()``). The use of valid + tagged pointers in this context is allowed with the exception of + ``brk()``, ``mmap()`` and the ``new_address`` argument to + ``mremap()`` as these have the potential to alias with existing + user addresses. + + NOTE: This behaviour changed in v5.6 and so some earlier kernels may + incorrectly accept valid tagged pointers for the ``brk()``, + ``mmap()`` and ``mremap()`` system calls. 2. User addresses accessed by the kernel (e.g. ``write()``). This ABI relaxation is disabled by default and the application thread needs to diff --git a/Documentation/block/capability.rst b/Documentation/block/capability.rst index 2cf258d64bbe..160a5148b915 100644 --- a/Documentation/block/capability.rst +++ b/Documentation/block/capability.rst @@ -2,17 +2,9 @@ Generic Block Device Capability =============================== -This file documents the sysfs file block/<disk>/capability +This file documents the sysfs file ``block/<disk>/capability``. -capability is a hex word indicating which capabilities a specific disk -supports. For more information on bits not listed here, see -include/linux/genhd.h +``capability`` is a bitfield, printed in hexadecimal, indicating which +capabilities a specific block device supports: -GENHD_FL_MEDIA_CHANGE_NOTIFY ----------------------------- - -Value: 4 - -When this bit is set, the disk supports Asynchronous Notification -of media change events. These events will be broadcast to user -space via kernel uevent. +.. kernel-doc:: include/linux/genhd.h diff --git a/Documentation/conf.py b/Documentation/conf.py index 3c7bdf4cd31f..9ae8e9abf846 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -38,7 +38,11 @@ needs_sphinx = '1.3' # ones. extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig', 'automarkup', - 'maintainers_include'] + 'maintainers_include', 'sphinx.ext.autosectionlabel' ] + +# Ensure that autosectionlabel will produce unique names +autosectionlabel_prefix_document = True +autosectionlabel_maxdepth = 2 # The name of the math extension changed on Sphinx 1.4 if (major == 1 and minor > 3) or (major > 1): diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index a501dc1c90d0..0897ad12c119 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -8,41 +8,81 @@ This is the beginning of a manual for core kernel APIs. The conversion Core utilities ============== +This section has general and "core core" documentation. The first is a +massive grab-bag of kerneldoc info left over from the docbook days; it +should really be broken up someday when somebody finds the energy to do +it. + .. toctree:: :maxdepth: 1 kernel-api + workqueue + printk-formats + symbol-namespaces + +Data structures and low-level utilities +======================================= + +Library functionality that is used throughout the kernel. + +.. toctree:: + :maxdepth: 1 + + kobject assoc_array + xarray + idr + circular-buffers + generic-radix-tree + packing + timekeeping + errseq + +Concurrency primitives +====================== + +How Linux keeps everything from happening at the same time. See +:doc:`/locking/index` for more related documentation. + +.. toctree:: + :maxdepth: 1 + atomic_ops - cachetlb refcount-vs-atomic - cpu_hotplug - idr local_ops - workqueue + padata + ../RCU/index + +Low-level hardware management +============================= + +Cache management, managing CPU hotplug, etc. + +.. toctree:: + :maxdepth: 1 + + cachetlb + cpu_hotplug + memory-hotplug genericirq - xarray - librs - genalloc - errseq - packing - printk-formats - circular-buffers - generic-radix-tree + protection-keys + +Memory management +================= + +How to allocate and use memory in the kernel. Note that there is a lot +more memory-management documentation in :doc:`/vm/index`. + +.. toctree:: + :maxdepth: 1 + memory-allocation mm-api + genalloc pin_user_pages - gfp_mask-from-fs-io - timekeeping boot-time-mm - memory-hotplug - protection-keys - ../RCU/index - gcc-plugins - symbol-namespaces - padata - ioctl - + gfp_mask-from-fs-io Interfaces for kernel debugging =============================== @@ -53,6 +93,16 @@ Interfaces for kernel debugging debug-objects tracepoint +Everything else +=============== + +Documents that don't fit elsewhere or which have yet to be categorized. + +.. toctree:: + :maxdepth: 1 + + librs + .. only:: subproject and html Indices diff --git a/Documentation/kobject.txt b/Documentation/core-api/kobject.rst index ff4c25098119..1f62d4d7d966 100644 --- a/Documentation/kobject.txt +++ b/Documentation/core-api/kobject.rst @@ -25,7 +25,7 @@ some terms we will be working with. usually embedded within some other structure which contains the stuff the code is really interested in. - No structure should EVER have more than one kobject embedded within it. + No structure should **EVER** have more than one kobject embedded within it. If it does, the reference counting for the object is sure to be messed up and incorrect, and your code will be buggy. So do not do this. @@ -55,7 +55,7 @@ a larger, domain-specific object. To this end, kobjects will be found embedded in other structures. If you are used to thinking of things in object-oriented terms, kobjects can be seen as a top-level, abstract class from which other classes are derived. A kobject implements a set of -capabilities which are not particularly useful by themselves, but which are +capabilities which are not particularly useful by themselves, but are nice to have in other objects. The C language does not allow for the direct expression of inheritance, so other techniques - such as structure embedding - must be used. @@ -65,12 +65,12 @@ this is analogous as to how "list_head" structs are rarely useful on their own, but are invariably found embedded in the larger objects of interest.) -So, for example, the UIO code in drivers/uio/uio.c has a structure that +So, for example, the UIO code in ``drivers/uio/uio.c`` has a structure that defines the memory region associated with a uio device:: struct uio_map { - struct kobject kobj; - struct uio_mem *mem; + struct kobject kobj; + struct uio_mem *mem; }; If you have a struct uio_map structure, finding its embedded kobject is @@ -78,30 +78,30 @@ just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) -and, instead, use the container_of() macro, found in <linux/kernel.h>:: +and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: container_of(pointer, type, member) where: - * "pointer" is the pointer to the embedded kobject, - * "type" is the type of the containing structure, and - * "member" is the name of the structure field to which "pointer" points. + * ``pointer`` is the pointer to the embedded kobject, + * ``type`` is the type of the containing structure, and + * ``member`` is the name of the structure field to which ``pointer`` points. The return value from container_of() is a pointer to the corresponding -container type. So, for example, a pointer "kp" to a struct kobject -embedded *within* a struct uio_map could be converted to a pointer to the -*containing* uio_map structure with:: +container type. So, for example, a pointer ``kp`` to a struct kobject +embedded **within** a struct uio_map could be converted to a pointer to the +**containing** uio_map structure with:: struct uio_map *u_map = container_of(kp, struct uio_map, kobj); -For convenience, programmers often define a simple macro for "back-casting" +For convenience, programmers often define a simple macro for **back-casting** kobject pointers to the containing type. Exactly this happens in the -earlier drivers/uio/uio.c, as you can see here:: +earlier ``drivers/uio/uio.c``, as you can see here:: struct uio_map { - struct kobject kobj; - struct uio_mem *mem; + struct kobject kobj; + struct uio_mem *mem; }; #define to_map(map) container_of(map, struct uio_map, kobj) @@ -125,7 +125,7 @@ must have an associated kobj_type. After calling kobject_init(), to register the kobject with sysfs, the function kobject_add() must be called:: int kobject_add(struct kobject *kobj, struct kobject *parent, - const char *fmt, ...); + const char *fmt, ...); This sets up the parent of the kobject and the name for the kobject properly. If the kobject is to be associated with a specific kset, @@ -172,13 +172,13 @@ call to kobject_uevent():: int kobject_uevent(struct kobject *kobj, enum kobject_action action); -Use the KOBJ_ADD action for when the kobject is first added to the kernel. +Use the **KOBJ_ADD** action for when the kobject is first added to the kernel. This should be done only after any attributes or children of the kobject have been initialized properly, as userspace will instantly start to look for them when this call happens. When the kobject is removed from the kernel (details on how to do that are -below), the uevent for KOBJ_REMOVE will be automatically created by the +below), the uevent for **KOBJ_REMOVE** will be automatically created by the kobject core, so the caller does not have to worry about doing that by hand. @@ -238,7 +238,7 @@ Both types of attributes used here, with a kobject that has been created with the kobject_create_and_add(), can be of type kobj_attribute, so no special custom attribute is needed to be created. -See the example module, samples/kobject/kobject-example.c for an +See the example module, ``samples/kobject/kobject-example.c`` for an implementation of a simple kobject and attributes. @@ -270,10 +270,10 @@ such a method has a form like:: void my_object_release(struct kobject *kobj) { - struct my_object *mine = container_of(kobj, struct my_object, kobj); + struct my_object *mine = container_of(kobj, struct my_object, kobj); - /* Perform any additional cleanup on this object, then... */ - kfree(mine); + /* Perform any additional cleanup on this object, then... */ + kfree(mine); } One important point cannot be overstated: every kobject must have a @@ -297,11 +297,11 @@ instead, it is associated with the ktype. So let us introduce struct kobj_type:: struct kobj_type { - void (*release)(struct kobject *kobj); - const struct sysfs_ops *sysfs_ops; - struct attribute **default_attrs; - const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); - const void *(*namespace)(struct kobject *kobj); + void (*release)(struct kobject *kobj); + const struct sysfs_ops *sysfs_ops; + struct attribute **default_attrs; + const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); + const void *(*namespace)(struct kobject *kobj); }; This structure is used to describe a particular type of kobject (or, more @@ -352,8 +352,8 @@ created and never declared statically or on the stack. To create a new kset use:: struct kset *kset_create_and_add(const char *name, - struct kset_uevent_ops *u, - struct kobject *parent); + struct kset_uevent_ops *u, + struct kobject *parent); When you are finished with the kset, call:: @@ -365,16 +365,16 @@ Because other references to the kset may still exist, the release may happen after kset_unregister() returns. An example of using a kset can be seen in the -samples/kobject/kset-example.c file in the kernel tree. +``samples/kobject/kset-example.c`` file in the kernel tree. If a kset wishes to control the uevent operations of the kobjects associated with it, it can use the struct kset_uevent_ops to handle it:: struct kset_uevent_ops { - int (*filter)(struct kset *kset, struct kobject *kobj); - const char *(*name)(struct kset *kset, struct kobject *kobj); - int (*uevent)(struct kset *kset, struct kobject *kobj, - struct kobj_uevent_env *env); + int (*filter)(struct kset *kset, struct kobject *kobj); + const char *(*name)(struct kset *kset, struct kobject *kobj); + int (*uevent)(struct kset *kset, struct kobject *kobj, + struct kobj_uevent_env *env); }; @@ -408,8 +408,8 @@ Kobject removal After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call kobject_put(). By doing this, the kobject core will automatically clean up -all of the memory allocated by this kobject. If a KOBJ_ADD uevent has been -sent for the object, a corresponding KOBJ_REMOVE uevent will be sent, and +all of the memory allocated by this kobject. If a ``KOBJ_ADD`` uevent has been +sent for the object, a corresponding ``KOBJ_REMOVE`` uevent will be sent, and any other sysfs housekeeping will be handled for the caller properly. If you need to do a two-stage delete of the kobject (say you are not @@ -430,5 +430,5 @@ Example code to copy from ========================= For a more complete example of using ksets and kobjects properly, see the -example programs samples/kobject/{kobject-example.c,kset-example.c}, -which will be built as loadable modules if you select CONFIG_SAMPLE_KOBJECT. +example programs ``samples/kobject/{kobject-example.c,kset-example.c}``, +which will be built as loadable modules if you select ``CONFIG_SAMPLE_KOBJECT``. diff --git a/Documentation/cpu-freq/amd-powernow.txt b/Documentation/cpu-freq/amd-powernow.txt deleted file mode 100644 index 254da155fa47..000000000000 --- a/Documentation/cpu-freq/amd-powernow.txt +++ /dev/null @@ -1,38 +0,0 @@ - -PowerNow! and Cool'n'Quiet are AMD names for frequency -management capabilities in AMD processors. As the hardware -implementation changes in new generations of the processors, -there is a different cpu-freq driver for each generation. - -Note that the driver's will not load on the "wrong" hardware, -so it is safe to try each driver in turn when in doubt as to -which is the correct driver. - -Note that the functionality to change frequency (and voltage) -is not available in all processors. The drivers will refuse -to load on processors without this capability. The capability -is detected with the cpuid instruction. - -The drivers use BIOS supplied tables to obtain frequency and -voltage information appropriate for a particular platform. -Frequency transitions will be unavailable if the BIOS does -not supply these tables. - -6th Generation: powernow-k6 - -7th Generation: powernow-k7: Athlon, Duron, Geode. - -8th Generation: powernow-k8: Athlon, Athlon 64, Opteron, Sempron. -Documentation on this functionality in 8th generation processors -is available in the "BIOS and Kernel Developer's Guide", publication -26094, in chapter 9, available for download from www.amd.com. - -BIOS supplied data, for powernow-k7 and for powernow-k8, may be -from either the PSB table or from ACPI objects. The ACPI support -is only available if the kernel config sets CONFIG_ACPI_PROCESSOR. -The powernow-k8 driver will attempt to use ACPI if so configured, -and fall back to PST if that fails. -The powernow-k7 driver will try to use the PSB support first, and -fall back to ACPI if the PSB support fails. A module parameter, -acpi_force, is provided to force ACPI support to be used instead -of PSB support. diff --git a/Documentation/cpu-freq/core.txt b/Documentation/cpu-freq/core.rst index ed577d9c154b..33cb90bd1d8f 100644 --- a/Documentation/cpu-freq/core.txt +++ b/Documentation/cpu-freq/core.rst @@ -1,31 +1,23 @@ - CPU frequency and voltage scaling code in the Linux(TM) kernel +.. SPDX-License-Identifier: GPL-2.0 +============================================================= +General description of the CPUFreq core and CPUFreq notifiers +============================================================= - L i n u x C P U F r e q +Authors: + - Dominik Brodowski <linux@brodo.de> + - David Kimdon <dwhedon@debian.org> + - Rafael J. Wysocki <rafael.j.wysocki@intel.com> + - Viresh Kumar <viresh.kumar@linaro.org> - C P U F r e q C o r e +.. Contents: - - Dominik Brodowski <linux@brodo.de> - David Kimdon <dwhedon@debian.org> - Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Viresh Kumar <viresh.kumar@linaro.org> - - - - Clock scaling allows you to change the clock speed of the CPUs on the - fly. This is a nice method to save battery power, because the lower - the clock speed, the less power the CPU consumes. - - -Contents: ---------- -1. CPUFreq core and interfaces -2. CPUFreq notifiers -3. CPUFreq Table Generation with Operating Performance Point (OPP) + 1. CPUFreq core and interfaces + 2. CPUFreq notifiers + 3. CPUFreq Table Generation with Operating Performance Point (OPP) 1. General Information -======================= +====================== The CPUFreq core code is located in drivers/cpufreq/cpufreq.c. This cpufreq code offers a standardized interface for the CPUFreq @@ -63,7 +55,7 @@ The phase is specified in the second argument to the notifier. The phase is CPUFREQ_CREATE_POLICY when the policy is first created and it is CPUFREQ_REMOVE_POLICY when the policy is removed. -The third argument, a void *pointer, points to a struct cpufreq_policy +The third argument, a ``void *pointer``, points to a struct cpufreq_policy consisting of several values, including min, max (the lower and upper frequencies (in kHz) of the new policy). @@ -80,10 +72,13 @@ CPUFREQ_POSTCHANGE. The third argument is a struct cpufreq_freqs with the following values: -cpu - number of the affected CPU -old - old frequency -new - new frequency -flags - flags of the cpufreq driver + +===== =========================== +cpu number of the affected CPU +old old frequency +new new frequency +flags flags of the cpufreq driver +===== =========================== 3. CPUFreq Table Generation with Operating Performance Point (OPP) ================================================================== @@ -94,9 +89,12 @@ dev_pm_opp_init_cpufreq_table - the OPP layer's internal information about the available frequencies into a format readily providable to cpufreq. - WARNING: Do not use this function in interrupt context. + .. Warning:: + + Do not use this function in interrupt context. + + Example:: - Example: soc_pm_init() { /* Do things */ @@ -106,7 +104,10 @@ dev_pm_opp_init_cpufreq_table - /* Do other things */ } - NOTE: This function is available only if CONFIG_CPU_FREQ is enabled in - addition to CONFIG_PM_OPP. + .. note:: + + This function is available only if CONFIG_CPU_FREQ is enabled in + addition to CONFIG_PM_OPP. -dev_pm_opp_free_cpufreq_table - Free up the table allocated by dev_pm_opp_init_cpufreq_table +dev_pm_opp_free_cpufreq_table + Free up the table allocated by dev_pm_opp_init_cpufreq_table diff --git a/Documentation/cpu-freq/cpu-drivers.txt b/Documentation/cpu-freq/cpu-drivers.rst index 6e353d00cdc6..a697278ce190 100644 --- a/Documentation/cpu-freq/cpu-drivers.txt +++ b/Documentation/cpu-freq/cpu-drivers.rst @@ -1,35 +1,27 @@ - CPU frequency and voltage scaling code in the Linux(TM) kernel +.. SPDX-License-Identifier: GPL-2.0 +=============================================== +How to Implement a new CPUFreq Processor Driver +=============================================== - L i n u x C P U F r e q +Authors: - C P U D r i v e r s - - information for developers - + - Dominik Brodowski <linux@brodo.de> + - Rafael J. Wysocki <rafael.j.wysocki@intel.com> + - Viresh Kumar <viresh.kumar@linaro.org> +.. Contents - Dominik Brodowski <linux@brodo.de> - Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Viresh Kumar <viresh.kumar@linaro.org> - - - - Clock scaling allows you to change the clock speed of the CPUs on the - fly. This is a nice method to save battery power, because the lower - the clock speed, the less power the CPU consumes. - - -Contents: ---------- -1. What To Do? -1.1 Initialization -1.2 Per-CPU Initialization -1.3 verify -1.4 target/target_index or setpolicy? -1.5 target/target_index -1.6 setpolicy -1.7 get_intermediate and target_intermediate -2. Frequency Table Helpers + 1. What To Do? + 1.1 Initialization + 1.2 Per-CPU Initialization + 1.3 verify + 1.4 target/target_index or setpolicy? + 1.5 target/target_index + 1.6 setpolicy + 1.7 get_intermediate and target_intermediate + 2. Frequency Table Helpers @@ -49,7 +41,7 @@ function check whether this kernel runs on the right CPU and the right chipset. If so, register a struct cpufreq_driver with the CPUfreq core using cpufreq_register_driver() -What shall this struct cpufreq_driver contain? +What shall this struct cpufreq_driver contain? .name - The name of this driver. @@ -108,37 +100,42 @@ Whenever a new CPU is registered with the device model, or after the cpufreq driver registers itself, the per-policy initialization function cpufreq_driver.init is called if no cpufreq policy existed for the CPU. Note that the .init() and .exit() routines are called only once for the -policy and not for each CPU managed by the policy. It takes a struct -cpufreq_policy *policy as argument. What to do now? +policy and not for each CPU managed by the policy. It takes a ``struct +cpufreq_policy *policy`` as argument. What to do now? If necessary, activate the CPUfreq support on your CPU. Then, the driver must fill in the following values: -policy->cpuinfo.min_freq _and_ -policy->cpuinfo.max_freq - the minimum and maximum frequency - (in kHz) which is supported by - this CPU -policy->cpuinfo.transition_latency the time it takes on this CPU to - switch between two frequencies in - nanoseconds (if appropriate, else - specify CPUFREQ_ETERNAL) - -policy->cur The current operating frequency of - this CPU (if appropriate) -policy->min, -policy->max, -policy->policy and, if necessary, -policy->governor must contain the "default policy" for - this CPU. A few moments later, - cpufreq_driver.verify and either - cpufreq_driver.setpolicy or - cpufreq_driver.target/target_index is called - with these values. -policy->cpus Update this with the masks of the - (online + offline) CPUs that do DVFS - along with this CPU (i.e. that share - clock/voltage rails with it). ++-----------------------------------+--------------------------------------+ +|policy->cpuinfo.min_freq _and_ | | +|policy->cpuinfo.max_freq | the minimum and maximum frequency | +| | (in kHz) which is supported by | +| | this CPU | ++-----------------------------------+--------------------------------------+ +|policy->cpuinfo.transition_latency | the time it takes on this CPU to | +| | switch between two frequencies in | +| | nanoseconds (if appropriate, else | +| | specify CPUFREQ_ETERNAL) | ++-----------------------------------+--------------------------------------+ +|policy->cur | The current operating frequency of | +| | this CPU (if appropriate) | ++-----------------------------------+--------------------------------------+ +|policy->min, | | +|policy->max, | | +|policy->policy and, if necessary, | | +|policy->governor | must contain the "default policy" for| +| | this CPU. A few moments later, | +| | cpufreq_driver.verify and either | +| | cpufreq_driver.setpolicy or | +| | cpufreq_driver.target/target_index is| +| | called with these values. | ++-----------------------------------+--------------------------------------+ +|policy->cpus | Update this with the masks of the | +| | (online + offline) CPUs that do DVFS | +| | along with this CPU (i.e. that share| +| | clock/voltage rails with it). | ++-----------------------------------+--------------------------------------+ For setting some of these values (cpuinfo.min[max]_freq, policy->min[max]), the frequency table helpers might be helpful. See the section 2 for more information @@ -151,8 +148,8 @@ on them. When the user decides a new policy (consisting of "policy,governor,min,max") shall be set, this policy must be validated so that incompatible values can be corrected. For verifying these -values cpufreq_verify_within_limits(struct cpufreq_policy *policy, -unsigned int min_freq, unsigned int max_freq) function might be helpful. +values cpufreq_verify_within_limits(``struct cpufreq_policy *policy``, +``unsigned int min_freq``, ``unsigned int max_freq``) function might be helpful. See section 2 for details on frequency table helpers. You need to make sure that at least one valid frequency (or operating @@ -163,7 +160,7 @@ policy->max first, and only if this is no solution, decrease policy->min. 1.4 target or target_index or setpolicy or fast_switch? ------------------------------------------------------- -Most cpufreq drivers or even most cpu frequency scaling algorithms +Most cpufreq drivers or even most cpu frequency scaling algorithms only allow the CPU frequency to be set to predefined fixed values. For these, you use the ->target(), ->target_index() or ->fast_switch() callbacks. @@ -175,8 +172,8 @@ limits on their own. These shall use the ->setpolicy() callback. 1.5. target/target_index ------------------------ -The target_index call has two arguments: struct cpufreq_policy *policy, -and unsigned int index (into the exposed frequency table). +The target_index call has two arguments: ``struct cpufreq_policy *policy``, +and ``unsigned int`` index (into the exposed frequency table). The CPUfreq driver must set the new frequency when called here. The actual frequency must be determined by freq_table[index].frequency. @@ -184,9 +181,9 @@ actual frequency must be determined by freq_table[index].frequency. It should always restore to earlier frequency (i.e. policy->restore_freq) in case of errors, even if we switched to intermediate frequency earlier. -Deprecated: +Deprecated ---------- -The target call has three arguments: struct cpufreq_policy *policy, +The target call has three arguments: ``struct cpufreq_policy *policy``, unsigned int target_frequency, unsigned int relation. The CPUfreq driver must set the new frequency when called here. The @@ -210,14 +207,14 @@ Not all drivers are expected to implement it, as sleeping from within this callback isn't allowed. This callback must be highly optimized to do switching as fast as possible. -This function has two arguments: struct cpufreq_policy *policy and -unsigned int target_frequency. +This function has two arguments: ``struct cpufreq_policy *policy`` and +``unsigned int target_frequency``. 1.7 setpolicy ------------- -The setpolicy call only takes a struct cpufreq_policy *policy as +The setpolicy call only takes a ``struct cpufreq_policy *policy`` as argument. You need to set the lower limit of the in-processor or in-chipset dynamic frequency switching to policy->min, the upper limit to policy->max, and -if supported- select a performance-oriented @@ -278,10 +275,10 @@ table. cpufreq_for_each_valid_entry(pos, table) - iterates over all entries, excluding CPUFREQ_ENTRY_INVALID frequencies. -Use arguments "pos" - a cpufreq_frequency_table * as a loop cursor and -"table" - the cpufreq_frequency_table * you want to iterate over. +Use arguments "pos" - a ``cpufreq_frequency_table *`` as a loop cursor and +"table" - the ``cpufreq_frequency_table *`` you want to iterate over. -For example: +For example:: struct cpufreq_frequency_table *pos, *driver_freq_table; diff --git a/Documentation/cpu-freq/cpufreq-nforce2.txt b/Documentation/cpu-freq/cpufreq-nforce2.txt deleted file mode 100644 index babce1315026..000000000000 --- a/Documentation/cpu-freq/cpufreq-nforce2.txt +++ /dev/null @@ -1,19 +0,0 @@ - -The cpufreq-nforce2 driver changes the FSB on nVidia nForce2 platforms. - -This works better than on other platforms, because the FSB of the CPU -can be controlled independently from the PCI/AGP clock. - -The module has two options: - - fid: multiplier * 10 (for example 8.5 = 85) - min_fsb: minimum FSB - -If not set, fid is calculated from the current CPU speed and the FSB. -min_fsb defaults to FSB at boot time - 50 MHz. - -IMPORTANT: The available range is limited downwards! - Also the minimum available FSB can differ, for systems - booting with 200 MHz, 150 should always work. - - diff --git a/Documentation/cpu-freq/cpufreq-stats.txt b/Documentation/cpu-freq/cpufreq-stats.rst index 14378cecb172..9ad695b1c7db 100644 --- a/Documentation/cpu-freq/cpufreq-stats.txt +++ b/Documentation/cpu-freq/cpufreq-stats.rst @@ -1,21 +1,23 @@ +.. SPDX-License-Identifier: GPL-2.0 - CPU frequency and voltage scaling statistics in the Linux(TM) kernel +========================================== +General Description of sysfs CPUFreq Stats +========================================== +information for users - L i n u x c p u f r e q - s t a t s d r i v e r - - information for users - +Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> +.. Contents - Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> - -Contents -1. Introduction -2. Statistics Provided (with example) -3. Configuring cpufreq-stats + 1. Introduction + 2. Statistics Provided (with example) + 3. Configuring cpufreq-stats 1. Introduction +=============== cpufreq-stats is a driver that provides CPU frequency statistics for each CPU. These statistics are provided in /sysfs as a bunch of read_only interfaces. This @@ -28,8 +30,10 @@ that may be running on your CPU. So, it will work with any cpufreq_driver. 2. Statistics Provided (with example) +===================================== cpufreq stats provides following statistics (explained in detail below). + - time_in_state - total_trans - trans_table @@ -39,53 +43,57 @@ All the statistics will be from the time the stats driver has been inserted statistic is done. Obviously, stats driver will not have any information about the frequency transitions before the stats driver insertion. --------------------------------------------------------------------------------- -<mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # ls -l -total 0 -drwxr-xr-x 2 root root 0 May 14 16:06 . -drwxr-xr-x 3 root root 0 May 14 15:58 .. ---w------- 1 root root 4096 May 14 16:06 reset --r--r--r-- 1 root root 4096 May 14 16:06 time_in_state --r--r--r-- 1 root root 4096 May 14 16:06 total_trans --r--r--r-- 1 root root 4096 May 14 16:06 trans_table --------------------------------------------------------------------------------- - -- reset +:: + + <mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # ls -l + total 0 + drwxr-xr-x 2 root root 0 May 14 16:06 . + drwxr-xr-x 3 root root 0 May 14 15:58 .. + --w------- 1 root root 4096 May 14 16:06 reset + -r--r--r-- 1 root root 4096 May 14 16:06 time_in_state + -r--r--r-- 1 root root 4096 May 14 16:06 total_trans + -r--r--r-- 1 root root 4096 May 14 16:06 trans_table + +- **reset** + Write-only attribute that can be used to reset the stat counters. This can be useful for evaluating system behaviour under different governors without the need for a reboot. -- time_in_state +- **time_in_state** + This gives the amount of time spent in each of the frequencies supported by this CPU. The cat output will have "<frequency> <time>" pair in each line, which will mean this CPU spent <time> usertime units of time at <frequency>. Output -will have one line for each of the supported frequencies. usertime units here +will have one line for each of the supported frequencies. usertime units here is 10mS (similar to other time exported in /proc). --------------------------------------------------------------------------------- -<mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat time_in_state -3600000 2089 -3400000 136 -3200000 34 -3000000 67 -2800000 172488 --------------------------------------------------------------------------------- +:: + <mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat time_in_state + 3600000 2089 + 3400000 136 + 3200000 34 + 3000000 67 + 2800000 172488 -- total_trans -This gives the total number of frequency transitions on this CPU. The cat + +- **total_trans** + +This gives the total number of frequency transitions on this CPU. The cat output will have a single count which is the total number of frequency transitions. --------------------------------------------------------------------------------- -<mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat total_trans -20 --------------------------------------------------------------------------------- +:: + + <mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat total_trans + 20 + +- **trans_table** -- trans_table This will give a fine grained information about all the CPU frequency transitions. The cat output here is a two dimensional matrix, where an entry -<i,j> (row i, column j) represents the count of number of transitions from +<i,j> (row i, column j) represents the count of number of transitions from Freq_i to Freq_j. Freq_i rows and Freq_j columns follow the sorting order in which the driver has provided the frequency table initially to the cpufreq core and so can be sorted (ascending or descending) or unsorted. The output here @@ -95,26 +103,27 @@ readability. If the transition table is bigger than PAGE_SIZE, reading this will return an -EFBIG error. --------------------------------------------------------------------------------- -<mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat trans_table - From : To - : 3600000 3400000 3200000 3000000 2800000 - 3600000: 0 5 0 0 0 - 3400000: 4 0 2 0 0 - 3200000: 0 1 0 2 0 - 3000000: 0 0 1 0 3 - 2800000: 0 0 0 2 0 --------------------------------------------------------------------------------- +:: + <mysystem>:/sys/devices/system/cpu/cpu0/cpufreq/stats # cat trans_table + From : To + : 3600000 3400000 3200000 3000000 2800000 + 3600000: 0 5 0 0 0 + 3400000: 4 0 2 0 0 + 3200000: 0 1 0 2 0 + 3000000: 0 0 1 0 3 + 2800000: 0 0 0 2 0 3. Configuring cpufreq-stats +============================ + +To configure cpufreq-stats in your kernel:: -To configure cpufreq-stats in your kernel -Config Main Menu - Power management options (ACPI, APM) ---> - CPU Frequency scaling ---> - [*] CPU Frequency scaling - [*] CPU frequency translation statistics + Config Main Menu + Power management options (ACPI, APM) ---> + CPU Frequency scaling ---> + [*] CPU Frequency scaling + [*] CPU frequency translation statistics "CPU Frequency scaling" (CONFIG_CPU_FREQ) should be enabled to configure diff --git a/Documentation/cpu-freq/index.rst b/Documentation/cpu-freq/index.rst new file mode 100644 index 000000000000..aba7831ab1cb --- /dev/null +++ b/Documentation/cpu-freq/index.rst @@ -0,0 +1,39 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================================== +Linux CPUFreq - CPU frequency and voltage scaling code in the Linux(TM) kernel +============================================================================== + +Author: Dominik Brodowski <linux@brodo.de> + + Clock scaling allows you to change the clock speed of the CPUs on the + fly. This is a nice method to save battery power, because the lower + the clock speed, the less power the CPU consumes. + + +.. toctree:: + :maxdepth: 1 + + core + cpu-drivers + cpufreq-stats + +Mailing List +------------ +There is a CPU frequency changing CVS commit and general list where +you can report bugs, problems or submit patches. To post a message, +send an email to linux-pm@vger.kernel.org. + +Links +----- +the FTP archives: +* ftp://ftp.linux.org.uk/pub/linux/cpufreq/ + +how to access the CVS repository: +* http://cvs.arm.linux.org.uk/ + +the CPUFreq Mailing list: +* http://vger.kernel.org/vger-lists.html#linux-pm + +Clock and voltage scaling for the SA-1100: +* http://www.lartmaker.nl/projects/scaling diff --git a/Documentation/cpu-freq/index.txt b/Documentation/cpu-freq/index.txt deleted file mode 100644 index c15e75386a05..000000000000 --- a/Documentation/cpu-freq/index.txt +++ /dev/null @@ -1,56 +0,0 @@ - CPU frequency and voltage scaling code in the Linux(TM) kernel - - - L i n u x C P U F r e q - - - - - Dominik Brodowski <linux@brodo.de> - - - - Clock scaling allows you to change the clock speed of the CPUs on the - fly. This is a nice method to save battery power, because the lower - the clock speed, the less power the CPU consumes. - - - -Documents in this directory: ----------------------------- - -amd-powernow.txt - AMD powernow driver specific file. - -core.txt - General description of the CPUFreq core and - of CPUFreq notifiers. - -cpu-drivers.txt - How to implement a new cpufreq processor driver. - -cpufreq-nforce2.txt - nVidia nForce2 platform specific file. - -cpufreq-stats.txt - General description of sysfs cpufreq stats. - -index.txt - File index, Mailing list and Links (this document) - -pcc-cpufreq.txt - PCC cpufreq driver specific file. - - -Mailing List ------------- -There is a CPU frequency changing CVS commit and general list where -you can report bugs, problems or submit patches. To post a message, -send an email to linux-pm@vger.kernel.org. - -Links ------ -the FTP archives: -* ftp://ftp.linux.org.uk/pub/linux/cpufreq/ - -how to access the CVS repository: -* http://cvs.arm.linux.org.uk/ - -the CPUFreq Mailing list: -* http://vger.kernel.org/vger-lists.html#linux-pm - -Clock and voltage scaling for the SA-1100: -* http://www.lartmaker.nl/projects/scaling diff --git a/Documentation/cpu-freq/pcc-cpufreq.txt b/Documentation/cpu-freq/pcc-cpufreq.txt deleted file mode 100644 index 9e3c3b33514c..000000000000 --- a/Documentation/cpu-freq/pcc-cpufreq.txt +++ /dev/null @@ -1,207 +0,0 @@ -/* - * pcc-cpufreq.txt - PCC interface documentation - * - * Copyright (C) 2009 Red Hat, Matthew Garrett <mjg@redhat.com> - * Copyright (C) 2009 Hewlett-Packard Development Company, L.P. - * Nagananda Chumbalkar <nagananda.chumbalkar@hp.com> - * - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; version 2 of the License. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or NON - * INFRINGEMENT. See the GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License along - * with this program; if not, write to the Free Software Foundation, Inc., - * 675 Mass Ave, Cambridge, MA 02139, USA. - * - * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - */ - - - Processor Clocking Control Driver - --------------------------------- - -Contents: ---------- -1. Introduction -1.1 PCC interface -1.1.1 Get Average Frequency -1.1.2 Set Desired Frequency -1.2 Platforms affected -2. Driver and /sys details -2.1 scaling_available_frequencies -2.2 cpuinfo_transition_latency -2.3 cpuinfo_cur_freq -2.4 related_cpus -3. Caveats - -1. Introduction: ----------------- -Processor Clocking Control (PCC) is an interface between the platform -firmware and OSPM. It is a mechanism for coordinating processor -performance (ie: frequency) between the platform firmware and the OS. - -The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC -interface. - -OS utilizes the PCC interface to inform platform firmware what frequency the -OS wants for a logical processor. The platform firmware attempts to achieve -the requested frequency. If the request for the target frequency could not be -satisfied by platform firmware, then it usually means that power budget -conditions are in place, and "power capping" is taking place. - -1.1 PCC interface: ------------------- -The complete PCC specification is available here: -http://www.acpica.org/download/Processor-Clocking-Control-v1p0.pdf - -PCC relies on a shared memory region that provides a channel for communication -between the OS and platform firmware. PCC also implements a "doorbell" that -is used by the OS to inform the platform firmware that a command has been -sent. - -The ACPI PCCH() method is used to discover the location of the PCC shared -memory region. The shared memory region header contains the "command" and -"status" interface. PCCH() also contains details on how to access the platform -doorbell. - -The following commands are supported by the PCC interface: -* Get Average Frequency -* Set Desired Frequency - -The ACPI PCCP() method is implemented for each logical processor and is -used to discover the offsets for the input and output buffers in the shared -memory region. - -When PCC mode is enabled, the platform will not expose processor performance -or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore, -the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for -AMD) will not load. - -However, OSPM remains in control of policy. The governor (eg: "ondemand") -computes the required performance for each processor based on server workload. -The PCC driver fills in the command interface, and the input buffer and -communicates the request to the platform firmware. The platform firmware is -responsible for delivering the requested performance. - -Each PCC command is "global" in scope and can affect all the logical CPUs in -the system. Therefore, PCC is capable of performing "group" updates. With PCC -the OS is capable of getting/setting the frequency of all the logical CPUs in -the system with a single call to the BIOS. - -1.1.1 Get Average Frequency: ----------------------------- -This command is used by the OSPM to query the running frequency of the -processor since the last time this command was completed. The output buffer -indicates the average unhalted frequency of the logical processor expressed as -a percentage of the nominal (ie: maximum) CPU frequency. The output buffer -also signifies if the CPU frequency is limited by a power budget condition. - -1.1.2 Set Desired Frequency: ----------------------------- -This command is used by the OSPM to communicate to the platform firmware the -desired frequency for a logical processor. The output buffer is currently -ignored by OSPM. The next invocation of "Get Average Frequency" will inform -OSPM if the desired frequency was achieved or not. - -1.2 Platforms affected: ------------------------ -The PCC driver will load on any system where the platform firmware: -* supports the PCC interface, and the associated PCCH() and PCCP() methods -* assumes responsibility for managing the hardware clocking controls in order -to deliver the requested processor performance - -Currently, certain HP ProLiant platforms implement the PCC interface. On those -platforms PCC is the "default" choice. - -However, it is possible to disable this interface via a BIOS setting. In -such an instance, as is also the case on platforms where the PCC interface -is not implemented, the PCC driver will fail to load silently. - -2. Driver and /sys details: ---------------------------- -When the driver loads, it merely prints the lowest and the highest CPU -frequencies supported by the platform firmware. - -The PCC driver loads with a message such as: -pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933 -MHz - -This means that the OPSM can request the CPU to run at any frequency in -between the limits (1600 MHz, and 2933 MHz) specified in the message. - -Internally, there is no need for the driver to convert the "target" frequency -to a corresponding P-state. - -The VERSION number for the driver will be of the format v.xy.ab. -eg: 1.00.02 - ----- -- - | | - | -- this will increase with bug fixes/enhancements to the driver - |-- this is the version of the PCC specification the driver adheres to - - -The following is a brief discussion on some of the fields exported via the -/sys filesystem and how their values are affected by the PCC driver: - -2.1 scaling_available_frequencies: ----------------------------------- -scaling_available_frequencies is not created in /sys. No intermediate -frequencies need to be listed because the BIOS will try to achieve any -frequency, within limits, requested by the governor. A frequency does not have -to be strictly associated with a P-state. - -2.2 cpuinfo_transition_latency: -------------------------------- -The cpuinfo_transition_latency field is 0. The PCC specification does -not include a field to expose this value currently. - -2.3 cpuinfo_cur_freq: ---------------------- -A) Often cpuinfo_cur_freq will show a value different than what is declared -in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq. -This is due to "turbo boost" available on recent Intel processors. If certain -conditions are met the BIOS can achieve a slightly higher speed than requested -by OSPM. An example: - -scaling_cur_freq : 2933000 -cpuinfo_cur_freq : 3196000 - -B) There is a round-off error associated with the cpuinfo_cur_freq value. -Since the driver obtains the current frequency as a "percentage" (%) of the -nominal frequency from the BIOS, sometimes, the values displayed by -scaling_cur_freq and cpuinfo_cur_freq may not match. An example: - -scaling_cur_freq : 1600000 -cpuinfo_cur_freq : 1583000 - -In this example, the nominal frequency is 2933 MHz. The driver obtains the -current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency: - - 54% of 2933 MHz = 1583 MHz - -Nominal frequency is the maximum frequency of the processor, and it usually -corresponds to the frequency of the P0 P-state. - -2.4 related_cpus: ------------------ -The related_cpus field is identical to affected_cpus. - -affected_cpus : 4 -related_cpus : 4 - -Currently, the PCC driver does not evaluate _PSD. The platforms that support -PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination -to ensure that the same frequency is requested of all dependent CPUs. - -3. Caveats: ------------ -The "cpufreq_stats" module in its present form cannot be loaded and -expected to work with the PCC driver. Since the "cpufreq_stats" module -provides information wrt each P-state, it is not applicable to the PCC driver. diff --git a/Documentation/debugging-modules.txt b/Documentation/debugging-modules.txt deleted file mode 100644 index 172ad4aec493..000000000000 --- a/Documentation/debugging-modules.txt +++ /dev/null @@ -1,22 +0,0 @@ -Debugging Modules after 2.6.3 ------------------------------ - -In almost all distributions, the kernel asks for modules which don't -exist, such as "net-pf-10" or whatever. Changing "modprobe -q" to -"succeed" in this case is hacky and breaks some setups, and also we -want to know if it failed for the fallback code for old aliases in -fs/char_dev.c, for example. - -In the past a debugging message which would fill people's logs was -emitted. This debugging message has been removed. The correct way -of debugging module problems is something like this: - -echo '#! /bin/sh' > /tmp/modprobe -echo 'echo "$@" >> /tmp/modprobe.log' >> /tmp/modprobe -echo 'exec /sbin/modprobe "$@"' >> /tmp/modprobe -chmod a+x /tmp/modprobe -echo /tmp/modprobe > /proc/sys/kernel/modprobe - -Note that the above applies only when the *kernel* is requesting -that the module be loaded -- it won't have any effect if that module -is being loaded explicitly using "modprobe" from userspace. diff --git a/Documentation/dev-tools/gcov.rst b/Documentation/dev-tools/gcov.rst index 46aae52a41d0..7bd013596217 100644 --- a/Documentation/dev-tools/gcov.rst +++ b/Documentation/dev-tools/gcov.rst @@ -203,7 +203,7 @@ Cause may not correctly copy files from sysfs. Solution - Use ``cat``' to read ``.gcda`` files and ``cp -d`` to copy links. + Use ``cat`` to read ``.gcda`` files and ``cp -d`` to copy links. Alternatively use the mechanism shown in Appendix B. diff --git a/Documentation/dev-tools/kmemleak.rst b/Documentation/dev-tools/kmemleak.rst index 3a289e8a1d12..fce262883984 100644 --- a/Documentation/dev-tools/kmemleak.rst +++ b/Documentation/dev-tools/kmemleak.rst @@ -8,7 +8,8 @@ with the difference that the orphan objects are not freed but only reported via /sys/kernel/debug/kmemleak. A similar method is used by the Valgrind tool (``memcheck --leak-check``) to detect the memory leaks in user-space applications. -Kmemleak is supported on x86, arm, powerpc, sparc, sh, microblaze, ppc, mips, s390 and tile. +Kmemleak is supported on x86, arm, arm64, powerpc, sparc, sh, microblaze, mips, +s390, nds32, arc and xtensa. Usage ----- diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst index 7cd56a1993b1..607758a66a99 100644 --- a/Documentation/dev-tools/kunit/usage.rst +++ b/Documentation/dev-tools/kunit/usage.rst @@ -551,6 +551,7 @@ options to your ``.config``: Once the kernel is built and installed, a simple .. code-block:: bash + modprobe example-test ...will run the tests. diff --git a/Documentation/devicetree/bindings/arm/arm,scmi.txt b/Documentation/devicetree/bindings/arm/arm,scmi.txt index f493d69e6194..dc102c4e4a78 100644 --- a/Documentation/devicetree/bindings/arm/arm,scmi.txt +++ b/Documentation/devicetree/bindings/arm/arm,scmi.txt @@ -102,7 +102,7 @@ Required sub-node properties: [1] Documentation/devicetree/bindings/clock/clock-bindings.txt [2] Documentation/devicetree/bindings/power/power-domain.yaml [3] Documentation/devicetree/bindings/thermal/thermal.txt -[4] Documentation/devicetree/bindings/sram/sram.txt +[4] Documentation/devicetree/bindings/sram/sram.yaml [5] Documentation/devicetree/bindings/reset/reset.txt Example: diff --git a/Documentation/devicetree/bindings/arm/arm,scpi.txt b/Documentation/devicetree/bindings/arm/arm,scpi.txt index 7b83ef43b418..dd04d9d9a1b8 100644 --- a/Documentation/devicetree/bindings/arm/arm,scpi.txt +++ b/Documentation/devicetree/bindings/arm/arm,scpi.txt @@ -109,7 +109,7 @@ Required properties: [0] http://infocenter.arm.com/help/topic/com.arm.doc.dui0922b/index.html [1] Documentation/devicetree/bindings/clock/clock-bindings.txt [2] Documentation/devicetree/bindings/thermal/thermal.txt -[3] Documentation/devicetree/bindings/sram/sram.txt +[3] Documentation/devicetree/bindings/sram/sram.yaml [4] Documentation/devicetree/bindings/power/power-domain.yaml Example: diff --git a/Documentation/devicetree/bindings/arm/bcm/brcm,bcm63138.txt b/Documentation/devicetree/bindings/arm/bcm/brcm,bcm63138.txt index b82b6a0ae6f7..8c7a4908a849 100644 --- a/Documentation/devicetree/bindings/arm/bcm/brcm,bcm63138.txt +++ b/Documentation/devicetree/bindings/arm/bcm/brcm,bcm63138.txt @@ -62,7 +62,7 @@ Timer node: Syscon reboot node: -See Documentation/devicetree/bindings/power/reset/syscon-reboot.txt for the +See Documentation/devicetree/bindings/power/reset/syscon-reboot.yaml for the detailed list of properties, the two values defined below are specific to the BCM6328-style timer: diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml index 7a9c3ce2dbef..0d5b61056b10 100644 --- a/Documentation/devicetree/bindings/arm/cpus.yaml +++ b/Documentation/devicetree/bindings/arm/cpus.yaml @@ -216,7 +216,7 @@ properties: $ref: '/schemas/types.yaml#/definitions/phandle-array' description: | List of phandles to idle state nodes supported - by this cpu (see ./idle-states.txt). + by this cpu (see ./idle-states.yaml). capacity-dmips-mhz: $ref: '/schemas/types.yaml#/definitions/uint32' diff --git a/Documentation/devicetree/bindings/arm/fsl.yaml b/Documentation/devicetree/bindings/arm/fsl.yaml index a8e0b4a813ed..0e17e1f6fb80 100644 --- a/Documentation/devicetree/bindings/arm/fsl.yaml +++ b/Documentation/devicetree/bindings/arm/fsl.yaml @@ -160,7 +160,7 @@ properties: items: - enum: - armadeus,imx6dl-apf6 # APF6 (Solo) SoM - - armadeus,imx6dl-apf6dldev # APF6 (Solo) SoM on APF6Dev board + - armadeus,imx6dl-apf6dev # APF6 (Solo) SoM on APF6Dev board - eckelmann,imx6dl-ci4x10 - emtrion,emcon-mx6 # emCON-MX6S or emCON-MX6DL SoM - emtrion,emcon-mx6-avari # emCON-MX6S or emCON-MX6DL SoM on Avari Base diff --git a/Documentation/devicetree/bindings/arm/hisilicon/hi3519-sysctrl.txt b/Documentation/devicetree/bindings/arm/hisilicon/hi3519-sysctrl.txt index 115c5be0bd0b..8defacc44dd5 100644 --- a/Documentation/devicetree/bindings/arm/hisilicon/hi3519-sysctrl.txt +++ b/Documentation/devicetree/bindings/arm/hisilicon/hi3519-sysctrl.txt @@ -1,7 +1,7 @@ * Hisilicon Hi3519 System Controller Block This bindings use the following binding: -Documentation/devicetree/bindings/mfd/syscon.txt +Documentation/devicetree/bindings/mfd/syscon.yaml Required properties: - compatible: "hisilicon,hi3519-sysctrl". diff --git a/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt b/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt index 06df04cc827a..6ce0b212ec6d 100644 --- a/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt +++ b/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt @@ -81,4 +81,4 @@ Example: }; }; -[1]. Documentation/devicetree/bindings/arm/idle-states.txt +[1]. Documentation/devicetree/bindings/arm/idle-states.yaml diff --git a/Documentation/devicetree/bindings/arm/omap/mpu.txt b/Documentation/devicetree/bindings/arm/omap/mpu.txt index f301e636fd52..e41490e6979c 100644 --- a/Documentation/devicetree/bindings/arm/omap/mpu.txt +++ b/Documentation/devicetree/bindings/arm/omap/mpu.txt @@ -17,7 +17,7 @@ am335x and am437x only: - pm-sram: Phandles to ocmcram nodes to be used for power management. First should be type 'protect-exec' for the driver to use to copy and run PM functions, second should be regular pool to be used for - data region for code. See Documentation/devicetree/bindings/sram/sram.txt + data region for code. See Documentation/devicetree/bindings/sram/sram.yaml for more details. Examples: diff --git a/Documentation/devicetree/bindings/arm/psci.yaml b/Documentation/devicetree/bindings/arm/psci.yaml index 8ef85420b2ab..5e66934455bb 100644 --- a/Documentation/devicetree/bindings/arm/psci.yaml +++ b/Documentation/devicetree/bindings/arm/psci.yaml @@ -100,13 +100,14 @@ properties: bindings in [1]) must specify this property. [1] Kernel documentation - ARM idle states bindings - Documentation/devicetree/bindings/arm/idle-states.txt - - "#power-domain-cells": - description: - The number of cells in a PM domain specifier as per binding in [3]. - Must be 0 as to represent a single PM domain. + Documentation/devicetree/bindings/arm/idle-states.yaml +patternProperties: + "^power-domain-": + allOf: + - $ref: "../power/power-domain.yaml#" + type: object + description: | ARM systems can have multiple cores, sometimes in an hierarchical arrangement. This often, but not always, maps directly to the processor power topology of the system. Individual nodes in a topology have their @@ -122,14 +123,8 @@ properties: helps to implement support for OSI mode and OS implementations may choose to mandate it. - [3] Documentation/devicetree/bindings/power/power_domain.txt - [4] Documentation/devicetree/bindings/power/domain-idle-state.txt - - power-domains: - $ref: '/schemas/types.yaml#/definitions/phandle-array' - description: - List of phandles and PM domain specifiers, as defined by bindings of the - PM domain provider. + [3] Documentation/devicetree/bindings/power/power-domain.yaml + [4] Documentation/devicetree/bindings/power/domain-idle-state.yaml required: - compatible @@ -199,7 +194,7 @@ examples: CPU0: cpu@0 { device_type = "cpu"; - compatible = "arm,cortex-a53", "arm,armv8"; + compatible = "arm,cortex-a53"; reg = <0x0>; enable-method = "psci"; power-domains = <&CPU_PD0>; @@ -208,7 +203,7 @@ examples: CPU1: cpu@1 { device_type = "cpu"; - compatible = "arm,cortex-a57", "arm,armv8"; + compatible = "arm,cortex-a53"; reg = <0x100>; enable-method = "psci"; power-domains = <&CPU_PD1>; @@ -224,6 +219,9 @@ examples: exit-latency-us = <10>; min-residency-us = <100>; }; + }; + + domain-idle-states { CLUSTER_RET: cluster-retention { compatible = "domain-idle-state"; @@ -247,19 +245,19 @@ examples: compatible = "arm,psci-1.0"; method = "smc"; - CPU_PD0: cpu-pd0 { + CPU_PD0: power-domain-cpu0 { #power-domain-cells = <0>; domain-idle-states = <&CPU_PWRDN>; power-domains = <&CLUSTER_PD>; }; - CPU_PD1: cpu-pd1 { + CPU_PD1: power-domain-cpu1 { #power-domain-cells = <0>; domain-idle-states = <&CPU_PWRDN>; power-domains = <&CLUSTER_PD>; }; - CLUSTER_PD: cluster-pd { + CLUSTER_PD: power-domain-cluster { #power-domain-cells = <0>; domain-idle-states = <&CLUSTER_RET>, <&CLUSTER_PWRDN>; }; diff --git a/Documentation/devicetree/bindings/arm/stm32/st,mlahb.yaml b/Documentation/devicetree/bindings/arm/stm32/st,mlahb.yaml index 68917bb7c7e8..55f7938c4826 100644 --- a/Documentation/devicetree/bindings/arm/stm32/st,mlahb.yaml +++ b/Documentation/devicetree/bindings/arm/stm32/st,mlahb.yaml @@ -52,7 +52,7 @@ required: examples: - | - mlahb: ahb { + mlahb: ahb@38000000 { compatible = "st,mlahb", "simple-bus"; #address-cells = <1>; #size-cells = <1>; diff --git a/Documentation/devicetree/bindings/bus/allwinner,sun8i-a23-rsb.yaml b/Documentation/devicetree/bindings/bus/allwinner,sun8i-a23-rsb.yaml index 9fe11ceecdba..80973619342d 100644 --- a/Documentation/devicetree/bindings/bus/allwinner,sun8i-a23-rsb.yaml +++ b/Documentation/devicetree/bindings/bus/allwinner,sun8i-a23-rsb.yaml @@ -70,7 +70,6 @@ examples: #size-cells = <0>; pmic@3e3 { - compatible = "..."; reg = <0x3e3>; /* ... */ diff --git a/Documentation/devicetree/bindings/clock/allwinner,sun4i-a10-osc-clk.yaml b/Documentation/devicetree/bindings/clock/allwinner,sun4i-a10-osc-clk.yaml index 69cfa4a3d562..c604822cda07 100644 --- a/Documentation/devicetree/bindings/clock/allwinner,sun4i-a10-osc-clk.yaml +++ b/Documentation/devicetree/bindings/clock/allwinner,sun4i-a10-osc-clk.yaml @@ -40,7 +40,7 @@ additionalProperties: false examples: - | - osc24M: clk@01c20050 { + osc24M: clk@1c20050 { #clock-cells = <0>; compatible = "allwinner,sun4i-a10-osc-clk"; reg = <0x01c20050 0x4>; diff --git a/Documentation/devicetree/bindings/clock/allwinner,sun9i-a80-gt-clk.yaml b/Documentation/devicetree/bindings/clock/allwinner,sun9i-a80-gt-clk.yaml index 07f38def7dc3..43963c3062c8 100644 --- a/Documentation/devicetree/bindings/clock/allwinner,sun9i-a80-gt-clk.yaml +++ b/Documentation/devicetree/bindings/clock/allwinner,sun9i-a80-gt-clk.yaml @@ -41,7 +41,7 @@ additionalProperties: false examples: - | - clk@0600005c { + clk@600005c { #clock-cells = <0>; compatible = "allwinner,sun9i-a80-gt-clk"; reg = <0x0600005c 0x4>; diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc-apq8064.yaml b/Documentation/devicetree/bindings/clock/qcom,gcc-apq8064.yaml index 17f87178f6b8..3647007f82ca 100644 --- a/Documentation/devicetree/bindings/clock/qcom,gcc-apq8064.yaml +++ b/Documentation/devicetree/bindings/clock/qcom,gcc-apq8064.yaml @@ -42,7 +42,7 @@ properties: be part of GCC and hence the TSENS properties can also be part of the GCC/clock-controller node. For more details on the TSENS properties please refer - Documentation/devicetree/bindings/thermal/qcom-tsens.txt + Documentation/devicetree/bindings/thermal/qcom-tsens.yaml nvmem-cell-names: minItems: 1 diff --git a/Documentation/devicetree/bindings/crypto/allwinner,sun4i-a10-crypto.yaml b/Documentation/devicetree/bindings/crypto/allwinner,sun4i-a10-crypto.yaml index 33c7842917f6..8b9a8f337f16 100644 --- a/Documentation/devicetree/bindings/crypto/allwinner,sun4i-a10-crypto.yaml +++ b/Documentation/devicetree/bindings/crypto/allwinner,sun4i-a10-crypto.yaml @@ -23,6 +23,8 @@ properties: - items: - const: allwinner,sun7i-a20-crypto - const: allwinner,sun4i-a10-crypto + - items: + - const: allwinner,sun8i-a33-crypto reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tcon.yaml b/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tcon.yaml index 86ad617d2327..5ff9cf26ca38 100644 --- a/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tcon.yaml +++ b/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tcon.yaml @@ -43,9 +43,13 @@ properties: - enum: - allwinner,sun8i-h3-tcon-tv - allwinner,sun50i-a64-tcon-tv - - allwinner,sun50i-h6-tcon-tv - const: allwinner,sun8i-a83t-tcon-tv + - items: + - enum: + - allwinner,sun50i-h6-tcon-tv + - const: allwinner,sun8i-r40-tcon-tv + reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tv-encoder.yaml b/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tv-encoder.yaml index 5d5d39665119..6009324be967 100644 --- a/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tv-encoder.yaml +++ b/Documentation/devicetree/bindings/display/allwinner,sun4i-a10-tv-encoder.yaml @@ -49,11 +49,7 @@ examples: resets = <&tcon_ch0_clk 0>; port { - #address-cells = <1>; - #size-cells = <0>; - - tve0_in_tcon0: endpoint@0 { - reg = <0>; + tve0_in_tcon0: endpoint { remote-endpoint = <&tcon0_out_tve0>; }; }; diff --git a/Documentation/devicetree/bindings/display/bridge/anx6345.yaml b/Documentation/devicetree/bindings/display/bridge/anx6345.yaml index 6d72b3d11fbc..c21103869923 100644 --- a/Documentation/devicetree/bindings/display/bridge/anx6345.yaml +++ b/Documentation/devicetree/bindings/display/bridge/anx6345.yaml @@ -79,21 +79,15 @@ examples: #size-cells = <0>; anx6345_in: port@0 { - #address-cells = <1>; - #size-cells = <0>; reg = <0>; - anx6345_in_tcon0: endpoint@0 { - reg = <0>; + anx6345_in_tcon0: endpoint { remote-endpoint = <&tcon0_out_anx6345>; }; }; anx6345_out: port@1 { - #address-cells = <1>; - #size-cells = <0>; reg = <1>; - anx6345_out_panel: endpoint@0 { - reg = <0>; + anx6345_out_panel: endpoint { remote-endpoint = <&panel_in_edp>; }; }; diff --git a/Documentation/devicetree/bindings/display/connector/analog-tv-connector.txt b/Documentation/devicetree/bindings/display/connector/analog-tv-connector.txt index 0c0970c210ab..883bcb2604c7 100644 --- a/Documentation/devicetree/bindings/display/connector/analog-tv-connector.txt +++ b/Documentation/devicetree/bindings/display/connector/analog-tv-connector.txt @@ -6,16 +6,22 @@ Required properties: Optional properties: - label: a symbolic name for the connector +- sdtv-standards: limit the supported TV standards on a connector to the given + ones. If not specified all TV standards are allowed. + Possible TV standards are defined in + include/dt-bindings/display/sdtv-standards.h. Required nodes: - Video port for TV input Example ------- +#include <dt-bindings/display/sdtv-standards.h> tv: connector { compatible = "composite-video-connector"; label = "tv"; + sdtv-standards = <(SDTV_STD_PAL | SDTV_STD_NTSC)>; port { tv_connector_in: endpoint { diff --git a/Documentation/devicetree/bindings/display/panel/leadtek,ltk500hd1829.yaml b/Documentation/devicetree/bindings/display/panel/leadtek,ltk500hd1829.yaml index 4ebcea7d0c63..a614644c9849 100644 --- a/Documentation/devicetree/bindings/display/panel/leadtek,ltk500hd1829.yaml +++ b/Documentation/devicetree/bindings/display/panel/leadtek,ltk500hd1829.yaml @@ -37,6 +37,8 @@ examples: dsi@ff450000 { #address-cells = <1>; #size-cells = <0>; + reg = <0xff450000 0x1000>; + panel@0 { compatible = "leadtek,ltk500hd1829"; reg = <0>; diff --git a/Documentation/devicetree/bindings/display/panel/xinpeng,xpp055c272.yaml b/Documentation/devicetree/bindings/display/panel/xinpeng,xpp055c272.yaml index 186e5e1c8fa3..22c91beb0541 100644 --- a/Documentation/devicetree/bindings/display/panel/xinpeng,xpp055c272.yaml +++ b/Documentation/devicetree/bindings/display/panel/xinpeng,xpp055c272.yaml @@ -37,6 +37,8 @@ examples: dsi@ff450000 { #address-cells = <1>; #size-cells = <0>; + reg = <0xff450000 0x1000>; + panel@0 { compatible = "xinpeng,xpp055c272"; reg = <0>; diff --git a/Documentation/devicetree/bindings/display/simple-framebuffer.yaml b/Documentation/devicetree/bindings/display/simple-framebuffer.yaml index 678776b6012a..1db608c9eef5 100644 --- a/Documentation/devicetree/bindings/display/simple-framebuffer.yaml +++ b/Documentation/devicetree/bindings/display/simple-framebuffer.yaml @@ -174,10 +174,6 @@ examples: }; }; - soc@1c00000 { - lcdc0: lcdc@1c0c000 { - compatible = "allwinner,sun4i-a10-lcdc"; - }; - }; + lcdc0: lcdc { }; ... diff --git a/Documentation/devicetree/bindings/display/tilcdc/tilcdc.txt b/Documentation/devicetree/bindings/display/tilcdc/tilcdc.txt index 7bf1bb444812..aac617acb64f 100644 --- a/Documentation/devicetree/bindings/display/tilcdc/tilcdc.txt +++ b/Documentation/devicetree/bindings/display/tilcdc/tilcdc.txt @@ -37,7 +37,7 @@ Optional nodes: supports a single port with a single endpoint. - See also Documentation/devicetree/bindings/display/tilcdc/panel.txt and - Documentation/devicetree/bindings/display/tilcdc/tfp410.txt for connecting + Documentation/devicetree/bindings/display/bridge/ti,tfp410.txt for connecting tfp410 DVI encoder or lcd panel to lcdc [1] There is an errata about AM335x color wiring. For 16-bit color mode diff --git a/Documentation/devicetree/bindings/dma/ti/k3-udma.yaml b/Documentation/devicetree/bindings/dma/ti/k3-udma.yaml index 8b5c346f23f6..34780d7535b8 100644 --- a/Documentation/devicetree/bindings/dma/ti/k3-udma.yaml +++ b/Documentation/devicetree/bindings/dma/ti/k3-udma.yaml @@ -143,7 +143,7 @@ examples: #size-cells = <2>; dma-coherent; dma-ranges; - ranges; + ranges = <0x0 0x30800000 0x0 0x30800000 0x0 0x05000000>; ti,sci-dev-id = <118>; @@ -169,16 +169,4 @@ examples: ti,sci-rm-range-rflow = <0x6>; /* GP RFLOW */ }; }; - - mcasp0: mcasp@02B00000 { - dmas = <&main_udmap 0xc400>, <&main_udmap 0x4400>; - dma-names = "tx", "rx"; - }; - - crypto: crypto@4E00000 { - compatible = "ti,sa2ul-crypto"; - - dmas = <&main_udmap 0xc000>, <&main_udmap 0x4000>, <&main_udmap 0x4001>; - dma-names = "tx", "rx1", "rx2"; - }; }; diff --git a/Documentation/devicetree/bindings/edac/dmc-520.yaml b/Documentation/devicetree/bindings/edac/dmc-520.yaml new file mode 100644 index 000000000000..9272d2bd8634 --- /dev/null +++ b/Documentation/devicetree/bindings/edac/dmc-520.yaml @@ -0,0 +1,59 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/edac/dmc-520.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: ARM DMC-520 EDAC bindings + +maintainers: + - Lei Wang <lewan@microsoft.com> + +description: |+ + DMC-520 node is defined to describe DRAM error detection and correction. + + https://static.docs.arm.com/100000/0200/corelink_dmc520_trm_100000_0200_01_en.pdf + +properties: + compatible: + items: + - const: brcm,dmc-520 + - const: arm,dmc-520 + + reg: + maxItems: 1 + + interrupts: + minItems: 1 + maxItems: 10 + + interrupt-names: + minItems: 1 + maxItems: 10 + items: + enum: + - ram_ecc_errc + - ram_ecc_errd + - dram_ecc_errc + - dram_ecc_errd + - failed_access + - failed_prog + - link_err + - temperature_event + - arch_fsm + - phy_request + +required: + - compatible + - reg + - interrupts + - interrupt-names + +examples: + - | + dmc0: dmc@200000 { + compatible = "brcm,dmc-520", "arm,dmc-520"; + reg = <0x200000 0x80000>; + interrupts = <0x0 0x349 0x4>, <0x0 0x34B 0x4>; + interrupt-names = "dram_ecc_errc", "dram_ecc_errd"; + }; diff --git a/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml b/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml new file mode 100644 index 000000000000..893d81e54caa --- /dev/null +++ b/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml @@ -0,0 +1,36 @@ +# SPDX-License-Identifier: (GPL-2.0-or-later) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/fsi/ibm,fsi2spi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: IBM FSI-attached SPI controllers + +maintainers: + - Eddie James <eajames@linux.ibm.com> + +description: | + This binding describes an FSI CFAM engine called the FSI2SPI. Therefore this + node will always be a child of an FSI CFAM node; see fsi.txt for details on + FSI slave and CFAM nodes. This FSI2SPI engine provides access to a number of + SPI controllers. + +properties: + compatible: + enum: + - ibm,fsi2spi + + reg: + items: + - description: FSI slave address + +required: + - compatible + - reg + +examples: + - | + fsi2spi@1c00 { + compatible = "ibm,fsi2spi"; + reg = <0x1c00 0x400>; + }; diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-bifrost.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-bifrost.yaml index 4ea6a8789699..e8b99adcb1bd 100644 --- a/Documentation/devicetree/bindings/gpu/arm,mali-bifrost.yaml +++ b/Documentation/devicetree/bindings/gpu/arm,mali-bifrost.yaml @@ -84,31 +84,31 @@ examples: gpu_opp_table: opp_table0 { compatible = "operating-points-v2"; - opp@533000000 { + opp-533000000 { opp-hz = /bits/ 64 <533000000>; opp-microvolt = <1250000>; }; - opp@450000000 { + opp-450000000 { opp-hz = /bits/ 64 <450000000>; opp-microvolt = <1150000>; }; - opp@400000000 { + opp-400000000 { opp-hz = /bits/ 64 <400000000>; opp-microvolt = <1125000>; }; - opp@350000000 { + opp-350000000 { opp-hz = /bits/ 64 <350000000>; opp-microvolt = <1075000>; }; - opp@266000000 { + opp-266000000 { opp-hz = /bits/ 64 <266000000>; opp-microvolt = <1025000>; }; - opp@160000000 { + opp-160000000 { opp-hz = /bits/ 64 <160000000>; opp-microvolt = <925000>; }; - opp@100000000 { + opp-100000000 { opp-hz = /bits/ 64 <100000000>; opp-microvolt = <912500>; }; diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-midgard.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.yaml index 36f59b3ade71..8d966f3ff3db 100644 --- a/Documentation/devicetree/bindings/gpu/arm,mali-midgard.yaml +++ b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.yaml @@ -138,31 +138,31 @@ examples: gpu_opp_table: opp_table0 { compatible = "operating-points-v2"; - opp@533000000 { + opp-533000000 { opp-hz = /bits/ 64 <533000000>; opp-microvolt = <1250000>; }; - opp@450000000 { + opp-450000000 { opp-hz = /bits/ 64 <450000000>; opp-microvolt = <1150000>; }; - opp@400000000 { + opp-400000000 { opp-hz = /bits/ 64 <400000000>; opp-microvolt = <1125000>; }; - opp@350000000 { + opp-350000000 { opp-hz = /bits/ 64 <350000000>; opp-microvolt = <1075000>; }; - opp@266000000 { + opp-266000000 { opp-hz = /bits/ 64 <266000000>; opp-microvolt = <1025000>; }; - opp@160000000 { + opp-160000000 { opp-hz = /bits/ 64 <160000000>; opp-microvolt = <925000>; }; - opp@100000000 { + opp-100000000 { opp-hz = /bits/ 64 <100000000>; opp-microvolt = <912500>; }; diff --git a/Documentation/devicetree/bindings/hwmon/adi,axi-fan-control.yaml b/Documentation/devicetree/bindings/hwmon/adi,axi-fan-control.yaml new file mode 100644 index 000000000000..57a240d2d026 --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/adi,axi-fan-control.yaml @@ -0,0 +1,62 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +# Copyright 2019 Analog Devices Inc. +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/bindings/hwmon/adi,axi-fan-control.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices AXI FAN Control Device Tree Bindings + +maintainers: + - Nuno Sá <nuno.sa@analog.com> + +description: |+ + Bindings for the Analog Devices AXI FAN Control driver. Spefications of the + core can be found in: + + https://wiki.analog.com/resources/fpga/docs/axi_fan_control + +properties: + compatible: + enum: + - adi,axi-fan-control-1.00.a + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + interrupts: + maxItems: 1 + + pulses-per-revolution: + description: + Value specifying the number of pulses per revolution of the controlled + FAN. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + enum: [1, 2, 4] + +required: + - compatible + - reg + - clocks + - interrupts + - pulses-per-revolution + +examples: + - | + fpga_axi: fpga-axi@0 { + #address-cells = <0x2>; + #size-cells = <0x1>; + + axi_fan_control: axi-fan-control@80000000 { + compatible = "adi,axi-fan-control-1.00.a"; + reg = <0x0 0x80000000 0x10000>; + clocks = <&clk 71>; + interrupts = <0 110 0>; + pulses-per-revolution = <2>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/hwmon/adt7475.yaml b/Documentation/devicetree/bindings/hwmon/adt7475.yaml new file mode 100644 index 000000000000..76985034ea73 --- /dev/null +++ b/Documentation/devicetree/bindings/hwmon/adt7475.yaml @@ -0,0 +1,84 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/adt7475.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: ADT7475 hwmon sensor + +maintainers: + - Jean Delvare <jdelvare@suse.com> + +description: | + The ADT7473, ADT7475, ADT7476, and ADT7490 are thermal monitors and multiple + PWN fan controllers. + + They support monitoring and controlling up to four fans (the ADT7490 can only + control up to three). They support reading a single on chip temperature + sensor and two off chip temperature sensors (the ADT7490 additionally + supports measuring up to three current external temperature sensors with + series resistance cancellation (SRC)). + + Datasheets: + https://www.onsemi.com/pub/Collateral/ADT7473-D.PDF + https://www.onsemi.com/pub/Collateral/ADT7475-D.PDF + https://www.onsemi.com/pub/Collateral/ADT7476-D.PDF + https://www.onsemi.com/pub/Collateral/ADT7490-D.PDF + + Description taken from onsemiconductors specification sheets, with minor + rephrasing. + +properties: + compatible: + enum: + - adi,adt7473 + - adi,adt7475 + - adi,adt7476 + - adi,adt7490 + + reg: + maxItems: 1 + +patternProperties: + "^adi,bypass-attenuator-in[0-4]$": + description: | + Configures bypassing the individual voltage input attenuator. If + set to 1 the attenuator is bypassed if set to 0 the attenuator is + not bypassed. If the property is absent then the attenuator + retains it's configuration from the bios/bootloader. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [0, 1] + + "^adi,pwm-active-state$": + description: | + Integer array, represents the active state of the pwm outputs If set to 0 + the pwm uses a logic low output for 100% duty cycle. If set to 1 the pwm + uses a logic high output for 100% duty cycle. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - minItems: 3 + maxItems: 3 + items: + enum: [0, 1] + default: 1 + +required: + - compatible + - reg + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + hwmon@2e { + compatible = "adi,adt7476"; + reg = <0x2e>; + adi,bypass-attenuator-in0 = <1>; + adi,bypass-attenuator-in1 = <0>; + adi,pwm-active-state = <1 0 1>; + }; + }; + diff --git a/Documentation/devicetree/bindings/hwmon/ltc2978.txt b/Documentation/devicetree/bindings/hwmon/ltc2978.txt index b428a70a7cc0..4e7f6215a453 100644 --- a/Documentation/devicetree/bindings/hwmon/ltc2978.txt +++ b/Documentation/devicetree/bindings/hwmon/ltc2978.txt @@ -2,20 +2,30 @@ ltc2978 Required properties: - compatible: should contain one of: + * "lltc,ltc2972" * "lltc,ltc2974" * "lltc,ltc2975" * "lltc,ltc2977" * "lltc,ltc2978" + * "lltc,ltc2979" * "lltc,ltc2980" * "lltc,ltc3880" * "lltc,ltc3882" * "lltc,ltc3883" + * "lltc,ltc3884" * "lltc,ltc3886" * "lltc,ltc3887" + * "lltc,ltc3889" + * "lltc,ltc7880" * "lltc,ltm2987" + * "lltc,ltm4664" * "lltc,ltm4675" * "lltc,ltm4676" + * "lltc,ltm4677" + * "lltc,ltm4678" + * "lltc,ltm4680" * "lltc,ltm4686" + * "lltc,ltm4700" - reg: I2C slave address Optional properties: @@ -25,13 +35,17 @@ Optional properties: standard binding for regulators; see regulator.txt. Valid names of regulators depend on number of supplies supported per device: + * ltc2972 vout0 - vout1 * ltc2974, ltc2975 : vout0 - vout3 - * ltc2977, ltc2980, ltm2987 : vout0 - vout7 + * ltc2977, ltc2979, ltc2980, ltm2987 : vout0 - vout7 * ltc2978 : vout0 - vout7 - * ltc3880, ltc3882, ltc3886 : vout0 - vout1 + * ltc3880, ltc3882, ltc3884, ltc3886, ltc3887, ltc3889 : vout0 - vout1 + * ltc7880 : vout0 - vout1 * ltc3883 : vout0 - * ltm4676 : vout0 - vout1 - * ltm4686 : vout0 - vout1 + * ltm4664 : vout0 - vout1 + * ltm4675, ltm4676, ltm4677, ltm4678 : vout0 - vout1 + * ltm4680, ltm4686 : vout0 - vout1 + * ltm4700 : vout0 - vout1 Example: ltc2978@5e { diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7923.yaml b/Documentation/devicetree/bindings/iio/adc/adi,ad7923.yaml new file mode 100644 index 000000000000..a11b918e0016 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7923.yaml @@ -0,0 +1,65 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/adc/adi,ad7923.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices AD7923 and similars with 4 and 8 Channel ADCs. + +maintainers: + - Michael Hennerich <michael.hennerich@analog.com> + - Patrick Vasseur <patrick.vasseur@c-s.fr> + +description: | + Analog Devices AD7904, AD7914, AD7923, AD7924 4 Channel ADCs, and AD7908, + AD7918, AD7928 8 Channels ADCs. + + Specifications about the part can be found at: + https://www.analog.com/media/en/technical-documentation/data-sheets/AD7923.pdf + https://www.analog.com/media/en/technical-documentation/data-sheets/AD7904_7914_7924.pdf + https://www.analog.com/media/en/technical-documentation/data-sheets/AD7908_7918_7928.pdf + +properties: + compatible: + enum: + - adi,ad7904 + - adi,ad7914 + - adi,ad7923 + - adi,ad7924 + - adi,ad7908 + - adi,ad7918 + - adi,ad7928 + + reg: + maxItems: 1 + + refin-supply: + description: | + The regulator supply for ADC reference voltage. + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +required: + - compatible + - reg + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + ad7928: adc@0 { + compatible = "adi,ad7928"; + reg = <0>; + spi-max-frequency = <25000000>; + refin-supply = <&adc_vref>; + + #address-cells = <1>; + #size-cells = <0>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/adc/max1363.txt b/Documentation/devicetree/bindings/iio/adc/max1363.txt deleted file mode 100644 index 94a9011dd860..000000000000 --- a/Documentation/devicetree/bindings/iio/adc/max1363.txt +++ /dev/null @@ -1,63 +0,0 @@ -* Maxim 1x3x/136x/116xx Analog to Digital Converter (ADC) - -The node for this driver must be a child node of a I2C controller, hence -all mandatory properties for your controller must be specified. See directory: - - Documentation/devicetree/bindings/i2c - -for more details. - -Required properties: - - compatible: Should be one of - "maxim,max1361" - "maxim,max1362" - "maxim,max1363" - "maxim,max1364" - "maxim,max1036" - "maxim,max1037" - "maxim,max1038" - "maxim,max1039" - "maxim,max1136" - "maxim,max1137" - "maxim,max1138" - "maxim,max1139" - "maxim,max1236" - "maxim,max1237" - "maxim,max1238" - "maxim,max1239" - "maxim,max11600" - "maxim,max11601" - "maxim,max11602" - "maxim,max11603" - "maxim,max11604" - "maxim,max11605" - "maxim,max11606" - "maxim,max11607" - "maxim,max11608" - "maxim,max11609" - "maxim,max11610" - "maxim,max11611" - "maxim,max11612" - "maxim,max11613" - "maxim,max11614" - "maxim,max11615" - "maxim,max11616" - "maxim,max11617" - "maxim,max11644" - "maxim,max11645" - "maxim,max11646" - "maxim,max11647" - - reg: Should contain the ADC I2C address - -Optional properties: - - vcc-supply: phandle to the regulator that provides power to the ADC. - - vref-supply: phandle to the regulator for ADC reference voltage. - - interrupts: IRQ line for the ADC. If not used the driver will use - polling. - -Example: -adc: max11644@36 { - compatible = "maxim,max11644"; - reg = <0x36>; - vref-supply = <&adc_vref>; -}; diff --git a/Documentation/devicetree/bindings/iio/adc/maxim,max1238.yaml b/Documentation/devicetree/bindings/iio/adc/maxim,max1238.yaml new file mode 100644 index 000000000000..a0ebb4680140 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/maxim,max1238.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/adc/maxim,max1238.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Maxim MAX1238 and similar ADCs + +maintainers: + - Jonathan Cameron <jic23@kernel.org> + +description: | + Family of simple ADCs with i2c inteface and internal references. + +properties: + compatible: + enum: + - maxim,max1036 + - maxim,max1037 + - maxim,max1038 + - maxim,max1039 + - maxim,max1136 + - maxim,max1137 + - maxim,max1138 + - maxim,max1139 + - maxim,max1236 + - maxim,max1237 + - maxim,max1238 + - maxim,max1239 + - maxim,max11600 + - maxim,max11601 + - maxim,max11602 + - maxim,max11603 + - maxim,max11604 + - maxim,max11605 + - maxim,max11606 + - maxim,max11607 + - maxim,max11608 + - maxim,max11609 + - maxim,max11610 + - maxim,max11611 + - maxim,max11612 + - maxim,max11613 + - maxim,max11614 + - maxim,max11615 + - maxim,max11616 + - maxim,max11617 + - maxim,max11644 + - maxim,max11645 + - maxim,max11646 + - maxim,max11647 + + reg: + maxItems: 1 + + vcc-supply: true + vref-supply: + description: Optional external reference. If not supplied, internal + reference will be used. + +required: + - compatible + - reg + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + adc@36 { + compatible = "maxim,max1238"; + reg = <0x36>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/iio/adc/maxim,max1363.yaml b/Documentation/devicetree/bindings/iio/adc/maxim,max1363.yaml new file mode 100644 index 000000000000..48377549c39a --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/maxim,max1363.yaml @@ -0,0 +1,50 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/adc/maxim,max1363.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Maxim MAX1363 and similar ADCs + +maintainers: + - Jonathan Cameron <jic23@kernel.org> + +description: | + Family of ADCs with i2c inteface, internal references and threshold + monitoring. + +properties: + compatible: + enum: + - maxim,max1361 + - maxim,max1362 + - maxim,max1363 + - maxim,max1364 + + reg: + maxItems: 1 + + vcc-supply: true + vref-supply: + description: Optional external reference. If not supplied, internal + reference will be used. + + interrupts: + maxItems: 1 + +required: + - compatible + - reg + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + adc@36 { + compatible = "maxim,max1363"; + reg = <0x36>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/iio/adc/nuvoton,npcm-adc.txt b/Documentation/devicetree/bindings/iio/adc/nuvoton,npcm-adc.txt index eb939fe77836..ef8eeec1a997 100644 --- a/Documentation/devicetree/bindings/iio/adc/nuvoton,npcm-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/nuvoton,npcm-adc.txt @@ -6,6 +6,7 @@ Required properties: - compatible: "nuvoton,npcm750-adc" for the NPCM7XX BMC. - reg: specifies physical base address and size of the registers. - interrupts: Contain the ADC interrupt with flags for falling edge. +- resets : phandle to the reset control for this device. Optional properties: - clocks: phandle of ADC reference clock, in case the clock is not @@ -21,4 +22,5 @@ adc: adc@f000c000 { reg = <0xf000c000 0x8>; interrupts = <GIC_SPI 0 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clk NPCM7XX_CLK_ADC>; + resets = <&rstc NPCM7XX_RESET_IPSRST1 NPCM7XX_RESET_ADC>; }; diff --git a/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.yaml b/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.yaml index f46de17c0878..cc3c8ea6a894 100644 --- a/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.yaml +++ b/Documentation/devicetree/bindings/iio/adc/samsung,exynos-adc.yaml @@ -123,7 +123,7 @@ examples: samsung,syscon-phandle = <&pmu_system_controller>; /* NTC thermistor is a hwmon device */ - ncp15wb473@0 { + ncp15wb473 { compatible = "murata,ncp15wb473"; pullup-uv = <1800000>; pullup-ohm = <47000>; diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt deleted file mode 100644 index 8de933146771..000000000000 --- a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt +++ /dev/null @@ -1,149 +0,0 @@ -STMicroelectronics STM32 ADC device driver - -STM32 ADC is a successive approximation analog-to-digital converter. -It has several multiplexed input channels. Conversions can be performed -in single, continuous, scan or discontinuous mode. Result of the ADC is -stored in a left-aligned or right-aligned 32-bit data register. -Conversions can be launched in software or using hardware triggers. - -The analog watchdog feature allows the application to detect if the input -voltage goes beyond the user-defined, higher or lower thresholds. - -Each STM32 ADC block can have up to 3 ADC instances. - -Each instance supports two contexts to manage conversions, each one has its -own configurable sequence and trigger: -- regular conversion can be done in sequence, running in background -- injected conversions have higher priority, and so have the ability to - interrupt regular conversion sequence (either triggered in SW or HW). - Regular sequence is resumed, in case it has been interrupted. - -Contents of a stm32 adc root node: ------------------------------------ -Required properties: -- compatible: Should be one of: - "st,stm32f4-adc-core" - "st,stm32h7-adc-core" - "st,stm32mp1-adc-core" -- reg: Offset and length of the ADC block register set. -- interrupts: One or more interrupts for ADC block. Some parts like stm32f4 - and stm32h7 share a common ADC interrupt line. stm32mp1 has two separate - interrupt lines, one for each ADC within ADC block. -- clocks: Core can use up to two clocks, depending on part used: - - "adc" clock: for the analog circuitry, common to all ADCs. - It's required on stm32f4. - It's optional on stm32h7. - - "bus" clock: for registers access, common to all ADCs. - It's not present on stm32f4. - It's required on stm32h7. -- clock-names: Must be "adc" and/or "bus" depending on part used. -- interrupt-controller: Identifies the controller node as interrupt-parent -- vdda-supply: Phandle to the vdda input analog voltage. -- vref-supply: Phandle to the vref input analog reference voltage. -- #interrupt-cells = <1>; -- #address-cells = <1>; -- #size-cells = <0>; - -Optional properties: -- A pinctrl state named "default" for each ADC channel may be defined to set - inX ADC pins in mode of operation for analog input on external pin. -- booster-supply: Phandle to the embedded booster regulator that can be used - to supply ADC analog input switches on stm32h7 and stm32mp1. -- vdd-supply: Phandle to the vdd input voltage. It can be used to supply ADC - analog input switches on stm32mp1. -- st,syscfg: Phandle to system configuration controller. It can be used to - control the analog circuitry on stm32mp1. -- st,max-clk-rate-hz: Allow to specify desired max clock rate used by analog - circuitry. - -Contents of a stm32 adc child node: ------------------------------------ -An ADC block node should contain at least one subnode, representing an -ADC instance available on the machine. - -Required properties: -- compatible: Should be one of: - "st,stm32f4-adc" - "st,stm32h7-adc" - "st,stm32mp1-adc" -- reg: Offset of ADC instance in ADC block (e.g. may be 0x0, 0x100, 0x200). -- clocks: Input clock private to this ADC instance. It's required only on - stm32f4, that has per instance clock input for registers access. -- interrupts: IRQ Line for the ADC (e.g. may be 0 for adc@0, 1 for adc@100 or - 2 for adc@200). -- st,adc-channels: List of single-ended channels muxed for this ADC. - It can have up to 16 channels on stm32f4 or 20 channels on stm32h7, numbered - from 0 to 15 or 19 (resp. for in0..in15 or in0..in19). -- st,adc-diff-channels: List of differential channels muxed for this ADC. - Depending on part used, some channels can be configured as differential - instead of single-ended (e.g. stm32h7). List here positive and negative - inputs pairs as <vinp vinn>, <vinp vinn>,... vinp and vinn are numbered - from 0 to 19 on stm32h7) - Note: At least one of "st,adc-channels" or "st,adc-diff-channels" is required. - Both properties can be used together. Some channels can be used as - single-ended and some other ones as differential (mixed). But channels - can't be configured both as single-ended and differential (invalid). -- #io-channel-cells = <1>: See the IIO bindings section "IIO consumers" in - Documentation/devicetree/bindings/iio/iio-bindings.txt - -Optional properties: -- dmas: Phandle to dma channel for this ADC instance. - See ../../dma/dma.txt for details. -- dma-names: Must be "rx" when dmas property is being used. -- assigned-resolution-bits: Resolution (bits) to use for conversions. Must - match device available resolutions: - * can be 6, 8, 10 or 12 on stm32f4 - * can be 8, 10, 12, 14 or 16 on stm32h7 - Default is maximum resolution if unset. -- st,min-sample-time-nsecs: Minimum sampling time in nanoseconds. - Depending on hardware (board) e.g. high/low analog input source impedance, - fine tune of ADC sampling time may be recommended. - This can be either one value or an array that matches 'st,adc-channels' list, - to set sample time resp. for all channels, or independently for each channel. - -Example: - adc: adc@40012000 { - compatible = "st,stm32f4-adc-core"; - reg = <0x40012000 0x400>; - interrupts = <18>; - clocks = <&rcc 0 168>; - clock-names = "adc"; - vref-supply = <®_vref>; - interrupt-controller; - pinctrl-names = "default"; - pinctrl-0 = <&adc3_in8_pin>; - - #interrupt-cells = <1>; - #address-cells = <1>; - #size-cells = <0>; - - adc@0 { - compatible = "st,stm32f4-adc"; - #io-channel-cells = <1>; - reg = <0x0>; - clocks = <&rcc 0 168>; - interrupt-parent = <&adc>; - interrupts = <0>; - st,adc-channels = <8>; - dmas = <&dma2 0 0 0x400 0x0>; - dma-names = "rx"; - assigned-resolution-bits = <8>; - }; - ... - other adc child nodes follow... - }; - -Example to setup: -- channel 1 as single-ended -- channels 2 & 3 as differential (with resp. 6 & 7 negative inputs) - - adc: adc@40022000 { - compatible = "st,stm32h7-adc-core"; - ... - adc1: adc@0 { - compatible = "st,stm32h7-adc"; - ... - st,adc-channels = <1>; - st,adc-diff-channels = <2 6>, <3 7>; - }; - }; diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.yaml b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.yaml new file mode 100644 index 000000000000..933ba37944d7 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.yaml @@ -0,0 +1,458 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/bindings/iio/adc/st,stm32-adc.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: STMicroelectronics STM32 ADC bindings + +description: | + STM32 ADC is a successive approximation analog-to-digital converter. + It has several multiplexed input channels. Conversions can be performed + in single, continuous, scan or discontinuous mode. Result of the ADC is + stored in a left-aligned or right-aligned 32-bit data register. + Conversions can be launched in software or using hardware triggers. + + The analog watchdog feature allows the application to detect if the input + voltage goes beyond the user-defined, higher or lower thresholds. + + Each STM32 ADC block can have up to 3 ADC instances. + +maintainers: + - Fabrice Gasnier <fabrice.gasnier@st.com> + +properties: + compatible: + enum: + - st,stm32f4-adc-core + - st,stm32h7-adc-core + - st,stm32mp1-adc-core + + reg: + maxItems: 1 + + interrupts: + description: | + One or more interrupts for ADC block, depending on part used: + - stm32f4 and stm32h7 share a common ADC interrupt line. + - stm32mp1 has two separate interrupt lines, one for each ADC within + ADC block. + minItems: 1 + maxItems: 2 + + clocks: + description: | + Core can use up to two clocks, depending on part used: + - "adc" clock: for the analog circuitry, common to all ADCs. + It's required on stm32f4. + It's optional on stm32h7 and stm32mp1. + - "bus" clock: for registers access, common to all ADCs. + It's not present on stm32f4. + It's required on stm32h7 and stm32mp1. + + clock-names: true + + st,max-clk-rate-hz: + description: + Allow to specify desired max clock rate used by analog circuitry. + + vdda-supply: + description: Phandle to the vdda input analog voltage. + + vref-supply: + description: Phandle to the vref input analog reference voltage. + + booster-supply: + description: + Phandle to the embedded booster regulator that can be used to supply ADC + analog input switches on stm32h7 and stm32mp1. + + vdd-supply: + description: + Phandle to the vdd input voltage. It can be used to supply ADC analog + input switches on stm32mp1. + + st,syscfg: + description: + Phandle to system configuration controller. It can be used to control the + analog circuitry on stm32mp1. + allOf: + - $ref: "/schemas/types.yaml#/definitions/phandle-array" + + interrupt-controller: true + + '#interrupt-cells': + const: 1 + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +allOf: + - if: + properties: + compatible: + contains: + const: st,stm32f4-adc-core + + then: + properties: + clocks: + maxItems: 1 + + clock-names: + const: adc + + interrupts: + items: + - description: interrupt line common for all ADCs + + st,max-clk-rate-hz: + minimum: 600000 + maximum: 36000000 + default: 36000000 + + booster-supply: false + + vdd-supply: false + + st,syscfg: false + + - if: + properties: + compatible: + contains: + const: st,stm32h7-adc-core + + then: + properties: + clocks: + minItems: 1 + maxItems: 2 + + clock-names: + items: + - const: bus + - const: adc + minItems: 1 + maxItems: 2 + + interrupts: + items: + - description: interrupt line common for all ADCs + + st,max-clk-rate-hz: + minimum: 120000 + maximum: 36000000 + default: 36000000 + + vdd-supply: false + + st,syscfg: false + + - if: + properties: + compatible: + contains: + const: st,stm32mp1-adc-core + + then: + properties: + clocks: + minItems: 1 + maxItems: 2 + + clock-names: + items: + - const: bus + - const: adc + minItems: 1 + maxItems: 2 + + interrupts: + items: + - description: interrupt line for ADC1 + - description: interrupt line for ADC2 + + st,max-clk-rate-hz: + minimum: 120000 + maximum: 36000000 + default: 36000000 + +additionalProperties: false + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + - vdda-supply + - vref-supply + - interrupt-controller + - '#interrupt-cells' + - '#address-cells' + - '#size-cells' + +patternProperties: + "^adc@[0-9]+$": + type: object + description: + An ADC block node should contain at least one subnode, representing an + ADC instance available on the machine. + + properties: + compatible: + enum: + - st,stm32f4-adc + - st,stm32h7-adc + - st,stm32mp1-adc + + reg: + description: | + Offset of ADC instance in ADC block. Valid values are: + - 0x0: ADC1 + - 0x100: ADC2 + - 0x200: ADC3 (stm32f4 only) + maxItems: 1 + + '#io-channel-cells': + const: 1 + + interrupts: + description: | + IRQ Line for the ADC instance. Valid values are: + - 0 for adc@0 + - 1 for adc@100 + - 2 for adc@200 (stm32f4 only) + maxItems: 1 + + clocks: + description: + Input clock private to this ADC instance. It's required only on + stm32f4, that has per instance clock input for registers access. + maxItems: 1 + + dmas: + description: RX DMA Channel + maxItems: 1 + + dma-names: + const: rx + + assigned-resolution-bits: + description: | + Resolution (bits) to use for conversions: + - can be 6, 8, 10 or 12 on stm32f4 + - can be 8, 10, 12, 14 or 16 on stm32h7 and stm32mp1 + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + + st,adc-channels: + description: | + List of single-ended channels muxed for this ADC. It can have up to: + - 16 channels, numbered from 0 to 15 (for in0..in15) on stm32f4 + - 20 channels, numbered from 0 to 19 (for in0..in19) on stm32h7 and + stm32mp1. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + + st,adc-diff-channels: + description: | + List of differential channels muxed for this ADC. Some channels can + be configured as differential instead of single-ended on stm32h7 and + on stm32mp1. Positive and negative inputs pairs are listed: + <vinp vinn>, <vinp vinn>,... vinp and vinn are numbered from 0 to 19. + + Note: At least one of "st,adc-channels" or "st,adc-diff-channels" is + required. Both properties can be used together. Some channels can be + used as single-ended and some other ones as differential (mixed). But + channels can't be configured both as single-ended and differential. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-matrix + - items: + items: + - description: | + "vinp" indicates positive input number + minimum: 0 + maximum: 19 + - description: | + "vinn" indicates negative input number + minimum: 0 + maximum: 19 + + st,min-sample-time-nsecs: + description: + Minimum sampling time in nanoseconds. Depending on hardware (board) + e.g. high/low analog input source impedance, fine tune of ADC + sampling time may be recommended. This can be either one value or an + array that matches "st,adc-channels" and/or "st,adc-diff-channels" + list, to set sample time resp. for all channels, or independently for + each channel. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + + allOf: + - if: + properties: + compatible: + contains: + const: st,stm32f4-adc + + then: + properties: + reg: + enum: + - 0x0 + - 0x100 + - 0x200 + + interrupts: + minimum: 0 + maximum: 2 + + assigned-resolution-bits: + enum: [6, 8, 10, 12] + default: 12 + + st,adc-channels: + minItems: 1 + maxItems: 16 + items: + minimum: 0 + maximum: 15 + + st,adc-diff-channels: false + + st,min-sample-time-nsecs: + minItems: 1 + maxItems: 16 + items: + minimum: 80 + + required: + - clocks + + - if: + properties: + compatible: + contains: + enum: + - st,stm32h7-adc + - st,stm32mp1-adc + + then: + properties: + reg: + enum: + - 0x0 + - 0x100 + + interrupts: + minimum: 0 + maximum: 1 + + assigned-resolution-bits: + enum: [8, 10, 12, 14, 16] + default: 16 + + st,adc-channels: + minItems: 1 + maxItems: 20 + items: + minimum: 0 + maximum: 19 + + st,min-sample-time-nsecs: + minItems: 1 + maxItems: 20 + items: + minimum: 40 + + additionalProperties: false + + anyOf: + - required: + - st,adc-channels + - required: + - st,adc-diff-channels + + required: + - compatible + - reg + - interrupts + - '#io-channel-cells' + +examples: + - | + // Example 1: with stm32f429, ADC1, single-ended channel 8 + adc123: adc@40012000 { + compatible = "st,stm32f4-adc-core"; + reg = <0x40012000 0x400>; + interrupts = <18>; + clocks = <&rcc 0 168>; + clock-names = "adc"; + st,max-clk-rate-hz = <36000000>; + vdda-supply = <&vdda>; + vref-supply = <&vref>; + interrupt-controller; + #interrupt-cells = <1>; + #address-cells = <1>; + #size-cells = <0>; + adc@0 { + compatible = "st,stm32f4-adc"; + #io-channel-cells = <1>; + reg = <0x0>; + clocks = <&rcc 0 168>; + interrupt-parent = <&adc123>; + interrupts = <0>; + st,adc-channels = <8>; + dmas = <&dma2 0 0 0x400 0x0>; + dma-names = "rx"; + assigned-resolution-bits = <8>; + }; + // ... + // other adc child nodes follow... + }; + + - | + // Example 2: with stm32mp157c to setup ADC1 with: + // - channels 0 & 1 as single-ended + // - channels 2 & 3 as differential (with resp. 6 & 7 negative inputs) + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/stm32mp1-clks.h> + adc12: adc@48003000 { + compatible = "st,stm32mp1-adc-core"; + reg = <0x48003000 0x400>; + interrupts = <GIC_SPI 18 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 90 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&rcc ADC12>, <&rcc ADC12_K>; + clock-names = "bus", "adc"; + booster-supply = <&booster>; + vdd-supply = <&vdd>; + vdda-supply = <&vdda>; + vref-supply = <&vref>; + st,syscfg = <&syscfg>; + interrupt-controller; + #interrupt-cells = <1>; + #address-cells = <1>; + #size-cells = <0>; + adc@0 { + compatible = "st,stm32mp1-adc"; + #io-channel-cells = <1>; + reg = <0x0>; + interrupt-parent = <&adc12>; + interrupts = <0>; + st,adc-channels = <0 1>; + st,adc-diff-channels = <2 6>, <3 7>; + st,min-sample-time-nsecs = <5000>; + dmas = <&dmamux1 9 0x400 0x05>; + dma-names = "rx"; + }; + // ... + // other adc child node follow... + }; + +... diff --git a/Documentation/devicetree/bindings/iio/amplifiers/adi,hmc425a.yaml b/Documentation/devicetree/bindings/iio/amplifiers/adi,hmc425a.yaml new file mode 100644 index 000000000000..1c6d49685e9f --- /dev/null +++ b/Documentation/devicetree/bindings/iio/amplifiers/adi,hmc425a.yaml @@ -0,0 +1,49 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/amplifiers/adi,hmc425a.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: HMC425A 6-bit Digital Step Attenuator + +maintainers: +- Michael Hennerich <michael.hennerich@analog.com> +- Beniamin Bia <beniamin.bia@analog.com> + +description: | + Digital Step Attenuator IIO device with gpio interface. + HMC425A 0.5 dB LSB GaAs MMIC 6-BIT DIGITAL POSITIVE CONTROL ATTENUATOR, 2.2 - 8.0 GHz + https://www.analog.com/media/en/technical-documentation/data-sheets/hmc425A.pdf + +properties: + compatible: + enum: + - adi,hmc425a + + vcc-supply: true + + ctrl-gpios: + description: + Must contain an array of 6 GPIO specifiers, referring to the GPIO pins + connected to the control pins V1-V6. + minItems: 6 + maxItems: 6 + +required: + - compatible + - ctrl-gpios + +examples: + - | + #include <dt-bindings/gpio/gpio.h> + gpio_hmc425a: hmc425a { + compatible = "adi,hmc425a"; + ctrl-gpios = <&gpio 40 GPIO_ACTIVE_HIGH>, + <&gpio 39 GPIO_ACTIVE_HIGH>, + <&gpio 38 GPIO_ACTIVE_HIGH>, + <&gpio 37 GPIO_ACTIVE_HIGH>, + <&gpio 36 GPIO_ACTIVE_HIGH>, + <&gpio 35 GPIO_ACTIVE_HIGH>; + vcc-supply = <&foo>; + }; +... diff --git a/Documentation/devicetree/bindings/iio/chemical/atlas,ec-sm.txt b/Documentation/devicetree/bindings/iio/chemical/atlas,ec-sm.txt deleted file mode 100644 index f4320595b851..000000000000 --- a/Documentation/devicetree/bindings/iio/chemical/atlas,ec-sm.txt +++ /dev/null @@ -1,21 +0,0 @@ -* Atlas Scientific EC-SM OEM sensor - -http://www.atlas-scientific.com/_files/_datasheets/_oem/EC_oem_datasheet.pdf - -Required properties: - - - compatible: must be "atlas,ec-sm" - - reg: the I2C address of the sensor - - interrupts: the sole interrupt generated by the device - - Refer to interrupt-controller/interrupts.txt for generic interrupt client - node bindings. - -Example: - -atlas@64 { - compatible = "atlas,ec-sm"; - reg = <0x64>; - interrupt-parent = <&gpio1>; - interrupts = <16 2>; -}; diff --git a/Documentation/devicetree/bindings/iio/chemical/atlas,orp-sm.txt b/Documentation/devicetree/bindings/iio/chemical/atlas,orp-sm.txt deleted file mode 100644 index af1f5a9aa4da..000000000000 --- a/Documentation/devicetree/bindings/iio/chemical/atlas,orp-sm.txt +++ /dev/null @@ -1,21 +0,0 @@ -* Atlas Scientific ORP-SM OEM sensor - -https://www.atlas-scientific.com/_files/_datasheets/_oem/ORP_oem_datasheet.pdf - -Required properties: - - - compatible: must be "atlas,orp-sm" - - reg: the I2C address of the sensor - - interrupts: the sole interrupt generated by the device - - Refer to interrupt-controller/interrupts.txt for generic interrupt client - node bindings. - -Example: - -atlas@66 { - compatible = "atlas,orp-sm"; - reg = <0x66>; - interrupt-parent = <&gpio1>; - interrupts = <16 2>; -}; diff --git a/Documentation/devicetree/bindings/iio/chemical/atlas,ph-sm.txt b/Documentation/devicetree/bindings/iio/chemical/atlas,ph-sm.txt deleted file mode 100644 index 79d90f060327..000000000000 --- a/Documentation/devicetree/bindings/iio/chemical/atlas,ph-sm.txt +++ /dev/null @@ -1,21 +0,0 @@ -* Atlas Scientific pH-SM OEM sensor - -http://www.atlas-scientific.com/_files/_datasheets/_oem/pH_oem_datasheet.pdf - -Required properties: - - - compatible: must be "atlas,ph-sm" - - reg: the I2C address of the sensor - - interrupts: the sole interrupt generated by the device - - Refer to interrupt-controller/interrupts.txt for generic interrupt client - node bindings. - -Example: - -atlas@65 { - compatible = "atlas,ph-sm"; - reg = <0x65>; - interrupt-parent = <&gpio1>; - interrupts = <16 2>; -}; diff --git a/Documentation/devicetree/bindings/iio/chemical/atlas,sensor.yaml b/Documentation/devicetree/bindings/iio/chemical/atlas,sensor.yaml new file mode 100644 index 000000000000..edcd2904d50e --- /dev/null +++ b/Documentation/devicetree/bindings/iio/chemical/atlas,sensor.yaml @@ -0,0 +1,53 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/chemical/atlas,sensor.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Atlas Scientific OEM sensors + +maintainers: + - Matt Ranostay <matt.ranostay@konsulko.com> + +description: | + Atlas Scientific OEM sensors connected via I2C + + Datasheets: + http://www.atlas-scientific.com/_files/_datasheets/_oem/DO_oem_datasheet.pdf + http://www.atlas-scientific.com/_files/_datasheets/_oem/EC_oem_datasheet.pdf + http://www.atlas-scientific.com/_files/_datasheets/_oem/ORP_oem_datasheet.pdf + http://www.atlas-scientific.com/_files/_datasheets/_oem/pH_oem_datasheet.pdf + +properties: + compatible: + enum: + - atlas,do-sm + - atlas,ec-sm + - atlas,orp-sm + - atlas,ph-sm + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + atlas@66 { + compatible = "atlas,orp-sm"; + reg = <0x66>; + interrupt-parent = <&gpio1>; + interrupts = <16 2>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/dac/adi,ad5770r.yaml b/Documentation/devicetree/bindings/iio/dac/adi,ad5770r.yaml new file mode 100644 index 000000000000..d9c25cf4b92f --- /dev/null +++ b/Documentation/devicetree/bindings/iio/dac/adi,ad5770r.yaml @@ -0,0 +1,185 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +# Copyright 2020 Analog Devices Inc. +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/bindings/iio/dac/adi,ad5770r.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices AD5770R DAC device driver + +maintainers: + - Mircea Caprioru <mircea.caprioru@analog.com> + +description: | + Bindings for the Analog Devices AD5770R current DAC device. Datasheet can be + found here: + https://www.analog.com/media/en/technical-documentation/data-sheets/AD5770R.pdf + +properties: + compatible: + enum: + - adi,ad5770r + + reg: + maxItems: 1 + + avdd-supply: + description: + AVdd voltage supply. Represents two different supplies in the datasheet + that are in fact the same. + + iovdd-supply: + description: + Voltage supply for the chip interface. + + vref-supply: + description: Specify the voltage of the external reference used. + Available reference options are 1.25 V or 2.5 V. If no + external reference declared then the device will use the + internal reference of 1.25 V. + + adi,external-resistor: + description: Specify if an external 2.5k ohm resistor is used. If not + specified the device will use an internal 2.5k ohm resistor. + The precision resistor is used for reference current generation. + type: boolean + + reset-gpios: + description: GPIO spec for the RESET pin. If specified, it will be + asserted during driver probe. + maxItems: 1 + + channel0: + description: Represents an external channel which are + connected to the DAC. Channel 0 can act both as a current + source and sink. + type: object + + properties: + num: + description: This represents the channel number. + items: + const: 0 + + adi,range-microamp: + description: Output range of the channel. + oneOf: + - $ref: /schemas/types.yaml#/definitions/int32-array + - items: + - enum: [0 300000] + - enum: [-60000 0] + - enum: [-60000 300000] + + channel1: + description: Represents an external channel which are + connected to the DAC. + type: object + + properties: + num: + description: This represents the channel number. + items: + const: 1 + + adi,range-microamp: + description: Output range of the channel. + oneOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - items: + - enum: [0 140000] + - enum: [0 250000] + + channel2: + description: Represents an external channel which are + connected to the DAC. + type: object + + properties: + num: + description: This represents the channel number. + items: + const: 2 + + adi,range-microamp: + description: Output range of the channel. + oneOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - items: + - enum: [0 140000] + - enum: [0 250000] + +patternProperties: + "^channel@([3-5])$": + type: object + description: Represents the external channels which are connected to the DAC. + properties: + num: + description: This represents the channel number. + items: + minimum: 3 + maximum: 5 + + adi,range-microamp: + description: Output range of the channel. + oneOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - items: + - enum: [0 45000] + - enum: [0 100000] + +required: +- reg +- diff-channels +- channel0 +- channel1 +- channel2 +- channel3 +- channel4 +- channel5 + +examples: + - | + spi { + #address-cells = <1>; + #size-cells = <0>; + + ad5770r@0 { + compatible = "ad5770r"; + reg = <0>; + spi-max-frequency = <1000000>; + vref-supply = <&vref>; + adi,external-resistor; + reset-gpios = <&gpio 22 0>; + + channel@0 { + num = <0>; + adi,range-microamp = <(-60000) 300000>; + }; + + channel@1 { + num = <1>; + adi,range-microamp = <0 140000>; + }; + + channel@2 { + num = <2>; + adi,range-microamp = <0 55000>; + }; + + channel@3 { + num = <3>; + adi,range-microamp = <0 45000>; + }; + + channel@4 { + num = <4>; + adi,range-microamp = <0 45000>; + }; + + channel@5 { + num = <5>; + adi,range-microamp = <0 45000>; + }; + }; + }; +... diff --git a/Documentation/devicetree/bindings/iio/dac/ltc2632.txt b/Documentation/devicetree/bindings/iio/dac/ltc2632.txt index e0d5fea33031..338c3220f01a 100644 --- a/Documentation/devicetree/bindings/iio/dac/ltc2632.txt +++ b/Documentation/devicetree/bindings/iio/dac/ltc2632.txt @@ -1,4 +1,4 @@ -Linear Technology LTC2632 DAC device driver +Linear Technology LTC2632/2636 DAC Required properties: - compatible: Has to contain one of the following: @@ -8,6 +8,12 @@ Required properties: lltc,ltc2632-h12 lltc,ltc2632-h10 lltc,ltc2632-h8 + lltc,ltc2636-l12 + lltc,ltc2636-l10 + lltc,ltc2636-l8 + lltc,ltc2636-h12 + lltc,ltc2636-h10 + lltc,ltc2636-h8 Property rules described in Documentation/devicetree/bindings/spi/spi-bus.txt apply. In particular, "reg" and "spi-max-frequency" properties must be given. diff --git a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt index c5ee8a20af9f..f2f64749e818 100644 --- a/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt +++ b/Documentation/devicetree/bindings/iio/imu/inv_mpu6050.txt @@ -4,6 +4,7 @@ http://www.invensense.com/mems/gyro/mpu6050.html Required properties: - compatible : should be one of + "invensense,mpu6000" "invensense,mpu6050" "invensense,mpu6500" "invensense,mpu6515" @@ -11,7 +12,11 @@ Required properties: "invensense,mpu9250" "invensense,mpu9255" "invensense,icm20608" + "invensense,icm20609" + "invensense,icm20689" "invensense,icm20602" + "invensense,icm20690" + "invensense,iam20680" - reg : the I2C address of the sensor - interrupts: interrupt mapping for IRQ. It should be configured with flags IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or diff --git a/Documentation/devicetree/bindings/iio/light/dynaimage,al3010.yaml b/Documentation/devicetree/bindings/iio/light/dynaimage,al3010.yaml new file mode 100644 index 000000000000..f671edda6641 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/dynaimage,al3010.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/dynaimage,al3010.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Dyna-Image AL3010 sensor + +maintainers: + - David Heidelberg <david@ixit.cz> + +properties: + compatible: + const: dynaimage,al3010 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + vdd-supply: + description: Regulator that provides power to the sensor + +required: + - compatible + - reg + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + + i2c { + #address-cells = <1>; + #size-cells = <0>; + + light-sensor@1c { + compatible = "dynaimage,al3010"; + reg = <0x1c>; + vdd-supply = <&vdd_reg>; + interrupts = <0 99 4>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/light/dynaimage,al3320a.yaml b/Documentation/devicetree/bindings/iio/light/dynaimage,al3320a.yaml new file mode 100644 index 000000000000..497300239d93 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/dynaimage,al3320a.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/dynaimage,al3320a.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Dyna-Image AL3320A sensor + +maintainers: + - David Heidelberg <david@ixit.cz> + +properties: + compatible: + const: dynaimage,al3320a + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + vdd-supply: + description: Regulator that provides power to the sensor + +required: + - compatible + - reg + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + + i2c { + #address-cells = <1>; + #size-cells = <0>; + + light-sensor@1c { + compatible = "dynaimage,al3320a"; + reg = <0x1c>; + vdd-supply = <&vdd_reg>; + interrupts = <0 99 4>; + }; + }; diff --git a/Documentation/devicetree/bindings/iio/light/sharp,gp2ap002.yaml b/Documentation/devicetree/bindings/iio/light/sharp,gp2ap002.yaml new file mode 100644 index 000000000000..12aa16f24772 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/light/sharp,gp2ap002.yaml @@ -0,0 +1,85 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iio/light/sharp,gp2ap002.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Sharp GP2AP002A00F and GP2AP002S00F proximity and ambient light sensors + +maintainers: + - Linus Walleij <linus.walleij@linaro.org> + +description: | + Proximity and ambient light sensor with IR LED for the proximity + sensing and an analog output for light intensity. The ambient light + sensor output is not available on the GP2AP002S00F variant. + +properties: + compatible: + enum: + - sharp,gp2ap002a00f + - sharp,gp2ap002s00f + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + description: an interrupt for proximity, usually a GPIO line + + vdd-supply: + description: VDD power supply a phandle to a regulator + + vio-supply: + description: VIO power supply a phandle to a regulator + + io-channels: + maxItems: 1 + description: ALSOUT ADC channel to read the ambient light + + io-channel-names: + const: alsout + + sharp,proximity-far-hysteresis: + $ref: /schemas/types.yaml#/definitions/uint8 + description: | + Hysteresis setting for "far" object detection, this setting is + device-unique and adjust the optical setting for proximity detection + of a "far away" object in front of the sensor. + + sharp,proximity-close-hysteresis: + $ref: /schemas/types.yaml#/definitions/uint8 + description: | + Hysteresis setting for "close" object detection, this setting is + device-unique and adjust the optical setting for proximity detection + of a "close" object in front of the sensor. + +required: + - compatible + - reg + - interrupts + - sharp,proximity-far-hysteresis + - sharp,proximity-close-hysteresis + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + + i2c { + #address-cells = <1>; + #size-cells = <0>; + + light-sensor@44 { + compatible = "sharp,gp2ap002a00f"; + reg = <0x44>; + interrupts = <18 IRQ_TYPE_EDGE_FALLING>; + vdd-supply = <&vdd_regulator>; + vio-supply = <&vio_regulator>; + io-channels = <&adc_channel>; + io-channel-names = "alsout"; + sharp,proximity-far-hysteresis = /bits/ 8 <0x2f>; + sharp,proximity-close-hysteresis = /bits/ 8 <0x0f>; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/iio/proximity/devantech-srf04.yaml b/Documentation/devicetree/bindings/iio/proximity/devantech-srf04.yaml index 4e80ea7c1475..8afbac24c34e 100644 --- a/Documentation/devicetree/bindings/iio/proximity/devantech-srf04.yaml +++ b/Documentation/devicetree/bindings/iio/proximity/devantech-srf04.yaml @@ -51,6 +51,24 @@ properties: the time between two interrupts is measured in the driver. maxItems: 1 + power-gpios: + description: + Definition of the GPIO for power management of connected peripheral + (output). + This GPIO can be used by the external hardware for power management. + When the device gets suspended it's switched off and when it resumes + it's switched on again. After some period of inactivity the driver + get suspended automatically (autosuspend feature). + maxItems: 1 + + startup-time-ms: + description: + This is the startup time the device needs after a resume to be up and + running. + minimum: 0 + maximum: 1000 + default: 100 + required: - compatible - trig-gpios diff --git a/Documentation/devicetree/bindings/input/cypress,tm2-touchkey.txt b/Documentation/devicetree/bindings/input/cypress,tm2-touchkey.txt index ef2ae729718f..921172f689b8 100644 --- a/Documentation/devicetree/bindings/input/cypress,tm2-touchkey.txt +++ b/Documentation/devicetree/bindings/input/cypress,tm2-touchkey.txt @@ -5,6 +5,7 @@ Required properties: * "cypress,tm2-touchkey" - for the touchkey found on the tm2 board * "cypress,midas-touchkey" - for the touchkey found on midas boards * "cypress,aries-touchkey" - for the touchkey found on aries boards + * "coreriver,tc360-touchkey" - for the Coreriver TouchCore 360 touchkey - reg: I2C address of the chip. - interrupts: interrupt to which the chip is connected (see interrupt binding[0]). diff --git a/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt b/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt index dc194b2c151a..cdcaa3f52d25 100644 --- a/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt +++ b/Documentation/devicetree/bindings/input/ilitek,ili2xxx.txt @@ -1,9 +1,10 @@ -Ilitek ILI210x/ILI2117/ILI251x touchscreen controller +Ilitek ILI210x/ILI2117/ILI2120/ILI251x touchscreen controller Required properties: - compatible: ilitek,ili210x for ILI210x ilitek,ili2117 for ILI2117 + ilitek,ili2120 for ILI2120 ilitek,ili251x for ILI251x - reg: The I2C address of the device diff --git a/Documentation/devicetree/bindings/input/touchscreen/goodix.yaml b/Documentation/devicetree/bindings/input/touchscreen/goodix.yaml index d7c3262b2494..c99ed3934d7e 100644 --- a/Documentation/devicetree/bindings/input/touchscreen/goodix.yaml +++ b/Documentation/devicetree/bindings/input/touchscreen/goodix.yaml @@ -62,7 +62,7 @@ required: examples: - | - i2c@00000000 { + i2c { #address-cells = <1>; #size-cells = <0>; gt928@5d { diff --git a/Documentation/devicetree/bindings/input/twl4030-pwrbutton.txt b/Documentation/devicetree/bindings/input/twl4030-pwrbutton.txt index c864a46cddcf..f5021214edec 100644 --- a/Documentation/devicetree/bindings/input/twl4030-pwrbutton.txt +++ b/Documentation/devicetree/bindings/input/twl4030-pwrbutton.txt @@ -1,7 +1,7 @@ Texas Instruments TWL family (twl4030) pwrbutton module This module is part of the TWL4030. For more details about the whole -chip see Documentation/devicetree/bindings/mfd/twl-familly.txt. +chip see Documentation/devicetree/bindings/mfd/twl-family.txt. This module provides a simple power button event via an Interrupt. diff --git a/Documentation/devicetree/bindings/interrupt-controller/loongson,htpic.yaml b/Documentation/devicetree/bindings/interrupt-controller/loongson,htpic.yaml new file mode 100644 index 000000000000..c8861cbbb8b5 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/loongson,htpic.yaml @@ -0,0 +1,59 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/interrupt-controller/loongson,htpic.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Loongson-3 HyperTransport Interrupt Controller + +maintainers: + - Jiaxun Yang <jiaxun.yang@flygoat.com> + +allOf: + - $ref: /schemas/interrupt-controller.yaml# + +description: | + This interrupt controller is found in the Loongson-3 family of chips to transmit + interrupts from PCH PIC connected on HyperTransport bus. + +properties: + compatible: + const: loongson,htpic-1.0 + + reg: + maxItems: 1 + + interrupts: + minItems: 1 + maxItems: 4 + description: | + Four parent interrupts that receive chained interrupts. + + interrupt-controller: true + + '#interrupt-cells': + const: 1 + +required: + - compatible + - reg + - interrupts + - interrupt-controller + - '#interrupt-cells' + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + htintc: interrupt-controller@1fb000080 { + compatible = "loongson,htintc-1.0"; + reg = <0xfb000080 0x40>; + interrupt-controller; + #interrupt-cells = <1>; + + interrupt-parent = <&liointc>; + interrupts = <24 IRQ_TYPE_LEVEL_HIGH>, + <25 IRQ_TYPE_LEVEL_HIGH>, + <26 IRQ_TYPE_LEVEL_HIGH>, + <27 IRQ_TYPE_LEVEL_HIGH>; + }; +... diff --git a/Documentation/devicetree/bindings/interrupt-controller/loongson,liointc.yaml b/Documentation/devicetree/bindings/interrupt-controller/loongson,liointc.yaml new file mode 100644 index 000000000000..9c6b91fee477 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/loongson,liointc.yaml @@ -0,0 +1,93 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/interrupt-controller/loongson,liointc.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Loongson Local I/O Interrupt Controller + +maintainers: + - Jiaxun Yang <jiaxun.yang@flygoat.com> + +description: | + This interrupt controller is found in the Loongson-3 family of chips as the primary + package interrupt controller which can route local I/O interrupt to interrupt lines + of cores. + +allOf: + - $ref: /schemas/interrupt-controller.yaml# + +properties: + compatible: + oneOf: + - const: loongson,liointc-1.0 + - const: loongson,liointc-1.0a + + reg: + maxItems: 1 + + interrupt-controller: true + + interrupts: + description: + Interrupt source of the CPU interrupts. + minItems: 1 + maxItems: 4 + + interrupt-names: + description: List of names for the parent interrupts. + items: + - const: int0 + - const: int1 + - const: int2 + - const: int3 + minItems: 1 + maxItems: 4 + + '#interrupt-cells': + const: 2 + + 'loongson,parent_int_map': + description: | + This property points how the children interrupts will be mapped into CPU + interrupt lines. Each cell refers to a parent interrupt line from 0 to 3 + and each bit in the cell refers to a children interrupt fron 0 to 31. + If a CPU interrupt line didn't connected with liointc, then keep it's + cell with zero. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32-array + - items: + minItems: 4 + maxItems: 4 + + +required: + - compatible + - reg + - interrupts + - interrupt-controller + - '#interrupt-cells' + - 'loongson,parent_int_map' + + +examples: + - | + iointc: interrupt-controller@3ff01400 { + compatible = "loongson,liointc-1.0"; + reg = <0x3ff01400 0x64>; + + interrupt-controller; + #interrupt-cells = <2>; + + interrupt-parent = <&cpuintc>; + interrupts = <2>, <3>; + interrupt-names = "int0", "int1"; + + loongson,parent_int_map = <0xf0ffffff>, /* int0 */ + <0x0f000000>, /* int1 */ + <0x00000000>, /* int2 */ + <0x00000000>; /* int3 */ + + }; + +... diff --git a/Documentation/devicetree/bindings/leds/common.yaml b/Documentation/devicetree/bindings/leds/common.yaml index d97d099b87e5..c60b994fe116 100644 --- a/Documentation/devicetree/bindings/leds/common.yaml +++ b/Documentation/devicetree/bindings/leds/common.yaml @@ -85,7 +85,7 @@ properties: # LED will act as a back-light, controlled by the framebuffer system - backlight # LED will turn on (but for leds-gpio see "default-state" property in - # Documentation/devicetree/bindings/leds/leds-gpio.txt) + # Documentation/devicetree/bindings/leds/leds-gpio.yaml) - default-on # LED "double" flashes at a load average based rate - heartbeat diff --git a/Documentation/devicetree/bindings/leds/register-bit-led.txt b/Documentation/devicetree/bindings/leds/register-bit-led.txt index cf1ea403ba7a..c7af6f70a97b 100644 --- a/Documentation/devicetree/bindings/leds/register-bit-led.txt +++ b/Documentation/devicetree/bindings/leds/register-bit-led.txt @@ -5,7 +5,7 @@ where single bits in a certain register can turn on/off a single LED. The register bit LEDs appear as children to the syscon device, with the proper compatible string. For the syscon bindings see: -Documentation/devicetree/bindings/mfd/syscon.txt +Documentation/devicetree/bindings/mfd/syscon.yaml Each LED is represented as a sub-node of the syscon device. Each node's name represents the name of the corresponding LED. diff --git a/Documentation/devicetree/bindings/media/allwinner,sun4i-a10-csi.yaml b/Documentation/devicetree/bindings/media/allwinner,sun4i-a10-csi.yaml index 9af873b43acd..8453ee340b9f 100644 --- a/Documentation/devicetree/bindings/media/allwinner,sun4i-a10-csi.yaml +++ b/Documentation/devicetree/bindings/media/allwinner,sun4i-a10-csi.yaml @@ -33,24 +33,40 @@ properties: maxItems: 1 clocks: - minItems: 2 - maxItems: 3 - items: - - description: The CSI interface clock - - description: The CSI ISP clock - - description: The CSI DRAM clock + oneOf: + - items: + - description: The CSI interface clock + - description: The CSI DRAM clock + + - items: + - description: The CSI interface clock + - description: The CSI ISP clock + - description: The CSI DRAM clock clock-names: - minItems: 2 - maxItems: 3 - items: - - const: bus - - const: isp - - const: ram + oneOf: + - items: + - const: bus + - const: ram + + - items: + - const: bus + - const: isp + - const: ram resets: maxItems: 1 + # FIXME: This should be made required eventually once every SoC will + # have the MBUS declared. + interconnects: + maxItems: 1 + + # FIXME: This should be made required eventually once every SoC will + # have the MBUS declared. + interconnect-names: + const: dma-mem + # See ./video-interfaces.txt for details port: type: object diff --git a/Documentation/devicetree/bindings/media/allwinner,sun8i-a83t-de2-rotate.yaml b/Documentation/devicetree/bindings/media/allwinner,sun8i-a83t-de2-rotate.yaml new file mode 100644 index 000000000000..75196d11da58 --- /dev/null +++ b/Documentation/devicetree/bindings/media/allwinner,sun8i-a83t-de2-rotate.yaml @@ -0,0 +1,70 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/media/allwinner,sun8i-a83t-de2-rotate.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Allwinner A83T DE2 Rotate Device Tree Bindings + +maintainers: + - Jernej Skrabec <jernej.skrabec@siol.net> + - Chen-Yu Tsai <wens@csie.org> + - Maxime Ripard <mripard@kernel.org> + +description: |- + The Allwinner A83T and A64 have a rotation core used for + rotating and flipping images. + +properties: + compatible: + oneOf: + - const: allwinner,sun8i-a83t-de2-rotate + - items: + - const: allwinner,sun50i-a64-de2-rotate + - const: allwinner,sun8i-a83t-de2-rotate + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Rotate interface clock + - description: Rotate module clock + + clock-names: + items: + - const: bus + - const: mod + + resets: + maxItems: 1 + +required: + - compatible + - reg + - interrupts + - clocks + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/sun8i-de2.h> + #include <dt-bindings/reset/sun8i-de2.h> + + rotate: rotate@1020000 { + compatible = "allwinner,sun8i-a83t-de2-rotate"; + reg = <0x1020000 0x10000>; + interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&display_clocks CLK_BUS_ROT>, + <&display_clocks CLK_ROT>; + clock-names = "bus", + "mod"; + resets = <&display_clocks RST_ROT>; + }; + +... diff --git a/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml b/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml index 2e40f700e84f..8707df613f6c 100644 --- a/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml +++ b/Documentation/devicetree/bindings/media/allwinner,sun8i-h3-deinterlace.yaml @@ -17,7 +17,11 @@ description: |- properties: compatible: - const: allwinner,sun8i-h3-deinterlace + oneOf: + - const: allwinner,sun8i-h3-deinterlace + - items: + - const: allwinner,sun50i-a64-deinterlace + - const: allwinner,sun8i-h3-deinterlace reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/media/aspeed-video.txt b/Documentation/devicetree/bindings/media/aspeed-video.txt index ce2894506e1f..d2ca32512272 100644 --- a/Documentation/devicetree/bindings/media/aspeed-video.txt +++ b/Documentation/devicetree/bindings/media/aspeed-video.txt @@ -1,11 +1,12 @@ * Device tree bindings for Aspeed Video Engine -The Video Engine (VE) embedded in the Aspeed AST2400 and AST2500 SOCs can +The Video Engine (VE) embedded in the Aspeed AST2400/2500/2600 SOCs can capture and compress video data from digital or analog sources. Required properties: - compatible: "aspeed,ast2400-video-engine" or - "aspeed,ast2500-video-engine" + "aspeed,ast2500-video-engine" or + "aspeed,ast2600-video-engine" - reg: contains the offset and length of the VE memory region - clocks: clock specifiers for the syscon clocks associated with the VE (ordering must match the clock-names property) diff --git a/Documentation/devicetree/bindings/media/i2c/imx219.yaml b/Documentation/devicetree/bindings/media/i2c/imx219.yaml new file mode 100644 index 000000000000..32d6b693274f --- /dev/null +++ b/Documentation/devicetree/bindings/media/i2c/imx219.yaml @@ -0,0 +1,114 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/media/i2c/imx219.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Sony 1/4.0-Inch 8Mpixel CMOS Digital Image Sensor + +maintainers: + - Dave Stevenson <dave.stevenson@raspberrypi.com> + +description: |- + The Sony imx219 is a 1/4.0-inch CMOS active pixel digital image sensor + with an active array size of 3280H x 2464V. It is programmable through + I2C interface. The I2C address is fixed to 0x10 as per sensor data sheet. + Image data is sent through MIPI CSI-2, which is configured as either 2 or + 4 data lanes. + +properties: + compatible: + const: sony,imx219 + + reg: + description: I2C device address + maxItems: 1 + + clocks: + maxItems: 1 + + VDIG-supply: + description: + Digital I/O voltage supply, 1.8 volts + + VANA-supply: + description: + Analog voltage supply, 2.8 volts + + VDDL-supply: + description: + Digital core voltage supply, 1.2 volts + + reset-gpios: + description: |- + Reference to the GPIO connected to the xclr pin, if any. + Must be released (set high) after all supplies are applied. + + # See ../video-interfaces.txt for more details + port: + type: object + properties: + endpoint: + type: object + properties: + data-lanes: + description: |- + The sensor supports either two-lane, or four-lane operation. + If this property is omitted four-lane operation is assumed. + For two-lane operation the property must be set to <1 2>. + items: + - const: 1 + - const: 2 + + clock-noncontinuous: + type: boolean + description: |- + MIPI CSI-2 clock is non-continuous if this property is present, + otherwise it's continuous. + + link-frequencies: + allOf: + - $ref: /schemas/types.yaml#/definitions/uint64-array + description: + Allowed data bus frequencies. + + required: + - link-frequencies + +required: + - compatible + - reg + - clocks + - VANA-supply + - VDIG-supply + - VDDL-supply + - port + +additionalProperties: false + +examples: + - | + i2c0 { + #address-cells = <1>; + #size-cells = <0>; + + imx219: sensor@10 { + compatible = "sony,imx219"; + reg = <0x10>; + clocks = <&imx219_clk>; + VANA-supply = <&imx219_vana>; /* 2.8v */ + VDIG-supply = <&imx219_vdig>; /* 1.8v */ + VDDL-supply = <&imx219_vddl>; /* 1.2v */ + + port { + imx219_0: endpoint { + remote-endpoint = <&csi1_ep>; + data-lanes = <1 2>; + clock-noncontinuous; + link-frequencies = /bits/ 64 <456000000>; + }; + }; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/media/i2c/tvp5150.txt b/Documentation/devicetree/bindings/media/i2c/tvp5150.txt index 8c0fc1a26bf0..6c88ce858d08 100644 --- a/Documentation/devicetree/bindings/media/i2c/tvp5150.txt +++ b/Documentation/devicetree/bindings/media/i2c/tvp5150.txt @@ -5,38 +5,150 @@ The TVP5150 and TVP5151 are video decoders that convert baseband NTSC and PAL with discrete syncs or 8-bit ITU-R BT.656 with embedded syncs output formats. Required Properties: -- compatible: value must be "ti,tvp5150" -- reg: I2C slave address +==================== +- compatible: Value must be "ti,tvp5150". +- reg: I2C slave address. Optional Properties: -- pdn-gpios: phandle for the GPIO connected to the PDN pin, if any. -- reset-gpios: phandle for the GPIO connected to the RESETB pin, if any. +==================== +- pdn-gpios: Phandle for the GPIO connected to the PDN pin, if any. +- reset-gpios: Phandle for the GPIO connected to the RESETB pin, if any. -The device node must contain one 'port' child node for its digital output -video port, in accordance with the video interface bindings defined in -Documentation/devicetree/bindings/media/video-interfaces.txt. +The device node must contain one 'port' child node per device physical input +and output port, in accordance with the video interface bindings defined in +Documentation/devicetree/bindings/media/video-interfaces.txt. The port nodes +are numbered as follows -Required Endpoint Properties for parallel synchronization: + Name Type Port + -------------------------------------- + AIP1A sink 0 + AIP1B sink 1 + Y-OUT src 2 -- hsync-active: active state of the HSYNC signal. Must be <1> (HIGH). -- vsync-active: active state of the VSYNC signal. Must be <1> (HIGH). -- field-even-active: field signal level during the even field data - transmission. Must be <0>. +The device node must contain at least one sink port and the src port. Each input +port must be linked to an endpoint defined in [1]. The port/connector layout is +as follows -If none of hsync-active, vsync-active and field-even-active is specified, -the endpoint is assumed to use embedded BT.656 synchronization. +tvp-5150 port@0 (AIP1A) + endpoint@0 -----------> Comp0-Con port + endpoint@1 ------+----> Svideo-Con port +tvp-5150 port@1 (AIP1B) | + endpoint@1 ------+ + endpoint@0 -----------> Comp1-Con port +tvp-5150 port@2 + endpoint (video bitstream output at YOUT[0-7] parallel bus) -Example: +Required Endpoint Properties for parallel synchronization on output port: +========================================================================= + +- hsync-active: Active state of the HSYNC signal. Must be <1> (HIGH). +- vsync-active: Active state of the VSYNC signal. Must be <1> (HIGH). +- field-even-active: Field signal level during the even field data + transmission. Must be <0>. + +Note: Do not specify any of these properties if you want to use the embedded + BT.656 synchronization. + +Optional Connector Properties: +============================== + +- sdtv-standards: Set the possible signals to which the hardware tries to lock + instead of using the autodetection mechnism. Please look at + [1] for more information. + +[1] Documentation/devicetree/bindings/display/connector/analog-tv-connector.txt. + +Example - three input sources: +#include <dt-bindings/display/sdtv-standards.h> + +comp_connector_0 { + compatible = "composite-video-connector"; + label = "Composite0"; + sdtv-standards = <SDTV_STD_PAL_M>; /* limit to pal-m signals */ + + port { + composite0_to_tvp5150: endpoint { + remote-endpoint = <&tvp5150_to_composite0>; + }; + }; +}; + +comp_connector_1 { + compatible = "composite-video-connector"; + label = "Composite1"; + sdtv-standards = <SDTV_STD_NTSC_M>; /* limit to ntsc-m signals */ + + port { + composite1_to_tvp5150: endpoint { + remote-endpoint = <&tvp5150_to_composite1>; + }; + }; +}; + +svideo_connector { + compatible = "svideo-connector"; + label = "S-Video"; + + port { + #address-cells = <1>; + #size-cells = <0>; + + svideo_luma_to_tvp5150: endpoint@0 { + reg = <0>; + remote-endpoint = <&tvp5150_to_svideo_luma>; + }; + + svideo_chroma_to_tvp5150: endpoint@1 { + reg = <1>; + remote-endpoint = <&tvp5150_to_svideo_chroma>; + }; + }; +}; &i2c2 { - ... tvp5150@5c { compatible = "ti,tvp5150"; reg = <0x5c>; pdn-gpios = <&gpio4 30 GPIO_ACTIVE_LOW>; reset-gpios = <&gpio6 7 GPIO_ACTIVE_LOW>; + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + #address-cells = <1>; + #size-cells = <0>; + reg = <0>; + + tvp5150_to_composite0: endpoint@0 { + reg = <0>; + remote-endpoint = <&composite0_to_tvp5150>; + }; + + tvp5150_to_svideo_luma: endpoint@1 { + reg = <1>; + remote-endpoint = <&svideo_luma_to_tvp5150>; + }; + }; + + port@1 { + #address-cells = <1>; + #size-cells = <0>; + reg = <1>; + + tvp5150_to_composite1: endpoint@0 { + reg = <0>; + remote-endpoint = <&composite1_to_tvp5150>; + }; + + tvp5150_to_svideo_chroma: endpoint@1 { + reg = <1>; + remote-endpoint = <&svideo_chroma_to_tvp5150>; + }; + }; + + port@2 { + reg = <2>; - port { tvp5150_1: endpoint { remote-endpoint = <&ccdc_ep>; }; diff --git a/Documentation/devicetree/bindings/media/nxp,imx8mq-vpu.yaml b/Documentation/devicetree/bindings/media/nxp,imx8mq-vpu.yaml new file mode 100644 index 000000000000..a2d1cd77c1e2 --- /dev/null +++ b/Documentation/devicetree/bindings/media/nxp,imx8mq-vpu.yaml @@ -0,0 +1,77 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/nxp,imx8mq-vpu.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Hantro G1/G2 VPU codecs implemented on i.MX8MQ SoCs + +maintainers: + - Philipp Zabel <p.zabel@pengutronix.de> + +description: + Hantro G1/G2 video decode accelerators present on i.MX8MQ SoCs. + +properties: + compatible: + const: nxp,imx8mq-vpu + + reg: + maxItems: 3 + + reg-names: + items: + - const: g1 + - const: g2 + - const: ctrl + + interrupts: + maxItems: 2 + + interrupt-names: + items: + - const: g1 + - const: g2 + + clocks: + maxItems: 3 + + clock-names: + items: + - const: g1 + - const: g2 + - const: bus + + power-domains: + maxItems: 1 + +required: + - compatible + - reg + - reg-names + - interrupts + - interrupt-names + - clocks + - clock-names + +examples: + - | + #include <dt-bindings/clock/imx8mq-clock.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + + vpu: video-codec@38300000 { + compatible = "nxp,imx8mq-vpu"; + reg = <0x38300000 0x10000>, + <0x38310000 0x10000>, + <0x38320000 0x10000>; + reg-names = "g1", "g2", "ctrl"; + interrupts = <GIC_SPI 7 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 8 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "g1", "g2"; + clocks = <&clk IMX8MQ_CLK_VPU_G1_ROOT>, + <&clk IMX8MQ_CLK_VPU_G2_ROOT>, + <&clk IMX8MQ_CLK_VPU_DEC_ROOT>; + clock-names = "g1", "g2", "bus"; + power-domains = <&pgc_vpu>; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,msm8916-venus.yaml b/Documentation/devicetree/bindings/media/qcom,msm8916-venus.yaml new file mode 100644 index 000000000000..f9606df02d70 --- /dev/null +++ b/Documentation/devicetree/bindings/media/qcom,msm8916-venus.yaml @@ -0,0 +1,119 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/qcom,msm8916-venus.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Venus video encode and decode accelerators + +maintainers: + - Stanimir Varbanov <stanimir.varbanov@linaro.org> + +description: | + The Venus IP is a video encode and decode accelerator present + on Qualcomm platforms + +properties: + compatible: + const: qcom,msm8916-venus + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + power-domains: + maxItems: 1 + + clocks: + maxItems: 3 + + clock-names: + items: + - const: core + - const: iface + - const: bus + + iommus: + maxItems: 1 + + memory-region: + maxItems: 1 + + video-decoder: + type: object + + properties: + compatible: + const: "venus-decoder" + + required: + - compatible + + additionalProperties: false + + video-encoder: + type: object + + properties: + compatible: + const: "venus-encoder" + + required: + - compatible + + additionalProperties: false + + video-firmware: + type: object + + description: | + Firmware subnode is needed when the platform does not + have TrustZone. + + properties: + iommus: + maxItems: 1 + + required: + - iommus + +required: + - compatible + - reg + - interrupts + - power-domains + - clocks + - clock-names + - iommus + - memory-region + - video-decoder + - video-encoder + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/qcom,gcc-msm8916.h> + + video-codec@1d00000 { + compatible = "qcom,msm8916-venus"; + reg = <0x01d00000 0xff000>; + interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&gcc GCC_VENUS0_VCODEC0_CLK>, + <&gcc GCC_VENUS0_AHB_CLK>, + <&gcc GCC_VENUS0_AXI_CLK>; + clock-names = "core", "iface", "bus"; + power-domains = <&gcc VENUS_GDSC>; + iommus = <&apps_iommu 5>; + memory-region = <&venus_mem>; + + video-decoder { + compatible = "venus-decoder"; + }; + + video-encoder { + compatible = "venus-encoder"; + }; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,msm8996-venus.yaml b/Documentation/devicetree/bindings/media/qcom,msm8996-venus.yaml new file mode 100644 index 000000000000..fa0dc6c47f1d --- /dev/null +++ b/Documentation/devicetree/bindings/media/qcom,msm8996-venus.yaml @@ -0,0 +1,172 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/qcom,msm8996-venus.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Venus video encode and decode accelerators + +maintainers: + - Stanimir Varbanov <stanimir.varbanov@linaro.org> + +description: | + The Venus IP is a video encode and decode accelerator present + on Qualcomm platforms + +properties: + compatible: + const: qcom,msm8996-venus + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + power-domains: + maxItems: 1 + + clocks: + maxItems: 4 + + clock-names: + items: + - const: core + - const: iface + - const: bus + - const: mbus + + iommus: + maxItems: 20 + + memory-region: + maxItems: 1 + + video-decoder: + type: object + + properties: + compatible: + const: venus-decoder + + clocks: + maxItems: 1 + + clock-names: + items: + - const: core + + power-domains: + maxItems: 1 + + required: + - compatible + - clocks + - clock-names + - power-domains + + additionalProperties: false + + video-encoder: + type: object + + properties: + compatible: + const: venus-encoder + + clocks: + maxItems: 1 + + clock-names: + items: + - const: core + + power-domains: + maxItems: 1 + + required: + - compatible + - clocks + - clock-names + - power-domains + + additionalProperties: false + + video-firmware: + type: object + + description: | + Firmware subnode is needed when the platform does not + have TrustZone. + + properties: + iommus: + maxItems: 1 + + required: + - iommus + +required: + - compatible + - reg + - interrupts + - power-domains + - clocks + - clock-names + - iommus + - memory-region + - video-decoder + - video-encoder + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/qcom,mmcc-msm8996.h> + + video-codec@c00000 { + compatible = "qcom,msm8996-venus"; + reg = <0x00c00000 0xff000>; + interrupts = <GIC_SPI 287 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&mmcc VIDEO_CORE_CLK>, + <&mmcc VIDEO_AHB_CLK>, + <&mmcc VIDEO_AXI_CLK>, + <&mmcc VIDEO_MAXI_CLK>; + clock-names = "core", "iface", "bus", "mbus"; + power-domains = <&mmcc VENUS_GDSC>; + iommus = <&venus_smmu 0x00>, + <&venus_smmu 0x01>, + <&venus_smmu 0x0a>, + <&venus_smmu 0x07>, + <&venus_smmu 0x0e>, + <&venus_smmu 0x0f>, + <&venus_smmu 0x08>, + <&venus_smmu 0x09>, + <&venus_smmu 0x0b>, + <&venus_smmu 0x0c>, + <&venus_smmu 0x0d>, + <&venus_smmu 0x10>, + <&venus_smmu 0x11>, + <&venus_smmu 0x21>, + <&venus_smmu 0x28>, + <&venus_smmu 0x29>, + <&venus_smmu 0x2b>, + <&venus_smmu 0x2c>, + <&venus_smmu 0x2d>, + <&venus_smmu 0x31>; + memory-region = <&venus_mem>; + + video-decoder { + compatible = "venus-decoder"; + clocks = <&mmcc VIDEO_SUBCORE0_CLK>; + clock-names = "core"; + power-domains = <&mmcc VENUS_CORE0_GDSC>; + }; + + video-encoder { + compatible = "venus-encoder"; + clocks = <&mmcc VIDEO_SUBCORE1_CLK>; + clock-names = "core"; + power-domains = <&mmcc VENUS_CORE1_GDSC>; + }; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,sc7180-venus.yaml b/Documentation/devicetree/bindings/media/qcom,sc7180-venus.yaml new file mode 100644 index 000000000000..764affa4877e --- /dev/null +++ b/Documentation/devicetree/bindings/media/qcom,sc7180-venus.yaml @@ -0,0 +1,140 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/qcom,sc7180-venus.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Venus video encode and decode accelerators + +maintainers: + - Stanimir Varbanov <stanimir.varbanov@linaro.org> + +description: | + The Venus IP is a video encode and decode accelerator present + on Qualcomm platforms + +properties: + compatible: + const: qcom,sc7180-venus + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + power-domains: + maxItems: 2 + + power-domain-names: + items: + - const: venus + - const: vcodec0 + + clocks: + maxItems: 5 + + clock-names: + items: + - const: core + - const: iface + - const: bus + - const: vcodec0_core + - const: vcodec0_bus + + iommus: + maxItems: 1 + + memory-region: + maxItems: 1 + + interconnects: + maxItems: 2 + + interconnect-names: + items: + - const: video-mem + - const: cpu-cfg + + video-decoder: + type: object + + properties: + compatible: + const: venus-decoder + + required: + - compatible + + additionalProperties: false + + video-encoder: + type: object + + properties: + compatible: + const: venus-encoder + + required: + - compatible + + additionalProperties: false + + video-firmware: + type: object + + description: | + Firmware subnode is needed when the platform does not + have TrustZone. + + properties: + iommus: + maxItems: 1 + + required: + - iommus + +required: + - compatible + - reg + - interrupts + - power-domains + - power-domain-names + - clocks + - clock-names + - iommus + - memory-region + - video-decoder + - video-encoder + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/qcom,videocc-sc7180.h> + + venus: video-codec@aa00000 { + compatible = "qcom,sc7180-venus"; + reg = <0 0x0aa00000 0 0xff000>; + interrupts = <GIC_SPI 174 IRQ_TYPE_LEVEL_HIGH>; + power-domains = <&videocc VENUS_GDSC>, + <&videocc VCODEC0_GDSC>; + power-domain-names = "venus", "vcodec0"; + clocks = <&videocc VIDEO_CC_VENUS_CTL_CORE_CLK>, + <&videocc VIDEO_CC_VENUS_AHB_CLK>, + <&videocc VIDEO_CC_VENUS_CTL_AXI_CLK>, + <&videocc VIDEO_CC_VCODEC0_CORE_CLK>, + <&videocc VIDEO_CC_VCODEC0_AXI_CLK>; + clock-names = "core", "iface", "bus", + "vcodec0_core", "vcodec0_bus"; + iommus = <&apps_smmu 0x0c00 0x60>; + memory-region = <&venus_mem>; + + video-decoder { + compatible = "venus-decoder"; + }; + + video-encoder { + compatible = "venus-encoder"; + }; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,sdm845-venus-v2.yaml b/Documentation/devicetree/bindings/media/qcom,sdm845-venus-v2.yaml new file mode 100644 index 000000000000..8552f4ab907e --- /dev/null +++ b/Documentation/devicetree/bindings/media/qcom,sdm845-venus-v2.yaml @@ -0,0 +1,140 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/qcom,sdm845-venus-v2.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Venus video encode and decode accelerators + +maintainers: + - Stanimir Varbanov <stanimir.varbanov@linaro.org> + +description: | + The Venus IP is a video encode and decode accelerator present + on Qualcomm platforms + +properties: + compatible: + const: qcom,sdm845-venus-v2 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + power-domains: + maxItems: 3 + + power-domain-names: + items: + - const: venus + - const: vcodec0 + - const: vcodec1 + + clocks: + maxItems: 7 + + clock-names: + items: + - const: core + - const: iface + - const: bus + - const: vcodec0_core + - const: vcodec0_bus + - const: vcodec1_core + - const: vcodec1_bus + + iommus: + maxItems: 2 + + memory-region: + maxItems: 1 + + video-core0: + type: object + + properties: + compatible: + const: venus-decoder + + required: + - compatible + + additionalProperties: false + + video-core1: + type: object + + properties: + compatible: + const: venus-encoder + + required: + - compatible + + additionalProperties: false + + video-firmware: + type: object + + description: | + Firmware subnode is needed when the platform does not + have TrustZone. + + properties: + iommus: + maxItems: 1 + + required: + - iommus + +required: + - compatible + - reg + - interrupts + - power-domains + - power-domain-names + - clocks + - clock-names + - iommus + - memory-region + - video-core0 + - video-core1 + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/qcom,videocc-sdm845.h> + + video-codec@aa00000 { + compatible = "qcom,sdm845-venus-v2"; + reg = <0 0x0aa00000 0 0xff000>; + interrupts = <GIC_SPI 174 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&videocc VIDEO_CC_VENUS_CTL_CORE_CLK>, + <&videocc VIDEO_CC_VENUS_AHB_CLK>, + <&videocc VIDEO_CC_VENUS_CTL_AXI_CLK>, + <&videocc VIDEO_CC_VCODEC0_CORE_CLK>, + <&videocc VIDEO_CC_VCODEC0_AXI_CLK>, + <&videocc VIDEO_CC_VCODEC1_CORE_CLK>, + <&videocc VIDEO_CC_VCODEC1_AXI_CLK>; + clock-names = "core", "iface", "bus", + "vcodec0_core", "vcodec0_bus", + "vcodec1_core", "vcodec1_bus"; + power-domains = <&videocc VENUS_GDSC>, + <&videocc VCODEC0_GDSC>, + <&videocc VCODEC1_GDSC>; + power-domain-names = "venus", "vcodec0", "vcodec1"; + iommus = <&apps_smmu 0x10a0 0x8>, + <&apps_smmu 0x10b0 0x0>; + memory-region = <&venus_mem>; + + video-core0 { + compatible = "venus-decoder"; + }; + + video-core1 { + compatible = "venus-encoder"; + }; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,sdm845-venus.yaml b/Documentation/devicetree/bindings/media/qcom,sdm845-venus.yaml new file mode 100644 index 000000000000..05cabe4e893a --- /dev/null +++ b/Documentation/devicetree/bindings/media/qcom,sdm845-venus.yaml @@ -0,0 +1,156 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/media/qcom,sdm845-venus.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Venus video encode and decode accelerators + +maintainers: + - Stanimir Varbanov <stanimir.varbanov@linaro.org> + +description: | + The Venus IP is a video encode and decode accelerator present + on Qualcomm platforms + +properties: + compatible: + const: qcom,sdm845-venus + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + power-domains: + maxItems: 1 + + clocks: + maxItems: 3 + + clock-names: + items: + - const: core + - const: iface + - const: bus + + iommus: + maxItems: 2 + + memory-region: + maxItems: 1 + + video-core0: + type: object + + properties: + compatible: + const: venus-decoder + + clocks: + maxItems: 2 + + clock-names: + items: + - const: core + - const: bus + + power-domains: + maxItems: 1 + + required: + - compatible + - clocks + - clock-names + - power-domains + + additionalProperties: false + + video-core1: + type: object + + properties: + compatible: + const: venus-encoder + + clocks: + maxItems: 2 + + clock-names: + items: + - const: core + - const: bus + + power-domains: + maxItems: 1 + + required: + - compatible + - clocks + - clock-names + - power-domains + + additionalProperties: false + + video-firmware: + type: object + + description: | + Firmware subnode is needed when the platform does not + have TrustZone. + + properties: + iommus: + maxItems: 1 + + required: + - iommus + +required: + - compatible + - reg + - interrupts + - power-domains + - clocks + - clock-names + - iommus + - memory-region + - video-core0 + - video-core1 + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/qcom,videocc-sdm845.h> + + video-codec@aa00000 { + compatible = "qcom,sdm845-venus"; + reg = <0 0x0aa00000 0 0xff000>; + interrupts = <GIC_SPI 174 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&videocc VIDEO_CC_VENUS_CTL_CORE_CLK>, + <&videocc VIDEO_CC_VENUS_AHB_CLK>, + <&videocc VIDEO_CC_VENUS_CTL_AXI_CLK>; + clock-names = "core", "iface", "bus"; + power-domains = <&videocc VENUS_GDSC>; + iommus = <&apps_smmu 0x10a0 0x8>, + <&apps_smmu 0x10b0 0x0>; + memory-region = <&venus_mem>; + + video-core0 { + compatible = "venus-decoder"; + clocks = <&videocc VIDEO_CC_VCODEC0_CORE_CLK>, + <&videocc VIDEO_CC_VCODEC0_AXI_CLK>; + clock-names = "core", "bus"; + power-domains = <&videocc VCODEC0_GDSC>; + }; + + video-core1 { + compatible = "venus-encoder"; + clocks = <&videocc VIDEO_CC_VCODEC1_CORE_CLK>, + <&videocc VIDEO_CC_VCODEC1_AXI_CLK>; + clock-names = "core", "bus"; + power-domains = <&videocc VCODEC1_GDSC>; + }; + }; diff --git a/Documentation/devicetree/bindings/media/qcom,venus.txt b/Documentation/devicetree/bindings/media/qcom,venus.txt deleted file mode 100644 index b602c4c025e7..000000000000 --- a/Documentation/devicetree/bindings/media/qcom,venus.txt +++ /dev/null @@ -1,120 +0,0 @@ -* Qualcomm Venus video encoder/decoder accelerators - -- compatible: - Usage: required - Value type: <stringlist> - Definition: Value should contain one of: - - "qcom,msm8916-venus" - - "qcom,msm8996-venus" - - "qcom,sdm845-venus" -- reg: - Usage: required - Value type: <prop-encoded-array> - Definition: Register base address and length of the register map. -- interrupts: - Usage: required - Value type: <prop-encoded-array> - Definition: Should contain interrupt line number. -- clocks: - Usage: required - Value type: <prop-encoded-array> - Definition: A List of phandle and clock specifier pairs as listed - in clock-names property. -- clock-names: - Usage: required for msm8916 - Value type: <stringlist> - Definition: Should contain the following entries: - - "core" Core video accelerator clock - - "iface" Video accelerator AHB clock - - "bus" Video accelerator AXI clock -- clock-names: - Usage: required for msm8996 - Value type: <stringlist> - Definition: Should contain the following entries: - - "core" Core video accelerator clock - - "iface" Video accelerator AHB clock - - "bus" Video accelerator AXI clock - - "mbus" Video MAXI clock -- power-domains: - Usage: required - Value type: <prop-encoded-array> - Definition: A phandle and power domain specifier pairs to the - power domain which is responsible for collapsing - and restoring power to the peripheral. -- iommus: - Usage: required - Value type: <prop-encoded-array> - Definition: A list of phandle and IOMMU specifier pairs. -- memory-region: - Usage: required - Value type: <phandle> - Definition: reference to the reserved-memory for the firmware - memory region. - -* Subnodes -The Venus video-codec node must contain two subnodes representing -video-decoder and video-encoder, and one optional firmware subnode. -Firmware subnode is needed when the platform does not have TrustZone. - -Every of video-encoder or video-decoder subnode should have: - -- compatible: - Usage: required - Value type: <stringlist> - Definition: Value should contain "venus-decoder" or "venus-encoder" -- clocks: - Usage: required for msm8996 - Value type: <prop-encoded-array> - Definition: A List of phandle and clock specifier pairs as listed - in clock-names property. -- clock-names: - Usage: required for msm8996 - Value type: <stringlist> - Definition: Should contain the following entries: - - "core" Subcore video accelerator clock - -- power-domains: - Usage: required for msm8996 - Value type: <prop-encoded-array> - Definition: A phandle and power domain specifier pairs to the - power domain which is responsible for collapsing - and restoring power to the subcore. - -The firmware subnode must have: - -- iommus: - Usage: required - Value type: <prop-encoded-array> - Definition: A list of phandle and IOMMU specifier pairs. - -* An Example - video-codec@1d00000 { - compatible = "qcom,msm8916-venus"; - reg = <0x01d00000 0xff000>; - interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&gcc GCC_VENUS0_VCODEC0_CLK>, - <&gcc GCC_VENUS0_AHB_CLK>, - <&gcc GCC_VENUS0_AXI_CLK>; - clock-names = "core", "iface", "bus"; - power-domains = <&gcc VENUS_GDSC>; - iommus = <&apps_iommu 5>; - memory-region = <&venus_mem>; - - video-decoder { - compatible = "venus-decoder"; - clocks = <&mmcc VIDEO_SUBCORE0_CLK>; - clock-names = "core"; - power-domains = <&mmcc VENUS_CORE0_GDSC>; - }; - - video-encoder { - compatible = "venus-encoder"; - clocks = <&mmcc VIDEO_SUBCORE1_CLK>; - clock-names = "core"; - power-domains = <&mmcc VENUS_CORE1_GDSC>; - }; - - video-firmware { - iommus = <&apps_iommu 0x10b2 0x0>; - }; - }; diff --git a/Documentation/devicetree/bindings/media/rc.yaml b/Documentation/devicetree/bindings/media/rc.yaml index a64ee038d235..b27c9385d490 100644 --- a/Documentation/devicetree/bindings/media/rc.yaml +++ b/Documentation/devicetree/bindings/media/rc.yaml @@ -143,6 +143,7 @@ properties: - rc-videomate-k100 - rc-videomate-s350 - rc-videomate-tv-pvr + - rc-videostrong-kii-pro - rc-wetek-hub - rc-wetek-play2 - rc-winfast diff --git a/Documentation/devicetree/bindings/media/rockchip-rga.txt b/Documentation/devicetree/bindings/media/rockchip-rga.txt index fd5276abfad6..c53a8e5133f6 100644 --- a/Documentation/devicetree/bindings/media/rockchip-rga.txt +++ b/Documentation/devicetree/bindings/media/rockchip-rga.txt @@ -6,8 +6,9 @@ BitBLT, alpha blending and image blur/sharpness. Required properties: - compatible: value should be one of the following - "rockchip,rk3288-rga"; - "rockchip,rk3399-rga"; + "rockchip,rk3228-rga", "rockchip,rk3288-rga": for Rockchip RK3228 + "rockchip,rk3288-rga": for Rockchip RK3288 + "rockchip,rk3399-rga": for Rockchip RK3399 - interrupts: RGA interrupt specifier. diff --git a/Documentation/devicetree/bindings/media/ti,cal.yaml b/Documentation/devicetree/bindings/media/ti,cal.yaml index 1ea784179536..5e066629287d 100644 --- a/Documentation/devicetree/bindings/media/ti,cal.yaml +++ b/Documentation/devicetree/bindings/media/ti,cal.yaml @@ -177,7 +177,7 @@ examples: }; }; - i2c5: i2c@4807c000 { + i2c { clock-frequency = <400000>; #address-cells = <1>; #size-cells = <0>; diff --git a/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra124-emc.yaml b/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra124-emc.yaml index dd1843489ad1..3e0a8a92d652 100644 --- a/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra124-emc.yaml +++ b/Documentation/devicetree/bindings/memory-controllers/nvidia,tegra124-emc.yaml @@ -347,6 +347,7 @@ examples: interrupts = <GIC_SPI 77 IRQ_TYPE_LEVEL_HIGH>; #iommu-cells = <1>; + #reset-cells = <1>; }; external-memory-controller@7001b000 { @@ -363,20 +364,23 @@ examples: timing-0 { clock-frequency = <12750000>; - nvidia,emc-zcal-cnt-long = <0x00000042>; - nvidia,emc-auto-cal-interval = <0x001fffff>; - nvidia,emc-ctt-term-ctrl = <0x00000802>; - nvidia,emc-cfg = <0x73240000>; - nvidia,emc-cfg-2 = <0x000008c5>; - nvidia,emc-sel-dpd-ctrl = <0x00040128>; - nvidia,emc-bgbias-ctl0 = <0x00000008>; nvidia,emc-auto-cal-config = <0xa1430000>; nvidia,emc-auto-cal-config2 = <0x00000000>; nvidia,emc-auto-cal-config3 = <0x00000000>; - nvidia,emc-mode-reset = <0x80001221>; + nvidia,emc-auto-cal-interval = <0x001fffff>; + nvidia,emc-bgbias-ctl0 = <0x00000008>; + nvidia,emc-cfg = <0x73240000>; + nvidia,emc-cfg-2 = <0x000008c5>; + nvidia,emc-ctt-term-ctrl = <0x00000802>; nvidia,emc-mode-1 = <0x80100003>; nvidia,emc-mode-2 = <0x80200008>; nvidia,emc-mode-4 = <0x00000000>; + nvidia,emc-mode-reset = <0x80001221>; + nvidia,emc-mrs-wait-cnt = <0x000e000e>; + nvidia,emc-sel-dpd-ctrl = <0x00040128>; + nvidia,emc-xm2dqspadctrl2 = <0x0130b118>; + nvidia,emc-zcal-cnt-long = <0x00000042>; + nvidia,emc-zcal-interval = <0x00000000>; nvidia,emc-configuration = < 0x00000000 /* EMC_RC */ diff --git a/Documentation/devicetree/bindings/memory-controllers/ti/emif.txt b/Documentation/devicetree/bindings/memory-controllers/ti/emif.txt index 44d71469c914..63f674ffeb4f 100644 --- a/Documentation/devicetree/bindings/memory-controllers/ti/emif.txt +++ b/Documentation/devicetree/bindings/memory-controllers/ti/emif.txt @@ -32,7 +32,7 @@ Required only for "ti,emif-am3352" and "ti,emif-am4372": - sram : Phandles for generic sram driver nodes, first should be type 'protect-exec' for the driver to use to copy and run PM functions, second should be regular pool to be used for - data region for code. See Documentation/devicetree/bindings/sram/sram.txt + data region for code. See Documentation/devicetree/bindings/sram/sram.yaml for more details. Optional properties: diff --git a/Documentation/devicetree/bindings/mfd/max77650.yaml b/Documentation/devicetree/bindings/mfd/max77650.yaml index 4a70f875a6eb..480385789394 100644 --- a/Documentation/devicetree/bindings/mfd/max77650.yaml +++ b/Documentation/devicetree/bindings/mfd/max77650.yaml @@ -97,14 +97,14 @@ examples: regulators { compatible = "maxim,max77650-regulator"; - max77650_ldo: regulator@0 { + max77650_ldo: regulator-ldo { regulator-compatible = "ldo"; regulator-name = "max77650-ldo"; regulator-min-microvolt = <1350000>; regulator-max-microvolt = <2937500>; }; - max77650_sbb0: regulator@1 { + max77650_sbb0: regulator-sbb0 { regulator-compatible = "sbb0"; regulator-name = "max77650-sbb0"; regulator-min-microvolt = <800000>; diff --git a/Documentation/devicetree/bindings/mfd/qcom-rpm.txt b/Documentation/devicetree/bindings/mfd/qcom-rpm.txt index 3c91ad430eea..b823b8625243 100644 --- a/Documentation/devicetree/bindings/mfd/qcom-rpm.txt +++ b/Documentation/devicetree/bindings/mfd/qcom-rpm.txt @@ -61,6 +61,7 @@ Regulator nodes are identified by their compatible: "qcom,rpm-pm8901-regulators" "qcom,rpm-pm8921-regulators" "qcom,rpm-pm8018-regulators" + "qcom,rpm-smb208-regulators" - vdd_l0_l1_lvs-supply: - vdd_l2_l11_l12-supply: @@ -171,6 +172,9 @@ pm8018: s1, s2, s3, s4, s5, , l1, l2, l3, l4, l5, l6, l7, l8, l9, l10, l11, l12, l14, lvs1 +smb208: + s1a, s1b, s2a, s2b + The content of each sub-node is defined by the standard binding for regulators - see regulator.txt - with additional custom properties described below: diff --git a/Documentation/devicetree/bindings/mfd/tps65910.txt b/Documentation/devicetree/bindings/mfd/tps65910.txt index 4f62143afd24..a5ced46bbde9 100644 --- a/Documentation/devicetree/bindings/mfd/tps65910.txt +++ b/Documentation/devicetree/bindings/mfd/tps65910.txt @@ -26,8 +26,8 @@ Required properties: ldo6, ldo7, ldo8 - xxx-supply: Input voltage supply regulator. - These entries are require if regulators are enabled for a device. Missing of these - properties can cause the regulator registration fails. + These entries are required if regulators are enabled for a device. Missing these + properties can cause the regulator registration to fail. If some of input supply is powered through battery or always-on supply then also it is require to have these parameters with proper node handle of always on power supply. diff --git a/Documentation/devicetree/bindings/mfd/twl-familly.txt b/Documentation/devicetree/bindings/mfd/twl-family.txt index 56f244b5d8a4..56f244b5d8a4 100644 --- a/Documentation/devicetree/bindings/mfd/twl-familly.txt +++ b/Documentation/devicetree/bindings/mfd/twl-family.txt diff --git a/Documentation/devicetree/bindings/mfd/zii,rave-sp.txt b/Documentation/devicetree/bindings/mfd/zii,rave-sp.txt index 088eff9ddb78..e0f901edc063 100644 --- a/Documentation/devicetree/bindings/mfd/zii,rave-sp.txt +++ b/Documentation/devicetree/bindings/mfd/zii,rave-sp.txt @@ -20,7 +20,7 @@ RAVE SP consists of the following sub-devices: Device Description ------ ----------- rave-sp-wdt : Watchdog -rave-sp-nvmem : Interface to onborad EEPROM +rave-sp-nvmem : Interface to onboard EEPROM rave-sp-backlight : Display backlight rave-sp-hwmon : Interface to onboard hardware sensors rave-sp-leds : Interface to onboard LEDs diff --git a/Documentation/devicetree/bindings/mips/loongson/devices.yaml b/Documentation/devicetree/bindings/mips/loongson/devices.yaml new file mode 100644 index 000000000000..74ed4e397a78 --- /dev/null +++ b/Documentation/devicetree/bindings/mips/loongson/devices.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/mips/loongson/devices.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Loongson based Platforms Device Tree Bindings + +maintainers: + - Jiaxun Yang <jiaxun.yang@flygoat.com> +description: | + Devices with a Loongson CPU shall have the following properties. + +properties: + $nodename: + const: '/' + compatible: + oneOf: + + - description: Generic Loongson3 Quad Core + RS780E + items: + - const: loongson,loongson3-4core-rs780e + + - description: Generic Loongson3 Octa Core + RS780E + items: + - const: loongson,loongson3-8core-rs780e +... diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt index bb7e896cb644..9134e9bcca56 100644 --- a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt +++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt @@ -26,7 +26,7 @@ For generic IOMMU bindings, see Documentation/devicetree/bindings/iommu/iommu.txt. For arm-smmu binding, see: -Documentation/devicetree/bindings/iommu/arm,smmu.txt. +Documentation/devicetree/bindings/iommu/arm,smmu.yaml. Required properties: diff --git a/Documentation/devicetree/bindings/mmc/mmc-controller.yaml b/Documentation/devicetree/bindings/mmc/mmc-controller.yaml index 3c0df4016a12..8fded83c519a 100644 --- a/Documentation/devicetree/bindings/mmc/mmc-controller.yaml +++ b/Documentation/devicetree/bindings/mmc/mmc-controller.yaml @@ -370,6 +370,7 @@ examples: mmc3: mmc@1c12000 { #address-cells = <1>; #size-cells = <0>; + reg = <0x1c12000 0x200>; pinctrl-names = "default"; pinctrl-0 = <&mmc3_pins_a>; vmmc-supply = <®_vmmc3>; diff --git a/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt b/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt index 19f5508a7569..4a9145ef15d6 100644 --- a/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt +++ b/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt @@ -124,7 +124,7 @@ not every application needs SDIO irq, e.g. MMC cards. pinctrl-1 = <&mmc1_idle>; pinctrl-2 = <&mmc1_sleep>; ... - interrupts-extended = <&intc 64 &gpio2 28 GPIO_ACTIVE_LOW>; + interrupts-extended = <&intc 64 &gpio2 28 IRQ_TYPE_LEVEL_LOW>; }; mmc1_idle : pinmux_cirq_pin { diff --git a/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt b/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt index f3893c4d3c6a..d2eada5044b2 100644 --- a/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt +++ b/Documentation/devicetree/bindings/mtd/cadence-nand-controller.txt @@ -27,7 +27,7 @@ Required properties of NAND chips: - reg: shall contain the native Chip Select ids from 0 to max supported by the cadence nand flash controller -See Documentation/devicetree/bindings/mtd/nand.txt for more details on +See Documentation/devicetree/bindings/mtd/nand-controller.yaml for more details on generic bindings. Example: diff --git a/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt b/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt index 48a7f916c5e4..88b57b0ca1f4 100644 --- a/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt +++ b/Documentation/devicetree/bindings/net/brcm,bcm7445-switch-v4.0.txt @@ -45,7 +45,7 @@ Optional properties: switch queue - resets: a single phandle and reset identifier pair. See - Documentation/devicetree/binding/reset/reset.txt for details. + Documentation/devicetree/bindings/reset/reset.txt for details. - reset-names: If the "reset" property is specified, this property should have the value "switch" to denote the switch reset line. diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt b/Documentation/devicetree/bindings/net/fsl-fman.txt index 250f8d8cdce4..c00fb0d22c7b 100644 --- a/Documentation/devicetree/bindings/net/fsl-fman.txt +++ b/Documentation/devicetree/bindings/net/fsl-fman.txt @@ -110,6 +110,13 @@ PROPERTIES Usage: required Definition: See soc/fsl/qman.txt and soc/fsl/bman.txt +- fsl,erratum-a050385 + Usage: optional + Value type: boolean + Definition: A boolean property. Indicates the presence of the + erratum A050385 which indicates that DMA transactions that are + split can result in a FMan lock. + ============================================================================= FMan MURAM Node diff --git a/Documentation/devicetree/bindings/net/mdio.yaml b/Documentation/devicetree/bindings/net/mdio.yaml index 5d08d2ffd4eb..50c3397a82bc 100644 --- a/Documentation/devicetree/bindings/net/mdio.yaml +++ b/Documentation/devicetree/bindings/net/mdio.yaml @@ -56,7 +56,6 @@ patternProperties: examples: - | davinci_mdio: mdio@5c030000 { - compatible = "ti,davinci_mdio"; reg = <0x5c030000 0x1000>; #address-cells = <1>; #size-cells = <0>; diff --git a/Documentation/devicetree/bindings/nvmem/nvmem.yaml b/Documentation/devicetree/bindings/nvmem/nvmem.yaml index b43c6c65294e..65980224d550 100644 --- a/Documentation/devicetree/bindings/nvmem/nvmem.yaml +++ b/Documentation/devicetree/bindings/nvmem/nvmem.yaml @@ -76,6 +76,8 @@ examples: qfprom: eeprom@700000 { #address-cells = <1>; #size-cells = <1>; + reg = <0x00700000 0x100000>; + wp-gpios = <&gpio1 3 GPIO_ACTIVE_HIGH>; /* ... */ diff --git a/Documentation/devicetree/bindings/opp/qcom-nvmem-cpufreq.txt b/Documentation/devicetree/bindings/opp/qcom-nvmem-cpufreq.txt index 4751029b9b74..64f07417ecfb 100644 --- a/Documentation/devicetree/bindings/opp/qcom-nvmem-cpufreq.txt +++ b/Documentation/devicetree/bindings/opp/qcom-nvmem-cpufreq.txt @@ -19,7 +19,8 @@ In 'cpu' nodes: In 'operating-points-v2' table: - compatible: Should be - - 'operating-points-v2-kryo-cpu' for apq8096 and msm8996. + - 'operating-points-v2-kryo-cpu' for apq8096, msm8996, msm8974, + apq8064, ipq8064, msm8960 and ipq8074. Optional properties: -------------------- diff --git a/Documentation/devicetree/bindings/phy/allwinner,sun4i-a10-usb-phy.yaml b/Documentation/devicetree/bindings/phy/allwinner,sun4i-a10-usb-phy.yaml index 020ef9e4c411..94ac23687b7e 100644 --- a/Documentation/devicetree/bindings/phy/allwinner,sun4i-a10-usb-phy.yaml +++ b/Documentation/devicetree/bindings/phy/allwinner,sun4i-a10-usb-phy.yaml @@ -86,7 +86,7 @@ examples: #include <dt-bindings/clock/sun4i-a10-ccu.h> #include <dt-bindings/reset/sun4i-a10-ccu.h> - usbphy: phy@01c13400 { + usbphy: phy@1c13400 { #phy-cells = <1>; compatible = "allwinner,sun4i-a10-usb-phy"; reg = <0x01c13400 0x10>, <0x01c14800 0x4>, <0x01c1c800 0x4>; diff --git a/Documentation/devicetree/bindings/phy/amlogic,meson-g12a-usb2-phy.yaml b/Documentation/devicetree/bindings/phy/amlogic,meson-g12a-usb2-phy.yaml index 57d8603076bd..9e32cb43fb21 100644 --- a/Documentation/devicetree/bindings/phy/amlogic,meson-g12a-usb2-phy.yaml +++ b/Documentation/devicetree/bindings/phy/amlogic,meson-g12a-usb2-phy.yaml @@ -14,6 +14,7 @@ properties: compatible: enum: - amlogic,meson-g12a-usb2-phy + - amlogic,meson-a1-usb2-phy reg: maxItems: 1 @@ -49,6 +50,19 @@ required: - reset-names - "#phy-cells" +if: + properties: + compatible: + enum: + - amlogic,meson-a1-usb-ctrl + +then: + properties: + power-domains: + maxItems: 1 + required: + - power-domains + examples: - | phy@36000 { diff --git a/Documentation/devicetree/bindings/phy/phy-cadence-dp.txt b/Documentation/devicetree/bindings/phy/phy-cadence-dp.txt deleted file mode 100644 index 7f49fd54ebc1..000000000000 --- a/Documentation/devicetree/bindings/phy/phy-cadence-dp.txt +++ /dev/null @@ -1,30 +0,0 @@ -Cadence MHDP DisplayPort SD0801 PHY binding -=========================================== - -This binding describes the Cadence SD0801 PHY hardware included with -the Cadence MHDP DisplayPort controller. - -------------------------------------------------------------------------------- -Required properties (controller (parent) node): -- compatible : Should be "cdns,dp-phy" -- reg : Defines the following sets of registers in the parent - mhdp device: - - Offset of the DPTX PHY configuration registers - - Offset of the SD0801 PHY configuration registers -- #phy-cells : from the generic PHY bindings, must be 0. - -Optional properties: -- num_lanes : Number of DisplayPort lanes to use (1, 2 or 4) -- max_bit_rate : Maximum DisplayPort link bit rate to use, in Mbps (2160, - 2430, 2700, 3240, 4320, 5400 or 8100) -------------------------------------------------------------------------------- - -Example: - dp_phy: phy@f0fb030a00 { - compatible = "cdns,dp-phy"; - reg = <0xf0 0xfb030a00 0x0 0x00000040>, - <0xf0 0xfb500000 0x0 0x00100000>; - num_lanes = <4>; - max_bit_rate = <8100>; - #phy-cells = <0>; - }; diff --git a/Documentation/devicetree/bindings/phy/phy-cadence-torrent.yaml b/Documentation/devicetree/bindings/phy/phy-cadence-torrent.yaml new file mode 100644 index 000000000000..c779a3c7d87a --- /dev/null +++ b/Documentation/devicetree/bindings/phy/phy-cadence-torrent.yaml @@ -0,0 +1,143 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/phy/phy-cadence-torrent.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Cadence Torrent SD0801 PHY binding for DisplayPort + +description: + This binding describes the Cadence SD0801 PHY (also known as Torrent PHY) + hardware included with the Cadence MHDP DisplayPort controller. + +maintainers: + - Swapnil Jakhade <sjakhade@cadence.com> + - Yuti Amonkar <yamonkar@cadence.com> + +properties: + compatible: + enum: + - cdns,torrent-phy + - ti,j721e-serdes-10g + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + + clocks: + maxItems: 1 + description: + PHY reference clock. Must contain an entry in clock-names. + + clock-names: + const: refclk + + reg: + minItems: 1 + maxItems: 2 + items: + - description: Offset of the Torrent PHY configuration registers. + - description: Offset of the DPTX PHY configuration registers. + + reg-names: + minItems: 1 + maxItems: 2 + items: + - const: torrent_phy + - const: dptx_phy + + resets: + maxItems: 1 + description: + Torrent PHY reset. + See Documentation/devicetree/bindings/reset/reset.txt + +patternProperties: + '^phy@[0-7]+$': + type: object + description: + Each group of PHY lanes with a single master lane should be represented as a sub-node. + properties: + reg: + description: + The master lane number. This is the lowest numbered lane in the lane group. + + resets: + minItems: 1 + maxItems: 4 + description: + Contains list of resets, one per lane, to get all the link lanes out of reset. + + "#phy-cells": + const: 0 + + cdns,phy-type: + description: + Specifies the type of PHY for which the group of PHY lanes is used. + Refer include/dt-bindings/phy/phy.h. Constants from the header should be used. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [1, 2, 3, 4, 5, 6] + + cdns,num-lanes: + description: + Number of DisplayPort lanes. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [1, 2, 4] + default: 4 + + cdns,max-bit-rate: + description: + Maximum DisplayPort link bit rate to use, in Mbps + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - enum: [2160, 2430, 2700, 3240, 4320, 5400, 8100] + default: 8100 + + required: + - reg + - resets + - "#phy-cells" + - cdns,phy-type + + additionalProperties: false + +required: + - compatible + - "#address-cells" + - "#size-cells" + - clocks + - clock-names + - reg + - reg-names + - resets + +additionalProperties: false + +examples: + - | + #include <dt-bindings/phy/phy.h> + torrent_phy: torrent-phy@f0fb500000 { + compatible = "cdns,torrent-phy"; + reg = <0xf0 0xfb500000 0x0 0x00100000>, + <0xf0 0xfb030a00 0x0 0x00000040>; + reg-names = "torrent_phy", "dptx_phy"; + resets = <&phyrst 0>; + clocks = <&ref_clk>; + clock-names = "refclk"; + #address-cells = <1>; + #size-cells = <0>; + torrent_phy_dp: phy@0 { + reg = <0>; + resets = <&phyrst 1>, <&phyrst 2>, + <&phyrst 3>, <&phyrst 4>; + #phy-cells = <0>; + cdns,phy-type = <PHY_TYPE_DP>; + cdns,num-lanes = <4>; + cdns,max-bit-rate = <8100>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt b/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt index a5f7a4f0dbc1..dd75b676b71d 100644 --- a/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt +++ b/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt @@ -13,10 +13,16 @@ Required properties (controller (parent) node): "mediatek,mt8173-u3phy"; make use of "mediatek,generic-tphy-v1" on mt2701 instead and "mediatek,generic-tphy-v2" on mt2712 instead. - - clocks : (deprecated, use port's clocks instead) a list of phandle + - clock-specifier pairs, one for each entry in clock-names - - clock-names : (deprecated, use port's one instead) must contain - "u3phya_ref": for reference clock of usb3.0 analog phy. + +- #address-cells: the number of cells used to represent physical + base addresses. +- #size-cells: the number of cells used to represent the size of an address. +- ranges: the address mapping relationship to the parent, defined with + - empty value: if optional 'reg' is used. + - non-empty value: if optional 'reg' is not used. should set + the child's base address to 0, the physical address + within parent's address space, and the length of + the address map. Required nodes : a sub-node is required for each port the controller provides. Address range information including the usual @@ -34,12 +40,6 @@ Optional properties (controller (parent) node): Required properties (port (child) node): - reg : address and length of the register set for the port. -- clocks : a list of phandle + clock-specifier pairs, one for each - entry in clock-names -- clock-names : must contain - "ref": 48M reference clock for HighSpeed analog phy; and 26M - reference clock for SuperSpeed analog phy, sometimes is - 24M, 25M or 27M, depended on platform. - #phy-cells : should be 1 (See second example) cell after port phandle is phy type from: - PHY_TYPE_USB2 @@ -48,10 +48,22 @@ Required properties (port (child) node): - PHY_TYPE_SATA Optional properties (PHY_TYPE_USB2 port (child) node): +- clocks : a list of phandle + clock-specifier pairs, one for each + entry in clock-names +- clock-names : may contain + "ref": 48M reference clock for HighSpeed (digital) phy; and 26M + reference clock for SuperSpeed (digital) phy, sometimes is + 24M, 25M or 27M, depended on platform. + "da_ref": the reference clock of analog phy, used if the clocks + of analog and digital phys are separated, otherwise uses + "ref" clock only if needed. + - mediatek,eye-src : u32, the value of slew rate calibrate - mediatek,eye-vrt : u32, the selection of VRT reference voltage - mediatek,eye-term : u32, the selection of HS_TX TERM reference voltage - mediatek,bc12 : bool, enable BC12 of u2phy if support it +- mediatek,discth : u32, the selection of disconnect threshold +- mediatek,intr : u32, the selection of internal R (resistance) Example: diff --git a/Documentation/devicetree/bindings/phy/qcom,qusb2-phy.yaml b/Documentation/devicetree/bindings/phy/qcom,qusb2-phy.yaml new file mode 100644 index 000000000000..144ae29e7141 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/qcom,qusb2-phy.yaml @@ -0,0 +1,185 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/phy/qcom,qusb2-phy.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm QUSB2 phy controller + +maintainers: + - Manu Gautam <mgautam@codeaurora.org> + +description: + QUSB2 controller supports LS/FS/HS usb connectivity on Qualcomm chipsets. + +properties: + compatible: + oneOf: + - items: + - enum: + - qcom,msm8996-qusb2-phy + - qcom,msm8998-qusb2-phy + - items: + - enum: + - qcom,sc7180-qusb2-phy + - qcom,sdm845-qusb2-phy + - const: qcom,qusb2-v2-phy + reg: + maxItems: 1 + + "#phy-cells": + const: 0 + + clocks: + minItems: 2 + maxItems: 3 + items: + - description: phy config clock + - description: 19.2 MHz ref clk + - description: phy interface clock (Optional) + + clock-names: + minItems: 2 + maxItems: 3 + items: + - const: cfg_ahb + - const: ref + - const: iface + + vdda-pll-supply: + description: + Phandle to 1.8V regulator supply to PHY refclk pll block. + + vdda-phy-dpdm-supply: + description: + Phandle to 3.1V regulator supply to Dp/Dm port signals. + + resets: + maxItems: 1 + description: + Phandle to reset to phy block. + + nvmem-cells: + maxItems: 1 + description: + Phandle to nvmem cell that contains 'HS Tx trim' + tuning parameter value for qusb2 phy. + + qcom,tcsr-syscon: + description: + Phandle to TCSR syscon register region. + $ref: /schemas/types.yaml#/definitions/phandle + +if: + properties: + compatible: + contains: + const: qcom,qusb2-v2-phy +then: + properties: + qcom,imp-res-offset-value: + description: + It is a 6 bit value that specifies offset to be + added to PHY refgen RESCODE via IMP_CTRL1 register. It is a PHY + tuning parameter that may vary for different boards of same SOC. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 63 + default: 0 + + qcom,bias-ctrl-value: + description: + It is a 6 bit value that specifies bias-ctrl-value. It is a PHY + tuning parameter that may vary for different boards of same SOC. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 63 + default: 0 + + qcom,charge-ctrl-value: + description: + It is a 2 bit value that specifies charge-ctrl-value. It is a PHY + tuning parameter that may vary for different boards of same SOC. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 3 + default: 0 + + qcom,hstx-trim-value: + description: + It is a 4 bit value that specifies tuning for HSTX + output current. + Possible range is - 15mA to 24mA (stepsize of 600 uA). + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 15 + default: 3 + + qcom,preemphasis-level: + description: + It is a 2 bit value that specifies pre-emphasis level. + Possible range is 0 to 15% (stepsize of 5%). + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 3 + default: 2 + + qcom,preemphasis-width: + description: + It is a 1 bit value that specifies how long the HSTX + pre-emphasis (specified using qcom,preemphasis-level) must be in + effect. Duration could be half-bit of full-bit. + See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 1 + default: 0 + + qcom,hsdisc-trim-value: + description: + It is a 2 bit value tuning parameter that control disconnect + threshold and may vary for different boards of same SOC. + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - minimum: 0 + maximum: 3 + default: 0 + +required: + - compatible + - reg + - "#phy-cells" + - clocks + - clock-names + - vdda-pll-supply + - vdda-phy-dpdm-supply + - resets + + +examples: + - | + #include <dt-bindings/clock/qcom,gcc-msm8996.h> + hsusb_phy: phy@7411000 { + compatible = "qcom,msm8996-qusb2-phy"; + reg = <0x7411000 0x180>; + #phy-cells = <0>; + + clocks = <&gcc GCC_USB_PHY_CFG_AHB2PHY_CLK>, + <&gcc GCC_RX1_USB2_CLKREF_CLK>; + clock-names = "cfg_ahb", "ref"; + + vdda-pll-supply = <&pm8994_l12>; + vdda-phy-dpdm-supply = <&pm8994_l24>; + + resets = <&gcc GCC_QUSB2PHY_PRIM_BCR>; + nvmem-cells = <&qusb2p_hstx_trim>; + }; diff --git a/Documentation/devicetree/bindings/phy/qcom,usb-hs-28nm.yaml b/Documentation/devicetree/bindings/phy/qcom,usb-hs-28nm.yaml new file mode 100644 index 000000000000..ca6a0836b53c --- /dev/null +++ b/Documentation/devicetree/bindings/phy/qcom,usb-hs-28nm.yaml @@ -0,0 +1,90 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/phy/qcom,usb-hs-28nm.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Synopsys DesignWare Core 28nm High-Speed PHY + +maintainers: + - Bryan O'Donoghue <bryan.odonoghue@linaro.org> + +description: | + Qualcomm Low-Speed, Full-Speed, Hi-Speed 28nm USB PHY + +properties: + compatible: + enum: + - qcom,usb-hs-28nm-femtophy + + reg: + maxItems: 1 + + "#phy-cells": + const: 0 + + clocks: + items: + - description: rpmcc ref clock + - description: PHY AHB clock + - description: Rentention clock + + clock-names: + items: + - const: ref + - const: ahb + - const: sleep + + resets: + items: + - description: PHY core reset + - description: POR reset + + reset-names: + items: + - const: phy + - const: por + + vdd-supply: + description: phandle to the regulator VDD supply node. + + vdda1p8-supply: + description: phandle to the regulator 1.8V supply node. + + vdda3p3-supply: + description: phandle to the regulator 3.3V supply node. + +required: + - compatible + - reg + - "#phy-cells" + - clocks + - clock-names + - resets + - reset-names + - vdd-supply + - vdda1p8-supply + - vdda3p3-supply + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/qcom,gcc-qcs404.h> + #include <dt-bindings/clock/qcom,rpmcc.h> + usb2_phy_prim: phy@7a000 { + compatible = "qcom,usb-hs-28nm-femtophy"; + reg = <0x0007a000 0x200>; + #phy-cells = <0>; + clocks = <&rpmcc RPM_SMD_LN_BB_CLK>, + <&gcc GCC_USB_HS_PHY_CFG_AHB_CLK>, + <&gcc GCC_USB2A_PHY_SLEEP_CLK>; + clock-names = "ref", "ahb", "sleep"; + resets = <&gcc GCC_USB_HS_PHY_CFG_AHB_BCR>, + <&gcc GCC_USB2A_PHY_BCR>; + reset-names = "phy", "por"; + vdd-supply = <&vreg_l4_1p2>; + vdda1p8-supply = <&vreg_l5_1p8>; + vdda3p3-supply = <&vreg_l12_3p3>; + }; +... diff --git a/Documentation/devicetree/bindings/phy/qcom,usb-ss.yaml b/Documentation/devicetree/bindings/phy/qcom,usb-ss.yaml new file mode 100644 index 000000000000..bd1388d62ce0 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/qcom,usb-ss.yaml @@ -0,0 +1,83 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/phy/qcom,usb-ss.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Qualcomm Synopsys 1.0.0 SuperSpeed USB PHY + +maintainers: + - Bryan O'Donoghue <bryan.odonoghue@linaro.org> + +description: | + Qualcomm Synopsys 1.0.0 SuperSpeed USB PHY + +properties: + compatible: + enum: + - qcom,usb-ss-28nm-phy + + reg: + maxItems: 1 + + "#phy-cells": + const: 0 + + clocks: + items: + - description: rpmcc clock + - description: PHY AHB clock + - description: SuperSpeed pipe clock + + clock-names: + items: + - const: ref + - const: ahb + - const: pipe + + vdd-supply: + description: phandle to the regulator VDD supply node. + + vdda1p8-supply: + description: phandle to the regulator 1.8V supply node. + + resets: + items: + - description: COM reset + - description: PHY reset line + + reset-names: + items: + - const: com + - const: phy + +required: + - compatible + - reg + - "#phy-cells" + - clocks + - clock-names + - vdd-supply + - vdda1p8-supply + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/qcom,gcc-qcs404.h> + #include <dt-bindings/clock/qcom,rpmcc.h> + usb3_phy: usb3-phy@78000 { + compatible = "qcom,usb-ss-28nm-phy"; + reg = <0x78000 0x400>; + #phy-cells = <0>; + clocks = <&rpmcc RPM_SMD_LN_BB_CLK>, + <&gcc GCC_USB_HS_PHY_CFG_AHB_CLK>, + <&gcc GCC_USB3_PHY_PIPE_CLK>; + clock-names = "ref", "ahb", "pipe"; + resets = <&gcc GCC_USB3_PHY_BCR>, + <&gcc GCC_USB3PHY_PHY_BCR>; + reset-names = "com", "phy"; + vdd-supply = <&vreg_l3_1p05>; + vdda1p8-supply = <&vreg_l5_1p8>; + }; +... diff --git a/Documentation/devicetree/bindings/phy/qcom-dwc3-usb-phy.txt b/Documentation/devicetree/bindings/phy/qcom-dwc3-usb-phy.txt deleted file mode 100644 index a1697c27aecd..000000000000 --- a/Documentation/devicetree/bindings/phy/qcom-dwc3-usb-phy.txt +++ /dev/null @@ -1,37 +0,0 @@ -Qualcomm DWC3 HS AND SS PHY CONTROLLER --------------------------------------- - -DWC3 PHY nodes are defined to describe on-chip Synopsis Physical layer -controllers. Each DWC3 PHY controller should have its own node. - -Required properties: -- compatible: should contain one of the following: - - "qcom,dwc3-hs-usb-phy" for High Speed Synopsis PHY controller - - "qcom,dwc3-ss-usb-phy" for Super Speed Synopsis PHY controller -- reg: offset and length of the DWC3 PHY controller register set -- #phy-cells: must be zero -- clocks: a list of phandles and clock-specifier pairs, one for each entry in - clock-names. -- clock-names: Should contain "ref" for the PHY reference clock - -Optional clocks: - "xo" External reference clock - -Example: - phy@100f8800 { - compatible = "qcom,dwc3-hs-usb-phy"; - reg = <0x100f8800 0x30>; - clocks = <&gcc USB30_0_UTMI_CLK>; - clock-names = "ref"; - #phy-cells = <0>; - - }; - - phy@100f8830 { - compatible = "qcom,dwc3-ss-usb-phy"; - reg = <0x100f8830 0x30>; - clocks = <&gcc USB30_0_MASTER_CLK>; - clock-names = "ref"; - #phy-cells = <0>; - - }; diff --git a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt index eac9ad3cbbc8..54d6f8d43508 100644 --- a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt +++ b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt @@ -8,10 +8,13 @@ Required properties: - compatible: compatible list, contains: "qcom,ipq8074-qmp-pcie-phy" for PCIe phy on IPQ8074 "qcom,msm8996-qmp-pcie-phy" for 14nm PCIe phy on msm8996, + "qcom,msm8996-qmp-ufs-phy" for 14nm UFS phy on msm8996, "qcom,msm8996-qmp-usb3-phy" for 14nm USB3 phy on msm8996, "qcom,msm8998-qmp-usb3-phy" for USB3 QMP V3 phy on msm8998, "qcom,msm8998-qmp-ufs-phy" for UFS QMP phy on msm8998, "qcom,msm8998-qmp-pcie-phy" for PCIe QMP phy on msm8998, + "qcom,sdm845-qhp-pcie-phy" for QHP PCIe phy on sdm845, + "qcom,sdm845-qmp-pcie-phy" for QMP PCIe phy on sdm845, "qcom,sdm845-qmp-usb3-phy" for USB3 QMP V3 phy on sdm845, "qcom,sdm845-qmp-usb3-uni-phy" for USB3 QMP V3 UNI phy on sdm845, "qcom,sdm845-qmp-ufs-phy" for UFS QMP phy on sdm845, @@ -44,6 +47,8 @@ Required properties: For "qcom,ipq8074-qmp-pcie-phy": no clocks are listed. For "qcom,msm8996-qmp-pcie-phy" must contain: "aux", "cfg_ahb", "ref". + For "qcom,msm8996-qmp-ufs-phy" must contain: + "ref". For "qcom,msm8996-qmp-usb3-phy" must contain: "aux", "cfg_ahb", "ref". For "qcom,msm8998-qmp-usb3-phy" must contain: @@ -52,6 +57,10 @@ Required properties: "ref", "ref_aux". For "qcom,msm8998-qmp-pcie-phy" must contain: "aux", "cfg_ahb", "ref". + For "qcom,sdm845-qhp-pcie-phy" must contain: + "aux", "cfg_ahb", "ref", "refgen". + For "qcom,sdm845-qmp-pcie-phy" must contain: + "aux", "cfg_ahb", "ref", "refgen". For "qcom,sdm845-qmp-usb3-phy" must contain: "aux", "cfg_ahb", "ref", "com_aux". For "qcom,sdm845-qmp-usb3-uni-phy" must contain: @@ -72,6 +81,8 @@ Required properties: "phy", "common". For "qcom,msm8996-qmp-pcie-phy" must contain: "phy", "common", "cfg". + For "qcom,msm8996-qmp-ufs-phy": must contain: + "ufsphy". For "qcom,msm8996-qmp-usb3-phy" must contain "phy", "common". For "qcom,msm8998-qmp-usb3-phy" must contain @@ -80,6 +91,10 @@ Required properties: "ufsphy". For "qcom,msm8998-qmp-pcie-phy" must contain: "phy", "common". + For "qcom,sdm845-qhp-pcie-phy" must contain: + "phy". + For "qcom,sdm845-qmp-pcie-phy" must contain: + "phy". For "qcom,sdm845-qmp-usb3-phy" must contain: "phy", "common". For "qcom,sdm845-qmp-usb3-uni-phy" must contain: diff --git a/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt deleted file mode 100644 index fe29f9e0af6d..000000000000 --- a/Documentation/devicetree/bindings/phy/qcom-qusb2-phy.txt +++ /dev/null @@ -1,68 +0,0 @@ -Qualcomm QUSB2 phy controller -============================= - -QUSB2 controller supports LS/FS/HS usb connectivity on Qualcomm chipsets. - -Required properties: - - compatible: compatible list, contains - "qcom,msm8996-qusb2-phy" for 14nm PHY on msm8996, - "qcom,msm8998-qusb2-phy" for 10nm PHY on msm8998, - "qcom,sdm845-qusb2-phy" for 10nm PHY on sdm845. - - - reg: offset and length of the PHY register set. - - #phy-cells: must be 0. - - - clocks: a list of phandles and clock-specifier pairs, - one for each entry in clock-names. - - clock-names: must be "cfg_ahb" for phy config clock, - "ref" for 19.2 MHz ref clk, - "iface" for phy interface clock (Optional). - - - vdda-pll-supply: Phandle to 1.8V regulator supply to PHY refclk pll block. - - vdda-phy-dpdm-supply: Phandle to 3.1V regulator supply to Dp/Dm port signals. - - - resets: Phandle to reset to phy block. - -Optional properties: - - nvmem-cells: Phandle to nvmem cell that contains 'HS Tx trim' - tuning parameter value for qusb2 phy. - - - qcom,tcsr-syscon: Phandle to TCSR syscon register region. - - qcom,imp-res-offset-value: It is a 6 bit value that specifies offset to be - added to PHY refgen RESCODE via IMP_CTRL1 register. It is a PHY - tuning parameter that may vary for different boards of same SOC. - This property is applicable to only QUSB2 v2 PHY (sdm845). - - qcom,hstx-trim-value: It is a 4 bit value that specifies tuning for HSTX - output current. - Possible range is - 15mA to 24mA (stepsize of 600 uA). - See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. - This property is applicable to only QUSB2 v2 PHY (sdm845). - Default value is 22.2mA for sdm845. - - qcom,preemphasis-level: It is a 2 bit value that specifies pre-emphasis level. - Possible range is 0 to 15% (stepsize of 5%). - See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. - This property is applicable to only QUSB2 v2 PHY (sdm845). - Default value is 10% for sdm845. -- qcom,preemphasis-width: It is a 1 bit value that specifies how long the HSTX - pre-emphasis (specified using qcom,preemphasis-level) must be in - effect. Duration could be half-bit of full-bit. - See dt-bindings/phy/phy-qcom-qusb2.h for applicable values. - This property is applicable to only QUSB2 v2 PHY (sdm845). - Default value is full-bit width for sdm845. - -Example: - hsusb_phy: phy@7411000 { - compatible = "qcom,msm8996-qusb2-phy"; - reg = <0x7411000 0x180>; - #phy-cells = <0>; - - clocks = <&gcc GCC_USB_PHY_CFG_AHB2PHY_CLK>, - <&gcc GCC_RX1_USB2_CLKREF_CLK>, - clock-names = "cfg_ahb", "ref"; - - vdda-pll-supply = <&pm8994_l12>; - vdda-phy-dpdm-supply = <&pm8994_l24>; - - resets = <&gcc GCC_QUSB2PHY_PRIM_BCR>; - nvmem-cells = <&qusb2p_hstx_trim>; - }; diff --git a/Documentation/devicetree/bindings/phy/ti-phy-gmii-sel.txt b/Documentation/devicetree/bindings/phy/ti-phy-gmii-sel.txt index 50ce9ae0f7a5..83b78c1c0644 100644 --- a/Documentation/devicetree/bindings/phy/ti-phy-gmii-sel.txt +++ b/Documentation/devicetree/bindings/phy/ti-phy-gmii-sel.txt @@ -40,6 +40,7 @@ Required properties: "ti,dra7xx-phy-gmii-sel" for dra7xx/am57xx platform "ti,am43xx-phy-gmii-sel" for am43xx platform "ti,dm814-phy-gmii-sel" for dm814x platform + "ti,am654-phy-gmii-sel" for AM654x/J721E platform - reg : Address and length of the register set for the device - #phy-cells : must be 2. cell 1 - CPSW port number (starting from 1) diff --git a/Documentation/devicetree/bindings/phy/uniphier-pcie-phy.txt b/Documentation/devicetree/bindings/phy/uniphier-pcie-phy.txt index 1889d3b89d68..3cee372c5742 100644 --- a/Documentation/devicetree/bindings/phy/uniphier-pcie-phy.txt +++ b/Documentation/devicetree/bindings/phy/uniphier-pcie-phy.txt @@ -5,14 +5,19 @@ PCIe controller implemented on Socionext UniPhier SoCs. Required properties: - compatible: Should contain one of the following: + "socionext,uniphier-pro5-pcie-phy" - for Pro5 PHY "socionext,uniphier-ld20-pcie-phy" - for LD20 PHY "socionext,uniphier-pxs3-pcie-phy" - for PXs3 PHY - reg: Specifies offset and length of the register set for the device. - #phy-cells: Must be zero. -- clocks: A phandle to the clock gate for PCIe glue layer including - this phy. -- resets: A phandle to the reset line for PCIe glue layer including - this phy. +- clocks: A list of phandles to the clock gate for PCIe glue layer + including this phy. +- clock-names: For Pro5 only, should contain the following: + "gio", "link" - for Pro5 SoC +- resets: A list of phandles to the reset line for PCIe glue layer + including this phy. +- reset-names: For Pro5 only, should contain the following: + "gio", "link" - for Pro5 SoC Optional properties: - socionext,syscon: A phandle to system control to set configurations diff --git a/Documentation/devicetree/bindings/phy/uniphier-usb3-hsphy.txt b/Documentation/devicetree/bindings/phy/uniphier-usb3-hsphy.txt index e8d8086a7ae9..093d4f08705f 100644 --- a/Documentation/devicetree/bindings/phy/uniphier-usb3-hsphy.txt +++ b/Documentation/devicetree/bindings/phy/uniphier-usb3-hsphy.txt @@ -7,7 +7,7 @@ this describes about High-Speed PHY. Required properties: - compatible: Should contain one of the following: - "socionext,uniphier-pro4-usb3-hsphy" - for Pro4 SoC + "socionext,uniphier-pro5-usb3-hsphy" - for Pro5 SoC "socionext,uniphier-pxs2-usb3-hsphy" - for PXs2 SoC "socionext,uniphier-ld20-usb3-hsphy" - for LD20 SoC "socionext,uniphier-pxs3-usb3-hsphy" - for PXs3 SoC @@ -16,13 +16,13 @@ Required properties: - clocks: A list of phandles to the clock gate for USB3 glue layer. According to the clock-names, appropriate clocks are required. - clock-names: Should contain the following: - "gio", "link" - for Pro4 SoC + "gio", "link" - for Pro5 SoC "phy", "phy-ext", "link" - for PXs3 SoC, "phy-ext" is optional. "phy", "link" - for others - resets: A list of phandles to the reset control for USB3 glue layer. According to the reset-names, appropriate resets are required. - reset-names: Should contain the following: - "gio", "link" - for Pro4 SoC + "gio", "link" - for Pro5 SoC "phy", "link" - for others Optional properties: diff --git a/Documentation/devicetree/bindings/phy/uniphier-usb3-ssphy.txt b/Documentation/devicetree/bindings/phy/uniphier-usb3-ssphy.txt index 490b815445e8..9df2bc2f5999 100644 --- a/Documentation/devicetree/bindings/phy/uniphier-usb3-ssphy.txt +++ b/Documentation/devicetree/bindings/phy/uniphier-usb3-ssphy.txt @@ -8,6 +8,7 @@ this describes about Super-Speed PHY. Required properties: - compatible: Should contain one of the following: "socionext,uniphier-pro4-usb3-ssphy" - for Pro4 SoC + "socionext,uniphier-pro5-usb3-ssphy" - for Pro5 SoC "socionext,uniphier-pxs2-usb3-ssphy" - for PXs2 SoC "socionext,uniphier-ld20-usb3-ssphy" - for LD20 SoC "socionext,uniphier-pxs3-usb3-ssphy" - for PXs3 SoC @@ -16,13 +17,13 @@ Required properties: - clocks: A list of phandles to the clock gate for USB3 glue layer. According to the clock-names, appropriate clocks are required. - clock-names: - "gio", "link" - for Pro4 SoC + "gio", "link" - for Pro4 and Pro5 SoC "phy", "phy-ext", "link" - for PXs3 SoC, "phy-ext" is optional. "phy", "link" - for others - resets: A list of phandles to the reset control for USB3 glue layer. According to the reset-names, appropriate resets are required. - reset-names: - "gio", "link" - for Pro4 SoC + "gio", "link" - for Pro4 and Pro5 SoC "phy", "link" - for others Optional properties: diff --git a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2400-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2400-pinctrl.yaml index bb690e20c368..135c7dfbc180 100644 --- a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2400-pinctrl.yaml +++ b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2400-pinctrl.yaml @@ -17,7 +17,7 @@ description: |+ "aspeed,ast2400-scu", "syscon", "simple-mfd" Refer to the the bindings described in - Documentation/devicetree/bindings/mfd/syscon.txt + Documentation/devicetree/bindings/mfd/syscon.yaml properties: compatible: diff --git a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.yaml index f7f5d57f2c9a..824f7fd1d51b 100644 --- a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.yaml +++ b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.yaml @@ -18,7 +18,7 @@ description: |+ "aspeed,g5-scu", "syscon", "simple-mfd" Refer to the the bindings described in - Documentation/devicetree/bindings/mfd/syscon.txt + Documentation/devicetree/bindings/mfd/syscon.yaml properties: compatible: diff --git a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2600-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2600-pinctrl.yaml index 3749fa233e87..ac8d1c30a8ed 100644 --- a/Documentation/devicetree/bindings/pinctrl/aspeed,ast2600-pinctrl.yaml +++ b/Documentation/devicetree/bindings/pinctrl/aspeed,ast2600-pinctrl.yaml @@ -17,7 +17,7 @@ description: |+ "aspeed,ast2600-scu", "syscon", "simple-mfd" Refer to the the bindings described in - Documentation/devicetree/bindings/mfd/syscon.txt + Documentation/devicetree/bindings/mfd/syscon.yaml properties: compatible: diff --git a/Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.yaml b/Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.yaml index 754ea7ab040a..ef4de32cb17c 100644 --- a/Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.yaml +++ b/Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.yaml @@ -248,7 +248,7 @@ examples: }; //Example 3 pin groups - pinctrl@60020000 { + pinctrl { usart1_pins_a: usart1-0 { pins1 { pinmux = <STM32_PINMUX('A', 9, AF7)>; diff --git a/Documentation/devicetree/bindings/power/amlogic,meson-ee-pwrc.yaml b/Documentation/devicetree/bindings/power/amlogic,meson-ee-pwrc.yaml index aab70e8b681e..d3098c924b25 100644 --- a/Documentation/devicetree/bindings/power/amlogic,meson-ee-pwrc.yaml +++ b/Documentation/devicetree/bindings/power/amlogic,meson-ee-pwrc.yaml @@ -18,7 +18,7 @@ description: |+ "amlogic,meson-gx-hhi-sysctrl", "simple-mfd", "syscon" Refer to the the bindings described in - Documentation/devicetree/bindings/mfd/syscon.txt + Documentation/devicetree/bindings/mfd/syscon.yaml properties: compatible: diff --git a/Documentation/devicetree/bindings/power/domain-idle-state.txt b/Documentation/devicetree/bindings/power/domain-idle-state.txt deleted file mode 100644 index eefc7ed22ca2..000000000000 --- a/Documentation/devicetree/bindings/power/domain-idle-state.txt +++ /dev/null @@ -1,33 +0,0 @@ -PM Domain Idle State Node: - -A domain idle state node represents the state parameters that will be used to -select the state when there are no active components in the domain. - -The state node has the following parameters - - -- compatible: - Usage: Required - Value type: <string> - Definition: Must be "domain-idle-state". - -- entry-latency-us - Usage: Required - Value type: <prop-encoded-array> - Definition: u32 value representing worst case latency in - microseconds required to enter the idle state. - The exit-latency-us duration may be guaranteed - only after entry-latency-us has passed. - -- exit-latency-us - Usage: Required - Value type: <prop-encoded-array> - Definition: u32 value representing worst case latency - in microseconds required to exit the idle state. - -- min-residency-us - Usage: Required - Value type: <prop-encoded-array> - Definition: u32 value representing minimum residency duration - in microseconds after which the idle state will yield - power benefits after overcoming the overhead in entering -i the idle state. diff --git a/Documentation/devicetree/bindings/power/domain-idle-state.yaml b/Documentation/devicetree/bindings/power/domain-idle-state.yaml new file mode 100644 index 000000000000..dfba1af9abe5 --- /dev/null +++ b/Documentation/devicetree/bindings/power/domain-idle-state.yaml @@ -0,0 +1,64 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/power/domain-idle-state.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: PM Domain Idle States binding description + +maintainers: + - Ulf Hansson <ulf.hansson@linaro.org> + +description: + A domain idle state node represents the state parameters that will be used to + select the state when there are no active components in the PM domain. + +properties: + $nodename: + const: domain-idle-states + +patternProperties: + "^(cpu|cluster|domain)-": + type: object + description: + Each state node represents a domain idle state description. + + properties: + compatible: + const: domain-idle-state + + entry-latency-us: + description: + The worst case latency in microseconds required to enter the idle + state. Note that, the exit-latency-us duration may be guaranteed only + after the entry-latency-us has passed. + + exit-latency-us: + description: + The worst case latency in microseconds required to exit the idle + state. + + min-residency-us: + description: + The minimum residency duration in microseconds after which the idle + state will yield power benefits, after overcoming the overhead while + entering the idle state. + + required: + - compatible + - entry-latency-us + - exit-latency-us + - min-residency-us + +examples: + - | + + domain-idle-states { + domain_retention: domain-retention { + compatible = "domain-idle-state"; + entry-latency-us = <20>; + exit-latency-us = <40>; + min-residency-us = <80>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/power/power-domain.yaml b/Documentation/devicetree/bindings/power/power-domain.yaml index 455b573293ae..6047aacd7766 100644 --- a/Documentation/devicetree/bindings/power/power-domain.yaml +++ b/Documentation/devicetree/bindings/power/power-domain.yaml @@ -25,22 +25,20 @@ description: |+ properties: $nodename: - pattern: "^(power-controller|power-domain)(@.*)?$" + pattern: "^(power-controller|power-domain)([@-].*)?$" domain-idle-states: $ref: /schemas/types.yaml#/definitions/phandle-array - description: - A phandle of an idle-state that shall be soaked into a generic domain - power state. The idle state definitions are compatible with - domain-idle-state specified in - Documentation/devicetree/bindings/power/domain-idle-state.txt - phandles that are not compatible with domain-idle-state will be ignored. - The domain-idle-state property reflects the idle state of this PM domain - and not the idle states of the devices or sub-domains in the PM domain. - Devices and sub-domains have their own idle-states independent - of the parent domain's idle states. In the absence of this property, - the domain would be considered as capable of being powered-on - or powered-off. + description: | + Phandles of idle states that defines the available states for the + power-domain provider. The idle state definitions are compatible with the + domain-idle-state bindings, specified in ./domain-idle-state.yaml. + + Note that, the domain-idle-state property reflects the idle states of this + PM domain and not the idle states of the devices or sub-domains in the PM + domain. Devices and sub-domains have their own idle states independent of + the parent domain's idle states. In the absence of this property, the + domain would be considered as capable of being powered-on or powered-off. operating-points-v2: $ref: /schemas/types.yaml#/definitions/phandle-array diff --git a/Documentation/devicetree/bindings/power/power_domain.txt b/Documentation/devicetree/bindings/power/power_domain.txt index 5b09b2deb483..08497ef26c7a 100644 --- a/Documentation/devicetree/bindings/power/power_domain.txt +++ b/Documentation/devicetree/bindings/power/power_domain.txt @@ -109,4 +109,4 @@ Example: required-opps = <&domain1_opp_1>; }; -[1]. Documentation/devicetree/bindings/power/domain-idle-state.txt +[1]. Documentation/devicetree/bindings/power/domain-idle-state.yaml diff --git a/Documentation/devicetree/bindings/regulator/mp886x.txt b/Documentation/devicetree/bindings/regulator/mp886x.txt new file mode 100644 index 000000000000..551867829459 --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/mp886x.txt @@ -0,0 +1,27 @@ +Monolithic Power Systems MP8867/MP8869 voltage regulator + +Required properties: +- compatible: Must be one of the following. + "mps,mp8867" + "mps,mp8869" +- reg: I2C slave address. +- enable-gpios: enable gpios. +- mps,fb-voltage-divider: An array of two integers containing the resistor + values R1 and R2 of the feedback voltage divider in kilo ohms. + +Any property defined as part of the core regulator binding, defined in +./regulator.txt, can also be used. + +Example: + + vcpu: regulator@62 { + compatible = "mps,mp8869"; + regulator-name = "vcpu"; + regulator-min-microvolt = <700000>; + regulator-max-microvolt = <850000>; + regulator-always-on; + regulator-boot-on; + enable-gpios = <&porta 1 GPIO_ACTIVE_LOW>; + mps,fb-voltage-divider = <80 240>; + reg = <0x62>; + }; diff --git a/Documentation/devicetree/bindings/regulator/mps,mp5416.yaml b/Documentation/devicetree/bindings/regulator/mps,mp5416.yaml new file mode 100644 index 000000000000..f0acce2029fd --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/mps,mp5416.yaml @@ -0,0 +1,78 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/regulator/mps,mp5416.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Monolithic Power System MP5416 PMIC + +maintainers: + - Saravanan Sekar <sravanhome@gmail.com> + +properties: + $nodename: + pattern: "^pmic@[0-9a-f]{1,2}$" + compatible: + enum: + - mps,mp5416 + + reg: + maxItems: 1 + + regulators: + type: object + description: | + list of regulators provided by this controller, must be named + after their hardware counterparts BUCK[1-4] and LDO[1-4] + + patternProperties: + "^buck[1-4]$": + allOf: + - $ref: "regulator.yaml#" + type: object + + "^ldo[1-4]$": + allOf: + - $ref: "regulator.yaml#" + type: object + + additionalProperties: false + additionalProperties: false + +required: + - compatible + - reg + - regulators + +additionalProperties: false + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + pmic@69 { + compatible = "mps,mp5416"; + reg = <0x69>; + + regulators { + + buck1 { + regulator-name = "buck1"; + regulator-min-microvolt = <600000>; + regulator-max-microvolt = <2187500>; + regulator-min-microamp = <3800000>; + regulator-max-microamp = <6800000>; + regulator-boot-on; + }; + + ldo2 { + regulator-name = "ldo2"; + regulator-min-microvolt = <800000>; + regulator-max-microvolt = <3975000>; + }; + }; + }; + }; +... diff --git a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt index d126df043403..dea4384f4c03 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.txt @@ -26,6 +26,7 @@ Regulator nodes are identified by their compatible: "qcom,rpm-pm8994-regulators" "qcom,rpm-pm8998-regulators" "qcom,rpm-pma8084-regulators" + "qcom,rpm-pmi8994-regulators" "qcom,rpm-pmi8998-regulators" "qcom,rpm-pms405-regulators" @@ -146,6 +147,15 @@ Regulator nodes are identified by their compatible: - vdd_s1-supply: - vdd_s2-supply: - vdd_s3-supply: +- vdd_bst_byp-supply: + Usage: optional (pmi8994 only) + Value type: <phandle> + Definition: reference to regulator supplying the input pin, as + described in the data sheet + +- vdd_s1-supply: +- vdd_s2-supply: +- vdd_s3-supply: - vdd_s4-supply: - vdd_s5-supply: - vdd_s6-supply: @@ -259,6 +269,9 @@ pma8084: l6, l7, l8, l9, l10, l11, l12, l13, l14, l15, l16, l17, l18, l19, l20, l21, l22, l23, l24, l25, l26, l27, lvs1, lvs2, lvs3, lvs4, 5vs1 +pmi8994: + s1, s2, s3, boost-bypass + pmi8998: bob diff --git a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt index f5cdac8b2847..8b005192f6e8 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.txt @@ -161,7 +161,7 @@ The regulator node houses sub-nodes for each regulator within the device. Each sub-node is identified using the node's name, with valid values listed for each of the PMICs below. -pm8005: +pm8004: s2, s5 pm8005: diff --git a/Documentation/devicetree/bindings/regulator/regulator.yaml b/Documentation/devicetree/bindings/regulator/regulator.yaml index 92ff2e8ad572..91a39a33000b 100644 --- a/Documentation/devicetree/bindings/regulator/regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/regulator.yaml @@ -191,7 +191,7 @@ patternProperties: examples: - | - xyzreg: regulator@0 { + xyzreg: regulator { regulator-min-microvolt = <1000000>; regulator-max-microvolt = <2500000>; regulator-always-on; diff --git a/Documentation/devicetree/bindings/regulator/vqmmc-ipq4019-regulator.yaml b/Documentation/devicetree/bindings/regulator/vqmmc-ipq4019-regulator.yaml new file mode 100644 index 000000000000..d1a79d2ffa1e --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/vqmmc-ipq4019-regulator.yaml @@ -0,0 +1,42 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/regulator/vqmmc-ipq4019-regulator.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm IPQ4019 VQMMC SD LDO regulator + +maintainers: + - Robert Marko <robert.marko@sartura.hr> + +description: | + Qualcomm IPQ4019 SoC-s feature a built a build SD/EMMC controller, + in order to support both 1.8 and 3V I/O voltage levels an LDO + controller is also embedded. + +allOf: + - $ref: "regulator.yaml#" + +properties: + compatible: + const: qcom,vqmmc-ipq4019-regulator + + reg: + maxItems: 1 + +required: + - compatible + - reg + +examples: + - | + regulator@1948000 { + compatible = "qcom,vqmmc-ipq4019-regulator"; + reg = <0x01948000 0x4>; + regulator-name = "vqmmc"; + regulator-min-microvolt = <1500000>; + regulator-max-microvolt = <3000000>; + regulator-always-on; + status = "disabled"; + }; +... diff --git a/Documentation/devicetree/bindings/reset/intel,rcu-gw.yaml b/Documentation/devicetree/bindings/reset/intel,rcu-gw.yaml index 246dea8a2ec9..8ac437282659 100644 --- a/Documentation/devicetree/bindings/reset/intel,rcu-gw.yaml +++ b/Documentation/devicetree/bindings/reset/intel,rcu-gw.yaml @@ -23,7 +23,11 @@ properties: description: Global reset register offset and bit offset. allOf: - $ref: /schemas/types.yaml#/definitions/uint32-array - - maxItems: 2 + items: + - description: Register offset + - description: Register bit offset + minimum: 0 + maximum: 31 "#reset-cells": minimum: 2 diff --git a/Documentation/devicetree/bindings/reset/st,stm32mp1-rcc.txt b/Documentation/devicetree/bindings/reset/st,stm32mp1-rcc.txt index b4edaf7c7ff3..2880d5dda95e 100644 --- a/Documentation/devicetree/bindings/reset/st,stm32mp1-rcc.txt +++ b/Documentation/devicetree/bindings/reset/st,stm32mp1-rcc.txt @@ -3,4 +3,4 @@ STMicroelectronics STM32MP1 Peripheral Reset Controller The RCC IP is both a reset and a clock controller. -Please see Documentation/devicetree/bindings/clock/st,stm32mp1-rcc.txt +Please see Documentation/devicetree/bindings/clock/st,stm32mp1-rcc.yaml diff --git a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt index 944743dd9212..c42b91e525fa 100644 --- a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt +++ b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt @@ -36,7 +36,7 @@ SAI subnodes required properties: - clock-names: Must contain "sai_ck". Must also contain "MCLK", if SAI shares a master clock, with a SAI set as MCLK clock provider. - - dmas: see Documentation/devicetree/bindings/dma/stm32-dma.txt + - dmas: see Documentation/devicetree/bindings/dma/st,stm32-dma.yaml - dma-names: identifier string for each DMA request line "tx": if sai sub-block is configured as playback DAI "rx": if sai sub-block is configured as capture DAI diff --git a/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt b/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt index 33826f2459fa..ca9101777c44 100644 --- a/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt +++ b/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt @@ -10,7 +10,7 @@ Required properties: - clock-names: must contain "kclk" - interrupts: cpu DAI interrupt line - dmas: DMA specifiers for audio data DMA and iec control flow DMA - See STM32 DMA bindings, Documentation/devicetree/bindings/dma/stm32-dma.txt + See STM32 DMA bindings, Documentation/devicetree/bindings/dma/st,stm32-dma.yaml - dma-names: two dmas have to be defined, "rx" and "rx-ctrl" Optional properties: diff --git a/Documentation/devicetree/bindings/spi/amlogic,meson-gx-spicc.yaml b/Documentation/devicetree/bindings/spi/amlogic,meson-gx-spicc.yaml index 49b617c98ae7..9147df29022a 100644 --- a/Documentation/devicetree/bindings/spi/amlogic,meson-gx-spicc.yaml +++ b/Documentation/devicetree/bindings/spi/amlogic,meson-gx-spicc.yaml @@ -22,6 +22,7 @@ properties: enum: - amlogic,meson-gx-spicc # SPICC controller on Amlogic GX and compatible SoCs - amlogic,meson-axg-spicc # SPICC controller on Amlogic AXG and compatible SoCs + - amlogic,meson-g12a-spicc # SPICC controller on Amlogic G12A and compatible SoCs interrupts: maxItems: 1 @@ -40,6 +41,27 @@ properties: items: - const: core +if: + properties: + compatible: + contains: + enum: + - amlogic,meson-g12a-spicc + +then: + properties: + clocks: + contains: + items: + - description: controller register bus clock + - description: baud rate generator and delay control clock + + clock-names: + minItems: 2 + items: + - const: core + - const: pclk + required: - compatible - reg diff --git a/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt b/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt index 2d3264140cc5..33bc58f4cf4b 100644 --- a/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt +++ b/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt @@ -10,7 +10,10 @@ Required properties: - "fsl,imx35-cspi" for SPI compatible with the one integrated on i.MX35 - "fsl,imx51-ecspi" for SPI compatible with the one integrated on i.MX51 - "fsl,imx53-ecspi" for SPI compatible with the one integrated on i.MX53 and later Soc - - "fsl,imx8mq-ecspi" for SPI compatible with the one integrated on i.MX8M + - "fsl,imx8mq-ecspi" for SPI compatible with the one integrated on i.MX8MQ + - "fsl,imx8mm-ecspi" for SPI compatible with the one integrated on i.MX8MM + - "fsl,imx8mn-ecspi" for SPI compatible with the one integrated on i.MX8MN + - "fsl,imx8mp-ecspi" for SPI compatible with the one integrated on i.MX8MP - reg : Offset and length of the register set for the device - interrupts : Should contain CSPI/eCSPI interrupt - clocks : Clock specifiers for both ipg and per clocks. diff --git a/Documentation/devicetree/bindings/spi/qca,ar934x-spi.yaml b/Documentation/devicetree/bindings/spi/qca,ar934x-spi.yaml new file mode 100644 index 000000000000..2aa766759d59 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/qca,ar934x-spi.yaml @@ -0,0 +1,41 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/qca,ar934x-spi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm Atheros AR934x/QCA95xx SoC SPI controller + +maintainers: + - Chuanhong Guo <gch981213@gmail.com> + +allOf: + - $ref: spi-controller.yaml# + +properties: + compatible: + const: qca,ar934x-spi + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + +required: + - compatible + - reg + - clocks + - '#address-cells' + - '#size-cells' + +examples: + - | + #include <dt-bindings/clock/ath79-clk.h> + spi: spi@1f000000 { + compatible = "qca,ar934x-spi"; + reg = <0x1f000000 0x1c>; + clocks = <&pll ATH79_CLK_AHB>; + #address-cells = <1>; + #size-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/spi/spi-controller.yaml b/Documentation/devicetree/bindings/spi/spi-controller.yaml index 1e0ca6ccf64b..d8e5509a7081 100644 --- a/Documentation/devicetree/bindings/spi/spi-controller.yaml +++ b/Documentation/devicetree/bindings/spi/spi-controller.yaml @@ -52,6 +52,12 @@ properties: description: The SPI controller acts as a slave, instead of a master. +oneOf: + - required: + - "#address-cells" + - required: + - spi-slave + patternProperties: "^slave$": type: object @@ -114,7 +120,7 @@ patternProperties: - enum: [ 1, 2, 4, 8 ] - default: 1 description: - Bus width to the SPI bus used for MISO. + Bus width to the SPI bus used for read transfers. spi-rx-delay-us: description: @@ -126,7 +132,7 @@ patternProperties: - enum: [ 1, 2, 4, 8 ] - default: 1 description: - Bus width to the SPI bus used for MOSI. + Bus width to the SPI bus used for write transfers. spi-tx-delay-us: description: diff --git a/Documentation/devicetree/bindings/spi/spi-fsl-dspi.txt b/Documentation/devicetree/bindings/spi/spi-fsl-dspi.txt index 162e024b95a0..30a79da9c039 100644 --- a/Documentation/devicetree/bindings/spi/spi-fsl-dspi.txt +++ b/Documentation/devicetree/bindings/spi/spi-fsl-dspi.txt @@ -1,12 +1,17 @@ ARM Freescale DSPI controller Required properties: -- compatible : "fsl,vf610-dspi", "fsl,ls1021a-v1.0-dspi", - "fsl,ls2085a-dspi" - or - "fsl,ls2080a-dspi" followed by "fsl,ls2085a-dspi" - "fsl,ls1012a-dspi" followed by "fsl,ls1021a-v1.0-dspi" - "fsl,ls1088a-dspi" followed by "fsl,ls1021a-v1.0-dspi" +- compatible : must be one of: + "fsl,vf610-dspi", + "fsl,ls1021a-v1.0-dspi", + "fsl,ls1012a-dspi" (optionally followed by "fsl,ls1021a-v1.0-dspi"), + "fsl,ls1028a-dspi", + "fsl,ls1043a-dspi" (optionally followed by "fsl,ls1021a-v1.0-dspi"), + "fsl,ls1046a-dspi" (optionally followed by "fsl,ls1021a-v1.0-dspi"), + "fsl,ls1088a-dspi" (optionally followed by "fsl,ls1021a-v1.0-dspi"), + "fsl,ls2080a-dspi" (optionally followed by "fsl,ls2085a-dspi"), + "fsl,ls2085a-dspi", + "fsl,lx2160a-dspi", - reg : Offset and length of the register set for the device - interrupts : Should contain SPI controller interrupt - clocks: from common clock binding: handle to dspi clock. @@ -14,11 +19,11 @@ Required properties: - pinctrl-0: pin control group to be used for this controller. - pinctrl-names: must contain a "default" entry. - spi-num-chipselects : the number of the chipselect signals. -- bus-num : the slave chip chipselect signal number. Optional property: - big-endian: If present the dspi device's registers are implemented in big endian mode. +- bus-num : the slave chip chipselect signal number. Optional SPI slave node properties: - fsl,spi-cs-sck-delay: a delay in nanoseconds between activating chip diff --git a/Documentation/devicetree/bindings/mtd/mtk-quadspi.txt b/Documentation/devicetree/bindings/spi/spi-mtk-nor.txt index a12e3b5c495d..984ae7fd4f94 100644 --- a/Documentation/devicetree/bindings/mtd/mtk-quadspi.txt +++ b/Documentation/devicetree/bindings/spi/spi-mtk-nor.txt @@ -1,4 +1,4 @@ -* Serial NOR flash controller for MediaTek SoCs +* Serial NOR flash controller for MediaTek ARM SoCs Required properties: - compatible: For mt8173, compatible should be "mediatek,mt8173-nor", @@ -13,6 +13,7 @@ Required properties: "mediatek,mt7629-nor", "mediatek,mt8173-nor" "mediatek,mt8173-nor" - reg: physical base address and length of the controller's register +- interrupts: Interrupt number used by the controller. - clocks: the phandle of the clocks needed by the nor controller - clock-names: the names of the clocks the clocks should be named "spi" and "sf". "spi" is used for spi bus, @@ -22,20 +23,16 @@ Required properties: - #address-cells: should be <1> - #size-cells: should be <0> -The SPI flash must be a child of the nor_flash node and must have a -compatible property. Also see jedec,spi-nor.txt. - -Required properties: -- compatible: May include a device-specific string consisting of the manufacturer - and name of the chip. Must also include "jedec,spi-nor" for any - SPI NOR flash that can be identified by the JEDEC READ ID opcode (0x9F). -- reg : Chip-Select number +There should be only one spi slave device following generic spi bindings. +It's not recommended to use this controller for devices other than SPI NOR +flash due to limited transfer capability of this controller. Example: nor_flash: spi@1100d000 { compatible = "mediatek,mt8173-nor"; reg = <0 0x1100d000 0 0xe0>; + interrupts = <&spi_flash_irq>; clocks = <&pericfg CLK_PERI_SPI>, <&topckgen CLK_TOP_SPINFI_IFR_SEL>; clock-names = "spi", "sf"; diff --git a/Documentation/devicetree/bindings/spi/spi-mux.yaml b/Documentation/devicetree/bindings/spi/spi-mux.yaml new file mode 100644 index 000000000000..0ae692dc28b5 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/spi-mux.yaml @@ -0,0 +1,89 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/spi-mux.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Generic SPI Multiplexer + +description: | + This binding describes a SPI bus multiplexer to route the SPI chip select + signals. This can be used when you need more devices than the SPI controller + has chip selects available. An example setup is shown in ASCII art; the actual + setting of the multiplexer to a channel needs to be done by a specific SPI mux + driver. + + MOSI /--------------------------------+--------+--------+--------\ + MISO |/------------------------------+|-------+|-------+|-------\| + SCL ||/----------------------------+||------+||------+||------\|| + ||| ||| ||| ||| ||| + +------------+ ||| ||| ||| ||| + | SoC ||| | +-+++-+ +-+++-+ +-+++-+ +-+++-+ + | ||| | | dev | | dev | | dev | | dev | + | +--+++-+ | CS-X +------+\ +--+--+ +--+--+ +--+--+ +--+--+ + | | SPI +-|-------+ Mux |\\ CS-0 | | | | + | +------+ | +--+---+\\\-------/ CS-1 | | | + | | | \\\----------------/ CS-2 | | + | +------+ | | \\-------------------------/ CS-3 | + | | ? +-|----------/ \----------------------------------/ + | +------+ | + +------------+ + +allOf: + - $ref: "/schemas/spi/spi-controller.yaml#" + +maintainers: + - Chris Packham <chris.packham@alliedtelesis.co.nz> + +properties: + compatible: + const: spi-mux + + mux-controls: + maxItems: 1 + +required: + - compatible + - reg + - spi-max-frequency + - mux-controls + +examples: + - | + #include <dt-bindings/gpio/gpio.h> + mux: mux-controller { + compatible = "gpio-mux"; + #mux-control-cells = <0>; + + mux-gpios = <&gpio0 3 GPIO_ACTIVE_HIGH>; + }; + + spi { + #address-cells = <1>; + #size-cells = <0>; + spi@0 { + compatible = "spi-mux"; + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + spi-max-frequency = <100000000>; + + mux-controls = <&mux>; + + spi-flash@0 { + compatible = "jedec,spi-nor"; + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + spi-max-frequency = <40000000>; + }; + + spi-device@1 { + compatible = "lineartechnology,ltc2488"; + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + spi-max-frequency = <10000000>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/spi/spi-nxp-fspi.txt b/Documentation/devicetree/bindings/spi/spi-nxp-fspi.txt index 2cd67eb727d4..7ac60d9fe357 100644 --- a/Documentation/devicetree/bindings/spi/spi-nxp-fspi.txt +++ b/Documentation/devicetree/bindings/spi/spi-nxp-fspi.txt @@ -2,6 +2,9 @@ Required properties: - compatible : Should be "nxp,lx2160a-fspi" + "nxp,imx8qxp-fspi" + "nxp,imx8mm-fspi" + - reg : First contains the register location and length, Second contains the memory mapping address and length - reg-names : Should contain the resource reg names: diff --git a/Documentation/devicetree/bindings/spi/spi-rockchip.txt b/Documentation/devicetree/bindings/spi/spi-rockchip.txt deleted file mode 100644 index a0edac12d8df..000000000000 --- a/Documentation/devicetree/bindings/spi/spi-rockchip.txt +++ /dev/null @@ -1,58 +0,0 @@ -* Rockchip SPI Controller - -The Rockchip SPI controller is used to interface with various devices such as flash -and display controllers using the SPI communication interface. - -Required Properties: - -- compatible: should be one of the following. - "rockchip,rv1108-spi" for rv1108 SoCs. - "rockchip,px30-spi", "rockchip,rk3066-spi" for px30 SoCs. - "rockchip,rk3036-spi" for rk3036 SoCS. - "rockchip,rk3066-spi" for rk3066 SoCs. - "rockchip,rk3188-spi" for rk3188 SoCs. - "rockchip,rk3228-spi" for rk3228 SoCS. - "rockchip,rk3288-spi" for rk3288 SoCs. - "rockchip,rk3368-spi" for rk3368 SoCs. - "rockchip,rk3399-spi" for rk3399 SoCs. -- reg: physical base address of the controller and length of memory mapped - region. -- interrupts: The interrupt number to the cpu. The interrupt specifier format - depends on the interrupt controller. -- clocks: Must contain an entry for each entry in clock-names. -- clock-names: Shall be "spiclk" for the transfer-clock, and "apb_pclk" for - the peripheral clock. -- #address-cells: should be 1. -- #size-cells: should be 0. - -Optional Properties: - -- dmas: DMA specifiers for tx and rx dma. See the DMA client binding, - Documentation/devicetree/bindings/dma/dma.txt -- dma-names: DMA request names should include "tx" and "rx" if present. -- rx-sample-delay-ns: nanoseconds to delay after the SCLK edge before sampling - Rx data (may need to be fine tuned for high capacitance lines). - No delay (0) by default. -- pinctrl-names: Names for the pin configuration(s); may be "default" or - "sleep", where the "sleep" configuration may describe the state - the pins should be in during system suspend. See also - pinctrl/pinctrl-bindings.txt. - - -Example: - - spi0: spi@ff110000 { - compatible = "rockchip,rk3066-spi"; - reg = <0xff110000 0x1000>; - dmas = <&pdma1 11>, <&pdma1 12>; - dma-names = "tx", "rx"; - rx-sample-delay-ns = <10>; - #address-cells = <1>; - #size-cells = <0>; - interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&cru SCLK_SPI0>, <&cru PCLK_SPI0>; - clock-names = "spiclk", "apb_pclk"; - pinctrl-0 = <&spi1_pins>; - pinctrl-1 = <&spi1_sleep>; - pinctrl-names = "default", "sleep"; - }; diff --git a/Documentation/devicetree/bindings/spi/spi-rockchip.yaml b/Documentation/devicetree/bindings/spi/spi-rockchip.yaml new file mode 100644 index 000000000000..81ad4b761502 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/spi-rockchip.yaml @@ -0,0 +1,107 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/spi-rockchip.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Rockchip SPI Controller + +description: + The Rockchip SPI controller is used to interface with various devices such + as flash and display controllers using the SPI communication interface. + +allOf: + - $ref: "spi-controller.yaml#" + +maintainers: + - Heiko Stuebner <heiko@sntech.de> + +# Everything else is described in the common file +properties: + compatible: + oneOf: + - const: rockchip,rk3036-spi + - const: rockchip,rk3066-spi + - const: rockchip,rk3228-spi + - const: rockchip,rv1108-spi + - items: + - enum: + - rockchip,px30-spi + - rockchip,rk3188-spi + - rockchip,rk3288-spi + - rockchip,rk3308-spi + - rockchip,rk3328-spi + - rockchip,rk3368-spi + - rockchip,rk3399-spi + - const: rockchip,rk3066-spi + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: transfer-clock + - description: peripheral clock + + clock-names: + items: + - const: spiclk + - const: apb_pclk + + dmas: + items: + - description: TX DMA Channel + - description: RX DMA Channel + + dma-names: + items: + - const: tx + - const: rx + + rx-sample-delay-ns: + default: 0 + description: + Nano seconds to delay after the SCLK edge before sampling Rx data + (may need to be fine tuned for high capacitance lines). + If not specified 0 will be used. + + pinctrl-names: + minItems: 1 + items: + - const: default + - const: sleep + description: + Names for the pin configuration(s); may be "default" or "sleep", + where the "sleep" configuration may describe the state + the pins should be in during system suspend. + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +examples: + - | + #include <dt-bindings/clock/rk3188-cru-common.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/interrupt-controller/irq.h> + spi0: spi@ff110000 { + compatible = "rockchip,rk3066-spi"; + reg = <0xff110000 0x1000>; + interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cru SCLK_SPI0>, <&cru PCLK_SPI0>; + clock-names = "spiclk", "apb_pclk"; + dmas = <&pdma1 11>, <&pdma1 12>; + dma-names = "tx", "rx"; + pinctrl-0 = <&spi1_pins>; + pinctrl-1 = <&spi1_sleep>; + pinctrl-names = "default", "sleep"; + rx-sample-delay-ns = <10>; + #address-cells = <1>; + #size-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml b/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml index f0d979664f07..e49ecbf715ba 100644 --- a/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml +++ b/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml @@ -49,7 +49,7 @@ properties: dmas: description: | DMA specifiers for tx and rx dma. DMA fifo mode must be used. See - the STM32 DMA bindings Documentation/devicetree/bindings/dma/stm32-dma.txt. + the STM32 DMA bindings Documentation/devicetree/bindings/dma/st,stm32-dma.yaml. items: - description: rx DMA channel - description: tx DMA channel diff --git a/Documentation/devicetree/bindings/sram/allwinner,sun4i-a10-system-control.yaml b/Documentation/devicetree/bindings/sram/allwinner,sun4i-a10-system-control.yaml index 80bac7a182d5..4b5509436588 100644 --- a/Documentation/devicetree/bindings/sram/allwinner,sun4i-a10-system-control.yaml +++ b/Documentation/devicetree/bindings/sram/allwinner,sun4i-a10-system-control.yaml @@ -125,7 +125,7 @@ examples: #size-cells = <1>; ranges; - sram_a: sram@00000000 { + sram_a: sram@0 { compatible = "mmio-sram"; reg = <0x00000000 0xc000>; #address-cells = <1>; diff --git a/Documentation/devicetree/bindings/thermal/brcm,avs-ro-thermal.yaml b/Documentation/devicetree/bindings/thermal/brcm,avs-ro-thermal.yaml index d9fdf4809a49..f3e68ed03abf 100644 --- a/Documentation/devicetree/bindings/thermal/brcm,avs-ro-thermal.yaml +++ b/Documentation/devicetree/bindings/thermal/brcm,avs-ro-thermal.yaml @@ -17,7 +17,7 @@ description: |+ "brcm,bcm2711-avs-monitor", "syscon", "simple-mfd" Refer to the the bindings described in - Documentation/devicetree/bindings/mfd/syscon.txt + Documentation/devicetree/bindings/mfd/syscon.yaml properties: compatible: diff --git a/Documentation/devicetree/bindings/timer/allwinner,sun4i-a10-timer.yaml b/Documentation/devicetree/bindings/timer/allwinner,sun4i-a10-timer.yaml index 23e989e09766..d918cee100ac 100644 --- a/Documentation/devicetree/bindings/timer/allwinner,sun4i-a10-timer.yaml +++ b/Documentation/devicetree/bindings/timer/allwinner,sun4i-a10-timer.yaml @@ -87,7 +87,7 @@ additionalProperties: false examples: - | - timer { + timer@1c20c00 { compatible = "allwinner,sun4i-a10-timer"; reg = <0x01c20c00 0x400>; interrupts = <22>, diff --git a/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt b/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt index 195792270414..3cb2f4c98d64 100644 --- a/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt +++ b/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt @@ -11,6 +11,7 @@ Required properties: "moxa,moxart-timer", "faraday,fttmr010" "aspeed,ast2400-timer" "aspeed,ast2500-timer" + "aspeed,ast2600-timer" - reg : Should contain registers location and length - interrupts : Should contain the three timer interrupts usually with diff --git a/Documentation/devicetree/bindings/timer/ingenic,tcu.txt b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt index 0b63cebc5f45..91f704951845 100644 --- a/Documentation/devicetree/bindings/timer/ingenic,tcu.txt +++ b/Documentation/devicetree/bindings/timer/ingenic,tcu.txt @@ -10,6 +10,7 @@ Required properties: * ingenic,jz4740-tcu * ingenic,jz4725b-tcu * ingenic,jz4770-tcu + * ingenic,x1000-tcu followed by "simple-mfd". - reg: Should be the offset/length value corresponding to the TCU registers - clocks: List of phandle & clock specifiers for clocks external to the TCU. diff --git a/Documentation/devicetree/bindings/trivial-devices.yaml b/Documentation/devicetree/bindings/trivial-devices.yaml index 978de7d37c66..330cab25cc92 100644 --- a/Documentation/devicetree/bindings/trivial-devices.yaml +++ b/Documentation/devicetree/bindings/trivial-devices.yaml @@ -34,14 +34,6 @@ properties: - adi,adt7461 # +/-1C TDM Extended Temp Range I.C - adt7461 - # +/-1C TDM Extended Temp Range I.C - - adi,adt7473 - # +/-1C TDM Extended Temp Range I.C - - adi,adt7475 - # +/-1C TDM Extended Temp Range I.C - - adi,adt7476 - # +/-1C TDM Extended Temp Range I.C - - adi,adt7490 # Three-Axis Digital Accelerometer - adi,adxl345 # Three-Axis Digital Accelerometer (backward-compatibility value "adi,adxl345" must be listed too) @@ -350,6 +342,8 @@ properties: - ti,ads7830 # Temperature Monitoring and Fan Control - ti,amc6821 + # Temperature sensor with 2-wire interface + - ti,lm73 # Temperature sensor with integrated fan control - ti,lm96000 # I2C Touch-Screen Controller diff --git a/Documentation/devicetree/bindings/usb/amlogic,meson-g12a-usb-ctrl.yaml b/Documentation/devicetree/bindings/usb/amlogic,meson-g12a-usb-ctrl.yaml index 267fce165994..b0e5e0fe9386 100644 --- a/Documentation/devicetree/bindings/usb/amlogic,meson-g12a-usb-ctrl.yaml +++ b/Documentation/devicetree/bindings/usb/amlogic,meson-g12a-usb-ctrl.yaml @@ -22,10 +22,14 @@ description: | The DWC3 Glue controls the PHY routing and power, an interrupt line is connected to the Glue to serve as OTG ID change detection. + The Amlogic A1 embeds a DWC3 USB IP Core configured for USB2 in + host-only mode. + properties: compatible: enum: - amlogic,meson-g12a-usb-ctrl + - amlogic,meson-a1-usb-ctrl ranges: true @@ -84,6 +88,25 @@ required: - phys - dr_mode +allOf: + - if: + properties: + compatible: + enum: + - amlogic,meson-a1-usb-ctrl + + then: + properties: + clocks: + minItems: 3 + clock-names: + items: + - const: usb_ctrl + - const: usb_bus + - const: xtal_usb_ctrl + required: + - clock-names + examples: - | usb: usb@ffe09000 { diff --git a/Documentation/devicetree/bindings/usb/aspeed,usb-vhub.yaml b/Documentation/devicetree/bindings/usb/aspeed,usb-vhub.yaml new file mode 100644 index 000000000000..06399ba0d9e4 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/aspeed,usb-vhub.yaml @@ -0,0 +1,77 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +# Copyright (c) 2020 Facebook Inc. +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/aspeed,usb-vhub.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: ASPEED USB 2.0 Virtual Hub Controller + +maintainers: + - Benjamin Herrenschmidt <benh@kernel.crashing.org> + +description: |+ + The ASPEED USB 2.0 Virtual Hub Controller implements 1 set of USB Hub + register and several sets of Device and Endpoint registers to support + the Virtual Hub's downstream USB devices. + + Supported number of devices and endpoints vary depending on hardware + revisions. AST2400 and AST2500 Virtual Hub supports 5 downstream devices + and 15 generic endpoints, while AST2600 Virtual Hub supports 7 downstream + devices and 21 generic endpoints. + +properties: + compatible: + enum: + - aspeed,ast2400-usb-vhub + - aspeed,ast2500-usb-vhub + - aspeed,ast2600-usb-vhub + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + interrupts: + maxItems: 1 + + aspeed,vhub-downstream-ports: + description: Number of downstream ports supported by the Virtual Hub + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - default: 5 + minimum: 1 + maximum: 7 + + aspeed,vhub-generic-endpoints: + description: Number of generic endpoints supported by the Virtual Hub + allOf: + - $ref: /schemas/types.yaml#/definitions/uint32 + - default: 15 + minimum: 1 + maximum: 21 + +required: + - compatible + - reg + - clocks + - interrupts + - aspeed,vhub-downstream-ports + - aspeed,vhub-generic-endpoints + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/aspeed-clock.h> + vhub: usb-vhub@1e6a0000 { + compatible = "aspeed,ast2500-usb-vhub"; + reg = <0x1e6a0000 0x300>; + interrupts = <5>; + clocks = <&syscon ASPEED_CLK_GATE_USBPORT1CLK>; + aspeed,vhub-downstream-ports = <5>; + aspeed,vhub-generic-endpoints = <15>; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_usb2ad_default>; + }; diff --git a/Documentation/devicetree/bindings/usb/dwc2.yaml b/Documentation/devicetree/bindings/usb/dwc2.yaml index 71cf7ba32237..6baf00e7d0a9 100644 --- a/Documentation/devicetree/bindings/usb/dwc2.yaml +++ b/Documentation/devicetree/bindings/usb/dwc2.yaml @@ -18,27 +18,15 @@ properties: - const: rockchip,rk3066-usb - const: snps,dwc2 - items: - - const: rockchip,px30-usb - - const: rockchip,rk3066-usb - - const: snps,dwc2 - - items: - - const: rockchip,rk3036-usb - - const: rockchip,rk3066-usb - - const: snps,dwc2 - - items: - - const: rockchip,rv1108-usb - - const: rockchip,rk3066-usb - - const: snps,dwc2 - - items: - - const: rockchip,rk3188-usb - - const: rockchip,rk3066-usb - - const: snps,dwc2 - - items: - - const: rockchip,rk3228-usb - - const: rockchip,rk3066-usb - - const: snps,dwc2 - - items: - - const: rockchip,rk3288-usb + - enum: + - rockchip,px30-usb + - rockchip,rk3036-usb + - rockchip,rk3188-usb + - rockchip,rk3228-usb + - rockchip,rk3288-usb + - rockchip,rk3328-usb + - rockchip,rk3368-usb + - rockchip,rv1108-usb - const: rockchip,rk3066-usb - const: snps,dwc2 - const: lantiq,arx100-usb diff --git a/Documentation/devicetree/bindings/usb/dwc3.txt b/Documentation/devicetree/bindings/usb/dwc3.txt index 66780a47ad85..9946ff9ba735 100644 --- a/Documentation/devicetree/bindings/usb/dwc3.txt +++ b/Documentation/devicetree/bindings/usb/dwc3.txt @@ -7,7 +7,8 @@ Required properties: - compatible: must be "snps,dwc3" - reg : Address and length of the register set for the device - interrupts: Interrupts used by the dwc3 controller. - - clock-names: should contain "ref", "bus_early", "suspend" + - clock-names: list of clock names. Ideally should be "ref", + "bus_early", "suspend" but may be less or more. - clocks: list of phandle and clock specifier pairs corresponding to entries in the clock-names property. @@ -36,7 +37,7 @@ Optional properties: - phys: from the *Generic PHY* bindings - phy-names: from the *Generic PHY* bindings; supported names are "usb2-phy" or "usb3-phy". - - resets: a single pair of phandle and reset specifier + - resets: set of phandle and reset specifier pairs - snps,usb2-lpm-disable: indicate if we don't want to enable USB2 HW LPM - snps,usb3_lpm_capable: determines if platform is USB3 LPM capable - snps,dis-start-transfer-quirk: when set, disable isoc START TRANSFER command @@ -75,6 +76,8 @@ Optional properties: from P0 to P1/P2/P3 without delay. - snps,dis-tx-ipgap-linecheck-quirk: when set, disable u2mac linestate check during HS transmit. + - snps,parkmode-disable-ss-quirk: when set, all SuperSpeed bus instances in + park mode are disabled. - snps,dis_metastability_quirk: when set, disable metastability workaround. CAUTION: use only if you are absolutely sure of it. - snps,is-utmi-l1-suspend: true when DWC3 asserts output signal diff --git a/Documentation/devicetree/bindings/usb/generic.txt b/Documentation/devicetree/bindings/usb/generic.txt index e6790d2a4da9..67c51759a642 100644 --- a/Documentation/devicetree/bindings/usb/generic.txt +++ b/Documentation/devicetree/bindings/usb/generic.txt @@ -35,6 +35,12 @@ Optional properties: the USB data role (USB host or USB device) for a given USB connector, such as Type-C, Type-B(micro). see connector/usb-connector.txt. + - role-switch-default-mode: indicating if usb-role-switch is enabled, the + device default operation mode of controller while usb + role is USB_ROLE_NONE. Valid arguments are "host" and + "peripheral". Defaults to "peripheral" if not + specified. + This is an attribute to a USB controller such as: diff --git a/Documentation/devicetree/bindings/usb/ingenic,jz4740-musb.txt b/Documentation/devicetree/bindings/usb/ingenic,jz4740-musb.txt deleted file mode 100644 index 16808721f3ff..000000000000 --- a/Documentation/devicetree/bindings/usb/ingenic,jz4740-musb.txt +++ /dev/null @@ -1,32 +0,0 @@ -Ingenic JZ4740 MUSB driver - -Required properties: - -- compatible: Must be "ingenic,jz4740-musb" -- reg: Address range of the UDC register set -- interrupts: IRQ number related to the UDC hardware -- interrupt-names: must be "mc" -- clocks: phandle to the "udc" clock -- clock-names: must be "udc" -- phys: phandle to the USB PHY - -Example: - -usb_phy: usb-phy@0 { - compatible = "usb-nop-xceiv"; - #phy-cells = <0>; -}; - -udc: usb@13040000 { - compatible = "ingenic,jz4740-musb"; - reg = <0x13040000 0x10000>; - - interrupt-parent = <&intc>; - interrupts = <24>; - interrupt-names = "mc"; - - clocks = <&cgu JZ4740_CLK_UDC>; - clock-names = "udc"; - - phys = <&usb_phy>; -}; diff --git a/Documentation/devicetree/bindings/usb/ingenic,jz4770-phy.yaml b/Documentation/devicetree/bindings/usb/ingenic,jz4770-phy.yaml new file mode 100644 index 000000000000..a81b0b1a2226 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ingenic,jz4770-phy.yaml @@ -0,0 +1,52 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/ingenic,jz4770-phy.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Ingenic JZ4770 USB PHY devicetree bindings + +maintainers: + - Paul Cercueil <paul@crapouillou.net> + +properties: + $nodename: + pattern: '^usb-phy@.*' + + compatible: + enum: + - ingenic,jz4770-phy + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + vcc-supply: + description: VCC power supply + + '#phy-cells': + const: 0 + +required: + - compatible + - reg + - clocks + - vcc-supply + - '#phy-cells' + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/jz4770-cgu.h> + otg_phy: usb-phy@3c { + compatible = "ingenic,jz4770-phy"; + reg = <0x3c 0x10>; + + vcc-supply = <&vcc>; + clocks = <&cgu JZ4770_CLK_OTG_PHY>; + + #phy-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/usb/ingenic,musb.yaml b/Documentation/devicetree/bindings/usb/ingenic,musb.yaml new file mode 100644 index 000000000000..1d6877875077 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/ingenic,musb.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/ingenic,musb.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Ingenic JZ47xx USB IP DT bindings + +maintainers: + - Paul Cercueil <paul@crapouillou.net> + +properties: + $nodename: + pattern: '^usb@.*' + + compatible: + oneOf: + - enum: + - ingenic,jz4770-musb + - ingenic,jz4740-musb + - items: + - const: ingenic,jz4725b-musb + - const: ingenic,jz4740-musb + + reg: + maxItems: 1 + + clocks: + maxItems: 1 + + clock-names: + items: + - const: udc + + interrupts: + maxItems: 1 + + interrupt-names: + items: + - const: mc + + phys: + description: PHY specifier for the USB PHY + +required: + - compatible + - reg + - clocks + - clock-names + - interrupts + - interrupt-names + - phys + +additionalProperties: false + +examples: + - | + #include <dt-bindings/clock/jz4740-cgu.h> + usb_phy: usb-phy@0 { + compatible = "usb-nop-xceiv"; + #phy-cells = <0>; + }; + + udc: usb@13040000 { + compatible = "ingenic,jz4740-musb"; + reg = <0x13040000 0x10000>; + + interrupt-parent = <&intc>; + interrupts = <24>; + interrupt-names = "mc"; + + clocks = <&cgu JZ4740_CLK_UDC>; + clock-names = "udc"; + + phys = <&usb_phy>; + }; diff --git a/Documentation/devicetree/bindings/usb/maxim,max3420-udc.yaml b/Documentation/devicetree/bindings/usb/maxim,max3420-udc.yaml new file mode 100644 index 000000000000..4241d38d5864 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/maxim,max3420-udc.yaml @@ -0,0 +1,69 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/usb/maxim,max3420-udc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MAXIM MAX3420/1 USB Peripheral Controller + +maintainers: + - Jassi Brar <jaswinder.singh@linaro.org> + +description: | + The controller provices USB2.0 compliant FullSpeed peripheral + implementation over the SPI interface. + + Specifications about the part can be found at: + http://datasheets.maximintegrated.com/en/ds/MAX3420E.pdf + +properties: + compatible: + enum: + - maxim,max3420-udc + - maxim,max3421-udc + + reg: + maxItems: 1 + + interrupts: + items: + - description: usb irq from max3420 + - description: vbus detection irq + minItems: 1 + maxItems: 2 + + interrupt-names: + items: + - const: udc + - const: vbus + minItems: 1 + maxItems: 2 + + spi-max-frequency: + maximum: 26000000 + +required: + - compatible + - reg + - interrupts + - interrupt-names + +additionalProperties: false + +examples: + - | + #include <dt-bindings/gpio/gpio.h> + #include <dt-bindings/interrupt-controller/irq.h> + spi0 { + #address-cells = <1>; + #size-cells = <0>; + + udc@0 { + compatible = "maxim,max3420-udc"; + reg = <0>; + interrupt-parent = <&gpio>; + interrupts = <0 IRQ_TYPE_EDGE_FALLING>, <10 IRQ_TYPE_EDGE_BOTH>; + interrupt-names = "udc", "vbus"; + spi-max-frequency = <12500000>; + }; + }; diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml index 9e67944bec9c..fba343fa0205 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.yaml +++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml @@ -205,6 +205,8 @@ patternProperties: description: Colorful GRP, Shenzhen Xueyushi Technology Ltd. "^compulab,.*": description: CompuLab Ltd. + "^coreriver,.*": + description: CORERIVER Semiconductor Co.,Ltd. "^corpro,.*": description: Chengdu Corpro Technology Co., Ltd. "^cortina,.*": @@ -267,6 +269,8 @@ patternProperties: description: Dragino Technology Co., Limited "^dserve,.*": description: dServe Technology B.V. + "^dynaimage,.*": + description: Dyna-Image "^ea,.*": description: Embedded Artists AB "^ebs-systart,.*": diff --git a/Documentation/driver-api/80211/mac80211-advanced.rst b/Documentation/driver-api/80211/mac80211-advanced.rst index 9f1c5bb7ac35..24cb64b3b715 100644 --- a/Documentation/driver-api/80211/mac80211-advanced.rst +++ b/Documentation/driver-api/80211/mac80211-advanced.rst @@ -272,8 +272,8 @@ STA information lifetime rules .. kernel-doc:: net/mac80211/sta_info.c :doc: STA information lifetime rules -Aggregation -=========== +Aggregation Functions +===================== .. kernel-doc:: net/mac80211/sta_info.h :functions: sta_ampdu_mlme @@ -284,8 +284,8 @@ Aggregation .. kernel-doc:: net/mac80211/sta_info.h :functions: tid_ampdu_rx -Synchronisation -=============== +Synchronisation Functions +========================= TBD diff --git a/Documentation/driver-api/dmaengine/client.rst b/Documentation/driver-api/dmaengine/client.rst index e5953e7e4bf4..2104830a99ae 100644 --- a/Documentation/driver-api/dmaengine/client.rst +++ b/Documentation/driver-api/dmaengine/client.rst @@ -151,8 +151,8 @@ The details of these operations are: Note that callbacks will always be invoked from the DMA engines tasklet, never from interrupt context. -Optional: per descriptor metadata ---------------------------------- + **Optional: per descriptor metadata** + DMAengine provides two ways for metadata support. DESC_METADATA_CLIENT @@ -199,12 +199,15 @@ Optional: per descriptor metadata DESC_METADATA_CLIENT - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) construct the metadata in the client's buffer 2. use dmaengine_desc_attach_metadata() to attach the buffer to the descriptor 3. submit the transfer + - DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) 2. use dmaengine_desc_attach_metadata() to attach the buffer to the descriptor @@ -215,6 +218,7 @@ Optional: per descriptor metadata DESC_METADATA_ENGINE - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) 2. use dmaengine_desc_get_metadata_ptr() to get the pointer to the engine's metadata area @@ -222,7 +226,9 @@ Optional: per descriptor metadata 4. use dmaengine_desc_set_metadata_len() to tell the DMA engine the amount of data the client has placed into the metadata buffer 5. submit the transfer + - DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) 2. submit the transfer 3. on transfer completion, use dmaengine_desc_get_metadata_ptr() to get @@ -278,8 +284,8 @@ Optional: per descriptor metadata void dma_async_issue_pending(struct dma_chan *chan); -Further APIs: -------------- +Further APIs +------------ 1. Terminate APIs diff --git a/Documentation/driver-api/dmaengine/index.rst b/Documentation/driver-api/dmaengine/index.rst index b9df904d0a79..bdc45d8b4cfb 100644 --- a/Documentation/driver-api/dmaengine/index.rst +++ b/Documentation/driver-api/dmaengine/index.rst @@ -5,8 +5,8 @@ DMAEngine documentation DMAEngine documentation provides documents for various aspects of DMAEngine framework. -DMAEngine documentation ------------------------ +DMAEngine development documentation +----------------------------------- This book helps with DMAengine internal APIs and guide for DMAEngine device driver writers. diff --git a/Documentation/driver-api/dmaengine/provider.rst b/Documentation/driver-api/dmaengine/provider.rst index 790a15089f1f..56e5833e8a07 100644 --- a/Documentation/driver-api/dmaengine/provider.rst +++ b/Documentation/driver-api/dmaengine/provider.rst @@ -266,11 +266,15 @@ to use. attached (via the dmaengine_desc_attach_metadata() helper to the descriptor. From the DMA driver the following is expected for this mode: + - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM + The data from the provided metadata buffer should be prepared for the DMA controller to be sent alongside of the payload data. Either by copying to a hardware descriptor, or highly coupled packet. + - DMA_DEV_TO_MEM + On transfer completion the DMA driver must copy the metadata to the client provided metadata buffer before notifying the client about the completion. After the transfer completion, DMA drivers must not touch the metadata @@ -284,10 +288,14 @@ to use. and dmaengine_desc_set_metadata_len() is provided as helper functions. From the DMA driver the following is expected for this mode: - - get_metadata_ptr + + - get_metadata_ptr() + Should return a pointer for the metadata buffer, the maximum size of the metadata buffer and the currently used / valid (if any) bytes in the buffer. - - set_metadata_len + + - set_metadata_len() + It is called by the clients after it have placed the metadata to the buffer to let the DMA driver know the number of valid bytes provided. diff --git a/Documentation/driver-api/driver-model/driver.rst b/Documentation/driver-api/driver-model/driver.rst index baa6a85c8287..63887b813005 100644 --- a/Documentation/driver-api/driver-model/driver.rst +++ b/Documentation/driver-api/driver-model/driver.rst @@ -210,7 +210,7 @@ probed. While the typical use case for sync_state() is to have the kernel cleanly take over management of devices from the bootloader, the usage of sync_state() is not restricted to that. Use it whenever it makes sense to take an action after -all the consumers of a device have probed. +all the consumers of a device have probed:: int (*remove) (struct device *dev); diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst new file mode 100644 index 000000000000..4fe8abba9fc6 --- /dev/null +++ b/Documentation/driver-api/firmware/efi/index.rst @@ -0,0 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +UEFI Support +============ + +UEFI stub library functions +=========================== + +.. kernel-doc:: drivers/firmware/efi/libstub/mem.c + :internal: diff --git a/Documentation/driver-api/firmware/fallback-mechanisms.rst b/Documentation/driver-api/firmware/fallback-mechanisms.rst index 8b041d0ab426..036383dad6d6 100644 --- a/Documentation/driver-api/firmware/fallback-mechanisms.rst +++ b/Documentation/driver-api/firmware/fallback-mechanisms.rst @@ -202,3 +202,106 @@ the following file: If you echo 0 into it means MAX_JIFFY_OFFSET will be used. The data type for the timeout is an int. + +EFI embedded firmware fallback mechanism +======================================== + +On some devices the system's EFI code / ROM may contain an embedded copy +of firmware for some of the system's integrated peripheral devices and +the peripheral's Linux device-driver needs to access this firmware. + +Device drivers which need such firmware can use the +firmware_request_platform() function for this, note that this is a +separate fallback mechanism from the other fallback mechanisms and +this does not use the sysfs interface. + +A device driver which needs this can describe the firmware it needs +using an efi_embedded_fw_desc struct: + +.. kernel-doc:: include/linux/efi_embedded_fw.h + :functions: efi_embedded_fw_desc + +The EFI embedded-fw code works by scanning all EFI_BOOT_SERVICES_CODE memory +segments for an eight byte sequence matching prefix; if the prefix is found it +then does a sha256 over length bytes and if that matches makes a copy of length +bytes and adds that to its list with found firmwares. + +To avoid doing this somewhat expensive scan on all systems, dmi matching is +used. Drivers are expected to export a dmi_system_id array, with each entries' +driver_data pointing to an efi_embedded_fw_desc. + +To register this array with the efi-embedded-fw code, a driver needs to: + +1. Always be builtin to the kernel or store the dmi_system_id array in a + separate object file which always gets builtin. + +2. Add an extern declaration for the dmi_system_id array to + include/linux/efi_embedded_fw.h. + +3. Add the dmi_system_id array to the embedded_fw_table in + drivers/firmware/efi/embedded-firmware.c wrapped in a #ifdef testing that + the driver is being builtin. + +4. Add "select EFI_EMBEDDED_FIRMWARE if EFI_STUB" to its Kconfig entry. + +The firmware_request_platform() function will always first try to load firmware +with the specified name directly from the disk, so the EFI embedded-fw can +always be overridden by placing a file under /lib/firmware. + +Note that: + +1. The code scanning for EFI embedded-firmware runs near the end + of start_kernel(), just before calling rest_init(). For normal drivers and + subsystems using subsys_initcall() to register themselves this does not + matter. This means that code running earlier cannot use EFI + embedded-firmware. + +2. At the moment the EFI embedded-fw code assumes that firmwares always start at + an offset which is a multiple of 8 bytes, if this is not true for your case + send in a patch to fix this. + +3. At the moment the EFI embedded-fw code only works on x86 because other archs + free EFI_BOOT_SERVICES_CODE before the EFI embedded-fw code gets a chance to + scan it. + +4. The current brute-force scanning of EFI_BOOT_SERVICES_CODE is an ad-hoc + brute-force solution. There has been discussion to use the UEFI Platform + Initialization (PI) spec's Firmware Volume protocol. This has been rejected + because the FV Protocol relies on *internal* interfaces of the PI spec, and: + 1. The PI spec does not define peripheral firmware at all + 2. The internal interfaces of the PI spec do not guarantee any backward + compatibility. Any implementation details in FV may be subject to change, + and may vary system to system. Supporting the FV Protocol would be + difficult as it is purposely ambiguous. + +Example how to check for and extract embedded firmware +------------------------------------------------------ + +To check for, for example Silead touchscreen controller embedded firmware, +do the following: + +1. Boot the system with efi=debug on the kernel commandline + +2. cp /sys/kernel/debug/efi/boot_services_code? to your home dir + +3. Open the boot_services_code? files in a hex-editor, search for the + magic prefix for Silead firmware: F0 00 00 00 02 00 00 00, this gives you + the beginning address of the firmware inside the boot_services_code? file. + +4. The firmware has a specific pattern, it starts with a 8 byte page-address, + typically F0 00 00 00 02 00 00 00 for the first page followed by 32-bit + word-address + 32-bit value pairs. With the word-address incrementing 4 + bytes (1 word) for each pair until a page is complete. A complete page is + followed by a new page-address, followed by more word + value pairs. This + leads to a very distinct pattern. Scroll down until this pattern stops, + this gives you the end of the firmware inside the boot_services_code? file. + +5. "dd if=boot_services_code? of=firmware bs=1 skip=<begin-addr> count=<len>" + will extract the firmware for you. Inspect the firmware file in a + hexeditor to make sure you got the dd parameters correct. + +6. Copy it to /lib/firmware under the expected name to test it. + +7. If the extracted firmware works, you can use the found info to fill an + efi_embedded_fw_desc struct to describe it, run "sha256sum firmware" + to get the sha256sum to put in the sha256 field. diff --git a/Documentation/driver-api/firmware/index.rst b/Documentation/driver-api/firmware/index.rst index 29da39ec4b8a..57415d657173 100644 --- a/Documentation/driver-api/firmware/index.rst +++ b/Documentation/driver-api/firmware/index.rst @@ -6,6 +6,7 @@ Linux Firmware API introduction core + efi/index request_firmware other_interfaces diff --git a/Documentation/driver-api/firmware/lookup-order.rst b/Documentation/driver-api/firmware/lookup-order.rst index 88c81739683c..6064672a782e 100644 --- a/Documentation/driver-api/firmware/lookup-order.rst +++ b/Documentation/driver-api/firmware/lookup-order.rst @@ -12,6 +12,8 @@ a driver issues a firmware API call. return it immediately * The ''Direct filesystem lookup'' is performed next, if found we return it immediately +* The ''Platform firmware fallback'' is performed next, but only when + firmware_request_platform() is used, if found we return it immediately * If no firmware has been found and the fallback mechanism was enabled the sysfs interface is created. After this either a kobject uevent is issued or the custom firmware loading is relied upon for firmware diff --git a/Documentation/driver-api/firmware/request_firmware.rst b/Documentation/driver-api/firmware/request_firmware.rst index f62bdcbfed5b..cd076462d235 100644 --- a/Documentation/driver-api/firmware/request_firmware.rst +++ b/Documentation/driver-api/firmware/request_firmware.rst @@ -25,6 +25,11 @@ firmware_request_nowarn .. kernel-doc:: drivers/base/firmware_loader/main.c :functions: firmware_request_nowarn +firmware_request_platform +------------------------- +.. kernel-doc:: drivers/base/firmware_loader/main.c + :functions: firmware_request_platform + request_firmware_direct ----------------------- .. kernel-doc:: drivers/base/firmware_loader/main.c diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 0ebe205efd0c..d4e78cb3ef4d 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -17,6 +17,7 @@ available subsections can be seen below. driver-model/index basics infrastructure + ioctl early-userspace/index pm/index clk @@ -74,11 +75,12 @@ available subsections can be seen below. connector console dcdbas - edid eisa ipmb isa isapnp + io-mapping + io_ordering generic-counter lightnvm-pblk memory-devices/index diff --git a/Documentation/io-mapping.txt b/Documentation/driver-api/io-mapping.rst index a966239f04e4..a966239f04e4 100644 --- a/Documentation/io-mapping.txt +++ b/Documentation/driver-api/io-mapping.rst diff --git a/Documentation/io_ordering.txt b/Documentation/driver-api/io_ordering.rst index 2ab303ce9a0d..2ab303ce9a0d 100644 --- a/Documentation/io_ordering.txt +++ b/Documentation/driver-api/io_ordering.rst diff --git a/Documentation/core-api/ioctl.rst b/Documentation/driver-api/ioctl.rst index c455db0e1627..c455db0e1627 100644 --- a/Documentation/core-api/ioctl.rst +++ b/Documentation/driver-api/ioctl.rst diff --git a/Documentation/driver-api/ipmb.rst b/Documentation/driver-api/ipmb.rst index 3ec3baed84c4..209c49e05116 100644 --- a/Documentation/driver-api/ipmb.rst +++ b/Documentation/driver-api/ipmb.rst @@ -71,9 +71,13 @@ b) Example for device tree:: ipmb@10 { compatible = "ipmb-dev"; reg = <0x10>; + i2c-protocol; }; }; +If xmit of data to be done using raw i2c block vs smbus +then "i2c-protocol" needs to be defined as above. + 2) Manually from Linux:: modprobe ipmb-dev-int diff --git a/Documentation/driver-api/usb/typec_bus.rst b/Documentation/driver-api/usb/typec_bus.rst index f47a69bff498..03dfa9c018b7 100644 --- a/Documentation/driver-api/usb/typec_bus.rst +++ b/Documentation/driver-api/usb/typec_bus.rst @@ -53,9 +53,7 @@ in need to reconfigure the pins on the connector, the alternate mode driver needs to notify the bus using :c:func:`typec_altmode_notify()`. The driver passes the negotiated SVID specific pin configuration value to the function as parameter. The bus driver will then configure the mux behind the connector using -that value as the state value for the mux, and also call blocking notification -chain to notify the external drivers about the state of the connector that need -to know it. +that value as the state value for the mux. NOTE: The SVID specific pin configuration values must always start from ``TYPEC_STATE_MODAL``. USB Type-C specification defines two default states for @@ -80,19 +78,6 @@ Helper macro ``TYPEC_MODAL_STATE()`` can also be used:: #define ALTMODEX_CONF_A = TYPEC_MODAL_STATE(0); #define ALTMODEX_CONF_B = TYPEC_MODAL_STATE(1); -Notification chain -~~~~~~~~~~~~~~~~~~ - -The drivers for the components that the alternate modes are designed for need to -get details regarding the results of the negotiation with the partner, and the -pin configuration of the connector. In case of DisplayPort alternate mode for -example, the GPU drivers will need to know those details. In case of -Thunderbolt alternate mode, the thunderbolt drivers will need to know them, and -so on. - -The notification chain is designed for this purpose. The drivers can register -notifiers with :c:func:`typec_altmode_register_notifier()`. - Cable plug alternate modes ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -129,8 +114,3 @@ Cable Plug operations .. kernel-doc:: drivers/usb/typec/bus.c :functions: typec_altmode_get_plug typec_altmode_put_plug - -Notifications -~~~~~~~~~~~~~ -.. kernel-doc:: drivers/usb/typec/class.c - :functions: typec_altmode_register_notifier typec_altmode_unregister_notifier diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 2dc5df6a1cf5..3d492a34c8ee 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -23,7 +23,7 @@ | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | riscv: | TODO | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.rst index fec7144e817c..f054d1c45e86 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.rst @@ -1,7 +1,10 @@ - v9fs: Plan 9 Resource Sharing for Linux - ======================================= +.. SPDX-License-Identifier: GPL-2.0 -ABOUT +======================================= +v9fs: Plan 9 Resource Sharing for Linux +======================================= + +About ===== v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. @@ -14,32 +17,34 @@ and Maya Gokhale. Additional development by Greg Watson The best detailed explanation of the Linux implementation and applications of the 9p client is available in the form of a USENIX paper: + http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html Other applications are described in the following papers: + * XCPU & Clustering - http://xcpu.org/papers/xcpu-talk.pdf + http://xcpu.org/papers/xcpu-talk.pdf * KVMFS: control file system for KVM - http://xcpu.org/papers/kvmfs.pdf + http://xcpu.org/papers/kvmfs.pdf * CellFS: A New Programming Model for the Cell BE - http://xcpu.org/papers/cellfs-talk.pdf + http://xcpu.org/papers/cellfs-talk.pdf * PROSE I/O: Using 9p to enable Application Partitions - http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf * VirtFS: A Virtualization Aware File System pass-through - http://goo.gl/3WPDg + http://goo.gl/3WPDg -USAGE +Usage ===== -For remote file server: +For remote file server:: mount -t 9p 10.10.1.2 /mnt/9 -For Plan 9 From User Space applications (http://swtch.com/plan9) +For Plan 9 From User Space applications (http://swtch.com/plan9):: mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER -For server running on QEMU host with virtio transport: +For server running on QEMU host with virtio transport:: mount -t 9p -o trans=virtio <mount_tag> /mnt/9 @@ -48,18 +53,22 @@ mount points. Each 9P export is seen by the client as a virtio device with an associated "mount_tag" property. Available mount tags can be seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. -OPTIONS +Options ======= + ============= =============================================================== trans=name select an alternative transport. Valid options are currently: - unix - specifying a named pipe mount point - tcp - specifying a normal TCP/IP connection - fd - used passed file descriptors for connection - (see rfdno and wfdno) - virtio - connect to the next virtio channel available - (from QEMU with trans_virtio module) - rdma - connect to a specified RDMA channel + + ======== ============================================ + unix specifying a named pipe mount point + tcp specifying a normal TCP/IP connection + fd used passed file descriptors for connection + (see rfdno and wfdno) + virtio connect to the next virtio channel available + (from QEMU with trans_virtio module) + rdma connect to a specified RDMA channel + ======== ============================================ uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user @@ -69,28 +78,36 @@ OPTIONS offering several exported file systems. cache=mode specifies a caching policy. By default, no caches are used. - none = default no cache policy, metadata and data + + none + default no cache policy, metadata and data alike are synchronous. - loose = no attempts are made at consistency, + loose + no attempts are made at consistency, intended for exclusive, read-only mounts - fscache = use FS-Cache for a persistent, read-only + fscache + use FS-Cache for a persistent, read-only cache backend. - mmap = minimal cache that is only used for read-write + mmap + minimal cache that is only used for read-write mmap. Northing else is cached, like cache=none debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9p trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug - 0x100 = display protocol message debug - 0x200 = display Fid debug - 0x400 = display packet debug - 0x800 = display fscache tracing debug + + ===== ================================ + 0x01 display verbose error messages + 0x02 developer debug (DEBUG_CURRENT) + 0x04 display 9p trace + 0x08 display VFS trace + 0x10 display Marshalling debug + 0x20 display RPC debug + 0x40 display transport debug + 0x80 display allocation debug + 0x100 display protocol message debug + 0x200 display Fid debug + 0x400 display packet debug + 0x800 display fscache tracing debug + ===== ================================ rfdno=n the file descriptor for reading with trans=fd @@ -103,9 +120,12 @@ OPTIONS noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) version=name Select 9P protocol version. Valid options are: - 9p2000 - Legacy mode (same as noextend) - 9p2000.u - Use 9P2000.u protocol - 9p2000.L - Use 9P2000.L protocol + + ======== ============================== + 9p2000 Legacy mode (same as noextend) + 9p2000.u Use 9P2000.u protocol + 9p2000.L Use 9P2000.L protocol + ======== ============================== dfltuid attempt to mount as a particular uid @@ -118,22 +138,27 @@ OPTIONS hosts. This functionality will be expanded in later versions. access there are four access modes. - user = if a user tries to access a file on v9fs + user + if a user tries to access a file on v9fs filesystem for the first time, v9fs sends an attach command (Tattach) for that user. This is the default mode. - <uid> = allows only user with uid=<uid> to access + <uid> + allows only user with uid=<uid> to access the files on the mounted filesystem - any = v9fs does single attach and performs all + any + v9fs does single attach and performs all operations as one user - client = ACL based access check on the 9p client + clien + ACL based access check on the 9p client side for access validation cachetag cache tag to use the specified persistent cache. cache tags for existing cache sessions can be listed at /sys/fs/9p/caches. (applies only to cache=fscache) + ============= =============================================================== -RESOURCES +Resources ========= Protocol specifications are maintained on github: @@ -158,4 +183,3 @@ http://plan9.bell-labs.com/plan9 For information on Plan 9 from User Space (Plan 9 applications and libraries ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 - diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.rst index 0baa8e8c1fc1..5b22cae38e5e 100644 --- a/Documentation/filesystems/adfs.txt +++ b/Documentation/filesystems/adfs.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Acorn Disc Filing System - ADFS +=============================== + Filesystems supported by ADFS ----------------------------- @@ -25,6 +31,7 @@ directory updates, specifically updating the access mode and timestamp. Mount options for ADFS ---------------------- + ============ ====================================================== uid=nnn All files in the partition will be owned by user id nnn. Default 0 (root). gid=nnn All files in the partition will be in group @@ -36,22 +43,23 @@ Mount options for ADFS ftsuffix=n When ftsuffix=0, no file type suffix will be applied. When ftsuffix=1, a hexadecimal suffix corresponding to the RISC OS file type will be added. Default 0. + ============ ====================================================== Mapping of ADFS permissions to Linux permissions ------------------------------------------------ ADFS permissions consist of the following: - Owner read - Owner write - Other read - Other write + - Owner read + - Owner write + - Other read + - Other write (In older versions, an 'execute' permission did exist, but this - does not hold the same meaning as the Linux 'execute' permission - and is now obsolete). + does not hold the same meaning as the Linux 'execute' permission + and is now obsolete). - The mapping is performed as follows: + The mapping is performed as follows:: Owner read -> -r--r--r-- Owner write -> --w--w---w @@ -66,17 +74,18 @@ Mapping of ADFS permissions to Linux permissions Possible other mode permissions -> ----rwxrwx Hence, with the default masks, if a file is owner read/write, and - not a UnixExec filetype, then the permissions will be: + not a UnixExec filetype, then the permissions will be:: -rw------- However, if the masks were ownmask=0770,othmask=0007, then this would - be modified to: + be modified to:: + -rw-rw---- There is no restriction on what you can do with these masks. You may wish that either read bits give read access to the file for all, but - keep the default write protection (ownmask=0755,othmask=0577): + keep the default write protection (ownmask=0755,othmask=0577):: -rw-r--r-- diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.rst index 71b63c2b9841..7f1a40dce6d3 100644 --- a/Documentation/filesystems/affs.txt +++ b/Documentation/filesystems/affs.rst @@ -1,9 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= Overview of Amiga Filesystems ============================= Not all varieties of the Amiga filesystems are supported for reading and writing. The Amiga currently knows six different filesystems: +============== =============================================================== DOS\0 The old or original filesystem, not really suited for hard disks and normally not used on them, either. Supported read/write. @@ -23,6 +27,7 @@ DOS\4 The original filesystem with directory cache. The directory sense on hard disks. Supported read only. DOS\5 The Fast File System with directory cache. Supported read only. +============== =============================================================== All of the above filesystems allow block sizes from 512 to 32K bytes. Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks @@ -36,14 +41,18 @@ are supported, too. Mount options for the AFFS ========================== -protect If this option is set, the protection bits cannot be altered. +protect + If this option is set, the protection bits cannot be altered. -setuid[=uid] This sets the owner of all files and directories in the file +setuid[=uid] + This sets the owner of all files and directories in the file system to uid or the uid of the current user, respectively. -setgid[=gid] Same as above, but for gid. +setgid[=gid] + Same as above, but for gid. -mode=mode Sets the mode flags to the given (octal) value, regardless +mode=mode + Sets the mode flags to the given (octal) value, regardless of the original permissions. Directories will get an x permission if the corresponding r bit is set. This is useful since most of the plain AmigaOS files @@ -53,33 +62,41 @@ nofilenametruncate The file system will return an error when filename exceeds standard maximum filename length (30 characters). -reserved=num Sets the number of reserved blocks at the start of the +reserved=num + Sets the number of reserved blocks at the start of the partition to num. You should never need this option. Default is 2. -root=block Sets the block number of the root block. This should never +root=block + Sets the block number of the root block. This should never be necessary. -bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, +bs=blksize + Sets the blocksize to blksize. Valid block sizes are 512, 1024, 2048 and 4096. Like the root option, this should never be necessary, as the affs can figure it out itself. -quiet The file system will not return an error for disallowed +quiet + The file system will not return an error for disallowed mode changes. -verbose The volume name, file system type and block size will +verbose + The volume name, file system type and block size will be written to the syslog when the filesystem is mounted. -mufs The filesystem is really a muFS, also it doesn't +mufs + The filesystem is really a muFS, also it doesn't identify itself as one. This option is necessary if the filesystem wasn't formatted as muFS, but is used as one. -prefix=path Path will be prefixed to every absolute path name of +prefix=path + Path will be prefixed to every absolute path name of symbolic links on an AFFS partition. Default = "/". (See below.) -volume=name When symbolic links with an absolute path are created +volume=name + When symbolic links with an absolute path are created on an AFFS partition, name will be prepended as the volume name. Default = "" (empty string). (See below.) @@ -119,7 +136,7 @@ The Linux rwxrwxrwx file mode is handled as follows: - All other flags (suid, sgid, ...) are ignored and will not be retained. - + Newly created files and directories will get the user and group ID of the current user and a mode according to the umask. @@ -148,11 +165,13 @@ might be "User", "WB" and "Graphics", the mount points /amiga/User, Examples ======== -Command line: +Command line:: + mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose mount /dev/sda3 /Amiga -t affs -/etc/fstab entry: +/etc/fstab entry:: + /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 IMPORTANT NOTE @@ -170,7 +189,8 @@ before booting Windows! If the damage is already done, the following should fix the RDB (where <disk> is the device name). -DO AT YOUR OWN RISK: + +DO AT YOUR OWN RISK:: dd if=/dev/<disk> of=rdb.tmp count=1 cp rdb.tmp rdb.fixed @@ -189,10 +209,14 @@ By default, filenames are truncated to 30 characters without warning. 'nofilenametruncate' mount option can change that behavior. Case is ignored by the affs in filename matching, but Linux shells -do care about the case. Example (with /wb being an affs mounted fs): +do care about the case. Example (with /wb being an affs mounted fs):: + rm /wb/WRONGCASE -will remove /mnt/wrongcase, but + +will remove /mnt/wrongcase, but:: + rm /wb/WR* + will not since the names are matched by the shell. The block allocation is designed for hard disk partitions. If more @@ -219,4 +243,4 @@ due to an incompatibility with the Amiga floppy controller. If you are interested in an Amiga Emulator for Linux, look at -http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ +http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.rst index 8c6ea7b41048..c4ec39a5966e 100644 --- a/Documentation/filesystems/afs.txt +++ b/Documentation/filesystems/afs.rst @@ -1,8 +1,10 @@ - ==================== - kAFS: AFS FILESYSTEM - ==================== +.. SPDX-License-Identifier: GPL-2.0 -Contents: +==================== +kAFS: AFS FILESYSTEM +==================== + +.. Contents: - Overview. - Usage. @@ -14,8 +16,7 @@ Contents: - The @sys substitution. -======== -OVERVIEW +Overview ======== This filesystem provides a fairly simple secure AFS filesystem driver. It is @@ -35,35 +36,33 @@ It does not yet support the following AFS features: (*) pioctl() system call. -=========== -COMPILATION +Compilation =========== The filesystem should be enabled by turning on the kernel configuration -options: +options:: CONFIG_AF_RXRPC - The RxRPC protocol transport CONFIG_RXKAD - The RxRPC Kerberos security handler CONFIG_AFS - The AFS filesystem -Additionally, the following can be turned on to aid debugging: +Additionally, the following can be turned on to aid debugging:: CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled They permit the debugging messages to be turned on dynamically by manipulating -the masks in the following files: +the masks in the following files:: /sys/module/af_rxrpc/parameters/debug /sys/module/kafs/parameters/debug -===== -USAGE +Usage ===== When inserting the driver modules the root cell must be specified along with a -list of volume location server IP addresses: +list of volume location server IP addresses:: modprobe rxrpc modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 @@ -77,14 +76,14 @@ The second module is the kerberos RxRPC security driver, and the third module is the actual filesystem driver for the AFS filesystem. Once the module has been loaded, more modules can be added by the following -procedure: +procedure:: echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells Where the parameters to the "add" command are the name of a cell and a list of volume location servers within that cell, with the latter separated by colons. -Filesystems can be mounted anywhere by commands similar to the following: +Filesystems can be mounted anywhere by commands similar to the following:: mount -t afs "%cambridge.redhat.com:root.afs." /afs mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge @@ -104,8 +103,7 @@ named volume will be looked up in the cell specified during modprobe. Additional cells can be added through /proc (see later section). -=========== -MOUNTPOINTS +Mountpoints =========== AFS has a concept of mountpoints. In AFS terms, these are specially formatted @@ -123,42 +121,40 @@ culled first. If all are culled, then the requested volume will also be unmounted, otherwise error EBUSY will be returned. This can be used by the administrator to attempt to unmount the whole AFS tree -mounted on /afs in one go by doing: +mounted on /afs in one go by doing:: umount /afs -============ -DYNAMIC ROOT +Dynamic Root ============ A mount option is available to create a serverless mount that is only usable -for dynamic lookup. Creating such a mount can be done by, for example: +for dynamic lookup. Creating such a mount can be done by, for example:: mount -t afs none /afs -o dyn This creates a mount that just has an empty directory at the root. Attempting to look up a name in this directory will cause a mountpoint to be created that -looks up a cell of the same name, for example: +looks up a cell of the same name, for example:: ls /afs/grand.central.org/ -=============== -PROC FILESYSTEM +Proc Filesystem =============== The AFS modules creates a "/proc/fs/afs/" directory and populates it: (*) A "cells" file that lists cells currently known to the afs module and - their usage counts: + their usage counts:: [root@andromeda ~]# cat /proc/fs/afs/cells USE NAME 3 cambridge.redhat.com (*) A directory per cell that contains files that list volume location - servers, volumes, and active servers known within that cell. + servers, volumes, and active servers known within that cell:: [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers USE ADDR STATE @@ -171,8 +167,7 @@ The AFS modules creates a "/proc/fs/afs/" directory and populates it: 1 Val 20000000 20000001 20000002 root.afs -================= -THE CELL DATABASE +The Cell Database ================= The filesystem maintains an internal database of all the cells it knows and the @@ -181,7 +176,7 @@ the system belongs is added to the database when modprobe is performed by the "rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on the kernel command line. -Further cells can be added by commands similar to the following: +Further cells can be added by commands similar to the following:: echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells @@ -189,8 +184,7 @@ Further cells can be added by commands similar to the following: No other cell database operations are available at this time. -======== -SECURITY +Security ======== Secure operations are initiated by acquiring a key using the klog program. A @@ -198,17 +192,17 @@ very primitive klog program is available at: http://people.redhat.com/~dhowells/rxrpc/klog.c -This should be compiled by: +This should be compiled by:: make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" -And then run as: +And then run as:: ./klog Assuming it's successful, this adds a key of type RxRPC, named for the service and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or -by cat'ing /proc/keys: +by cat'ing /proc/keys:: [root@andromeda ~]# keyctl show Session Keyring @@ -232,20 +226,19 @@ socket), then the operations on the file will be made with key that was used to open the file. -===================== -THE @SYS SUBSTITUTION +The @sys Substitution ===================== The list of up to 16 @sys substitutions for the current network namespace can -be configured by writing a list to /proc/fs/afs/sysname: +be configured by writing a list to /proc/fs/afs/sysname:: [root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname -or cleared entirely by writing an empty list: +or cleared entirely by writing an empty list:: [root@andromeda ~]# echo >/proc/fs/afs/sysname -The current list for current network namespace can be retrieved by: +The current list for current network namespace can be retrieved by:: [root@andromeda ~]# cat /proc/fs/afs/sysname foo diff --git a/Documentation/filesystems/autofs-mount-control.txt b/Documentation/filesystems/autofs-mount-control.rst index acc02fc57993..2903aed92316 100644 --- a/Documentation/filesystems/autofs-mount-control.txt +++ b/Documentation/filesystems/autofs-mount-control.rst @@ -1,4 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +==================================================================== Miscellaneous Device control operations for the autofs kernel module ==================================================================== @@ -36,24 +38,24 @@ For example, there are two types of automount maps, direct (in the kernel module source you will see a third type called an offset, which is just a direct mount in disguise) and indirect. -Here is a master map with direct and indirect map entries: +Here is a master map with direct and indirect map entries:: -/- /etc/auto.direct -/test /etc/auto.indirect + /- /etc/auto.direct + /test /etc/auto.indirect -and the corresponding map files: +and the corresponding map files:: -/etc/auto.direct: + /etc/auto.direct: -/automount/dparse/g6 budgie:/autofs/export1 -/automount/dparse/g1 shark:/autofs/export1 -and so on. + /automount/dparse/g6 budgie:/autofs/export1 + /automount/dparse/g1 shark:/autofs/export1 + and so on. -/etc/auto.indirect: +/etc/auto.indirect:: -g1 shark:/autofs/export1 -g6 budgie:/autofs/export1 -and so on. + g1 shark:/autofs/export1 + g6 budgie:/autofs/export1 + and so on. For the above indirect map an autofs file system is mounted on /test and mounts are triggered for each sub-directory key by the inode lookup @@ -69,23 +71,23 @@ use the follow_link inode operation to trigger the mount. But, each entry in direct and indirect maps can have offsets (making them multi-mount map entries). -For example, an indirect mount map entry could also be: +For example, an indirect mount map entry could also be:: -g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export1 \ - /s2/ss2 shark:/autofs/export2 + g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 -and a similarly a direct mount map entry could also be: +and a similarly a direct mount map entry could also be:: -/automount/dparse/g1 \ - / shark:/autofs/export5/testing/test \ - /s1 shark:/autofs/export/testing/test/s1 \ - /s2 shark:/autofs/export5/testing/test/s2 \ - /s1/ss1 shark:/autofs/export2 \ - /s2/ss2 shark:/autofs/export2 + /automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 One of the issues with version 4 of autofs was that, when mounting an entry with a large number of offsets, possibly with nesting, we needed @@ -170,32 +172,32 @@ autofs Miscellaneous Device mount control interface The control interface is opening a device node, typically /dev/autofs. All the ioctls use a common structure to pass the needed parameter -information and return operation results: - -struct autofs_dev_ioctl { - __u32 ver_major; - __u32 ver_minor; - __u32 size; /* total size of data passed in - * including this struct */ - __s32 ioctlfd; /* automount command fd */ - - /* Command parameters */ - union { - struct args_protover protover; - struct args_protosubver protosubver; - struct args_openmount openmount; - struct args_ready ready; - struct args_fail fail; - struct args_setpipefd setpipefd; - struct args_timeout timeout; - struct args_requester requester; - struct args_expire expire; - struct args_askumount askumount; - struct args_ismountpoint ismountpoint; - }; - - char path[0]; -}; +information and return operation results:: + + struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + /* Command parameters */ + union { + struct args_protover protover; + struct args_protosubver protosubver; + struct args_openmount openmount; + struct args_ready ready; + struct args_fail fail; + struct args_setpipefd setpipefd; + struct args_timeout timeout; + struct args_requester requester; + struct args_expire expire; + struct args_askumount askumount; + struct args_ismountpoint ismountpoint; + }; + + char path[0]; + }; The ioctlfd field is a mount point file descriptor of an autofs mount point. It is returned by the open call and is used by all calls except @@ -212,7 +214,7 @@ is used account for the increased structure length when translating the structure sent from user space. This structure can be initialized before setting specific fields by using -the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). +the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``). All of the ioctls perform a copy of this structure from user space to kernel space and return -EINVAL if the size parameter is smaller than diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.rst index da45e6c842b8..79f9740d76ff 100644 --- a/Documentation/filesystems/befs.txt +++ b/Documentation/filesystems/befs.rst @@ -1,48 +1,54 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= BeOS filesystem for Linux +========================= Document last updated: Dec 6, 2001 -WARNING +Warning ======= Make sure you understand that this is alpha software. This means that the -implementation is neither complete nor well-tested. +implementation is neither complete nor well-tested. I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! -LICENSE -===== -This software is covered by the GNU General Public License. +License +======= +This software is covered by the GNU General Public License. See the file COPYING for the complete text of the license. Or the GNU website: <http://www.gnu.org/licenses/licenses.html> -AUTHOR -===== +Author +====== The largest part of the code written by Will Dyson <will_dyson@pobox.com> He has been working on the code since Aug 13, 2001. See the changelog for details. Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> + His original code can still be found at: <http://hp.vector.co.jp/authors/VA008030/bfs/> + Does anyone know of a more current email address for Makoto? He doesn't respond to the address given above... This filesystem doesn't have a maintainer. -WHAT IS THIS DRIVER? -================== -This module implements the native filesystem of BeOS http://www.beincorporated.com/ +What is this Driver? +==================== +This module implements the native filesystem of BeOS http://www.beincorporated.com/ for the linux 2.4.1 and later kernels. Currently it is a read-only implementation. Which is it, BFS or BEFS? -================ -Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". +========================= +Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". But Unixware Boot Filesystem is called bfs, too. And they are already in the kernel. Because of this naming conflict, on Linux the BeOS filesystem is called befs. -HOW TO INSTALL +How to Install ============== step 1. Install the BeFS patch into the source code tree of linux. @@ -54,16 +60,16 @@ is called patch-befs-xxx, you would do the following: patch -p1 < /path/to/patch-befs-xxx if the patching step fails (i.e. there are rejected hunks), you can try to -figure it out yourself (it shouldn't be hard), or mail the maintainer +figure it out yourself (it shouldn't be hard), or mail the maintainer (Will Dyson <will_dyson@pobox.com>) for help. step 2. Configuration & make kernel The linux kernel has many compile-time options. Most of them are beyond the scope of this document. I suggest the Kernel-HOWTO document as a good general -reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html +reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html -However, to use the BeFS module, you must enable it at configure time. +However, to use the BeFS module, you must enable it at configure time:: cd /foo/bar/linux make menuconfig (or xconfig) @@ -82,35 +88,40 @@ step 3. Install See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for instructions on this critical step. -USING BFS +Using BFS ========= To use the BeOS filesystem, use filesystem type 'befs'. -ex) +ex:: + mount -t befs /dev/fd0 /beos -MOUNT OPTIONS +Mount Options ============= + +============= =========================================================== uid=nnn All files in the partition will be owned by user id nnn. gid=nnn All files in the partition will be in group nnn. iocharset=xxx Use xxx as the name of the NLS translation table. debug The driver will output debugging information to the syslog. +============= =========================================================== -HOW TO GET LASTEST VERSION +How to Get Lastest Version ========================== The latest version is currently available at: <http://befs-driver.sourceforge.net/> -ANY KNOWN BUGS? -=========== +Any Known Bugs? +=============== As of Jan 20, 2002: - + None -SPECIAL THANKS +Special Thanks ============== Dominic Giampalo ... Writing "Practical file system design with Be filesystem" + Hiroyuki Yamada ... Testing LinuxPPC. diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.rst index 843ce91a2e40..ce14b9018807 100644 --- a/Documentation/filesystems/bfs.txt +++ b/Documentation/filesystems/bfs.rst @@ -1,4 +1,7 @@ -BFS FILESYSTEM FOR LINUX +.. SPDX-License-Identifier: GPL-2.0 + +======================== +BFS Filesystem for Linux ======================== The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which @@ -9,22 +12,22 @@ In order to access /stand partition under Linux you obviously need to know the partition number and the kernel must support UnixWare disk slices (CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not depend on having UnixWare disklabel support because one can also mount -BFS filesystem via loopback: +BFS filesystem via loopback:: -# losetup /dev/loop0 stand.img -# mount -t bfs /dev/loop0 /mnt/stand + # losetup /dev/loop0 stand.img + # mount -t bfs /dev/loop0 /mnt/stand -where stand.img is a file containing the image of BFS filesystem. +where stand.img is a file containing the image of BFS filesystem. When you have finished using it and umounted you need to also deallocate -/dev/loop0 device by: +/dev/loop0 device by:: -# losetup -d /dev/loop0 + # losetup -d /dev/loop0 -You can simplify mounting by just typing: +You can simplify mounting by just typing:: -# mount -t bfs -o loop stand.img /mnt/stand + # mount -t bfs -o loop stand.img /mnt/stand -this will allocate the first available loopback device (and load loop.o +this will allocate the first available loopback device (and load loop.o kernel module if necessary) automatically. If the loopback driver is not loaded automatically, make sure that you have compiled the module and that modprobe is functioning. Beware that umount will not deallocate @@ -33,21 +36,21 @@ that modprobe is functioning. Beware that umount will not deallocate losetup(8). Read losetup(8) manpage for more info. To create the BFS image under UnixWare you need to find out first which -slice contains it. The command prtvtoc(1M) is your friend: +slice contains it. The command prtvtoc(1M) is your friend:: -# prtvtoc /dev/rdsk/c0b0t0d0s0 + # prtvtoc /dev/rdsk/c0b0t0d0s0 (assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you look for the slice with tag "STAND", which is usually slice 10. With this -information you can use dd(1) to create the BFS image: +information you can use dd(1) to create the BFS image:: -# umount /stand -# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 + # umount /stand + # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 Just in case, you can verify that you have done the right thing by checking -the magic number: +the magic number:: -# od -Ad -tx4 stand.img | more + # od -Ad -tx4 stand.img | more The first 4 bytes should be 0x1badface. diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.rst index f9dad22d95ce..d0904f602819 100644 --- a/Documentation/filesystems/btrfs.txt +++ b/Documentation/filesystems/btrfs.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== BTRFS ===== diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.rst index b19b6a03f91c..b46a7218248f 100644 --- a/Documentation/filesystems/ceph.txt +++ b/Documentation/filesystems/ceph.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ Ceph Distributed File System ============================ @@ -15,6 +18,7 @@ Basic features include: * Easy deployment: most FS components are userspace daemons Also, + * Flexible snapshots (on any directory) * Recursive accounting (nested files, directories, bytes) @@ -63,7 +67,7 @@ no 'du' or similar recursive scan of the file system is required. Finally, Ceph also allows quotas to be set on any directory in the system. The quota can restrict the number of bytes or the number of files stored beneath that point in the directory hierarchy. Quotas can be set using -extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg: +extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir getfattr -n ceph.quota.max_bytes /some/dir @@ -76,7 +80,7 @@ from writing as much data as it needs. Mount Syntax ============ -The basic mount syntax is: +The basic mount syntax is:: # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt @@ -84,7 +88,7 @@ You only need to specify a single monitor, as the client will get the full list when it connects. (However, if the monitor you specify happens to be down, the mount won't succeed.) The port can be left off if the monitor is using the default. So if the monitor is at -1.2.3.4, +1.2.3.4:: # mount -t ceph 1.2.3.4:/ /mnt/ceph @@ -163,14 +167,14 @@ Mount Options available modes are "no" and "clean". The default is "no". * no: never attempt to reconnect when client detects that it has been - blacklisted. Operations will generally fail after being blacklisted. + blacklisted. Operations will generally fail after being blacklisted. * clean: client reconnects to the ceph cluster automatically when it - detects that it has been blacklisted. During reconnect, client drops - dirty data/metadata, invalidates page caches and writable file handles. - After reconnect, file locks become stale because the MDS loses track - of them. If an inode contains any stale file locks, read/write on the - inode is not allowed until applications release all stale file locks. + detects that it has been blacklisted. During reconnect, client drops + dirty data/metadata, invalidates page caches and writable file handles. + After reconnect, file locks become stale because the MDS loses track + of them. If an inode contains any stale file locks, read/write on the + inode is not allowed until applications release all stale file locks. More Information ================ @@ -179,8 +183,8 @@ For more information on Ceph, see the home page at https://ceph.com/ The Linux kernel client source tree is available at - https://github.com/ceph/ceph-client.git - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git + - https://github.com/ceph/ceph-client.git + - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git and the source for the full system is at https://github.com/ceph/ceph.git diff --git a/Documentation/filesystems/cifs/cifsroot.txt b/Documentation/filesystems/cifs/cifsroot.txt index 0fa1a2c36a40..947b7ec6ce9e 100644 --- a/Documentation/filesystems/cifs/cifsroot.txt +++ b/Documentation/filesystems/cifs/cifsroot.txt @@ -13,7 +13,7 @@ network by utilizing SMB or CIFS protocol. In order to mount, the network stack will also need to be set up by using 'ip=' config option. For more details, see -Documentation/filesystems/nfs/nfsroot.txt. +Documentation/admin-guide/nfs/nfsroot.rst. A CIFS root mount currently requires the use of SMB1+UNIX Extensions which is only supported by the Samba server. SMB1 is the older diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.rst index 8e19a53d648b..afbdbde98bd2 100644 --- a/Documentation/filesystems/cramfs.txt +++ b/Documentation/filesystems/cramfs.rst @@ -1,12 +1,15 @@ +.. SPDX-License-Identifier: GPL-2.0 - Cramfs - cram a filesystem onto a small ROM +=========================================== +Cramfs - cram a filesystem onto a small ROM +=========================================== -cramfs is designed to be simple and small, and to compress things well. +cramfs is designed to be simple and small, and to compress things well. It uses the zlib routines to compress a file one page at a time, and allows random page access. The meta-data is not compressed, but is expressed in a very terse representation to make it use much less -diskspace than traditional filesystems. +diskspace than traditional filesystems. You can't write to a cramfs filesystem (making it compressible and compact also makes it _very_ hard to update on-the-fly), so you have to @@ -28,9 +31,9 @@ issue. Hard links are supported, but hard linked files will still have a link count of 1 in the cramfs image. -Cramfs directories have no `.' or `..' entries. Directories (like +Cramfs directories have no ``.`` or ``..`` entries. Directories (like every other file on cramfs) always have a link count of 1. (There's -no need to use -noleaf in `find', btw.) +no need to use -noleaf in ``find``, btw.) No timestamps are stored in a cramfs, so these default to the epoch (1970 GMT). Recently-accessed files may have updated timestamps, but @@ -70,9 +73,9 @@ MTD drivers are cfi_cmdset_0001 (Intel/Sharp CFI flash) or physmap (Flash device in physical memory map). MTD partitions based on such devices are fine too. Then that device should be specified with the "mtd:" prefix as the mount device argument. For example, to mount the MTD device named -"fs_partition" on the /mnt directory: +"fs_partition" on the /mnt directory:: -$ mount -t cramfs mtd:fs_partition /mnt + $ mount -t cramfs mtd:fs_partition /mnt To boot a kernel with this as root filesystem, suffice to specify something like "root=mtd:fs_partition" on the kernel command line. @@ -90,6 +93,7 @@ https://github.com/npitre/cramfs-tools For /usr/share/magic -------------------- +===== ======================= ======================= 0 ulelong 0x28cd3d45 Linux cramfs offset 0 >4 ulelong x size %d >8 ulelong x flags 0x%x @@ -110,6 +114,7 @@ For /usr/share/magic >552 ulelong x fsid.blocks %d >556 ulelong x fsid.files %d >560 string >\0 name "%.16s" +===== ======================= ======================= Hacker Notes diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.rst index dc497b96fa4f..db9ea0854040 100644 --- a/Documentation/filesystems/debugfs.txt +++ b/Documentation/filesystems/debugfs.rst @@ -1,4 +1,11 @@ -Copyright 2009 Jonathan Corbet <corbet@lwn.net> +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +======= +DebugFS +======= + +Copyright |copy| 2009 Jonathan Corbet <corbet@lwn.net> Debugfs exists as a simple way for kernel developers to make information available to user space. Unlike /proc, which is only meant for information @@ -6,11 +13,11 @@ about a process, or sysfs, which has strict one-value-per-file rules, debugfs has no rules at all. Developers can put any information they want there. The debugfs filesystem is also intended to not serve as a stable ABI to user space; in theory, there are no stability constraints placed on -files exported there. The real world is not always so simple, though [1]; +files exported there. The real world is not always so simple, though [1]_; even debugfs interfaces are best designed with the idea that they will need to be maintained forever. -Debugfs is typically mounted with a command like: +Debugfs is typically mounted with a command like:: mount -t debugfs none /sys/kernel/debug @@ -23,7 +30,7 @@ Note that the debugfs API is exported GPL-only to modules. Code using debugfs should include <linux/debugfs.h>. Then, the first order of business will be to create at least one directory to hold a set of -debugfs files: +debugfs files:: struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); @@ -36,7 +43,7 @@ something went wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the kernel has been built without debugfs support and none of the functions described below will work. -The most general way to create a file within a debugfs directory is with: +The most general way to create a file within a debugfs directory is with:: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, @@ -53,12 +60,12 @@ ERR_PTR(-ERROR) on error, or ERR_PTR(-ENODEV) if debugfs support is missing. Create a file with an initial size, the following function can be used -instead: +instead:: - struct dentry *debugfs_create_file_size(const char *name, umode_t mode, - struct dentry *parent, void *data, - const struct file_operations *fops, - loff_t file_size); + void debugfs_create_file_size(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops, + loff_t file_size); file_size is the initial file size. The other parameters are the same as the function debugfs_create_file. @@ -66,7 +73,7 @@ as the function debugfs_create_file. In a number of cases, the creation of a set of file operations is not actually necessary; the debugfs code provides a number of helper functions for simple situations. Files containing a single integer value can be -created with any of: +created with any of:: void debugfs_create_u8(const char *name, umode_t mode, struct dentry *parent, u8 *value); @@ -80,7 +87,7 @@ created with any of: These files support both reading and writing the given value; if a specific file should not be written to, simply set the mode bits accordingly. The values in these files are in decimal; if hexadecimal is more appropriate, -the following functions can be used instead: +the following functions can be used instead:: void debugfs_create_x8(const char *name, umode_t mode, struct dentry *parent, u8 *value); @@ -94,7 +101,7 @@ the following functions can be used instead: These functions are useful as long as the developer knows the size of the value to be exported. Some types can have different widths on different architectures, though, complicating the situation somewhat. There are -functions meant to help out in such special cases: +functions meant to help out in such special cases:: void debugfs_create_size_t(const char *name, umode_t mode, struct dentry *parent, size_t *value); @@ -103,7 +110,7 @@ As might be expected, this function will create a debugfs file to represent a variable of type size_t. Similarly, there are helpers for variables of type unsigned long, in decimal -and hexadecimal: +and hexadecimal:: struct dentry *debugfs_create_ulong(const char *name, umode_t mode, struct dentry *parent, @@ -111,7 +118,7 @@ and hexadecimal: void debugfs_create_xul(const char *name, umode_t mode, struct dentry *parent, unsigned long *value); -Boolean values can be placed in debugfs with: +Boolean values can be placed in debugfs with:: struct dentry *debugfs_create_bool(const char *name, umode_t mode, struct dentry *parent, bool *value); @@ -120,7 +127,7 @@ A read on the resulting file will yield either Y (for non-zero values) or N, followed by a newline. If written to, it will accept either upper- or lower-case values, or 1 or 0. Any other input will be silently ignored. -Also, atomic_t values can be placed in debugfs with: +Also, atomic_t values can be placed in debugfs with:: void debugfs_create_atomic_t(const char *name, umode_t mode, struct dentry *parent, atomic_t *value) @@ -129,7 +136,7 @@ A read of this file will get atomic_t values, and a write of this file will set atomic_t values. Another option is exporting a block of arbitrary binary data, with -this structure and function: +this structure and function:: struct debugfs_blob_wrapper { void *data; @@ -151,7 +158,7 @@ If you want to dump a block of registers (something that happens quite often during development, even if little such code reaches mainline. Debugfs offers two functions: one to make a registers-only file, and another to insert a register block in the middle of another sequential -file. +file:: struct debugfs_reg32 { char *name; @@ -164,9 +171,9 @@ file. void __iomem *base; }; - struct dentry *debugfs_create_regset32(const char *name, umode_t mode, - struct dentry *parent, - struct debugfs_regset32 *regset); + debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, int nregs, void __iomem *base, char *prefix); @@ -175,7 +182,7 @@ The "base" argument may be 0, but you may want to build the reg32 array using __stringify, and a number of register names (macros) are actually byte offsets over a base for the register block. -If you want to dump an u32 array in debugfs, you can create file with: +If you want to dump an u32 array in debugfs, you can create file with:: void debugfs_create_u32_array(const char *name, umode_t mode, struct dentry *parent, @@ -185,7 +192,7 @@ The "array" argument provides data, and the "elements" argument is the number of elements in the array. Note: Once array is created its size can not be changed. -There is a helper function to create device related seq_file: +There is a helper function to create device related seq_file:: struct dentry *debugfs_create_devm_seqfile(struct device *dev, const char *name, @@ -197,14 +204,14 @@ The "dev" argument is the device related to this debugfs file, and the "read_fn" is a function pointer which to be called to print the seq_file content. -There are a couple of other directory-oriented helper functions: +There are a couple of other directory-oriented helper functions:: - struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, - struct dentry *new_dir, + struct dentry *new_dir, const char *new_name); - struct dentry *debugfs_create_symlink(const char *name, + struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent, const char *target); @@ -219,7 +226,7 @@ module is unloaded without explicitly removing debugfs entries, the result will be a lot of stale pointers and no end of highly antisocial behavior. So all debugfs users - at least those which can be built as modules - must be prepared to remove all files and directories they create there. A file -can be removed with: +can be removed with:: void debugfs_remove(struct dentry *dentry); @@ -229,7 +236,7 @@ be removed. Once upon a time, debugfs users were required to remember the dentry pointer for every debugfs file they created so that all files could be cleaned up. We live in more civilized times now, though, and debugfs users -can call: +can call:: void debugfs_remove_recursive(struct dentry *dentry); @@ -237,5 +244,4 @@ If this function is passed a pointer for the dentry corresponding to the top-level directory, the entire hierarchy below that directory will be removed. -Notes: - [1] http://lwn.net/Articles/309298/ +.. [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.rst index fcf4d509d118..68daaa7facf9 100644 --- a/Documentation/filesystems/dlmfs.txt +++ b/Documentation/filesystems/dlmfs.rst @@ -1,20 +1,25 @@ -dlmfs -================== +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +===== +DLMFS +===== + A minimal DLM userspace interface implemented via a virtual file system. dlmfs is built with OCFS2 as it requires most of its infrastructure. -Project web page: http://ocfs2.wiki.kernel.org -Tools web page: https://github.com/markfasheh/ocfs2-tools -OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ +:Project web page: http://ocfs2.wiki.kernel.org +:Tools web page: https://github.com/markfasheh/ocfs2-tools +:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ All code copyright 2005 Oracle except when otherwise noted. -CREDITS +Credits ======= -Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds +Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds and Transmeta Corp. Mark Fasheh <mark.fasheh@oracle.com> @@ -96,14 +101,19 @@ operation. If the lock succeeds, you'll get an fd. open(2) with O_CREAT to ensure the resource inode is created - dlmfs does not automatically create inodes for existing lock resources. +============ =========================== Open Flag Lock Request Type ---------- ----------------- +============ =========================== O_RDONLY Shared Read O_RDWR Exclusive +============ =========================== + +============ =========================== Open Flag Resulting Locking Behavior ---------- -------------------------- +============ =========================== O_NONBLOCK Trylock operation +============ =========================== You must provide exactly one of O_RDONLY or O_RDWR. diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.rst index 01d8a08351ac..1f2edef4c57a 100644 --- a/Documentation/filesystems/ecryptfs.txt +++ b/Documentation/filesystems/ecryptfs.rst @@ -1,14 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== eCryptfs: A stacked cryptographic filesystem for Linux +====================================================== eCryptfs is free software. Please see the file COPYING for details. For documentation, please see the files in the doc/ subdirectory. For building and installation instructions please see the INSTALL file. -Maintainer: Phillip Hellewell -Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com> -Developers: Michael C. Thompson - Kent Yoder -Web Site: http://ecryptfs.sf.net +:Maintainer: Phillip Hellewell +:Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com> +:Developers: Michael C. Thompson + Kent Yoder +:Web Site: http://ecryptfs.sf.net This software is currently undergoing development. Make sure to maintain a backup copy of any data you write into eCryptfs. @@ -19,34 +23,36 @@ SourceForge site: http://sourceforge.net/projects/ecryptfs/ Userspace requirements include: - - David Howells' userspace keyring headers and libraries (version - 1.0 or higher), obtainable from - http://people.redhat.com/~dhowells/keyutils/ - - Libgcrypt + +- David Howells' userspace keyring headers and libraries (version + 1.0 or higher), obtainable from + http://people.redhat.com/~dhowells/keyutils/ +- Libgcrypt -NOTES +.. note:: -In the beta/experimental releases of eCryptfs, when you upgrade -eCryptfs, you should copy the files to an unencrypted location and -then copy the files back into the new eCryptfs mount to migrate the -files. + In the beta/experimental releases of eCryptfs, when you upgrade + eCryptfs, you should copy the files to an unencrypted location and + then copy the files back into the new eCryptfs mount to migrate the + files. -MOUNT-WIDE PASSPHRASE +Mount-wide Passphrase +===================== Create a new directory into which eCryptfs will write its encrypted files (i.e., /root/crypt). Then, create the mount point directory -(i.e., /mnt/crypt). Now it's time to mount eCryptfs: +(i.e., /mnt/crypt). Now it's time to mount eCryptfs:: -mount -t ecryptfs /root/crypt /mnt/crypt + mount -t ecryptfs /root/crypt /mnt/crypt You should be prompted for a passphrase and a salt (the salt may be blank). -Try writing a new file: +Try writing a new file:: -echo "Hello, World" > /mnt/crypt/hello.txt + echo "Hello, World" > /mnt/crypt/hello.txt The operation will complete. Notice that there is a new file in /root/crypt that is at least 12288 bytes in size (depending on your @@ -59,10 +65,13 @@ keyctl clear @u Then umount /mnt/crypt and mount again per the instructions given above. -cat /mnt/crypt/hello.txt +:: + + cat /mnt/crypt/hello.txt -NOTES +Notes +===== eCryptfs version 0.1 should only be mounted on (1) empty directories or (2) directories containing files only created by eCryptfs. If you diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.rst index 686a64bba775..90ac65683e7e 100644 --- a/Documentation/filesystems/efivarfs.txt +++ b/Documentation/filesystems/efivarfs.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 +======================================= efivarfs - a (U)EFI variable filesystem +======================================= The efivarfs filesystem was created to address the shortcomings of using entries in sysfs to maintain EFI variables. The old sysfs EFI @@ -11,7 +14,7 @@ than a single page, sysfs isn't the best interface for this. Variables can be created, deleted and modified with the efivarfs filesystem. -efivarfs is typically mounted like this, +efivarfs is typically mounted like this:: mount -t efivarfs none /sys/firmware/efi/efivars diff --git a/Documentation/filesystems/erofs.txt b/Documentation/filesystems/erofs.rst index db6d39c3ae71..bf145171c2bf 100644 --- a/Documentation/filesystems/erofs.txt +++ b/Documentation/filesystems/erofs.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +Enhanced Read-Only File System - EROFS +====================================== + Overview ======== @@ -6,6 +12,7 @@ from other read-only file systems, it aims to be designed for flexibility, scalability, but be kept simple and high performance. It is designed as a better filesystem solution for the following scenarios: + - read-only storage media or - part of a fully trusted read-only solution, which means it needs to be @@ -17,6 +24,7 @@ It is designed as a better filesystem solution for the following scenarios: for those embedded devices with limited memory (ex, smartphone); Here is the main features of EROFS: + - Little endian on-disk design; - Currently 4KB block size (nobh) and therefore maximum 16TB address space; @@ -24,13 +32,17 @@ Here is the main features of EROFS: - Metadata & data could be mixed by design; - 2 inode versions for different requirements: + + ===================== ============ ===================================== compact (v1) extended (v2) - Inode metadata size: 32 bytes 64 bytes - Max file size: 4 GB 16 EB (also limited by max. vol size) - Max uids/gids: 65536 4294967296 - File change time: no yes (64 + 32-bit timestamp) - Max hardlinks: 65536 4294967296 - Metadata reserved: 4 bytes 14 bytes + ===================== ============ ===================================== + Inode metadata size 32 bytes 64 bytes + Max file size 4 GB 16 EB (also limited by max. vol size) + Max uids/gids 65536 4294967296 + File change time no yes (64 + 32-bit timestamp) + Max hardlinks 65536 4294967296 + Metadata reserved 4 bytes 14 bytes + ===================== ============ ===================================== - Support extended attributes (xattrs) as an option; @@ -43,29 +55,36 @@ Here is the main features of EROFS: The following git tree provides the file system user-space tools under development (ex, formatting tool mkfs.erofs): ->> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git + +- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git Bugs and patches are welcome, please kindly help us and send to the following linux-erofs mailing list: ->> linux-erofs mailing list <linux-erofs@lists.ozlabs.org> + +- linux-erofs mailing list <linux-erofs@lists.ozlabs.org> Mount options ============= +=================== ========================================================= (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled by default if CONFIG_EROFS_FS_XATTR is selected. (no)acl Setup POSIX Access Control List. Note: acl is enabled by default if CONFIG_EROFS_FS_POSIX_ACL is selected. cache_strategy=%s Select a strategy for cached decompression from now on: - disabled: In-place I/O decompression only; - readahead: Cache the last incomplete compressed physical + + ========== ============================================= + disabled In-place I/O decompression only; + readahead Cache the last incomplete compressed physical cluster for further reading. It still does in-place I/O decompression for the rest compressed physical clusters; - readaround: Cache the both ends of incomplete compressed + readaround Cache the both ends of incomplete compressed physical clusters for further reading. It still does in-place I/O decompression for the rest compressed physical clusters. + ========== ============================================= +=================== ========================================================= On-disk details =============== @@ -73,7 +92,7 @@ On-disk details Summary ------- Different from other read-only file systems, an EROFS volume is designed -to be as simple as possible: +to be as simple as possible:: |-> aligned with the block size ____________________________________________________________ @@ -83,41 +102,45 @@ to be as simple as possible: All data areas should be aligned with the block size, but metadata areas may not. All metadatas can be now observed in two different spaces (views): + 1. Inode metadata space + Each valid inode should be aligned with an inode slot, which is a fixed value (32 bytes) and designed to be kept in line with compact inode size. Each inode can be directly found with the following formula: inode offset = meta_blkaddr * block_size + 32 * nid - |-> aligned with 8B - |-> followed closely - + meta_blkaddr blocks |-> another slot - _____________________________________________________________________ - | ... | inode | xattrs | extents | data inline | ... | inode ... - |________|_______|(optional)|(optional)|__(optional)_|_____|__________ - |-> aligned with the inode slot size - . . - . . - . . - . . - . . - . . - .____________________________________________________|-> aligned with 4B - | xattr_ibody_header | shared xattrs | inline xattrs | - |____________________|_______________|_______________| - |-> 12 bytes <-|->x * 4 bytes<-| . - . . . - . . . - . . . - ._______________________________.______________________. - | id | id | id | id | ... | id | ent | ... | ent| ... | - |____|____|____|____|______|____|_____|_____|____|_____| - |-> aligned with 4B - |-> aligned with 4B + :: + + |-> aligned with 8B + |-> followed closely + + meta_blkaddr blocks |-> another slot + _____________________________________________________________________ + | ... | inode | xattrs | extents | data inline | ... | inode ... + |________|_______|(optional)|(optional)|__(optional)_|_____|__________ + |-> aligned with the inode slot size + . . + . . + . . + . . + . . + . . + .____________________________________________________|-> aligned with 4B + | xattr_ibody_header | shared xattrs | inline xattrs | + |____________________|_______________|_______________| + |-> 12 bytes <-|->x * 4 bytes<-| . + . . . + . . . + . . . + ._______________________________.______________________. + | id | id | id | id | ... | id | ent | ... | ent| ... | + |____|____|____|____|______|____|_____|_____|____|_____| + |-> aligned with 4B + |-> aligned with 4B Inode could be 32 or 64 bytes, which can be distinguished from a common - field which all inode versions have -- i_format: + field which all inode versions have -- i_format:: __________________ __________________ | i_format | | i_format | @@ -132,16 +155,19 @@ may not. All metadatas can be now observed in two different spaces (views): proper alignment, and they could be optional for different data mappings. _currently_ total 4 valid data mappings are supported: + == ==================================================================== 0 flat file data without data inline (no extent); 1 fixed-sized output data compression (with non-compacted indexes); 2 flat file data with tail packing data inline (no extent); 3 fixed-sized output data compression (with compacted indexes, v5.3+). + == ==================================================================== The size of the optional xattrs is indicated by i_xattr_count in inode header. Large xattrs or xattrs shared by many different files can be stored in shared xattrs metadata rather than inlined right after inode. 2. Shared xattrs metadata space + Shared xattrs space is similar to the above inode space, started with a specific block indicated by xattr_blkaddr, organized one by one with proper align. @@ -149,11 +175,13 @@ may not. All metadatas can be now observed in two different spaces (views): Each share xattr can also be directly found by the following formula: xattr offset = xattr_blkaddr * block_size + 4 * xattr_id - |-> aligned by 4 bytes - + xattr_blkaddr blocks |-> aligned with 4 bytes - _________________________________________________________________________ - | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... - |________|_____________|_____________|_____|______________|_______________ + :: + + |-> aligned by 4 bytes + + xattr_blkaddr blocks |-> aligned with 4 bytes + _________________________________________________________________________ + | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... + |________|_____________|_____________|_____|______________|_______________ Directories ----------- @@ -163,19 +191,21 @@ random file lookup, and all directory entries are _strictly_ recorded in alphabetical order in order to support improved prefix binary search algorithm (could refer to the related source code). - ___________________________ - / | - / ______________|________________ - / / | nameoff1 | nameoffN-1 - ____________.______________._______________v________________v__________ -| dirent | dirent | ... | dirent | filename | filename | ... | filename | -|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| - \ ^ - \ | * could have - \ | trailing '\0' - \________________________| nameoff0 +:: + + ___________________________ + / | + / ______________|________________ + / / | nameoff1 | nameoffN-1 + ____________.______________._______________v________________v__________ + | dirent | dirent | ... | dirent | filename | filename | ... | filename | + |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| + \ ^ + \ | * could have + \ | trailing '\0' + \________________________| nameoff0 - Directory block + Directory block Note that apart from the offset of the first filename, nameoff0 also indicates the total number of directory entries in this block since it is no need to @@ -184,28 +214,27 @@ introduce another on-disk field at all. Compression ----------- Currently, EROFS supports 4KB fixed-sized output transparent file compression, -as illustrated below: - - |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- - clusterofs clusterofs clusterofs - | | | logical data -_________v_______________________________v_____________________v_______________ -... | . | | . | | . | ... -____|____.________|_____________|________.____|_____________|__.__________|____ - |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| - size size size size size - . . . . - . . . . - . . . . - _______._____________._____________._____________._____________________ - ... | | | | ... physical data - _______|_____________|_____________|_____________|_____________________ - |-> cluster <-|-> cluster <-|-> cluster <-| - size size size +as illustrated below:: + + |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- + clusterofs clusterofs clusterofs + | | | logical data + _________v_______________________________v_____________________v_______________ + ... | . | | . | | . | ... + ____|____.________|_____________|________.____|_____________|__.__________|____ + |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| + size size size size size + . . . . + . . . . + . . . . + _______._____________._____________._____________._____________________ + ... | | | | ... physical data + _______|_____________|_____________|_____________|_____________________ + |-> cluster <-|-> cluster <-|-> cluster <-| + size size size Currently each on-disk physical cluster can contain 4KB (un)compressed data at most. For each logical cluster, there is a corresponding on-disk index to describe its cluster type, physical cluster address, etc. See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. - diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.rst index 94c2cf0292f5..d83dbbb162e2 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + The Second Extended Filesystem ============================== @@ -14,8 +16,9 @@ Options Most defaults are determined by the filesystem superblock, and can be set using tune2fs(8). Kernel-determined defaults are indicated by (*). -bsddf (*) Makes `df' act like BSD. -minixdf Makes `df' act like Minix. +==================== === ================================================ +bsddf (*) Makes ``df`` act like BSD. +minixdf Makes ``df`` act like Minix. check=none, nocheck (*) Don't do extra checking of bitmaps on mount (check=normal and check=strict options removed) @@ -62,6 +65,7 @@ quota, usrquota Enable user disk quota support grpquota Enable group disk quota support (requires CONFIG_QUOTA). +==================== === ================================================ noquota option ls silently ignored by ext2. @@ -294,9 +298,9 @@ respective fsck programs. If you're exceptionally paranoid, there are 3 ways of making metadata writes synchronous on ext2: -per-file if you have the program source: use the O_SYNC flag to open() -per-file if you don't have the source: use "chattr +S" on the file -per-filesystem: add the "sync" option to mount (or in /etc/fstab) +- per-file if you have the program source: use the O_SYNC flag to open() +- per-file if you don't have the source: use "chattr +S" on the file +- per-filesystem: add the "sync" option to mount (or in /etc/fstab) the first and last are not ext2 specific but do force the metadata to be written synchronously. See also Journaling below. @@ -316,10 +320,12 @@ Most of these limits could be overcome with slight changes in the on-disk format and using a compatibility flag to signal the format change (at the expense of some compatibility). -Filesystem block size: 1kB 2kB 4kB 8kB - -File size limit: 16GB 256GB 2048GB 2048GB -Filesystem size limit: 2047GB 8192GB 16384GB 32768GB +===================== ======= ======= ======= ======== +Filesystem block size 1kB 2kB 4kB 8kB +===================== ======= ======= ======= ======== +File size limit 16GB 256GB 2048GB 2048GB +Filesystem size limit 2047GB 8192GB 16384GB 32768GB +===================== ======= ======= ======= ======== There is a 2.4 kernel limit of 2048GB for a single block device, so no filesystem larger than that can be created at this time. There is also @@ -370,19 +376,24 @@ ext4 and journaling. References ========== +======================= =============================================== The kernel source file:/usr/src/linux/fs/ext2/ e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ Filesystem Resizing http://ext2resize.sourceforge.net/ -Compression (*) http://e2compr.sourceforge.net/ +Compression [1]_ http://e2compr.sourceforge.net/ +======================= =============================================== Implementations for: + +======================= =========================================================== Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs -Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 -DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2 +DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ +======================= =========================================================== -(*) no longer actively developed/supported (as of Apr 2001) -(+) no longer actively developed/supported (as of Mar 2009) +.. [1] no longer actively developed/supported (as of Apr 2001) +.. [2] no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.rst index 58758fbef9e0..c06cec3a8fdc 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.rst @@ -1,4 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +=============== Ext3 Filesystem =============== diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.rst index 4eb3e2ddd00e..d681203728d7 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.rst @@ -1,6 +1,8 @@ -================================================================================ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== WHAT IS Flash-Friendly File System (F2FS)? -================================================================================ +========================================== NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have been equipped on a variety systems ranging from mobile to server systems. Since @@ -20,14 +22,15 @@ layout, but also for selecting allocation and cleaning algorithms. The following git tree provides the file system formatting tool (mkfs.f2fs), a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). ->> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git + +- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git For reporting bugs and sending patches, please use the following mailing list: ->> linux-f2fs-devel@lists.sourceforge.net -================================================================================ -BACKGROUND AND DESIGN ISSUES -================================================================================ +- linux-f2fs-devel@lists.sourceforge.net + +Background and Design issues +============================ Log-structured File System (LFS) -------------------------------- @@ -61,6 +64,7 @@ needs to reclaim these obsolete blocks seamlessly to users. This job is called as a cleaning process. The process consists of three operations as follows. + 1. A victim segment is selected through referencing segment usage table. 2. It loads parent index structures of all the data in the victim identified by segment summary blocks. @@ -71,9 +75,8 @@ This cleaning job may cause unexpected long delays, so the most important goal is to hide the latencies to users. And also definitely, it should reduce the amount of valid data to be moved, and move them quickly as well. -================================================================================ -KEY FEATURES -================================================================================ +Key Features +============ Flash Awareness --------------- @@ -94,10 +97,11 @@ Cleaning Overhead - Support multi-head logs for static/dynamic hot and cold data separation - Introduce adaptive logging for efficient block allocation -================================================================================ -MOUNT OPTIONS -================================================================================ +Mount Options +============= + +====================== ============================================================ background_gc=%s Turn on/off cleaning operations, namely garbage collection, triggered in background when I/O subsystem is idle. If background_gc=on, it will turn on the garbage @@ -167,7 +171,10 @@ fault_injection=%d Enable fault injection in all supported types with fault_type=%d Support configuring fault injection type, should be enabled with fault_injection option, fault type value is shown below, it supports single or combined type. + + =================== =========== Type_Name Type_Value + =================== =========== FAULT_KMALLOC 0x000000001 FAULT_KVMALLOC 0x000000002 FAULT_PAGE_ALLOC 0x000000004 @@ -183,6 +190,7 @@ fault_type=%d Support configuring fault injection type, should be FAULT_CHECKPOINT 0x000001000 FAULT_DISCARD 0x000002000 FAULT_WRITE_IO 0x000004000 + =================== =========== mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. @@ -219,7 +227,7 @@ fsync_mode=%s Control the policy of fsync. Currently supports "posix", non-atomic files likewise "nobarrier" mount option. test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt context. The fake fscrypt context is used by xfstests. -checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" +checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" to reenable checkpointing. Is enabled by default. While disabled, any unmounting or unexpected shutdowns will cause the filesystem contents to appear as they did when the @@ -246,22 +254,22 @@ compress_extension=%s Support adding specified extension, so that f2fs can enab on compression extension list and enable compression on these file by default rather than to enable it via ioctl. For other files, we can still enable compression via ioctl. +====================== ============================================================ -================================================================================ -DEBUGFS ENTRIES -================================================================================ +Debugfs Entries +=============== /sys/kernel/debug/f2fs/ contains information about all the partitions mounted as f2fs. Each file shows the whole f2fs information. /sys/kernel/debug/f2fs/status includes: + - major file system information managed by f2fs currently - average SIT information about whole segments - current memory footprint consumed by f2fs. -================================================================================ -SYSFS ENTRIES -================================================================================ +Sysfs Entries +============= Information about mounted f2fs file systems can be found in /sys/fs/f2fs. Each mounted filesystem will have a directory in @@ -271,22 +279,24 @@ The files in each per-device directory are shown in table below. Files in /sys/fs/f2fs/<devname> (see also Documentation/ABI/testing/sysfs-fs-f2fs) -================================================================================ -USAGE -================================================================================ +Usage +===== 1. Download userland tools and compile them. 2. Skip, if f2fs was compiled statically inside kernel. - Otherwise, insert the f2fs.ko module. - # insmod f2fs.ko + Otherwise, insert the f2fs.ko module:: + + # insmod f2fs.ko -3. Create a directory trying to mount - # mkdir /mnt/f2fs +3. Create a directory trying to mount:: -4. Format the block device, and then mount as f2fs - # mkfs.f2fs -l label /dev/block_device - # mount -t f2fs /dev/block_device /mnt/f2fs + # mkdir /mnt/f2fs + +4. Format the block device, and then mount as f2fs:: + + # mkfs.f2fs -l label /dev/block_device + # mount -t f2fs /dev/block_device /mnt/f2fs mkfs.f2fs --------- @@ -294,18 +304,26 @@ The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, which builds a basic on-disk layout. The options consist of: --l [label] : Give a volume label, up to 512 unicode name. --a [0 or 1] : Split start location of each area for heap-based allocation. - 1 is set by default, which performs this. --o [int] : Set overprovision ratio in percent over volume size. - 5 is set by default. --s [int] : Set the number of segments per section. - 1 is set by default. --z [int] : Set the number of sections per zone. - 1 is set by default. --e [str] : Set basic extension list. e.g. "mp3,gif,mov" --t [0 or 1] : Disable discard command or not. - 1 is set by default, which conducts discard. + +=============== =========================================================== +``-l [label]`` Give a volume label, up to 512 unicode name. +``-a [0 or 1]`` Split start location of each area for heap-based allocation. + + 1 is set by default, which performs this. +``-o [int]`` Set overprovision ratio in percent over volume size. + + 5 is set by default. +``-s [int]`` Set the number of segments per section. + + 1 is set by default. +``-z [int]`` Set the number of sections per zone. + + 1 is set by default. +``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov" +``-t [0 or 1]`` Disable discard command or not. + + 1 is set by default, which conducts discard. +=============== =========================================================== fsck.f2fs --------- @@ -314,7 +332,8 @@ partition, which examines whether the filesystem metadata and user-made data are cross-referenced correctly or not. Note that, initial version of the tool does not fix any inconsistency. -The options consist of: +The options consist of:: + -d debug level [default:0] dump.f2fs @@ -327,20 +346,21 @@ It shows on-disk inode information recognized by a given inode number, and is able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and ./dump_sit respectively. -The options consist of: +The options consist of:: + -d debug level [default:0] -i inode no (hex) -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] -Examples: -# dump.f2fs -i [ino] /dev/sdx -# dump.f2fs -s 0~-1 /dev/sdx (SIT dump) -# dump.f2fs -a 0~-1 /dev/sdx (SSA dump) +Examples:: + + # dump.f2fs -i [ino] /dev/sdx + # dump.f2fs -s 0~-1 /dev/sdx (SIT dump) + # dump.f2fs -a 0~-1 /dev/sdx (SSA dump) -================================================================================ -DESIGN -================================================================================ +Design +====== On-disk Layout -------------- @@ -351,7 +371,7 @@ consists of a set of sections. By default, section and zone sizes are set to one segment size identically, but users can easily modify the sizes by mkfs. F2FS splits the entire volume into six areas, and all the areas except superblock -consists of multiple segments as described below. +consists of multiple segments as described below:: align with the zone size <-| |-> align with the segment size @@ -373,28 +393,28 @@ consists of multiple segments as described below. |__zone__| - Superblock (SB) - : It is located at the beginning of the partition, and there exist two copies + It is located at the beginning of the partition, and there exist two copies to avoid file system crash. It contains basic partition information and some default parameters of f2fs. - Checkpoint (CP) - : It contains file system information, bitmaps for valid NAT/SIT sets, orphan + It contains file system information, bitmaps for valid NAT/SIT sets, orphan inode lists, and summary entries of current active segments. - Segment Information Table (SIT) - : It contains segment information such as valid block count and bitmap for the + It contains segment information such as valid block count and bitmap for the validity of all the blocks. - Node Address Table (NAT) - : It is composed of a block address table for all the node blocks stored in + It is composed of a block address table for all the node blocks stored in Main area. - Segment Summary Area (SSA) - : It contains summary entries which contains the owner information of all the + It contains summary entries which contains the owner information of all the data and node blocks stored in Main area. - Main Area - : It contains file and directory data including their indices. + It contains file and directory data including their indices. In order to avoid misalignment between file system and flash-based storage, F2FS aligns the start block address of CP with the segment size. Also, it aligns the @@ -414,7 +434,7 @@ One of them always indicates the last valid data, which is called as shadow copy mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. For file system consistency, each CP points to which NAT and SIT copies are -valid, as shown as below. +valid, as shown as below:: +--------+----------+---------+ | CP | SIT | NAT | @@ -438,7 +458,7 @@ indirect node. F2FS assigns 4KB to an inode block which contains 923 data block indices, two direct node pointers, two indirect node pointers, and one double indirect node pointer as described below. One direct node block contains 1018 data blocks, and one indirect node block contains also 1018 node blocks. Thus, -one inode block (i.e., a file) covers: +one inode block (i.e., a file) covers:: 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. @@ -473,6 +493,8 @@ A dentry block consists of 214 dentry slots and file names. Therein a bitmap is used to represent whether each dentry is valid or not. A dentry block occupies 4KB with the following composition. +:: + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + dentries(11 * 214 bytes) + file name (8 * 214 bytes) @@ -498,23 +520,25 @@ F2FS implements multi-level hash tables for directory structure. Each level has a hash table with dedicated number of hash buckets as shown below. Note that "A(2B)" means a bucket includes 2 data blocks. ----------------------- -A : bucket -B : block -N : MAX_DIR_HASH_DEPTH ----------------------- +:: + + ---------------------- + A : bucket + B : block + N : MAX_DIR_HASH_DEPTH + ---------------------- -level #0 | A(2B) - | -level #1 | A(2B) - A(2B) - | -level #2 | A(2B) - A(2B) - A(2B) - A(2B) - . | . . . . -level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) - . | . . . . -level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) + level #0 | A(2B) + | + level #1 | A(2B) - A(2B) + | + level #2 | A(2B) - A(2B) - A(2B) - A(2B) + . | . . . . + level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) + . | . . . . + level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) -The number of blocks and buckets are determined by, +The number of blocks and buckets are determined by:: ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, # of blocks in level #n = | @@ -532,7 +556,7 @@ dentry consisting of the file name and its inode number. If not found, F2FS scans the next hash table in level #1. In this way, F2FS scans hash tables in each levels incrementally from 1 to N. In each levels F2FS needs to scan only one bucket determined by the following equation, which shows O(log(# of files)) -complexity. +complexity:: bucket number to scan in level #n = (hash value) % (# of buckets in level #n) @@ -540,7 +564,8 @@ In the case of file creation, F2FS finds empty consecutive slots that cover the file name. F2FS searches the empty slots in the hash tables of whole levels from 1 to N in the same way as the lookup operation. -The following figure shows an example of two cases holding children. +The following figure shows an example of two cases holding children:: + --------------> Dir <-------------- | | child child @@ -611,14 +636,15 @@ Write-hint Policy 2) whint_mode=user-based. F2FS tries to pass down hints given by users. +===================== ======================== =================== User F2FS Block ----- ---- ----- +===================== ======================== =================== META WRITE_LIFE_NOT_SET HOT_NODE " WARM_NODE " COLD_NODE " -*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME -*extension list " " +ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME +extension list " " -- buffered io WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME @@ -635,11 +661,13 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== 3) whint_mode=fs-based. F2FS passes down hints with its policy. +===================== ======================== =================== User F2FS Block ----- ---- ----- +===================== ======================== =================== META WRITE_LIFE_MEDIUM; HOT_NODE WRITE_LIFE_NOT_SET WARM_NODE " @@ -662,6 +690,7 @@ WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_LONG " WRITE_LIFE_LONG +===================== ======================== =================== Fallocate(2) Policy ------------------- @@ -681,6 +710,7 @@ Allocating disk space However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having zero or random data, which is useful to the below scenario where: + 1. create(fd) 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE) 3. fallocate(fd, 0, 0, size) @@ -692,39 +722,41 @@ Compression implementation -------------------------- - New term named cluster is defined as basic unit of compression, file can -be divided into multiple clusters logically. One cluster includes 4 << n -(n >= 0) logical pages, compression size is also cluster size, each of -cluster can be compressed or not. + be divided into multiple clusters logically. One cluster includes 4 << n + (n >= 0) logical pages, compression size is also cluster size, each of + cluster can be compressed or not. - In cluster metadata layout, one special block address is used to indicate -cluster is compressed one or normal one, for compressed cluster, following -metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs -stores data including compress header and compressed data. + cluster is compressed one or normal one, for compressed cluster, following + metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs + stores data including compress header and compressed data. - In order to eliminate write amplification during overwrite, F2FS only -support compression on write-once file, data can be compressed only when -all logical blocks in file are valid and cluster compress ratio is lower -than specified threshold. + support compression on write-once file, data can be compressed only when + all logical blocks in file are valid and cluster compress ratio is lower + than specified threshold. - To enable compression on regular inode, there are three ways: -* chattr +c file -* chattr +c dir; touch dir/file -* mount w/ -o compress_extension=ext; touch file.ext - -Compress metadata layout: - [Dnode Structure] - +-----------------------------------------------+ - | cluster 1 | cluster 2 | ......... | cluster N | - +-----------------------------------------------+ - . . . . - . . . . - . Compressed Cluster . . Normal Cluster . -+----------+---------+---------+---------+ +---------+---------+---------+---------+ -|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | -+----------+---------+---------+---------+ +---------+---------+---------+---------+ - . . - . . - . . - +-------------+-------------+----------+----------------------------+ - | data length | data chksum | reserved | compressed data | - +-------------+-------------+----------+----------------------------+ + + * chattr +c file + * chattr +c dir; touch dir/file + * mount w/ -o compress_extension=ext; touch file.ext + +Compress metadata layout:: + + [Dnode Structure] + +-----------------------------------------------+ + | cluster 1 | cluster 2 | ......... | cluster N | + +-----------------------------------------------+ + . . . . + . . . . + . Compressed Cluster . . Normal Cluster . + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 | + +----------+---------+---------+---------+ +---------+---------+---------+---------+ + . . + . . + . . + +-------------+-------------+----------+----------------------------+ + | data length | data chksum | reserved | compressed data | + +-------------+-------------+----------+----------------------------+ diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index bd9932344804..aa072112cfff 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -633,6 +633,17 @@ from a passphrase or other low-entropy user credential. FS_IOC_GET_ENCRYPTION_PWSALT is deprecated. Instead, prefer to generate and manage any needed salt(s) in userspace. +Getting a file's encryption nonce +--------------------------------- + +Since Linux v5.7, the ioctl FS_IOC_GET_ENCRYPTION_NONCE is supported. +On encrypted files and directories it gets the inode's 16-byte nonce. +On unencrypted files and directories, it fails with ENODATA. + +This ioctl can be useful for automated tests which verify that the +encryption is being done correctly. It is not needed for normal use +of fscrypt. + Adding keys ----------- diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse.rst index 8e455065ce9e..cd717f9bf940 100644 --- a/Documentation/filesystems/fuse.rst +++ b/Documentation/filesystems/fuse.rst @@ -1,7 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -============== + +==== FUSE -============== +==== Definitions =========== diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.rst index 19a19ebebc34..f162a2c76c69 100644 --- a/Documentation/filesystems/gfs2-uevents.txt +++ b/Documentation/filesystems/gfs2-uevents.rst @@ -1,14 +1,18 @@ - uevents and GFS2 - ================== +.. SPDX-License-Identifier: GPL-2.0 + +================ +uevents and GFS2 +================ During the lifetime of a GFS2 mount, a number of uevents are generated. This document explains what the events are and what they are used for (by gfs_controld in gfs2-utils). A list of GFS2 uevents ------------------------ +====================== 1. ADD +------ The ADD event occurs at mount time. It will always be the first uevent generated by the newly created filesystem. If the mount @@ -21,6 +25,7 @@ with no journal assigned), and read-only (with journal assigned) status of the filesystem respectively. 2. ONLINE +--------- The ONLINE uevent is generated after a successful mount or remount. It has the same environment variables as the ADD uevent. The ONLINE @@ -29,6 +34,7 @@ RDONLY are a relatively recent addition (2.6.32-rc+) and will not be generated by older kernels. 3. CHANGE +--------- The CHANGE uevent is used in two places. One is when reporting the successful mount of the filesystem by the first node (FIRSTMOUNT=Done). @@ -52,6 +58,7 @@ cluster. For this reason the ONLINE uevent was used when adding a new uevent for a successful mount or remount. 4. OFFLINE +---------- The OFFLINE uevent is only generated due to filesystem errors and is used as part of the "withdraw" mechanism. Currently this doesn't give any @@ -59,6 +66,7 @@ information about what the error is, which is something that needs to be fixed. 5. REMOVE +--------- The REMOVE uevent is generated at the end of an unsuccessful mount or at the end of a umount of the filesystem. All REMOVE uevents will @@ -68,9 +76,10 @@ kobject subsystem. Information common to all GFS2 uevents (uevent environment variables) ----------------------------------------------------------------------- +===================================================================== 1. LOCKTABLE= +-------------- The LOCKTABLE is a string, as supplied on the mount command line (locktable=) or via fstab. It is used as a filesystem label @@ -78,6 +87,7 @@ as well as providing the information for a lock_dlm mount to be able to join the cluster. 2. LOCKPROTO= +------------- The LOCKPROTO is a string, and its value depends on what is set on the mount command line, or via fstab. It will be either @@ -85,12 +95,14 @@ lock_nolock or lock_dlm. In the future other lock managers may be supported. 3. JOURNALID= +------------- If a journal is in use by the filesystem (journals are not assigned for spectator mounts) then this will give the numeric journal id in all GFS2 uevents. 4. UUID= +-------- With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID into the filesystem superblock. If it exists, this will diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.rst index cc4f2306609e..8d1ab589ce18 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== Global File System ------------------- +================== https://fedorahosted.org/cluster/wiki/HomePage @@ -14,16 +17,18 @@ on one machine show up immediately on all other machines in the cluster. GFS uses interchangeable inter-node locking mechanisms, the currently supported mechanisms are: - lock_nolock -- allows gfs to be used as a local file system + lock_nolock + - allows gfs to be used as a local file system - lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking - The dlm is found at linux/fs/dlm/ + lock_dlm + - uses a distributed lock manager (dlm) for inter-node locking. + The dlm is found at linux/fs/dlm/ Lock_dlm depends on user space cluster management systems found at the URL above. To use gfs as a local file system, no external clustering systems are -needed, simply: +needed, simply:: $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device $ mount -t gfs2 /dev/block_device /dir @@ -37,9 +42,12 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it is pretty close. The following man pages can be found at the URL above: + + ============ ============================================= fsck.gfs2 to repair a filesystem gfs2_grow to expand a filesystem online gfs2_jadd to add journals to a filesystem online tunegfs2 to manipulate, examine and tune a filesystem - gfs2_convert to convert a gfs filesystem to gfs2 in-place + gfs2_convert to convert a gfs filesystem to gfs2 in-place mkfs.gfs2 to make a filesystem + ============ ============================================= diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.rst index d096df6db07a..ab17a005e9b1 100644 --- a/Documentation/filesystems/hfs.txt +++ b/Documentation/filesystems/hfs.rst @@ -1,11 +1,16 @@ -Note: This filesystem doesn't have a maintainer. +.. SPDX-License-Identifier: GPL-2.0 +================================== Macintosh HFS Filesystem for Linux ================================== -HFS stands for ``Hierarchical File System'' and is the filesystem used + +.. Note:: This filesystem doesn't have a maintainer. + + +HFS stands for ``Hierarchical File System`` and is the filesystem used by the Mac Plus and all later Macintosh models. Earlier Macintosh -models used MFS (``Macintosh File System''), which is not supported, +models used MFS (``Macintosh File System``), which is not supported, MacOS 8.1 and newer support a filesystem called HFS+ that's similar to HFS but is extended in various areas. Use the hfsplus filesystem driver to access such filesystems from Linux. @@ -49,25 +54,25 @@ Writing to HFS Filesystems HFS is not a UNIX filesystem, thus it does not have the usual features you'd expect: - o You can't modify the set-uid, set-gid, sticky or executable bits or the uid + * You can't modify the set-uid, set-gid, sticky or executable bits or the uid and gid of files. - o You can't create hard- or symlinks, device files, sockets or FIFOs. + * You can't create hard- or symlinks, device files, sockets or FIFOs. HFS does on the other have the concepts of multiple forks per file. These non-standard forks are represented as hidden additional files in the normal filesystems namespace which is kind of a cludge and makes the semantics for the a little strange: - o You can't create, delete or rename resource forks of files or the + * You can't create, delete or rename resource forks of files or the Finder's metadata. - o They are however created (with default values), deleted and renamed + * They are however created (with default values), deleted and renamed along with the corresponding data fork or directory. - o Copying files to a different filesystem will loose those attributes + * Copying files to a different filesystem will loose those attributes that are essential for MacOS to work. Creating HFS filesystems -=================================== +======================== The hfsutils package from Robert Leslie contains a program called hformat that can be used to create HFS filesystem. See diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.rst index 59f7569fc9ed..f02f4f5fc020 100644 --- a/Documentation/filesystems/hfsplus.txt +++ b/Documentation/filesystems/hfsplus.rst @@ -1,4 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +====================================== Macintosh HFSPlus Filesystem for Linux ====================================== diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.rst index 74630bd504fb..0db152278572 100644 --- a/Documentation/filesystems/hpfs.txt +++ b/Documentation/filesystems/hpfs.rst @@ -1,13 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== Read/Write HPFS 2.09 +==================== + 1998-2004, Mikulas Patocka -email: mikulas@artax.karlin.mff.cuni.cz -homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi +:email: mikulas@artax.karlin.mff.cuni.cz +:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi -CREDITS: +Credits +======= Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file is taken from it + Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) + Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion Mount options @@ -50,6 +58,7 @@ timeshift=(-)nnn (default 0) File names +========== As in OS/2, filenames are case insensitive. However, shell thinks that names are case sensitive, so for example when you create a file FOO, you can use @@ -64,6 +73,7 @@ access it under names 'a.', 'a..', 'a . . . ' etc. Extended attributes +=================== On HPFS partitions, OS/2 can associate to each file a special information called extended attributes. Extended attributes are pairs of (key,value) where key is @@ -88,6 +98,7 @@ values doesn't work. Symlinks +======== You can do symlinks on HPFS partition, symlinks are achieved by setting extended attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and @@ -101,6 +112,7 @@ to analyze or change OS2SYS.INI. Codepages +========= HPFS can contain several uppercasing tables for several codepages and each file has a pointer to codepage its name is in. However OS/2 was created in @@ -128,6 +140,7 @@ this codepage - if you don't try to do what I described above :-) Known bugs +========== HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client should work. If you have OS/2 server, use only read-only mode. I don't know how @@ -152,7 +165,8 @@ would result in directory tree splitting, that takes disk space. Workaround is to delete other files that are leaf (probability that the file is non-leaf is about 1/50) or to truncate file first to make some space. You encounter this problem only if you have many directories so that -preallocated directory band is full i.e. +preallocated directory band is full i.e.:: + number_of_directories / size_of_filesystem_in_mb > 4. You can't delete open directories. @@ -174,6 +188,7 @@ anybody know what does it mean? What does "unbalanced tree" message mean? +========================================= Old versions of this driver created sometimes unbalanced dnode trees. OS/2 chkdsk doesn't scream if the tree is unbalanced (and sometimes creates @@ -187,6 +202,7 @@ whole created by this driver, it is BUG - let me know about it. Bugs in OS/2 +============ When you have two (or more) lost directories pointing each to other, chkdsk locks up when repairing filesystem. @@ -199,98 +215,139 @@ File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and marks them as short (and writes "minor fs error corrected"). This bug is not in HPFS386. -Codepage bugs described above. +Codepage bugs described above +============================= If you don't install fixpacks, there are many, many more... History +======= + +====== ========================================================================= +0.90 First public release +0.91 Fixed bug that caused shooting to memory when write_inode was called on + open inode (rarely happened) +0.92 Fixed a little memory leak in freeing directory inodes +0.93 Fixed bug that locked up the machine when there were too many filenames + with first 15 characters same + Fixed write_file to zero file when writing behind file end +0.94 Fixed a little memory leak when trying to delete busy file or directory +0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files +1.90 First version for 2.1.1xx kernels +1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk + Fixed a race-condition when write_inode is called while deleting file + Fixed a bug that could possibly happen (with very low probability) when + using 0xff in filenames. + + Rewritten locking to avoid race-conditions + + Mount option 'eas' now works + + Fsync no longer returns error + + Files beginning with '.' are marked hidden + + Remount support added + + Alloc is not so slow when filesystem becomes full + + Atimes are no more updated because it slows down operation + + Code cleanup (removed all commented debug prints) +1.92 Corrected a bug when sync was called just before closing file +1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it + works with previous versions + + Fixed a possible problem with disks > 64G (but I don't have one, so I can't + test it) + + Fixed a file overflow at 2G + + Added new option 'timeshift' + + Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in + read-only mode + + Fixed a bug that slowed down alloc and prevented allocating 100% space + (this bug was not destructive) +1.94 Added workaround for one bug in Linux + + Fixed one buffer leak + + Fixed some incompatibilities with large extended attributes (but it's still + not 100% ok, I have no info on it and OS/2 doesn't want to create them) + + Rewritten allocation -0.90 First public release -0.91 Fixed bug that caused shooting to memory when write_inode was called on - open inode (rarely happened) -0.92 Fixed a little memory leak in freeing directory inodes -0.93 Fixed bug that locked up the machine when there were too many filenames - with first 15 characters same - Fixed write_file to zero file when writing behind file end -0.94 Fixed a little memory leak when trying to delete busy file or directory -0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files -1.90 First version for 2.1.1xx kernels -1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk - Fixed a race-condition when write_inode is called while deleting file - Fixed a bug that could possibly happen (with very low probability) when - using 0xff in filenames - Rewritten locking to avoid race-conditions - Mount option 'eas' now works - Fsync no longer returns error - Files beginning with '.' are marked hidden - Remount support added - Alloc is not so slow when filesystem becomes full - Atimes are no more updated because it slows down operation - Code cleanup (removed all commented debug prints) -1.92 Corrected a bug when sync was called just before closing file -1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it - works with previous versions - Fixed a possible problem with disks > 64G (but I don't have one, so I can't - test it) - Fixed a file overflow at 2G - Added new option 'timeshift' - Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in - read-only mode - Fixed a bug that slowed down alloc and prevented allocating 100% space - (this bug was not destructive) -1.94 Added workaround for one bug in Linux - Fixed one buffer leak - Fixed some incompatibilities with large extended attributes (but it's still - not 100% ok, I have no info on it and OS/2 doesn't want to create them) - Rewritten allocation - Fixed a bug with i_blocks (du sometimes didn't display correct values) - Directories have no longer archive attribute set (some programs don't like - it) - Fixed a bug that it set badly one flag in large anode tree (it was not - destructive) -1.95 Fixed one buffer leak, that could happen on corrupted filesystem - Fixed one bug in allocation in 1.94 -1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported - error sometimes when opening directories in PMSHELL) - Fixed a possible bitmap race - Fixed possible problem on large disks - You can now delete open files - Fixed a nondestructive race in rename -1.97 Support for HPFS v3 (on large partitions) - Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) + Fixed a bug with i_blocks (du sometimes didn't display correct values) + + Directories have no longer archive attribute set (some programs don't like + it) + + Fixed a bug that it set badly one flag in large anode tree (it was not + destructive) +1.95 Fixed one buffer leak, that could happen on corrupted filesystem + + Fixed one bug in allocation in 1.94 +1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported + error sometimes when opening directories in PMSHELL) + + Fixed a possible bitmap race + + Fixed possible problem on large disks + + You can now delete open files + + Fixed a nondestructive race in rename +1.97 Support for HPFS v3 (on large partitions) + + ZFixed a bug that it didn't allow creation of files > 128M + (it should be 2G) 1.97.1 Changed names of global symbols + Fixed a bug when chmoding or chowning root directory -1.98 Fixed a deadlock when using old_readdir - Better directory handling; workaround for "unbalanced tree" bug in OS/2 -1.99 Corrected a possible problem when there's not enough space while deleting - file - Now it tries to truncate the file if there's not enough space when deleting - Removed a lot of redundant code -2.00 Fixed a bug in rename (it was there since 1.96) - Better anti-fragmentation strategy -2.01 Fixed problem with directory listing over NFS - Directory lseek now checks for proper parameters - Fixed race-condition in buffer code - it is in all filesystems in Linux; - when reading device (cat /dev/hda) while creating files on it, files - could be damaged -2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond - end of partition -2.03 Char, block devices and pipes are correctly created - Fixed non-crashing race in unlink (Alexander Viro) - Now it works with Japanese version of OS/2 -2.04 Fixed error when ftruncate used to extend file -2.05 Fixed crash when got mount parameters without = - Fixed crash when allocation of anode failed due to full disk - Fixed some crashes when block io or inode allocation failed -2.06 Fixed some crash on corrupted disk structures - Better allocation strategy - Reschedule points added so that it doesn't lock CPU long time - It should work in read-only mode on Warp Server -2.07 More fixes for Warp Server. Now it really works -2.08 Creating new files is not so slow on large disks - An attempt to sync deleted file does not generate filesystem error -2.09 Fixed error on extremely fragmented files - - - vim: set textwidth=80: +1.98 Fixed a deadlock when using old_readdir + Better directory handling; workaround for "unbalanced tree" bug in OS/2 +1.99 Corrected a possible problem when there's not enough space while deleting + file + + Now it tries to truncate the file if there's not enough space when + deleting + + Removed a lot of redundant code +2.00 Fixed a bug in rename (it was there since 1.96) + Better anti-fragmentation strategy +2.01 Fixed problem with directory listing over NFS + + Directory lseek now checks for proper parameters + + Fixed race-condition in buffer code - it is in all filesystems in Linux; + when reading device (cat /dev/hda) while creating files on it, files + could be damaged +2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond + end of partition +2.03 Char, block devices and pipes are correctly created + + Fixed non-crashing race in unlink (Alexander Viro) + + Now it works with Japanese version of OS/2 +2.04 Fixed error when ftruncate used to extend file +2.05 Fixed crash when got mount parameters without = + + Fixed crash when allocation of anode failed due to full disk + + Fixed some crashes when block io or inode allocation failed +2.06 Fixed some crash on corrupted disk structures + + Better allocation strategy + + Reschedule points added so that it doesn't lock CPU long time + + It should work in read-only mode on Warp Server +2.07 More fixes for Warp Server. Now it really works +2.08 Creating new files is not so slow on large disks + + An attempt to sync deleted file does not generate filesystem error +2.09 Fixed error on extremely fragmented files +====== ========================================================================= diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 386eaad008b2..e7b46dac7079 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -1,3 +1,5 @@ +.. _filesystems_index: + =============================== Filesystems in the Linux kernel =============================== @@ -46,8 +48,53 @@ Documentation for filesystem implementations. .. toctree:: :maxdepth: 2 + 9p + adfs + affs + afs autofs + autofs-mount-control + befs + bfs + btrfs + ceph + cramfs + debugfs + dlmfs + ecryptfs + efivarfs + erofs + ext2 + ext3 + f2fs + gfs2 + gfs2-uevents + hfs + hfsplus + hpfs fuse + inotify + isofs + nilfs2 + nfs/index + ntfs + ocfs2 + ocfs2-online-filecheck + omfs + orangefs overlayfs + proc + qnx6 + ramfs-rootfs-initramfs + relay + romfs + squashfs + sysfs + sysv-fs + tmpfs + ubifs + ubifs-authentication.rst + udf virtiofs vfat + zonefs diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.rst index 51f61db787fb..7f7ef8af0e1e 100644 --- a/Documentation/filesystems/inotify.txt +++ b/Documentation/filesystems/inotify.rst @@ -1,27 +1,36 @@ - inotify - a powerful yet simple file change notification system +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Inotify - A Powerful yet Simple File Change Notification System +=============================================================== Document started 15 Mar 2005 by Robert Love <rml@novell.com> + Document updated 4 Jan 2015 by Zhang Zhen <zhenzhang.zhang@huawei.com> - --Deleted obsoleted interface, just refer to manpages for user interface. + + - Deleted obsoleted interface, just refer to manpages for user interface. (i) Rationale -Q: What is the design decision behind not tying the watch to the open fd of +Q: + What is the design decision behind not tying the watch to the open fd of the watched object? -A: Watches are associated with an open inotify device, not an open file. +A: + Watches are associated with an open inotify device, not an open file. This solves the primary problem with dnotify: keeping the file open pins the file and thus, worse, pins the mount. Dnotify is therefore infeasible for use on a desktop system with removable media as the media cannot be unmounted. Watching a file should not require that it be open. -Q: What is the design decision behind using an-fd-per-instance as opposed to +Q: + What is the design decision behind using an-fd-per-instance as opposed to an fd-per-watch? -A: An fd-per-watch quickly consumes more file descriptors than are allowed, +A: + An fd-per-watch quickly consumes more file descriptors than are allowed, more fd's than are feasible to manage, and more fd's than are optimally select()-able. Yes, root can bump the per-process fd limit and yes, users can use epoll, but requiring both is a silly and extraneous requirement. @@ -29,8 +38,8 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed, spaces is thus sensible. The current design is what user-space developers want: Users initialize inotify, once, and add n watches, requiring but one fd and no twiddling with fd limits. Initializing an inotify instance two - thousand times is silly. If we can implement user-space's preferences - cleanly--and we can, the idr layer makes stuff like this trivial--then we + thousand times is silly. If we can implement user-space's preferences + cleanly--and we can, the idr layer makes stuff like this trivial--then we should. There are other good arguments. With a single fd, there is a single @@ -65,9 +74,11 @@ A: An fd-per-watch quickly consumes more file descriptors than are allowed, need not be a one-fd-per-process mapping; it is one-fd-per-queue and a process can easily want more than one queue. -Q: Why the system call approach? +Q: + Why the system call approach? -A: The poor user-space interface is the second biggest problem with dnotify. +A: + The poor user-space interface is the second biggest problem with dnotify. Signals are a terrible, terrible interface for file notification. Or for anything, for that matter. The ideal solution, from all perspectives, is a file descriptor-based one that allows basic file I/O and poll/select. diff --git a/Documentation/filesystems/isofs.rst b/Documentation/filesystems/isofs.rst new file mode 100644 index 000000000000..08fd469091d4 --- /dev/null +++ b/Documentation/filesystems/isofs.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +ISO9660 Filesystem +================== + +Mount options that are the same as for msdos and vfat partitions. + + ========= ======================================================== + gid=nnn All files in the partition will be in group nnn. + uid=nnn All files in the partition will be owned by user id nnn. + umask=nnn The permission mask (see umask(1)) for the partition. + ========= ======================================================== + +Mount options that are the same as vfat partitions. These are only useful +when using discs encoded using Microsoft's Joliet extensions. + + ============== ============================================================= + iocharset=name Character set to use for converting from Unicode to + ASCII. Joliet filenames are stored in Unicode format, but + Unix for the most part doesn't know how to deal with Unicode. + There is also an option of doing UTF-8 translations with the + utf8 option. + utf8 Encode Unicode names in UTF-8 format. Default is no. + ============== ============================================================= + +Mount options unique to the isofs filesystem. + + ================= ============================================================ + block=512 Set the block size for the disk to 512 bytes + block=1024 Set the block size for the disk to 1024 bytes + block=2048 Set the block size for the disk to 2048 bytes + check=relaxed Matches filenames with different cases + check=strict Matches only filenames with the exact same case + cruft Try to handle badly formatted CDs. + map=off Do not map non-Rock Ridge filenames to lower case + map=normal Map non-Rock Ridge filenames to lower case + map=acorn As map=normal but also apply Acorn extensions if present + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. + nojoliet Ignore Joliet extensions if they are present. + norock Ignore Rock Ridge extensions if they are present. + hide Completely strip hidden files from the file system. + showassoc Show files marked with the 'associated' bit + unhide Deprecated; showing hidden files is now default; + If given, it is a synonym for 'showassoc' which will + recreate previous unhide behavior + session=x Select number of session on multisession CD + sbsector=xxx Session begins from sector xxx + ================= ============================================================ + +Recommended documents about ISO 9660 standard are located at: + +- http://www.y-adagio.com/ +- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf + +Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically +identical with ISO 9660.", so it is a valid and gratis substitute of the +official ISO specification. diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt deleted file mode 100644 index ba0a93384de0..000000000000 --- a/Documentation/filesystems/isofs.txt +++ /dev/null @@ -1,48 +0,0 @@ -Mount options that are the same as for msdos and vfat partitions. - - gid=nnn All files in the partition will be in group nnn. - uid=nnn All files in the partition will be owned by user id nnn. - umask=nnn The permission mask (see umask(1)) for the partition. - -Mount options that are the same as vfat partitions. These are only useful -when using discs encoded using Microsoft's Joliet extensions. - iocharset=name Character set to use for converting from Unicode to - ASCII. Joliet filenames are stored in Unicode format, but - Unix for the most part doesn't know how to deal with Unicode. - There is also an option of doing UTF-8 translations with the - utf8 option. - utf8 Encode Unicode names in UTF-8 format. Default is no. - -Mount options unique to the isofs filesystem. - block=512 Set the block size for the disk to 512 bytes - block=1024 Set the block size for the disk to 1024 bytes - block=2048 Set the block size for the disk to 2048 bytes - check=relaxed Matches filenames with different cases - check=strict Matches only filenames with the exact same case - cruft Try to handle badly formatted CDs. - map=off Do not map non-Rock Ridge filenames to lower case - map=normal Map non-Rock Ridge filenames to lower case - map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx unless Rock Ridge - extensions set the permissions otherwise - dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge - extensions set the permissions otherwise - overriderockperm Set permissions on files and directories according to - 'mode' and 'dmode' even though Rock Ridge extensions are - present. - nojoliet Ignore Joliet extensions if they are present. - norock Ignore Rock Ridge extensions if they are present. - hide Completely strip hidden files from the file system. - showassoc Show files marked with the 'associated' bit - unhide Deprecated; showing hidden files is now default; - If given, it is a synonym for 'showassoc' which will - recreate previous unhide behavior - session=x Select number of session on multisession CD - sbsector=xxx Session begins from sector xxx - -Recommended documents about ISO 9660 standard are located at: -http://www.y-adagio.com/ -ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf -Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically -identical with ISO 9660.", so it is a valid and gratis substitute of the -official ISO specification. diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst new file mode 100644 index 000000000000..65805624e39b --- /dev/null +++ b/Documentation/filesystems/nfs/index.rst @@ -0,0 +1,13 @@ +=============================== +NFS +=============================== + + +.. toctree:: + :maxdepth: 1 + + pnfs + rpc-cache + rpc-server-gss + nfs41-server + knfsd-stats diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.rst index 1a5d82180b84..80bcf13550de 100644 --- a/Documentation/filesystems/nfs/knfsd-stats.txt +++ b/Documentation/filesystems/nfs/knfsd-stats.rst @@ -1,7 +1,9 @@ - +============================ Kernel NFS Server Statistics ============================ +:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009 + This document describes the format and semantics of the statistics which the kernel NFS server makes available to userspace. These statistics are available in several text form pseudo files, each of @@ -18,7 +20,7 @@ by parsing routines. All other lines contain a sequence of fields separated by whitespace. /proc/fs/nfsd/pool_stats ------------------------- +======================== This file is available in kernels from 2.6.30 onwards, if the /proc/fs/nfsd filesystem is mounted (it almost always should be). @@ -109,15 +111,12 @@ this case), or the transport can be enqueued for later attention (sockets-enqueued counts this case), or the packet can be temporarily deferred because the transport is currently being used by an nfsd thread. This last case is not very interesting and is not explicitly -counted, but can be inferred from the other counters thus: +counted, but can be inferred from the other counters thus:: -packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) More ----- -Descriptions of the other statistics file should go here. - +==== -Greg Banks <gnb@sgi.com> -26 Mar 2009 +Descriptions of the other statistics file should go here. diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst new file mode 100644 index 000000000000..16b5f02f81c3 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.rst @@ -0,0 +1,256 @@ +============================= +NFSv4.1 Server Implementation +============================= + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is enabled by default. +It can be disabled at run time by writing the string "-4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. You can use rpc.nfsd +for this; see rpc.nfsd(8). + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on RFC 5661. + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + +- **pNFS** Parallel NFS +- **FDELG** File Delegations +- **DDELG** Directory Delegations + +The following abbreviations indicate the linux server implementation status. + +- **I** Implemented NFSv4.1 operations. +- **NS** Not Supported. +- **NS\*** Unimplemented optional feature. + +Operations +========== + ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+======================+=====================+===========================+================+ +| | ACCESS | REQ | | Section 18.1 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BACKCHANNEL_CTL | REQ | | Section 18.33 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CLOSE | REQ | | Section 18.2 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | COMMIT | REQ | | Section 18.3 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | CREATE | REQ | | Section 18.4 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | CREATE_SESSION | REQ | | Section 18.36 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | DELEGRETURN | OPT | FDELG, | Section 18.6 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | DDELG, pNFS | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | (REQ) | | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_CLIENTID | REQ | | Section 18.50 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | DESTROY_SESSION | REQ | | Section 18.37 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | EXCHANGE_ID | REQ | | Section 18.35 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | FREE_STATEID | REQ | | Section 18.38 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETATTR | REQ | | Section 18.7 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | GETFH | REQ | | Section 18.8 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LINK | OPT | | Section 18.9 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCK | REQ | | Section 18.10 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKT | REQ | | Section 18.11 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOCKU | REQ | | Section 18.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUP | REQ | | Section 18.13 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | LOOKUPP | REQ | | Section 18.14 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | NVERIFY | REQ | | Section 18.15 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN | REQ | | Section 18.16 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | OPENATTR | OPT | | Section 18.17 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | OPEN_DOWNGRADE | REQ | | Section 18.18 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTFH | REQ | | Section 18.19 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTPUBFH | REQ | | Section 18.20 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | PUTROOTFH | REQ | | Section 18.21 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READ | REQ | | Section 18.22 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READDIR | REQ | | Section 18.23 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | READLINK | OPT | | Section 18.24 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RECLAIM_COMPLETE | REQ | | Section 18.51 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RELEASE_LOCKOWNER | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | REMOVE | REQ | | Section 18.25 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENAME | REQ | | Section 18.26 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RENEW | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | RESTOREFH | REQ | | Section 18.27 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SAVEFH | REQ | | Section 18.28 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SECINFO | REQ | | Section 18.29 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | | | layout (REQ) | Section 13.12 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | SEQUENCE | REQ | | Section 18.46 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETATTR | REQ | | Section 18.30 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | SETCLIENTID_CONFIRM | MNI | | N/A | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS | SET_SSV | REQ | | Section 18.47 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| I | TEST_STATEID | REQ | | Section 18.48 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | VERIFY | REQ | | Section 18.31 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ +| | WRITE | REQ | | Section 18.32 | ++-----------------------+----------------------+---------------------+---------------------------+----------------+ + + +Callback Operations +=================== ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition | ++=======================+=========================+=====================+===========================+===============+ +| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | CB_RECALL | OPT | FDELG, | Section 20.2 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS | CB_RECALL_SLOT | REQ | | Section 20.8 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | DDELG, pNFS | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ +| | | | (REQ) | | ++-----------------------+-------------------------+---------------------+---------------------------+---------------+ + + +Implementation notes: +===================== + +SSV: + The spec claims this is mandatory, but we don't actually know of any + implementations, so we're ignoring it for now. The server returns + NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. + +GSS on the backchannel: + Again, theoretically required but not widely implemented (in + particular, the current Linux client doesn't request it). We return + NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. + +DELEGPURGE: + mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: + implementation ids are ignored + +CREATE_SESSION: + backchannel attributes are ignored + +SEQUENCE: + no support for dynamic slot table renegotiation (optional) + +Nonstandard compound limitations: + No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. + +See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt deleted file mode 100644 index 682a59fabe3f..000000000000 --- a/Documentation/filesystems/nfs/nfs41-server.txt +++ /dev/null @@ -1,173 +0,0 @@ -NFSv4.1 Server Implementation - -Server support for minorversion 1 can be controlled using the -/proc/fs/nfsd/versions control file. The string output returned -by reading this file will contain either "+4.1" or "-4.1" -correspondingly. - -Currently, server support for minorversion 1 is enabled by default. -It can be disabled at run time by writing the string "-4.1" to -the /proc/fs/nfsd/versions control file. Note that to write this -control file, the nfsd service must be taken down. You can use rpc.nfsd -for this; see rpc.nfsd(8). - -(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and -"-4", respectively. Therefore, code meant to work on both new and old -kernels must turn 4.1 on or off *before* turning support for version 4 -on or off; rpc.nfsd does this correctly.) - -The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based -on RFC 5661. - -From the many new features in NFSv4.1 the current implementation -focuses on the mandatory-to-implement NFSv4.1 Sessions, providing -"exactly once" semantics and better control and throttling of the -resources allocated for each client. - -The table below, taken from the NFSv4.1 document, lists -the operations that are mandatory to implement (REQ), optional -(OPT), and NFSv4.0 operations that are required not to implement (MNI) -in minor version 1. The first column indicates the operations that -are not supported yet by the linux server implementation. - -The OPTIONAL features identified and their abbreviations are as follows: - pNFS Parallel NFS - FDELG File Delegations - DDELG Directory Delegations - -The following abbreviations indicate the linux server implementation status. - I Implemented NFSv4.1 operations. - NS Not Supported. - NS* Unimplemented optional feature. - -Operations - - +----------------------+------------+--------------+----------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +----------------------+------------+--------------+----------------+ - | ACCESS | REQ | | Section 18.1 | -I | BACKCHANNEL_CTL | REQ | | Section 18.33 | -I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | - | CLOSE | REQ | | Section 18.2 | - | COMMIT | REQ | | Section 18.3 | - | CREATE | REQ | | Section 18.4 | -I | CREATE_SESSION | REQ | | Section 18.36 | -NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | - | DELEGRETURN | OPT | FDELG, | Section 18.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -I | DESTROY_CLIENTID | REQ | | Section 18.50 | -I | DESTROY_SESSION | REQ | | Section 18.37 | -I | EXCHANGE_ID | REQ | | Section 18.35 | -I | FREE_STATEID | REQ | | Section 18.38 | - | GETATTR | REQ | | Section 18.7 | -I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | -NS*| GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | - | GETFH | REQ | | Section 18.8 | -NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | -I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | -I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | -I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | - | LINK | OPT | | Section 18.9 | - | LOCK | REQ | | Section 18.10 | - | LOCKT | REQ | | Section 18.11 | - | LOCKU | REQ | | Section 18.12 | - | LOOKUP | REQ | | Section 18.13 | - | LOOKUPP | REQ | | Section 18.14 | - | NVERIFY | REQ | | Section 18.15 | - | OPEN | REQ | | Section 18.16 | -NS*| OPENATTR | OPT | | Section 18.17 | - | OPEN_CONFIRM | MNI | | N/A | - | OPEN_DOWNGRADE | REQ | | Section 18.18 | - | PUTFH | REQ | | Section 18.19 | - | PUTPUBFH | REQ | | Section 18.20 | - | PUTROOTFH | REQ | | Section 18.21 | - | READ | REQ | | Section 18.22 | - | READDIR | REQ | | Section 18.23 | - | READLINK | OPT | | Section 18.24 | - | RECLAIM_COMPLETE | REQ | | Section 18.51 | - | RELEASE_LOCKOWNER | MNI | | N/A | - | REMOVE | REQ | | Section 18.25 | - | RENAME | REQ | | Section 18.26 | - | RENEW | MNI | | N/A | - | RESTOREFH | REQ | | Section 18.27 | - | SAVEFH | REQ | | Section 18.28 | - | SECINFO | REQ | | Section 18.29 | -I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | - | | | layout (REQ) | Section 13.12 | -I | SEQUENCE | REQ | | Section 18.46 | - | SETATTR | REQ | | Section 18.30 | - | SETCLIENTID | MNI | | N/A | - | SETCLIENTID_CONFIRM | MNI | | N/A | -NS | SET_SSV | REQ | | Section 18.47 | -I | TEST_STATEID | REQ | | Section 18.48 | - | VERIFY | REQ | | Section 18.31 | -NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | - | WRITE | REQ | | Section 18.32 | - -Callback Operations - - +-------------------------+-----------+-------------+---------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +-------------------------+-----------+-------------+---------------+ - | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | -I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | -NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | -NS*| CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | -NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | -NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | - | CB_RECALL | OPT | FDELG, | Section 20.2 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS | CB_RECALL_SLOT | REQ | | Section 20.8 | -NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | - | | | (REQ) | | -I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | - | | | DDELG, pNFS | | - | | | (REQ) | | - +-------------------------+-----------+-------------+---------------+ - -Implementation notes: - -SSV: -* The spec claims this is mandatory, but we don't actually know of any - implementations, so we're ignoring it for now. The server returns - NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. - -GSS on the backchannel: -* Again, theoretically required but not widely implemented (in - particular, the current Linux client doesn't request it). We return - NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. - -DELEGPURGE: -* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or - CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that - persist across client reboots). Thus we need not implement this for - now. - -EXCHANGE_ID: -* implementation ids are ignored - -CREATE_SESSION: -* backchannel attributes are ignored - -SEQUENCE: -* no support for dynamic slot table renegotiation (optional) - -Nonstandard compound limitations: -* No support for a sessions fore channel RPC compound that requires both a - ca_maxrequestsize request and a ca_maxresponsesize reply, so we may - fail to live up to the promise we made in CREATE_SESSION fore channel - negotiation. - -See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.rst index 80dc0bdc302a..7c470ecdc3a9 100644 --- a/Documentation/filesystems/nfs/pnfs.txt +++ b/Documentation/filesystems/nfs/pnfs.rst @@ -1,15 +1,17 @@ -Reference counting in pnfs: +========================== +Reference counting in pnfs ========================== The are several inter-related caches. We have layouts which can reference multiple devices, each of which can reference multiple data servers. Each data server can be referenced by multiple devices. Each device -can be referenced by multiple layouts. To keep all of this straight, +can be referenced by multiple layouts. To keep all of this straight, we need to reference count. struct pnfs_layout_hdr ----------------------- +====================== + The on-the-wire command LAYOUTGET corresponds to struct pnfs_layout_segment, usually referred to by the variable name lseg. Each nfs_inode may hold a pointer to a cache of these layout @@ -25,7 +27,8 @@ the reference count, as the layout is kept around by the lseg that keeps it in the list. deviceid_cache --------------- +============== + lsegs reference device ids, which are resolved per nfs_client and layout driver type. The device ids are held in a RCU cache (struct nfs4_deviceid_cache). The cache itself is referenced across each @@ -38,24 +41,26 @@ justification, but seems reasonable given that we can have multiple deviceid's per filesystem, and multiple filesystems per nfs_client. The hash code is copied from the nfsd code base. A discussion of -hashing and variations of this algorithm can be found at: -http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 +hashing and variations of this algorithm can be found `here. +<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_ data server cache ------------------ +================= + file driver devices refer to data servers, which are kept in a module level cache. Its reference is held over the lifetime of the deviceid pointing to it. lseg ----- +==== + lseg maintains an extra reference corresponding to the NFS_LSEG_VALID bit which holds it in the pnfs_layout_hdr's list. When the final lseg is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED bit is set, preventing any new lsegs from being added. layout drivers --------------- +============== PNFS utilizes what is called layout drivers. The STD defines 4 basic layout types: "files", "objects", "blocks", and "flexfiles". For each @@ -68,6 +73,6 @@ Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory blocks-layout setup -------------------- +=================== TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.rst index c4dac829db0f..bb164eea969b 100644 --- a/Documentation/filesystems/nfs/rpc-cache.txt +++ b/Documentation/filesystems/nfs/rpc-cache.rst @@ -1,9 +1,14 @@ - This document gives a brief introduction to the caching +========= +RPC Cache +========= + +This document gives a brief introduction to the caching mechanisms in the sunrpc layer that is used, in particular, for NFS authentication. -CACHES +Caches ====== + The caching replaces the old exports table and allows for a wide variety of values to be caches. @@ -12,6 +17,7 @@ quite possibly very different in content and use. There is a corpus of common code for managing these caches. Examples of caches that are likely to be needed are: + - mapping from IP address to client name - mapping from client name and filesystem to export options - mapping from UID to list of GIDs, to work around NFS's limitation @@ -21,6 +27,7 @@ Examples of caches that are likely to be needed are: - mapping from network identify to public key for crypto authentication. The common code handles such things as: + - general cache lookup with correct locking - supporting 'NEGATIVE' as well as positive entries - allowing an EXPIRED time on cache items, and removing @@ -35,60 +42,66 @@ The common code handles such things as: Creating a Cache ---------------- -1/ A cache needs a datum to store. This is in the form of a - structure definition that must contain a - struct cache_head +- A cache needs a datum to store. This is in the form of a + structure definition that must contain a struct cache_head as an element, usually the first. It will also contain a key and some content. Each cache element is reference counted and contains expiry and update times for use in cache management. -2/ A cache needs a "cache_detail" structure that +- A cache needs a "cache_detail" structure that describes the cache. This stores the hash table, some parameters for cache management, and some operations detailing how to work with particular cache items. - The operations requires are: - struct cache_head *alloc(void) - This simply allocates appropriate memory and returns - a pointer to the cache_detail embedded within the - structure - void cache_put(struct kref *) - This is called when the last reference to an item is - dropped. The pointer passed is to the 'ref' field - in the cache_head. cache_put should release any - references create by 'cache_init' and, if CACHE_VALID - is set, any references created by cache_update. - It should then release the memory allocated by - 'alloc'. - int match(struct cache_head *orig, struct cache_head *new) - test if the keys in the two structures match. Return - 1 if they do, 0 if they don't. - void init(struct cache_head *orig, struct cache_head *new) - Set the 'key' fields in 'new' from 'orig'. This may - include taking references to shared objects. - void update(struct cache_head *orig, struct cache_head *new) - Set the 'content' fileds in 'new' from 'orig'. - int cache_show(struct seq_file *m, struct cache_detail *cd, - struct cache_head *h) - Optional. Used to provide a /proc file that lists the - contents of a cache. This should show one item, - usually on just one line. - int cache_request(struct cache_detail *cd, struct cache_head *h, - char **bpp, int *blen) - Format a request to be send to user-space for an item - to be instantiated. *bpp is a buffer of size *blen. - bpp should be moved forward over the encoded message, - and *blen should be reduced to show how much free - space remains. Return 0 on success or <0 if not - enough room or other problem. - int cache_parse(struct cache_detail *cd, char *buf, int len) - A message from user space has arrived to fill out a - cache entry. It is in 'buf' of length 'len'. - cache_parse should parse this, find the item in the - cache with sunrpc_cache_lookup_rcu, and update the item - with sunrpc_cache_update. - - -3/ A cache needs to be registered using cache_register(). This + + The operations are: + + struct cache_head \*alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + + void cache_put(struct kref \*) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + + int match(struct cache_head \*orig, struct cache_head \*new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + + void init(struct cache_head \*orig, struct cache_head \*new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + + void update(struct cache_head \*orig, struct cache_head \*new) + Set the 'content' fileds in 'new' from 'orig'. + + int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + + int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen) + Format a request to be send to user-space for an item + to be instantiated. \*bpp is a buffer of size \*blen. + bpp should be moved forward over the encoded message, + and \*blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + + int cache_parse(struct cache_detail \*cd, char \*buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup_rcu, and update the item + with sunrpc_cache_update. + + +- A cache needs to be registered using cache_register(). This includes it on a list of caches that will be regularly cleaned to discard old data. @@ -107,7 +120,7 @@ cache_check will return -ENOENT in the entry is negative or if an up call is needed but not possible, -EAGAIN if an upcall is pending, or 0 if the data is valid; -cache_check can be passed a "struct cache_req *". This structure is +cache_check can be passed a "struct cache_req\*". This structure is typically embedded in the actual request and can be used to create a deferred copy of the request (struct cache_deferred_req). This is done when the found cache item is not uptodate, but the is reason to @@ -139,9 +152,11 @@ The 'channel' works a bit like a datagram socket. Each 'write' is passed as a whole to the cache for parsing and interpretation. Each cache can treat the write requests differently, but it is expected that a message written will contain: + - a key - an expiry time - a content. + with the intention that an item in the cache with the give key should be create or updated to have the given content, and the expiry time should be set on that item. @@ -156,7 +171,8 @@ If there are no more requests to return, read will return EOF, but a select or poll for read will block waiting for another request to be added. -Thus a user-space helper is likely to: +Thus a user-space helper is likely to:: + open the channel. select for readable read a request @@ -175,12 +191,13 @@ Each cache should also define a "cache_request" method which takes a cache item and encodes a request into the buffer provided. -Note: If a cache has no active readers on the channel, and has had not -active readers for more than 60 seconds, further requests will not be -added to the channel but instead all lookups that do not find a valid -entry will fail. This is partly for backward compatibility: The -previous nfs exports table was deemed to be authoritative and a -failed lookup meant a definite 'no'. +.. note:: + If a cache has no active readers on the channel, and has had not + active readers for more than 60 seconds, further requests will not be + added to the channel but instead all lookups that do not find a valid + entry will fail. This is partly for backward compatibility: The + previous nfs exports table was deemed to be authoritative and a + failed lookup meant a definite 'no'. request/response format ----------------------- @@ -193,10 +210,11 @@ with precisely one newline character which should be at the end. Fields within the record should be separated by spaces, normally one. If spaces, newlines, or nul characters are needed in a field they much be quoted. two mechanisms are available: -1/ If a field begins '\x' then it must contain an even number of + +- If a field begins '\x' then it must contain an even number of hex digits, and pairs of these digits provide the bytes in the field. -2/ otherwise a \ in the field must be followed by 3 octal digits +- otherwise a \ in the field must be followed by 3 octal digits which give the code for a byte. Other characters are treated as them selves. At the very least, space, newline, nul, and '\' must be quoted in this way. diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.rst index 310bbbaf9080..812754576845 100644 --- a/Documentation/filesystems/nfs/rpc-server-gss.txt +++ b/Documentation/filesystems/nfs/rpc-server-gss.rst @@ -1,4 +1,4 @@ - +========================================= rpcsec_gss support for kernel RPC servers ========================================= @@ -9,14 +9,17 @@ NFSv4.1 and higher don't require the client to act as a server for the purposes of authentication.) RPCGSS is specified in a few IETF documents: + - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt + and there is a 3rd version being proposed: + - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt (At draft n. 02 at the time of writing) Background ----------- +========== The RPCGSS Authentication method describes a way to perform GSSAPI Authentication for NFS. Although GSSAPI is itself completely mechanism @@ -29,6 +32,7 @@ depends on GSSAPI extensions that are KRB5 specific. GSSAPI is a complex library, and implementing it completely in kernel is unwarranted. However GSSAPI operations are fundementally separable in 2 parts: + - initial context establishment - integrity/privacy protection (signing and encrypting of individual packets) @@ -41,7 +45,7 @@ kernel, but leave the initial context establishment to userspace. We need upcalls to request userspace to perform context establishment. NFS Server Legacy Upcall Mechanism ----------------------------------- +================================== The classic upcall mechanism uses a custom text based upcall mechanism to talk to a custom daemon called rpc.svcgssd that is provide by the @@ -62,21 +66,20 @@ groups) due to limitation on the size of the buffer that can be send back to the kernel (4KiB). NFS Server New RPC Upcall Mechanism ------------------------------------ +=================================== The newer upcall mechanism uses RPC over a unix socket to a daemon called gss-proxy, implemented by a userspace program called Gssproxy. -The gss_proxy RPC protocol is currently documented here: - - https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation +The gss_proxy RPC protocol is currently documented `here +<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_. This upcall mechanism uses the kernel rpc client and connects to the gssproxy userspace program over a regular unix socket. The gssproxy protocol does not suffer from the size limitations of the legacy protocol. Negotiating Upcall Mechanisms ------------------------------ +============================= To provide backward compatibility, the kernel defaults to using the legacy mechanism. To switch to the new mechanism, gss-proxy must bind diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.rst index f2f3f8592a6f..6c49f04e9e0a 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== NILFS2 ------- +====== NILFS2 is a log-structured file system (LFS) supporting continuous snapshotting. In addition to versioning capability of the entire file @@ -25,9 +28,9 @@ available from the following download page. At least "mkfs.nilfs2", cleaner or garbage collector) are required. Details on the tools are described in the man pages included in the package. -Project web page: https://nilfs.sourceforge.io/ -Download page: https://nilfs.sourceforge.io/en/download.html -List info: http://vger.kernel.org/vger-lists.html#linux-nilfs +:Project web page: https://nilfs.sourceforge.io/ +:Download page: https://nilfs.sourceforge.io/en/download.html +:List info: http://vger.kernel.org/vger-lists.html#linux-nilfs Caveats ======= @@ -47,6 +50,7 @@ Mount options NILFS2 supports the following mount options: (*) == default +======================= ======================================================= barrier(*) This enables/disables the use of write barriers. This nobarrier requires an IO stack which can support barriers, and if nilfs gets an error on a barrier write, it will @@ -79,6 +83,7 @@ discard This enables/disables the use of discard/TRIM commands. nodiscard(*) The discard/TRIM commands are sent to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs. +======================= ======================================================= Ioctls ====== @@ -87,9 +92,11 @@ There is some NILFS2 specific functionality which can be accessed by application through the system call interfaces. The list of all NILFS2 specific ioctls are shown in the table below. -Table of NILFS2 specific ioctls -.............................................................................. +Table of NILFS2 specific ioctls: + + ============================== =============================================== Ioctl Description + ============================== =============================================== NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between checkpoint and snapshot state. This ioctl is used in chcp and mkcp utilities. @@ -142,11 +149,12 @@ Table of NILFS2 specific ioctls NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and upper limit of segments in bytes. This ioctl is used by nilfs_resize utility. + ============================== =============================================== NILFS2 usage ============ -To use nilfs2 as a local file system, simply: +To use nilfs2 as a local file system, simply:: # mkfs -t nilfs2 /dev/block_device # mount -t nilfs2 /dev/block_device /dir @@ -157,18 +165,20 @@ This will also invoke the cleaner through the mount helper program Checkpoints and snapshots are managed by the following commands. Their manpages are included in the nilfs-utils package above. + ==== =========================================================== lscp list checkpoints or snapshots. mkcp make a checkpoint or a snapshot. chcp change an existing checkpoint to a snapshot or vice versa. rmcp invalidate specified checkpoint(s). + ==== =========================================================== -To mount a snapshot, +To mount a snapshot:: # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir where <cno> is the checkpoint number of the snapshot. -To unmount the NILFS2 mount point or snapshot, simply: +To unmount the NILFS2 mount point or snapshot, simply:: # umount /dir @@ -181,7 +191,7 @@ Disk format A nilfs2 volume is equally divided into a number of segments except for the super block (SB) and segment #0. A segment is the container of logs. Each log is composed of summary information blocks, payload -blocks, and an optional super root block (SR): +blocks, and an optional super root block (SR):: ______________________________________________________ | |SB| | Segment | Segment | Segment | ... | Segment | | @@ -200,7 +210,7 @@ blocks, and an optional super root block (SR): |_blocks__|_________________|__| The payload blocks are organized per file, and each file consists of -data blocks and B-tree node blocks: +data blocks and B-tree node blocks:: |<--- File-A --->|<--- File-B --->| _______________________________________________________________ @@ -213,7 +223,7 @@ files without data blocks or B-tree node blocks. The organization of the blocks is recorded in the summary information blocks, which contains a header structure (nilfs_segment_summary), per -file structures (nilfs_finfo), and per block structures (nilfs_binfo): +file structures (nilfs_finfo), and per block structures (nilfs_binfo):: _________________________________________________________________________ | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... @@ -223,7 +233,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo): The logs include regular files, directory files, symbolic link files and several meta data files. The mata data files are the files used to maintain file system meta data. The current version of NILFS2 uses -the following meta data files: +the following meta data files:: 1) Inode file (ifile) -- Stores on-disk inodes 2) Checkpoint file (cpfile) -- Stores checkpoints @@ -232,7 +242,7 @@ the following meta data files: (DAT) block numbers. This file serves to make on-disk blocks relocatable. -The following figure shows a typical organization of the logs: +The following figure shows a typical organization of the logs:: _________________________________________________________________________ | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| @@ -250,7 +260,7 @@ three special inodes, inodes for the DAT, cpfile, and sufile. Inodes of regular files, directories, symlinks and other special files, are included in the ifile. The inode of ifile itself is included in the corresponding checkpoint entry in the cpfile. Thus, the hierarchy -among NILFS2 files can be depicted as follows: +among NILFS2 files can be depicted as follows:: Super block (SB) | diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.rst index 553f10d03076..5bb093a26485 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.rst @@ -1,19 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ The Linux NTFS filesystem driver ================================ -Table of contents -================= +.. Table of contents -- Overview -- Web site -- Features -- Supported mount options -- Known bugs and (mis-)features -- Using NTFS volume and stripe sets - - The Device-Mapper driver - - The Software RAID / MD driver - - Limitations when using the MD driver + - Overview + - Web site + - Features + - Supported mount options + - Known bugs and (mis-)features + - Using NTFS volume and stripe sets + - The Device-Mapper driver + - The Software RAID / MD driver + - Limitations when using the MD driver Overview @@ -66,8 +68,10 @@ Features partition by creating a large file while in Windows and then loopback mounting the file while in Linux and creating a Linux filesystem on it that is used to install Linux on it. -- A comparison of the two drivers using: +- A comparison of the two drivers using:: + time find . -type f -exec md5sum "{}" \; + run three times in sequence with each driver (after a reboot) on a 1.4GiB NTFS partition, showed the new driver to be 20% faster in total time elapsed (from 9:43 minutes on average down to 7:53). The time spent in user space @@ -104,6 +108,7 @@ In addition to the generic mount options described by the manual page for the mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the following mount options: +======================= ======================================================= iocharset=name Deprecated option. Still supported but please use nls=name in the future. See description for nls=name. @@ -175,16 +180,22 @@ disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse errors=opt What to do when critical filesystem errors are found. Following values can be used for "opt": - continue: DEFAULT, try to clean-up as much as + + ======== ========================================= + continue DEFAULT, try to clean-up as much as possible, e.g. marking a corrupt inode as bad so it is no longer accessed, and then continue. - recover: At present only supported is recovery of + recover At present only supported is recovery of the boot sector from the backup copy. If read-only mount, the recovery is done in memory only and not written to disk. - Note that the options are additive, i.e. specifying: + ======== ========================================= + + Note that the options are additive, i.e. specifying:: + errors=continue,errors=recover + means the driver will attempt to recover and if that fails it will clean-up as much as possible and continue. @@ -202,12 +213,18 @@ mft_zone_multiplier= Set the MFT zone multiplier for the volume (this In general use the default. If you have a lot of small files then use a higher value. The values have the following meaning: + + ===== ================================= Value MFT zone size (% of volume size) + ===== ================================= 1 12.5% 2 25% 3 37.5% 4 50% + ===== ================================= + Note this option is irrelevant for read-only mounts. +======================= ======================================================= Known bugs and (mis-)features @@ -252,18 +269,18 @@ To create the table describing your volume you will need to know each of its components and their sizes in sectors, i.e. multiples of 512-byte blocks. For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for -example if one of your partitions is /dev/hda2 you would do: +example if one of your partitions is /dev/hda2 you would do:: -$ fdisk -ul /dev/hda + $ fdisk -ul /dev/hda -Disk /dev/hda: 81.9 GB, 81964302336 bytes -255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors -Units = sectors of 1 * 512 = 512 bytes + Disk /dev/hda: 81.9 GB, 81964302336 bytes + 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors + Units = sectors of 1 * 512 = 512 bytes - Device Boot Start End Blocks Id System - /dev/hda1 * 63 4209029 2104483+ 83 Linux - /dev/hda2 4209030 37768814 16779892+ 86 NTFS - /dev/hda3 37768815 46170809 4200997+ 83 Linux + Device Boot Start End Blocks Id System + /dev/hda1 * 63 4209029 2104483+ 83 Linux + /dev/hda2 4209030 37768814 16779892+ 86 NTFS + /dev/hda3 37768815 46170809 4200997+ 83 Linux And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = 33559785 sectors. @@ -271,15 +288,17 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = For Win2k and later dynamic disks, you can for example use the ldminfo utility which is part of the Linux LDM tools (the latest version at the time of writing is linux-ldm-0.0.8.tar.bz2). You can download it from: + http://www.linux-ntfs.org/ + Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You will find the precompiled (i386) ldminfo utility there. NOTE: You will not be able to compile this yourself easily so use the binary version! -Then you would use ldminfo in dump mode to obtain the necessary information: +Then you would use ldminfo in dump mode to obtain the necessary information:: -$ ./ldminfo --dump /dev/hda + $ ./ldminfo --dump /dev/hda This would dump the LDM database found on /dev/hda which describes all of your dynamic disks and all the volumes on them. At the bottom you will see the @@ -305,42 +324,36 @@ give you the correct information to do this. Assuming you know all your devices and their sizes things are easy. For a linear raid the table would look like this (note all values are in -512-byte sectors): +512-byte sectors):: ---- cut here --- -# Offset into Size of this Raid type Device Start sector -# volume device of device -0 1028161 linear /dev/hda1 0 -1028161 3903762 linear /dev/hdb2 0 -4931923 2103211 linear /dev/hdc1 0 ---- cut here --- + # Offset into Size of this Raid type Device Start sector + # volume device of device + 0 1028161 linear /dev/hda1 0 + 1028161 3903762 linear /dev/hdb2 0 + 4931923 2103211 linear /dev/hdc1 0 For a striped volume, i.e. raid level 0, you will need to know the chunk size you used when creating the volume. Windows uses 64kiB as the default, so it will probably be this unless you changes the defaults when creating the array. For a raid level 0 the table would look like this (note all values are in -512-byte sectors): +512-byte sectors):: ---- cut here --- -# Offset Size Raid Number Chunk 1st Start 2nd Start -# into of the type of size Device in Device in -# volume volume stripes device device -0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- + # Offset Size Raid Number Chunk 1st Start 2nd Start + # into of the type of size Device in Device in + # volume volume stripes device device + 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 If there are more than two devices, just add each of them to the end of the line. Finally, for a mirrored volume, i.e. raid level 1, the table would look like -this (note all values are in 512-byte sectors): +this (note all values are in 512-byte sectors):: ---- cut here --- -# Ofs Size Raid Log Number Region Should Number Source Start Target Start -# in of the type type of log size sync? of Device in Device in -# vol volume params mirrors Device Device -0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 ---- cut here --- + # Ofs Size Raid Log Number Region Should Number Source Start Target Start + # in of the type type of log size sync? of Device in Device in + # vol volume params mirrors Device Device + 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 If you are mirroring to multiple devices you can specify further targets at the end of the line. @@ -353,17 +366,17 @@ to the "Target Device" or if you specified multiple target devices to all of them. Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), -and hand it over to dmsetup to work with, like so: +and hand it over to dmsetup to work with, like so:: -$ dmsetup create myvolume1 /etc/ntfsvolume1 + $ dmsetup create myvolume1 /etc/ntfsvolume1 You can obviously replace "myvolume1" with whatever name you like. If it all worked, you will now have the device /dev/device-mapper/myvolume1 which you can then just use as an argument to the mount command as usual to -mount the ntfs volume. For example: +mount the ntfs volume. For example:: -$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 + $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 (You need to create the directory /mnt/myvol1 first and of course you can use anything you like instead of /mnt/myvol1 as long as it is an existing @@ -395,18 +408,18 @@ Windows by default uses a stripe chunk size of 64k, so you probably want the "chunk-size 64k" option for each raid-disk, too. For example, if you have a stripe set consisting of two partitions /dev/hda5 -and /dev/hdb1 your /etc/raidtab would look like this: - -raiddev /dev/md0 - raid-level 0 - nr-raid-disks 2 - nr-spare-disks 0 - persistent-superblock 0 - chunk-size 64k - device /dev/hda5 - raid-disk 0 - device /dev/hdb1 - raid-disk 1 +and /dev/hdb1 your /etc/raidtab would look like this:: + + raiddev /dev/md0 + raid-level 0 + nr-raid-disks 2 + nr-spare-disks 0 + persistent-superblock 0 + chunk-size 64k + device /dev/hda5 + raid-disk 0 + device /dev/hdb1 + raid-disk 1 For linear raid, just change the raid-level above to "raid-level linear", for mirrors, change it to "raid-level 1", and for stripe sets with parity, change @@ -427,7 +440,9 @@ Once the raidtab is setup, run for example raid0run -a to start all devices or raid0run /dev/md0 to start a particular md device, in this case /dev/md0. Then just use the mount command as usual to mount the ntfs volume using for -example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume +example:: + + mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume It is advisable to do the mount read-only to see if the md volume has been setup correctly to avoid the possibility of causing damage to the data on the diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.rst index 139fab175c8a..2257bb53edc1 100644 --- a/Documentation/filesystems/ocfs2-online-filecheck.txt +++ b/Documentation/filesystems/ocfs2-online-filecheck.rst @@ -1,5 +1,8 @@ - OCFS2 online file check - ----------------------- +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +OCFS2 file system - online file check +===================================== This document will describe OCFS2 online file check feature. @@ -40,7 +43,7 @@ When there are errors in the OCFS2 filesystem, they are usually accompanied by the inode number which caused the error. This inode number would be the input to check/fix the file. -There is a sysfs directory for each OCFS2 file system mounting: +There is a sysfs directory for each OCFS2 file system mounting:: /sys/fs/ocfs2/<devname>/filecheck @@ -50,34 +53,36 @@ communicate with kernel space, tell which file(inode number) will be checked or fixed. Currently, three operations are supported, which includes checking inode, fixing inode and setting the size of result record history. -1. If you want to know what error exactly happened to <inode> before fixing, do +1. If you want to know what error exactly happened to <inode> before fixing, do:: + + # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check + # cat /sys/fs/ocfs2/<devname>/filecheck/check + +The output is like this:: - # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check - # cat /sys/fs/ocfs2/<devname>/filecheck/check + INO DONE ERROR + 39502 1 GENERATION -The output is like this: - INO DONE ERROR -39502 1 GENERATION + <INO> lists the inode numbers. + <DONE> indicates whether the operation has been finished. + <ERROR> says what kind of errors was found. For the detailed error numbers, + please refer to the file linux/fs/ocfs2/filecheck.h. -<INO> lists the inode numbers. -<DONE> indicates whether the operation has been finished. -<ERROR> says what kind of errors was found. For the detailed error numbers, -please refer to the file linux/fs/ocfs2/filecheck.h. +2. If you determine to fix this inode, do:: -2. If you determine to fix this inode, do + # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix + # cat /sys/fs/ocfs2/<devname>/filecheck/fix - # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix - # cat /sys/fs/ocfs2/<devname>/filecheck/fix +The output is like this::: -The output is like this: - INO DONE ERROR -39502 1 SUCCESS + INO DONE ERROR + 39502 1 SUCCESS This time, the <ERROR> column indicates whether this fix is successful or not. 3. The record cache is used to store the history of check/fix results. It's default size is 10, and can be adjust between the range of 10 ~ 100. You can -adjust the size like this: +adjust the size like this:: # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.rst index 4c49e5410595..412386bc6506 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.rst @@ -1,5 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ OCFS2 filesystem -================== +================ + OCFS2 is a general purpose extent based shared disk cluster file system with many similarities to ext3. It supports 64 bit inode numbers, and has automatically extending metadata groups which may @@ -14,22 +18,26 @@ OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ All code copyright 2005 Oracle except when otherwise noted. -CREDITS: +Credits +======= + Lots of code taken from ext3 and other projects. Authors in alphabetical order: -Joel Becker <joel.becker@oracle.com> -Zach Brown <zach.brown@oracle.com> -Mark Fasheh <mfasheh@suse.com> -Kurt Hackel <kurt.hackel@oracle.com> -Tao Ma <tao.ma@oracle.com> -Sunil Mushran <sunil.mushran@oracle.com> -Manish Singh <manish.singh@oracle.com> -Tiger Yang <tiger.yang@oracle.com> + +- Joel Becker <joel.becker@oracle.com> +- Zach Brown <zach.brown@oracle.com> +- Mark Fasheh <mfasheh@suse.com> +- Kurt Hackel <kurt.hackel@oracle.com> +- Tao Ma <tao.ma@oracle.com> +- Sunil Mushran <sunil.mushran@oracle.com> +- Manish Singh <manish.singh@oracle.com> +- Tiger Yang <tiger.yang@oracle.com> Caveats ======= Features which OCFS2 does not support yet: + - Directory change notification (F_NOTIFY) - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) @@ -37,8 +45,10 @@ Mount options ============= OCFS2 supports the following mount options: + (*) == default +======================= ======================================================== barrier=1 This enables/disables barriers. barrier=0 disables it, barrier=1 enables it. errors=remount-ro(*) Remount the filesystem read-only on an error. @@ -104,3 +114,4 @@ journal_async_commit Commit block can be written to disk without waiting for descriptor blocks. If enabled older kernels cannot mount the device. This will enable 'journal_checksum' internally. +======================= ======================================================== diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst new file mode 100644 index 000000000000..4c8bb3074169 --- /dev/null +++ b/Documentation/filesystems/omfs.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Optimized MPEG Filesystem (OMFS) +================================ + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + ============ ======================================== + uid=n make all files owned by specified user + gid=n make all files owned by specified group + umask=xxx set permission umask to xxx + fmask=xxx set umask to xxx for files + dmask=xxx set umask to xxx for directories + ============ ======================================== + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information:: + + struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; + }; + +Files and directories are both represented by omfs_inode:: + + struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ + }; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START:: + + struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ + }; + + struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ + }; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt deleted file mode 100644 index 1d0d41ff5c65..000000000000 --- a/Documentation/filesystems/omfs.txt +++ /dev/null @@ -1,106 +0,0 @@ -Optimized MPEG Filesystem (OMFS) - -Overview -======== - -OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR -and Rio Karma MP3 player. The filesystem is extent-based, utilizing -block sizes from 2k to 8k, with hash-based directories. This -filesystem driver may be used to read and write disks from these -devices. - -Note, it is not recommended that this FS be used in place of a general -filesystem for your own streaming media device. Native Linux filesystems -will likely perform better. - -More information is available at: - - http://linux-karma.sf.net/ - -Various utilities, including mkomfs and omfsck, are included with -omfsprogs, available at: - - http://bobcopeland.com/karma/ - -Instructions are included in its README. - -Options -======= - -OMFS supports the following mount-time options: - - uid=n - make all files owned by specified user - gid=n - make all files owned by specified group - umask=xxx - set permission umask to xxx - fmask=xxx - set umask to xxx for files - dmask=xxx - set umask to xxx for directories - -Disk format -=========== - -OMFS discriminates between "sysblocks" and normal data blocks. The sysblock -group consists of super block information, file metadata, directory structures, -and extents. Each sysblock has a header containing CRCs of the entire -sysblock, and may be mirrored in successive blocks on the disk. A sysblock may -have a smaller size than a data block, but since they are both addressed by the -same 64-bit block number, any remaining space in the smaller sysblock is -unused. - -Sysblock header information: - -struct omfs_header { - __be64 h_self; /* FS block where this is located */ - __be32 h_body_size; /* size of useful data after header */ - __be16 h_crc; /* crc-ccitt of body_size bytes */ - char h_fill1[2]; - u8 h_version; /* version, always 1 */ - char h_type; /* OMFS_INODE_X */ - u8 h_magic; /* OMFS_IMAGIC */ - u8 h_check_xor; /* XOR of header bytes before this */ - __be32 h_fill2; -}; - -Files and directories are both represented by omfs_inode: - -struct omfs_inode { - struct omfs_header i_head; /* header */ - __be64 i_parent; /* parent containing this inode */ - __be64 i_sibling; /* next inode in hash bucket */ - __be64 i_ctime; /* ctime, in milliseconds */ - char i_fill1[35]; - char i_type; /* OMFS_[DIR,FILE] */ - __be32 i_fill2; - char i_fill3[64]; - char i_name[OMFS_NAMELEN]; /* filename */ - __be64 i_size; /* size of file, in bytes */ -}; - -Directories in OMFS are implemented as a large hash table. Filenames are -hashed then prepended into the bucket list beginning at OMFS_DIR_START. -Lookup requires hashing the filename, then seeking across i_sibling pointers -until a match is found on i_name. Empty buckets are represented by block -pointers with all-1s (~0). - -A file is an omfs_inode structure followed by an extent table beginning at -OMFS_EXTENT_START: - -struct omfs_extent_entry { - __be64 e_cluster; /* start location of a set of blocks */ - __be64 e_blocks; /* number of blocks after e_cluster */ -}; - -struct omfs_extent { - __be64 e_next; /* next extent table location */ - __be32 e_extent_count; /* total # extents in this table */ - __be32 e_fill; - struct omfs_extent_entry e_entry; /* start of extent entries */ -}; - -Each extent holds the block offset followed by number of blocks allocated to -the extent. The final extent in each table is a terminator with e_cluster -being ~0 and e_blocks being ones'-complement of the total number of blocks -in the table. - -If this table overflows, a continuation inode is written and pointed to by -e_next. These have a header but lack the rest of the inode structure. - diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.rst index f4ba94950e3f..7d6d4cad73c4 100644 --- a/Documentation/filesystems/orangefs.txt +++ b/Documentation/filesystems/orangefs.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== ORANGEFS ======== @@ -21,25 +24,25 @@ Orangefs features include: * Stateless -MAILING LIST ARCHIVES +Mailing List Archives ===================== http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ -MAILING LIST SUBMISSIONS +Mailing List Submissions ======================== devel@lists.orangefs.org -DOCUMENTATION +Documentation ============= http://www.orangefs.org/documentation/ -USERSPACE FILESYSTEM SOURCE +Userspace Filesystem Source =========================== http://www.orangefs.org/download @@ -48,16 +51,16 @@ Orangefs versions prior to 2.9.3 would not be compatible with the upstream version of the kernel client. -RUNNING ORANGEFS ON A SINGLE SERVER +Running ORANGEFS On a Single Server =================================== OrangeFS is usually run in large installations with multiple servers and clients, but a complete filesystem can be run on a single machine for development and testing. -On Fedora, install orangefs and orangefs-server. +On Fedora, install orangefs and orangefs-server:: -dnf -y install orangefs orangefs-server + dnf -y install orangefs orangefs-server There is an example server configuration file in /etc/orangefs/orangefs.conf. Change localhost to your hostname if @@ -70,29 +73,29 @@ single line. Uncomment it and change the hostname if necessary. This controls clients which use libpvfs2. This does not control the pvfs2-client-core. -Create the filesystem. +Create the filesystem:: -pvfs2-server -f /etc/orangefs/orangefs.conf + pvfs2-server -f /etc/orangefs/orangefs.conf -Start the server. +Start the server:: -systemctl start orangefs-server + systemctl start orangefs-server -Test the server. +Test the server:: -pvfs2-ping -m /pvfsmnt + pvfs2-ping -m /pvfsmnt Start the client. The module must be compiled in or loaded before this -point. +point:: -systemctl start orangefs-client + systemctl start orangefs-client -Mount the filesystem. +Mount the filesystem:: -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt -BUILDING ORANGEFS ON A SINGLE SERVER +Building ORANGEFS on a Single Server ==================================== Where OrangeFS cannot be installed from distribution packages, it may be @@ -102,49 +105,51 @@ You can omit --prefix if you don't care that things are sprinkled around in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by default, we will probably be changing the default to LMDB soon. -./configure --prefix=/opt/ofs --with-db-backend=lmdb +:: -make + ./configure --prefix=/opt/ofs --with-db-backend=lmdb -make install + make -Create an orangefs config file. + make install -/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf +Create an orangefs config file:: -Create an /etc/pvfs2tab file. + /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf -echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ - /etc/pvfs2tab +Create an /etc/pvfs2tab file:: -Create the mount point you specified in the tab file if needed. + echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ + /etc/pvfs2tab -mkdir /pvfsmnt +Create the mount point you specified in the tab file if needed:: -Bootstrap the server. + mkdir /pvfsmnt -/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf +Bootstrap the server:: -Start the server. + /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf -/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf +Start the server:: + + /opt/osf/sbin/pvfs2-server /etc/pvfs2.conf Now the server should be running. Pvfs2-ls is a simple -test to verify that the server is running. +test to verify that the server is running:: -/opt/ofs/bin/pvfs2-ls /pvfsmnt + /opt/ofs/bin/pvfs2-ls /pvfsmnt If stuff seems to be working, load the kernel module and -turn on the client core. +turn on the client core:: -/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + /opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core -Mount your filesystem. +Mount your filesystem:: -mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt + mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt -RUNNING XFSTESTS +Running xfstests ================ It is useful to use a scratch filesystem with xfstests. This can be @@ -159,21 +164,23 @@ Then there are two FileSystem sections: orangefs and scratch. This change should be made before creating the filesystem. -pvfs2-server -f /etc/orangefs/orangefs.conf +:: + + pvfs2-server -f /etc/orangefs/orangefs.conf -To run xfstests, create /etc/xfsqa.config. +To run xfstests, create /etc/xfsqa.config:: -TEST_DIR=/orangefs -TEST_DEV=tcp://localhost:3334/orangefs -SCRATCH_MNT=/scratch -SCRATCH_DEV=tcp://localhost:3334/scratch + TEST_DIR=/orangefs + TEST_DEV=tcp://localhost:3334/orangefs + SCRATCH_MNT=/scratch + SCRATCH_DEV=tcp://localhost:3334/scratch -Then xfstests can be run +Then xfstests can be run:: -./check -pvfs2 + ./check -pvfs2 -OPTIONS +Options ======= The following mount options are accepted: @@ -193,32 +200,32 @@ The following mount options are accepted: Distributed locking is being worked on for the future. -DEBUGGING +Debugging ========= If you want the debug (GOSSIP) statements in a particular -source file (inode.c for example) go to syslog: +source file (inode.c for example) go to syslog:: echo inode > /sys/kernel/debug/orangefs/kernel-debug -No debugging (the default): +No debugging (the default):: echo none > /sys/kernel/debug/orangefs/kernel-debug -Debugging from several source files: +Debugging from several source files:: echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug -All debugging: +All debugging:: echo all > /sys/kernel/debug/orangefs/kernel-debug -Get a list of all debugging keywords: +Get a list of all debugging keywords:: cat /sys/kernel/debug/orangefs/debug-help -PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE +Protocol between Kernel Module and Userspace ============================================ Orangefs is a user space filesystem and an associated kernel module. @@ -234,7 +241,8 @@ The kernel module implements a pseudo device that userspace can read from and write to. Userspace can also manipulate the kernel module through the pseudo device with ioctl. -THE BUFMAP: +The Bufmap +---------- At startup userspace allocates two page-size-aligned (posix_memalign) mlocked memory buffers, one is used for IO and one is used for readdir @@ -250,7 +258,8 @@ copied from user space to kernel space with copy_from_user and is used to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which then contains: - * refcnt - a reference counter + * refcnt + - a reference counter * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's partition size, which represents the filesystem's block size and is used for s_blocksize in super blocks. @@ -259,17 +268,19 @@ then contains: * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. * total_size - the total size of the IO buffer. * page_count - the number of 4096 byte pages in the IO buffer. - * page_array - a pointer to page_count * (sizeof(struct page*)) bytes + * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes of kcalloced memory. This memory is used as an array of pointers to each of the pages in the IO buffer through a call to get_user_pages. - * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) + * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` bytes of kcalloced memory. This memory is further intialized: user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc structure. user_desc->ptr points to the IO buffer. - pages_per_desc = bufmap->desc_size / PAGE_SIZE - offset = 0 + :: + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 bufmap->desc_array[0].page_array = &bufmap->page_array[offset] bufmap->desc_array[0].array_count = pages_per_desc = 1024 @@ -293,7 +304,8 @@ then contains: * readdir_index_lock - a spinlock to protect readdir_index_array during update. -OPERATIONS: +Operations +---------- The kernel module builds an "op" (struct orangefs_kernel_op_s) when it needs to communicate with userspace. Part of the op contains the "upcall" @@ -308,13 +320,19 @@ in flight at any given time. Ops are stateful: - * unknown - op was just initialized - * waiting - op is on request_list (upward bound) - * inprogr - op is in progress (waiting for downcall) - * serviced - op has matching downcall; ok - * purged - op has to start a timer since client-core + * unknown + - op was just initialized + * waiting + - op is on request_list (upward bound) + * inprogr + - op is in progress (waiting for downcall) + * serviced + - op has matching downcall; ok + * purged + - op has to start a timer since client-core exited uncleanly before servicing op - * given up - submitter has given up waiting for it + * given up + - submitter has given up waiting for it When some arbitrary userspace program needs to perform a filesystem operation on Orangefs (readdir, I/O, create, whatever) @@ -389,10 +407,15 @@ union of structs, each of which is associated with a particular response type. The several members outside of the union are: - - int32_t type - type of operation. - - int32_t status - return code for the operation. - - int64_t trailer_size - 0 unless readdir operation. - - char *trailer_buf - initialized to NULL, used during readdir operations. + + ``int32_t type`` + - type of operation. + ``int32_t status`` + - return code for the operation. + ``int64_t trailer_size`` + - 0 unless readdir operation. + ``char *trailer_buf`` + - initialized to NULL, used during readdir operations. The appropriate member inside the union is filled out for any particular response. @@ -449,18 +472,20 @@ Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests made by the kernel side. A buffer_list containing: + - a pointer to the prepared response to the request from the kernel (struct pvfs2_downcall_t). - and also, in the case of a readdir request, a pointer to a buffer containing descriptors for the objects in the target directory. + ... is sent to the function (PINT_dev_write_list) which performs the writev. PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; The first four elements of io_array are initialized like this for all -responses: +responses:: io_array[0].iov_base = address of local variable "proto_ver" (int32_t) io_array[0].iov_len = sizeof(int32_t) @@ -475,7 +500,7 @@ responses: of global variable vfs_request (vfs_request_t) io_array[3].iov_len = sizeof(pvfs2_downcall_t) -Readdir responses initialize the fifth element io_array like this: +Readdir responses initialize the fifth element io_array like this:: io_array[4].iov_base = contents of member trailer_buf (char *) from out_downcall member of global variable @@ -517,13 +542,13 @@ from a dentry is cheap, obtaining it from userspace is relatively expensive, hence the motivation to use the dentry when possible. The timeout values d_time and getattr_time are jiffy based, and the -code is designed to avoid the jiffy-wrap problem: +code is designed to avoid the jiffy-wrap problem:: -"In general, if the clock may have wrapped around more than once, there -is no way to tell how much time has elapsed. However, if the times t1 -and t2 are known to be fairly close, we can reliably compute the -difference in a way that takes into account the possibility that the -clock may have wrapped between times." + "In general, if the clock may have wrapped around more than once, there + is no way to tell how much time has elapsed. However, if the times t1 + and t2 are known to be fairly close, we can reliably compute the + difference in a way that takes into account the possibility that the + clock may have wrapped between times." - from course notes by instructor Andy Wang +from course notes by instructor Andy Wang diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index f18506083ced..26c093969573 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -850,3 +850,11 @@ business doing so. d_alloc_pseudo() is internal-only; uses outside of alloc_file_pseudo() are very suspect (and won't work in modules). Such uses are very likely to be misspelled d_alloc_anon(). + +--- + +**mandatory** + +[should've been added in 2016] stale comment in finish_open() nonwithstanding, +failure exits in ->atomic_open() instances should *NOT* fput() the file, +no matter what. Everything is handled by the caller. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.rst index 99ca040e3f90..38b606991065 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.rst @@ -1,19 +1,20 @@ ------------------------------------------------------------------------------- - T H E /proc F I L E S Y S T E M ------------------------------------------------------------------------------- -/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999 - Bodo Bauer <bb@ricochet.net> +.. SPDX-License-Identifier: GPL-2.0 + +==================== +The /proc Filesystem +==================== + +===================== ======================================= ================ +/proc/sys Terrehon Bowden <terrehon@pacbell.net>, October 7 1999 + Bodo Bauer <bb@ricochet.net> +2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 +move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 +fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 +===================== ======================================= ================ + -2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 -move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 ------------------------------------------------------------------------------- -Version 1.3 Kernel version 2.2.12 - Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------- -fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 -Table of Contents ------------------ +.. Table of Contents 0 Preface 0.1 Introduction/Credits @@ -50,9 +51,8 @@ Table of Contents 4 Configuring procfs 4.1 Mount options ------------------------------------------------------------------------------- Preface ------------------------------------------------------------------------------- +======= 0.1 Introduction/Credits ------------------------ @@ -95,20 +95,18 @@ We don't guarantee the correctness of this document, and if you come to us complaining about how you screwed up your system because of incorrect documentation, we won't feel responsible... ------------------------------------------------------------------------------- -CHAPTER 1: COLLECTING SYSTEM INFORMATION ------------------------------------------------------------------------------- +Chapter 1: Collecting System Information +======================================== ------------------------------------------------------------------------------- In This Chapter ------------------------------------------------------------------------------- +--------------- * Investigating the properties of the pseudo file system /proc and its ability to provide information on the running Linux system * Examining /proc's structure * Uncovering various information about the kernel and the processes running on the system ------------------------------------------------------------------------------- +------------------------------------------------------------------------------ The proc file system acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change @@ -134,9 +132,11 @@ never act on any new process that the kernel may, through chance, have also assigned the process ID <pid>. Instead, operations on these FDs usually fail with ESRCH. -Table 1-1: Process specific entries in /proc -.............................................................................. +.. table:: Table 1-1: Process specific entries in /proc + + ============= =============================================================== File Content + ============= =============================================================== clear_refs Clears page referenced bits shown in smaps output cmdline Command line arguments cpu Current and last cpu in which it was executed (2.4)(smp) @@ -160,10 +160,10 @@ Table 1-1: Process specific entries in /proc can be derived from smaps, but is faster and more convenient numa_maps An extension based on maps, showing the memory locality and binding policy as well as mem usage (in pages) of each mapping. -.............................................................................. + ============= =============================================================== For example, to get the status information of a process, all you have to do is -read the file /proc/PID/status: +read the file /proc/PID/status:: >cat /proc/self/status Name: cat @@ -222,14 +222,17 @@ contains details information about the process itself. Its fields are explained in Table 1-4. (for SMP CONFIG users) + For making accounting scalable, RSS related information are handled in an asynchronous manner and the value may not be very precise. To see a precise snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. It's slow but very precise. -Table 1-2: Contents of the status files (as of 4.19) -.............................................................................. +.. table:: Table 1-2: Contents of the status files (as of 4.19) + + ========================== =================================================== Field Content + ========================== =================================================== Name filename of the executable Umask file mode creation mask State state (R is running, S is sleeping, D is sleeping @@ -254,7 +257,8 @@ Table 1-2: Contents of the status files (as of 4.19) VmPin pinned memory size VmHWM peak resident set size ("high water mark") VmRSS size of memory portions. It contains the three - following parts (VmRSS = RssAnon + RssFile + RssShmem) + following parts + (VmRSS = RssAnon + RssFile + RssShmem) RssAnon size of resident anonymous memory RssFile size of resident file mappings RssShmem size of resident shmem memory (includes SysV shm, @@ -292,27 +296,32 @@ Table 1-2: Contents of the status files (as of 4.19) Mems_allowed_list Same as previous, but in "list format" voluntary_ctxt_switches number of voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches -.............................................................................. + ========================== =================================================== -Table 1-3: Contents of the statm files (as of 2.6.8-rc3) -.............................................................................. + +.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3) + + ======== =============================== ============================== Field Content + ======== =============================== ============================== size total program size (pages) (same as VmSize in status) resident size of memory portions (pages) (same as VmRSS in status) shared number of pages that are shared (i.e. backed by a file, same as RssFile+RssShmem in status) trs number of pages that are 'code' (not including libs; broken, - includes data segment) + includes data segment) lrs number of pages of library (always 0 on 2.6) drs number of pages of data/stack (including libs; broken, - includes library text) + includes library text) dt number of dirty pages (always 0 on 2.6) -.............................................................................. + ======== =============================== ============================== + +.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7) -Table 1-4: Contents of the stat files (as of 2.6.30-rc7) -.............................................................................. - Field Content + ============= =============================================================== + Field Content + ============= =============================================================== pid process id tcomm filename of the executable state state (R is running, S is sleeping, D is sleeping in an @@ -348,7 +357,8 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) blocked bitmap of blocked signals sigign bitmap of ignored signals sigcatch bitmap of caught signals - 0 (place holder, used to be the wchan address, use /proc/PID/wchan instead) + 0 (place holder, used to be the wchan address, + use /proc/PID/wchan instead) 0 (place holder) 0 (place holder) exit_signal signal to send to parent thread on exit @@ -365,39 +375,40 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) arg_end address below which program command line is placed env_start address above which program environment is placed env_end address below which program environment is placed - exit_code the thread's exit_code in the form reported by the waitpid system call -.............................................................................. + exit_code the thread's exit_code in the form reported by the waitpid + system call + ============= =============================================================== The /proc/PID/maps file contains the currently mapped memory regions and their access permissions. -The format is: - -address perms offset dev inode pathname - -08048000-08049000 r-xp 00000000 03:00 8312 /opt/test -08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test -0804a000-0806b000 rw-p 00000000 00:00 0 [heap] -a7cb1000-a7cb2000 ---p 00000000 00:00 0 -a7cb2000-a7eb2000 rw-p 00000000 00:00 0 -a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 -a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 -a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 -a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 -a800b000-a800e000 rw-p 00000000 00:00 0 -a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 -a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 -a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 -a8024000-a8027000 rw-p 00000000 00:00 0 -a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 -a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 -a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 -aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] -ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] +The format is:: + + address perms offset dev inode pathname + + 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test + 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test + 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] + a7cb1000-a7cb2000 ---p 00000000 00:00 0 + a7cb2000-a7eb2000 rw-p 00000000 00:00 0 + a7eb2000-a7eb3000 ---p 00000000 00:00 0 + a7eb3000-a7ed5000 rw-p 00000000 00:00 0 + a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 + a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 + a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 + a800b000-a800e000 rw-p 00000000 00:00 0 + a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 + a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 + a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 + a8024000-a8027000 rw-p 00000000 00:00 0 + a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 + a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 + a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 + aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] + ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] where "address" is the address space in the process that it occupies, "perms" -is a set of permissions: +is a set of permissions:: r = read w = write @@ -411,42 +422,44 @@ with the memory region, as the case would be with BSS (uninitialized data). The "pathname" shows the name associated file for this mapping. If the mapping is not associated with a file: - [heap] = the heap of the program - [stack] = the stack of the main process - [vdso] = the "virtual dynamic shared object", + ======= ==================================== + [heap] the heap of the program + [stack] the stack of the main process + [vdso] the "virtual dynamic shared object", the kernel system call handler + ======= ==================================== or if empty, the mapping is anonymous. The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each mapping (aka Virtual -Memory Area, or VMA) there is a series of lines such as the following: - -08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash - -Size: 1084 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Rss: 892 kB -Pss: 374 kB -Shared_Clean: 892 kB -Shared_Dirty: 0 kB -Private_Clean: 0 kB -Private_Dirty: 0 kB -Referenced: 892 kB -Anonymous: 0 kB -LazyFree: 0 kB -AnonHugePages: 0 kB -ShmemPmdMapped: 0 kB -Shared_Hugetlb: 0 kB -Private_Hugetlb: 0 kB -Swap: 0 kB -SwapPss: 0 kB -KernelPageSize: 4 kB -MMUPageSize: 4 kB -Locked: 0 kB -THPeligible: 0 -VmFlags: rd ex mr mw me dw +Memory Area, or VMA) there is a series of lines such as the following:: + + 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash + + Size: 1084 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Rss: 892 kB + Pss: 374 kB + Shared_Clean: 892 kB + Shared_Dirty: 0 kB + Private_Clean: 0 kB + Private_Dirty: 0 kB + Referenced: 892 kB + Anonymous: 0 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + Swap: 0 kB + SwapPss: 0 kB + KernelPageSize: 4 kB + MMUPageSize: 4 kB + Locked: 0 kB + THPeligible: 0 + VmFlags: rd ex mr mw me dw The first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. Following lines show the size of the mapping @@ -461,26 +474,35 @@ The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. + Note that even a page which is part of a MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used by only one process, is accounted as private and not as shared. + "Referenced" indicates the amount of memory currently marked as referenced or accessed. + "Anonymous" shows the amount of memory that does not belong to any file. Even a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy. + "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE). The memory isn't freed immediately with madvise(). It's freed in memory pressure if the memory is clean. Please note that the printed value might be lower than the real value due to optimizations used in the current implementation. If this is not desirable please file a bug report. + "AnonHugePages" shows the ammount of memory backed by transparent hugepage. + "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by huge pages. + "Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. + "Swap" shows how much would-be-anonymous memory is also used, but out on swap. + For shmem mappings, "Swap" includes also the size of the mapped (and not replaced by copy-on-write) part of the underlying shmem object out on swap. "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this @@ -489,36 +511,39 @@ does not take into account swapped out page of underlying shmem objects. "THPeligible" indicates whether the mapping is eligible for allocating THP pages - 1 if true, 0 otherwise. It just shows the current status. -"VmFlags" field deserves a separate description. This member represents the kernel -flags associated with the particular virtual memory area in two letter encoded -manner. The codes are the following: - rd - readable - wr - writeable - ex - executable - sh - shared - mr - may read - mw - may write - me - may execute - ms - may share - gd - stack segment growns down - pf - pure PFN range - dw - disabled write to the mapped file - lo - pages are locked in memory - io - memory mapped I/O area - sr - sequential read advise provided - rr - random read advise provided - dc - do not copy area on fork - de - do not expand area on remapping - ac - area is accountable - nr - swap space is not reserved for the area - ht - area uses huge tlb pages - ar - architecture specific flag - dd - do not include area into core dump - sd - soft-dirty flag - mm - mixed map area - hg - huge page advise flag - nh - no-huge page advise flag - mg - mergable advise flag +"VmFlags" field deserves a separate description. This member represents the +kernel flags associated with the particular virtual memory area in two letter +encoded manner. The codes are the following: + + == ======================================= + rd readable + wr writeable + ex executable + sh shared + mr may read + mw may write + me may execute + ms may share + gd stack segment growns down + pf pure PFN range + dw disabled write to the mapped file + lo pages are locked in memory + io memory mapped I/O area + sr sequential read advise provided + rr random read advise provided + dc do not copy area on fork + de do not expand area on remapping + ac area is accountable + nr swap space is not reserved for the area + ht area uses huge tlb pages + ar architecture specific flag + dd do not include area into core dump + sd soft dirty flag + mm mixed map area + hg huge page advise flag + nh no huge page advise flag + mg mergable advise flag + == ======================================= Note that there is no guarantee that every flag and associated mnemonic will be present in all further kernel releases. Things get changed, the flags may @@ -531,6 +556,7 @@ enabled. Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent output can be achieved only in the single read call). + This typically manifests when doing partial reads of these files while the memory map is being modified. Despite the races, we do provide the following guarantees: @@ -544,9 +570,9 @@ The /proc/PID/smaps_rollup file includes the same fields as /proc/PID/smaps, but their values are the sums of the corresponding values for all mappings of the process. Additionally, it contains these fields: -Pss_Anon -Pss_File -Pss_Shmem +- Pss_Anon +- Pss_File +- Pss_Shmem They represent the proportional shares of anonymous, file, and shmem pages, as described for smaps above. These fields are omitted in smaps since each @@ -558,20 +584,25 @@ The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst for details). -To clear the bits for all the pages associated with the process +To clear the bits for all the pages associated with the process:: + > echo 1 > /proc/PID/clear_refs -To clear the bits for the anonymous pages associated with the process +To clear the bits for the anonymous pages associated with the process:: + > echo 2 > /proc/PID/clear_refs -To clear the bits for the file mapped pages associated with the process +To clear the bits for the file mapped pages associated with the process:: + > echo 3 > /proc/PID/clear_refs -To clear the soft-dirty bit +To clear the soft-dirty bit:: + > echo 4 > /proc/PID/clear_refs To reset the peak resident set size ("high water mark") to the process's -current value: +current value:: + > echo 5 > /proc/PID/clear_refs Any other value written to /proc/PID/clear_refs will have no effect. @@ -584,30 +615,33 @@ Documentation/admin-guide/mm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of each mapping. The output follows a general format where mapping details get -summarized separated by blank spaces, one mapping per each file line: - -address policy mapping details - -00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 -00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 -320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 -320698b000 default file=/lib64/libc-2.12.so -3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 -3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 -3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 -7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 -7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 -7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 -7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 +summarized separated by blank spaces, one mapping per each file line:: + + address policy mapping details + + 00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4 + 00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4 + 320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4 + 320698b000 default file=/lib64/libc-2.12.so + 3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4 + 3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4 + 3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4 + 7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4 + 7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048 + 7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4 + 7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4 Where: + "address" is the starting address for the mapping; + "policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); + "mapping details" summarizes mapping data such as mapping type, page usage counters, node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. @@ -621,81 +655,83 @@ the running kernel. The files used to obtain this information are contained in system. It depends on the kernel configuration and the loaded modules, which files are there, and which are missing. -Table 1-5: Kernel info in /proc -.............................................................................. - File Content - apm Advanced power management info - buddyinfo Kernel memory allocator information (see text) (2.5) - bus Directory containing bus specific information - cmdline Kernel command line - cpuinfo Info about the CPU - devices Available devices (block and character) - dma Used DMS channels - filesystems Supported filesystems - driver Various drivers grouped here, currently rtc (2.4) - execdomains Execdomains, related to security (2.4) - fb Frame Buffer devices (2.4) - fs File system parameters, currently nfs/exports (2.4) - ide Directory containing info about the IDE subsystem - interrupts Interrupt usage - iomem Memory map (2.4) - ioports I/O port usage - irq Masks for irq to cpu affinity (2.4)(smp?) - isapnp ISA PnP (Plug&Play) Info (2.4) - kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) - kmsg Kernel messages - ksyms Kernel symbol table - loadavg Load average of last 1, 5 & 15 minutes - locks Kernel locks - meminfo Memory info - misc Miscellaneous - modules List of loaded modules - mounts Mounted filesystems - net Networking info (see text) +.. table:: Table 1-5: Kernel info in /proc + + ============ =============================================================== + File Content + ============ =============================================================== + apm Advanced power management info + buddyinfo Kernel memory allocator information (see text) (2.5) + bus Directory containing bus specific information + cmdline Kernel command line + cpuinfo Info about the CPU + devices Available devices (block and character) + dma Used DMS channels + filesystems Supported filesystems + driver Various drivers grouped here, currently rtc (2.4) + execdomains Execdomains, related to security (2.4) + fb Frame Buffer devices (2.4) + fs File system parameters, currently nfs/exports (2.4) + ide Directory containing info about the IDE subsystem + interrupts Interrupt usage + iomem Memory map (2.4) + ioports I/O port usage + irq Masks for irq to cpu affinity (2.4)(smp?) + isapnp ISA PnP (Plug&Play) Info (2.4) + kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) + kmsg Kernel messages + ksyms Kernel symbol table + loadavg Load average of last 1, 5 & 15 minutes + locks Kernel locks + meminfo Memory info + misc Miscellaneous + modules List of loaded modules + mounts Mounted filesystems + net Networking info (see text) pagetypeinfo Additional page allocator information (see text) (2.5) - partitions Table of partitions known to the system - pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, - decoupled by lspci (2.4) - rtc Real time clock - scsi SCSI info (see text) - slabinfo Slab pool info - softirqs softirq usage - stat Overall statistics - swaps Swap space utilization - sys See chapter 2 - sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) - tty Info of tty drivers - uptime Wall clock since boot, combined idle time of all cpus - version Kernel version - video bttv info of video resources (2.4) - vmallocinfo Show vmalloced areas -.............................................................................. + partitions Table of partitions known to the system + pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, + decoupled by lspci (2.4) + rtc Real time clock + scsi SCSI info (see text) + slabinfo Slab pool info + softirqs softirq usage + stat Overall statistics + swaps Swap space utilization + sys See chapter 2 + sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) + tty Info of tty drivers + uptime Wall clock since boot, combined idle time of all cpus + version Kernel version + video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas + ============ =============================================================== You can, for example, check which interrupts are currently in use and what -they are used for by looking in the file /proc/interrupts: - - > cat /proc/interrupts - CPU0 - 0: 8728810 XT-PIC timer - 1: 895 XT-PIC keyboard - 2: 0 XT-PIC cascade - 3: 531695 XT-PIC aha152x - 4: 2014133 XT-PIC serial - 5: 44401 XT-PIC pcnet_cs - 8: 2 XT-PIC rtc - 11: 8 XT-PIC i82365 - 12: 182918 XT-PIC PS/2 Mouse - 13: 1 XT-PIC fpu - 14: 1232265 XT-PIC ide0 - 15: 7 XT-PIC ide1 - NMI: 0 +they are used for by looking in the file /proc/interrupts:: + + > cat /proc/interrupts + CPU0 + 0: 8728810 XT-PIC timer + 1: 895 XT-PIC keyboard + 2: 0 XT-PIC cascade + 3: 531695 XT-PIC aha152x + 4: 2014133 XT-PIC serial + 5: 44401 XT-PIC pcnet_cs + 8: 2 XT-PIC rtc + 11: 8 XT-PIC i82365 + 12: 182918 XT-PIC PS/2 Mouse + 13: 1 XT-PIC fpu + 14: 1232265 XT-PIC ide0 + 15: 7 XT-PIC ide1 + NMI: 0 In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the -output of a SMP machine): +output of a SMP machine):: - > cat /proc/interrupts + > cat /proc/interrupts - CPU0 CPU1 + CPU0 CPU1 0: 1243498 1214548 IO-APIC-edge timer 1: 8949 8958 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade @@ -708,8 +744,8 @@ output of a SMP machine): 15: 2183 2415 IO-APIC-edge ide1 17: 30564 30414 IO-APIC-level eth0 18: 177 164 IO-APIC-level bttv - NMI: 2457961 2457959 - LOC: 2457882 2457881 + NMI: 2457961 2457959 + LOC: 2457882 2457881 ERR: 2155 NMI is incremented in this case because every timer interrupt generates a NMI @@ -726,21 +762,25 @@ In 2.6.2* /proc/interrupts was expanded again. This time the goal was for /proc/interrupts to display every IRQ vector in use by the system, not just those considered 'most important'. The new vectors are: - THR -- interrupt raised when a machine check threshold counter +THR + interrupt raised when a machine check threshold counter (typically counting ECC corrected errors of memory or cache) exceeds a configurable threshold. Only available on some systems. - TRM -- a thermal event interrupt occurs when a temperature threshold +TRM + a thermal event interrupt occurs when a temperature threshold has been exceeded for the CPU. This interrupt may also be generated when the temperature drops back to normal. - SPU -- a spurious interrupt is some interrupt that was raised then lowered +SPU + a spurious interrupt is some interrupt that was raised then lowered by some IO device before it could be fully processed by the APIC. Hence the APIC sees the interrupt but does not know what device it came from. For this case the APIC will generate the interrupt with a IRQ vector of 0xff. This might also be generated by chipset bugs. - RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are +RES, CAL, TLB] + rescheduling, call and TLB flush interrupts are sent from one CPU to another per the needs of the OS. Typically, their statistics are used by kernel developers and interested users to determine the occurrence of interrupts of the given type. @@ -756,7 +796,8 @@ IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and prof_cpu_mask. -For example +For example:: + > ls /proc/irq/ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask 1 11 13 15 17 19 3 5 7 9 default_smp_affinity @@ -764,20 +805,20 @@ For example smp_affinity smp_affinity is a bitmask, in which you can specify which CPUs can handle the -IRQ, you can set it by doing: +IRQ, you can set it by doing:: > echo 1 > /proc/irq/10/smp_affinity This means that only the first CPU will handle the IRQ, but you can also echo 5 which means that only the first and third CPU can handle the IRQ. -The contents of each smp_affinity file is the same by default: +The contents of each smp_affinity file is the same by default:: > cat /proc/irq/0/smp_affinity ffffffff There is an alternate interface, smp_affinity_list which allows specifying -a cpu range instead of a bitmask: +a cpu range instead of a bitmask:: > cat /proc/irq/0/smp_affinity_list 1024-1031 @@ -810,46 +851,46 @@ Linux uses slab pools for memory management above page level in version 2.2. Commonly used objects have their own slab pool (such as network buffers, directory cache, and so on). -.............................................................................. +:: -> cat /proc/buddyinfo + > cat /proc/buddyinfo -Node 0, zone DMA 0 4 5 4 4 3 ... -Node 0, zone Normal 1 0 0 1 101 8 ... -Node 0, zone HighMem 2 0 0 1 1 0 ... + Node 0, zone DMA 0 4 5 4 4 3 ... + Node 0, zone Normal 1 0 0 1 101 8 ... + Node 0, zone HighMem 2 0 0 1 1 0 ... External fragmentation is a problem under some workloads, and buddyinfo is a -useful tool for helping diagnose these problems. Buddyinfo will give you a +useful tool for helping diagnose these problems. Buddyinfo will give you a clue as to how big an area you can safely allocate, or why a previous allocation failed. -Each column represents the number of pages of a certain order which are -available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in -ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE -available in ZONE_NORMAL, etc... +Each column represents the number of pages of a certain order which are +available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE +available in ZONE_NORMAL, etc... More information relevant to external fragmentation can be found in -pagetypeinfo. - -> cat /proc/pagetypeinfo -Page block order: 9 -Pages per block: 512 - -Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 -Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 -Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 -Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 -Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 -Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 -Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 -Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 -Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 -Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 - -Number of blocks type Unmovable Reclaimable Movable Reserve Isolate -Node 0, zone DMA 2 0 5 1 0 -Node 0, zone DMA32 41 6 967 2 0 +pagetypeinfo:: + + > cat /proc/pagetypeinfo + Page block order: 9 + Pages per block: 512 + + Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 + Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 + Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 + Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 + Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 + Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 + Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 + Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 + Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + + Number of blocks type Unmovable Reclaimable Movable Reserve Isolate + Node 0, zone DMA 2 0 5 1 0 + Node 0, zone DMA32 41 6 967 2 0 Fragmentation avoidance in the kernel works by grouping pages of different migrate types into the same contiguous regions of memory called page blocks. @@ -870,59 +911,63 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should also be allocatable although a lot of filesystem metadata may have to be reclaimed to achieve this. -.............................................................................. -meminfo: +meminfo +~~~~~~~ Provides information about distribution and utilization of memory. This varies by architecture and compile options. The following is from a 16GB PIII, which has highmem enabled. You may not have all of these fields. -> cat /proc/meminfo - -MemTotal: 16344972 kB -MemFree: 13634064 kB -MemAvailable: 14836172 kB -Buffers: 3656 kB -Cached: 1195708 kB -SwapCached: 0 kB -Active: 891636 kB -Inactive: 1077224 kB -HighTotal: 15597528 kB -HighFree: 13629632 kB -LowTotal: 747444 kB -LowFree: 4432 kB -SwapTotal: 0 kB -SwapFree: 0 kB -Dirty: 968 kB -Writeback: 0 kB -AnonPages: 861800 kB -Mapped: 280372 kB -Shmem: 644 kB -KReclaimable: 168048 kB -Slab: 284364 kB -SReclaimable: 159856 kB -SUnreclaim: 124508 kB -PageTables: 24448 kB -NFS_Unstable: 0 kB -Bounce: 0 kB -WritebackTmp: 0 kB -CommitLimit: 7669796 kB -Committed_AS: 100056 kB -VmallocTotal: 112216 kB -VmallocUsed: 428 kB -VmallocChunk: 111088 kB -Percpu: 62080 kB -HardwareCorrupted: 0 kB -AnonHugePages: 49152 kB -ShmemHugePages: 0 kB -ShmemPmdMapped: 0 kB - - - MemTotal: Total usable ram (i.e. physical ram minus a few reserved +:: + + > cat /proc/meminfo + + MemTotal: 16344972 kB + MemFree: 13634064 kB + MemAvailable: 14836172 kB + Buffers: 3656 kB + Cached: 1195708 kB + SwapCached: 0 kB + Active: 891636 kB + Inactive: 1077224 kB + HighTotal: 15597528 kB + HighFree: 13629632 kB + LowTotal: 747444 kB + LowFree: 4432 kB + SwapTotal: 0 kB + SwapFree: 0 kB + Dirty: 968 kB + Writeback: 0 kB + AnonPages: 861800 kB + Mapped: 280372 kB + Shmem: 644 kB + KReclaimable: 168048 kB + Slab: 284364 kB + SReclaimable: 159856 kB + SUnreclaim: 124508 kB + PageTables: 24448 kB + NFS_Unstable: 0 kB + Bounce: 0 kB + WritebackTmp: 0 kB + CommitLimit: 7669796 kB + Committed_AS: 100056 kB + VmallocTotal: 112216 kB + VmallocUsed: 428 kB + VmallocChunk: 111088 kB + Percpu: 62080 kB + HardwareCorrupted: 0 kB + AnonHugePages: 49152 kB + ShmemHugePages: 0 kB + ShmemPmdMapped: 0 kB + +MemTotal + Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) - MemFree: The sum of LowFree+HighFree -MemAvailable: An estimate of how much memory is available for starting new +MemFree + The sum of LowFree+HighFree +MemAvailable + An estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. @@ -930,69 +975,99 @@ MemAvailable: An estimate of how much memory is available for starting new page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. The impact of those factors will vary from system to system. - Buffers: Relatively temporary storage for raw disk blocks +Buffers + Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so) - Cached: in-memory cache for files read from the disk (the +Cached + in-memory cache for files read from the disk (the pagecache). Doesn't include SwapCached - SwapCached: Memory that once was swapped out, is swapped back in but +SwapCached + Memory that once was swapped out, is swapped back in but still also is in the swapfile (if memory is needed it doesn't need to be swapped out AGAIN because it is already in the swapfile. This saves I/O) - Active: Memory that has been used more recently and usually not +Active + Memory that has been used more recently and usually not reclaimed unless absolutely necessary. - Inactive: Memory which has been less recently used. It is more +Inactive + Memory which has been less recently used. It is more eligible to be reclaimed for other purposes - HighTotal: - HighFree: Highmem is all memory above ~860MB of physical memory +HighTotal, HighFree + Highmem is all memory above ~860MB of physical memory Highmem areas are for use by userspace programs, or for the pagecache. The kernel must use tricks to access this memory, making it slower to access than lowmem. - LowTotal: - LowFree: Lowmem is memory which can be used for everything that +LowTotal, LowFree + Lowmem is memory which can be used for everything that highmem can be used for, but it is also available for the kernel's use for its own data structures. Among many other things, it is where everything from the Slab is allocated. Bad things happen when you're out of lowmem. - SwapTotal: total amount of swap space available - SwapFree: Memory which has been evicted from RAM, and is temporarily +SwapTotal + total amount of swap space available +SwapFree + Memory which has been evicted from RAM, and is temporarily on the disk - Dirty: Memory which is waiting to get written back to the disk - Writeback: Memory which is actively being written back to the disk - AnonPages: Non-file backed pages mapped into userspace page tables -HardwareCorrupted: The amount of RAM/memory in KB, the kernel identifies as +Dirty + Memory which is waiting to get written back to the disk +Writeback + Memory which is actively being written back to the disk +AnonPages + Non-file backed pages mapped into userspace page tables +HardwareCorrupted + The amount of RAM/memory in KB, the kernel identifies as corrupted. -AnonHugePages: Non-file backed huge pages mapped into userspace page tables - Mapped: files which have been mmaped, such as libraries - Shmem: Total memory used by shared memory (shmem) and tmpfs -ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated +AnonHugePages + Non-file backed huge pages mapped into userspace page tables +Mapped + files which have been mmaped, such as libraries +Shmem + Total memory used by shared memory (shmem) and tmpfs +ShmemHugePages + Memory used by shared memory (shmem) and tmpfs allocated with huge pages -ShmemPmdMapped: Shared memory mapped into userspace with huge pages -KReclaimable: Kernel allocations that the kernel will attempt to reclaim +ShmemPmdMapped + Shared memory mapped into userspace with huge pages +KReclaimable + Kernel allocations that the kernel will attempt to reclaim under memory pressure. Includes SReclaimable (below), and other direct allocations with a shrinker. - Slab: in-kernel data structures cache -SReclaimable: Part of Slab, that might be reclaimed, such as caches - SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure - PageTables: amount of memory dedicated to the lowest level of page +Slab + in-kernel data structures cache +SReclaimable + Part of Slab, that might be reclaimed, such as caches +SUnreclaim + Part of Slab, that cannot be reclaimed on memory pressure +PageTables + amount of memory dedicated to the lowest level of page tables. -NFS_Unstable: NFS pages sent to the server, but not yet committed to stable +NFS_Unstable + NFS pages sent to the server, but not yet committed to stable storage - Bounce: Memory used for block device "bounce buffers" -WritebackTmp: Memory used by FUSE for temporary writeback buffers - CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), +Bounce + Memory used for block device "bounce buffers" +WritebackTmp + Memory used by FUSE for temporary writeback buffers +CommitLimit + Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). - The CommitLimit is calculated with the following formula: - CommitLimit = ([total RAM pages] - [total huge TLB pages]) * - overcommit_ratio / 100 + [total swap pages] + + The CommitLimit is calculated with the following formula:: + + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] + For example, on a system with 1G of physical RAM and 7G of swap with a `vm.overcommit_ratio` of 30 it would yield a CommitLimit of 7.3G. + For more details, see the memory overcommit documentation in vm/overcommit-accounting. -Committed_AS: The amount of memory presently allocated on the system. +Committed_AS + The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet. A process which malloc()'s 1G @@ -1005,21 +1080,25 @@ Committed_AS: The amount of memory presently allocated on the system. This is useful if one needs to guarantee that processes will not fail due to lack of memory once that memory has been successfully allocated. -VmallocTotal: total size of vmalloc memory area - VmallocUsed: amount of vmalloc area which is used -VmallocChunk: largest contiguous block of vmalloc area which is free - Percpu: Memory allocated to the percpu allocator used to back percpu +VmallocTotal + total size of vmalloc memory area +VmallocUsed + amount of vmalloc area which is used +VmallocChunk + largest contiguous block of vmalloc area which is free +Percpu + Memory allocated to the percpu allocator used to back percpu allocations. This stat excludes the cost of metadata. -.............................................................................. - -vmallocinfo: +vmallocinfo +~~~~~~~~~~~ Provides information about vmalloced/vmaped areas. One line per area, containing the virtual address range of the area, size in bytes, caller information of the creator, and optional information depending on the kind of area : + ========== =================================================== pages=nr number of pages phys=addr if a physical address was specified ioremap I/O mapping (ioremap() and friends) @@ -1029,49 +1108,54 @@ on the kind of area : vpages buffer for pages pointers was vmalloced (huge area) N<node>=nr (Only on NUMA kernels) Number of pages allocated on memory node <node> - -> cat /proc/vmallocinfo -0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... - /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 -0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... - /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 -0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... - phys=7fee8000 ioremap -0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... - phys=7fee7000 ioremap -0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 -0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... - /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 -0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... - pages=2 vmalloc N1=2 -0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... - /0x130 [x_tables] pages=4 vmalloc N0=4 -0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... - pages=14 vmalloc N2=14 -0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... - pages=4 vmalloc N1=4 -0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... - pages=2 vmalloc N1=2 -0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... - pages=10 vmalloc N0=10 - -.............................................................................. - -softirqs: + ========== =================================================== + +:: + + > cat /proc/vmallocinfo + 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 + 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 + 0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap + 0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap + 0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 + 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 + 0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 + 0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 + 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 + 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 + 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 + 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + + +softirqs +~~~~~~~~ Provides counts of softirq handlers serviced since boot time, for each cpu. -> cat /proc/softirqs - CPU0 CPU1 CPU2 CPU3 - HI: 0 0 0 0 - TIMER: 27166 27120 27097 27034 - NET_TX: 0 0 0 17 - NET_RX: 42 0 0 39 - BLOCK: 0 0 107 1121 - TASKLET: 0 0 0 290 - SCHED: 27035 26983 26971 26746 - HRTIMER: 0 0 0 0 - RCU: 1678 1769 2178 2250 +:: + + > cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 1.3 IDE devices in /proc/ide @@ -1083,7 +1167,7 @@ file drivers and a link for each IDE device, pointing to the device directory in the controller specific subtree. The file drivers contains general information about the drivers used for the -IDE devices: +IDE devices:: > cat /proc/ide/drivers ide-cdrom version 4.53 @@ -1094,57 +1178,61 @@ subdirectories. These are named ide0, ide1 and so on. Each of these directories contains the files shown in table 1-6. -Table 1-6: IDE controller info in /proc/ide/ide? -.............................................................................. - File Content - channel IDE channel (0 or 1) - config Configuration (only for PCI/IDE bridge) - mate Mate name - model Type/Chipset of IDE controller -.............................................................................. +.. table:: Table 1-6: IDE controller info in /proc/ide/ide? + + ======= ======================================= + File Content + ======= ======================================= + channel IDE channel (0 or 1) + config Configuration (only for PCI/IDE bridge) + mate Mate name + model Type/Chipset of IDE controller + ======= ======================================= Each device connected to a controller has a separate subdirectory in the controllers directory. The files listed in table 1-7 are contained in these directories. -Table 1-7: IDE device information -.............................................................................. - File Content - cache The cache - capacity Capacity of the medium (in 512Byte blocks) - driver driver and version - geometry physical and logical geometry - identify device identify block - media media type - model device identifier - settings device setup - smart_thresholds IDE disk management thresholds - smart_values IDE disk management values -.............................................................................. - -The most interesting file is settings. This file contains a nice overview of -the drive parameters: - - # cat /proc/ide/ide0/hda/settings - name value min max mode - ---- ----- --- --- ---- - bios_cyl 526 0 65535 rw - bios_head 255 0 255 rw - bios_sect 63 0 63 rw - breada_readahead 4 0 127 rw - bswap 0 0 1 r - file_readahead 72 0 2097151 rw - io_32bit 0 0 3 rw - keepsettings 0 0 1 rw - max_kb_per_request 122 1 127 rw - multcount 0 0 8 rw - nice1 1 0 1 rw - nowerr 0 0 1 rw - pio_mode write-only 0 255 w - slow 0 0 1 rw - unmaskirq 0 0 1 rw - using_dma 0 0 1 rw +.. table:: Table 1-7: IDE device information + + ================ ========================================== + File Content + ================ ========================================== + cache The cache + capacity Capacity of the medium (in 512Byte blocks) + driver driver and version + geometry physical and logical geometry + identify device identify block + media media type + model device identifier + settings device setup + smart_thresholds IDE disk management thresholds + smart_values IDE disk management values + ================ ========================================== + +The most interesting file is ``settings``. This file contains a nice +overview of the drive parameters:: + + # cat /proc/ide/ide0/hda/settings + name value min max mode + ---- ----- --- --- ---- + bios_cyl 526 0 65535 rw + bios_head 255 0 255 rw + bios_sect 63 0 63 rw + breada_readahead 4 0 127 rw + bswap 0 0 1 r + file_readahead 72 0 2097151 rw + io_32bit 0 0 3 rw + keepsettings 0 0 1 rw + max_kb_per_request 122 1 127 rw + multcount 0 0 8 rw + nice1 1 0 1 rw + nowerr 0 0 1 rw + pio_mode write-only 0 255 w + slow 0 0 1 rw + unmaskirq 0 0 1 rw + using_dma 0 0 1 rw 1.4 Networking info in /proc/net @@ -1155,67 +1243,70 @@ additional values you get for IP version 6 if you configure the kernel to support this. Table 1-9 lists the files and their meaning. -Table 1-8: IPv6 info in /proc/net -.............................................................................. - File Content - udp6 UDP sockets (IPv6) - tcp6 TCP sockets (IPv6) - raw6 Raw device statistics (IPv6) - igmp6 IP multicast addresses, which this host joined (IPv6) - if_inet6 List of IPv6 interface addresses - ipv6_route Kernel routing table for IPv6 - rt6_stats Global IPv6 routing tables statistics - sockstat6 Socket statistics (IPv6) - snmp6 Snmp data (IPv6) -.............................................................................. - - -Table 1-9: Network info in /proc/net -.............................................................................. - File Content - arp Kernel ARP table - dev network devices with statistics +.. table:: Table 1-8: IPv6 info in /proc/net + + ========== ===================================================== + File Content + ========== ===================================================== + udp6 UDP sockets (IPv6) + tcp6 TCP sockets (IPv6) + raw6 Raw device statistics (IPv6) + igmp6 IP multicast addresses, which this host joined (IPv6) + if_inet6 List of IPv6 interface addresses + ipv6_route Kernel routing table for IPv6 + rt6_stats Global IPv6 routing tables statistics + sockstat6 Socket statistics (IPv6) + snmp6 Snmp data (IPv6) + ========== ===================================================== + +.. table:: Table 1-9: Network info in /proc/net + + ============= ================================================================ + File Content + ============= ================================================================ + arp Kernel ARP table + dev network devices with statistics dev_mcast the Layer2 multicast groups a device is listening too (interface index, label, number of references, number of bound - addresses). - dev_stat network device status - ip_fwchains Firewall chain linkage - ip_fwnames Firewall chain names - ip_masq Directory containing the masquerading tables - ip_masquerade Major masquerading table - netstat Network statistics - raw raw device statistics - route Kernel routing table - rpc Directory containing rpc info - rt_cache Routing cache - snmp SNMP data - sockstat Socket statistics - tcp TCP sockets - udp UDP sockets - unix UNIX domain sockets - wireless Wireless interface data (Wavelan etc) - igmp IP multicast addresses, which this host joined - psched Global packet scheduler parameters. - netlink List of PF_NETLINK sockets - ip_mr_vifs List of multicast virtual interfaces - ip_mr_cache List of multicast routing cache -.............................................................................. + addresses). + dev_stat network device status + ip_fwchains Firewall chain linkage + ip_fwnames Firewall chain names + ip_masq Directory containing the masquerading tables + ip_masquerade Major masquerading table + netstat Network statistics + raw raw device statistics + route Kernel routing table + rpc Directory containing rpc info + rt_cache Routing cache + snmp SNMP data + sockstat Socket statistics + tcp TCP sockets + udp UDP sockets + unix UNIX domain sockets + wireless Wireless interface data (Wavelan etc) + igmp IP multicast addresses, which this host joined + psched Global packet scheduler parameters. + netlink List of PF_NETLINK sockets + ip_mr_vifs List of multicast virtual interfaces + ip_mr_cache List of multicast routing cache + ============= ================================================================ You can use this information to see which network devices are available in -your system and how much traffic was routed over those devices: - - > cat /proc/net/dev - Inter-|Receive |[... - face |bytes packets errs drop fifo frame compressed multicast|[... - lo: 908188 5596 0 0 0 0 0 0 [... - ppp0:15475140 20721 410 0 0 410 0 0 [... - eth0: 614530 7085 0 0 0 0 0 1 [... - - ...] Transmit - ...] bytes packets errs drop fifo colls carrier compressed - ...] 908188 5596 0 0 0 0 0 0 - ...] 1375103 17405 0 0 0 0 0 0 - ...] 1703981 5535 0 0 0 3 0 0 +your system and how much traffic was routed over those devices:: + + > cat /proc/net/dev + Inter-|Receive |[... + face |bytes packets errs drop fifo frame compressed multicast|[... + lo: 908188 5596 0 0 0 0 0 0 [... + ppp0:15475140 20721 410 0 0 410 0 0 [... + eth0: 614530 7085 0 0 0 0 0 1 [... + + ...] Transmit + ...] bytes packets errs drop fifo colls carrier compressed + ...] 908188 5596 0 0 0 0 0 0 + ...] 1375103 17405 0 0 0 0 0 0 + ...] 1703981 5535 0 0 0 3 0 0 In addition, each Channel Bond interface has its own directory. For example, the bond0 device will have a directory called /proc/net/bond0/. @@ -1228,62 +1319,62 @@ many times the slaves link has failed. If you have a SCSI host adapter in your system, you'll find a subdirectory named after the driver for this adapter in /proc/scsi. You'll also see a list -of all recognized SCSI devices in /proc/scsi: +of all recognized SCSI devices in /proc/scsi:: - >cat /proc/scsi/scsi - Attached devices: - Host: scsi0 Channel: 00 Id: 00 Lun: 00 - Vendor: IBM Model: DGHS09U Rev: 03E0 - Type: Direct-Access ANSI SCSI revision: 03 - Host: scsi0 Channel: 00 Id: 06 Lun: 00 - Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 - Type: CD-ROM ANSI SCSI revision: 02 + >cat /proc/scsi/scsi + Attached devices: + Host: scsi0 Channel: 00 Id: 00 Lun: 00 + Vendor: IBM Model: DGHS09U Rev: 03E0 + Type: Direct-Access ANSI SCSI revision: 03 + Host: scsi0 Channel: 00 Id: 06 Lun: 00 + Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 + Type: CD-ROM ANSI SCSI revision: 02 The directory named after the driver has one file for each adapter found in the system. These files contain information about the controller, including the used IRQ and the IO address range. The amount of information shown is dependent on the adapter you use. The example shows the output for an Adaptec -AHA-2940 SCSI adapter: - - > cat /proc/scsi/aic7xxx/0 - - Adaptec AIC7xxx driver version: 5.1.19/3.2.4 - Compile Options: - TCQ Enabled By Default : Disabled - AIC7XXX_PROC_STATS : Disabled - AIC7XXX_RESET_DELAY : 5 - Adapter Configuration: - SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter - Ultra Wide Controller - PCI MMAPed I/O Base: 0xeb001000 - Adapter SEEPROM Config: SEEPROM found and used. - Adaptec SCSI BIOS: Enabled - IRQ: 10 - SCBs: Active 0, Max Active 2, - Allocated 15, HW 16, Page 255 - Interrupts: 160328 - BIOS Control Word: 0x18b6 - Adapter Control Word: 0x005b - Extended Translation: Enabled - Disconnect Enable Flags: 0xffff - Ultra Enable Flags: 0x0001 - Tag Queue Enable Flags: 0x0000 - Ordered Queue Tag Flags: 0x0000 - Default Tag Queue Depth: 8 - Tagged Queue By Device array for aic7xxx host instance 0: - {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} - Actual queue depth per device for aic7xxx host instance 0: - {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} - Statistics: - (scsi0:0:0:0) - Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 - Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) - Total transfers 160151 (74577 reads and 85574 writes) - (scsi0:0:6:0) - Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 - Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) - Total transfers 0 (0 reads and 0 writes) +AHA-2940 SCSI adapter:: + + > cat /proc/scsi/aic7xxx/0 + + Adaptec AIC7xxx driver version: 5.1.19/3.2.4 + Compile Options: + TCQ Enabled By Default : Disabled + AIC7XXX_PROC_STATS : Disabled + AIC7XXX_RESET_DELAY : 5 + Adapter Configuration: + SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter + Ultra Wide Controller + PCI MMAPed I/O Base: 0xeb001000 + Adapter SEEPROM Config: SEEPROM found and used. + Adaptec SCSI BIOS: Enabled + IRQ: 10 + SCBs: Active 0, Max Active 2, + Allocated 15, HW 16, Page 255 + Interrupts: 160328 + BIOS Control Word: 0x18b6 + Adapter Control Word: 0x005b + Extended Translation: Enabled + Disconnect Enable Flags: 0xffff + Ultra Enable Flags: 0x0001 + Tag Queue Enable Flags: 0x0000 + Ordered Queue Tag Flags: 0x0000 + Default Tag Queue Depth: 8 + Tagged Queue By Device array for aic7xxx host instance 0: + {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} + Actual queue depth per device for aic7xxx host instance 0: + {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} + Statistics: + (scsi0:0:0:0) + Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 + Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) + Total transfers 160151 (74577 reads and 85574 writes) + (scsi0:0:6:0) + Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 + Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) + Total transfers 0 (0 reads and 0 writes) 1.6 Parallel port info in /proc/parport @@ -1296,18 +1387,20 @@ number (0,1,2,...). These directories contain the four files shown in Table 1-10. -Table 1-10: Files in /proc/parport -.............................................................................. - File Content - autoprobe Any IEEE-1284 device ID information that has been acquired. +.. table:: Table 1-10: Files in /proc/parport + + ========= ==================================================================== + File Content + ========= ==================================================================== + autoprobe Any IEEE-1284 device ID information that has been acquired. devices list of the device drivers using that port. A + will appear by the name of the device currently using the port (it might not appear - against any). - hardware Parallel port's base address, IRQ line and DMA channel. + against any). + hardware Parallel port's base address, IRQ line and DMA channel. irq IRQ that parport is using for that port. This is in a separate file to allow you to alter it by writing a new value in (IRQ - number or none). -.............................................................................. + number or none). + ========= ==================================================================== 1.7 TTY info in /proc/tty ------------------------- @@ -1317,29 +1410,31 @@ directory /proc/tty.You'll find entries for drivers and line disciplines in this directory, as shown in Table 1-11. -Table 1-11: Files in /proc/tty -.............................................................................. - File Content - drivers list of drivers and their usage - ldiscs registered line disciplines - driver/serial usage statistic and status of single tty lines -.............................................................................. +.. table:: Table 1-11: Files in /proc/tty + + ============= ============================================== + File Content + ============= ============================================== + drivers list of drivers and their usage + ldiscs registered line disciplines + driver/serial usage statistic and status of single tty lines + ============= ============================================== To see which tty's are currently in use, you can simply look into the file -/proc/tty/drivers: - - > cat /proc/tty/drivers - pty_slave /dev/pts 136 0-255 pty:slave - pty_master /dev/ptm 128 0-255 pty:master - pty_slave /dev/ttyp 3 0-255 pty:slave - pty_master /dev/pty 2 0-255 pty:master - serial /dev/cua 5 64-67 serial:callout - serial /dev/ttyS 4 64-67 serial - /dev/tty0 /dev/tty0 4 0 system:vtmaster - /dev/ptmx /dev/ptmx 5 2 system - /dev/console /dev/console 5 1 system:console - /dev/tty /dev/tty 5 0 system:/dev/tty - unknown /dev/tty 4 1-63 console +/proc/tty/drivers:: + + > cat /proc/tty/drivers + pty_slave /dev/pts 136 0-255 pty:slave + pty_master /dev/ptm 128 0-255 pty:master + pty_slave /dev/ttyp 3 0-255 pty:slave + pty_master /dev/pty 2 0-255 pty:master + serial /dev/cua 5 64-67 serial:callout + serial /dev/ttyS 4 64-67 serial + /dev/tty0 /dev/tty0 4 0 system:vtmaster + /dev/ptmx /dev/ptmx 5 2 system + /dev/console /dev/console 5 1 system:console + /dev/tty /dev/tty 5 0 system:/dev/tty + unknown /dev/tty 4 1-63 console 1.8 Miscellaneous kernel statistics in /proc/stat @@ -1347,7 +1442,7 @@ To see which tty's are currently in use, you can simply look into the file Various pieces of information about kernel activity are available in the /proc/stat file. All of the numbers reported in this file are aggregates -since the system first booted. For a quick look, simply cat the file: +since the system first booted. For a quick look, simply cat the file:: > cat /proc/stat cpu 2255 34 2290 22625563 6290 127 456 0 0 0 @@ -1372,6 +1467,7 @@ second). The meanings of the columns are as follows, from left to right: - idle: twiddling thumbs - iowait: In a word, iowait stands for waiting for I/O to complete. But there are several problems: + 1. Cpu will not wait for I/O to complete, iowait is the time that a task is waiting for I/O to complete. When cpu goes into idle state for outstanding task io, another task will be scheduled on this CPU. @@ -1379,6 +1475,7 @@ second). The meanings of the columns are as follows, from left to right: on any CPU, so the iowait of each CPU is difficult to calculate. 3. The value of iowait field in /proc/stat will decrease in certain conditions. + So, the iowait is not reliable by reading from /proc/stat. - irq: servicing interrupts - softirq: servicing softirqs @@ -1422,18 +1519,19 @@ Information about mounted ext4 file systems can be found in /proc/fs/ext4/dm-0). The files in each per-device directory are shown in Table 1-12, below. -Table 1-12: Files in /proc/fs/ext4/<devname> -.............................................................................. - File Content +.. table:: Table 1-12: Files in /proc/fs/ext4/<devname> + + ============== ========================================================== + File Content mb_groups details of multiblock allocator buddy cache of free blocks -.............................................................................. + ============== ========================================================== 2.0 /proc/consoles ------------------ Shows registered system console lines. To see which character device lines are currently used for the system console -/dev/console, you may simply look into the file /proc/consoles: +/dev/console, you may simply look into the file /proc/consoles:: > cat /proc/consoles tty0 -WU (ECp) 4:7 @@ -1441,41 +1539,45 @@ To see which character device lines are currently used for the system console The columns are: - device name of the device - operations R = can do read operations - W = can do write operations - U = can do unblank - flags E = it is enabled - C = it is preferred console - B = it is primary boot console - p = it is used for printk buffer - b = it is not a TTY but a Braille device - a = it is safe to use when cpu is offline - major:minor major and minor number of the device separated by a colon ++--------------------+-------------------------------------------------------+ +| device | name of the device | ++====================+=======================================================+ +| operations | * R = can do read operations | +| | * W = can do write operations | +| | * U = can do unblank | ++--------------------+-------------------------------------------------------+ +| flags | * E = it is enabled | +| | * C = it is preferred console | +| | * B = it is primary boot console | +| | * p = it is used for printk buffer | +| | * b = it is not a TTY but a Braille device | +| | * a = it is safe to use when cpu is offline | ++--------------------+-------------------------------------------------------+ +| major:minor | major and minor number of the device separated by a | +| | colon | ++--------------------+-------------------------------------------------------+ ------------------------------------------------------------------------------- Summary ------------------------------------------------------------------------------- +------- + The /proc file system serves information about the running system. It not only allows access to process data but also allows you to request the kernel status by reading files in the hierarchy. The directory structure of /proc reflects the types of information and makes it easy, if not obvious, where to look for specific data. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -CHAPTER 2: MODIFYING SYSTEM PARAMETERS ------------------------------------------------------------------------------- +Chapter 2: Modifying System Parameters +====================================== ------------------------------------------------------------------------------- In This Chapter ------------------------------------------------------------------------------- +--------------- + * Modifying kernel parameters by writing into files found in /proc/sys * Exploring the files which modify certain parameters * Review of the /proc/sys file tree ------------------------------------------------------------------------------- +------------------------------------------------------------------------------ A very interesting part of /proc is the directory /proc/sys. This is not only a source of information, it also allows you to change parameters within the @@ -1503,19 +1605,18 @@ kernels, and became part of it in version 2.2.1 of the Linux kernel. Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these entries. ------------------------------------------------------------------------------- Summary ------------------------------------------------------------------------------- +------- + Certain aspects of kernel behavior can be modified at runtime, without the need to recompile the kernel, or even to reboot the system. The files in the /proc/sys tree can not only be read, but also modified. You can use the echo command to write value into these files, thereby changing the default settings of the kernel. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- -CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------- + +Chapter 3: Per-process Parameters +================================= 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score -------------------------------------------------------------------------------- @@ -1588,26 +1689,28 @@ process should be killed in an out-of-memory situation. This file contains IO statistics for each running process Example -------- +~~~~~~~ + +:: -test:/tmp # dd if=/dev/zero of=/tmp/test.dat & -[1] 3828 + test:/tmp # dd if=/dev/zero of=/tmp/test.dat & + [1] 3828 -test:/tmp # cat /proc/3828/io -rchar: 323934931 -wchar: 323929600 -syscr: 632687 -syscw: 632675 -read_bytes: 0 -write_bytes: 323932160 -cancelled_write_bytes: 0 + test:/tmp # cat /proc/3828/io + rchar: 323934931 + wchar: 323929600 + syscr: 632687 + syscw: 632675 + read_bytes: 0 + write_bytes: 323932160 + cancelled_write_bytes: 0 Description ------------ +~~~~~~~~~~~ rchar ------ +^^^^^ I/O counter: chars read The number of bytes which this task has caused to be read from storage. This @@ -1618,7 +1721,7 @@ pagecache) wchar ------ +^^^^^ I/O counter: chars written The number of bytes which this task has caused, or shall cause to be written @@ -1626,7 +1729,7 @@ to disk. Similar caveats apply here as with rchar. syscr ------ +^^^^^ I/O counter: read syscalls Attempt to count the number of read I/O operations, i.e. syscalls like read() @@ -1634,7 +1737,7 @@ and pread(). syscw ------ +^^^^^ I/O counter: write syscalls Attempt to count the number of write I/O operations, i.e. syscalls like @@ -1642,7 +1745,7 @@ write() and pwrite(). read_bytes ----------- +^^^^^^^^^^ I/O counter: bytes read Attempt to count the number of bytes which this process really did cause to @@ -1652,7 +1755,7 @@ CIFS at a later time> write_bytes ------------ +^^^^^^^^^^^ I/O counter: bytes written Attempt to count the number of bytes which this process caused to be sent to @@ -1660,7 +1763,7 @@ the storage layer. This is done at page-dirtying time. cancelled_write_bytes ---------------------- +^^^^^^^^^^^^^^^^^^^^^ The big inaccuracy here is truncate. If a process writes 1MB to a file and then deletes the file, it will in fact perform no writeout. But it will have @@ -1673,12 +1776,11 @@ from the truncating task's write_bytes, but there is information loss in doing that. -Note ----- +.. Note:: -At its current implementation state, this is a bit racy on 32-bit machines: if -process A reads process B's /proc/pid/io while process B is updating one of -those 64-bit counters, process A could see an intermediate result. + At its current implementation state, this is a bit racy on 32-bit machines: + if process A reads process B's /proc/pid/io while process B is updating one + of those 64-bit counters, process A could see an intermediate result. More information about this can be found within the taskstats documentation in @@ -1698,12 +1800,13 @@ of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. The following 9 memory types are supported: + - (bit 0) anonymous private memory - (bit 1) anonymous shared memory - (bit 2) file-backed private memory - (bit 3) file-backed shared memory - (bit 4) ELF header pages in file-backed private memory areas (it is - effective only if the bit 2 is cleared) + effective only if the bit 2 is cleared) - (bit 5) hugetlb private memory - (bit 6) hugetlb shared memory - (bit 7) DAX private memory @@ -1719,13 +1822,13 @@ The default value of coredump_filter is 0x33; this means all anonymous memory segments, ELF header pages and hugetlb private memory are dumped. If you don't want to dump all shared memory segments attached to pid 1234, -write 0x31 to the process's proc file. +write 0x31 to the process's proc file:: $ echo 0x31 > /proc/1234/coredump_filter When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs. -For example: +For example:: $ echo 0x7 > /proc/self/coredump_filter $ ./some_program @@ -1733,35 +1836,37 @@ For example: 3.5 /proc/<pid>/mountinfo - Information about mounts -------------------------------------------------------- -This file contains lines of the form: +This file contains lines of the form:: -36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue -(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue + (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) -(1) mount ID: unique identifier of the mount (may be reused after umount) -(2) parent ID: ID of parent (or of self for the top of the mount tree) -(3) major:minor: value of st_dev for files on filesystem -(4) root: root of the mount within the filesystem -(5) mount point: mount point relative to the process's root -(6) mount options: per mount options -(7) optional fields: zero or more fields of the form "tag[:value]" -(8) separator: marks the end of the optional fields -(9) filesystem type: name of filesystem of the form "type[.subtype]" -(10) mount source: filesystem specific information or "none" -(11) super options: per super block options + (1) mount ID: unique identifier of the mount (may be reused after umount) + (2) parent ID: ID of parent (or of self for the top of the mount tree) + (3) major:minor: value of st_dev for files on filesystem + (4) root: root of the mount within the filesystem + (5) mount point: mount point relative to the process's root + (6) mount options: per mount options + (7) optional fields: zero or more fields of the form "tag[:value]" + (8) separator: marks the end of the optional fields + (9) filesystem type: name of filesystem of the form "type[.subtype]" + (10) mount source: filesystem specific information or "none" + (11) super options: per super block options Parsers should ignore all unrecognised optional fields. Currently the possible optional fields are: -shared:X mount is shared in peer group X -master:X mount is slave to peer group X -propagate_from:X mount is slave and receives propagation from peer group X (*) -unbindable mount is unbindable +================ ============================================================== +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X [#]_ +unbindable mount is unbindable +================ ============================================================== -(*) X is the closest dominant peer group under the process's root. If -X is the immediate master of the mount, or if there's no dominant peer -group under the same root, then only the "master:X" field is present -and not the "propagate_from:X" field. +.. [#] X is the closest dominant peer group under the process's root. If + X is the immediate master of the mount, or if there's no dominant peer + group under the same root, then only the "master:X" field is present + and not the "propagate_from:X" field. For more information on mount propagation see: @@ -1804,77 +1909,86 @@ created with [see open(2) for details] and 'mnt_id' represents mount ID of the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo for details]. -A typical output is +A typical output is:: pos: 0 flags: 0100002 mnt_id: 19 -All locks associated with a file descriptor are shown in its fdinfo too. +All locks associated with a file descriptor are shown in its fdinfo too:: -lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF + lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags pair provide additional information particular to the objects they represent. - Eventfd files - ~~~~~~~~~~~~~ +Eventfd files +~~~~~~~~~~~~~ + +:: + pos: 0 flags: 04002 mnt_id: 9 eventfd-count: 5a - where 'eventfd-count' is hex value of a counter. +where 'eventfd-count' is hex value of a counter. + +Signalfd files +~~~~~~~~~~~~~~ + +:: - Signalfd files - ~~~~~~~~~~~~~~ pos: 0 flags: 04002 mnt_id: 9 sigmask: 0000000000000200 - where 'sigmask' is hex value of the signal mask associated - with a file. +where 'sigmask' is hex value of the signal mask associated +with a file. + +Epoll files +~~~~~~~~~~~ + +:: - Epoll files - ~~~~~~~~~~~ pos: 0 flags: 02 mnt_id: 9 tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 - where 'tfd' is a target file descriptor number in decimal form, - 'events' is events mask being watched and the 'data' is data - associated with a target [see epoll(7) for more details]. +where 'tfd' is a target file descriptor number in decimal form, +'events' is events mask being watched and the 'data' is data +associated with a target [see epoll(7) for more details]. - The 'pos' is current offset of the target file in decimal form - [see lseek(2)], 'ino' and 'sdev' are inode and device numbers - where target file resides, all in hex format. +The 'pos' is current offset of the target file in decimal form +[see lseek(2)], 'ino' and 'sdev' are inode and device numbers +where target file resides, all in hex format. - Fsnotify files - ~~~~~~~~~~~~~~ - For inotify files the format is the following +Fsnotify files +~~~~~~~~~~~~~~ +For inotify files the format is the following:: pos: 0 flags: 02000000 inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d - where 'wd' is a watch descriptor in decimal form, ie a target file - descriptor number, 'ino' and 'sdev' are inode and device where the - target file resides and the 'mask' is the mask of events, all in hex - form [see inotify(7) for more details]. +where 'wd' is a watch descriptor in decimal form, ie a target file +descriptor number, 'ino' and 'sdev' are inode and device where the +target file resides and the 'mask' is the mask of events, all in hex +form [see inotify(7) for more details]. - If the kernel was built with exportfs support, the path to the target - file is encoded as a file handle. The file handle is provided by three - fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex - format. +If the kernel was built with exportfs support, the path to the target +file is encoded as a file handle. The file handle is provided by three +fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex +format. - If the kernel is built without exportfs support the file handle won't be - printed out. +If the kernel is built without exportfs support the file handle won't be +printed out. - If there is no inotify mark attached yet the 'inotify' line will be omitted. +If there is no inotify mark attached yet the 'inotify' line will be omitted. - For fanotify files the format is +For fanotify files the format is:: pos: 0 flags: 02 @@ -1883,20 +1997,22 @@ pair provide additional information particular to the objects they represent. fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 - where fanotify 'flags' and 'event-flags' are values used in fanotify_init - call, 'mnt_id' is the mount point identifier, 'mflags' is the value of - flags associated with mark which are tracked separately from events - mask. 'ino', 'sdev' are target inode and device, 'mask' is the events - mask and 'ignored_mask' is the mask of events which are to be ignored. - All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' - does provide information about flags and mask used in fanotify_mark - call [see fsnotify manpage for details]. +where fanotify 'flags' and 'event-flags' are values used in fanotify_init +call, 'mnt_id' is the mount point identifier, 'mflags' is the value of +flags associated with mark which are tracked separately from events +mask. 'ino', 'sdev' are target inode and device, 'mask' is the events +mask and 'ignored_mask' is the mask of events which are to be ignored. +All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' +does provide information about flags and mask used in fanotify_mark +call [see fsnotify manpage for details]. + +While the first three lines are mandatory and always printed, the rest is +optional and may be omitted if no marks created yet. - While the first three lines are mandatory and always printed, the rest is - optional and may be omitted if no marks created yet. +Timerfd files +~~~~~~~~~~~~~ - Timerfd files - ~~~~~~~~~~~~~ +:: pos: 0 flags: 02 @@ -1907,18 +2023,18 @@ pair provide additional information particular to the objects they represent. it_value: (0, 49406829) it_interval: (1, 0) - where 'clockid' is the clock type and 'ticks' is the number of the timer expirations - that have occurred [see timerfd_create(2) for details]. 'settime flags' are - flags in octal form been used to setup the timer [see timerfd_settime(2) for - details]. 'it_value' is remaining time until the timer exiration. - 'it_interval' is the interval for the timer. Note the timer might be set up - with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' - still exhibits timer's remaining time. +where 'clockid' is the clock type and 'ticks' is the number of the timer expirations +that have occurred [see timerfd_create(2) for details]. 'settime flags' are +flags in octal form been used to setup the timer [see timerfd_settime(2) for +details]. 'it_value' is remaining time until the timer exiration. +'it_interval' is the interval for the timer. Note the timer might be set up +with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' +still exhibits timer's remaining time. 3.9 /proc/<pid>/map_files - Information about memory mapped files --------------------------------------------------------------------- This directory contains symbolic links which represent memory mapped files -the process is maintaining. Example output: +the process is maintaining. Example output:: | lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so | lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so @@ -1976,17 +2092,22 @@ When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the architecture specific status of the task. Example -------- +~~~~~~~ + +:: + $ cat /proc/6753/arch_status AVX512_elapsed_ms: 8 Description ------------ +~~~~~~~~~~~ x86 specific entries: ---------------------- - AVX512_elapsed_ms: - ------------------ +~~~~~~~~~~~~~~~~~~~~~ + +AVX512_elapsed_ms: +^^^^^^^^^^^^^^^^^^ + If AVX512 is supported on the machine, this entry shows the milliseconds elapsed since the last time AVX512 usage was recorded. The recording happens on a best effort basis when a task is scheduled out. This means @@ -2010,17 +2131,18 @@ x86 specific entries: the task is unlikely an AVX512 user, but depends on the workload and the scheduling scenario, it also could be a false negative mentioned above. ------------------------------------------------------------------------------- Configuring procfs ------------------------------------------------------------------------------- +------------------ 4.1 Mount options --------------------- The following mount options are supported: + ========= ======================================================== hidepid= Set /proc/<pid>/ access mode. gid= Set the group authorized to learn processes information. + ========= ======================================================== hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories (default). diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.rst index 48ea68f15845..b71308314070 100644 --- a/Documentation/filesystems/qnx6.txt +++ b/Documentation/filesystems/qnx6.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== The QNX6 Filesystem =================== @@ -14,10 +17,12 @@ Specification qnx6fs shares many properties with traditional Unix filesystems. It has the concepts of blocks, inodes and directories. + On QNX it is possible to create little endian and big endian qnx6 filesystems. This feature makes it possible to create and use a different endianness fs for the target (QNX is used on quite a range of embedded systems) platform running on a different endianness. + The Linux driver handles endianness transparently. (LE and BE) Blocks @@ -26,6 +31,7 @@ Blocks The space in the device or file is split up into blocks. These are a fixed size of 512, 1024, 2048 or 4096, which is decided when the filesystem is created. + Blockpointers are 32bit, so the maximum space that can be addressed is 2^32 * 4096 bytes or 16TB @@ -50,6 +56,7 @@ Each of these root nodes holds information like total size of the stored data and the addressing levels in that specific tree. If the level value is 0, up to 16 direct blocks can be addressed by each node. + Level 1 adds an additional indirect addressing level where each indirect addressing block holds up to blocksize / 4 bytes pointers to data blocks. Level 2 adds an additional indirect addressing block level (so, already up @@ -57,11 +64,13 @@ to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). Unused block pointers are always set to ~0 - regardless of root node, indirect addressing blocks or inodes. + Data leaves are always on the lowest level. So no data is stored on upper tree levels. The first Superblock is located at 0x2000. (0x2000 is the bootblock size) The Audi MMI 3G first superblock directly starts at byte 0. + Second superblock position can either be calculated from the superblock information (total number of filesystem blocks) or by taking the highest device address, zeroing the last 3 bytes and then subtracting 0x1000 from @@ -84,6 +93,7 @@ Object mode field is POSIX format. (which makes things easier) There are also pointers to the first 16 blocks, if the object data can be addressed with 16 direct blocks. + For more than 16 blocks an indirect addressing in form of another tree is used. (scheme is the same as the one used for the superblock root nodes) @@ -96,13 +106,18 @@ Directories A directory is a filesystem object and has an inode just like a file. It is a specially formatted file containing records which associate each name with an inode number. + '.' inode number points to the directory inode + '..' inode number points to the parent directory inode + Eeach filename record additionally got a filename length field. One special case are long filenames or subdirectory names. + These got set a filename length field of 0xff in the corresponding directory record plus the longfile inode number also stored in that record. + With that longfilename inode number, the longfilename tree can be walked starting with the superblock longfilename root node pointers. @@ -111,6 +126,7 @@ Special files Symbolic links are also filesystem objects with inodes. They got a specific bit in the inode mode field identifying them as symbolic link. + The directory entry file inode pointer points to the target file inode. Hard links got an inode, a directory entry, but a specific mode bit set, @@ -126,9 +142,11 @@ Long filenames Long filenames are stored in a separate addressing tree. The staring point is the longfilename root node in the active superblock. + Each data block (tree leaves) holds one long filename. That filename is limited to 510 bytes. The first two starting bytes are used as length field for the actual filename. + If that structure shall fit for all allowed blocksizes, it is clear why there is a limit of 510 bytes for the actual filename stored. @@ -138,6 +156,7 @@ Bitmap The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap root node in the superblock and each bit in the bitmap represents one filesystem block. + The first block is block 0, which starts 0x1000 after superblock start. So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical address at which block 0 is located. @@ -149,11 +168,14 @@ Bitmap system area ------------------ The bitmap itself is divided into three parts. + First the system area, that is split into two halves. + Then userspace. The requirement for a static, fixed preallocated system area comes from how qnx6fs deals with writes. + Each superblock got it's own half of the system area. So superblock #1 always uses blocks from the lower half while superblock #2 just writes to blocks represented by the upper half bitmap system area bits. diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.rst index 97d42ccaa92d..6c576e241d86 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst @@ -1,5 +1,11 @@ -ramfs, rootfs and initramfs +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +Ramfs, rootfs and initramfs +=========================== + October 17, 2005 + Rob Landley <rob@landley.net> ============================= @@ -99,14 +105,14 @@ out of that. All this differs from the old initrd in several ways: - The old initrd was always a separate file, while the initramfs archive is - linked into the linux kernel image. (The directory linux-*/usr is devoted - to generating this archive during the build.) + linked into the linux kernel image. (The directory ``linux-*/usr`` is + devoted to generating this archive during the build.) - The old initrd file was a gzipped filesystem image (in some file format, such as ext2, that needed a driver built into the kernel), while the new initramfs archive is a gzipped cpio archive (like tar only simpler, - see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The - kernel's cpio extraction code is not only extremely small, it's also + see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). + The kernel's cpio extraction code is not only extremely small, it's also __init text and data that can be discarded during the boot process. - The program run by the old initrd (which was called /initrd, not /init) did @@ -139,7 +145,7 @@ and living in usr/Kconfig) can be used to specify a source for the initramfs archive, which will automatically be incorporated into the resulting binary. This option can point to an existing gzipped cpio archive, a directory containing files to be archived, or a text file -specification such as the following example: +specification such as the following example:: dir /dev 755 0 0 nod /dev/console 644 0 0 c 5 1 @@ -175,12 +181,12 @@ or extracting your own preprepared cpio files to feed to the kernel build (instead of a config file or directory). The following command line can extract a cpio image (either by the above script -or by the kernel build) back into its component files: +or by the kernel build) back into its component files:: cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames The following shell script can create a prebuilt cpio archive you can -use in place of the above config file: +use in place of the above config file:: #!/bin/sh @@ -202,14 +208,17 @@ use in place of the above config file: exit 1 fi -Note: The cpio man page contains some bad advice that will break your initramfs -archive if you follow it. It says "A typical way to generate the list -of filenames is with the find command; you should give find the -depth option -to minimize problems with permissions on directories that are unwritable or not -searchable." Don't do this when creating initramfs.cpio.gz images, it won't -work. The Linux kernel cpio extractor won't create files in a directory that -doesn't exist, so the directory entries must go before the files that go in -those directories. The above script gets them in the right order. +.. Note:: + + The cpio man page contains some bad advice that will break your initramfs + archive if you follow it. It says "A typical way to generate the list + of filenames is with the find command; you should give find the -depth + option to minimize problems with permissions on directories that are + unwritable or not searchable." Don't do this when creating + initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor + won't create files in a directory that doesn't exist, so the directory + entries must go before the files that go in those directories. + The above script gets them in the right order. External initramfs images: -------------------------- @@ -236,9 +245,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux. If you don't already understand what shared libraries, devices, and paths you need to get a minimal root filesystem up and running, here are some references: -http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ -http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html -http://www.linuxfromscratch.org/lfs/view/stable/ + +- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +- http://www.linuxfromscratch.org/lfs/view/stable/ The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is designed to be a tiny C library to statically link early userspace @@ -255,7 +265,7 @@ name lookups, even when otherwise statically linked.) A good first step is to get initramfs to run a statically linked "hello world" program as init, and test it under an emulator like qemu (www.qemu.org) or -User Mode Linux, like so: +User Mode Linux, like so:: cat > hello.c << EOF #include <stdio.h> @@ -326,8 +336,8 @@ the above threads) is: explained his reasoning: - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html + - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html and, most importantly, designed and implemented the initramfs code. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.rst index cd709a94d054..04ad083cfe62 100644 --- a/Documentation/filesystems/relay.txt +++ b/Documentation/filesystems/relay.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== relay interface (formerly relayfs) ================================== @@ -108,6 +111,7 @@ The relay interface implements basic file operations for user space access to relay channel buffer data. Here are the file operations that are available and some comments regarding their behavior: +=========== ============================================================ open() enables user to open an _existing_ channel buffer. mmap() results in channel buffer being mapped into the caller's @@ -136,13 +140,16 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are close() decrements the channel buffer's refcount. When the refcount reaches 0, i.e. when no process or kernel client has the buffer open, the channel buffer is freed. +=========== ============================================================ In order for a user application to make use of relay files, the -host filesystem must be mounted. For example, +host filesystem must be mounted. For example:: mount -t debugfs debugfs /sys/kernel/debug -NOTE: the host filesystem doesn't need to be mounted for kernel +.. Note:: + + the host filesystem doesn't need to be mounted for kernel clients to create or use channels - it only needs to be mounted when user space applications need access to the buffer data. @@ -154,7 +161,7 @@ The relay interface kernel API Here's a summary of the API the relay interface provides to in-kernel clients: TBD(curr. line MT:/API/) - channel management functions: + channel management functions:: relay_open(base_filename, parent, subbuf_size, n_subbufs, callbacks, private_data) @@ -162,17 +169,17 @@ TBD(curr. line MT:/API/) relay_flush(chan) relay_reset(chan) - channel management typically called on instigation of userspace: + channel management typically called on instigation of userspace:: relay_subbufs_consumed(chan, cpu, subbufs_consumed) - write functions: + write functions:: relay_write(chan, data, length) __relay_write(chan, data, length) relay_reserve(chan, length) - callbacks: + callbacks:: subbuf_start(buf, subbuf, prev_subbuf, prev_padding) buf_mapped(buf, filp) @@ -180,7 +187,7 @@ TBD(curr. line MT:/API/) create_buf_file(filename, parent, mode, buf, is_global) remove_buf_file(dentry) - helper functions: + helper functions:: relay_buf_full(buf) subbuf_start_reserve(buf, length) @@ -215,41 +222,41 @@ the file(s) created in create_buf_file() and is called during relay_close(). Here are some typical definitions for these callbacks, in this case -using debugfs: - -/* - * create_buf_file() callback. Creates relay file in debugfs. - */ -static struct dentry *create_buf_file_handler(const char *filename, - struct dentry *parent, - umode_t mode, - struct rchan_buf *buf, - int *is_global) -{ - return debugfs_create_file(filename, mode, parent, buf, - &relay_file_operations); -} - -/* - * remove_buf_file() callback. Removes relay file from debugfs. - */ -static int remove_buf_file_handler(struct dentry *dentry) -{ - debugfs_remove(dentry); - - return 0; -} - -/* - * relay interface callbacks - */ -static struct rchan_callbacks relay_callbacks = -{ - .create_buf_file = create_buf_file_handler, - .remove_buf_file = remove_buf_file_handler, -}; - -And an example relay_open() invocation using them: +using debugfs:: + + /* + * create_buf_file() callback. Creates relay file in debugfs. + */ + static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + umode_t mode, + struct rchan_buf *buf, + int *is_global) + { + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); + } + + /* + * remove_buf_file() callback. Removes relay file from debugfs. + */ + static int remove_buf_file_handler(struct dentry *dentry) + { + debugfs_remove(dentry); + + return 0; + } + + /* + * relay interface callbacks + */ + static struct rchan_callbacks relay_callbacks = + { + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, + }; + +And an example relay_open() invocation using them:: chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); @@ -339,23 +346,23 @@ whether or not to actually move on to the next sub-buffer. To implement 'no-overwrite' mode, the userspace client would provide an implementation of the subbuf_start() callback something like the -following: +following:: -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; - if (relay_buf_full(buf)) - return 0; + if (relay_buf_full(buf)) + return 0; - subbuf_start_reserve(buf, sizeof(unsigned int)); + subbuf_start_reserve(buf, sizeof(unsigned int)); - return 1; -} + return 1; + } If the current buffer is full, i.e. all sub-buffers remain unconsumed, the callback returns 0 to indicate that the buffer switch should not @@ -370,20 +377,20 @@ ready sub-buffers will relay_buf_full() return 0, in which case the buffer switch can continue. The implementation of the subbuf_start() callback for 'overwrite' mode -would be very similar: +would be very similar:: -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - size_t prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; + static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + size_t prev_padding) + { + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; - subbuf_start_reserve(buf, sizeof(unsigned int)); + subbuf_start_reserve(buf, sizeof(unsigned int)); - return 1; -} + return 1; + } In this case, the relay_buf_full() check is meaningless and the callback always returns 1, causing the buffer switch to occur diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.rst index e2b07cc9120a..465b11efa9be 100644 --- a/Documentation/filesystems/romfs.txt +++ b/Documentation/filesystems/romfs.rst @@ -1,4 +1,8 @@ -ROMFS - ROM FILE SYSTEM +.. SPDX-License-Identifier: GPL-2.0 + +======================= +ROMFS - ROM File System +======================= This is a quite dumb, read only filesystem, mainly for initial RAM disks of installation disks. It has grown up by the need of having @@ -51,9 +55,9 @@ the 16 byte padding for the name and the contents, also 16+14+15 = 45 bytes. This is quite rare however, since most file names are longer than 3 bytes, and shorter than 15 bytes. -The layout of the filesystem is the following: +The layout of the filesystem is the following:: -offset content + offset content +---+---+---+---+ 0 | - | r | o | m | \ @@ -84,9 +88,9 @@ the source. This algorithm was chosen because although it's not quite reliable, it does not require any tables, and it is very simple. The following bytes are now part of the file system; each file header -must begin on a 16 byte boundary. +must begin on a 16 byte boundary:: -offset content + offset content +---+---+---+---+ 0 | next filehdr|X| The offset of the next file header @@ -114,7 +118,9 @@ file is user and group 0, this should never be a problem for the intended use. The mapping of the 8 possible values to file types is the following: +== =============== ============================================ mapping spec.info means +== =============== ============================================ 0 hard link link destination [file header] 1 directory first file's header 2 regular file unused, must be zero [MBZ] @@ -123,6 +129,7 @@ the following: 5 char device - " - 6 socket unused, MBZ 7 fifo unused, MBZ +== =============== ============================================ Note that hard links are specifically marked in this filesystem, but they will behave as you can expect (i.e. share the inode number). @@ -158,24 +165,24 @@ to romfs-subscribe@shadow.banki.hu, the content is irrelevant. Pending issues: - Permissions and owner information are pretty essential features of a -Un*x like system, but romfs does not provide the full possibilities. -I have never found this limiting, but others might. + Un*x like system, but romfs does not provide the full possibilities. + I have never found this limiting, but others might. - The file system is read only, so it can be very small, but in case -one would want to write _anything_ to a file system, he still needs -a writable file system, thus negating the size advantages. Possible -solutions: implement write access as a compile-time option, or a new, -similarly small writable filesystem for RAM disks. + one would want to write _anything_ to a file system, he still needs + a writable file system, thus negating the size advantages. Possible + solutions: implement write access as a compile-time option, or a new, + similarly small writable filesystem for RAM disks. - Since the files are only required to have alignment on a 16 byte -boundary, it is currently possibly suboptimal to read or execute files -from the filesystem. It might be resolved by reordering file data to -have most of it (i.e. except the start and the end) laying at "natural" -boundaries, thus it would be possible to directly map a big portion of -the file contents to the mm subsystem. + boundary, it is currently possibly suboptimal to read or execute files + from the filesystem. It might be resolved by reordering file data to + have most of it (i.e. except the start and the end) laying at "natural" + boundaries, thus it would be possible to directly map a big portion of + the file contents to the mm subsystem. - Compression might be an useful feature, but memory is quite a -limiting factor in my eyes. + limiting factor in my eyes. - Where it is used? @@ -183,4 +190,5 @@ limiting factor in my eyes. Have fun, + Janos Farkas <chexum@shadow.banki.hu> diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.rst index e5274f84dc56..df42106bae71 100644 --- a/Documentation/filesystems/squashfs.txt +++ b/Documentation/filesystems/squashfs.rst @@ -1,7 +1,11 @@ -SQUASHFS 4.0 FILESYSTEM +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Squashfs 4.0 Filesystem ======================= Squashfs is a compressed read-only filesystem for Linux. + It uses zlib, lz4, lzo, or xz compression to compress files, inodes and directories. Inodes in the system are very small and all blocks are packed to minimise data overhead. Block sizes greater than 4K are supported up to a @@ -15,31 +19,33 @@ needed. Mailing list: squashfs-devel@lists.sourceforge.net Web site: www.squashfs.org -1. FILESYSTEM FEATURES +1. Filesystem Features ---------------------- Squashfs filesystem features versus Cramfs: +============================== ========= ========== Squashfs Cramfs - -Max filesystem size: 2^64 256 MiB -Max file size: ~ 2 TiB 16 MiB -Max files: unlimited unlimited -Max directories: unlimited unlimited -Max entries per directory: unlimited unlimited -Max block size: 1 MiB 4 KiB -Metadata compression: yes no -Directory indexes: yes no -Sparse file support: yes no -Tail-end packing (fragments): yes no -Exportable (NFS etc.): yes no -Hard link support: yes no -"." and ".." in readdir: yes no -Real inode numbers: yes no -32-bit uids/gids: yes no -File creation time: yes no -Xattr support: yes no -ACL support: no no +============================== ========= ========== +Max filesystem size 2^64 256 MiB +Max file size ~ 2 TiB 16 MiB +Max files unlimited unlimited +Max directories unlimited unlimited +Max entries per directory unlimited unlimited +Max block size 1 MiB 4 KiB +Metadata compression yes no +Directory indexes yes no +Sparse file support yes no +Tail-end packing (fragments) yes no +Exportable (NFS etc.) yes no +Hard link support yes no +"." and ".." in readdir yes no +Real inode numbers yes no +32-bit uids/gids yes no +File creation time yes no +Xattr support yes no +ACL support no no +============================== ========= ========== Squashfs compresses data, inodes and directories. In addition, inode and directory data are highly compacted, and packed on byte boundaries. Each @@ -47,7 +53,7 @@ compressed inode is on average 8 bytes in length (the exact length varies on file type, i.e. regular file, directory, symbolic link, and block/char device inodes have different sizes). -2. USING SQUASHFS +2. Using Squashfs ----------------- As squashfs is a read-only filesystem, the mksquashfs program must be used to @@ -58,11 +64,11 @@ obtained from this site also. The squashfs-tools development tree is now located on kernel.org git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git -3. SQUASHFS FILESYSTEM DESIGN +3. Squashfs Filesystem Design ----------------------------- A squashfs filesystem consists of a maximum of nine parts, packed together on a -byte alignment: +byte alignment:: --------------- | superblock | @@ -229,15 +235,15 @@ location of the xattr list inside each inode, a 32-bit xattr id is stored. This xattr id is mapped into the location of the xattr list using a second xattr id lookup table. -4. TODOS AND OUTSTANDING ISSUES +4. TODOs and Outstanding Issues ------------------------------- -4.1 Todo list +4.1 TODO list ------------- Implement ACL support. -4.2 Squashfs internal cache +4.2 Squashfs Internal Cache --------------------------- Blocks in Squashfs are compressed. To avoid repeatedly decompressing diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.rst index ddf15b1b0d5a..290891c3fecb 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.rst @@ -1,32 +1,36 @@ +.. SPDX-License-Identifier: GPL-2.0 -sysfs - _The_ filesystem for exporting kernel objects. +===================================================== +sysfs - _The_ filesystem for exporting kernel objects +===================================================== Patrick Mochel <mochel@osdl.org> + Mike Murphy <mamurph@cs.clemson.edu> -Revised: 16 August 2011 -Original: 10 January 2003 +:Revised: 16 August 2011 +:Original: 10 January 2003 What it is: ~~~~~~~~~~~ sysfs is a ram-based filesystem initially based on ramfs. It provides -a means to export kernel data structures, their attributes, and the -linkages between them to userspace. +a means to export kernel data structures, their attributes, and the +linkages between them to userspace. sysfs is tied inherently to the kobject infrastructure. Please read Documentation/kobject.txt for more information concerning the kobject -interface. +interface. Using sysfs ~~~~~~~~~~~ sysfs is always compiled in if CONFIG_SYSFS is defined. You can access -it by doing: +it by doing:: - mount -t sysfs sysfs /sys + mount -t sysfs sysfs /sys Directory Creation @@ -37,7 +41,7 @@ created for it in sysfs. That directory is created as a subdirectory of the kobject's parent, expressing internal object hierarchies to userspace. Top-level directories in sysfs represent the common ancestors of object hierarchies; i.e. the subsystems the objects -belong to. +belong to. Sysfs internally stores a pointer to the kobject that implements a directory in the kernfs_node object associated with the directory. In @@ -58,63 +62,63 @@ attributes. Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of -values of the same type. +values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon. Doing these things may get -you publicly humiliated and your code rewritten without notice. +you publicly humiliated and your code rewritten without notice. -An attribute definition is simply: +An attribute definition is simply:: -struct attribute { - char * name; - struct module *owner; - umode_t mode; -}; + struct attribute { + char * name; + struct module *owner; + umode_t mode; + }; -int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); -void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); + int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); + void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); A bare attribute contains no means to read or write the value of the attribute. Subsystems are encouraged to define their own attribute structure and wrapper functions for adding and removing attributes for -a specific object type. +a specific object type. -For example, the driver model defines struct device_attribute like: +For example, the driver model defines struct device_attribute like:: -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; -int device_create_file(struct device *, const struct device_attribute *); -void device_remove_file(struct device *, const struct device_attribute *); + int device_create_file(struct device *, const struct device_attribute *); + void device_remove_file(struct device *, const struct device_attribute *); -It also defines this helper for defining device attributes: +It also defines this helper for defining device attributes:: -#define DEVICE_ATTR(_name, _mode, _show, _store) \ -struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) + #define DEVICE_ATTR(_name, _mode, _show, _store) \ + struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) -For example, declaring +For example, declaring:: -static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); + static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); -is equivalent to doing: +is equivalent to doing:: -static struct device_attribute dev_attr_foo = { - .attr = { - .name = "foo", - .mode = S_IWUSR | S_IRUGO, - }, - .show = show_foo, - .store = store_foo, -}; + static struct device_attribute dev_attr_foo = { + .attr = { + .name = "foo", + .mode = S_IWUSR | S_IRUGO, + }, + .show = show_foo, + .store = store_foo, + }; Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally considered a bad idea." so trying to set a sysfs file writable for @@ -127,15 +131,21 @@ readable. The above case could be shortened to: static struct device_attribute dev_attr_foo = __ATTR_RW(foo); the list of helpers available to define your wrapper function is: -__ATTR_RO(name): assumes default name_show and mode 0444 -__ATTR_WO(name): assumes a name_store only and is restricted to mode + +__ATTR_RO(name): + assumes default name_show and mode 0444 +__ATTR_WO(name): + assumes a name_store only and is restricted to mode 0200 that is root write access only. -__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently +__ATTR_RO_MODE(name, mode): + fore more restrictive RO access currently only use case is the EFI System Resource Table (see drivers/firmware/efi/esrt.c) -__ATTR_RW(name): assumes default name_show, name_store and setting +__ATTR_RW(name): + assumes default name_show, name_store and setting mode to 0644. -__ATTR_NULL: which sets the name to NULL and is used as end of list +__ATTR_NULL: + which sets the name to NULL and is used as end of list indicator (see: kernel/workqueue.c) Subsystem-Specific Callbacks @@ -143,12 +153,12 @@ Subsystem-Specific Callbacks When a subsystem defines a new attribute type, it must implement a set of sysfs operations for forwarding read and write calls to the -show and store methods of the attribute owners. +show and store methods of the attribute owners:: -struct sysfs_ops { - ssize_t (*show)(struct kobject *, struct attribute *, char *); - ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); -}; + struct sysfs_ops { + ssize_t (*show)(struct kobject *, struct attribute *, char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); + }; [ Subsystems should have already defined a struct kobj_type as a descriptor for this type, which is where the sysfs_ops pointer is @@ -157,29 +167,29 @@ stored. See the kobject documentation for more information. ] When a file is read or written, sysfs calls the appropriate method for the type. The method then translates the generic struct kobject and struct attribute pointers to the appropriate pointer types, and -calls the associated methods. +calls the associated methods. -To illustrate: +To illustrate:: -#define to_dev(obj) container_of(obj, struct device, kobj) -#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) + #define to_dev(obj) container_of(obj, struct device, kobj) + #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) -static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, - char *buf) -{ - struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = to_dev(kobj); - ssize_t ret = -EIO; + static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) + { + struct device_attribute *dev_attr = to_dev_attr(attr); + struct device *dev = to_dev(kobj); + ssize_t ret = -EIO; - if (dev_attr->show) - ret = dev_attr->show(dev, dev_attr, buf); - if (ret >= (ssize_t)PAGE_SIZE) { - printk("dev_attr_show: %pS returned bad count\n", - dev_attr->show); - } - return ret; -} + if (dev_attr->show) + ret = dev_attr->show(dev, dev_attr, buf); + if (ret >= (ssize_t)PAGE_SIZE) { + printk("dev_attr_show: %pS returned bad count\n", + dev_attr->show); + } + return ret; + } @@ -188,11 +198,11 @@ Reading/Writing Attribute Data To read or write attributes, show() or store() methods must be specified when declaring the attribute. The method types should be as -simple as those defined for device attributes: +simple as those defined for device attributes:: -ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); -ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); + ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); IOW, they should take only an object, an attribute, and a buffer as parameters. @@ -200,11 +210,11 @@ IOW, they should take only an object, an attribute, and a buffer as parameters. sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the method. Sysfs will call the method exactly once for each read or write. This forces the following behavior on the method -implementations: +implementations: -- On read(2), the show() method should fill the entire buffer. +- On read(2), the show() method should fill the entire buffer. Recall that an attribute should only be exporting one value, or an - array of similar values, so this shouldn't be that expensive. + array of similar values, so this shouldn't be that expensive. This allows userspace to do partial reads and forward seeks arbitrarily over the entire file at will. If userspace seeks back to @@ -218,10 +228,10 @@ implementations: When writing sysfs files, userspace processes should first read the entire file, modify the values it wishes to change, then write the - entire buffer back. + entire buffer back. Attribute method implementations should operate on an identical - buffer when reading and writing values. + buffer when reading and writing values. Other notes: @@ -229,7 +239,7 @@ Other notes: file position. - The buffer will always be PAGE_SIZE bytes in length. On i386, this - is 4096. + is 4096. - show() methods should return the number of bytes printed into the buffer. This is the return value of scnprintf(). @@ -246,31 +256,31 @@ Other notes: through, be sure to return an error. - The object passed to the methods will be pinned in memory via sysfs - referencing counting its embedded object. However, the physical - entity (e.g. device) the object represents may not be present. Be - sure to have a way to check this, if necessary. + referencing counting its embedded object. However, the physical + entity (e.g. device) the object represents may not be present. Be + sure to have a way to check this, if necessary. -A very simple (and naive) implementation of a device attribute is: +A very simple (and naive) implementation of a device attribute is:: -static ssize_t show_name(struct device *dev, struct device_attribute *attr, - char *buf) -{ - return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); -} + static ssize_t show_name(struct device *dev, struct device_attribute *attr, + char *buf) + { + return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); + } -static ssize_t store_name(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count) -{ - snprintf(dev->name, sizeof(dev->name), "%.*s", - (int)min(count, sizeof(dev->name) - 1), buf); - return count; -} + static ssize_t store_name(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) + { + snprintf(dev->name, sizeof(dev->name), "%.*s", + (int)min(count, sizeof(dev->name) - 1), buf); + return count; + } -static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); + static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); -(Note that the real implementation doesn't allow userspace to set the +(Note that the real implementation doesn't allow userspace to set the name for a device.) @@ -278,25 +288,25 @@ Top Level Directory Layout ~~~~~~~~~~~~~~~~~~~~~~~~~~ The sysfs directory arrangement exposes the relationship of kernel -data structures. +data structures. -The top level sysfs directory looks like: +The top level sysfs directory looks like:: -block/ -bus/ -class/ -dev/ -devices/ -firmware/ -net/ -fs/ + block/ + bus/ + class/ + dev/ + devices/ + firmware/ + net/ + fs/ devices/ contains a filesystem representation of the device tree. It maps directly to the internal kernel device tree, which is a hierarchy of -struct device. +struct device. bus/ contains flat directory layout of the various bus types in the -kernel. Each bus's directory contains two subdirectories: +kernel. Each bus's directory contains two subdirectories:: devices/ drivers/ @@ -331,71 +341,71 @@ Current Interfaces The following interface layers currently exist in sysfs: -- devices (include/linux/device.h) ----------------------------------- -Structure: +devices (include/linux/device.h) +-------------------------------- +Structure:: -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; -Declaring: +Declaring:: -DEVICE_ATTR(_name, _mode, _show, _store); + DEVICE_ATTR(_name, _mode, _show, _store); -Creation/Removal: +Creation/Removal:: -int device_create_file(struct device *dev, const struct device_attribute * attr); -void device_remove_file(struct device *dev, const struct device_attribute * attr); + int device_create_file(struct device *dev, const struct device_attribute * attr); + void device_remove_file(struct device *dev, const struct device_attribute * attr); -- bus drivers (include/linux/device.h) --------------------------------------- -Structure: +bus drivers (include/linux/device.h) +------------------------------------ +Structure:: -struct bus_attribute { - struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); -}; + struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + }; -Declaring: +Declaring:: -static BUS_ATTR_RW(name); -static BUS_ATTR_RO(name); -static BUS_ATTR_WO(name); + static BUS_ATTR_RW(name); + static BUS_ATTR_RO(name); + static BUS_ATTR_WO(name); -Creation/Removal: +Creation/Removal:: -int bus_create_file(struct bus_type *, struct bus_attribute *); -void bus_remove_file(struct bus_type *, struct bus_attribute *); + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); -- device drivers (include/linux/device.h) ------------------------------------------ +device drivers (include/linux/device.h) +--------------------------------------- -Structure: +Structure:: -struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *, char * buf); - ssize_t (*store)(struct device_driver *, const char * buf, - size_t count); -}; + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *, char * buf); + ssize_t (*store)(struct device_driver *, const char * buf, + size_t count); + }; -Declaring: +Declaring:: -DRIVER_ATTR_RO(_name) -DRIVER_ATTR_RW(_name) + DRIVER_ATTR_RO(_name) + DRIVER_ATTR_RW(_name) -Creation/Removal: +Creation/Removal:: -int driver_create_file(struct device_driver *, const struct driver_attribute *); -void driver_remove_file(struct device_driver *, const struct driver_attribute *); + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); Documentation diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.rst index 253b50d1328e..89e40911ad7c 100644 --- a/Documentation/filesystems/sysv-fs.txt +++ b/Documentation/filesystems/sysv-fs.rst @@ -1,25 +1,40 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +SystemV Filesystem +================== + It implements all of - Xenix FS, - SystemV/386 FS, - Coherent FS. To install: + * Answer the 'System V and Coherent filesystem support' question with 'y' when configuring the kernel. -* To mount a disk or a partition, use +* To mount a disk or a partition, use:: + mount [-r] -t sysv device mountpoint - The file system type names + + The file system type names:: + -t sysv -t xenix -t coherent + may be used interchangeably, but the last two will eventually disappear. Bugs in the present implementation: + - Coherent FS: + - The "free list interleave" n:m is currently ignored. - Only file systems with no filesystem name and no pack name are recognized. - (See Coherent "man mkfs" for a description of these features.) + (See Coherent "man mkfs" for a description of these features.) + - SystemV Release 2 FS: + The superblock is only searched in the blocks 9, 15, 18, which corresponds to the beginning of track 1 on floppy disks. No support for this FS on hard disk yet. @@ -28,12 +43,14 @@ Bugs in the present implementation: These filesystems are rather similar. Here is a comparison with Minix FS: * Linux fdisk reports on partitions + - Minix FS 0x81 Linux/Minix - Xenix FS ?? - SystemV FS ?? - Coherent FS 0x08 AIX bootable * Size of a block or zone (data allocation unit on disk) + - Minix FS 1024 - Xenix FS 1024 (also 512 ??) - SystemV FS 1024 (also 512 and 2048) @@ -45,37 +62,51 @@ These filesystems are rather similar. Here is a comparison with Minix FS: all the block numbers (including the super block) are offset by one track. * Byte ordering of "short" (16 bit entities) on disk: + - Minix FS little endian 0 1 - Xenix FS little endian 0 1 - SystemV FS little endian 0 1 - Coherent FS little endian 0 1 + Of course, this affects only the file system, not the data of files on it! * Byte ordering of "long" (32 bit entities) on disk: + - Minix FS little endian 0 1 2 3 - Xenix FS little endian 0 1 2 3 - SystemV FS little endian 0 1 2 3 - Coherent FS PDP-11 2 3 0 1 + Of course, this affects only the file system, not the data of files on it! * Inode on disk: "short", 0 means non-existent, the root dir ino is: - - Minix FS 1 - - Xenix FS, SystemV FS, Coherent FS 2 + + ================================= == + Minix FS 1 + Xenix FS, SystemV FS, Coherent FS 2 + ================================= == * Maximum number of hard links to a file: - - Minix FS 250 - - Xenix FS ?? - - SystemV FS ?? - - Coherent FS >=10000 + + =========== ========= + Minix FS 250 + Xenix FS ?? + SystemV FS ?? + Coherent FS >=10000 + =========== ========= * Free inode management: - - Minix FS a bitmap + + - Minix FS + a bitmap - Xenix FS, SystemV FS, Coherent FS There is a cache of a certain number of free inodes in the super-block. When it is exhausted, new free inodes are found using a linear search. * Free block management: - - Minix FS a bitmap + + - Minix FS + a bitmap - Xenix FS, SystemV FS, Coherent FS Free blocks are organized in a "free list". Maybe a misleading term, since it is not true that every free block contains a pointer to @@ -86,13 +117,18 @@ These filesystems are rather similar. Here is a comparison with Minix FS: 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. * Super-block location: - - Minix FS block 1 = bytes 1024..2047 - - Xenix FS block 1 = bytes 1024..2047 - - SystemV FS bytes 512..1023 - - Coherent FS block 1 = bytes 512..1023 + + =========== ========================== + Minix FS block 1 = bytes 1024..2047 + Xenix FS block 1 = bytes 1024..2047 + SystemV FS bytes 512..1023 + Coherent FS block 1 = bytes 512..1023 + =========== ========================== * Super-block layout: - - Minix FS + + - Minix FS:: + unsigned short s_ninodes; unsigned short s_nzones; unsigned short s_imap_blocks; @@ -101,7 +137,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS: unsigned short s_log_zone_size; unsigned long s_max_size; unsigned short s_magic; - - Xenix FS, SystemV FS, Coherent FS + + - Xenix FS, SystemV FS, Coherent FS:: + unsigned short s_firstdatazone; unsigned long s_nzones; unsigned short s_fzone_count; @@ -120,23 +158,33 @@ These filesystems are rather similar. Here is a comparison with Minix FS: unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only char s_fname[6]; char s_fpack[6]; + then they differ considerably: - Xenix FS + + Xenix FS:: + char s_clean; char s_fill[371]; long s_magic; long s_type; - SystemV FS + + SystemV FS:: + long s_fill[12 or 14]; long s_state; long s_magic; long s_type; - Coherent FS + + Coherent FS:: + unsigned long s_unique; + Note that Coherent FS has no magic. * Inode layout: - - Minix FS + + - Minix FS:: + unsigned short i_mode; unsigned short i_uid; unsigned long i_size; @@ -144,7 +192,9 @@ These filesystems are rather similar. Here is a comparison with Minix FS: unsigned char i_gid; unsigned char i_nlinks; unsigned short i_zone[7+1+1]; - - Xenix FS, SystemV FS, Coherent FS + + - Xenix FS, SystemV FS, Coherent FS:: + unsigned short i_mode; unsigned short i_nlink; unsigned short i_uid; @@ -155,38 +205,55 @@ These filesystems are rather similar. Here is a comparison with Minix FS: unsigned long i_mtime; unsigned long i_ctime; + * Regular file data blocks are organized as - - Minix FS - 7 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - - Xenix FS, SystemV FS, Coherent FS - 10 direct blocks - 1 indirect block (pointers to blocks) - 1 double-indirect block (pointer to pointers to blocks) - 1 triple-indirect block (pointer to pointers to pointers to blocks) -* Inode size, inodes per block - - Minix FS 32 32 - - Xenix FS 64 16 - - SystemV FS 64 16 - - Coherent FS 64 8 + - Minix FS: + + - 7 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + + - Xenix FS, SystemV FS, Coherent FS: + + - 10 direct blocks + - 1 indirect block (pointers to blocks) + - 1 double-indirect block (pointer to pointers to blocks) + - 1 triple-indirect block (pointer to pointers to pointers to blocks) + + + =========== ========== ================ + Inode size inodes per block + =========== ========== ================ + Minix FS 32 32 + Xenix FS 64 16 + SystemV FS 64 16 + Coherent FS 64 8 + =========== ========== ================ * Directory entry on disk - - Minix FS + + - Minix FS:: + unsigned short inode; char name[14/30]; - - Xenix FS, SystemV FS, Coherent FS + + - Xenix FS, SystemV FS, Coherent FS:: + unsigned short inode; char name[14]; -* Dir entry size, dir entries per block - - Minix FS 16/32 64/32 - - Xenix FS 16 64 - - SystemV FS 16 64 - - Coherent FS 16 32 + =========== ============== ===================== + Dir entry size dir entries per block + =========== ============== ===================== + Minix FS 16/32 64/32 + Xenix FS 16 64 + SystemV FS 16 64 + Coherent FS 16 32 + =========== ============== ===================== * How to implement symbolic links such that the host fsck doesn't scream: + - Minix FS normal - Xenix FS kludge: as regular files with chmod 1000 - SystemV FS ?? diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.rst index 5ecbc03e6b2f..4e95929301a5 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +Tmpfs +===== + Tmpfs is a file system which keeps all files in virtual memory. @@ -14,7 +20,7 @@ If you compare it to ramfs (which was the template to create tmpfs) you gain swapping and limit checking. Another similar thing is the RAM disk (/dev/ram*), which simulates a fixed size hard disk in physical RAM, where you have to create an ordinary filesystem on top. Ramdisks -cannot swap and you do not have the possibility to resize them. +cannot swap and you do not have the possibility to resize them. Since tmpfs lives completely in the page cache and on swap, all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in @@ -26,7 +32,7 @@ tmpfs has the following uses: 1) There is always a kernel internal mount which you will not see at all. This is used for shared anonymous mappings and SYSV shared - memory. + memory. This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal @@ -34,7 +40,7 @@ tmpfs has the following uses: 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX shared memory (shm_open, shm_unlink). Adding the following - line to /etc/fstab should take care of this: + line to /etc/fstab should take care of this:: tmpfs /dev/shm tmpfs defaults 0 0 @@ -56,15 +62,17 @@ tmpfs has the following uses: tmpfs has three mount options for sizing: -size: The limit of allocated bytes for this tmpfs instance. The +========= ============================================================ +size The limit of allocated bytes for this tmpfs instance. The default is half of your physical RAM without swap. If you oversize your tmpfs instances the machine will deadlock since the OOM handler will not be able to free that memory. -nr_blocks: The same as size, but in blocks of PAGE_SIZE. -nr_inodes: The maximum number of inodes for this instance. The default +nr_blocks The same as size, but in blocks of PAGE_SIZE. +nr_inodes The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a machine with highmem) the number of lowmem RAM pages, whichever is the lower. +========= ============================================================ These parameters accept a suffix k, m or g for kilo, mega and giga and can be changed on remount. The size parameter also accepts a suffix % @@ -82,6 +90,7 @@ tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be adjusted on the fly via 'mount -o remount ...' +======================== ============================================== mpol=default use the process allocation policy (see set_mempolicy(2)) mpol=prefer:Node prefers to allocate memory from the given Node @@ -89,6 +98,7 @@ mpol=bind:NodeList allocates memory only from nodes in NodeList mpol=interleave prefers to allocate from each node in turn mpol=interleave:NodeList allocates from each node of NodeList in turn mpol=local prefers to allocate memory from the local node +======================== ============================================== NodeList format is a comma-separated list of decimal numbers and ranges, a range being two hyphen-separated decimal numbers, the smallest and @@ -98,9 +108,9 @@ A memory policy with a valid NodeList will be saved, as specified, for use at file creation time. When a task allocates a file in the file system, the mount option memory policy will be applied with a NodeList, if any, modified by the calling task's cpuset constraints -[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed -below. If the resulting NodeLists is the empty set, the effective memory -policy for the file will revert to "default" policy. +[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, +listed below. If the resulting NodeLists is the empty set, the effective +memory policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified @@ -109,6 +119,8 @@ See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of all available memory allocation policy mode flags and their effect on memory policy. +:: + =static is equivalent to MPOL_F_STATIC_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES @@ -128,9 +140,11 @@ on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. To specify the initial root directory you can use the following mount options: -mode: The permissions as an octal number -uid: The user id -gid: The group id +==== ================================== +mode The permissions as an octal number +uid The user id +gid The group id +==== ================================== These options do not have any effect on remount. You can change these parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. @@ -141,9 +155,9 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB RAM/SWAP in 10240 inodes and it is only accessible by root. -Author: +:Author: Christoph Rohland <cr@sap.com>, 1.12.01 -Updated: +:Updated: Hugh Dickins, 4 June 2007 -Updated: +:Updated: KOSAKI Motohiro, 16 Mar 2010 diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst index 6a9584f6ff46..16efd729bf7c 100644 --- a/Documentation/filesystems/ubifs-authentication.rst +++ b/Documentation/filesystems/ubifs-authentication.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + :orphan: .. UBIFS Authentication @@ -92,11 +94,11 @@ UBIFS Index & Tree Node Cache ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types -of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file -contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes. -Almost all types of nodes share a common header (`ubifs_ch`) containing basic +of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file +contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes. +Almost all types of nodes share a common header (``ubifs_ch``) containing basic information like node type, node length, a sequence number, etc. (see -`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT +``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT and some less important node types like padding nodes which are used to pad unusable content at the end of LEBs. diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.rst index acc80442a3bb..e6ee99762534 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.rst @@ -1,5 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UBI File System +=============== + Introduction -============= +============ UBIFS file-system stands for UBI File System. UBI stands for "Unsorted Block Images". UBIFS is a flash file system, which means it is designed @@ -79,6 +85,7 @@ Mount options (*) == default. +==================== ======================================================= bulk_read read more in one go to take advantage of flash media that read faster sequentially no_bulk_read (*) do not bulk-read @@ -98,6 +105,7 @@ auth_key= specify the key used for authenticating the filesystem. auth_hash_name= The hash algorithm used for authentication. Used for both hashing and for creating HMACs. Typical values include "sha256" or "sha512" +==================== ======================================================= Quick usage instructions @@ -107,12 +115,14 @@ The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is UBI volume name. -Mount volume 0 on UBI device 0 to /mnt/ubifs: -$ mount -t ubifs ubi0_0 /mnt/ubifs +Mount volume 0 on UBI device 0 to /mnt/ubifs:: + + $ mount -t ubifs ubi0_0 /mnt/ubifs Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume -name): -$ mount -t ubifs ubi0:rootfs /mnt/ubifs +name):: + + $ mount -t ubifs ubi0:rootfs /mnt/ubifs The following is an example of the kernel boot arguments to attach mtd0 to UBI and mount volume "rootfs": @@ -122,5 +132,6 @@ References ========== UBIFS documentation and FAQ/HOWTO at the MTD web site: -http://www.linux-mtd.infradead.org/doc/ubifs.html -http://www.linux-mtd.infradead.org/faq/ubifs.html + +- http://www.linux-mtd.infradead.org/doc/ubifs.html +- http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.rst index e2f2faf32f18..d9badbf285b2 100644 --- a/Documentation/filesystems/udf.txt +++ b/Documentation/filesystems/udf.rst @@ -1,6 +1,8 @@ -* -* Documentation/filesystems/udf.txt -* +.. SPDX-License-Identifier: GPL-2.0 + +=============== +UDF file system +=============== If you encounter problems with reading UDF discs using this driver, please report them according to MAINTAINERS file. @@ -18,8 +20,10 @@ performance due to very poor read-modify-write support supplied internally by drive firmware. ------------------------------------------------------------------------------- + The following mount options are supported: + =========== ====================================== gid= Set the default group. umask= Set the default umask. mode= Set the default file permissions. @@ -34,6 +38,7 @@ The following mount options are supported: longad Use long ad's (default) nostrict Unset strict conformance iocharset= Set the NLS character set + =========== ====================================== The uid= and gid= options need a bit more explaining. They will accept a decimal numeric value and all inodes on that mount will then appear as @@ -47,13 +52,17 @@ the interactive user will always see the files on the disk as belonging to him. The remaining are for debugging and disaster recovery: - novrs Skip volume sequence recognition + ===== ================================ + novrs Skip volume sequence recognition + ===== ================================ The following expect a offset from 0. + ========== ================================================= session= Set the CDROM session (default= last session) anchor= Override standard anchor location. (default= 256) lastblock= Set the last block of the filesystem/ + ========== ================================================= ------------------------------------------------------------------------------- @@ -62,5 +71,5 @@ For the latest version and toolset see: https://github.com/pali/udftools Documentation on UDF and ECMA 167 is available FREE from: - http://www.osta.org/ - http://www.ecma-international.org/ + - http://www.osta.org/ + - http://www.ecma-international.org/ diff --git a/Documentation/filesystems/virtiofs.rst b/Documentation/filesystems/virtiofs.rst index 4f338e3cb3f7..e06e4951cb39 100644 --- a/Documentation/filesystems/virtiofs.rst +++ b/Documentation/filesystems/virtiofs.rst @@ -1,5 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 +.. _virtiofs_index: + =================================================== virtiofs: virtio-fs host<->guest shared file system =================================================== diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.rst index 935bf22031ca..71d845c6a700 100644 --- a/Documentation/filesystems/zonefs.txt +++ b/Documentation/filesystems/zonefs.rst @@ -1,4 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================ ZoneFS - Zone filesystem for Zoned block devices +================================================ Introduction ============ @@ -29,6 +33,7 @@ Zoned block devices Zoned storage devices belong to a class of storage devices with an address space that is divided into zones. A zone is a group of consecutive LBAs and all zones are contiguous (there are no LBA gaps). Zones may have different types. + * Conventional zones: there are no access constraints to LBAs belonging to conventional zones. Any read or write access can be executed, similarly to a regular block device. @@ -134,7 +139,7 @@ Sequential zone files can only be written sequentially, starting from the file end, that is, write operations can only be append writes. Zonefs makes no attempt at accepting random writes and will fail any write request that has a start offset not corresponding to the end of the file, or to the end of the last -write issued and still in-flight (for asynchrnous I/O operations). +write issued and still in-flight (for asynchronous I/O operations). Since dirty page writeback by the page cache does not guarantee a sequential write pattern, zonefs prevents buffered writes and writeable shared mappings @@ -142,7 +147,7 @@ on sequential files. Only direct I/O writes are accepted for these files. zonefs relies on the sequential delivery of write I/O requests to the device implemented by the block layer elevator. An elevator implementing the sequential write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) -must be used. This type of elevator (e.g. mq-deadline) is the set by default +must be used. This type of elevator (e.g. mq-deadline) is set by default for zoned block devices on device initialization. There are no restrictions on the type of I/O used for read operations in @@ -158,6 +163,7 @@ Format options -------------- Several optional features of zonefs can be enabled at format time. + * Conventional zone aggregation: ranges of contiguous conventional zones can be aggregated into a single larger file instead of the default one file per zone. * File ownership: The owner UID and GID of zone files is by default 0 (root) @@ -196,7 +202,7 @@ additional conditions that result in I/O errors. may still happen in the case of a partial failure of a very large direct I/O operation split into multiple BIOs/requests or asynchronous I/O operations. If one of the write request within the set of sequential write requests - issued to the device fails, all write requests after queued after it will + issued to the device fails, all write requests queued after it will become unaligned and fail. * Delayed write errors: similarly to regular block devices, if the device side @@ -207,7 +213,7 @@ additional conditions that result in I/O errors. causing all data to be dropped after the sector that caused the error. All I/O errors detected by zonefs are notified to the user with an error code -return for the system call that trigered or detected the error. The recovery +return for the system call that triggered or detected the error. The recovery actions taken by zonefs in response to I/O errors depend on the I/O type (read vs write) and on the reason for the error (bad sector, unaligned writes or zone condition change). @@ -222,7 +228,7 @@ condition change). * A zone condition change to read-only or offline also always triggers zonefs I/O error recovery. -Zonefs minimal I/O error recovery may change a file size and a file access +Zonefs minimal I/O error recovery may change a file size and file access permissions. * File size changes: @@ -237,7 +243,7 @@ permissions. A file size may also be reduced to reflect a delayed write error detected on fsync(): in this case, the amount of data effectively written in the zone may be less than originally indicated by the file inode size. After such I/O - error, zonefs always fixes a file inode size to reflect the amount of data + error, zonefs always fixes the file inode size to reflect the amount of data persistently stored in the file zone. * Access permission changes: @@ -249,7 +255,7 @@ permissions. Further action taken by zonefs I/O error recovery can be controlled by the user with the "errors=xxx" mount option. The table below summarizes the result of zonefs I/O error processing depending on the mount option and on the zone -conditions. +conditions:: +--------------+-----------+-----------------------------------------+ | | | Post error state | @@ -258,11 +264,11 @@ conditions. | option | condition | size read write read write | +--------------+-----------+-----------------------------------------+ | | good | fixed yes no yes yes | - | remount-ro | read-only | fixed yes no yes no | + | remount-ro | read-only | as is yes no yes no | | (default) | offline | 0 no no no no | +--------------+-----------+-----------------------------------------+ | | good | fixed yes no yes yes | - | zone-ro | read-only | fixed yes no yes no | + | zone-ro | read-only | as is yes no yes no | | | offline | 0 no no no no | +--------------+-----------+-----------------------------------------+ | | good | 0 no no yes yes | @@ -270,22 +276,23 @@ conditions. | | offline | 0 no no no no | +--------------+-----------+-----------------------------------------+ | | good | fixed yes yes yes yes | - | repair | read-only | fixed yes no yes no | + | repair | read-only | as is yes no yes no | | | offline | 0 no no no no | +--------------+-----------+-----------------------------------------+ Further notes: + * The "errors=remount-ro" mount option is the default behavior of zonefs I/O error processing if no errors mount option is specified. * With the "errors=remount-ro" mount option, the change of the file access permissions to read-only applies to all files. The file system is remounted read-only. * Access permission and file size changes due to the device transitioning zones - to the offline condition are permanent. Remounting or reformating the device + to the offline condition are permanent. Remounting or reformatting the device with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good state. * File access permission changes to read-only due to the device transitioning - zones to the read-only condition are permanent. Remounting or reformating + zones to the read-only condition are permanent. Remounting or reformatting the device will not re-enable file write access. * File access permission changes implied by the remount-ro, zone-ro and zone-offline mount options are temporary for zones in a good condition. @@ -301,14 +308,23 @@ Mount options zonefs define the "errors=<behavior>" mount option to allow the user to specify zonefs behavior in response to I/O errors, inode size inconsistencies or zone -condition chages. The defined behaviors are as follow: +condition changes. The defined behaviors are as follow: + * remount-ro (default) * zone-ro * zone-offline * repair -The I/O error actions defined for each behavior is detailed in the previous -section. +The run-time I/O error actions defined for each behavior are detailed in the +previous section. Mount time I/O errors will cause the mount operation to fail. +The handling of read-only zones also differs between mount-time and run-time. +If a read-only zone is found at mount time, the zone is always treated in the +same manner as offline zones, that is, all accesses are disabled and the zone +file size set to 0. This is necessary as the write pointer of read-only zones +is defined as invalib by the ZBC and ZAC standards, making it impossible to +discover the amount of data that has been written to the zone. In the case of a +read-only zone discovered at run-time, as indicated in the previous section. +the size of the zone file is left unchanged from its last updated value. Zonefs User Space Tools ======================= @@ -325,78 +341,78 @@ Examples -------- The following formats a 15TB host-managed SMR HDD with 256 MB zones -with the conventional zones aggregation feature enabled. +with the conventional zones aggregation feature enabled:: -# mkzonefs -o aggr_cnv /dev/sdX -# mount -t zonefs /dev/sdX /mnt -# ls -l /mnt/ -total 0 -dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv -dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq + # mkzonefs -o aggr_cnv /dev/sdX + # mount -t zonefs /dev/sdX /mnt + # ls -l /mnt/ + total 0 + dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv + dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq The size of the zone files sub-directories indicate the number of files existing for each type of zones. In this example, there is only one conventional zone file (all conventional zones are aggregated under a single -file). +file):: -# ls -l /mnt/cnv -total 137101312 --rw-r----- 1 root root 140391743488 Nov 25 13:23 0 + # ls -l /mnt/cnv + total 137101312 + -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 -This aggregated conventional zone file can be used as a regular file. +This aggregated conventional zone file can be used as a regular file:: -# mkfs.ext4 /mnt/cnv/0 -# mount -o loop /mnt/cnv/0 /data + # mkfs.ext4 /mnt/cnv/0 + # mount -o loop /mnt/cnv/0 /data The "seq" sub-directory grouping files for sequential write zones has in this -example 55356 zones. +example 55356 zones:: -# ls -lv /mnt/seq -total 14511243264 --rw-r----- 1 root root 0 Nov 25 13:23 0 --rw-r----- 1 root root 0 Nov 25 13:23 1 --rw-r----- 1 root root 0 Nov 25 13:23 2 -... --rw-r----- 1 root root 0 Nov 25 13:23 55354 --rw-r----- 1 root root 0 Nov 25 13:23 55355 + # ls -lv /mnt/seq + total 14511243264 + -rw-r----- 1 root root 0 Nov 25 13:23 0 + -rw-r----- 1 root root 0 Nov 25 13:23 1 + -rw-r----- 1 root root 0 Nov 25 13:23 2 + ... + -rw-r----- 1 root root 0 Nov 25 13:23 55354 + -rw-r----- 1 root root 0 Nov 25 13:23 55355 For sequential write zone files, the file size changes as data is appended at -the end of the file, similarly to any regular file system. +the end of the file, similarly to any regular file system:: -# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct -1+0 records in -1+0 records out -4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s + # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct + 1+0 records in + 1+0 records out + 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s -# ls -l /mnt/seq/0 --rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 The written file can be truncated to the zone size, preventing any further -write operation. +write operation:: -# truncate -s 268435456 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 + # truncate -s 268435456 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 Truncation to 0 size allows freeing the file zone storage space and restart -append-writes to the file. +append-writes to the file:: -# truncate -s 0 /mnt/seq/0 -# ls -l /mnt/seq/0 --rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 + # truncate -s 0 /mnt/seq/0 + # ls -l /mnt/seq/0 + -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 Since files are statically mapped to zones on the disk, the number of blocks of -a file as reported by stat() and fstat() indicates the size of the file zone. - -# stat /mnt/seq/0 - File: /mnt/seq/0 - Size: 0 Blocks: 524288 IO Block: 4096 regular empty file -Device: 870h/2160d Inode: 50431 Links: 1 -Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) -Access: 2019-11-25 13:23:57.048971997 +0900 -Modify: 2019-11-25 13:52:25.553805765 +0900 -Change: 2019-11-25 13:52:25.553805765 +0900 - Birth: - +a file as reported by stat() and fstat() indicates the size of the file zone:: + + # stat /mnt/seq/0 + File: /mnt/seq/0 + Size: 0 Blocks: 524288 IO Block: 4096 regular empty file + Device: 870h/2160d Inode: 50431 Links: 1 + Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) + Access: 2019-11-25 13:23:57.048971997 +0900 + Modify: 2019-11-25 13:52:25.553805765 +0900 + Change: 2019-11-25 13:52:25.553805765 +0900 + Birth: - The number of blocks of the file ("Blocks") in units of 512B blocks gives the maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst index e539c42a3e78..cc74e24ca3b5 100644 --- a/Documentation/gpu/i915.rst +++ b/Documentation/gpu/i915.rst @@ -207,10 +207,10 @@ DPIO CSR firmware support for DMC ---------------------------- -.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c :doc: csr support for dmc -.. kernel-doc:: drivers/gpu/drm/i915/intel_csr.c +.. kernel-doc:: drivers/gpu/drm/i915/display/intel_csr.c :internal: Video BIOS Table (VBT) diff --git a/Documentation/hwmon/adm1177.rst b/Documentation/hwmon/adm1177.rst index c81e0b4abd28..471be1e98d6f 100644 --- a/Documentation/hwmon/adm1177.rst +++ b/Documentation/hwmon/adm1177.rst @@ -20,8 +20,7 @@ Usage Notes ----------- This driver does not auto-detect devices. You will have to instantiate the -devices explicitly. Please see Documentation/i2c/instantiating-devices for -details. +devices explicitly. Please see :doc:`/i2c/instantiating-devices` for details. Sysfs entries diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst index b24adb67ddca..8ef62fd39787 100644 --- a/Documentation/hwmon/index.rst +++ b/Documentation/hwmon/index.rst @@ -162,6 +162,7 @@ Hardware Monitoring Kernel Drivers tmp421 tmp513 tps40422 + tps53679 twl4030-madc-hwmon ucd9000 ucd9200 diff --git a/Documentation/hwmon/isl68137.rst b/Documentation/hwmon/isl68137.rst index a5a7c8545c9e..cc4b61447b63 100644 --- a/Documentation/hwmon/isl68137.rst +++ b/Documentation/hwmon/isl68137.rst @@ -3,7 +3,7 @@ Kernel driver isl68137 Supported chips: - * Intersil ISL68137 + * Renesas ISL68137 Prefix: 'isl68137' @@ -11,19 +11,405 @@ Supported chips: Datasheet: - Publicly available at the Intersil website - https://www.intersil.com/content/dam/Intersil/documents/isl6/isl68137.pdf + Publicly available at the Renesas website + https://www.renesas.com/us/en/www/doc/datasheet/isl68137.pdf + + * Renesas ISL68220 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68221 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68222 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68223 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68224 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68225 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68226 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68227 + + Prefix: 'raa_dmpvr2_1rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68229 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68233 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL68239 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69222 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69223 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69224 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69225 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69227 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69228 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69234 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69236 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69239 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69242 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69243 + + Prefix: 'raa_dmpvr2_1rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69247 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69248 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69254 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69255 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69256 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69259 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69260 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69268 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69269 + + Prefix: 'raa_dmpvr2_3rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas ISL69298 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA228000 + + Prefix: 'raa_dmpvr2_hv' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA228004 + + Prefix: 'raa_dmpvr2_hv' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA228006 + + Prefix: 'raa_dmpvr2_hv' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA228228 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA229001 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website + + * Renesas RAA229004 + + Prefix: 'raa_dmpvr2_2rail' + + Addresses scanned: - + + Datasheet: + + Publicly available (after August 2020 launch) at the Renesas website Authors: - Maxim Sloyko <maxims@google.com> - Robert Lippert <rlippert@google.com> - Patrick Venture <venture@google.com> + - Grant Peltier <grant.peltier.jg@renesas.com> Description ----------- -Intersil ISL68137 is a digital output 7-phase configurable PWM -controller with an AVSBus interface. +This driver supports the Renesas ISL68137 and all 2nd generation Renesas +digital multiphase voltage regulators (raa_dmpvr2). The ISL68137 is a digital +output 7-phase configurable PWM controller with an AVSBus interface. 2nd +generation devices are grouped into 4 distinct configurations: '1rail' for +single-rail devices, '2rail' for dual-rail devices, '3rail' for 3-rail devices, +and 'hv' for high voltage single-rail devices. Consult the individual datasheets +for more information. Usage Notes ----------- @@ -33,10 +419,14 @@ devices explicitly. The ISL68137 AVS operation mode must be enabled/disabled at runtime. -Beyond the normal sysfs pmbus attributes, the driver exposes a control attribute. +Beyond the normal sysfs pmbus attributes, the driver exposes a control attribute +for the ISL68137. + +For 2nd generation Renesas digital multiphase voltage regulators, only the +normal sysfs pmbus attributes are supported. -Additional Sysfs attributes ---------------------------- +ISL68137 sysfs attributes +------------------------- ======================= ==================================== avs(0|1)_enable Controls the AVS state of each rail. @@ -78,3 +468,138 @@ temp[1-3]_crit_alarm Chip temperature critical high alarm temp[1-3]_max Maximum temperature temp[1-3]_max_alarm Chip temperature high alarm ======================= ==================================== + +raa_dmpvr2_1rail/hv sysfs attributes +------------------------------------ + +======================= ========================================== +curr1_label "iin" +curr1_input Measured input current +curr1_crit Critical maximum current +curr1_crit_alarm Current critical high alarm + +curr2_label "iout" +curr2_input Measured output current +curr2_crit Critical maximum current +curr2_crit_alarm Current critical high alarm + +in1_label "vin" +in1_input Measured input voltage +in1_lcrit Critical minimum input voltage +in1_lcrit_alarm Input voltage critical low alarm +in1_crit Critical maximum input voltage +in1_crit_alarm Input voltage critical high alarm + +in2_label "vmon" +in2_input Scaled VMON voltage read from the VMON pin + +in3_label "vout" +in3_input Measured output voltage +in3_lcrit Critical minimum output voltage +in3_lcrit_alarm Output voltage critical low alarm +in3_crit Critical maximum output voltage +in3_crit_alarm Output voltage critical high alarm + +power1_label "pin" +power1_input Measured input power +power1_alarm Input power high alarm + +power2_label "pout" +power2_input Measured output power + +temp[1-3]_input Measured temperature +temp[1-3]_crit Critical high temperature +temp[1-3]_crit_alarm Chip temperature critical high alarm +temp[1-3]_max Maximum temperature +temp[1-3]_max_alarm Chip temperature high alarm +======================= ========================================== + +raa_dmpvr2_2rail sysfs attributes +--------------------------------- + +======================= ========================================== +curr[1-2]_label "iin[1-2]" +curr[1-2]_input Measured input current +curr[1-2]_crit Critical maximum current +curr[1-2]_crit_alarm Current critical high alarm + +curr[3-4]_label "iout[1-2]" +curr[3-4]_input Measured output current +curr[3-4]_crit Critical maximum current +curr[3-4]_crit_alarm Current critical high alarm + +in1_label "vin" +in1_input Measured input voltage +in1_lcrit Critical minimum input voltage +in1_lcrit_alarm Input voltage critical low alarm +in1_crit Critical maximum input voltage +in1_crit_alarm Input voltage critical high alarm + +in2_label "vmon" +in2_input Scaled VMON voltage read from the VMON pin + +in[3-4]_label "vout[1-2]" +in[3-4]_input Measured output voltage +in[3-4]_lcrit Critical minimum output voltage +in[3-4]_lcrit_alarm Output voltage critical low alarm +in[3-4]_crit Critical maximum output voltage +in[3-4]_crit_alarm Output voltage critical high alarm + +power[1-2]_label "pin[1-2]" +power[1-2]_input Measured input power +power[1-2]_alarm Input power high alarm + +power[3-4]_label "pout[1-2]" +power[3-4]_input Measured output power + +temp[1-5]_input Measured temperature +temp[1-5]_crit Critical high temperature +temp[1-5]_crit_alarm Chip temperature critical high alarm +temp[1-5]_max Maximum temperature +temp[1-5]_max_alarm Chip temperature high alarm +======================= ========================================== + +raa_dmpvr2_3rail sysfs attributes +--------------------------------- + +======================= ========================================== +curr[1-3]_label "iin[1-3]" +curr[1-3]_input Measured input current +curr[1-3]_crit Critical maximum current +curr[1-3]_crit_alarm Current critical high alarm + +curr[4-6]_label "iout[1-3]" +curr[4-6]_input Measured output current +curr[4-6]_crit Critical maximum current +curr[4-6]_crit_alarm Current critical high alarm + +in1_label "vin" +in1_input Measured input voltage +in1_lcrit Critical minimum input voltage +in1_lcrit_alarm Input voltage critical low alarm +in1_crit Critical maximum input voltage +in1_crit_alarm Input voltage critical high alarm + +in2_label "vmon" +in2_input Scaled VMON voltage read from the VMON pin + +in[3-5]_label "vout[1-3]" +in[3-5]_input Measured output voltage +in[3-5]_lcrit Critical minimum output voltage +in[3-5]_lcrit_alarm Output voltage critical low alarm +in[3-5]_crit Critical maximum output voltage +in[3-5]_crit_alarm Output voltage critical high alarm + +power[1-3]_label "pin[1-3]" +power[1-3]_input Measured input power +power[1-3]_alarm Input power high alarm + +power[4-6]_label "pout[1-3]" +power[4-6]_input Measured output power + +temp[1-7]_input Measured temperature +temp[1-7]_crit Critical high temperature +temp[1-7]_crit_alarm Chip temperature critical high alarm +temp[1-7]_max Maximum temperature +temp[1-7]_max_alarm Chip temperature high alarm +======================= ========================================== diff --git a/Documentation/hwmon/k10temp.rst b/Documentation/hwmon/k10temp.rst index 4451d59b9425..8557e26281c3 100644 --- a/Documentation/hwmon/k10temp.rst +++ b/Documentation/hwmon/k10temp.rst @@ -100,9 +100,10 @@ socket type, not the processor's actual capabilities. Therefore, if you are using an AM3 processor on an AM2+ mainboard, you can safely use the "force=1" parameter. -There is one temperature measurement value, available as temp1_input in -sysfs. It is measured in degrees Celsius with a resolution of 1/8th degree. -Please note that it is defined as a relative value; to quote the AMD manual:: +For CPUs older than Family 17h, there is one temperature measurement value, +available as temp1_input in sysfs. It is measured in degrees Celsius with a +resolution of 1/8th degree. Please note that it is defined as a relative +value; to quote the AMD manual:: Tctl is the processor temperature control value, used by the platform to control cooling systems. Tctl is a non-physical temperature on an @@ -126,3 +127,25 @@ it. Models from 17h family report relative temperature, the driver aims to compensate and report the real temperature. + +On Family 17h and Family 18h CPUs, additional temperature sensors may report +Core Complex Die (CCD) temperatures. Up to 8 such temperatures are reported +as temp{3..10}_input, labeled Tccd{1..8}. Actual support depends on the CPU +variant. + +Various Family 17h and 18h CPUs report voltage and current telemetry +information. The following attributes may be reported. + +Attribute Label Description +=============== ======= ================ +in0_input Vcore Core voltage +in1_input Vsoc SoC voltage +curr1_input Icore Core current +curr2_input Isoc SoC current +=============== ======= ================ + +Current values are raw (unscaled) as reported by the CPU. Core current is +reported as multiples of 1A / LSB. SoC is reported as multiples of 0.25A +/ LSB. The real current is board specific. Reported currents should be seen +as rough guidance, and should be scaled using sensors3.conf as appropriate +for a given board. diff --git a/Documentation/hwmon/ltc2978.rst b/Documentation/hwmon/ltc2978.rst index 01a24fd6d5fe..bc5270e5a477 100644 --- a/Documentation/hwmon/ltc2978.rst +++ b/Documentation/hwmon/ltc2978.rst @@ -3,13 +3,21 @@ Kernel driver ltc2978 Supported chips: + * Linear Technology LTC2972 + + Prefix: 'ltc2972' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltc2972.html + * Linear Technology LTC2974 Prefix: 'ltc2974' Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc2974 + Datasheet: https://www.analog.com/en/products/ltc2974 * Linear Technology LTC2975 @@ -17,7 +25,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc2975 + Datasheet: https://www.analog.com/en/products/ltc2975 * Linear Technology LTC2977 @@ -25,7 +33,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc2977 + Datasheet: https://www.analog.com/en/products/ltc2977 * Linear Technology LTC2978, LTC2978A @@ -33,9 +41,17 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc2978 + Datasheet: https://www.analog.com/en/products/ltc2978 + + https://www.analog.com/en/products/ltc2978a + + * Linear Technology LTC2979 - http://www.linear.com/product/ltc2978a + Prefix: 'ltc2979' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltc2979 * Linear Technology LTC2980 @@ -43,7 +59,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc2980 + Datasheet: https://www.analog.com/en/products/ltc2980 * Linear Technology LTC3880 @@ -51,7 +67,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc3880 + Datasheet: https://www.analog.com/en/products/ltc3880 * Linear Technology LTC3882 @@ -59,7 +75,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc3882 + Datasheet: https://www.analog.com/en/products/ltc3882 * Linear Technology LTC3883 @@ -67,7 +83,15 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc3883 + Datasheet: https://www.analog.com/en/products/ltc3883 + + * Linear Technology LTC3884 + + Prefix: 'ltc3884' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltc3884 * Linear Technology LTC3886 @@ -75,7 +99,7 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc3886 + Datasheet: https://www.analog.com/en/products/ltc3886 * Linear Technology LTC3887 @@ -83,7 +107,23 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltc3887 + Datasheet: https://www.analog.com/en/products/ltc3887 + + * Linear Technology LTC3889 + + Prefix: 'ltc3889' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltc3889 + + * Linear Technology LTC7880 + + Prefix: 'ltc7880' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltc7880 * Linear Technology LTM2987 @@ -91,15 +131,23 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltm2987 + Datasheet: https://www.analog.com/en/products/ltm2987 - * Linear Technology LTM4675 + * Linear Technology LTM4644 + + Prefix: 'ltm4644' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltm4644 + + * Linear Technology LTM4675 Prefix: 'ltm4675' Addresses scanned: - - Datasheet: http://www.linear.com/product/ltm4675 + Datasheet: https://www.analog.com/en/products/ltm4675 * Linear Technology LTM4676 @@ -107,7 +155,31 @@ Supported chips: Addresses scanned: - - Datasheet: http://www.linear.com/product/ltm4676 + Datasheet: https://www.analog.com/en/products/ltm4676 + + * Linear Technology LTM4677 + + Prefix: 'ltm4677' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltm4677 + + * Linear Technology LTM4678 + + Prefix: 'ltm4678' + + Addresses scanned: - + + Datasheet: https://www.analog.com/en/products/ltm4678 + + * Analog Devices LTM4680 + + Prefix: 'ltm4680' + + Addresses scanned: - + + Datasheet: http://www.analog.com/ltm4680 * Analog Devices LTM4686 @@ -117,6 +189,15 @@ Supported chips: Datasheet: http://www.analog.com/ltm4686 + * Analog Devices LTM4700 + + Prefix: 'ltm4700' + + Addresses scanned: - + + Datasheet: http://www.analog.com/ltm4700 + + Author: Guenter Roeck <linux@roeck-us.net> @@ -166,13 +247,13 @@ in1_min Minimum input voltage. in1_max Maximum input voltage. - LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, and - LTM2987 only. + LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, + LTC2979 and LTM2987 only. in1_lcrit Critical minimum input voltage. - LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, and - LTM2987 only. + LTC2972, LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, + LTC2979 and LTM2987 only. in1_crit Critical maximum input voltage. @@ -180,29 +261,34 @@ in1_min_alarm Input voltage low alarm. in1_max_alarm Input voltage high alarm. - LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, and - LTM2987 only. + LTC2972, LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, + LTC2979 and LTM2987 only. + in1_lcrit_alarm Input voltage critical low alarm. - LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, and - LTM2987 only. + LTC2972, LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, + LTC2979 and LTM2987 only. + in1_crit_alarm Input voltage critical high alarm. in1_lowest Lowest input voltage. - LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, and - LTM2987 only. + LTC2972, LTC2974, LTC2975, LTC2977, LTC2980, LTC2978, + and LTM2987 only. + in1_highest Highest input voltage. in1_reset_history Reset input voltage history. in[N]_label "vout[1-8]". + - LTC2972: N=2-3 - LTC2974, LTC2975: N=2-5 - - LTC2977, LTC2980, LTM2987: N=2-9 + - LTC2977, LTC2979, LTC2980, LTM2987: N=2-9 - LTC2978: N=2-9 - - LTC3880, LTC3882, LTC23886 LTC3887, LTM4675, LTM4676: - N=2-3 + - LTC3880, LTC3882, LTC3884, LTC23886 LTC3887, LTC3889, + LTC7880, LTM4644, LTM4675, LTM4676, LTM4677, LTM4678, + LTM4680, LTM4700: N=2-3 - LTC3883: N=2 in[N]_input Measured output voltage. @@ -225,8 +311,7 @@ in[N]_crit_alarm Output voltage critical high alarm. in[N]_lowest Lowest output voltage. - - LTC2974, LTC2975,and LTC2978 only. + LTC2972, LTC2974, LTC2975,and LTC2978 only. in[N]_highest Highest output voltage. @@ -234,20 +319,24 @@ in[N]_reset_history Reset output voltage history. temp[N]_input Measured temperature. + - On LTC2972, temp[1-2] report external temperatures, + and temp 3 reports the chip temperature. - On LTC2974 and LTC2975, temp[1-4] report external temperatures, and temp5 reports the chip temperature. - - On LTC2977, LTC2980, LTC2978, and LTM2987, only one - temperature measurement is supported and reports - the chip temperature. - - On LTC3880, LTC3882, LTC3887, LTM4675, and LTM4676, - temp1 and temp2 report external temperatures, and - temp3 reports the chip temperature. + - On LTC2977, LTC2979, LTC2980, LTC2978, and LTM2987, + only one temperature measurement is supported and + reports the chip temperature. + - On LTC3880, LTC3882, LTC3886, LTC3887, LTC3889, + LTM4664, LTM4675, LTM4676, LTM4677, LTM4678, LTM4680, + and LTM4700, temp1 and temp2 report external + temperatures, and temp3 reports the chip temperature. - On LTC3883, temp1 reports an external temperature, and temp2 reports the chip temperature. temp[N]_min Mimimum temperature. - LTC2974, LCT2977, LTM2980, LTC2978, and LTM2987 only. + LTC2972, LTC2974, LCT2977, LTM2980, LTC2978, + LTC2979, and LTM2987 only. temp[N]_max Maximum temperature. @@ -257,8 +346,8 @@ temp[N]_crit Critical high temperature. temp[N]_min_alarm Temperature low alarm. - LTC2974, LTC2975, LTC2977, LTM2980, LTC2978, and - LTM2987 only. + LTC2972, LTC2974, LTC2975, LTC2977, LTM2980, LTC2978, + LTC2979, and LTM2987 only. temp[N]_max_alarm Temperature high alarm. @@ -269,8 +358,8 @@ temp[N]_crit_alarm Temperature critical high alarm. temp[N]_lowest Lowest measured temperature. - - LTC2974, LTC2975, LTC2977, LTM2980, LTC2978, and - LTM2987 only. + - LTC2972, LTC2974, LTC2975, LTC2977, LTM2980, LTC2978, + LTC2979, and LTM2987 only. - Not supported for chip temperature sensor on LTC2974 and LTC2975. @@ -290,19 +379,22 @@ power1_input Measured input power. power[N]_label "pout[1-4]". + - LTC2972: N=1-2 - LTC2974, LTC2975: N=1-4 - - LTC2977, LTC2980, LTM2987: Not supported + - LTC2977, LTC2979, LTC2980, LTM2987: Not supported - LTC2978: Not supported - - LTC3880, LTC3882, LTC3886, LTC3887, LTM4675, LTM4676: - N=1-2 + - LTC3880, LTC3882, LTC3884, LTC3886, LTC3887, LTC3889, + LTM4664, LTM4675, LTM4676, LTM4677, LTM4678, LTM4680, + LTM4700: N=1-2 - LTC3883: N=2 power[N]_input Measured output power. curr1_label "iin". - LTC3880, LTC3883, LTC3886, LTC3887, LTM4675, - and LTM4676 only. + LTC3880, LTC3883, LTC3884, LTC3886, LTC3887, LTC3889, + LTM4644, LTM4675, LTM4676, LTM4677, LTM4678, LTM4680, + and LTM4700 only. curr1_input Measured input current. @@ -320,11 +412,13 @@ curr1_reset_history Reset input current history. curr[N]_label "iout[1-4]". + - LTC2972: N-1-2 - LTC2974, LTC2975: N=1-4 - - LTC2977, LTC2980, LTM2987: not supported + - LTC2977, LTC2979, LTC2980, LTM2987: not supported - LTC2978: not supported - - LTC3880, LTC3882, LTC3886, LTC3887, LTM4675, LTM4676: - N=2-3 + - LTC3880, LTC3882, LTC3884, LTC3886, LTC3887, LTC3889, + LTM4664, LTM4675, LTM4676, LTM4677, LTM4678, LTM4680, + LTM4700: N=2-3 - LTC3883: N=2 curr[N]_input Measured output current. @@ -335,7 +429,7 @@ curr[N]_crit Critical high output current. curr[N]_lcrit Critical low output current. - LTC2974 and LTC2975 only. + LTC2972, LTC2974 and LTC2975 only. curr[N]_max_alarm Output current high alarm. @@ -343,11 +437,11 @@ curr[N]_crit_alarm Output current critical high alarm. curr[N]_lcrit_alarm Output current critical low alarm. - LTC2974 and LTC2975 only. + LTC2972, LTC2974 and LTC2975 only. curr[N]_lowest Lowest output current. - LTC2974 and LTC2975 only. + LTC2972, LTC2974 and LTC2975 only. curr[N]_highest Highest output current. diff --git a/Documentation/hwmon/pmbus-core.rst b/Documentation/hwmon/pmbus-core.rst index 92515c446fe3..501b37b0610d 100644 --- a/Documentation/hwmon/pmbus-core.rst +++ b/Documentation/hwmon/pmbus-core.rst @@ -162,9 +162,12 @@ Read byte from page <page>, register <reg>. :: - int (*read_word_data)(struct i2c_client *client, int page, int reg); + int (*read_word_data)(struct i2c_client *client, int page, int phase, + int reg); -Read word from page <page>, register <reg>. +Read word from page <page>, phase <pase>, register <reg>. If the chip does not +support multiple phases, the phase parameter can be ignored. If the chip +supports multiple phases, a phase value of 0xff indicates all phases. :: @@ -201,16 +204,21 @@ is mandatory. :: - int pmbus_set_page(struct i2c_client *client, u8 page); + int pmbus_set_page(struct i2c_client *client, u8 page, u8 phase); -Set PMBus page register to <page> for subsequent commands. +Set PMBus page register to <page> and <phase> for subsequent commands. +If the chip does not support multiple phases, the phase parameter is +ignored. Otherwise, a phase value of 0xff selects all phases. :: - int pmbus_read_word_data(struct i2c_client *client, u8 page, u8 reg); + int pmbus_read_word_data(struct i2c_client *client, u8 page, u8 phase, + u8 reg); -Read word data from <page>, <reg>. Similar to i2c_smbus_read_word_data(), but -selects page first. +Read word data from <page>, <phase>, <reg>. Similar to +i2c_smbus_read_word_data(), but selects page and phase first. If the chip does +not support multiple phases, the phase parameter is ignored. Otherwise, a phase +value of 0xff selects all phases. :: diff --git a/Documentation/hwmon/pmbus.rst b/Documentation/hwmon/pmbus.rst index f787984e88a9..2658ddee70eb 100644 --- a/Documentation/hwmon/pmbus.rst +++ b/Documentation/hwmon/pmbus.rst @@ -227,7 +227,9 @@ currX_lcrit_alarm Output current critical low alarm. From IOUT_UC_FAULT status. currX_crit_alarm Current critical high alarm. From IIN_OC_FAULT or IOUT_OC_FAULT status. -currX_label "iin" or "ioutY" +currX_label "iin", "iinY", "iinY.Z", "ioutY", or "ioutY.Z", + where Y reflects the page number and Z reflects the + phase. powerX_input Measured power. From READ_PIN or READ_POUT register. powerX_cap Output power cap. From POUT_MAX register. @@ -239,7 +241,9 @@ powerX_alarm Power high alarm. From PIN_OP_WARNING or POUT_OP_WARNING status. powerX_crit_alarm Output power critical high alarm. From POUT_OP_FAULT status. -powerX_label "pin" or "poutY" +powerX_label "pin", "pinY", "pinY.Z", "poutY", or "poutY.Z", + where Y reflects the page number and Z reflects the + phase. tempX_input Measured temperature. From READ_TEMPERATURE_X register. diff --git a/Documentation/hwmon/tps53679.rst b/Documentation/hwmon/tps53679.rst new file mode 100644 index 000000000000..be94cab78967 --- /dev/null +++ b/Documentation/hwmon/tps53679.rst @@ -0,0 +1,178 @@ +Kernel driver tps53679 +====================== + +Supported chips: + + * Texas Instruments TPS53647 + + Prefix: 'tps53647' + + Addresses scanned: - + + Datasheet: http://www.ti.com/lit/gpn/tps53647 + + * Texas Instruments TPS53667 + + Prefix: 'tps53667' + + Addresses scanned: - + + Datasheet: http://www.ti.com/lit/gpn/TPS53667 + + * Texas Instruments TPS53679 + + Prefix: 'tps53679' + + Addresses scanned: - + + Datasheet: http://www.ti.com/lit/gpn/TPS53679 (short version) + + * Texas Instruments TPS53681 + + Prefix: 'tps53681' + + Addresses scanned: - + + Datasheet: http://www.ti.com/lit/gpn/TPS53681 + + * Texas Instruments TPS53688 + + Prefix: 'tps53688' + + Addresses scanned: - + + Datasheet: Available under NDA + + +Authors: + Vadim Pasternak <vadimp@mellanox.com> + Guenter Roeck <linux@roeck-us.net> + + +Description +----------- + +Chips in this series are multi-phase step-down converters with one or two +output channels and up to 8 phases per channel. + + +Usage Notes +----------- + +This driver does not probe for PMBus devices. You will have to instantiate +devices explicitly. + +Example: the following commands will load the driver for an TPS53681 at address +0x60 on I2C bus #1:: + + # modprobe tps53679 + # echo tps53681 0x60 > /sys/bus/i2c/devices/i2c-1/new_device + + +Sysfs attributes +---------------- + +======================= ======================================================== +in1_label "vin" + +in1_input Measured input voltage. + +in1_lcrit Critical minimum input voltage + + TPS53679, TPS53681, TPS53688 only. + +in1_lcrit_alarm Input voltage critical low alarm. + + TPS53679, TPS53681, TPS53688 only. + +in1_crit Critical maximum input voltage. + +in1_crit_alarm Input voltage critical high alarm. + +in[N]_label "vout[1-2]" + + - TPS53647, TPS53667: N=2 + - TPS53679, TPS53588: N=2,3 + +in[N]_input Measured output voltage. + +in[N]_lcrit Critical minimum input voltage. + + TPS53679, TPS53681, TPS53688 only. + +in[N]_lcrit_alarm Critical minimum voltage alarm. + + TPS53679, TPS53681, TPS53688 only. + +in[N]_alarm Output voltage alarm. + + TPS53647, TPS53667 only. + +in[N]_crit Critical maximum output voltage. + + TPS53679, TPS53681, TPS53688 only. + +in[N]_crit_alarm Output voltage critical high alarm. + + TPS53679, TPS53681, TPS53688 only. + +temp[N]_input Measured temperature. + + - TPS53647, TPS53667: N=1 + - TPS53679, TPS53681, TPS53588: N=1,2 + +temp[N]_max Maximum temperature. + +temp[N]_crit Critical high temperature. + +temp[N]_max_alarm Temperature high alarm. + +temp[N]_crit_alarm Temperature critical high alarm. + +power1_label "pin". + +power1_input Measured input power. + +power[N]_label "pout[1-2]". + + - TPS53647, TPS53667: N=2 + - TPS53679, TPS53681, TPS53588: N=2,3 + +power[N]_input Measured output power. + +curr1_label "iin". + +curr1_input Measured input current. + +curr1_max Maximum input current. + +curr1_max_alarm Input current high alarm. + +curr1_crit Critical input current. + +curr1_crit_alarm Input current critical alarm. + +curr[N]_label "iout[1-2]" or "iout1.[0-5]". + + The first digit is the output channel, the second + digit is the phase within the channel. Per-phase + telemetry supported on TPS53681 only. + + - TPS53647, TPS53667: N=2 + - TPS53679, TPS53588: N=2,3 + - TPS53681: N=2-9 + +curr[N]_input Measured output current. + +curr[N]_max Maximum output current. + +curr[N]_crit Critical high output current. + +curr[N]_max_alarm Output current high alarm. + +curr[N]_crit_alarm Output current critical high alarm. + + Limit and alarm attributes are only available for + non-phase telemetry (iout1, iout2). + +======================= ======================================================== diff --git a/Documentation/hwmon/xdpe12284.rst b/Documentation/hwmon/xdpe12284.rst index 6b7ae98cc536..67d1f87808e5 100644 --- a/Documentation/hwmon/xdpe12284.rst +++ b/Documentation/hwmon/xdpe12284.rst @@ -24,6 +24,7 @@ This driver implements support for Infineon Multi-phase XDPE122 family dual loop voltage regulators. The family includes XDPE12284 and XDPE12254 devices. The devices from this family complaint with: + - Intel VR13 and VR13HC rev 1.3, IMVP8 rev 1.2 and IMPVP9 rev 1.3 DC-DC converter specification. - Intel SVID rev 1.9. protocol. diff --git a/Documentation/index.rst b/Documentation/index.rst index e99d0bd2589d..9df95bab4de8 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -99,6 +99,7 @@ needed). accounting/index block/index cdrom/index + cpu-freq/index ide/index fb/index fpga/index @@ -131,7 +132,6 @@ needed). usb/index PCI/index misc-devices/index - mic/index scheduler/index Architecture-agnostic documentation diff --git a/Documentation/core-api/gcc-plugins.rst b/Documentation/kbuild/gcc-plugins.rst index 8502f24396fb..4b1c10f88e30 100644 --- a/Documentation/core-api/gcc-plugins.rst +++ b/Documentation/kbuild/gcc-plugins.rst @@ -72,6 +72,10 @@ e.g., on Ubuntu for gcc-4.9:: apt-get install gcc-4.9-plugin-dev +Or on Fedora:: + + dnf install gcc-plugin-devel + Enable a GCC plugin based feature in the kernel config:: CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y diff --git a/Documentation/kbuild/index.rst b/Documentation/kbuild/index.rst index 0f144fad99a6..82daf2efcb73 100644 --- a/Documentation/kbuild/index.rst +++ b/Documentation/kbuild/index.rst @@ -19,6 +19,7 @@ Kernel Build System issues reproducible-builds + gcc-plugins .. only:: subproject and html diff --git a/Documentation/kbuild/kbuild.rst b/Documentation/kbuild/kbuild.rst index f1e5dce86af7..510f38d7e78a 100644 --- a/Documentation/kbuild/kbuild.rst +++ b/Documentation/kbuild/kbuild.rst @@ -237,7 +237,7 @@ This is solely useful to speed up test compiles. KBUILD_EXTRA_SYMBOLS -------------------- For modules that use symbols from other modules. -See more details in modules.txt. +See more details in modules.rst. ALLSOURCE_ARCHS --------------- diff --git a/Documentation/kbuild/kconfig-macro-language.rst b/Documentation/kbuild/kconfig-macro-language.rst index 35b3263b7e40..8b413ef9603d 100644 --- a/Documentation/kbuild/kconfig-macro-language.rst +++ b/Documentation/kbuild/kconfig-macro-language.rst @@ -44,7 +44,7 @@ intermediate:: def_bool y Then, Kconfig moves onto the evaluation stage to resolve inter-symbol -dependency as explained in kconfig-language.txt. +dependency as explained in kconfig-language.rst. Variables diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst index 0e0eb2c8da7d..04d5c01a2e99 100644 --- a/Documentation/kbuild/makefiles.rst +++ b/Documentation/kbuild/makefiles.rst @@ -765,7 +765,7 @@ is not sufficient this sometimes needs to be explicit. Example:: #arch/x86/boot/Makefile - subdir- := compressed/ + subdir- := compressed The above assignment instructs kbuild to descend down in the directory compressed/ when "make clean" is executed. @@ -924,7 +924,7 @@ When kbuild executes, the following steps are followed (roughly): $(KBUILD_AFLAGS_MODULE) is used to add arch-specific options that are used for assembler. - From commandline AFLAGS_MODULE shall be used (see kbuild.txt). + From commandline AFLAGS_MODULE shall be used (see kbuild.rst). KBUILD_CFLAGS_KERNEL $(CC) options specific for built-in @@ -937,7 +937,7 @@ When kbuild executes, the following steps are followed (roughly): $(KBUILD_CFLAGS_MODULE) is used to add arch-specific options that are used for $(CC). - From commandline CFLAGS_MODULE shall be used (see kbuild.txt). + From commandline CFLAGS_MODULE shall be used (see kbuild.rst). KBUILD_LDFLAGS_MODULE Options for $(LD) when linking modules @@ -945,7 +945,7 @@ When kbuild executes, the following steps are followed (roughly): $(KBUILD_LDFLAGS_MODULE) is used to add arch-specific options used when linking modules. This is often a linker script. - From commandline LDFLAGS_MODULE shall be used (see kbuild.txt). + From commandline LDFLAGS_MODULE shall be used (see kbuild.rst). KBUILD_LDS @@ -1379,9 +1379,6 @@ See subsequent chapter for the syntax of the Kbuild file. in arch/$(ARCH)/include/(uapi/)/asm, Kbuild will automatically generate a wrapper of the asm-generic one. - The convention is to list one subdir per line and - preferably in alphabetic order. - 8 Kbuild Variables ================== diff --git a/Documentation/kbuild/modules.rst b/Documentation/kbuild/modules.rst index 69fa48ee93d6..e0b45a257f21 100644 --- a/Documentation/kbuild/modules.rst +++ b/Documentation/kbuild/modules.rst @@ -470,9 +470,9 @@ build. The syntax of the Module.symvers file is:: - <CRC> <Symbol> <Namespace> <Module> <Export Type> + <CRC> <Symbol> <Module> <Export Type> <Namespace> - 0xe1cc2a05 usb_stor_suspend USB_STORAGE drivers/usb/storage/usb-storage EXPORT_SYMBOL_GPL + 0xe1cc2a05 usb_stor_suspend drivers/usb/storage/usb-storage EXPORT_SYMBOL_GPL USB_STORAGE The fields are separated by tabs and values may be empty (e.g. if no namespace is defined for an exported symbol). diff --git a/Documentation/kernel-hacking/hacking.rst b/Documentation/kernel-hacking/hacking.rst index d62aacb2822a..eed2136d847f 100644 --- a/Documentation/kernel-hacking/hacking.rst +++ b/Documentation/kernel-hacking/hacking.rst @@ -601,7 +601,7 @@ Defined in ``include/linux/export.h`` This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol namespace. Symbol Namespaces are documented in -``Documentation/core-api/symbol-namespaces.rst``. +:doc:`../core-api/symbol-namespaces` :c:func:`EXPORT_SYMBOL_NS_GPL()` -------------------------------- @@ -610,7 +610,7 @@ Defined in ``include/linux/export.h`` This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol namespace. Symbol Namespaces are documented in -``Documentation/core-api/symbol-namespaces.rst``. +:doc:`../core-api/symbol-namespaces` Routines and Conventions ======================== diff --git a/Documentation/kernel-hacking/locking.rst b/Documentation/kernel-hacking/locking.rst index a8518ac0d31d..6ed806e6061b 100644 --- a/Documentation/kernel-hacking/locking.rst +++ b/Documentation/kernel-hacking/locking.rst @@ -150,17 +150,17 @@ Locking Only In User Context If you have a data structure which is only ever accessed from user context, then you can use a simple mutex (``include/linux/mutex.h``) to protect it. This is the most trivial case: you initialize the mutex. -Then you can call :c:func:`mutex_lock_interruptible()` to grab the -mutex, and :c:func:`mutex_unlock()` to release it. There is also a -:c:func:`mutex_lock()`, which should be avoided, because it will +Then you can call mutex_lock_interruptible() to grab the +mutex, and mutex_unlock() to release it. There is also a +mutex_lock(), which should be avoided, because it will not return if a signal is received. Example: ``net/netfilter/nf_sockopt.c`` allows registration of new -:c:func:`setsockopt()` and :c:func:`getsockopt()` calls, with -:c:func:`nf_register_sockopt()`. Registration and de-registration +setsockopt() and getsockopt() calls, with +nf_register_sockopt(). Registration and de-registration are only done on module load and unload (and boot time, where there is no concurrency), and the list of registrations is only consulted for an -unknown :c:func:`setsockopt()` or :c:func:`getsockopt()` system +unknown setsockopt() or getsockopt() system call. The ``nf_sockopt_mutex`` is perfect to protect this, especially since the setsockopt and getsockopt calls may well sleep. @@ -170,19 +170,19 @@ Locking Between User Context and Softirqs If a softirq shares data with user context, you have two problems. Firstly, the current user context can be interrupted by a softirq, and secondly, the critical region could be entered from another CPU. This is -where :c:func:`spin_lock_bh()` (``include/linux/spinlock.h``) is +where spin_lock_bh() (``include/linux/spinlock.h``) is used. It disables softirqs on that CPU, then grabs the lock. -:c:func:`spin_unlock_bh()` does the reverse. (The '_bh' suffix is +spin_unlock_bh() does the reverse. (The '_bh' suffix is a historical reference to "Bottom Halves", the old name for software interrupts. It should really be called spin_lock_softirq()' in a perfect world). -Note that you can also use :c:func:`spin_lock_irq()` or -:c:func:`spin_lock_irqsave()` here, which stop hardware interrupts +Note that you can also use spin_lock_irq() or +spin_lock_irqsave() here, which stop hardware interrupts as well: see `Hard IRQ Context <#hard-irq-context>`__. This works perfectly for UP as well: the spin lock vanishes, and this -macro simply becomes :c:func:`local_bh_disable()` +macro simply becomes local_bh_disable() (``include/linux/interrupt.h``), which protects you from the softirq being run. @@ -216,8 +216,8 @@ Different Tasklets/Timers ~~~~~~~~~~~~~~~~~~~~~~~~~ If another tasklet/timer wants to share data with your tasklet or timer -, you will both need to use :c:func:`spin_lock()` and -:c:func:`spin_unlock()` calls. :c:func:`spin_lock_bh()` is +, you will both need to use spin_lock() and +spin_unlock() calls. spin_lock_bh() is unnecessary here, as you are already in a tasklet, and none will be run on the same CPU. @@ -234,14 +234,14 @@ The same softirq can run on the other CPUs: you can use a per-CPU array going so far as to use a softirq, you probably care about scalable performance enough to justify the extra complexity. -You'll need to use :c:func:`spin_lock()` and -:c:func:`spin_unlock()` for shared data. +You'll need to use spin_lock() and +spin_unlock() for shared data. Different Softirqs ~~~~~~~~~~~~~~~~~~ -You'll need to use :c:func:`spin_lock()` and -:c:func:`spin_unlock()` for shared data, whether it be a timer, +You'll need to use spin_lock() and +spin_unlock() for shared data, whether it be a timer, tasklet, different softirq or the same or another softirq: any of them could be running on a different CPU. @@ -259,38 +259,38 @@ If a hardware irq handler shares data with a softirq, you have two concerns. Firstly, the softirq processing can be interrupted by a hardware interrupt, and secondly, the critical region could be entered by a hardware interrupt on another CPU. This is where -:c:func:`spin_lock_irq()` is used. It is defined to disable +spin_lock_irq() is used. It is defined to disable interrupts on that cpu, then grab the lock. -:c:func:`spin_unlock_irq()` does the reverse. +spin_unlock_irq() does the reverse. -The irq handler does not to use :c:func:`spin_lock_irq()`, because +The irq handler does not need to use spin_lock_irq(), because the softirq cannot run while the irq handler is running: it can use -:c:func:`spin_lock()`, which is slightly faster. The only exception +spin_lock(), which is slightly faster. The only exception would be if a different hardware irq handler uses the same lock: -:c:func:`spin_lock_irq()` will stop that from interrupting us. +spin_lock_irq() will stop that from interrupting us. This works perfectly for UP as well: the spin lock vanishes, and this -macro simply becomes :c:func:`local_irq_disable()` +macro simply becomes local_irq_disable() (``include/asm/smp.h``), which protects you from the softirq/tasklet/BH being run. -:c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) is a +spin_lock_irqsave() (``include/linux/spinlock.h``) is a variant which saves whether interrupts were on or off in a flags word, -which is passed to :c:func:`spin_unlock_irqrestore()`. This means +which is passed to spin_unlock_irqrestore(). This means that the same code can be used inside an hard irq handler (where interrupts are already off) and in softirqs (where the irq disabling is required). Note that softirqs (and hence tasklets and timers) are run on return -from hardware interrupts, so :c:func:`spin_lock_irq()` also stops -these. In that sense, :c:func:`spin_lock_irqsave()` is the most +from hardware interrupts, so spin_lock_irq() also stops +these. In that sense, spin_lock_irqsave() is the most general and powerful locking function. Locking Between Two Hard IRQ Handlers ------------------------------------- It is rare to have to share data between two IRQ handlers, but if you -do, :c:func:`spin_lock_irqsave()` should be used: it is +do, spin_lock_irqsave() should be used: it is architecture-specific whether all interrupts are disabled inside irq handlers themselves. @@ -304,11 +304,11 @@ Pete Zaitcev gives the following summary: (``copy_from_user*(`` or ``kmalloc(x,GFP_KERNEL)``). - Otherwise (== data can be touched in an interrupt), use - :c:func:`spin_lock_irqsave()` and - :c:func:`spin_unlock_irqrestore()`. + spin_lock_irqsave() and + spin_unlock_irqrestore(). - Avoid holding spinlock for more than 5 lines of code and across any - function call (except accessors like :c:func:`readb()`). + function call (except accessors like readb()). Table of Minimum Requirements ----------------------------- @@ -320,7 +320,7 @@ particular thread can only run on one CPU at a time, but if it needs shares data with another thread, locking is required). Remember the advice above: you can always use -:c:func:`spin_lock_irqsave()`, which is a superset of all other +spin_lock_irqsave(), which is a superset of all other spinlock primitives. ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== @@ -363,13 +363,13 @@ They can be used if you need no access to the data protected with the lock when some other thread is holding the lock. You should acquire the lock later if you then need access to the data protected with the lock. -:c:func:`spin_trylock()` does not spin but returns non-zero if it +spin_trylock() does not spin but returns non-zero if it acquires the spinlock on the first try or 0 if not. This function can be -used in all contexts like :c:func:`spin_lock()`: you must have +used in all contexts like spin_lock(): you must have disabled the contexts that might interrupt you and acquire the spin lock. -:c:func:`mutex_trylock()` does not suspend your task but returns +mutex_trylock() does not suspend your task but returns non-zero if it could lock the mutex on the first try or 0 if not. This function cannot be safely used in hardware or software interrupt contexts despite not sleeping. @@ -490,14 +490,14 @@ easy, since we copy the data for the user, and never let them access the objects directly. There is a slight (and common) optimization here: in -:c:func:`cache_add()` we set up the fields of the object before +cache_add() we set up the fields of the object before grabbing the lock. This is safe, as no-one else can access it until we put it in cache. Accessing From Interrupt Context -------------------------------- -Now consider the case where :c:func:`cache_find()` can be called +Now consider the case where cache_find() can be called from interrupt context: either a hardware interrupt or a softirq. An example would be a timer which deletes object from the cache. @@ -566,16 +566,16 @@ which are taken away, and the ``+`` are lines which are added. return ret; } -Note that the :c:func:`spin_lock_irqsave()` will turn off +Note that the spin_lock_irqsave() will turn off interrupts if they are on, otherwise does nothing (if we are already in an interrupt handler), hence these functions are safe to call from any context. -Unfortunately, :c:func:`cache_add()` calls :c:func:`kmalloc()` +Unfortunately, cache_add() calls kmalloc() with the ``GFP_KERNEL`` flag, which is only legal in user context. I -have assumed that :c:func:`cache_add()` is still only called in +have assumed that cache_add() is still only called in user context, otherwise this should become a parameter to -:c:func:`cache_add()`. +cache_add(). Exposing Objects Outside This File ---------------------------------- @@ -592,7 +592,7 @@ This makes locking trickier, as it is no longer all in one place. The second problem is the lifetime problem: if another structure keeps a pointer to an object, it presumably expects that pointer to remain valid. Unfortunately, this is only guaranteed while you hold the lock, -otherwise someone might call :c:func:`cache_delete()` and even +otherwise someone might call cache_delete() and even worse, add another object, re-using the same address. As there is only one lock, you can't hold it forever: no-one else would @@ -693,8 +693,8 @@ Here is the code:: We encapsulate the reference counting in the standard 'get' and 'put' functions. Now we can return the object itself from -:c:func:`cache_find()` which has the advantage that the user can -now sleep holding the object (eg. to :c:func:`copy_to_user()` to +cache_find() which has the advantage that the user can +now sleep holding the object (eg. to copy_to_user() to name to userspace). The other point to note is that I said a reference should be held for @@ -710,7 +710,7 @@ number of atomic operations defined in ``include/asm/atomic.h``: these are guaranteed to be seen atomically from all CPUs in the system, so no lock is required. In this case, it is simpler than using spinlocks, although for anything non-trivial using spinlocks is clearer. The -:c:func:`atomic_inc()` and :c:func:`atomic_dec_and_test()` +atomic_inc() and atomic_dec_and_test() are used instead of the standard increment and decrement operators, and the lock is no longer used to protect the reference count itself. @@ -802,7 +802,7 @@ name to change, there are three possibilities: - You can make ``cache_lock`` non-static, and tell people to grab that lock before changing the name in any object. -- You can provide a :c:func:`cache_obj_rename()` which grabs this +- You can provide a cache_obj_rename() which grabs this lock and changes the name for the caller, and tell everyone to use that function. @@ -861,11 +861,11 @@ Note that I decide that the popularity count should be protected by the ``cache_lock`` rather than the per-object lock: this is because it (like the :c:type:`struct list_head <list_head>` inside the object) is logically part of the infrastructure. This way, I don't need to grab -the lock of every object in :c:func:`__cache_add()` when seeking +the lock of every object in __cache_add() when seeking the least popular. I also decided that the id member is unchangeable, so I don't need to -grab each object lock in :c:func:`__cache_find()` to examine the +grab each object lock in __cache_find() to examine the id: the object lock is only used by a caller who wants to read or write the name field. @@ -887,7 +887,7 @@ trivial to diagnose: not a stay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem. For a slightly more complex case, imagine you have a region shared by a -softirq and user context. If you use a :c:func:`spin_lock()` call +softirq and user context. If you use a spin_lock() call to protect it, it is possible that the user context will be interrupted by the softirq while it holds the lock, and the softirq will then spin forever trying to get the same lock. @@ -985,12 +985,12 @@ you might do the following:: Sooner or later, this will crash on SMP, because a timer can have just -gone off before the :c:func:`spin_lock_bh()`, and it will only get -the lock after we :c:func:`spin_unlock_bh()`, and then try to free +gone off before the spin_lock_bh(), and it will only get +the lock after we spin_unlock_bh(), and then try to free the element (which has already been freed!). This can be avoided by checking the result of -:c:func:`del_timer()`: if it returns 1, the timer has been deleted. +del_timer(): if it returns 1, the timer has been deleted. If 0, it means (in this case) that it is currently running, so we can do:: @@ -1012,9 +1012,9 @@ do:: Another common problem is deleting timers which restart themselves (by -calling :c:func:`add_timer()` at the end of their timer function). +calling add_timer() at the end of their timer function). Because this is a fairly common case which is prone to races, you should -use :c:func:`del_timer_sync()` (``include/linux/timer.h``) to +use del_timer_sync() (``include/linux/timer.h``) to handle this case. It returns the number of times the timer had to be deleted before we finally stopped it from adding itself back in. @@ -1086,7 +1086,7 @@ adding ``new`` to a single linked list called ``list``:: list->next = new; -The :c:func:`wmb()` is a write memory barrier. It ensures that the +The wmb() is a write memory barrier. It ensures that the first operation (setting the new element's ``next`` pointer) is complete and will be seen by all CPUs, before the second operation is (putting the new element into the list). This is important, since modern @@ -1097,7 +1097,7 @@ rest of the list. Fortunately, there is a function to do this for standard :c:type:`struct list_head <list_head>` lists: -:c:func:`list_add_rcu()` (``include/linux/list.h``). +list_add_rcu() (``include/linux/list.h``). Removing an element from the list is even simpler: we replace the pointer to the old element with a pointer to its successor, and readers @@ -1108,7 +1108,7 @@ will either see it, or skip over it. list->next = old->next; -There is :c:func:`list_del_rcu()` (``include/linux/list.h``) which +There is list_del_rcu() (``include/linux/list.h``) which does this (the normal version poisons the old object, which we don't want). @@ -1116,9 +1116,9 @@ The reader must also be careful: some CPUs can look through the ``next`` pointer to start reading the contents of the next element early, but don't realize that the pre-fetched contents is wrong when the ``next`` pointer changes underneath them. Once again, there is a -:c:func:`list_for_each_entry_rcu()` (``include/linux/list.h``) +list_for_each_entry_rcu() (``include/linux/list.h``) to help you. Of course, writers can just use -:c:func:`list_for_each_entry()`, since there cannot be two +list_for_each_entry(), since there cannot be two simultaneous writers. Our final dilemma is this: when can we actually destroy the removed @@ -1127,14 +1127,14 @@ the list right now: if we free this element and the ``next`` pointer changes, the reader will jump off into garbage and crash. We need to wait until we know that all the readers who were traversing the list when we deleted the element are finished. We use -:c:func:`call_rcu()` to register a callback which will actually +call_rcu() to register a callback which will actually destroy the object once all pre-existing readers are finished. -Alternatively, :c:func:`synchronize_rcu()` may be used to block +Alternatively, synchronize_rcu() may be used to block until all pre-existing are finished. But how does Read Copy Update know when the readers are finished? The method is this: firstly, the readers always traverse the list inside -:c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` pairs: +rcu_read_lock()/rcu_read_unlock() pairs: these simply disable preemption so the reader won't go to sleep while reading the list. @@ -1223,12 +1223,12 @@ this is the fundamental idea. } Note that the reader will alter the popularity member in -:c:func:`__cache_find()`, and now it doesn't hold a lock. One +__cache_find(), and now it doesn't hold a lock. One solution would be to make it an ``atomic_t``, but for this usage, we don't really care about races: an approximate result is good enough, so I didn't change it. -The result is that :c:func:`cache_find()` requires no +The result is that cache_find() requires no synchronization with any other functions, so is almost as fast on SMP as it would be on UP. @@ -1240,9 +1240,9 @@ and put the reference count. Now, because the 'read lock' in RCU is simply disabling preemption, a caller which always has preemption disabled between calling -:c:func:`cache_find()` and :c:func:`object_put()` does not +cache_find() and object_put() does not need to actually get and put the reference count: we could expose -:c:func:`__cache_find()` by making it non-static, and such +__cache_find() by making it non-static, and such callers could simply call that. The benefit here is that the reference count is not written to: the @@ -1260,11 +1260,11 @@ counter. Nice and simple. If that was too slow (it's usually not, but if you've got a really big machine to test on and can show that it is), you could instead use a counter for each CPU, then none of them need an exclusive lock. See -:c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` and -:c:func:`put_cpu_var()` (``include/linux/percpu.h``). +DEFINE_PER_CPU(), get_cpu_var() and +put_cpu_var() (``include/linux/percpu.h``). Of particular use for simple per-cpu counters is the ``local_t`` type, -and the :c:func:`cpu_local_inc()` and related functions, which are +and the cpu_local_inc() and related functions, which are more efficient than simple code on some architectures (``include/asm/local.h``). @@ -1289,10 +1289,10 @@ irq handler doesn't use a lock, and all other accesses are done as so:: enable_irq(irq); spin_unlock(&lock); -The :c:func:`disable_irq()` prevents the irq handler from running +The disable_irq() prevents the irq handler from running (and waits for it to finish if it's currently running on other CPUs). The spinlock prevents any other accesses happening at the same time. -Naturally, this is slower than just a :c:func:`spin_lock_irq()` +Naturally, this is slower than just a spin_lock_irq() call, so it only makes sense if this type of access happens extremely rarely. @@ -1315,22 +1315,22 @@ from user context, and can sleep. - Accesses to userspace: - - :c:func:`copy_from_user()` + - copy_from_user() - - :c:func:`copy_to_user()` + - copy_to_user() - - :c:func:`get_user()` + - get_user() - - :c:func:`put_user()` + - put_user() -- :c:func:`kmalloc(GFP_KERNEL) <kmalloc>` +- kmalloc(GP_KERNEL) <kmalloc>` -- :c:func:`mutex_lock_interruptible()` and - :c:func:`mutex_lock()` +- mutex_lock_interruptible() and + mutex_lock() - There is a :c:func:`mutex_trylock()` which does not sleep. + There is a mutex_trylock() which does not sleep. Still, it must not be used inside interrupt context since its - implementation is not safe for that. :c:func:`mutex_unlock()` + implementation is not safe for that. mutex_unlock() will also never sleep. It cannot be used in interrupt context either since a mutex must be released by the same task that acquired it. @@ -1340,11 +1340,11 @@ Some Functions Which Don't Sleep Some functions are safe to call from any context, or holding almost any lock. -- :c:func:`printk()` +- printk() -- :c:func:`kfree()` +- kfree() -- :c:func:`add_timer()` and :c:func:`del_timer()` +- add_timer() and del_timer() Mutex API reference =================== @@ -1400,26 +1400,26 @@ preemption bh Bottom Half: for historical reasons, functions with '_bh' in them often - now refer to any software interrupt, e.g. :c:func:`spin_lock_bh()` + now refer to any software interrupt, e.g. spin_lock_bh() blocks any software interrupt on the current CPU. Bottom halves are deprecated, and will eventually be replaced by tasklets. Only one bottom half will be running at any time. Hardware Interrupt / Hardware IRQ - Hardware interrupt request. :c:func:`in_irq()` returns true in a + Hardware interrupt request. in_irq() returns true in a hardware interrupt handler. Interrupt Context Not user context: processing a hardware irq or software irq. Indicated - by the :c:func:`in_interrupt()` macro returning true. + by the in_interrupt() macro returning true. SMP Symmetric Multi-Processor: kernels compiled for multiple-CPU machines. (``CONFIG_SMP=y``). Software Interrupt / softirq - Software interrupt handler. :c:func:`in_irq()` returns false; - :c:func:`in_softirq()` returns true. Tasklets and softirqs both + Software interrupt handler. in_irq() returns false; + in_softirq() returns true. Tasklets and softirqs both fall into the category of 'software interrupts'. Strictly speaking a softirq is one of up to 32 enumerated software diff --git a/Documentation/kref.txt b/Documentation/kref.txt index 3af384156d7e..c61eea6f1bf2 100644 --- a/Documentation/kref.txt +++ b/Documentation/kref.txt @@ -128,6 +128,10 @@ since we already have a valid pointer that we own a refcount for. The put needs no lock because nothing tries to get the data without already holding a pointer. +In the above example, kref_put() will be called 2 times in both success +and error paths. This is necessary because the reference count got +incremented 2 times by kref_init() and kref_get(). + Note that the "before" in rule 1 is very important. You should never do something like:: diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst index 626a463f7e42..5d6800a723dc 100644 --- a/Documentation/locking/index.rst +++ b/Documentation/locking/index.rst @@ -7,6 +7,7 @@ locking .. toctree:: :maxdepth: 1 + locktypes lockdep-design lockstat locktorture diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst new file mode 100644 index 000000000000..09f45ce38d26 --- /dev/null +++ b/Documentation/locking/locktypes.rst @@ -0,0 +1,347 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _kernel_hacking_locktypes: + +========================== +Lock types and their rules +========================== + +Introduction +============ + +The kernel provides a variety of locking primitives which can be divided +into two categories: + + - Sleeping locks + - Spinning locks + +This document conceptually describes these lock types and provides rules +for their nesting, including the rules for use under PREEMPT_RT. + + +Lock categories +=============== + +Sleeping locks +-------------- + +Sleeping locks can only be acquired in preemptible task context. + +Although implementations allow try_lock() from other contexts, it is +necessary to carefully evaluate the safety of unlock() as well as of +try_lock(). Furthermore, it is also necessary to evaluate the debugging +versions of these primitives. In short, don't acquire sleeping locks from +other contexts unless there is no other option. + +Sleeping lock types: + + - mutex + - rt_mutex + - semaphore + - rw_semaphore + - ww_mutex + - percpu_rw_semaphore + +On PREEMPT_RT kernels, these lock types are converted to sleeping locks: + + - spinlock_t + - rwlock_t + +Spinning locks +-------------- + + - raw_spinlock_t + - bit spinlocks + +On non-PREEMPT_RT kernels, these lock types are also spinning locks: + + - spinlock_t + - rwlock_t + +Spinning locks implicitly disable preemption and the lock / unlock functions +can have suffixes which apply further protections: + + =================== ==================================================== + _bh() Disable / enable bottom halves (soft interrupts) + _irq() Disable / enable interrupts + _irqsave/restore() Save and disable / restore interrupt disabled state + =================== ==================================================== + +Owner semantics +=============== + +The aforementioned lock types except semaphores have strict owner +semantics: + + The context (task) that acquired the lock must release it. + +rw_semaphores have a special interface which allows non-owner release for +readers. + + +rtmutex +======= + +RT-mutexes are mutexes with support for priority inheritance (PI). + +PI has limitations on non-PREEMPT_RT kernels due to preemption and +interrupt disabled sections. + +PI clearly cannot preempt preemption-disabled or interrupt-disabled +regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels +execute most such regions of code in preemptible task context, especially +interrupt handlers and soft interrupts. This conversion allows spinlock_t +and rwlock_t to be implemented via RT-mutexes. + + +semaphore +========= + +semaphore is a counting semaphore implementation. + +Semaphores are often used for both serialization and waiting, but new use +cases should instead use separate serialization and wait mechanisms, such +as mutexes and completions. + +semaphores and PREEMPT_RT +---------------------------- + +PREEMPT_RT does not change the semaphore implementation because counting +semaphores have no concept of owners, thus preventing PREEMPT_RT from +providing priority inheritance for semaphores. After all, an unknown +owner cannot be boosted. As a consequence, blocking on semaphores can +result in priority inversion. + + +rw_semaphore +============ + +rw_semaphore is a multiple readers and single writer lock mechanism. + +On non-PREEMPT_RT kernels the implementation is fair, thus preventing +writer starvation. + +rw_semaphore complies by default with the strict owner semantics, but there +exist special-purpose interfaces that allow non-owner release for readers. +These interfaces work independent of the kernel configuration. + +rw_semaphore and PREEMPT_RT +--------------------------- + +PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based +implementation, thus changing the fairness: + + Because an rw_semaphore writer cannot grant its priority to multiple + readers, a preempted low-priority reader will continue holding its lock, + thus starving even high-priority writers. In contrast, because readers + can grant their priority to a writer, a preempted low-priority writer will + have its priority boosted until it releases the lock, thus preventing that + writer from starving readers. + + +raw_spinlock_t and spinlock_t +============================= + +raw_spinlock_t +-------------- + +raw_spinlock_t is a strict spinning lock implementation regardless of the +kernel configuration including PREEMPT_RT enabled kernels. + +raw_spinlock_t is a strict spinning lock implementation in all kernels, +including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical +core code, low-level interrupt handling and places where disabling +preemption or interrupts is required, for example, to safely access +hardware state. raw_spinlock_t can sometimes also be used when the +critical section is tiny, thus avoiding RT-mutex overhead. + +spinlock_t +---------- + +The semantics of spinlock_t change with the state of PREEMPT_RT. + +On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has +exactly the same semantics. + +spinlock_t and PREEMPT_RT +------------------------- + +On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation +based on rt_mutex which changes the semantics: + + - Preemption is not disabled. + + - The hard interrupt related suffixes for spin_lock / spin_unlock + operations (_irq, _irqsave / _irqrestore) do not affect the CPU's + interrupt disabled state. + + - The soft interrupt related suffix (_bh()) still disables softirq + handlers. + + Non-PREEMPT_RT kernels disable preemption to get this effect. + + PREEMPT_RT kernels use a per-CPU lock for serialization which keeps + preemption disabled. The lock disables softirq handlers and also + prevents reentrancy due to task preemption. + +PREEMPT_RT kernels preserve all other spinlock_t semantics: + + - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels + avoid migration by disabling preemption. PREEMPT_RT kernels instead + disable migration, which ensures that pointers to per-CPU variables + remain valid even if the task is preempted. + + - Task state is preserved across spinlock acquisition, ensuring that the + task-state rules apply to all kernel configurations. Non-PREEMPT_RT + kernels leave task state untouched. However, PREEMPT_RT must change + task state if the task blocks during acquisition. Therefore, it saves + the current task state before blocking and the corresponding lock wakeup + restores it, as shown below:: + + task->state = TASK_INTERRUPTIBLE + lock() + block() + task->saved_state = task->state + task->state = TASK_UNINTERRUPTIBLE + schedule() + lock wakeup + task->state = task->saved_state + + Other types of wakeups would normally unconditionally set the task state + to RUNNING, but that does not work here because the task must remain + blocked until the lock becomes available. Therefore, when a non-lock + wakeup attempts to awaken a task blocked waiting for a spinlock, it + instead sets the saved state to RUNNING. Then, when the lock + acquisition completes, the lock wakeup sets the task state to the saved + state, in this case setting it to RUNNING:: + + task->state = TASK_INTERRUPTIBLE + lock() + block() + task->saved_state = task->state + task->state = TASK_UNINTERRUPTIBLE + schedule() + non lock wakeup + task->saved_state = TASK_RUNNING + + lock wakeup + task->state = task->saved_state + + This ensures that the real wakeup cannot be lost. + + +rwlock_t +======== + +rwlock_t is a multiple readers and single writer lock mechanism. + +Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the +suffix rules of spinlock_t apply accordingly. The implementation is fair, +thus preventing writer starvation. + +rwlock_t and PREEMPT_RT +----------------------- + +PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based +implementation, thus changing semantics: + + - All the spinlock_t changes also apply to rwlock_t. + + - Because an rwlock_t writer cannot grant its priority to multiple + readers, a preempted low-priority reader will continue holding its lock, + thus starving even high-priority writers. In contrast, because readers + can grant their priority to a writer, a preempted low-priority writer + will have its priority boosted until it releases the lock, thus + preventing that writer from starving readers. + + +PREEMPT_RT caveats +================== + +spinlock_t and rwlock_t +----------------------- + +These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels +have a few implications. For example, on a non-PREEMPT_RT kernel the +following code sequence works as expected:: + + local_irq_disable(); + spin_lock(&lock); + +and is fully equivalent to:: + + spin_lock_irq(&lock); + +Same applies to rwlock_t and the _irqsave() suffix variants. + +On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a +fully preemptible context. Instead, use spin_lock_irq() or +spin_lock_irqsave() and their unlock counterparts. In cases where the +interrupt disabling and locking must remain separate, PREEMPT_RT offers a +local_lock mechanism. Acquiring the local_lock pins the task to a CPU, +allowing things like per-CPU interrupt disabled locks to be acquired. +However, this approach should be used only where absolutely necessary. + + +raw_spinlock_t +-------------- + +Acquiring a raw_spinlock_t disables preemption and possibly also +interrupts, so the critical section must avoid acquiring a regular +spinlock_t or rwlock_t, for example, the critical section must avoid +allocating memory. Thus, on a non-PREEMPT_RT kernel the following code +works perfectly:: + + raw_spin_lock(&lock); + p = kmalloc(sizeof(*p), GFP_ATOMIC); + +But this code fails on PREEMPT_RT kernels because the memory allocator is +fully preemptible and therefore cannot be invoked from truly atomic +contexts. However, it is perfectly fine to invoke the memory allocator +while holding normal non-raw spinlocks because they do not disable +preemption on PREEMPT_RT kernels:: + + spin_lock(&lock); + p = kmalloc(sizeof(*p), GFP_ATOMIC); + + +bit spinlocks +------------- + +PREEMPT_RT cannot substitute bit spinlocks because a single bit is too +small to accommodate an RT-mutex. Therefore, the semantics of bit +spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t +caveats also apply to bit spinlocks. + +Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT +using conditional (#ifdef'ed) code changes at the usage site. In contrast, +usage-site changes are not needed for the spinlock_t substitution. +Instead, conditionals in header files and the core locking implemementation +enable the compiler to do the substitution transparently. + + +Lock type nesting rules +======================= + +The most basic rules are: + + - Lock types of the same lock category (sleeping, spinning) can nest + arbitrarily as long as they respect the general lock ordering rules to + prevent deadlocks. + + - Sleeping lock types cannot nest inside spinning lock types. + + - Spinning lock types can nest inside sleeping lock types. + +These constraints apply both in PREEMPT_RT and otherwise. + +The fact that PREEMPT_RT changes the lock category of spinlock_t and +rwlock_t from spinning to sleeping means that they cannot be acquired while +holding a raw spinlock. This results in the following nesting ordering: + + 1) Sleeping locks + 2) spinlock_t and rwlock_t + 3) raw_spinlock_t and bit spinlocks + +Lockdep will complain if these constraints are violated, both in +PREEMPT_RT and otherwise. diff --git a/Documentation/media/kapi/csi2.rst b/Documentation/media/kapi/csi2.rst index 030a5c41ec75..e111ff7bfd3d 100644 --- a/Documentation/media/kapi/csi2.rst +++ b/Documentation/media/kapi/csi2.rst @@ -74,7 +74,7 @@ Before the receiver driver may enable the CSI-2 transmitter by using the :c:type:`v4l2_subdev_video_ops`->s_stream(), it must have powered the transmitter up by using the :c:type:`v4l2_subdev_core_ops`->s_power() callback. This may take -place either indirectly by using :c:func:`v4l2_pipeline_pm_use` or +place either indirectly by using :c:func:`v4l2_pipeline_pm_get` or directly. Formats diff --git a/Documentation/media/kapi/v4l2-controls.rst b/Documentation/media/kapi/v4l2-controls.rst index b20800cae3f2..5129019afb49 100644 --- a/Documentation/media/kapi/v4l2-controls.rst +++ b/Documentation/media/kapi/v4l2-controls.rst @@ -291,8 +291,8 @@ and QUERYMENU. And G/S_CTRL as well as G/TRY/S_EXT_CTRLS are automatically suppo In practice the basic usage as described above is sufficient for most drivers. -Inheriting Controls -------------------- +Inheriting Sub-device Controls +------------------------------ When a sub-device is registered with a V4L2 driver by calling v4l2_device_register_subdev() and the ctrl_handler fields of both v4l2_subdev @@ -757,8 +757,8 @@ attempting to find another control from the same handler will deadlock. It is recommended not to use this function from inside the control ops. -Inheriting Controls -------------------- +Preventing Controls inheritance +------------------------------- When one control handler is added to another using v4l2_ctrl_add_handler, then by default all controls from one are merged to the other. But a subdev might diff --git a/Documentation/media/kapi/v4l2-dev.rst b/Documentation/media/kapi/v4l2-dev.rst index 4c5a15c53dbf..63c064837c00 100644 --- a/Documentation/media/kapi/v4l2-dev.rst +++ b/Documentation/media/kapi/v4l2-dev.rst @@ -185,7 +185,7 @@ This will create the character device for you. .. code-block:: c - err = video_register_device(vdev, VFL_TYPE_GRABBER, -1); + err = video_register_device(vdev, VFL_TYPE_VIDEO, -1); if (err) { video_device_release(vdev); /* or kfree(my_vdev); */ return err; @@ -201,7 +201,7 @@ types exist: ========================== ==================== ============================== :c:type:`vfl_devnode_type` Device name Usage ========================== ==================== ============================== -``VFL_TYPE_GRABBER`` ``/dev/videoX`` for video input/output devices +``VFL_TYPE_VIDEO`` ``/dev/videoX`` for video input/output devices ``VFL_TYPE_VBI`` ``/dev/vbiX`` for vertical blank data (i.e. closed captions, teletext) ``VFL_TYPE_RADIO`` ``/dev/radioX`` for radio tuners diff --git a/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst b/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst index a21659d55c6b..6818ddf1495c 100644 --- a/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst +++ b/Documentation/media/uapi/cec/cec-ioc-adap-g-conn-info.rst @@ -44,18 +44,18 @@ is only available if the ``CEC_CAP_CONNECTOR_INFO`` capability is set. .. flat-table:: struct cec_connector_info :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 8 + :widths: 1 1 8 * - __u32 - ``type`` - The type of connector this adapter is associated with. - * - union + * - union { - ``(anonymous)`` - - - * - - - ``struct cec_drm_connector_info`` + * - ``struct cec_drm_connector_info`` - drm - :ref:`cec-drm-connector-info` + * - } + - .. tabularcolumns:: |p{4.4cm}|p{2.5cm}|p{10.6cm}| diff --git a/Documentation/media/uapi/cec/cec-ioc-dqevent.rst b/Documentation/media/uapi/cec/cec-ioc-dqevent.rst index 5e21b1fbfc01..d16b226b1bef 100644 --- a/Documentation/media/uapi/cec/cec-ioc-dqevent.rst +++ b/Documentation/media/uapi/cec/cec-ioc-dqevent.rst @@ -109,35 +109,33 @@ it is guaranteed that the state did change in between the two events. .. flat-table:: struct cec_event :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 8 + :widths: 1 1 8 * - __u64 - ``ts`` - - :cspan:`1`\ Timestamp of the event in ns. + - Timestamp of the event in ns. The timestamp has been taken from the ``CLOCK_MONOTONIC`` clock. To access the same clock from userspace use :c:func:`clock_gettime`. * - __u32 - ``event`` - - :cspan:`1` The CEC event type, see :ref:`cec-events`. + - The CEC event type, see :ref:`cec-events`. * - __u32 - ``flags`` - - :cspan:`1` Event flags, see :ref:`cec-event-flags`. - * - union + - Event flags, see :ref:`cec-event-flags`. + * - union { - (anonymous) - - - - - * - - - struct cec_event_state_change + * - struct cec_event_state_change - ``state_change`` - The new adapter state as sent by the :ref:`CEC_EVENT_STATE_CHANGE <CEC-EVENT-STATE-CHANGE>` event. - * - - - struct cec_event_lost_msgs + * - struct cec_event_lost_msgs - ``lost_msgs`` - The number of lost messages as sent by the :ref:`CEC_EVENT_LOST_MSGS <CEC-EVENT-LOST-MSGS>` event. + * - } + - .. tabularcolumns:: |p{5.6cm}|p{0.9cm}|p{11.0cm}| diff --git a/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst b/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst index 6218d9cbdd83..33e2b110145c 100644 --- a/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst +++ b/Documentation/media/uapi/mediactl/media-ioc-enum-entities.rst @@ -64,12 +64,11 @@ id's until they get an error. .. flat-table:: struct media_entity_desc :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 1 8 + :widths: 2 2 1 8 * - __u32 - ``id`` - - - - Entity ID, set by the application. When the ID is or'ed with ``MEDIA_ENT_ID_FLAG_NEXT``, the driver clears the flag and returns the first entity with a larger ID. Do not expect that the ID will @@ -79,79 +78,70 @@ id's until they get an error. * - char - ``name``\ [32] - - - - Entity name as an UTF-8 NULL-terminated string. This name must be unique within the media topology. * - __u32 - ``type`` - - - - Entity type, see :ref:`media-entity-functions` for details. * - __u32 - ``revision`` - - - - Entity revision. Always zero (obsolete) * - __u32 - ``flags`` - - - - Entity flags, see :ref:`media-entity-flag` for details. * - __u32 - ``group_id`` - - - - Entity group ID. Always zero (obsolete) * - __u16 - ``pads`` - - - - Number of pads * - __u16 - ``links`` - - - - Total number of outbound links. Inbound links are not counted in this field. * - __u32 - ``reserved[4]`` - - - - Reserved for future extensions. Drivers and applications must set the array to zero. - * - union + * - union { + - (anonymous) - * - - - struct + * - struct - ``dev`` - - Valid for (sub-)devices that create a single device node. * - - - - __u32 - ``major`` - Device node major number. * - - - - __u32 - ``minor`` - Device node minor number. - * - - - __u8 + * - __u8 - ``raw``\ [184] - - + * - } + - Return Value diff --git a/Documentation/media/uapi/v4l/buffer.rst b/Documentation/media/uapi/v4l/buffer.rst index 9149b57728e5..3112300c2fa0 100644 --- a/Documentation/media/uapi/v4l/buffer.rst +++ b/Documentation/media/uapi/v4l/buffer.rst @@ -172,11 +172,10 @@ struct v4l2_buffer .. flat-table:: struct v4l2_buffer :header-rows: 0 :stub-columns: 0 - :widths: 1 2 1 10 + :widths: 1 2 10 * - __u32 - ``index`` - - - Number of the buffer, set by the application except when calling :ref:`VIDIOC_DQBUF <VIDIOC_QBUF>`, then it is set by the driver. This field can range from zero to the number of buffers @@ -186,14 +185,12 @@ struct v4l2_buffer :ref:`VIDIOC_CREATE_BUFS` minus one. * - __u32 - ``type`` - - - Type of the buffer, same as struct :c:type:`v4l2_format` ``type`` or struct :c:type:`v4l2_requestbuffers` ``type``, set by the application. See :c:type:`v4l2_buf_type` * - __u32 - ``bytesused`` - - - The number of bytes occupied by the data in the buffer. It depends on the negotiated data format and may change with each buffer for compressed variable size data like JPEG images. Drivers must set @@ -205,18 +202,15 @@ struct v4l2_buffer ``planes`` pointer is used instead. * - __u32 - ``flags`` - - - Flags set by the application or driver, see :ref:`buffer-flags`. * - __u32 - ``field`` - - - Indicates the field order of the image in the buffer, see :c:type:`v4l2_field`. This field is not used when the buffer contains VBI data. Drivers must set it when ``type`` refers to a capture stream, applications when it refers to an output stream. * - struct timeval - ``timestamp`` - - - For capture streams this is time when the first data byte was captured, as returned by the :c:func:`clock_gettime()` function for the relevant clock id; see ``V4L2_BUF_FLAG_TIMESTAMP_*`` in @@ -229,7 +223,6 @@ struct v4l2_buffer stream. * - struct :c:type:`v4l2_timecode` - ``timecode`` - - - When the ``V4L2_BUF_FLAG_TIMECODE`` flag is set in ``flags``, this structure contains a frame timecode. In :c:type:`V4L2_FIELD_ALTERNATE <v4l2_field>` mode the top and @@ -239,10 +232,9 @@ struct v4l2_buffer independent of the ``timestamp`` and ``sequence`` fields. * - __u32 - ``sequence`` - - - Set by the driver, counting the frames (not fields!) in sequence. This field is set for both input and output devices. - * - :cspan:`3` + * - :cspan:`2` In :c:type:`V4L2_FIELD_ALTERNATE <v4l2_field>` mode the top and bottom field have the same sequence number. The count starts at @@ -262,13 +254,11 @@ struct v4l2_buffer * - __u32 - ``memory`` - - - This field must be set by applications and/or drivers in accordance with the selected I/O method. See :c:type:`v4l2_memory` - * - union + * - union { - ``m`` - * - - - __u32 + * - __u32 - ``offset`` - For the single-planar API and when ``memory`` is ``V4L2_MEMORY_MMAP`` this is the offset of the buffer from the @@ -276,29 +266,27 @@ struct v4l2_buffer and apart of serving as parameter to the :ref:`mmap() <func-mmap>` function not useful for applications. See :ref:`mmap` for details - * - - - unsigned long + * - unsigned long - ``userptr`` - For the single-planar API and when ``memory`` is ``V4L2_MEMORY_USERPTR`` this is a pointer to the buffer (casted to unsigned long type) in virtual memory, set by the application. See :ref:`userp` for details. - * - - - struct v4l2_plane + * - struct v4l2_plane - ``*planes`` - When using the multi-planar API, contains a userspace pointer to an array of struct :c:type:`v4l2_plane`. The size of the array should be put in the ``length`` field of this struct :c:type:`v4l2_buffer` structure. - * - - - int + * - int - ``fd`` - For the single-plane API and when ``memory`` is ``V4L2_MEMORY_DMABUF`` this is the file descriptor associated with a DMABUF buffer. + * - } + - * - __u32 - ``length`` - - - Size of the buffer (not the payload) in bytes for the single-planar API. This is set by the driver based on the calls to :ref:`VIDIOC_REQBUFS` and/or @@ -308,12 +296,10 @@ struct v4l2_buffer actual number of valid elements in that array. * - __u32 - ``reserved2`` - - - A place holder for future extensions. Drivers and applications must set this to 0. * - __u32 - ``request_fd`` - - - The file descriptor of the request to queue the buffer to. If the flag ``V4L2_BUF_FLAG_REQUEST_FD`` is set, then the buffer will be queued to this request. If the flag is not set, then this field will @@ -344,11 +330,10 @@ struct v4l2_plane .. flat-table:: :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u32 - ``bytesused`` - - - The number of bytes occupied by data in the plane (its payload). Drivers must set this field when ``type`` refers to a capture stream, applications when it refers to an output stream. If the @@ -362,40 +347,35 @@ struct v4l2_plane which may not be 0. * - __u32 - ``length`` - - - Size in bytes of the plane (not its payload). This is set by the driver based on the calls to :ref:`VIDIOC_REQBUFS` and/or :ref:`VIDIOC_CREATE_BUFS`. - * - union + * - union { - ``m`` - - - - - * - - - __u32 + * - __u32 - ``mem_offset`` - When the memory type in the containing struct :c:type:`v4l2_buffer` is ``V4L2_MEMORY_MMAP``, this is the value that should be passed to :ref:`mmap() <func-mmap>`, similar to the ``offset`` field in struct :c:type:`v4l2_buffer`. - * - - - unsigned long + * - unsigned long - ``userptr`` - When the memory type in the containing struct :c:type:`v4l2_buffer` is ``V4L2_MEMORY_USERPTR``, this is a userspace pointer to the memory allocated for this plane by an application. - * - - - int + * - int - ``fd`` - When the memory type in the containing struct :c:type:`v4l2_buffer` is ``V4L2_MEMORY_DMABUF``, this is a file descriptor associated with a DMABUF buffer, similar to the ``fd`` field in struct :c:type:`v4l2_buffer`. + * - } + - * - __u32 - ``data_offset`` - - - Offset in bytes to video data in the plane. Drivers must set this field when ``type`` refers to a capture stream, applications when it refers to an output stream. @@ -407,7 +387,6 @@ struct v4l2_plane at offset ``data_offset`` from the start of the plane. * - __u32 - ``reserved[11]`` - - - Reserved for future use. Should be zeroed by drivers and applications. diff --git a/Documentation/media/uapi/v4l/dev-sliced-vbi.rst b/Documentation/media/uapi/v4l/dev-sliced-vbi.rst index e86346f66017..7b2d38dd402a 100644 --- a/Documentation/media/uapi/v4l/dev-sliced-vbi.rst +++ b/Documentation/media/uapi/v4l/dev-sliced-vbi.rst @@ -478,33 +478,30 @@ struct v4l2_mpeg_vbi_fmt_ivtv .. flat-table:: :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u8 - ``magic``\ [4] - - - A "magic" constant from :ref:`v4l2-mpeg-vbi-fmt-ivtv-magic` that indicates this is a valid sliced VBI data payload and also indicates which member of the anonymous union, ``itv0`` or ``ITV0``, to use for the payload data. - * - union + * - union { - (anonymous) - * - - - struct :c:type:`v4l2_mpeg_vbi_itv0` + * - struct :c:type:`v4l2_mpeg_vbi_itv0` - ``itv0`` - The primary form of the sliced VBI data payload that contains anywhere from 1 to 35 lines of sliced VBI data. Line masks are provided in this form of the payload indicating which VBI lines are provided. - * - - - struct :ref:`v4l2_mpeg_vbi_ITV0 <v4l2-mpeg-vbi-itv0-1>` + * - struct :ref:`v4l2_mpeg_vbi_ITV0 <v4l2-mpeg-vbi-itv0-1>` - ``ITV0`` - An alternate form of the sliced VBI data payload used when 36 lines of sliced VBI data are present. No line masks are provided in this form of the payload; all valid line mask bits are implcitly set. - - + * - } + - .. _v4l2-mpeg-vbi-fmt-ivtv-magic: diff --git a/Documentation/media/uapi/v4l/ext-ctrls-codec.rst b/Documentation/media/uapi/v4l/ext-ctrls-codec.rst index 28313c0f4e7c..d4fc5f25aa14 100644 --- a/Documentation/media/uapi/v4l/ext-ctrls-codec.rst +++ b/Documentation/media/uapi/v4l/ext-ctrls-codec.rst @@ -2028,6 +2028,22 @@ enum v4l2_mpeg_video_h264_hierarchical_coding_type - * - ``V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM`` - 0x00000004 - The DPB entry is a long term reference frame + * - ``V4L2_H264_DPB_ENTRY_FLAG_FIELD`` + - 0x00000008 + - The DPB entry is a field reference, which means only one of the field + will be used when decoding the new frame/field. When not set the DPB + entry is a frame reference (both fields will be used). Note that this + flag does not say anything about the number of fields contained in the + reference frame, it just describes the one used to decode the new + field/frame + * - ``V4L2_H264_DPB_ENTRY_FLAG_BOTTOM_FIELD`` + - 0x00000010 + - The DPB entry is a bottom field reference (only the bottom field of the + reference frame is needed to decode the new frame/field). Only valid if + V4L2_H264_DPB_ENTRY_FLAG_FIELD is set. When + V4L2_H264_DPB_ENTRY_FLAG_FIELD is set but + V4L2_H264_DPB_ENTRY_FLAG_BOTTOM_FIELD is not, that means the + DPB entry is a top field reference ``V4L2_CID_MPEG_VIDEO_H264_DECODE_MODE (enum)`` Specifies the decoding mode to use. Currently exposes slice-based and diff --git a/Documentation/media/uapi/v4l/pixfmt-bayer.rst b/Documentation/media/uapi/v4l/pixfmt-bayer.rst index cfa2f4e3e114..807ab34ba93b 100644 --- a/Documentation/media/uapi/v4l/pixfmt-bayer.rst +++ b/Documentation/media/uapi/v4l/pixfmt-bayer.rst @@ -34,5 +34,6 @@ orders. See also `the Wikipedia article on Bayer filter pixfmt-srggb10-ipu3 pixfmt-srggb12 pixfmt-srggb12p + pixfmt-srggb14 pixfmt-srggb14p pixfmt-srggb16 diff --git a/Documentation/media/uapi/v4l/pixfmt-srggb14.rst b/Documentation/media/uapi/v4l/pixfmt-srggb14.rst new file mode 100644 index 000000000000..3420d4d1825e --- /dev/null +++ b/Documentation/media/uapi/v4l/pixfmt-srggb14.rst @@ -0,0 +1,82 @@ +.. Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GFDL-1.1-or-later WITH no-invariant-sections + +.. _V4L2-PIX-FMT-SRGGB14: +.. _v4l2-pix-fmt-sbggr14: +.. _v4l2-pix-fmt-sgbrg14: +.. _v4l2-pix-fmt-sgrbg14: + + +*************************************************************************************************************************** +V4L2_PIX_FMT_SRGGB14 ('RG14'), V4L2_PIX_FMT_SGRBG14 ('GR14'), V4L2_PIX_FMT_SGBRG14 ('GB14'), V4L2_PIX_FMT_SBGGR14 ('BG14'), +*************************************************************************************************************************** + + +14-bit Bayer formats expanded to 16 bits + + +Description +=========== + +These four pixel formats are raw sRGB / Bayer formats with 14 bits per +colour. Each sample is stored in a 16-bit word, with two unused high +bits filled with zeros. Each n-pixel row contains n/2 green samples +and n/2 blue or red samples, with alternating red and blue rows. Bytes +are stored in memory in little endian order. They are conventionally +described as GRGR... BGBG..., RGRG... GBGB..., etc. Below is an +example of a small V4L2_PIX_FMT_SBGGR14 image: + +**Byte Order.** +Each cell is one byte, the two most significant bits in the high bytes are +zero. + + + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + :widths: 2 1 1 1 1 1 1 1 1 + + + * - start + 0: + - B\ :sub:`00low` + - B\ :sub:`00high` + - G\ :sub:`01low` + - G\ :sub:`01high` + - B\ :sub:`02low` + - B\ :sub:`02high` + - G\ :sub:`03low` + - G\ :sub:`03high` + * - start + 8: + - G\ :sub:`10low` + - G\ :sub:`10high` + - R\ :sub:`11low` + - R\ :sub:`11high` + - G\ :sub:`12low` + - G\ :sub:`12high` + - R\ :sub:`13low` + - R\ :sub:`13high` + * - start + 16: + - B\ :sub:`20low` + - B\ :sub:`20high` + - G\ :sub:`21low` + - G\ :sub:`21high` + - B\ :sub:`22low` + - B\ :sub:`22high` + - G\ :sub:`23low` + - G\ :sub:`23high` + * - start + 24: + - G\ :sub:`30low` + - G\ :sub:`30high` + - R\ :sub:`31low` + - R\ :sub:`31high` + - G\ :sub:`32low` + - G\ :sub:`32high` + - R\ :sub:`33low` + - R\ :sub:`33high` diff --git a/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst b/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst index db43dda5aafb..054275c0dfc1 100644 --- a/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst +++ b/Documentation/media/uapi/v4l/pixfmt-v4l2-mplane.rst @@ -100,7 +100,8 @@ describing all planes of that format. * - __u8 - ``flags`` - Flags set by the application or driver, see :ref:`format-flags`. - * - :cspan:`2` union { (anonymous) + * - union { + - (anonymous) * - __u8 - ``ycbcr_enc`` - Y'CbCr encoding, from enum :c:type:`v4l2_ycbcr_encoding`. @@ -113,7 +114,8 @@ describing all planes of that format. This information supplements the ``colorspace`` and must be set by the driver for capture streams and by the application for output streams, see :ref:`colorspaces`. - * - :cspan:`2` } + * - } + - * - __u8 - ``quantization`` - Quantization range, from enum :c:type:`v4l2_quantization`. diff --git a/Documentation/media/uapi/v4l/pixfmt-v4l2.rst b/Documentation/media/uapi/v4l/pixfmt-v4l2.rst index a8321c348bf8..a993b861bf75 100644 --- a/Documentation/media/uapi/v4l/pixfmt-v4l2.rst +++ b/Documentation/media/uapi/v4l/pixfmt-v4l2.rst @@ -143,7 +143,6 @@ Single-planar format structure - Flags set by the application or driver, see :ref:`format-flags`. * - union { - (anonymous) - - * - __u32 - ``ycbcr_enc`` - Y'CbCr encoding, from enum :c:type:`v4l2_ycbcr_encoding`. @@ -158,7 +157,6 @@ Single-planar format structure streams, see :ref:`colorspaces`. * - } - - - * - __u32 - ``quantization`` - Quantization range, from enum :c:type:`v4l2_quantization`. diff --git a/Documentation/media/uapi/v4l/pixfmt-y14.rst b/Documentation/media/uapi/v4l/pixfmt-y14.rst new file mode 100644 index 000000000000..5c260f8da088 --- /dev/null +++ b/Documentation/media/uapi/v4l/pixfmt-y14.rst @@ -0,0 +1,72 @@ +.. Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GFDL-1.1-or-later WITH no-invariant-sections + +.. _V4L2-PIX-FMT-Y14: + +************************* +V4L2_PIX_FMT_Y14 ('Y14 ') +************************* + + +Grey-scale image + + +Description +=========== + +This is a grey-scale image with a depth of 14 bits per pixel. Pixels are +stored in 16-bit words with unused high bits padded with 0. The least +significant byte is stored at lower memory addresses (little-endian). + +**Byte Order.** +Each cell is one byte. + + + + +.. flat-table:: + :header-rows: 0 + :stub-columns: 0 + + * - start + 0: + - Y'\ :sub:`00low` + - Y'\ :sub:`00high` + - Y'\ :sub:`01low` + - Y'\ :sub:`01high` + - Y'\ :sub:`02low` + - Y'\ :sub:`02high` + - Y'\ :sub:`03low` + - Y'\ :sub:`03high` + * - start + 8: + - Y'\ :sub:`10low` + - Y'\ :sub:`10high` + - Y'\ :sub:`11low` + - Y'\ :sub:`11high` + - Y'\ :sub:`12low` + - Y'\ :sub:`12high` + - Y'\ :sub:`13low` + - Y'\ :sub:`13high` + * - start + 16: + - Y'\ :sub:`20low` + - Y'\ :sub:`20high` + - Y'\ :sub:`21low` + - Y'\ :sub:`21high` + - Y'\ :sub:`22low` + - Y'\ :sub:`22high` + - Y'\ :sub:`23low` + - Y'\ :sub:`23high` + * - start + 24: + - Y'\ :sub:`30low` + - Y'\ :sub:`30high` + - Y'\ :sub:`31low` + - Y'\ :sub:`31high` + - Y'\ :sub:`32low` + - Y'\ :sub:`32high` + - Y'\ :sub:`33low` + - Y'\ :sub:`33high` diff --git a/Documentation/media/uapi/v4l/subdev-formats.rst b/Documentation/media/uapi/v4l/subdev-formats.rst index 15e11f27b4c8..17bfb2beaa6a 100644 --- a/Documentation/media/uapi/v4l/subdev-formats.rst +++ b/Documentation/media/uapi/v4l/subdev-formats.rst @@ -5792,6 +5792,43 @@ the following codes. - u\ :sub:`2` - u\ :sub:`1` - u\ :sub:`0` + * .. _MEDIA-BUS-FMT-Y14-1X14: + + - MEDIA_BUS_FMT_Y14_1X14 + - 0x202d + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - y\ :sub:`13` + - y\ :sub:`12` + - y\ :sub:`11` + - y\ :sub:`10` + - y\ :sub:`9` + - y\ :sub:`8` + - y\ :sub:`7` + - y\ :sub:`6` + - y\ :sub:`5` + - y\ :sub:`4` + - y\ :sub:`3` + - y\ :sub:`2` + - y\ :sub:`1` + - y\ :sub:`0` * .. _MEDIA-BUS-FMT-UYVY8-1X16: - MEDIA_BUS_FMT_UYVY8_1X16 diff --git a/Documentation/media/uapi/v4l/vidioc-dbg-g-chip-info.rst b/Documentation/media/uapi/v4l/vidioc-dbg-g-chip-info.rst index a1cf20181cf1..d38031dbe4e4 100644 --- a/Documentation/media/uapi/v4l/vidioc-dbg-g-chip-info.rst +++ b/Documentation/media/uapi/v4l/vidioc-dbg-g-chip-info.rst @@ -91,23 +91,23 @@ instructions. .. flat-table:: struct v4l2_dbg_match :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u32 - ``type`` - See :ref:`name-chip-match-types` for a list of possible types. - * - union + * - union { - (anonymous) - * - - - __u32 + * - __u32 - ``addr`` - Match a chip by this number, interpreted according to the ``type`` field. - * - - - char + * - char - ``name[32]`` - Match a chip by this name, interpreted according to the ``type`` field. Currently unused. + * - } + - diff --git a/Documentation/media/uapi/v4l/vidioc-dbg-g-register.rst b/Documentation/media/uapi/v4l/vidioc-dbg-g-register.rst index 29e1d4fc4f52..112597c6cad2 100644 --- a/Documentation/media/uapi/v4l/vidioc-dbg-g-register.rst +++ b/Documentation/media/uapi/v4l/vidioc-dbg-g-register.rst @@ -100,23 +100,23 @@ instructions. .. flat-table:: struct v4l2_dbg_match :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u32 - ``type`` - See :ref:`chip-match-types` for a list of possible types. - * - union + * - union { - (anonymous) - * - - - __u32 + * - __u32 - ``addr`` - Match a chip by this number, interpreted according to the ``type`` field. - * - - - char + * - char - ``name[32]`` - Match a chip by this name, interpreted according to the ``type`` field. Currently unused. + * - } + - diff --git a/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst b/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst index f1a504836f31..784c5980da8d 100644 --- a/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst +++ b/Documentation/media/uapi/v4l/vidioc-decoder-cmd.rst @@ -77,32 +77,25 @@ introduced in Linux 3.3. They are, however, mandatory for stateful mem2mem decod .. flat-table:: struct v4l2_decoder_cmd :header-rows: 0 :stub-columns: 0 - :widths: 11 24 12 16 106 + :widths: 1 1 1 3 * - __u32 - ``cmd`` - - - - The decoder command, see :ref:`decoder-cmds`. * - __u32 - ``flags`` - - - - Flags to go with the command. If no flags are defined for this command, drivers and applications must set this field to zero. - * - union + * - union { - (anonymous) - - - - - - - * - - - struct + * - struct - ``start`` - - Structure containing additional data for the ``V4L2_DEC_CMD_START`` command. * - - - - __s32 - ``speed`` - Playback speed and direction. The playback speed is defined as @@ -113,7 +106,6 @@ introduced in Linux 3.3. They are, however, mandatory for stateful mem2mem decod of 1 steps just one frame forward, a speed of -1 steps just one frame back. * - - - - __u32 - ``format`` - Format restrictions. This field is set by the driver, not the @@ -124,30 +116,26 @@ introduced in Linux 3.3. They are, however, mandatory for stateful mem2mem decod GOPs, which it can then play in reverse order. So to implement reverse playback the application must feed the decoder the last GOP in the video file, then the GOP before that, etc. etc. - * - - - struct + * - struct - ``stop`` - - Structure containing additional data for the ``V4L2_DEC_CMD_STOP`` command. * - - - - __u64 - ``pts`` - Stop playback at this ``pts`` or immediately if the playback is already past that timestamp. Leave to 0 if you want to stop after the last frame was decoded. - * - - - struct + * - struct - ``raw`` - - - - * - - - - __u32 - ``data``\ [16] - Reserved for future extensions. Drivers and applications must set the array to zero. + * - } + - diff --git a/Documentation/media/uapi/v4l/vidioc-dqevent.rst b/Documentation/media/uapi/v4l/vidioc-dqevent.rst index 42659a3d1705..2f37d255352a 100644 --- a/Documentation/media/uapi/v4l/vidioc-dqevent.rst +++ b/Documentation/media/uapi/v4l/vidioc-dqevent.rst @@ -55,66 +55,54 @@ call. .. flat-table:: struct v4l2_event :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 * - __u32 - ``type`` - - - Type of the event, see :ref:`event-type`. - * - union + * - union { - ``u`` - - - - - * - - - struct :c:type:`v4l2_event_vsync` + * - struct :c:type:`v4l2_event_vsync` - ``vsync`` - Event data for event ``V4L2_EVENT_VSYNC``. - * - - - struct :c:type:`v4l2_event_ctrl` + * - struct :c:type:`v4l2_event_ctrl` - ``ctrl`` - Event data for event ``V4L2_EVENT_CTRL``. - * - - - struct :c:type:`v4l2_event_frame_sync` + * - struct :c:type:`v4l2_event_frame_sync` - ``frame_sync`` - Event data for event ``V4L2_EVENT_FRAME_SYNC``. - * - - - struct :c:type:`v4l2_event_motion_det` + * - struct :c:type:`v4l2_event_motion_det` - ``motion_det`` - Event data for event V4L2_EVENT_MOTION_DET. - * - - - struct :c:type:`v4l2_event_src_change` + * - struct :c:type:`v4l2_event_src_change` - ``src_change`` - Event data for event V4L2_EVENT_SOURCE_CHANGE. - * - - - __u8 + * - __u8 - ``data``\ [64] - Event data. Defined by the event type. The union should be used to define easily accessible type for events. + * - } + - * - __u32 - ``pending`` - - - Number of pending events excluding this one. * - __u32 - ``sequence`` - - - Event sequence number. The sequence number is incremented for every subscribed event that takes place. If sequence numbers are not contiguous it means that events have been lost. * - struct timespec - ``timestamp`` - - - Event timestamp. The timestamp has been taken from the ``CLOCK_MONOTONIC`` clock. To access the same clock outside V4L2, use :c:func:`clock_gettime`. * - u32 - ``id`` - - - The ID associated with the event source. If the event does not have an associated ID (this depends on the event type), then this is 0. * - __u32 - ``reserved``\ [8] - - - Reserved for future extensions. Drivers must set the array to zero. @@ -233,54 +221,45 @@ call. .. flat-table:: struct v4l2_event_ctrl :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 * - __u32 - ``changes`` - - - A bitmask that tells what has changed. See :ref:`ctrl-changes-flags`. * - __u32 - ``type`` - - - The type of the control. See enum :c:type:`v4l2_ctrl_type`. - * - union (anonymous) - - - - - - - * - - - __s32 + * - union { + - (anonymous) + * - __s32 - ``value`` - The 32-bit value of the control for 32-bit control types. This is 0 for string controls since the value of a string cannot be passed using :ref:`VIDIOC_DQEVENT`. - * - - - __s64 + * - __s64 - ``value64`` - The 64-bit value of the control for 64-bit control types. + * - } + - * - __u32 - ``flags`` - - - The control flags. See :ref:`control-flags`. * - __s32 - ``minimum`` - - - The minimum value of the control. See struct :ref:`v4l2_queryctrl <v4l2-queryctrl>`. * - __s32 - ``maximum`` - - - The maximum value of the control. See struct :ref:`v4l2_queryctrl <v4l2-queryctrl>`. * - __s32 - ``step`` - - - The step value of the control. See struct :ref:`v4l2_queryctrl <v4l2-queryctrl>`. * - __s32 - ``default_value`` - - - The default value value of the control. See struct :ref:`v4l2_queryctrl <v4l2-queryctrl>`. diff --git a/Documentation/media/uapi/v4l/vidioc-dv-timings-cap.rst b/Documentation/media/uapi/v4l/vidioc-dv-timings-cap.rst index e62d45d37072..1d0acbf14c4f 100644 --- a/Documentation/media/uapi/v4l/vidioc-dv-timings-cap.rst +++ b/Documentation/media/uapi/v4l/vidioc-dv-timings-cap.rst @@ -112,7 +112,7 @@ that doesn't support them will return an ``EINVAL`` error code. .. flat-table:: struct v4l2_dv_timings_cap :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 * - __u32 - ``type`` @@ -127,16 +127,14 @@ that doesn't support them will return an ``EINVAL`` error code. - Reserved for future extensions. Drivers and applications must set the array to zero. - * - union - - - - - * - - - struct :c:type:`v4l2_bt_timings_cap` + * - union { + - (anonymous) + * - struct :c:type:`v4l2_bt_timings_cap` - ``bt`` - BT.656/1120 timings capabilities of the hardware. - * - - - __u32 + * - __u32 - ``raw_data``\ [32] + * - } - .. tabularcolumns:: |p{7.0cm}|p{10.5cm}| diff --git a/Documentation/media/uapi/v4l/vidioc-enum-frameintervals.rst b/Documentation/media/uapi/v4l/vidioc-enum-frameintervals.rst index 2c69f26b165d..563a67cddeca 100644 --- a/Documentation/media/uapi/v4l/vidioc-enum-frameintervals.rst +++ b/Documentation/media/uapi/v4l/vidioc-enum-frameintervals.rst @@ -138,36 +138,31 @@ application should zero out all members except for the *IN* fields. * - __u32 - ``index`` - - - IN: Index of the given frame interval in the enumeration. * - __u32 - ``pixel_format`` - - - IN: Pixel format for which the frame intervals are enumerated. * - __u32 - ``width`` - - - IN: Frame width for which the frame intervals are enumerated. * - __u32 - ``height`` - - - IN: Frame height for which the frame intervals are enumerated. * - __u32 - ``type`` - - - OUT: Frame interval type the device supports. - * - union - - - - + * - union { + - (anonymous) - OUT: Frame interval with the given index. - * - - - struct :c:type:`v4l2_fract` + * - struct :c:type:`v4l2_fract` - ``discrete`` - Frame interval [s]. - * - - - struct :c:type:`v4l2_frmival_stepwise` + * - struct :c:type:`v4l2_frmival_stepwise` - ``stepwise`` - + * - } + - + - * - __u32 - ``reserved[2]`` - diff --git a/Documentation/media/uapi/v4l/vidioc-enum-framesizes.rst b/Documentation/media/uapi/v4l/vidioc-enum-framesizes.rst index cf31f548826f..cd97546a7122 100644 --- a/Documentation/media/uapi/v4l/vidioc-enum-framesizes.rst +++ b/Documentation/media/uapi/v4l/vidioc-enum-framesizes.rst @@ -155,31 +155,27 @@ application should zero out all members except for the *IN* fields. * - __u32 - ``index`` - - - IN: Index of the given frame size in the enumeration. * - __u32 - ``pixel_format`` - - - IN: Pixel format for which the frame sizes are enumerated. * - __u32 - ``type`` - - - OUT: Frame size type the device supports. - * - union - - - - + * - union { + - (anonymous) - OUT: Frame size with the given index. - * - - - struct :c:type:`v4l2_frmsize_discrete` + * - struct :c:type:`v4l2_frmsize_discrete` - ``discrete`` - - * - - - struct :c:type:`v4l2_frmsize_stepwise` + * - struct :c:type:`v4l2_frmsize_stepwise` - ``stepwise`` - + * - } + - + - * - __u32 - ``reserved[2]`` - - - Reserved space for future use. Must be zeroed by drivers and applications. diff --git a/Documentation/media/uapi/v4l/vidioc-g-dv-timings.rst b/Documentation/media/uapi/v4l/vidioc-g-dv-timings.rst index 5c675cbac4cf..e36dd2622857 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-dv-timings.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-dv-timings.rst @@ -179,23 +179,21 @@ EBUSY .. flat-table:: struct v4l2_dv_timings :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 * - __u32 - ``type`` - - - Type of DV timings as listed in :ref:`dv-timing-types`. - * - union - - - - - * - - - struct :c:type:`v4l2_bt_timings` + * - union { + - (anonymous) + * - struct :c:type:`v4l2_bt_timings` - ``bt`` - Timings defined by BT.656/1120 specifications - * - - - __u32 + * - __u32 - ``reserved``\ [32] - + * - } + - .. tabularcolumns:: |p{4.4cm}|p{4.4cm}|p{8.7cm}| diff --git a/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst b/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst index 271cac18afbb..cdb2a2a512d6 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-ext-ctrls.rst @@ -136,15 +136,13 @@ still cause this situation. .. flat-table:: struct v4l2_ext_control :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u32 - ``id`` - - - Identifies the control, set by the application. * - __u32 - ``size`` - - - The total size in bytes of the payload of this control. This is normally 0, but for pointer controls this should be set to the size of the memory containing the payload, or that will receive @@ -161,55 +159,48 @@ still cause this situation. *length* of the string may well be much smaller. * - __u32 - ``reserved2``\ [1] - - - Reserved for future extensions. Drivers and applications must set the array to zero. - * - union + * - union { - (anonymous) - * - - - __s32 + * - __s32 - ``value`` - New value or current value. Valid if this control is not of type ``V4L2_CTRL_TYPE_INTEGER64`` and ``V4L2_CTRL_FLAG_HAS_PAYLOAD`` is not set. - * - - - __s64 + * - __s64 - ``value64`` - New value or current value. Valid if this control is of type ``V4L2_CTRL_TYPE_INTEGER64`` and ``V4L2_CTRL_FLAG_HAS_PAYLOAD`` is not set. - * - - - char * + * - char * - ``string`` - A pointer to a string. Valid if this control is of type ``V4L2_CTRL_TYPE_STRING``. - * - - - __u8 * + * - __u8 * - ``p_u8`` - A pointer to a matrix control of unsigned 8-bit values. Valid if this control is of type ``V4L2_CTRL_TYPE_U8``. - * - - - __u16 * + * - __u16 * - ``p_u16`` - A pointer to a matrix control of unsigned 16-bit values. Valid if this control is of type ``V4L2_CTRL_TYPE_U16``. - * - - - __u32 * + * - __u32 * - ``p_u32`` - A pointer to a matrix control of unsigned 32-bit values. Valid if this control is of type ``V4L2_CTRL_TYPE_U32``. - * - - - :c:type:`v4l2_area` * + * - :c:type:`v4l2_area` * - ``p_area`` - A pointer to a struct :c:type:`v4l2_area`. Valid if this control is of type ``V4L2_CTRL_TYPE_AREA``. - * - - - void * + * - void * - ``ptr`` - A pointer to a compound type which can be an N-dimensional array and/or a compound type (the control's type is >= ``V4L2_CTRL_COMPOUND_TYPES``). Valid if ``V4L2_CTRL_FLAG_HAS_PAYLOAD`` is set for this control. + * - } + - .. tabularcolumns:: |p{4.0cm}|p{2.2cm}|p{2.1cm}|p{8.2cm}| @@ -221,12 +212,11 @@ still cause this situation. .. flat-table:: struct v4l2_ext_controls :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 - * - union + * - union { - (anonymous) - * - - - __u32 + * - __u32 - ``ctrl_class`` - The control class to which all controls belong, see :ref:`ctrl-class`. Drivers that use a kernel framework for @@ -235,8 +225,7 @@ still cause this situation. support this can be tested by setting ``ctrl_class`` to 0 and calling :ref:`VIDIOC_TRY_EXT_CTRLS <VIDIOC_G_EXT_CTRLS>` with a ``count`` of 0. If that succeeds, then the driver supports this feature. - * - - - __u32 + * - __u32 - ``which`` - Which value of the control to get/set/try. ``V4L2_CTRL_WHICH_CUR_VAL`` will return the current value of the @@ -261,6 +250,8 @@ still cause this situation. by setting ctrl_class to ``V4L2_CTRL_WHICH_CUR_VAL`` and calling VIDIOC_TRY_EXT_CTRLS with a count of 0. If that fails, then the driver does not support ``V4L2_CTRL_WHICH_CUR_VAL``. + * - } + - * - __u32 - ``count`` - The number of controls in the controls array. May also be zero. diff --git a/Documentation/media/uapi/v4l/vidioc-g-fmt.rst b/Documentation/media/uapi/v4l/vidioc-g-fmt.rst index e35a9caff652..1e69bfc46e8d 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-fmt.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-fmt.rst @@ -103,51 +103,44 @@ The format as returned by :ref:`VIDIOC_TRY_FMT <VIDIOC_G_FMT>` must be identical * - __u32 - ``type`` - - - Type of the data stream, see :c:type:`v4l2_buf_type`. - * - union + * - union { - ``fmt`` - * - - - struct :c:type:`v4l2_pix_format` + * - struct :c:type:`v4l2_pix_format` - ``pix`` - Definition of an image format, see :ref:`pixfmt`, used by video capture and output devices. - * - - - struct :c:type:`v4l2_pix_format_mplane` + * - struct :c:type:`v4l2_pix_format_mplane` - ``pix_mp`` - Definition of an image format, see :ref:`pixfmt`, used by video capture and output devices that support the :ref:`multi-planar version of the API <planar-apis>`. - * - - - struct :c:type:`v4l2_window` + * - struct :c:type:`v4l2_window` - ``win`` - Definition of an overlaid image, see :ref:`overlay`, used by video overlay devices. - * - - - struct :c:type:`v4l2_vbi_format` + * - struct :c:type:`v4l2_vbi_format` - ``vbi`` - Raw VBI capture or output parameters. This is discussed in more detail in :ref:`raw-vbi`. Used by raw VBI capture and output devices. - * - - - struct :c:type:`v4l2_sliced_vbi_format` + * - struct :c:type:`v4l2_sliced_vbi_format` - ``sliced`` - Sliced VBI capture or output parameters. See :ref:`sliced` for details. Used by sliced VBI capture and output devices. - * - - - struct :c:type:`v4l2_sdr_format` + * - struct :c:type:`v4l2_sdr_format` - ``sdr`` - Definition of a data format, see :ref:`pixfmt`, used by SDR capture and output devices. - * - - - struct :c:type:`v4l2_meta_format` + * - struct :c:type:`v4l2_meta_format` - ``meta`` - Definition of a metadata format, see :ref:`meta-formats`, used by metadata capture devices. - * - - - __u8 + * - __u8 - ``raw_data``\ [200] - Place holder for future extensions. + * - } + - Return Value diff --git a/Documentation/media/uapi/v4l/vidioc-g-parm.rst b/Documentation/media/uapi/v4l/vidioc-g-parm.rst index d9d5d97848d3..044a459e073f 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-parm.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-parm.rst @@ -69,33 +69,29 @@ union holding separate parameters for input and output devices. .. flat-table:: struct v4l2_streamparm :header-rows: 0 :stub-columns: 0 - :widths: 1 1 1 2 + :widths: 1 1 2 * - __u32 - ``type`` - - - The buffer (stream) type, same as struct :c:type:`v4l2_format` ``type``, set by the application. See :c:type:`v4l2_buf_type`. - * - union + * - union { - ``parm`` - - - - - * - - - struct :c:type:`v4l2_captureparm` + * - struct :c:type:`v4l2_captureparm` - ``capture`` - Parameters for capture devices, used when ``type`` is ``V4L2_BUF_TYPE_VIDEO_CAPTURE`` or ``V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE``. - * - - - struct :c:type:`v4l2_outputparm` + * - struct :c:type:`v4l2_outputparm` - ``output`` - Parameters for output devices, used when ``type`` is ``V4L2_BUF_TYPE_VIDEO_OUTPUT`` or ``V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE``. - * - - - __u8 + * - __u8 - ``raw_data``\ [200] - A place holder for future extensions. + * - } + - diff --git a/Documentation/media/uapi/v4l/vidioc-queryctrl.rst b/Documentation/media/uapi/v4l/vidioc-queryctrl.rst index 6690928e657b..8971f4cfb16e 100644 --- a/Documentation/media/uapi/v4l/vidioc-queryctrl.rst +++ b/Documentation/media/uapi/v4l/vidioc-queryctrl.rst @@ -290,34 +290,29 @@ See also the examples in :ref:`control`. .. flat-table:: struct v4l2_querymenu :header-rows: 0 :stub-columns: 0 - :widths: 1 1 2 1 + :widths: 1 1 2 * - __u32 - - - ``id`` - Identifies the control, set by the application from the respective struct :ref:`v4l2_queryctrl <v4l2-queryctrl>` ``id``. * - __u32 - - - ``index`` - Index of the menu item, starting at zero, set by the application. - * - union - - - - - - - * - - - __u8 + * - union { + - (anonymous) + * - __u8 - ``name``\ [32] - Name of the menu item, a NUL-terminated ASCII string. This information is intended for the user. This field is valid for ``V4L2_CTRL_TYPE_MENU`` type controls. - * - - - __s64 + * - __s64 - ``value`` - Value of the integer menu item. This field is valid for ``V4L2_CTRL_TYPE_INTEGER_MENU`` type controls. - * - __u32 + * - } - + * - __u32 - ``reserved`` - Reserved for future extensions. Drivers must set the array to zero. @@ -378,7 +373,7 @@ See also the examples in :ref:`control`. - 0 - 0 - A control which performs an action when set. Drivers must ignore - the value passed with ``VIDIOC_S_CTRL`` and return an ``EINVAL`` error + the value passed with ``VIDIOC_S_CTRL`` and return an ``EACCES`` error code on a ``VIDIOC_G_CTRL`` attempt. * - ``V4L2_CTRL_TYPE_INTEGER64`` - any diff --git a/Documentation/media/uapi/v4l/yuv-formats.rst b/Documentation/media/uapi/v4l/yuv-formats.rst index 867470e5f9e1..3b259e31b7a1 100644 --- a/Documentation/media/uapi/v4l/yuv-formats.rst +++ b/Documentation/media/uapi/v4l/yuv-formats.rst @@ -35,6 +35,7 @@ to brightness information. pixfmt-grey pixfmt-y10 pixfmt-y12 + pixfmt-y14 pixfmt-y10b pixfmt-y10p pixfmt-y16 diff --git a/Documentation/media/v4l-drivers/ipu3.rst b/Documentation/media/v4l-drivers/ipu3.rst index e4904ab44e60..a694f49491f9 100644 --- a/Documentation/media/v4l-drivers/ipu3.rst +++ b/Documentation/media/v4l-drivers/ipu3.rst @@ -311,10 +311,13 @@ Down Scaler and GDC blocks should be configured with the supported resolutions as each hardware block has its own alignment requirement. You must configure the output resolution of the hardware blocks smartly to meet -the hardware requirement along with keeping the maximum field of view. -The intermediate resolutions can be generated by specific tool and this -information can be obtained by looking at the following IPU3 ImgU configuration -table. +the hardware requirement along with keeping the maximum field of view. The +intermediate resolutions can be generated by specific tool - + +https://github.com/intel/intel-ipu3-pipecfg + +This tool can be used to generate intermediate resolutions. More information can +be obtained by looking at the following IPU3 ImgU configuration table. https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master diff --git a/Documentation/media/v4l-drivers/vivid.rst b/Documentation/media/v4l-drivers/vivid.rst index 7082fec4075d..52e57b773f07 100644 --- a/Documentation/media/v4l-drivers/vivid.rst +++ b/Documentation/media/v4l-drivers/vivid.rst @@ -4,9 +4,9 @@ The Virtual Video Test Driver (vivid) ===================================== This driver emulates video4linux hardware of various types: video capture, video -output, vbi capture and output, radio receivers and transmitters and a software -defined radio receiver. In addition a simple framebuffer device is available for -testing capture and output overlays. +output, vbi capture and output, metadata capture and output, radio receivers and +transmitters, touch capture and a software defined radio receiver. In addition a +simple framebuffer device is available for testing capture and output overlays. Up to 64 vivid instances can be created, each with up to 16 inputs and 16 outputs. @@ -36,6 +36,8 @@ This document describes the features implemented by this driver: - Radio receiver and transmitter support, including RDS support - Software defined radio (SDR) support - Capture and output overlay support +- Metadata capture and output support +- Touch capture support These features will be described in more detail below. @@ -69,6 +71,9 @@ all configurable using the following module options: - bit 10-11: VBI Output node: 0 = none, 1 = raw vbi, 2 = sliced vbi, 3 = both - bit 12: Radio Transmitter node - bit 16: Framebuffer for testing overlays + - bit 17: Metadata Capture node + - bit 18: Metadata Output node + - bit 19: Touch Capture node So to create four instances, the first two with just one video capture device, the second two with just one video output device you would pass @@ -175,6 +180,21 @@ all configurable using the following module options: give the desired swradioX start number for each SDR capture device. The default is -1 which will just take the first free number. +- meta_cap_nr: + + give the desired videoX start number for each metadata capture device. + The default is -1 which will just take the first free number. + +- meta_out_nr: + + give the desired videoX start number for each metadata output device. + The default is -1 which will just take the first free number. + +- touch_cap_nr: + + give the desired v4l-touchX start number for each touch capture device. + The default is -1 which will just take the first free number. + - ccs_cap_mode: specify the allowed video capture crop/compose/scaling combination @@ -547,6 +567,33 @@ The generated data contains the In-phase and Quadrature components of a 1 kHz tone that has an amplitude of sqrt(2). +Metadata Capture +---------------- + +The Metadata capture generates UVC format metadata. The PTS and SCR are +transmitted based on the values set in vivid contols. + +The Metadata device will only work for the Webcam input, it will give +back an error for all other inputs. + + +Metadata Output +--------------- + +The Metadata output can be used to set brightness, contrast, saturation and hue. + +The Metadata device will only work for the Webcam output, it will give +back an error for all other outputs. + + +Touch Capture +------------- + +The Touch capture generates touch patterns simulating single tap, double tap, +triple tap, move from left to right, zoom in, zoom out, palm press (simulating +a large area being pressed on a touchpad), and simulating 16 simultaneous +touch points. + Controls -------- @@ -1049,6 +1096,16 @@ FM Radio Modulator Controls to pass the RDS blocks to the driver, or "Controls" where the RDS data is Provided by the RDS controls mentioned above. +Metadata Capture Controls +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Generate PTS + + if set, then the generated metadata stream contains Presentation timestamp. + +- Generate SCR + + if set, then the generated metadata stream contains Source Clock information. Video, VBI and RDS Looping -------------------------- diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 7146da061693..e1c355e84edd 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -185,7 +185,7 @@ As a further example, consider this sequence of events: =============== =============== { A == 1, B == 2, C == 3, P == &A, Q == &C } B = 4; Q = P; - P = &B D = *Q; + P = &B; D = *Q; There is an obvious data dependency here, as the value loaded into D depends on the address retrieved from P by CPU 2. At the end of the sequence, any of the @@ -569,7 +569,7 @@ following sequence of events: { A == 1, B == 2, C == 3, P == &A, Q == &C } B = 4; <write barrier> - WRITE_ONCE(P, &B) + WRITE_ONCE(P, &B); Q = READ_ONCE(P); D = *Q; @@ -1721,7 +1721,7 @@ of optimizations: and WRITE_ONCE() are more selective: With READ_ONCE() and WRITE_ONCE(), the compiler need only forget the contents of the indicated memory locations, while with barrier() the compiler must - discard the value of all memory locations that it has currented + discard the value of all memory locations that it has currently cached in any machine registers. Of course, the compiler must also respect the order in which the READ_ONCE()s and WRITE_ONCE()s occur, though the CPU of course need not do so. @@ -1833,7 +1833,7 @@ Aside: In the case of data dependencies, the compiler would be expected to issue the loads in the correct order (eg. `a[b]` would have to load the value of b before loading a[b]), however there is no guarantee in the C specification that the compiler may not speculate the value of b -(eg. is equal to 1) and load a before b (eg. tmp = a[1]; if (b != 1) +(eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the problem of a compiler reloading b after having loaded a[b], thus having a newer copy of b than a[b]. A consensus has not yet been reached about these problems, however the READ_ONCE() diff --git a/Documentation/mips/au1xxx_ide.rst b/Documentation/mips/au1xxx_ide.rst deleted file mode 100644 index 2f9c2cff6738..000000000000 --- a/Documentation/mips/au1xxx_ide.rst +++ /dev/null @@ -1,130 +0,0 @@ -.. include:: <isonum.txt> - -====================== -MIPS AU1XXX IDE driver -====================== - -Released 2005-07-15 - -About -===== - -This file describes the 'drivers/ide/au1xxx-ide.c', related files and the -services they provide. - -If you are short in patience and just want to know how to add your hard disc to -the white or black list, go to the 'ADD NEW HARD DISC TO WHITE OR BLACK LIST' -section. - - -License -======= - -:Copyright: |copy| 2003-2005 AMD, Personal Connectivity Solutions - -This program is free software; you can redistribute it and/or modify it under -the terms of the GNU General Public License as published by the Free Software -Foundation; either version 2 of the License, or (at your option) any later -version. - -THIS SOFTWARE IS PROVIDED ``AS IS`` AND ANY EXPRESS OR IMPLIED WARRANTIES, -INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND -FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR -BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. - -You should have received a copy of the GNU General Public License along with -this program; if not, write to the Free Software Foundation, Inc., -675 Mass Ave, Cambridge, MA 02139, USA. - -Note: - for more information, please refer "AMD Alchemy Au1200/Au1550 IDE - Interface and Linux Device Driver" Application Note. - - -Files, Configs and Compatibility -================================ - -Two files are introduced: - - a) 'arch/mips/include/asm/mach-au1x00/au1xxx_ide.h' - contains : struct _auide_hwif - - - timing parameters for PIO mode 0/1/2/3/4 - - timing parameters for MWDMA 0/1/2 - - b) 'drivers/ide/mips/au1xxx-ide.c' - contains the functionality of the AU1XXX IDE driver - -Following extra configs variables are introduced: - - CONFIG_BLK_DEV_IDE_AU1XXX_PIO_DBDMA - - enable the PIO+DBDMA mode - CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA - - enable the MWDMA mode - - -Supported IDE Modes -=================== - -The AU1XXX IDE driver supported all PIO modes - PIO mode 0/1/2/3/4 - and all -MWDMA modes - MWDMA 0/1/2 -. There is no support for SWDMA and UDMA mode. - -To change the PIO mode use the program hdparm with option -p, e.g. -'hdparm -p0 [device]' for PIO mode 0. To enable the MWDMA mode use the option --X, e.g. 'hdparm -X32 [device]' for MWDMA mode 0. - - -Performance Configurations -========================== - -If the used system doesn't need USB support enable the following kernel -configs:: - - CONFIG_IDE=y - CONFIG_BLK_DEV_IDE=y - CONFIG_IDE_GENERIC=y - CONFIG_BLK_DEV_IDEPCI=y - CONFIG_BLK_DEV_GENERIC=y - CONFIG_BLK_DEV_IDEDMA_PCI=y - CONFIG_BLK_DEV_IDE_AU1XXX=y - CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y - CONFIG_BLK_DEV_IDEDMA=y - -Also define 'IDE_AU1XXX_BURSTMODE' in 'drivers/ide/mips/au1xxx-ide.c' to enable -the burst support on DBDMA controller. - -If the used system need the USB support enable the following kernel configs for -high IDE to USB throughput. - -:: - - CONFIG_IDE_GENERIC=y - CONFIG_BLK_DEV_IDEPCI=y - CONFIG_BLK_DEV_GENERIC=y - CONFIG_BLK_DEV_IDEDMA_PCI=y - CONFIG_BLK_DEV_IDE_AU1XXX=y - CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y - CONFIG_BLK_DEV_IDEDMA=y - -Also undefine 'IDE_AU1XXX_BURSTMODE' in 'drivers/ide/mips/au1xxx-ide.c' to -disable the burst support on DBDMA controller. - - -Acknowledgments -=============== - -These drivers wouldn't have been done without the base of kernel 2.4.x AU1XXX -IDE driver from AMD. - -Additional input also from: -Matthias Lenk <matthias.lenk@amd.com> - -Happy hacking! - -Enrico Walther <enrico.walther@amd.com> diff --git a/Documentation/mips/index.rst b/Documentation/mips/index.rst index a93c2f65884c..d5ad8c00f0bd 100644 --- a/Documentation/mips/index.rst +++ b/Documentation/mips/index.rst @@ -10,8 +10,6 @@ MIPS-specific Documentation ingenic-tcu - au1xxx_ide - .. only:: subproject and html Indices diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index f11c5daeada5..c1dcd2628911 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -20,4 +20,5 @@ fit into other categories. isl29003 lis3lv02d max6875 + mic/index xilinx_sdfec diff --git a/Documentation/mic/index.rst b/Documentation/misc-devices/mic/index.rst index 3a8d06367ef1..3a8d06367ef1 100644 --- a/Documentation/mic/index.rst +++ b/Documentation/misc-devices/mic/index.rst diff --git a/Documentation/mic/mic_overview.rst b/Documentation/misc-devices/mic/mic_overview.rst index 17d956bdaf7c..17d956bdaf7c 100644 --- a/Documentation/mic/mic_overview.rst +++ b/Documentation/misc-devices/mic/mic_overview.rst diff --git a/Documentation/mic/scif_overview.rst b/Documentation/misc-devices/mic/scif_overview.rst index 4c8ad9e43706..4c8ad9e43706 100644 --- a/Documentation/mic/scif_overview.rst +++ b/Documentation/misc-devices/mic/scif_overview.rst diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst index 1a7683e7acb2..8b46e8591fe0 100644 --- a/Documentation/networking/devlink/devlink-region.rst +++ b/Documentation/networking/devlink/devlink-region.rst @@ -40,9 +40,6 @@ example usage # Delete a snapshot using: $ devlink region del pci/0000:00:05.0/cr-space snapshot 1 - # Trigger (request) a snapshot be taken: - $ devlink region trigger pci/0000:00:05.0/cr-space - # Dump a snapshot: $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst index 06c97dcb57ca..e143ab79a960 100644 --- a/Documentation/networking/net_failover.rst +++ b/Documentation/networking/net_failover.rst @@ -8,9 +8,9 @@ Overview ======== The net_failover driver provides an automated failover mechanism via APIs -to create and destroy a failover master netdev and mananges a primary and +to create and destroy a failover master netdev and manages a primary and standby slave netdevs that get registered via the generic failover -infrastructrure. +infrastructure. The failover netdev acts a master device and controls 2 slave devices. The original paravirtual interface is registered as 'standby' slave netdev and @@ -29,7 +29,7 @@ virtio-net accelerated datapath: STANDBY mode ============================================= net_failover enables hypervisor controlled accelerated datapath to virtio-net -enabled VMs in a transparent manner with no/minimal guest userspace chanages. +enabled VMs in a transparent manner with no/minimal guest userspace changes. To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY feature on the virtio-net interface and assign the same MAC address to both diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 1e4735cc0553..256106054c8c 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -487,8 +487,9 @@ phy_register_fixup_for_id():: The stubs set one of the two matching criteria, and set the other one to match anything. -When phy_register_fixup() or \*_for_uid()/\*_for_id() is called at module, -unregister fixup and free allocate memory are required. +When phy_register_fixup() or \*_for_uid()/\*_for_id() is called at module load +time, the module needs to unregister the fixup and free allocated memory when +it's unloaded. Call one of following function before unloading module:: diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index f2a0147c933d..eec61694e894 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -159,7 +159,7 @@ Socket Interface set SO_RDS_TRANSPORT on a socket for which the transport has been previously attached explicitly (by SO_RDS_TRANSPORT) or implicitly (via bind(2)) will return an error of EOPNOTSUPP. - An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will + An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will always return EINVAL. RDMA for RDS diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst index 38a4edc4522b..10e11099e74a 100644 --- a/Documentation/networking/snmp_counter.rst +++ b/Documentation/networking/snmp_counter.rst @@ -908,8 +908,8 @@ A TLP probe packet is sent. A packet loss is detected and recovered by TLP. -TCP Fast Open -============= +TCP Fast Open description +========================= TCP Fast Open is a technology which allows data transfer before the 3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a general description. diff --git a/Documentation/power/index.rst b/Documentation/power/index.rst index 002e42745263..ced8a8007434 100644 --- a/Documentation/power/index.rst +++ b/Documentation/power/index.rst @@ -13,7 +13,6 @@ Power Management drivers-testing energy-model freezing-of-tasks - interface opp pci pm_qos_interface diff --git a/Documentation/power/pm_qos_interface.rst b/Documentation/power/pm_qos_interface.rst index 0d62d506caf0..69b0fe3e2542 100644 --- a/Documentation/power/pm_qos_interface.rst +++ b/Documentation/power/pm_qos_interface.rst @@ -7,86 +7,78 @@ performance expectations by drivers, subsystems and user space applications on one of the parameters. Two different PM QoS frameworks are available: -1. PM QoS classes for cpu_dma_latency -2. The per-device PM QoS framework provides the API to manage the + * CPU latency QoS. + * The per-device PM QoS framework provides the API to manage the per-device latency constraints and PM QoS flags. -Each parameters have defined units: - - * latency: usec - * timeout: usec - * throughput: kbs (kilo bit / sec) - * memory bandwidth: mbs (mega bit / sec) +The latency unit used in the PM QoS framework is the microsecond (usec). 1. PM QoS framework =================== -The infrastructure exposes multiple misc device nodes one per implemented -parameter. The set of parameters implement is defined by pm_qos_power_init() -and pm_qos_params.h. This is done because having the available parameters -being runtime configurable or changeable from a driver was seen as too easy to -abuse. - -For each parameter a list of performance requests is maintained along with -an aggregated target value. The aggregated target value is updated with -changes to the request list or elements of the list. Typically the -aggregated target value is simply the max or min of the request values held -in the parameter list elements. +A global list of CPU latency QoS requests is maintained along with an aggregated +(effective) target value. The aggregated target value is updated with changes +to the request list or elements of the list. For CPU latency QoS, the +aggregated target value is simply the min of the request values held in the list +elements. + Note: the aggregated target value is implemented as an atomic variable so that reading the aggregated value does not require any locking mechanism. +From kernel space the use of this interface is simple: -From kernel mode the use of this interface is simple: - -void pm_qos_add_request(handle, param_class, target_value): - Will insert an element into the list for that identified PM QoS class with the - target value. Upon change to this list the new target is recomputed and any - registered notifiers are called only if the target value is now different. - Clients of pm_qos need to save the returned handle for future use in other - pm_qos API functions. +void cpu_latency_qos_add_request(handle, target_value): + Will insert an element into the CPU latency QoS list with the target value. + Upon change to this list the new target is recomputed and any registered + notifiers are called only if the target value is now different. + Clients of PM QoS need to save the returned handle for future use in other + PM QoS API functions. -void pm_qos_update_request(handle, new_target_value): +void cpu_latency_qos_update_request(handle, new_target_value): Will update the list element pointed to by the handle with the new target value and recompute the new aggregated target, calling the notification tree if the target is changed. -void pm_qos_remove_request(handle): +void cpu_latency_qos_remove_request(handle): Will remove the element. After removal it will update the aggregate target and call the notification tree if the target was changed as a result of removing the request. -int pm_qos_request(param_class): - Returns the aggregated value for a given PM QoS class. +int cpu_latency_qos_limit(): + Returns the aggregated value for the CPU latency QoS. + +int cpu_latency_qos_request_active(handle): + Returns if the request is still active, i.e. it has not been removed from the + CPU latency QoS list. -int pm_qos_request_active(handle): - Returns if the request is still active, i.e. it has not been removed from a - PM QoS class constraints list. +int cpu_latency_qos_add_notifier(notifier): + Adds a notification callback function to the CPU latency QoS. The callback is + called when the aggregated value for the CPU latency QoS is changed. -int pm_qos_add_notifier(param_class, notifier): - Adds a notification callback function to the PM QoS class. The callback is - called when the aggregated value for the PM QoS class is changed. +int cpu_latency_qos_remove_notifier(notifier): + Removes the notification callback function from the CPU latency QoS. -int pm_qos_remove_notifier(int param_class, notifier): - Removes the notification callback function for the PM QoS class. +From user space: -From user mode: +The infrastructure exposes one device node, /dev/cpu_dma_latency, for the CPU +latency QoS. -Only processes can register a pm_qos request. To provide for automatic +Only processes can register a PM QoS request. To provide for automatic cleanup of a process, the interface requires the process to register its -parameter requests in the following way: +parameter requests as follows. -To register the default pm_qos target for the specific parameter, the process -must open /dev/cpu_dma_latency +To register the default PM QoS target for the CPU latency QoS, the process must +open /dev/cpu_dma_latency. As long as the device node is held open that process has a registered request on the parameter. -To change the requested target value the process needs to write an s32 value to -the open device node. Alternatively the user mode program could write a hex -string for the value using 10 char long format e.g. "0x12345678". This -translates to a pm_qos_update_request call. +To change the requested target value, the process needs to write an s32 value to +the open device node. Alternatively, it can write a hex string for the value +using the 10 char long format e.g. "0x12345678". This translates to a +cpu_latency_qos_update_request() call. To remove the user mode request for a target value simply close the device node. diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index ab8406c84254..0553008b6279 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -382,6 +382,12 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: nonzero, increment the counter and return 1; otherwise return 0 without changing the counter + `int pm_runtime_get_if_active(struct device *dev, bool ign_usage_count);` + - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the + runtime PM status is RPM_ACTIVE, and either ign_usage_count is true + or the device's usage_count is non-zero, increment the counter and + return 1; otherwise return 0 without changing the counter + `void pm_runtime_put_noidle(struct device *dev);` - decrement the device's usage counter diff --git a/Documentation/power/userland-swsusp.rst b/Documentation/power/userland-swsusp.rst index a0fa51bb1a4d..1cf62d80a9ca 100644 --- a/Documentation/power/userland-swsusp.rst +++ b/Documentation/power/userland-swsusp.rst @@ -69,11 +69,13 @@ SNAPSHOT_PREF_IMAGE_SIZE SNAPSHOT_GET_IMAGE_SIZE return the actual size of the hibernation image + (the last argument should be a pointer to a loff_t variable that + will contain the result if the call is successful) SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the - last argument should be a pointer to an unsigned int variable that will - contain the result if the call is successful). + return the amount of available swap in bytes + (the last argument should be a pointer to a loff_t variable that + will contain the result if the call is successful) SNAPSHOT_ALLOC_SWAP_PAGE allocate a swap page from the resume partition diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst index 363736d7fd36..df136c8f91fa 100644 --- a/Documentation/powerpc/ultravisor.rst +++ b/Documentation/powerpc/ultravisor.rst @@ -8,8 +8,8 @@ Protected Execution Facility .. contents:: :depth: 3 -Protected Execution Facility -############################ +Introduction +############ Protected Execution Facility (PEF) is an architectural change for POWER 9 that enables Secure Virtual Machines (SVMs). DD2.3 chips diff --git a/Documentation/process/2.Process.rst b/Documentation/process/2.Process.rst index ae020d84d7c4..b21b5b245d13 100644 --- a/Documentation/process/2.Process.rst +++ b/Documentation/process/2.Process.rst @@ -18,18 +18,18 @@ major kernel release happening every two or three months. The recent release history looks like this: ====== ================= - 4.11 April 30, 2017 - 4.12 July 2, 2017 - 4.13 September 3, 2017 - 4.14 November 12, 2017 - 4.15 January 28, 2018 - 4.16 April 1, 2018 + 5.0 March 3, 2019 + 5.1 May 5, 2019 + 5.2 July 7, 2019 + 5.3 September 15, 2019 + 5.4 November 24, 2019 + 5.5 January 6, 2020 ====== ================= -Every 4.x release is a major kernel release with new features, internal -API changes, and more. A typical 4.x release contain about 13,000 -changesets with changes to several hundred thousand lines of code. 4.x is -thus the leading edge of Linux kernel development; the kernel uses a +Every 5.x release is a major kernel release with new features, internal +API changes, and more. A typical release can contain about 13,000 +changesets with changes to several hundred thousand lines of code. 5.x is +the leading edge of Linux kernel development; the kernel uses a rolling development model which is continually integrating major changes. A relatively straightforward discipline is followed with regard to the @@ -48,9 +48,9 @@ detail later on). The merge window lasts for approximately two weeks. At the end of this time, Linus Torvalds will declare that the window is closed and release the -first of the "rc" kernels. For the kernel which is destined to be 2.6.40, +first of the "rc" kernels. For the kernel which is destined to be 5.6, for example, the release which happens at the end of the merge window will -be called 2.6.40-rc1. The -rc1 release is the signal that the time to +be called 5.6-rc1. The -rc1 release is the signal that the time to merge new features has passed, and that the time to stabilize the next kernel has begun. @@ -67,22 +67,23 @@ add at any time). As fixes make their way into the mainline, the patch rate will slow over time. Linus releases new -rc kernels about once a week; a normal series will get up to somewhere between -rc6 and -rc9 before the kernel is -considered to be sufficiently stable and the final 2.6.x release is made. +considered to be sufficiently stable and the final release is made. At that point the whole process starts over again. -As an example, here is how the 4.16 development cycle went (all dates in -2018): +As an example, here is how the 5.4 development cycle went (all dates in +2019): ============== =============================== - January 28 4.15 stable release - February 11 4.16-rc1, merge window closes - February 18 4.16-rc2 - February 25 4.16-rc3 - March 4 4.16-rc4 - March 11 4.16-rc5 - March 18 4.16-rc6 - March 25 4.16-rc7 - April 1 4.16 stable release + September 15 5.3 stable release + September 30 5.4-rc1, merge window closes + October 6 5.4-rc2 + October 13 5.4-rc3 + October 20 5.4-rc4 + October 27 5.4-rc5 + November 3 5.4-rc6 + November 10 5.4-rc7 + November 17 5.4-rc8 + November 24 5.4 stable release ============== =============================== How do the developers decide when to close the development cycle and create @@ -98,43 +99,44 @@ release is made. In the real world, this kind of perfection is hard to achieve; there are just too many variables in a project of this size. There comes a point where delaying the final release just makes the problem worse; the pile of changes waiting for the next merge window will grow -larger, creating even more regressions the next time around. So most 4.x +larger, creating even more regressions the next time around. So most 5.x kernels go out with a handful of known regressions though, hopefully, none of them are serious. Once a stable release is made, its ongoing maintenance is passed off to the -"stable team," currently consisting of Greg Kroah-Hartman. The stable team -will release occasional updates to the stable release using the 4.x.y -numbering scheme. To be considered for an update release, a patch must (1) -fix a significant bug, and (2) already be merged into the mainline for the -next development kernel. Kernels will typically receive stable updates for -a little more than one development cycle past their initial release. So, -for example, the 4.13 kernel's history looked like: +"stable team," currently Greg Kroah-Hartman. The stable team will release +occasional updates to the stable release using the 5.x.y numbering scheme. +To be considered for an update release, a patch must (1) fix a significant +bug, and (2) already be merged into the mainline for the next development +kernel. Kernels will typically receive stable updates for a little more +than one development cycle past their initial release. So, for example, the +5.2 kernel's history looked like this (all dates in 2019): ============== =============================== - September 3 4.13 stable release - September 13 4.13.1 - September 20 4.13.2 - September 27 4.13.3 - October 5 4.13.4 - October 12 4.13.5 + September 15 5.2 stable release + July 14 5.2.1 + July 21 5.2.2 + July 26 5.2.3 + July 28 5.2.4 + July 31 5.2.5 ... ... - November 24 4.13.16 + October 11 5.2.21 ============== =============================== -4.13.16 was the final stable update of the 4.13 release. +5.2.21 was the final stable update of the 5.2 release. Some kernels are designated "long term" kernels; they will receive support for a longer period. As of this writing, the current long term kernels and their maintainers are: - ====== ====================== ============================== - 3.16 Ben Hutchings (very long-term stable kernel) - 4.1 Sasha Levin - 4.4 Greg Kroah-Hartman (very long-term stable kernel) - 4.9 Greg Kroah-Hartman - 4.14 Greg Kroah-Hartman - ====== ====================== ============================== + ====== ================================ ======================= + 3.16 Ben Hutchings (very long-term kernel) + 4.4 Greg Kroah-Hartman & Sasha Levin (very long-term kernel) + 4.9 Greg Kroah-Hartman & Sasha Levin + 4.14 Greg Kroah-Hartman & Sasha Levin + 4.19 Greg Kroah-Hartman & Sasha Levin + 5.4 Greg Kroah-Hartman & Sasha Levin + ====== ================================ ======================= The selection of a kernel for long-term support is purely a matter of a maintainer having the need and the time to maintain that release. There @@ -215,12 +217,12 @@ How patches get into the Kernel ------------------------------- There is exactly one person who can merge patches into the mainline kernel -repository: Linus Torvalds. But, of the over 9,500 patches which went -into the 2.6.38 kernel, only 112 (around 1.3%) were directly chosen by Linus -himself. The kernel project has long since grown to a size where no single -developer could possibly inspect and select every patch unassisted. The -way the kernel developers have addressed this growth is through the use of -a lieutenant system built around a chain of trust. +repository: Linus Torvalds. But, for example, of the over 9,500 patches +which went into the 2.6.38 kernel, only 112 (around 1.3%) were directly +chosen by Linus himself. The kernel project has long since grown to a size +where no single developer could possibly inspect and select every patch +unassisted. The way the kernel developers have addressed this growth is +through the use of a lieutenant system built around a chain of trust. The kernel code base is logically broken down into a set of subsystems: networking, specific architecture support, memory management, video diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index edb296c52f61..acb2f1b36350 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -284,9 +284,9 @@ context lines. 4) Naming --------- -C is a Spartan language, and so should your naming be. Unlike Modula-2 -and Pascal programmers, C programmers do not use cute names like -ThisVariableIsATemporaryCounter. A C programmer would call that +C is a Spartan language, and your naming conventions should follow suit. +Unlike Modula-2 and Pascal programmers, C programmers do not use cute +names like ThisVariableIsATemporaryCounter. A C programmer would call that variable ``tmp``, which is much easier to write, and not the least more difficult to understand. @@ -300,9 +300,9 @@ that counts the number of active users, you should call that ``count_active_users()`` or similar, you should **not** call it ``cntusr()``. Encoding the type of a function into the name (so-called Hungarian -notation) is brain damaged - the compiler knows the types anyway and can -check those, and it only confuses the programmer. No wonder MicroSoft -makes buggy programs. +notation) is asinine - the compiler knows the types anyway and can check +those, and it only confuses the programmer. No wonder Microsoft makes buggy +programs. LOCAL variable names should be short, and to the point. If you have some random integer loop counter, it should probably be called ``i``. @@ -806,9 +806,9 @@ covers RTL which is used frequently with assembly language in the kernel. ---------------------------- Kernel developers like to be seen as literate. Do mind the spelling -of kernel messages to make a good impression. Do not use crippled -words like ``dont``; use ``do not`` or ``don't`` instead. Make the messages -concise, clear, and unambiguous. +of kernel messages to make a good impression. Do not use incorrect +contractions like ``dont``; use ``do not`` or ``don't`` instead. Make the +messages concise, clear, and unambiguous. Kernel messages do not have to be terminated with a period. diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst index 179f2a5625a0..652e2aa02a66 100644 --- a/Documentation/process/deprecated.rst +++ b/Documentation/process/deprecated.rst @@ -29,6 +29,28 @@ a header file, it isn't the full solution. Such interfaces must either be fully removed from the kernel, or added to this file to discourage others from using them in the future. +BUG() and BUG_ON() +------------------ +Use WARN() and WARN_ON() instead, and handle the "impossible" +error condition as gracefully as possible. While the BUG()-family +of APIs were originally designed to act as an "impossible situation" +assert and to kill a kernel thread "safely", they turn out to just be +too risky. (e.g. "In what order do locks need to be released? Have +various states been restored?") Very commonly, using BUG() will +destabilize a system or entirely break it, which makes it impossible +to debug or even get viable crash reports. Linus has `very strong +<https://lore.kernel.org/lkml/CA+55aFy6jNLsywVYdGp83AMrXBo_P-pkjkphPGrO=82SPKCpLQ@mail.gmail.com/>`_ +feelings `about this +<https://lore.kernel.org/lkml/CAHk-=whDHsbK3HTOpTF=ue_o04onRwTEaK_ZoJp_fjbqq4+=Jw@mail.gmail.com/>`_. + +Note that the WARN()-family should only be used for "expected to +be unreachable" situations. If you want to warn about "reachable +but undesirable" situations, please use the pr_warn()-family of +functions. System owners may have set the *panic_on_warn* sysctl, +to make sure their systems do not continue running in the face of +"unreachable" conditions. (For example, see commits like `this one +<https://git.kernel.org/linus/d4689846881d160a4d12a514e991a740bcb5d65a>`_.) + open-coded arithmetic in allocator arguments -------------------------------------------- Dynamic size calculations (especially multiplication) should not be @@ -63,51 +85,73 @@ Instead, use the helper:: header = kzalloc(struct_size(header, item, count), GFP_KERNEL); -See :c:func:`array_size`, :c:func:`array3_size`, and :c:func:`struct_size`, -for more details as well as the related :c:func:`check_add_overflow` and -:c:func:`check_mul_overflow` family of functions. +See array_size(), array3_size(), and struct_size(), +for more details as well as the related check_add_overflow() and +check_mul_overflow() family of functions. simple_strtol(), simple_strtoll(), simple_strtoul(), simple_strtoull() ---------------------------------------------------------------------- -The :c:func:`simple_strtol`, :c:func:`simple_strtoll`, -:c:func:`simple_strtoul`, and :c:func:`simple_strtoull` functions +The simple_strtol(), simple_strtoll(), +simple_strtoul(), and simple_strtoull() functions explicitly ignore overflows, which may lead to unexpected results -in callers. The respective :c:func:`kstrtol`, :c:func:`kstrtoll`, -:c:func:`kstrtoul`, and :c:func:`kstrtoull` functions tend to be the +in callers. The respective kstrtol(), kstrtoll(), +kstrtoul(), and kstrtoull() functions tend to be the correct replacements, though note that those require the string to be NUL or newline terminated. strcpy() -------- -:c:func:`strcpy` performs no bounds checking on the destination +strcpy() performs no bounds checking on the destination buffer. This could result in linear overflows beyond the end of the buffer, leading to all kinds of misbehaviors. While `CONFIG_FORTIFY_SOURCE=y` and various compiler flags help reduce the risk of using this function, there is no good reason to add new uses of -this function. The safe replacement is :c:func:`strscpy`. +this function. The safe replacement is strscpy(). strncpy() on NUL-terminated strings ----------------------------------- -Use of :c:func:`strncpy` does not guarantee that the destination buffer +Use of strncpy() does not guarantee that the destination buffer will be NUL terminated. This can lead to various linear read overflows and other misbehavior due to the missing termination. It also NUL-pads the destination buffer if the source contents are shorter than the destination buffer size, which may be a needless performance penalty for callers using -only NUL-terminated strings. The safe replacement is :c:func:`strscpy`. -(Users of :c:func:`strscpy` still needing NUL-padding will need an -explicit :c:func:`memset` added.) +only NUL-terminated strings. The safe replacement is strscpy(). +(Users of strscpy() still needing NUL-padding should instead +use strscpy_pad().) -If a caller is using non-NUL-terminated strings, :c:func:`strncpy()` can +If a caller is using non-NUL-terminated strings, strncpy()() can still be used, but destinations should be marked with the `__nonstring <https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html>`_ attribute to avoid future compiler warnings. strlcpy() --------- -:c:func:`strlcpy` reads the entire source buffer first, possibly exceeding +strlcpy() reads the entire source buffer first, possibly exceeding the given limit of bytes to copy. This is inefficient and can lead to linear read overflows if a source string is not NUL-terminated. The -safe replacement is :c:func:`strscpy`. +safe replacement is strscpy(). + +%p format specifier +------------------- +Traditionally, using "%p" in format strings would lead to regular address +exposure flaws in dmesg, proc, sysfs, etc. Instead of leaving these to +be exploitable, all "%p" uses in the kernel are being printed as a hashed +value, rendering them unusable for addressing. New uses of "%p" should not +be added to the kernel. For text addresses, using "%pS" is likely better, +as it produces the more useful symbol name instead. For nearly everything +else, just do not add "%p" at all. + +Paraphrasing Linus's current `guidance <https://lore.kernel.org/lkml/CA+55aFwQEd_d40g4mUCSsVRZzrFPUJt74vc6PPpb675hYNXcKw@mail.gmail.com/>`_: + +- If the hashed "%p" value is pointless, ask yourself whether the pointer + itself is important. Maybe it should be removed entirely? +- If you really think the true pointer value is important, why is some + system state or user privilege level considered "special"? If you think + you can justify it (in comments and commit log) well enough to stand + up to Linus's scrutiny, maybe you can use "%px", along with making sure + you have sensible permissions. + +And finally, know that a toggle for "%p" hashing will `not be accepted <https://lore.kernel.org/lkml/CA+55aFwieC1-nAs+NFq9RTwaR8ef9hWa4MjNBWL41F-8wM49eA@mail.gmail.com/>`_. Variable Length Arrays (VLAs) ----------------------------- @@ -122,27 +166,37 @@ memory adjacent to the stack (when built without `CONFIG_VMAP_STACK=y`) Implicit switch case fall-through --------------------------------- -The C language allows switch cases to "fall-through" when a "break" statement -is missing at the end of a case. This, however, introduces ambiguity in the -code, as it's not always clear if the missing break is intentional or a bug. +The C language allows switch cases to fall through to the next case +when a "break" statement is missing at the end of a case. This, however, +introduces ambiguity in the code, as it's not always clear if the missing +break is intentional or a bug. For example, it's not obvious just from +looking at the code if `STATE_ONE` is intentionally designed to fall +through into `STATE_TWO`:: + + switch (value) { + case STATE_ONE: + do_something(); + case STATE_TWO: + do_other(); + break; + default: + WARN("unknown state"); + } As there have been a long list of flaws `due to missing "break" statements <https://cwe.mitre.org/data/definitions/484.html>`_, we no longer allow -"implicit fall-through". - -In order to identify intentional fall-through cases, we have adopted a -pseudo-keyword macro 'fallthrough' which expands to gcc's extension -__attribute__((__fallthrough__)). `Statement Attributes -<https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html>`_ - -When the C17/C18 [[fallthrough]] syntax is more commonly supported by +implicit fall-through. In order to identify intentional fall-through +cases, we have adopted a pseudo-keyword macro "fallthrough" which +expands to gcc's extension `__attribute__((__fallthrough__)) +<https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html>`_. +(When the C17/C18 `[[fallthrough]]` syntax is more commonly supported by C compilers, static analyzers, and IDEs, we can switch to using that syntax -for the macro pseudo-keyword. +for the macro pseudo-keyword.) All switch/case blocks must end in one of: - break; - fallthrough; - continue; - goto <label>; - return [expression]; +* break; +* fallthrough; +* continue; +* goto <label>; +* return [expression]; diff --git a/Documentation/process/email-clients.rst b/Documentation/process/email-clients.rst index 5273d06c8ff6..c9e4ce2613c0 100644 --- a/Documentation/process/email-clients.rst +++ b/Documentation/process/email-clients.rst @@ -237,9 +237,9 @@ using Mutt to send patches through Gmail:: The Mutt docs have lots more information: - http://dev.mutt.org/trac/wiki/UseCases/Gmail + https://gitlab.com/muttmua/mutt/-/wikis/UseCases/Gmail - http://dev.mutt.org/doc/manual.html + http://www.mutt.org/doc/manual/ Pine (TUI) ********** diff --git a/Documentation/process/embargoed-hardware-issues.rst b/Documentation/process/embargoed-hardware-issues.rst index 33edae654599..a19d084f9b2c 100644 --- a/Documentation/process/embargoed-hardware-issues.rst +++ b/Documentation/process/embargoed-hardware-issues.rst @@ -244,23 +244,23 @@ disclosure of a particular issue, unless requested by a response team or by an involved disclosed party. The current ambassadors list: ============= ======================================================== - ARM + ARM Grant Likely <grant.likely@arm.com> AMD Tom Lendacky <tom.lendacky@amd.com> IBM Intel Tony Luck <tony.luck@intel.com> Qualcomm Trilok Soni <tsoni@codeaurora.org> - Microsoft Sasha Levin <sashal@kernel.org> + Microsoft James Morris <jamorris@linux.microsoft.com> VMware Xen Andrew Cooper <andrew.cooper3@citrix.com> - Canonical Tyler Hicks <tyhicks@canonical.com> + Canonical John Johansen <john.johansen@canonical.com> Debian Ben Hutchings <ben@decadent.org.uk> Oracle Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Red Hat Josh Poimboeuf <jpoimboe@redhat.com> SUSE Jiri Kosina <jkosina@suse.cz> - Amazon Peter Bowen <pzb@amzn.com> + Amazon Google Kees Cook <keescook@chromium.org> ============= ======================================================== diff --git a/Documentation/process/howto.rst b/Documentation/process/howto.rst index b6f5a379ad6c..70791e153de1 100644 --- a/Documentation/process/howto.rst +++ b/Documentation/process/howto.rst @@ -243,10 +243,10 @@ branches. These different branches are: Mainline tree ~~~~~~~~~~~~~ -Mainline tree are maintained by Linus Torvalds, and can be found at +The mainline tree is maintained by Linus Torvalds, and can be found at https://kernel.org or in the repo. Its development process is as follows: - - As soon as a new kernel is released a two weeks window is open, + - As soon as a new kernel is released a two week window is open, during this period of time maintainers can submit big diffs to Linus, usually the patches that have already been included in the linux-next for a few weeks. The preferred way to submit big changes @@ -281,8 +281,9 @@ Various stable trees with multiple major numbers Kernels with 3-part versions are -stable kernels. They contain relatively small and critical fixes for security problems or significant -regressions discovered in a given major mainline release, with the first -2-part of version number are the same correspondingly. +regressions discovered in a given major mainline release. Each release +in a major stable series increments the third part of the version +number, keeping the first two parts the same. This is the recommended branch for users who want the most recent stable kernel and are not interested in helping test development/experimental @@ -359,10 +360,10 @@ Managing bug reports One of the best ways to put into practice your hacking skills is by fixing bugs reported by other people. Not only you will help to make the kernel -more stable, you'll learn to fix real world problems and you will improve -your skills, and other developers will be aware of your presence. Fixing -bugs is one of the best ways to get merits among other developers, because -not many people like wasting time fixing other people's bugs. +more stable, but you'll also learn to fix real world problems and you will +improve your skills, and other developers will be aware of your presence. +Fixing bugs is one of the best ways to get merits among other developers, +because not many people like wasting time fixing other people's bugs. To work in the already reported bug reports, go to https://bugzilla.kernel.org. diff --git a/Documentation/process/kernel-docs.rst b/Documentation/process/kernel-docs.rst index 7a45a8e36ea7..9d6d0ac4fca9 100644 --- a/Documentation/process/kernel-docs.rst +++ b/Documentation/process/kernel-docs.rst @@ -313,7 +313,7 @@ On-line docs :URL: http://www.linuxjournal.com/article.php?sid=2391 :Date: 1997 :Keywords: RAID, MD driver. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *A description of the implementation of the RAID-1, RAID-4 and RAID-5 personalities of the MD device driver in the Linux kernel, providing users with high performance and reliable, @@ -338,7 +338,7 @@ On-line docs :Date: 1996 :Keywords: device driver, module, loading/unloading modules, allocating resources. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This is the first of a series of four articles co-authored by Alessandro Rubini and Georg Zezchwitz which present a practical approach to writing Linux device drivers as kernel @@ -354,7 +354,7 @@ On-line docs :Keywords: character driver, init_module, clean_up module, autodetection, mayor number, minor number, file operations, open(), close(). - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This article, the second of four, introduces part of the actual code to create custom module implementing a character device driver. It describes the code for module initialization and @@ -367,7 +367,7 @@ On-line docs :Date: 1996 :Keywords: read(), write(), select(), ioctl(), blocking/non blocking mode, interrupt handler. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This article, the third of four on writing character device drivers, introduces concepts of reading, writing, and using ioctl-calls*. @@ -378,7 +378,7 @@ On-line docs :URL: http://www.linuxjournal.com/article.php?sid=1222 :Date: 1996 :Keywords: interrupts, irqs, DMA, bottom halves, task queues. - :Description: Linux Journal Kernel Korner article. Here is its + :Description: Linux Journal Kernel Korner article. :Abstract: *This is the fourth in a series of articles about writing character device drivers as loadable kernel modules. This month, we further investigate the field of interrupt handling. diff --git a/Documentation/process/management-style.rst b/Documentation/process/management-style.rst index 186753ff3d2d..dfbc69bf49d4 100644 --- a/Documentation/process/management-style.rst +++ b/Documentation/process/management-style.rst @@ -227,7 +227,7 @@ incompetence will grudgingly admit that you at least didn't try to weasel out of it. Then make the developer who really screwed up (if you can find them) know -**in_private** that they screwed up. Not just so they can avoid it in the +**in private** that they screwed up. Not just so they can avoid it in the future, but so that they know they owe you one. And, perhaps even more importantly, they're also likely the person who can fix it. Because, let's face it, it sure ain't you. diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/robust-futex-ABI.txt index 8a5d34abf726..f24904f1c16f 100644 --- a/Documentation/robust-futex-ABI.txt +++ b/Documentation/robust-futex-ABI.txt @@ -61,8 +61,8 @@ setup that list. address of the associated 'lock entry', plus or minus, of what will be called the 'lock word', from that 'lock entry'. The 'lock word' is always a 32 bit word, unlike the other words above. The 'lock - word' holds 3 flag bits in the upper 3 bits, and the thread id (TID) - of the thread holding the lock in the bottom 29 bits. See further + word' holds 2 flag bits in the upper 2 bits, and the thread id (TID) + of the thread holding the lock in the bottom 30 bits. See further below for a description of the flag bits. The third word, called 'list_op_pending', contains transient copy of @@ -128,7 +128,7 @@ that thread's robust_futex linked lock list a given time. A given futex lock structure in a user shared memory region may be held at different times by any of the threads with access to that region. The thread currently holding such a lock, if any, is marked with the threads -TID in the lower 29 bits of the 'lock word'. +TID in the lower 30 bits of the 'lock word'. When adding or removing a lock from its list of held locks, in order for the kernel to correctly handle lock cleanup regardless of when the task @@ -141,7 +141,7 @@ On insertion: 1) set the 'list_op_pending' word to the address of the 'lock entry' to be inserted, 2) acquire the futex lock, - 3) add the lock entry, with its thread id (TID) in the bottom 29 bits + 3) add the lock entry, with its thread id (TID) in the bottom 30 bits of the 'lock word', to the linked list starting at 'head', and 4) clear the 'list_op_pending' word. @@ -155,7 +155,7 @@ On removal: On exit, the kernel will consider the address stored in 'list_op_pending' and the address of each 'lock word' found by walking -the list starting at 'head'. For each such address, if the bottom 29 +the list starting at 'head'. For each such address, if the bottom 30 bits of the 'lock word' at offset 'offset' from that address equals the exiting threads TID, then the kernel will do two things: @@ -180,7 +180,5 @@ any point: future kernel configuration changes) elements. When the kernel sees a list entry whose 'lock word' doesn't have the -current threads TID in the lower 29 bits, it does nothing with that +current threads TID in the lower 30 bits, it does nothing with that entry, and goes on to the next entry. - -Bit 29 (0x20000000) of the 'lock word' is reserved for future use. diff --git a/Documentation/scsi/scsi_mid_low_api.txt b/Documentation/scsi/scsi_mid_low_api.txt index 2a4be1c3e6db..537f04728487 100644 --- a/Documentation/scsi/scsi_mid_low_api.txt +++ b/Documentation/scsi/scsi_mid_low_api.txt @@ -299,7 +299,6 @@ Summary: scsi_host_alloc - return a new scsi_host instance whose refcount==1 scsi_host_get - increments Scsi_Host instance's refcount scsi_host_put - decrements Scsi_Host instance's refcount (free if 0) - scsi_partsize - parse partition table into cylinders, heads + sectors scsi_register - create and register a scsi host adapter instance. scsi_remove_device - detach and remove a SCSI device scsi_remove_host - detach and remove all SCSI devices owned by host @@ -473,26 +472,6 @@ void scsi_host_put(struct Scsi_Host *shost) /** - * scsi_partsize - parse partition table into cylinders, heads + sectors - * @buf: pointer to partition table - * @capacity: size of (total) disk in 512 byte sectors - * @cyls: outputs number of cylinders calculated via this pointer - * @hds: outputs number of heads calculated via this pointer - * @secs: outputs number of sectors calculated via this pointer - * - * Returns 0 on success, -1 on failure - * - * Might block: no - * - * Notes: Caller owns memory returned (free with kfree() ) - * - * Defined in: drivers/scsi/scsicam.c - **/ -int scsi_partsize(unsigned char *buf, unsigned long capacity, - unsigned int *cyls, unsigned int *hds, unsigned int *secs) - - -/** * scsi_register - create and register a scsi host adapter instance. * @sht: pointer to scsi host template * @privsize: extra bytes to allocate in hostdata array (which is the diff --git a/Documentation/security/siphash.rst b/Documentation/security/siphash.rst index 9965821ab333..4eba68cdf0a1 100644 --- a/Documentation/security/siphash.rst +++ b/Documentation/security/siphash.rst @@ -128,8 +128,8 @@ then when you can be absolutely certain that the outputs will never be transmitted out of the kernel. This is only remotely useful over `jhash` as a means of mitigating hashtable flooding denial of service attacks. -Generating a key -================ +Generating a HalfSipHash key +============================ Keys should always be generated from a cryptographically secure source of random numbers, either using get_random_bytes or get_random_once: @@ -139,8 +139,8 @@ get_random_bytes(&key, sizeof(key)); If you're not deriving your key from here, you're doing it wrong. -Using the functions -=================== +Using the HalfSipHash functions +=============================== There are two variants of the function, one that takes a list of integers, and one that takes a buffer:: diff --git a/Documentation/sphinx/parallel-wrapper.sh b/Documentation/sphinx/parallel-wrapper.sh index 7daf5133bdd3..e54c44ce117d 100644 --- a/Documentation/sphinx/parallel-wrapper.sh +++ b/Documentation/sphinx/parallel-wrapper.sh @@ -30,4 +30,4 @@ if [ -n "$parallel" ] ; then parallel="-j$parallel" fi -exec "$sphinx" "$parallel" "$@" +exec "$sphinx" $parallel "$@" diff --git a/Documentation/target/tcmu-design.rst b/Documentation/target/tcmu-design.rst index a7b426707bf6..e47047e32e27 100644 --- a/Documentation/target/tcmu-design.rst +++ b/Documentation/target/tcmu-design.rst @@ -5,7 +5,7 @@ TCM Userspace Design .. Contents: - 1) TCM Userspace Design + 1) Design a) Background b) Benefits c) Design constraints @@ -23,8 +23,8 @@ TCM Userspace Design 3) A final note -TCM Userspace Design -==================== +Design +====== TCM is another name for LIO, an in-kernel iSCSI target (server). Existing TCM targets run in the kernel. TCMU (TCM in Userspace) diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst index 2ef318962e29..f45bf11fa88d 100644 --- a/Documentation/trace/events-power.rst +++ b/Documentation/trace/events-power.rst @@ -75,16 +75,6 @@ The PM QoS events are used for QoS add/update/remove request and for target/flags update. :: - pm_qos_add_request "pm_qos_class=%s value=%d" - pm_qos_update_request "pm_qos_class=%s value=%d" - pm_qos_remove_request "pm_qos_class=%s value=%d" - pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld" - -The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY"). -The second parameter is value to be added/updated/removed. -The third parameter is timeout value in usec. -:: - pm_qos_update_target "action=%s prev_value=%d curr_value=%d" pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x" @@ -92,7 +82,7 @@ The first parameter gives the QoS action name (e.g. "ADD_REQ"). The second parameter is the previous QoS value. The third parameter is the current QoS value to update. -And, there are also events used for device PM QoS add/update/remove request. +There are also events used for device PM QoS add/update/remove request. :: dev_pm_qos_add_request "device=%s type=%s new_value=%d" @@ -103,3 +93,12 @@ The first parameter gives the device name which tries to add/update/remove QoS requests. The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY"). The third parameter is value to be added/updated/removed. + +And, there are events used for CPU latency QoS add/update/remove request. +:: + + pm_qos_add_request "value=%d" + pm_qos_update_request "value=%d" + pm_qos_remove_request "value=%d" + +The parameter is the value to be added/updated/removed. diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst index ed79b220bd07..4a2ebe0bd19b 100644 --- a/Documentation/trace/events.rst +++ b/Documentation/trace/events.rst @@ -342,7 +342,8 @@ section of Documentation/trace/ftrace.rst), but there are major differences and the implementation isn't currently tied to it in any way, so beware about making generalizations between the two. -Note: Writing into trace_marker (See Documentation/trace/ftrace.rst) +.. Note:: + Writing into trace_marker (See Documentation/trace/ftrace.rst) can also enable triggers that are written into /sys/kernel/tracing/events/ftrace/print/trigger @@ -569,14 +570,14 @@ The first creates the event in one step, using synth_event_create(). In this method, the name of the event to create and an array defining the fields is supplied to synth_event_create(). If successful, a synthetic event with that name and fields will exist following that -call. For example, to create a new "schedtest" synthetic event: +call. For example, to create a new "schedtest" synthetic event:: ret = synth_event_create("schedtest", sched_fields, ARRAY_SIZE(sched_fields), THIS_MODULE); The sched_fields param in this example points to an array of struct synth_field_desc, each of which describes an event field by type and -name: +name:: static struct synth_field_desc sched_fields[] = { { .type = "pid_t", .name = "next_pid_field" }, @@ -615,7 +616,7 @@ synth_event_gen_cmd_array_start(), the user should create and initialize a dynevent_cmd object using synth_event_cmd_init(). For example, to create a new "schedtest" synthetic event with two -fields: +fields:: struct dynevent_cmd cmd; char *buf; @@ -631,7 +632,7 @@ fields: "u64", "ts_ns"); Alternatively, using an array of struct synth_field_desc fields -containing the same information: +containing the same information:: ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE, fields, n_fields); @@ -640,7 +641,7 @@ Once the synthetic event object has been created, it can then be populated with more fields. Fields are added one by one using synth_event_add_field(), supplying the dynevent_cmd object, a field type, and a field name. For example, to add a new int field named -"intfield", the following call should be made: +"intfield", the following call should be made:: ret = synth_event_add_field(&cmd, "int", "intfield"); @@ -649,7 +650,7 @@ the field is considered to be an array. A group of fields can also be added all at once using an array of synth_field_desc with add_synth_fields(). For example, this would add -just the first four sched_fields: +just the first four sched_fields:: ret = synth_event_add_fields(&cmd, sched_fields, 4); @@ -658,7 +659,7 @@ synth_event_add_field_str() can be used to add it as-is; it will also automatically append a ';' to the string. Once all the fields have been added, the event should be finalized and -registered by calling the synth_event_gen_cmd_end() function: +registered by calling the synth_event_gen_cmd_end() function:: ret = synth_event_gen_cmd_end(&cmd); @@ -691,7 +692,7 @@ trace array)), along with an variable number of u64 args, one for each synthetic event field, and the number of values being passed. So, to trace an event corresponding to the synthetic event definition -above, code like the following could be used: +above, code like the following could be used:: ret = synth_event_trace(create_synth_test, 7, /* number of values */ 444, /* next_pid_field */ @@ -715,7 +716,7 @@ trace array)), along with an array of u64, one for each synthetic event field. To trace an event corresponding to the synthetic event definition -above, code like the following could be used: +above, code like the following could be used:: u64 vals[7]; @@ -739,7 +740,7 @@ In order to trace a synthetic event, a pointer to the trace event file is needed. The trace_get_event_file() function can be used to get it - it will find the file in the given trace instance (in this case NULL since the top trace array is being used) while at the same time -preventing the instance containing it from going away: +preventing the instance containing it from going away:: schedtest_event_file = trace_get_event_file(NULL, "synthetic", "schedtest"); @@ -751,31 +752,31 @@ To enable a synthetic event from the kernel, trace_array_set_clr_event() can be used (which is not specific to synthetic events, so does need the "synthetic" system name to be specified explicitly). -To enable the event, pass 'true' to it: +To enable the event, pass 'true' to it:: trace_array_set_clr_event(schedtest_event_file->tr, "synthetic", "schedtest", true); -To disable it pass false: +To disable it pass false:: trace_array_set_clr_event(schedtest_event_file->tr, "synthetic", "schedtest", false); Finally, synth_event_trace_array() can be used to actually trace the -event, which should be visible in the trace buffer afterwards: +event, which should be visible in the trace buffer afterwards:: ret = synth_event_trace_array(schedtest_event_file, vals, ARRAY_SIZE(vals)); To remove the synthetic event, the event should be disabled, and the -trace instance should be 'put' back using trace_put_event_file(): +trace instance should be 'put' back using trace_put_event_file():: trace_array_set_clr_event(schedtest_event_file->tr, "synthetic", "schedtest", false); trace_put_event_file(schedtest_event_file); If those have been successful, synth_event_delete() can be called to -remove the event: +remove the event:: ret = synth_event_delete("schedtest"); @@ -784,7 +785,7 @@ remove the event: To trace a synthetic using the piecewise method described above, the synth_event_trace_start() function is used to 'open' the synthetic -event trace: +event trace:: struct synth_trace_state trace_state; @@ -809,7 +810,7 @@ along with the value to set the next field in the event. After each field is set, the 'cursor' points to the next field, which will be set by the subsequent call, continuing until all the fields have been set in order. The same sequence of calls as in the above examples using -this method would be (without error-handling code): +this method would be (without error-handling code):: /* next_pid_field */ ret = synth_event_add_next_val(777, &trace_state); @@ -837,7 +838,7 @@ used. Each call is passed the same synth_trace_state object used in the synth_event_trace_start(), along with the field name of the field to set and the value to set it to. The same sequence of calls as in the above examples using this method would be (without error-handling -code): +code):: ret = synth_event_add_val("next_pid_field", 777, &trace_state); ret = synth_event_add_val("next_comm_field", (u64)"silly putty", @@ -855,7 +856,7 @@ can be used but not both at the same time. Finally, the event won't be actually traced until it's 'closed', which is done using synth_event_trace_end(), which takes only the -struct synth_trace_state object used in the previous calls: +struct synth_trace_state object used in the previous calls:: ret = synth_event_trace_end(&trace_state); @@ -878,7 +879,7 @@ function. Before calling kprobe_event_gen_cmd_start(), the user should create and initialize a dynevent_cmd object using kprobe_event_cmd_init(). -For example, to create a new "schedtest" kprobe event with two fields: +For example, to create a new "schedtest" kprobe event with two fields:: struct dynevent_cmd cmd; char *buf; @@ -900,18 +901,18 @@ Once the kprobe event object has been created, it can then be populated with more fields. Fields can be added using kprobe_event_add_fields(), supplying the dynevent_cmd object along with a variable arg list of probe fields. For example, to add a -couple additional fields, the following call could be made: +couple additional fields, the following call could be made:: ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)"); Once all the fields have been added, the event should be finalized and registered by calling the kprobe_event_gen_cmd_end() or kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe -or kretprobe command was started: +or kretprobe command was started:: ret = kprobe_event_gen_cmd_end(&cmd); -or +or:: ret = kretprobe_event_gen_cmd_end(&cmd); @@ -920,13 +921,13 @@ events. Similarly, a kretprobe event can be created using kretprobe_event_gen_cmd_start() with a probe name and location and -additional params such as $retval: +additional params such as $retval:: ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test", "do_sys_open", "$retval"); Similar to the synthetic event case, code like the following can be -used to enable the newly created kprobe event: +used to enable the newly created kprobe event:: gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test"); @@ -934,7 +935,7 @@ used to enable the newly created kprobe event: "kprobes", "gen_kprobe_test", true); Finally, also similar to synthetic events, the following code can be -used to give the kprobe event file back and delete the event: +used to give the kprobe event file back and delete the event:: trace_put_event_file(gen_kprobe_test); @@ -963,7 +964,7 @@ are described below. The first step in building a new command string is to create and initialize an instance of a dynevent_cmd. Here, for instance, we -create a dynevent_cmd on the stack and initialize it: +create a dynevent_cmd on the stack and initialize it:: struct dynevent_cmd cmd; char *buf; @@ -989,7 +990,7 @@ calls to argument-adding functions. To add a single argument, define and initialize a struct dynevent_arg or struct dynevent_arg_pair object. Here's an example of the simplest possible arg addition, which is simply to append the given string as -a whitespace-separated argument to the command: +a whitespace-separated argument to the command:: struct dynevent_arg arg; @@ -1007,7 +1008,7 @@ the arg. Here's another more complicated example using an 'arg pair', which is used to create an argument that consists of a couple components added together as a unit, for example, a 'type field_name;' arg or a simple -expression arg e.g. 'flags=%cx': +expression arg e.g. 'flags=%cx':: struct dynevent_arg_pair arg_pair; @@ -1031,7 +1032,7 @@ Any number of dynevent_*_add() calls can be made to build up the string (until its length surpasses cmd->maxlen). When all the arguments have been added and the command string is complete, the only thing left to do is run the command, which happens by simply calling -dynevent_create(): +dynevent_create():: ret = dynevent_create(&cmd); diff --git a/Documentation/translations/it_IT/networking/netdev-FAQ.rst b/Documentation/translations/it_IT/networking/netdev-FAQ.rst index 8489ead7cff1..7e2456bb7d92 100644 --- a/Documentation/translations/it_IT/networking/netdev-FAQ.rst +++ b/Documentation/translations/it_IT/networking/netdev-FAQ.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-ita.rst -:Original: :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>` +:Original: :ref:`Documentation/networking/netdev-FAQ.rst <netdev-FAQ>` .. _it_netdev-FAQ: diff --git a/Documentation/translations/it_IT/process/programming-language.rst b/Documentation/translations/it_IT/process/programming-language.rst index f4b006395849..c4fc9d394c29 100644 --- a/Documentation/translations/it_IT/process/programming-language.rst +++ b/Documentation/translations/it_IT/process/programming-language.rst @@ -8,26 +8,26 @@ Linguaggio di programmazione ============================ -Il kernel è scritto nel linguaggio di programmazione C [c-language]_. -Più precisamente, il kernel viene compilato con ``gcc`` [gcc]_ usando -l'opzione ``-std=gnu89`` [gcc-c-dialect-options]_: il dialetto GNU +Il kernel è scritto nel linguaggio di programmazione C [it-c-language]_. +Più precisamente, il kernel viene compilato con ``gcc`` [it-gcc]_ usando +l'opzione ``-std=gnu89`` [it-gcc-c-dialect-options]_: il dialetto GNU dello standard ISO C90 (con l'aggiunta di alcune funzionalità da C99) -Questo dialetto contiene diverse estensioni al linguaggio [gnu-extensions]_, +Questo dialetto contiene diverse estensioni al linguaggio [it-gnu-extensions]_, e molte di queste vengono usate sistematicamente dal kernel. Il kernel offre un certo livello di supporto per la compilazione con ``clang`` -[clang]_ e ``icc`` [icc]_ su diverse architetture, tuttavia in questo momento +[it-clang]_ e ``icc`` [it-icc]_ su diverse architetture, tuttavia in questo momento il supporto non è completo e richiede delle patch aggiuntive. Attributi --------- Una delle estensioni più comuni e usate nel kernel sono gli attributi -[gcc-attribute-syntax]_. Gli attributi permettono di aggiungere una semantica, +[it-gcc-attribute-syntax]_. Gli attributi permettono di aggiungere una semantica, definita dell'implementazione, alle entità del linguaggio (come le variabili, le funzioni o i tipi) senza dover fare importanti modifiche sintattiche al -linguaggio stesso (come l'aggiunta di nuove parole chiave) [n2049]_. +linguaggio stesso (come l'aggiunta di nuove parole chiave) [it-n2049]_. In alcuni casi, gli attributi sono opzionali (ovvero un compilatore che non dovesse supportarli dovrebbe produrre comunque codice corretto, anche se @@ -41,11 +41,11 @@ possono usare e/o per accorciare il codice. Per maggiori informazioni consultate il file d'intestazione ``include/linux/compiler_attributes.h``. -.. [c-language] http://www.open-std.org/jtc1/sc22/wg14/www/standards -.. [gcc] https://gcc.gnu.org -.. [clang] https://clang.llvm.org -.. [icc] https://software.intel.com/en-us/c-compilers -.. [gcc-c-dialect-options] https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html -.. [gnu-extensions] https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html -.. [gcc-attribute-syntax] https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html -.. [n2049] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2049.pdf +.. [it-c-language] http://www.open-std.org/jtc1/sc22/wg14/www/standards +.. [it-gcc] https://gcc.gnu.org +.. [it-clang] https://clang.llvm.org +.. [it-icc] https://software.intel.com/en-us/c-compilers +.. [it-gcc-c-dialect-options] https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html +.. [it-gnu-extensions] https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html +.. [it-gcc-attribute-syntax] https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html +.. [it-n2049] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2049.pdf diff --git a/Documentation/translations/zh_CN/filesystems/index.rst b/Documentation/translations/zh_CN/filesystems/index.rst new file mode 100644 index 000000000000..14f155edaf69 --- /dev/null +++ b/Documentation/translations/zh_CN/filesystems/index.rst @@ -0,0 +1,27 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: :ref:`Documentation/filesystems/index.rst <filesystems_index>` +:Translator: Wang Wenhu <wenhu.wang@vivo.com> + +.. _cn_filesystems_index: + +======================== +Linux Kernel中的文件系统 +======================== + +这份正在开发的手册或许在未来某个辉煌的日子里以易懂的形式将Linux虚拟\ +文件系统(VFS)层以及基于其上的各种文件系统如何工作呈现给大家。当前\ +可以看到下面的内容。 + +文件系统 +======== + +文件系统实现文档。 + +.. toctree:: + :maxdepth: 2 + + virtiofs + diff --git a/Documentation/translations/zh_CN/filesystems/virtiofs.rst b/Documentation/translations/zh_CN/filesystems/virtiofs.rst new file mode 100644 index 000000000000..09bc9e012e2a --- /dev/null +++ b/Documentation/translations/zh_CN/filesystems/virtiofs.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: :ref:`Documentation/filesystems/virtiofs.rst <virtiofs_index>` + +译者 +:: + + 中文版维护者: 王文虎 Wang Wenhu <wenhu.wang@vivo.com> + 中文版翻译者: 王文虎 Wang Wenhu <wenhu.wang@vivo.com> + 中文版校译者: 王文虎 Wang Wenhu <wenhu.wang@vivo.com> + +=========================================== +virtiofs: virtio-fs 主机<->客机共享文件系统 +=========================================== + +- Copyright (C) 2020 Vivo Communication Technology Co. Ltd. + +介绍 +==== +Linux的virtiofs文件系统实现了一个半虚拟化VIRTIO类型“virtio-fs”设备的驱动,通过该\ +类型设备实现客机<->主机文件系统共享。它允许客机挂载一个已经导出到主机的目录。 + +客机通常需要访问主机或者远程系统上的文件。使用场景包括:在新客机安装时让文件对其\ +可见;从主机上的根文件系统启动;对无状态或临时客机提供持久存储和在客机之间共享目录。 + +尽管在某些任务可能通过使用已有的网络文件系统完成,但是却需要非常难以自动化的配置\ +步骤,且将存储网络暴露给客机。而virtio-fs设备通过提供不经过网络的文件系统访问文件\ +的设计方式解决了这些问题。 + +另外,virto-fs设备发挥了主客机共存的优点提高了性能,并且提供了网络文件系统所不具备 +的一些语义功能。 + +用法 +==== +以``myfs``标签将文件系统挂载到``/mnt``: + +.. code-block:: sh + + guest# mount -t virtiofs myfs /mnt + +请查阅 https://virtio-fs.gitlab.io/ 了解配置QEMU和virtiofsd守护程序的详细信息。 + +内幕 +==== +由于virtio-fs设备将FUSE协议用于文件系统请求,因此Linux的virtiofs文件系统与FUSE文\ +件系统客户端紧密集成在一起。客机充当FUSE客户端而主机充当FUSE服务器,内核与用户空\ +间之间的/dev/fuse接口由virtio-fs设备接口代替。 + +FUSE请求被置于虚拟队列中由主机处理。主机填充缓冲区中的响应部分,而客机处理请求的完成部分。 + +将/dev/fuse映射到虚拟队列需要解决/dev/fuse和虚拟队列之间语义上的差异。每次读取\ +/dev/fuse设备时,FUSE客户端都可以选择要传输的请求,从而可以使某些请求优先于其他\ +请求。虚拟队列有其队列语义,无法更改已入队请求的顺序。在虚拟队列已满的情况下尤 +其关键,因为此时不可能加入高优先级的请求。为了解决此差异,virtio-fs设备采用“hiprio”\ +(高优先级)虚拟队列,专门用于有别于普通请求的高优先级请求。 + diff --git a/Documentation/translations/zh_CN/index.rst b/Documentation/translations/zh_CN/index.rst index d3165535ec9e..76850a5dd982 100644 --- a/Documentation/translations/zh_CN/index.rst +++ b/Documentation/translations/zh_CN/index.rst @@ -14,6 +14,7 @@ :maxdepth: 2 process/index + filesystems/index 目录和表格 ---------- diff --git a/Documentation/translations/zh_CN/io_ordering.txt b/Documentation/translations/zh_CN/io_ordering.txt index 1f8127bdd415..7bb3086227ae 100644 --- a/Documentation/translations/zh_CN/io_ordering.txt +++ b/Documentation/translations/zh_CN/io_ordering.txt @@ -1,4 +1,4 @@ -Chinese translated version of Documentation/io_ordering.txt +Chinese translated version of Documentation/driver-api/io_ordering.rst If you have any comment or update to the content, please contact the original document maintainer directly. However, if you have a problem @@ -8,7 +8,7 @@ or if there is a problem with the translation. Chinese maintainer: Lin Yongting <linyongting@gmail.com> --------------------------------------------------------------------- -Documentation/io_ordering.txt 的中文翻译 +Documentation/driver-api/io_ordering.rst 的中文翻译 如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文 交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻 diff --git a/Documentation/translations/zh_CN/process/5.Posting.rst b/Documentation/translations/zh_CN/process/5.Posting.rst index 41aba21ff050..9ff9945f918c 100644 --- a/Documentation/translations/zh_CN/process/5.Posting.rst +++ b/Documentation/translations/zh_CN/process/5.Posting.rst @@ -5,7 +5,7 @@ .. _cn_development_posting: -发送补丁 +发布补丁 ======== 迟早,当您的工作准备好提交给社区进行审查,并最终包含到主线内核中时。不出所料, diff --git a/Documentation/translations/zh_CN/process/embargoed-hardware-issues.rst b/Documentation/translations/zh_CN/process/embargoed-hardware-issues.rst index b93f1af68261..88273ebe7823 100644 --- a/Documentation/translations/zh_CN/process/embargoed-hardware-issues.rst +++ b/Documentation/translations/zh_CN/process/embargoed-hardware-issues.rst @@ -183,7 +183,7 @@ CVE分配 VMware Xen Andrew Cooper <andrew.cooper3@citrix.com> - Canonical Tyler Hicks <tyhicks@canonical.com> + Canonical John Johansen <john.johansen@canonical.com> Debian Ben Hutchings <ben@decadent.org.uk> Oracle Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Red Hat Josh Poimboeuf <jpoimboe@redhat.com> diff --git a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt index 66c7c568bd86..9c39ee58ea50 100644 --- a/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt +++ b/Documentation/translations/zh_CN/video4linux/v4l2-framework.txt @@ -649,7 +649,7 @@ video_device注册 接下来你需要注册视频设备:这会为你创建一个字符设备。 - err = video_register_device(vdev, VFL_TYPE_GRABBER, -1); + err = video_register_device(vdev, VFL_TYPE_VIDEO, -1); if (err) { video_device_release(vdev); /* or kfree(my_vdev); */ return err; @@ -660,7 +660,7 @@ video_device注册 注册哪种设备是根据类型(type)参数。存在以下类型: -VFL_TYPE_GRABBER: 用于视频输入/输出设备的 videoX +VFL_TYPE_VIDEO: 用于视频输入/输出设备的 videoX VFL_TYPE_VBI: 用于垂直消隐数据的 vbiX (例如,隐藏式字幕,图文电视) VFL_TYPE_RADIO: 用于广播调谐器的 radioX diff --git a/Documentation/usb/index.rst b/Documentation/usb/index.rst index 36b6ebd9a9d9..b656c9be23ed 100644 --- a/Documentation/usb/index.rst +++ b/Documentation/usb/index.rst @@ -22,6 +22,7 @@ USB support misc_usbsevseg mtouchusb ohci + raw-gadget usbip_protocol usbmon usb-serial diff --git a/Documentation/usb/raw-gadget.rst b/Documentation/usb/raw-gadget.rst new file mode 100644 index 000000000000..9e78cb858f86 --- /dev/null +++ b/Documentation/usb/raw-gadget.rst @@ -0,0 +1,61 @@ +============== +USB Raw Gadget +============== + +USB Raw Gadget is a kernel module that provides a userspace interface for +the USB Gadget subsystem. Essentially it allows to emulate USB devices +from userspace. Enabled with CONFIG_USB_RAW_GADGET. Raw Gadget is +currently a strictly debugging feature and shouldn't be used in +production, use GadgetFS instead. + +Comparison to GadgetFS +~~~~~~~~~~~~~~~~~~~~~~ + +Raw Gadget is similar to GadgetFS, but provides a more low-level and +direct access to the USB Gadget layer for the userspace. The key +differences are: + +1. Every USB request is passed to the userspace to get a response, while + GadgetFS responds to some USB requests internally based on the provided + descriptors. However note, that the UDC driver might respond to some + requests on its own and never forward them to the Gadget layer. + +2. GadgetFS performs some sanity checks on the provided USB descriptors, + while Raw Gadget allows you to provide arbitrary data as responses to + USB requests. + +3. Raw Gadget provides a way to select a UDC device/driver to bind to, + while GadgetFS currently binds to the first available UDC. + +4. Raw Gadget uses predictable endpoint names (handles) across different + UDCs (as long as UDCs have enough endpoints of each required transfer + type). + +5. Raw Gadget has ioctl-based interface instead of a filesystem-based one. + +Userspace interface +~~~~~~~~~~~~~~~~~~~ + +To create a Raw Gadget instance open /dev/raw-gadget. Multiple raw-gadget +instances (bound to different UDCs) can be used at the same time. The +interaction with the opened file happens through the ioctl() calls, see +comments in include/uapi/linux/usb/raw_gadget.h for details. + +The typical usage of Raw Gadget looks like: + +1. Open Raw Gadget instance via /dev/raw-gadget. +2. Initialize the instance via USB_RAW_IOCTL_INIT. +3. Launch the instance with USB_RAW_IOCTL_RUN. +4. In a loop issue USB_RAW_IOCTL_EVENT_FETCH calls to receive events from + Raw Gadget and react to those depending on what kind of USB device + needs to be emulated. + +Potential future improvements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Implement ioctl's for setting/clearing halt status on endpoints. + +- Reporting more events (suspend, resume, etc.) through + USB_RAW_IOCTL_EVENT_FETCH. + +- Support O_NONBLOCK I/O. diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 2e91370dc159..f759edafd938 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -266,7 +266,6 @@ Code Seq# Include File Comments 'o' 01-A1 `linux/dvb/*.h` DVB 'p' 00-0F linux/phantom.h conflict! (OpenHaptics needs this) 'p' 00-1F linux/rtc.h conflict! -'p' 00-3F linux/mc146818rtc.h conflict! 'p' 40-7F linux/nvram.h 'p' 80-9F linux/ppdev.h user-space parport <mailto:tim@cyberelk.net> diff --git a/Documentation/virtual/guest-halt-polling.txt b/Documentation/virt/guest-halt-polling.rst index b3a2a294532d..b4e747942417 100644 --- a/Documentation/virtual/guest-halt-polling.txt +++ b/Documentation/virt/guest-halt-polling.rst @@ -1,9 +1,11 @@ +================== Guest halt polling ================== The cpuidle_haltpoll driver, with the haltpoll governor, allows the guest vcpus to poll for a specified amount of time before halting. + This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows @@ -29,18 +31,21 @@ Module Parameters The haltpoll governor has 5 tunable module parameters: 1) guest_halt_poll_ns: + Maximum amount of time, in nanoseconds, that polling is performed before halting. Default: 200000 2) guest_halt_poll_shrink: + Division factor used to shrink per-cpu guest_halt_poll_ns when wakeup event occurs after the global guest_halt_poll_ns. Default: 2 3) guest_halt_poll_grow: + Multiplication factor used to grow per-cpu guest_halt_poll_ns when event occurs after per-cpu guest_halt_poll_ns but before global guest_halt_poll_ns. @@ -48,6 +53,7 @@ but before global guest_halt_poll_ns. Default: 2 4) guest_halt_poll_grow_start: + The per-cpu guest_halt_poll_ns eventually reaches zero in case of an idle system. This value sets the initial per-cpu guest_halt_poll_ns when growing. This can @@ -66,7 +72,7 @@ high once achieves global guest_halt_poll_ns value). Default: Y -The module parameters can be set from the debugfs files in: +The module parameters can be set from the debugfs files in:: /sys/module/haltpoll/parameters/ @@ -74,5 +80,5 @@ Further Notes ============= - Care should be taken when setting the guest_halt_poll_ns parameter as a -large value has the potential to drive the cpu usage to 100% on a machine which -would be almost entirely idle otherwise. + large value has the potential to drive the cpu usage to 100% on a machine + which would be almost entirely idle otherwise. diff --git a/Documentation/virt/index.rst b/Documentation/virt/index.rst index 062ffb527043..de1ab81df958 100644 --- a/Documentation/virt/index.rst +++ b/Documentation/virt/index.rst @@ -8,7 +8,9 @@ Linux Virtualization Support :maxdepth: 2 kvm/index + uml/user_mode_linux paravirt_ops + guest-halt-polling .. only:: html and subproject diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst index d18c97b4e140..c3129b9ba5cb 100644 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/amd-memory-encryption.rst @@ -53,6 +53,29 @@ key management interface to perform common hypervisor activities such as encrypting bootstrap code, snapshot, migrating and debugging the guest. For more information, see the SEV Key Management spec [api-spec]_ +The main ioctl to access SEV is KVM_MEM_ENCRYPT_OP. If the argument +to KVM_MEM_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled +and ``ENOTTY` if it is disabled (on some older versions of Linux, +the ioctl runs normally even with a NULL argument, and therefore will +likely return ``EFAULT``). If non-NULL, the argument to KVM_MEM_ENCRYPT_OP +must be a struct kvm_sev_cmd:: + + struct kvm_sev_cmd { + __u32 id; + __u64 data; + __u32 error; + __u32 sev_fd; + }; + + +The ``id`` field contains the subcommand, and the ``data`` field points to +another struct containing arguments specific to command. The ``sev_fd`` +should point to a file descriptor that is opened on the ``/dev/sev`` +device, if needed (see individual commands). + +On output, ``error`` is zero on success, or an error code. Error codes +are defined in ``<linux/psp-dev.h>`. + KVM implements the following commands to support common lifecycle events of SEV guests, such as launching, running, snapshotting, migrating and decommissioning. @@ -90,6 +113,8 @@ Returns: 0 on success, -negative on error On success, the 'handle' field contains a new handle and on error, a negative value. +KVM_SEV_LAUNCH_START requires the ``sev_fd`` field to be valid. + For more details, see SEV spec Section 6.2. 3. KVM_SEV_LAUNCH_UPDATE_DATA diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.rst index c6e1ce5d40de..ebd383fba939 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.rst @@ -1,8 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================================================== The Definitive KVM (Kernel-based Virtual Machine) API Documentation =================================================================== 1. General description ----------------------- +====================== The kvm API is a set of ioctls that are issued to control various aspects of a virtual machine. The ioctls belong to the following classes: @@ -33,7 +36,7 @@ of a virtual machine. The ioctls belong to the following classes: was used to create the VM. 2. File descriptors -------------------- +=================== The kvm API is centered around file descriptors. An initial open("/dev/kvm") obtains a handle to the kvm subsystem; this handle @@ -70,7 +73,7 @@ the VM is shut down. 3. Extensions -------------- +============= As of Linux 2.6.22, the KVM ABI has been stabilized: no backward incompatible change are allowed. However, there is an extension @@ -84,13 +87,14 @@ set of ioctls is available for application use. 4. API description ------------------- +================== This section describes ioctls that can be used to control kvm guests. For each ioctl, the following information is provided along with a description: - Capability: which KVM extension provides this ioctl. Can be 'basic', + Capability: + which KVM extension provides this ioctl. Can be 'basic', which means that is will be provided by any kernel that supports API version 12 (see section 4.1), a KVM_CAP_xyz constant, which means availability needs to be checked with KVM_CHECK_EXTENSION @@ -99,24 +103,29 @@ description: availability: for kernels that don't support the ioctl, the ioctl returns -ENOTTY. - Architectures: which instruction set architectures provide this ioctl. + Architectures: + which instruction set architectures provide this ioctl. x86 includes both i386 and x86_64. - Type: system, vm, or vcpu. + Type: + system, vm, or vcpu. - Parameters: what parameters are accepted by the ioctl. + Parameters: + what parameters are accepted by the ioctl. - Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) + Returns: + the return value. General error numbers (EBADF, ENOMEM, EINVAL) are not detailed, but errors with specific meanings are. 4.1 KVM_GET_API_VERSION +----------------------- -Capability: basic -Architectures: all -Type: system ioctl -Parameters: none -Returns: the constant KVM_API_VERSION (=12) +:Capability: basic +:Architectures: all +:Type: system ioctl +:Parameters: none +:Returns: the constant KVM_API_VERSION (=12) This identifies the API version as the stable kvm API. It is not expected that this number will change. However, Linux 2.6.20 and @@ -127,12 +136,13 @@ described as 'basic' will be available. 4.2 KVM_CREATE_VM +----------------- -Capability: basic -Architectures: all -Type: system ioctl -Parameters: machine type identifier (KVM_VM_*) -Returns: a VM fd that can be used to control the new virtual machine. +:Capability: basic +:Architectures: all +:Type: system ioctl +:Parameters: machine type identifier (KVM_VM_*) +:Returns: a VM fd that can be used to control the new virtual machine. The new VM has no virtual cpus and no memory. You probably want to use 0 as machine type. @@ -155,17 +165,17 @@ identifier, where IPA_Bits is the maximum width of any physical address used by the VM. The IPA_Bits is encoded in bits[7-0] of the machine type identifier. -e.g, to configure a guest to use 48bit physical address size : +e.g, to configure a guest to use 48bit physical address size:: vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48)); -The requested size (IPA_Bits) must be : - 0 - Implies default size, 40bits (for backward compatibility) +The requested size (IPA_Bits) must be: - or - - N - Implies N bits, where N is a positive integer such that, + == ========================================================= + 0 Implies default size, 40bits (for backward compatibility) + N Implies N bits, where N is a positive integer such that, 32 <= N <= Host_IPA_Limit + == ========================================================= Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and is dependent on the CPU capability and the kernel configuration. The limit can @@ -179,21 +189,28 @@ host physical address translations). 4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST +---------------------------------------------------------- + +:Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST +:Architectures: x86 +:Type: system ioctl +:Parameters: struct kvm_msr_list (in/out) +:Returns: 0 on success; -1 on error -Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST -Architectures: x86 -Type: system ioctl -Parameters: struct kvm_msr_list (in/out) -Returns: 0 on success; -1 on error Errors: - EFAULT: the msr index list cannot be read from or written to - E2BIG: the msr index list is to be to fit in the array specified by + + ====== ============================================================ + EFAULT the msr index list cannot be read from or written to + E2BIG the msr index list is to be to fit in the array specified by the user. + ====== ============================================================ -struct kvm_msr_list { +:: + + struct kvm_msr_list { __u32 nmsrs; /* number of msrs in entries */ __u32 indices[0]; -}; + }; The user fills in the size of the indices array in nmsrs, and in return kvm adjusts nmsrs to reflect the actual number of msrs and fills in the @@ -214,12 +231,13 @@ otherwise. 4.4 KVM_CHECK_EXTENSION +----------------------- -Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl -Architectures: all -Type: system ioctl, vm ioctl -Parameters: extension identifier (KVM_CAP_*) -Returns: 0 if unsupported; 1 (or some other positive integer) if supported +:Capability: basic, KVM_CAP_CHECK_EXTENSION_VM for vm ioctl +:Architectures: all +:Type: system ioctl, vm ioctl +:Parameters: extension identifier (KVM_CAP_*) +:Returns: 0 if unsupported; 1 (or some other positive integer) if supported The API allows the application to query about extensions to the core kvm API. Userspace passes an extension identifier (an integer) and @@ -232,12 +250,13 @@ It is thus encouraged to use the vm ioctl to query for capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd) 4.5 KVM_GET_VCPU_MMAP_SIZE +-------------------------- -Capability: basic -Architectures: all -Type: system ioctl -Parameters: none -Returns: size of vcpu mmap area, in bytes +:Capability: basic +:Architectures: all +:Type: system ioctl +:Parameters: none +:Returns: size of vcpu mmap area, in bytes The KVM_RUN ioctl (cf.) communicates with userspace via a shared memory region. This ioctl returns the size of that region. See the @@ -245,23 +264,25 @@ KVM_RUN documentation for details. 4.6 KVM_SET_MEMORY_REGION +------------------------- -Capability: basic -Architectures: all -Type: vm ioctl -Parameters: struct kvm_memory_region (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_memory_region (in) +:Returns: 0 on success, -1 on error This ioctl is obsolete and has been removed. 4.7 KVM_CREATE_VCPU +------------------- -Capability: basic -Architectures: all -Type: vm ioctl -Parameters: vcpu id (apic id on x86) -Returns: vcpu fd on success, -1 on error +:Capability: basic +:Architectures: all +:Type: vm ioctl +:Parameters: vcpu id (apic id on x86) +:Returns: vcpu fd on success, -1 on error This API adds a vcpu to a virtual machine. No more than max_vcpus may be added. The vcpu id is an integer in the range [0, max_vcpu_id). @@ -302,22 +323,25 @@ cpu's hardware control block. 4.8 KVM_GET_DIRTY_LOG (vm ioctl) +-------------------------------- -Capability: basic -Architectures: all -Type: vm ioctl -Parameters: struct kvm_dirty_log (in/out) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_dirty_log (in/out) +:Returns: 0 on success, -1 on error -/* for KVM_GET_DIRTY_LOG */ -struct kvm_dirty_log { +:: + + /* for KVM_GET_DIRTY_LOG */ + struct kvm_dirty_log { __u32 slot; __u32 padding; union { void __user *dirty_bitmap; /* one bit per page */ __u64 padding; }; -}; + }; Given a memory slot, return a bitmap containing any pages dirtied since the last call to this ioctl. Bit 0 is the first page in the @@ -334,25 +358,31 @@ KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information, see the description of the capability. 4.9 KVM_SET_MEMORY_ALIAS +------------------------ -Capability: basic -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_memory_alias (in) -Returns: 0 (success), -1 (error) +:Capability: basic +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_memory_alias (in) +:Returns: 0 (success), -1 (error) This ioctl is obsolete and has been removed. 4.10 KVM_RUN +------------ + +:Capability: basic +:Architectures: all +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 on success, -1 on error -Capability: basic -Architectures: all -Type: vcpu ioctl -Parameters: none -Returns: 0 on success, -1 on error Errors: - EINTR: an unmasked signal is pending + + ===== ============================= + EINTR an unmasked signal is pending + ===== ============================= This ioctl is used to run a guest virtual cpu. While there are no explicit parameters, there is an implicit parameter block that can be @@ -362,42 +392,46 @@ kvm_run' (see below). 4.11 KVM_GET_REGS +----------------- -Capability: basic -Architectures: all except ARM, arm64 -Type: vcpu ioctl -Parameters: struct kvm_regs (out) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: all except ARM, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_regs (out) +:Returns: 0 on success, -1 on error Reads the general purpose registers from the vcpu. -/* x86 */ -struct kvm_regs { +:: + + /* x86 */ + struct kvm_regs { /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ __u64 rax, rbx, rcx, rdx; __u64 rsi, rdi, rsp, rbp; __u64 r8, r9, r10, r11; __u64 r12, r13, r14, r15; __u64 rip, rflags; -}; + }; -/* mips */ -struct kvm_regs { + /* mips */ + struct kvm_regs { /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */ __u64 gpr[32]; __u64 hi; __u64 lo; __u64 pc; -}; + }; 4.12 KVM_SET_REGS +----------------- -Capability: basic -Architectures: all except ARM, arm64 -Type: vcpu ioctl -Parameters: struct kvm_regs (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: all except ARM, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_regs (in) +:Returns: 0 on success, -1 on error Writes the general purpose registers into the vcpu. @@ -405,17 +439,20 @@ See KVM_GET_REGS for the data structure. 4.13 KVM_GET_SREGS +------------------ -Capability: basic -Architectures: x86, ppc -Type: vcpu ioctl -Parameters: struct kvm_sregs (out) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86, ppc +:Type: vcpu ioctl +:Parameters: struct kvm_sregs (out) +:Returns: 0 on success, -1 on error Reads special registers from the vcpu. -/* x86 */ -struct kvm_sregs { +:: + + /* x86 */ + struct kvm_sregs { struct kvm_segment cs, ds, es, fs, gs, ss; struct kvm_segment tr, ldt; struct kvm_dtable gdt, idt; @@ -423,9 +460,9 @@ struct kvm_sregs { __u64 efer; __u64 apic_base; __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64]; -}; + }; -/* ppc -- see arch/powerpc/include/uapi/asm/kvm.h */ + /* ppc -- see arch/powerpc/include/uapi/asm/kvm.h */ interrupt_bitmap is a bitmap of pending external interrupts. At most one bit may be set. This interrupt has been acknowledged by the APIC @@ -433,29 +470,33 @@ but not yet injected into the cpu core. 4.14 KVM_SET_SREGS +------------------ -Capability: basic -Architectures: x86, ppc -Type: vcpu ioctl -Parameters: struct kvm_sregs (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86, ppc +:Type: vcpu ioctl +:Parameters: struct kvm_sregs (in) +:Returns: 0 on success, -1 on error Writes special registers into the vcpu. See KVM_GET_SREGS for the data structures. 4.15 KVM_TRANSLATE +------------------ -Capability: basic -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_translation (in/out) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_translation (in/out) +:Returns: 0 on success, -1 on error Translates a virtual address according to the vcpu's current address translation mode. -struct kvm_translation { +:: + + struct kvm_translation { /* in */ __u64 linear_address; @@ -465,59 +506,68 @@ struct kvm_translation { __u8 writeable; __u8 usermode; __u8 pad[5]; -}; + }; 4.16 KVM_INTERRUPT +------------------ -Capability: basic -Architectures: x86, ppc, mips -Type: vcpu ioctl -Parameters: struct kvm_interrupt (in) -Returns: 0 on success, negative on failure. +:Capability: basic +:Architectures: x86, ppc, mips +:Type: vcpu ioctl +:Parameters: struct kvm_interrupt (in) +:Returns: 0 on success, negative on failure. Queues a hardware interrupt vector to be injected. -/* for KVM_INTERRUPT */ -struct kvm_interrupt { +:: + + /* for KVM_INTERRUPT */ + struct kvm_interrupt { /* in */ __u32 irq; -}; + }; X86: +^^^^ + +:Returns: -Returns: 0 on success, - -EEXIST if an interrupt is already enqueued - -EINVAL the the irq number is invalid - -ENXIO if the PIC is in the kernel - -EFAULT if the pointer is invalid + ========= =================================== + 0 on success, + -EEXIST if an interrupt is already enqueued + -EINVAL the the irq number is invalid + -ENXIO if the PIC is in the kernel + -EFAULT if the pointer is invalid + ========= =================================== Note 'irq' is an interrupt vector, not an interrupt pin or line. This ioctl is useful if the in-kernel PIC is not used. PPC: +^^^^ Queues an external interrupt to be injected. This ioctl is overleaded with 3 different irq values: a) KVM_INTERRUPT_SET - This injects an edge type external interrupt into the guest once it's ready - to receive interrupts. When injected, the interrupt is done. + This injects an edge type external interrupt into the guest once it's ready + to receive interrupts. When injected, the interrupt is done. b) KVM_INTERRUPT_UNSET - This unsets any pending interrupt. + This unsets any pending interrupt. - Only available with KVM_CAP_PPC_UNSET_IRQ. + Only available with KVM_CAP_PPC_UNSET_IRQ. c) KVM_INTERRUPT_SET_LEVEL - This injects a level type external interrupt into the guest context. The - interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET - is triggered. + This injects a level type external interrupt into the guest context. The + interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET + is triggered. - Only available with KVM_CAP_PPC_IRQ_LEVEL. + Only available with KVM_CAP_PPC_IRQ_LEVEL. Note that any value for 'irq' other than the ones stated above is invalid and incurs unexpected behavior. @@ -525,6 +575,7 @@ and incurs unexpected behavior. This is an asynchronous vcpu ioctl and can be invoked from any thread. MIPS: +^^^^^ Queues an external interrupt to be injected into the virtual CPU. A negative interrupt number dequeues the interrupt. @@ -533,24 +584,26 @@ This is an asynchronous vcpu ioctl and can be invoked from any thread. 4.17 KVM_DEBUG_GUEST +-------------------- -Capability: basic -Architectures: none -Type: vcpu ioctl -Parameters: none) -Returns: -1 on error +:Capability: basic +:Architectures: none +:Type: vcpu ioctl +:Parameters: none) +:Returns: -1 on error Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. 4.18 KVM_GET_MSRS +----------------- -Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system) -Architectures: x86 -Type: system ioctl, vcpu ioctl -Parameters: struct kvm_msrs (in/out) -Returns: number of msrs successfully returned; - -1 on error +:Capability: basic (vcpu), KVM_CAP_GET_MSR_FEATURES (system) +:Architectures: x86 +:Type: system ioctl, vcpu ioctl +:Parameters: struct kvm_msrs (in/out) +:Returns: number of msrs successfully returned; + -1 on error When used as a system ioctl: Reads the values of MSR-based features that are available for the VM. This @@ -562,18 +615,20 @@ When used as a vcpu ioctl: Reads model-specific registers from the vcpu. Supported msr indices can be obtained using KVM_GET_MSR_INDEX_LIST in a system ioctl. -struct kvm_msrs { +:: + + struct kvm_msrs { __u32 nmsrs; /* number of msrs in entries */ __u32 pad; struct kvm_msr_entry entries[0]; -}; + }; -struct kvm_msr_entry { + struct kvm_msr_entry { __u32 index; __u32 reserved; __u64 data; -}; + }; Application code should set the 'nmsrs' member (which indicates the size of the entries array) and the 'index' member of each array entry. @@ -581,12 +636,13 @@ kvm will fill in the 'data' member. 4.19 KVM_SET_MSRS +----------------- -Capability: basic -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_msrs (in) -Returns: number of msrs successfully set (see below), -1 on error +:Capability: basic +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_msrs (in) +:Returns: number of msrs successfully set (see below), -1 on error Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the data structures. @@ -602,41 +658,44 @@ MSRs that have been set successfully. 4.20 KVM_SET_CPUID +------------------ -Capability: basic -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_cpuid (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_cpuid (in) +:Returns: 0 on success, -1 on error Defines the vcpu responses to the cpuid instruction. Applications should use the KVM_SET_CPUID2 ioctl if available. +:: -struct kvm_cpuid_entry { + struct kvm_cpuid_entry { __u32 function; __u32 eax; __u32 ebx; __u32 ecx; __u32 edx; __u32 padding; -}; + }; -/* for KVM_SET_CPUID */ -struct kvm_cpuid { + /* for KVM_SET_CPUID */ + struct kvm_cpuid { __u32 nent; __u32 padding; struct kvm_cpuid_entry entries[0]; -}; + }; 4.21 KVM_SET_SIGNAL_MASK +------------------------ -Capability: basic -Architectures: all -Type: vcpu ioctl -Parameters: struct kvm_signal_mask (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: all +:Type: vcpu ioctl +:Parameters: struct kvm_signal_mask (in) +:Returns: 0 on success, -1 on error Defines which signals are blocked during execution of KVM_RUN. This signal mask temporarily overrides the threads signal mask. Any @@ -646,25 +705,30 @@ their traditional behaviour) will cause KVM_RUN to return with -EINTR. Note the signal will only be delivered if not blocked by the original signal mask. -/* for KVM_SET_SIGNAL_MASK */ -struct kvm_signal_mask { +:: + + /* for KVM_SET_SIGNAL_MASK */ + struct kvm_signal_mask { __u32 len; __u8 sigset[0]; -}; + }; 4.22 KVM_GET_FPU +---------------- -Capability: basic -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_fpu (out) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_fpu (out) +:Returns: 0 on success, -1 on error Reads the floating point state from the vcpu. -/* for KVM_GET_FPU and KVM_SET_FPU */ -struct kvm_fpu { +:: + + /* for KVM_GET_FPU and KVM_SET_FPU */ + struct kvm_fpu { __u8 fpr[8][16]; __u16 fcw; __u16 fsw; @@ -676,21 +740,24 @@ struct kvm_fpu { __u8 xmm[16][16]; __u32 mxcsr; __u32 pad2; -}; + }; 4.23 KVM_SET_FPU +---------------- -Capability: basic -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_fpu (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_fpu (in) +:Returns: 0 on success, -1 on error Writes the floating point state to the vcpu. -/* for KVM_GET_FPU and KVM_SET_FPU */ -struct kvm_fpu { +:: + + /* for KVM_GET_FPU and KVM_SET_FPU */ + struct kvm_fpu { __u8 fpr[8][16]; __u16 fcw; __u16 fsw; @@ -702,16 +769,17 @@ struct kvm_fpu { __u8 xmm[16][16]; __u32 mxcsr; __u32 pad2; -}; + }; 4.24 KVM_CREATE_IRQCHIP +----------------------- -Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390) -Architectures: x86, ARM, arm64, s390 -Type: vm ioctl -Parameters: none -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP, KVM_CAP_S390_IRQCHIP (s390) +:Architectures: x86, ARM, arm64, s390 +:Type: vm ioctl +:Parameters: none +:Returns: 0 on success, -1 on error Creates an interrupt controller model in the kernel. On x86, creates a virtual ioapic, a virtual PIC (two PICs, nested), and sets up @@ -727,12 +795,13 @@ before KVM_CREATE_IRQCHIP can be used. 4.25 KVM_IRQ_LINE +----------------- -Capability: KVM_CAP_IRQCHIP -Architectures: x86, arm, arm64 -Type: vm ioctl -Parameters: struct kvm_irq_level -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP +:Architectures: x86, arm, arm64 +:Type: vm ioctl +:Parameters: struct kvm_irq_level +:Returns: 0 on success, -1 on error Sets the level of a GSI input to the interrupt controller model in the kernel. On some architectures it is required that an interrupt controller model has @@ -756,16 +825,20 @@ of course). ARM/arm64 can signal an interrupt either at the CPU level, or at the in-kernel irqchip (GIC), and for in-kernel irqchip can tell the GIC to use PPIs designated for specific cpus. The irq field is interpreted -like this: +like this:: bits: | 31 ... 28 | 27 ... 24 | 23 ... 16 | 15 ... 0 | field: | vcpu2_index | irq_type | vcpu_index | irq_id | The irq_type field has the following values: -- irq_type[0]: out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ -- irq_type[1]: in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.) + +- irq_type[0]: + out-of-kernel GIC: irq_id 0 is IRQ, irq_id 1 is FIQ +- irq_type[1]: + in-kernel GIC: SPI, irq_id between 32 and 1019 (incl.) (the vcpu_index field is ignored) -- irq_type[2]: in-kernel GIC: PPI, irq_id between 16 and 31 (incl.) +- irq_type[2]: + in-kernel GIC: PPI, irq_id between 16 and 31 (incl.) (The irq_id field thus corresponds nicely to the IRQ ID in the ARM GIC specs) @@ -779,27 +852,32 @@ Note that on arm/arm64, the KVM_CAP_IRQCHIP capability only conditions injection of interrupts for the in-kernel irqchip. KVM_IRQ_LINE can always be used for a userspace interrupt controller. -struct kvm_irq_level { +:: + + struct kvm_irq_level { union { __u32 irq; /* GSI */ __s32 status; /* not used for KVM_IRQ_LEVEL */ }; __u32 level; /* 0 or 1 */ -}; + }; 4.26 KVM_GET_IRQCHIP +-------------------- -Capability: KVM_CAP_IRQCHIP -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_irqchip (in/out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_irqchip (in/out) +:Returns: 0 on success, -1 on error Reads the state of a kernel interrupt controller created with KVM_CREATE_IRQCHIP into a buffer provided by the caller. -struct kvm_irqchip { +:: + + struct kvm_irqchip { __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ __u32 pad; union { @@ -807,21 +885,24 @@ struct kvm_irqchip { struct kvm_pic_state pic; struct kvm_ioapic_state ioapic; } chip; -}; + }; 4.27 KVM_SET_IRQCHIP +-------------------- -Capability: KVM_CAP_IRQCHIP -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_irqchip (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_irqchip (in) +:Returns: 0 on success, -1 on error Sets the state of a kernel interrupt controller created with KVM_CREATE_IRQCHIP from a buffer provided by the caller. -struct kvm_irqchip { +:: + + struct kvm_irqchip { __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */ __u32 pad; union { @@ -829,16 +910,17 @@ struct kvm_irqchip { struct kvm_pic_state pic; struct kvm_ioapic_state ioapic; } chip; -}; + }; 4.28 KVM_XEN_HVM_CONFIG +----------------------- -Capability: KVM_CAP_XEN_HVM -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_xen_hvm_config (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_XEN_HVM +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_xen_hvm_config (in) +:Returns: 0 on success, -1 on error Sets the MSR that the Xen HVM guest uses to initialize its hypercall page, and provides the starting address and size of the hypercall @@ -846,7 +928,9 @@ blobs in userspace. When the guest writes the MSR, kvm copies one page of a blob (32- or 64-bit, depending on the vcpu mode) to guest memory. -struct kvm_xen_hvm_config { +:: + + struct kvm_xen_hvm_config { __u32 flags; __u32 msr; __u64 blob_addr_32; @@ -854,16 +938,17 @@ struct kvm_xen_hvm_config { __u8 blob_size_32; __u8 blob_size_64; __u8 pad2[30]; -}; + }; 4.29 KVM_GET_CLOCK +------------------ -Capability: KVM_CAP_ADJUST_CLOCK -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_clock_data (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_ADJUST_CLOCK +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_clock_data (out) +:Returns: 0 on success, -1 on error Gets the current timestamp of kvmclock as seen by the current guest. In conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios @@ -880,47 +965,56 @@ with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, but the exact value read by each VCPU could differ, because the host TSC is not stable. -struct kvm_clock_data { +:: + + struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; __u32 pad[9]; -}; + }; 4.30 KVM_SET_CLOCK +------------------ -Capability: KVM_CAP_ADJUST_CLOCK -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_clock_data (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_ADJUST_CLOCK +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_clock_data (in) +:Returns: 0 on success, -1 on error Sets the current timestamp of kvmclock to the value specified in its parameter. In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios such as migration. -struct kvm_clock_data { +:: + + struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; __u32 pad[9]; -}; + }; 4.31 KVM_GET_VCPU_EVENTS +------------------------ -Capability: KVM_CAP_VCPU_EVENTS -Extended by: KVM_CAP_INTR_SHADOW -Architectures: x86, arm, arm64 -Type: vcpu ioctl -Parameters: struct kvm_vcpu_event (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_VCPU_EVENTS +:Extended by: KVM_CAP_INTR_SHADOW +:Architectures: x86, arm, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_vcpu_event (out) +:Returns: 0 on success, -1 on error X86: +^^^^ Gets currently pending exceptions, interrupts, and NMIs as well as related states of the vcpu. -struct kvm_vcpu_events { +:: + + struct kvm_vcpu_events { struct { __u8 injected; __u8 nr; @@ -951,7 +1045,7 @@ struct kvm_vcpu_events { __u8 reserved[27]; __u8 exception_has_payload; __u64 exception_payload; -}; + }; The following bits are defined in the flags field: @@ -967,6 +1061,7 @@ The following bits are defined in the flags field: KVM_CAP_EXCEPTION_PAYLOAD is enabled. ARM/ARM64: +^^^^^^^^^^ If the guest accesses a device that is being emulated by the host kernel in such a way that a real device would generate a physical SError, KVM may make @@ -1006,8 +1101,9 @@ It is not possible to read back a pending external abort (injected via KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered directly to the virtual CPU). +:: -struct kvm_vcpu_events { + struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; @@ -1017,18 +1113,20 @@ struct kvm_vcpu_events { __u64 serror_esr; } exception; __u32 reserved[12]; -}; + }; 4.32 KVM_SET_VCPU_EVENTS +------------------------ -Capability: KVM_CAP_VCPU_EVENTS -Extended by: KVM_CAP_INTR_SHADOW -Architectures: x86, arm, arm64 -Type: vcpu ioctl -Parameters: struct kvm_vcpu_event (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_VCPU_EVENTS +:Extended by: KVM_CAP_INTR_SHADOW +:Architectures: x86, arm, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_vcpu_event (in) +:Returns: 0 on success, -1 on error X86: +^^^^ Set pending exceptions, interrupts, and NMIs as well as related states of the vcpu. @@ -1040,9 +1138,11 @@ from the update. These fields are nmi.pending, sipi_vector, smi.smm, smi.pending. Keep the corresponding bits in the flags field cleared to suppress overwriting the current in-kernel state. The bits are: -KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel -KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector -KVM_VCPUEVENT_VALID_SMM - transfer the smi sub-struct. +=============================== ================================== +KVM_VCPUEVENT_VALID_NMI_PENDING transfer nmi.pending to the kernel +KVM_VCPUEVENT_VALID_SIPI_VECTOR transfer sipi_vector +KVM_VCPUEVENT_VALID_SMM transfer the smi sub-struct. +=============================== ================================== If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in the flags field to signal that interrupt.shadow contains a valid state and @@ -1056,6 +1156,7 @@ exception_has_payload, exception_payload, and exception.pending fields contain a valid state and shall be written into the VCPU. ARM/ARM64: +^^^^^^^^^^ User space may need to inject several types of events to the guest. @@ -1078,31 +1179,35 @@ See KVM_GET_VCPU_EVENTS for the data structure. 4.33 KVM_GET_DEBUGREGS +---------------------- -Capability: KVM_CAP_DEBUGREGS -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_debugregs (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_DEBUGREGS +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_debugregs (out) +:Returns: 0 on success, -1 on error Reads debug registers from the vcpu. -struct kvm_debugregs { +:: + + struct kvm_debugregs { __u64 db[4]; __u64 dr6; __u64 dr7; __u64 flags; __u64 reserved[9]; -}; + }; 4.34 KVM_SET_DEBUGREGS +---------------------- -Capability: KVM_CAP_DEBUGREGS -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_debugregs (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_DEBUGREGS +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_debugregs (in) +:Returns: 0 on success, -1 on error Writes debug registers into the vcpu. @@ -1111,24 +1216,27 @@ yet and must be cleared on entry. 4.35 KVM_SET_USER_MEMORY_REGION +------------------------------- -Capability: KVM_CAP_USER_MEMORY -Architectures: all -Type: vm ioctl -Parameters: struct kvm_userspace_memory_region (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_USER_MEMORY +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_userspace_memory_region (in) +:Returns: 0 on success, -1 on error -struct kvm_userspace_memory_region { +:: + + struct kvm_userspace_memory_region { __u32 slot; __u32 flags; __u64 guest_phys_addr; __u64 memory_size; /* bytes */ __u64 userspace_addr; /* start of the userspace allocated memory */ -}; + }; -/* for kvm_memory_region::flags */ -#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) -#define KVM_MEM_READONLY (1UL << 1) + /* for kvm_memory_region::flags */ + #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) + #define KVM_MEM_READONLY (1UL << 1) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1174,12 +1282,13 @@ allocation and is deprecated. 4.36 KVM_SET_TSS_ADDR +--------------------- -Capability: KVM_CAP_SET_TSS_ADDR -Architectures: x86 -Type: vm ioctl -Parameters: unsigned long tss_address (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_SET_TSS_ADDR +:Architectures: x86 +:Type: vm ioctl +:Parameters: unsigned long tss_address (in) +:Returns: 0 on success, -1 on error This ioctl defines the physical address of a three-page region in the guest physical address space. The region must be within the first 4GB of the @@ -1193,21 +1302,24 @@ documentation when it pops into existence). 4.37 KVM_ENABLE_CAP +------------------- + +:Capability: KVM_CAP_ENABLE_CAP +:Architectures: mips, ppc, s390 +:Type: vcpu ioctl +:Parameters: struct kvm_enable_cap (in) +:Returns: 0 on success; -1 on error -Capability: KVM_CAP_ENABLE_CAP -Architectures: mips, ppc, s390 -Type: vcpu ioctl -Parameters: struct kvm_enable_cap (in) -Returns: 0 on success; -1 on error +:Capability: KVM_CAP_ENABLE_CAP_VM +:Architectures: all +:Type: vcpu ioctl +:Parameters: struct kvm_enable_cap (in) +:Returns: 0 on success; -1 on error -Capability: KVM_CAP_ENABLE_CAP_VM -Architectures: all -Type: vcpu ioctl -Parameters: struct kvm_enable_cap (in) -Returns: 0 on success; -1 on error +.. note:: -+Not all extensions are enabled by default. Using this ioctl the application -can enable an extension, making it available to the guest. + Not all extensions are enabled by default. Using this ioctl the application + can enable an extension, making it available to the guest. On systems that do not support this ioctl, it always fails. On systems that do support it, it only works for extensions that are supported for enablement. @@ -1215,76 +1327,91 @@ do support it, it only works for extensions that are supported for enablement. To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should be used. -struct kvm_enable_cap { +:: + + struct kvm_enable_cap { /* in */ __u32 cap; The capability that is supposed to get enabled. +:: + __u32 flags; A bitfield indicating future enhancements. Has to be 0 for now. +:: + __u64 args[4]; Arguments for enabling a feature. If a feature needs initial values to function properly, this is the place to put them. +:: + __u8 pad[64]; -}; + }; The vcpu ioctl should be used for vcpu-specific capabilities, the vm ioctl for vm-wide capabilities. 4.38 KVM_GET_MP_STATE +--------------------- + +:Capability: KVM_CAP_MP_STATE +:Architectures: x86, s390, arm, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_mp_state (out) +:Returns: 0 on success; -1 on error -Capability: KVM_CAP_MP_STATE -Architectures: x86, s390, arm, arm64 -Type: vcpu ioctl -Parameters: struct kvm_mp_state (out) -Returns: 0 on success; -1 on error +:: -struct kvm_mp_state { + struct kvm_mp_state { __u32 mp_state; -}; + }; Returns the vcpu's current "multiprocessing state" (though also valid on uniprocessor guests). Possible values are: - - KVM_MP_STATE_RUNNABLE: the vcpu is currently running [x86,arm/arm64] - - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP) + ========================== =============================================== + KVM_MP_STATE_RUNNABLE the vcpu is currently running [x86,arm/arm64] + KVM_MP_STATE_UNINITIALIZED the vcpu is an application processor (AP) which has not yet received an INIT signal [x86] - - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is + KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is now ready for a SIPI [x86] - - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and + KVM_MP_STATE_HALTED the vcpu has executed a HLT instruction and is waiting for an interrupt [x86] - - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector + KVM_MP_STATE_SIPI_RECEIVED the vcpu has just received a SIPI (vector accessible via KVM_GET_VCPU_EVENTS) [x86] - - KVM_MP_STATE_STOPPED: the vcpu is stopped [s390,arm/arm64] - - KVM_MP_STATE_CHECK_STOP: the vcpu is in a special error state [s390] - - KVM_MP_STATE_OPERATING: the vcpu is operating (running or halted) + KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64] + KVM_MP_STATE_CHECK_STOP the vcpu is in a special error state [s390] + KVM_MP_STATE_OPERATING the vcpu is operating (running or halted) [s390] - - KVM_MP_STATE_LOAD: the vcpu is in a special load/startup state + KVM_MP_STATE_LOAD the vcpu is in a special load/startup state [s390] + ========================== =============================================== On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures. For arm/arm64: +^^^^^^^^^^^^^^ The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. 4.39 KVM_SET_MP_STATE +--------------------- -Capability: KVM_CAP_MP_STATE -Architectures: x86, s390, arm, arm64 -Type: vcpu ioctl -Parameters: struct kvm_mp_state (in) -Returns: 0 on success; -1 on error +:Capability: KVM_CAP_MP_STATE +:Architectures: x86, s390, arm, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_mp_state (in) +:Returns: 0 on success; -1 on error Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for arguments. @@ -1294,17 +1421,19 @@ in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures. For arm/arm64: +^^^^^^^^^^^^^^ The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not. 4.40 KVM_SET_IDENTITY_MAP_ADDR +------------------------------ -Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR -Architectures: x86 -Type: vm ioctl -Parameters: unsigned long identity (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR +:Architectures: x86 +:Type: vm ioctl +:Parameters: unsigned long identity (in) +:Returns: 0 on success, -1 on error This ioctl defines the physical address of a one-page region in the guest physical address space. The region must be within the first 4GB of the @@ -1322,12 +1451,13 @@ documentation when it pops into existence). Fails if any VCPU has already been created. 4.41 KVM_SET_BOOT_CPU_ID +------------------------ -Capability: KVM_CAP_SET_BOOT_CPU_ID -Architectures: x86 -Type: vm ioctl -Parameters: unsigned long vcpu_id -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_SET_BOOT_CPU_ID +:Architectures: x86 +:Type: vm ioctl +:Parameters: unsigned long vcpu_id +:Returns: 0 on success, -1 on error Define which vcpu is the Bootstrap Processor (BSP). Values are the same as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default @@ -1335,102 +1465,119 @@ is vcpu 0. 4.42 KVM_GET_XSAVE +------------------ -Capability: KVM_CAP_XSAVE -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_xsave (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_XSAVE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_xsave (out) +:Returns: 0 on success, -1 on error -struct kvm_xsave { + +:: + + struct kvm_xsave { __u32 region[1024]; -}; + }; This ioctl would copy current vcpu's xsave struct to the userspace. 4.43 KVM_SET_XSAVE +------------------ + +:Capability: KVM_CAP_XSAVE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_xsave (in) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_XSAVE -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_xsave (in) -Returns: 0 on success, -1 on error +:: -struct kvm_xsave { + + struct kvm_xsave { __u32 region[1024]; -}; + }; This ioctl would copy userspace's xsave struct to the kernel. 4.44 KVM_GET_XCRS +----------------- + +:Capability: KVM_CAP_XCRS +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_xcrs (out) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_XCRS -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_xcrs (out) -Returns: 0 on success, -1 on error +:: -struct kvm_xcr { + struct kvm_xcr { __u32 xcr; __u32 reserved; __u64 value; -}; + }; -struct kvm_xcrs { + struct kvm_xcrs { __u32 nr_xcrs; __u32 flags; struct kvm_xcr xcrs[KVM_MAX_XCRS]; __u64 padding[16]; -}; + }; This ioctl would copy current vcpu's xcrs to the userspace. 4.45 KVM_SET_XCRS +----------------- + +:Capability: KVM_CAP_XCRS +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_xcrs (in) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_XCRS -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_xcrs (in) -Returns: 0 on success, -1 on error +:: -struct kvm_xcr { + struct kvm_xcr { __u32 xcr; __u32 reserved; __u64 value; -}; + }; -struct kvm_xcrs { + struct kvm_xcrs { __u32 nr_xcrs; __u32 flags; struct kvm_xcr xcrs[KVM_MAX_XCRS]; __u64 padding[16]; -}; + }; This ioctl would set vcpu's xcr to the value userspace specified. 4.46 KVM_GET_SUPPORTED_CPUID +---------------------------- -Capability: KVM_CAP_EXT_CPUID -Architectures: x86 -Type: system ioctl -Parameters: struct kvm_cpuid2 (in/out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_EXT_CPUID +:Architectures: x86 +:Type: system ioctl +:Parameters: struct kvm_cpuid2 (in/out) +:Returns: 0 on success, -1 on error -struct kvm_cpuid2 { +:: + + struct kvm_cpuid2 { __u32 nent; __u32 padding; struct kvm_cpuid_entry2 entries[0]; -}; + }; -#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0) -#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) -#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) + #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0) + #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) + #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) -struct kvm_cpuid_entry2 { + struct kvm_cpuid_entry2 { __u32 function; __u32 index; __u32 flags; @@ -1439,7 +1586,7 @@ struct kvm_cpuid_entry2 { __u32 ecx; __u32 edx; __u32 padding[3]; -}; + }; This ioctl returns x86 cpuid features which are supported by both the hardware and kvm in its default configuration. Userspace can use the @@ -1467,10 +1614,16 @@ with unknown or unsupported features masked out. Some features (for example, x2apic), may not be present in the host cpu, but are exposed by kvm if it can emulate them efficiently. The fields in each entry are defined as follows: - function: the eax value used to obtain the entry - index: the ecx value used to obtain the entry (for entries that are + function: + the eax value used to obtain the entry + + index: + the ecx value used to obtain the entry (for entries that are affected by ecx) - flags: an OR of zero or more of the following: + + flags: + an OR of zero or more of the following: + KVM_CPUID_FLAG_SIGNIFCANT_INDEX: if the index field is valid KVM_CPUID_FLAG_STATEFUL_FUNC: @@ -1480,12 +1633,14 @@ emulate them efficiently. The fields in each entry are defined as follows: KVM_CPUID_FLAG_STATE_READ_NEXT: for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is the first entry to be read by a cpu - eax, ebx, ecx, edx: the values returned by the cpuid instruction for + + eax, ebx, ecx, edx: + the values returned by the cpuid instruction for this function/index combination The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC -support. Instead it is reported via +support. Instead it is reported via:: ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER) @@ -1494,18 +1649,21 @@ feature in userspace, then you can enable the feature for KVM_SET_CPUID2. 4.47 KVM_PPC_GET_PVINFO +----------------------- -Capability: KVM_CAP_PPC_GET_PVINFO -Architectures: ppc -Type: vm ioctl -Parameters: struct kvm_ppc_pvinfo (out) -Returns: 0 on success, !0 on error +:Capability: KVM_CAP_PPC_GET_PVINFO +:Architectures: ppc +:Type: vm ioctl +:Parameters: struct kvm_ppc_pvinfo (out) +:Returns: 0 on success, !0 on error -struct kvm_ppc_pvinfo { +:: + + struct kvm_ppc_pvinfo { __u32 flags; __u32 hcall[4]; __u8 pad[108]; -}; + }; This ioctl fetches PV specific information that need to be passed to the guest using the device tree or other means from vm context. @@ -1515,33 +1673,39 @@ The hcall array defines 4 instructions that make up a hypercall. If any additional field gets added to this structure later on, a bit for that additional piece of information will be set in the flags bitmap. -The flags bitmap is defined as: +The flags bitmap is defined as:: /* the host supports the ePAPR idle hcall #define KVM_PPC_PVINFO_FLAGS_EV_IDLE (1<<0) 4.52 KVM_SET_GSI_ROUTING +------------------------ -Capability: KVM_CAP_IRQ_ROUTING -Architectures: x86 s390 arm arm64 -Type: vm ioctl -Parameters: struct kvm_irq_routing (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQ_ROUTING +:Architectures: x86 s390 arm arm64 +:Type: vm ioctl +:Parameters: struct kvm_irq_routing (in) +:Returns: 0 on success, -1 on error Sets the GSI routing table entries, overwriting any previously set entries. On arm/arm64, GSI routing has the following limitation: + - GSI routing does not apply to KVM_IRQ_LINE but only to KVM_IRQFD. -struct kvm_irq_routing { +:: + + struct kvm_irq_routing { __u32 nr; __u32 flags; struct kvm_irq_routing_entry entries[0]; -}; + }; No flags are specified so far, the corresponding field must be set to zero. -struct kvm_irq_routing_entry { +:: + + struct kvm_irq_routing_entry { __u32 gsi; __u32 type; __u32 flags; @@ -1553,15 +1717,16 @@ struct kvm_irq_routing_entry { struct kvm_irq_routing_hv_sint hv_sint; __u32 pad[8]; } u; -}; + }; -/* gsi routing entry types */ -#define KVM_IRQ_ROUTING_IRQCHIP 1 -#define KVM_IRQ_ROUTING_MSI 2 -#define KVM_IRQ_ROUTING_S390_ADAPTER 3 -#define KVM_IRQ_ROUTING_HV_SINT 4 + /* gsi routing entry types */ + #define KVM_IRQ_ROUTING_IRQCHIP 1 + #define KVM_IRQ_ROUTING_MSI 2 + #define KVM_IRQ_ROUTING_S390_ADAPTER 3 + #define KVM_IRQ_ROUTING_HV_SINT 4 flags: + - KVM_MSI_VALID_DEVID: used along with KVM_IRQ_ROUTING_MSI routing entry type, specifies that the devid field contains a valid value. The per-VM KVM_CAP_MSI_DEVID capability advertises the requirement to provide @@ -1569,12 +1734,14 @@ flags: never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail. - zero otherwise -struct kvm_irq_routing_irqchip { +:: + + struct kvm_irq_routing_irqchip { __u32 irqchip; __u32 pin; -}; + }; -struct kvm_irq_routing_msi { + struct kvm_irq_routing_msi { __u32 address_lo; __u32 address_hi; __u32 data; @@ -1582,7 +1749,7 @@ struct kvm_irq_routing_msi { __u32 pad; __u32 devid; }; -}; + }; If KVM_MSI_VALID_DEVID is set, devid contains a unique device identifier for the device that wrote the MSI message. For PCI, this is usually a @@ -1593,39 +1760,43 @@ feature of KVM_CAP_X2APIC_API capability is enabled. If it is enabled, address_hi bits 31-8 provide bits 31-8 of the destination id. Bits 7-0 of address_hi must be zero. -struct kvm_irq_routing_s390_adapter { +:: + + struct kvm_irq_routing_s390_adapter { __u64 ind_addr; __u64 summary_addr; __u64 ind_offset; __u32 summary_offset; __u32 adapter_id; -}; + }; -struct kvm_irq_routing_hv_sint { + struct kvm_irq_routing_hv_sint { __u32 vcpu; __u32 sint; -}; + }; 4.55 KVM_SET_TSC_KHZ +-------------------- -Capability: KVM_CAP_TSC_CONTROL -Architectures: x86 -Type: vcpu ioctl -Parameters: virtual tsc_khz -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_TSC_CONTROL +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: virtual tsc_khz +:Returns: 0 on success, -1 on error Specifies the tsc frequency for the virtual machine. The unit of the frequency is KHz. 4.56 KVM_GET_TSC_KHZ +-------------------- -Capability: KVM_CAP_GET_TSC_KHZ -Architectures: x86 -Type: vcpu ioctl -Parameters: none -Returns: virtual tsc-khz on success, negative value on error +:Capability: KVM_CAP_GET_TSC_KHZ +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: none +:Returns: virtual tsc-khz on success, negative value on error Returns the tsc frequency of the guest. The unit of the return value is KHz. If the host has unstable tsc this ioctl returns -EIO instead as an @@ -1633,17 +1804,20 @@ error. 4.57 KVM_GET_LAPIC +------------------ -Capability: KVM_CAP_IRQCHIP -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_lapic_state (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_lapic_state (out) +:Returns: 0 on success, -1 on error -#define KVM_APIC_REG_SIZE 0x400 -struct kvm_lapic_state { +:: + + #define KVM_APIC_REG_SIZE 0x400 + struct kvm_lapic_state { char regs[KVM_APIC_REG_SIZE]; -}; + }; Reads the Local APIC registers and copies them into the input argument. The data format and layout are the same as documented in the architecture manual. @@ -1661,17 +1835,20 @@ always uses xAPIC format. 4.58 KVM_SET_LAPIC +------------------ -Capability: KVM_CAP_IRQCHIP -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_lapic_state (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQCHIP +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_lapic_state (in) +:Returns: 0 on success, -1 on error -#define KVM_APIC_REG_SIZE 0x400 -struct kvm_lapic_state { +:: + + #define KVM_APIC_REG_SIZE 0x400 + struct kvm_lapic_state { char regs[KVM_APIC_REG_SIZE]; -}; + }; Copies the input argument into the Local APIC registers. The data format and layout are the same as documented in the architecture manual. @@ -1682,35 +1859,38 @@ See the note in KVM_GET_LAPIC. 4.59 KVM_IOEVENTFD +------------------ -Capability: KVM_CAP_IOEVENTFD -Architectures: all -Type: vm ioctl -Parameters: struct kvm_ioeventfd (in) -Returns: 0 on success, !0 on error +:Capability: KVM_CAP_IOEVENTFD +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_ioeventfd (in) +:Returns: 0 on success, !0 on error This ioctl attaches or detaches an ioeventfd to a legal pio/mmio address within the guest. A guest write in the registered address will signal the provided event instead of triggering an exit. -struct kvm_ioeventfd { +:: + + struct kvm_ioeventfd { __u64 datamatch; __u64 addr; /* legal pio/mmio address */ __u32 len; /* 0, 1, 2, 4, or 8 bytes */ __s32 fd; __u32 flags; __u8 pad[36]; -}; + }; For the special case of virtio-ccw devices on s390, the ioevent is matched to a subchannel/virtqueue tuple instead. -The following flags are defined: +The following flags are defined:: -#define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch) -#define KVM_IOEVENTFD_FLAG_PIO (1 << kvm_ioeventfd_flag_nr_pio) -#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign) -#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \ + #define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch) + #define KVM_IOEVENTFD_FLAG_PIO (1 << kvm_ioeventfd_flag_nr_pio) + #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign) + #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \ (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify) If datamatch flag is set, the event will be signaled only if the written value @@ -1725,17 +1905,20 @@ The speedup may only apply to specific architectures, but the ioeventfd will work anyway. 4.60 KVM_DIRTY_TLB +------------------ + +:Capability: KVM_CAP_SW_TLB +:Architectures: ppc +:Type: vcpu ioctl +:Parameters: struct kvm_dirty_tlb (in) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_SW_TLB -Architectures: ppc -Type: vcpu ioctl -Parameters: struct kvm_dirty_tlb (in) -Returns: 0 on success, -1 on error +:: -struct kvm_dirty_tlb { + struct kvm_dirty_tlb { __u64 bitmap; __u32 num_dirty; -}; + }; This must be called whenever userspace has changed an entry in the shared TLB, prior to calling KVM_RUN on the associated vcpu. @@ -1758,23 +1941,26 @@ be set to the number of set bits in the bitmap. 4.62 KVM_CREATE_SPAPR_TCE +------------------------- -Capability: KVM_CAP_SPAPR_TCE -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_create_spapr_tce (in) -Returns: file descriptor for manipulating the created TCE table +:Capability: KVM_CAP_SPAPR_TCE +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_create_spapr_tce (in) +:Returns: file descriptor for manipulating the created TCE table This creates a virtual TCE (translation control entry) table, which is an IOMMU for PAPR-style virtual I/O. It is used to translate logical addresses used in virtual I/O into guest physical addresses, and provides a scatter/gather capability for PAPR virtual I/O. -/* for KVM_CAP_SPAPR_TCE */ -struct kvm_create_spapr_tce { +:: + + /* for KVM_CAP_SPAPR_TCE */ + struct kvm_create_spapr_tce { __u64 liobn; __u32 window_size; -}; + }; The liobn field gives the logical IO bus number for which to create a TCE table. The window_size field specifies the size of the DMA window @@ -1794,12 +1980,13 @@ circumstances. 4.63 KVM_ALLOCATE_RMA +--------------------- -Capability: KVM_CAP_PPC_RMA -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_allocate_rma (out) -Returns: file descriptor for mapping the allocated RMA +:Capability: KVM_CAP_PPC_RMA +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_allocate_rma (out) +:Returns: file descriptor for mapping the allocated RMA This allocates a Real Mode Area (RMA) from the pool allocated at boot time by the kernel. An RMA is a physically-contiguous, aligned region @@ -1808,10 +1995,12 @@ will be accessed by real-mode (MMU off) accesses in a KVM guest. POWER processors support a set of sizes for the RMA that usually includes 64MB, 128MB, 256MB and some larger powers of two. -/* for KVM_ALLOCATE_RMA */ -struct kvm_allocate_rma { +:: + + /* for KVM_ALLOCATE_RMA */ + struct kvm_allocate_rma { __u64 rma_size; -}; + }; The return value is a file descriptor which can be passed to mmap(2) to map the allocated RMA into userspace. The mapped area can then be @@ -1827,12 +2016,13 @@ because it supports the Virtual RMA (VRMA) facility. 4.64 KVM_NMI +------------ -Capability: KVM_CAP_USER_NMI -Architectures: x86 -Type: vcpu ioctl -Parameters: none -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_USER_NMI +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 on success, -1 on error Queues an NMI on the thread's vcpu. Note this is well defined only when KVM_CREATE_IRQCHIP has not been called, since this is an interface @@ -1853,14 +2043,16 @@ debugging. 4.65 KVM_S390_UCAS_MAP +---------------------- -Capability: KVM_CAP_S390_UCONTROL -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_ucas_mapping (in) -Returns: 0 in case of success +:Capability: KVM_CAP_S390_UCONTROL +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_ucas_mapping (in) +:Returns: 0 in case of success + +The parameter is defined like this:: -The parameter is defined like this: struct kvm_s390_ucas_mapping { __u64 user_addr; __u64 vcpu_addr; @@ -1873,14 +2065,16 @@ be aligned by 1 megabyte. 4.66 KVM_S390_UCAS_UNMAP +------------------------ -Capability: KVM_CAP_S390_UCONTROL -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_ucas_mapping (in) -Returns: 0 in case of success +:Capability: KVM_CAP_S390_UCONTROL +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_ucas_mapping (in) +:Returns: 0 in case of success + +The parameter is defined like this:: -The parameter is defined like this: struct kvm_s390_ucas_mapping { __u64 user_addr; __u64 vcpu_addr; @@ -1893,12 +2087,13 @@ All parameters need to be aligned by 1 megabyte. 4.67 KVM_S390_VCPU_FAULT +------------------------ -Capability: KVM_CAP_S390_UCONTROL -Architectures: s390 -Type: vcpu ioctl -Parameters: vcpu absolute address (in) -Returns: 0 in case of success +:Capability: KVM_CAP_S390_UCONTROL +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: vcpu absolute address (in) +:Returns: 0 in case of success This call creates a page table entry on the virtual cpu's address space (for user controlled virtual machines) or the virtual machine's address @@ -1910,23 +2105,31 @@ prior to calling the KVM_RUN ioctl. 4.68 KVM_SET_ONE_REG +-------------------- + +:Capability: KVM_CAP_ONE_REG +:Architectures: all +:Type: vcpu ioctl +:Parameters: struct kvm_one_reg (in) +:Returns: 0 on success, negative value on failure -Capability: KVM_CAP_ONE_REG -Architectures: all -Type: vcpu ioctl -Parameters: struct kvm_one_reg (in) -Returns: 0 on success, negative value on failure Errors: - ENOENT: no such register - EINVAL: invalid register ID, or no such register - EPERM: (arm64) register access not allowed before vcpu finalization + + ====== ============================================================ + ENOENT no such register + EINVAL invalid register ID, or no such register + EPERM (arm64) register access not allowed before vcpu finalization + ====== ============================================================ + (These error codes are indicative only: do not rely on a specific error code being returned in a specific situation.) -struct kvm_one_reg { +:: + + struct kvm_one_reg { __u64 id; __u64 addr; -}; + }; Using this ioctl, a single vcpu register can be set to a specific value defined by user space with the passed in struct kvm_one_reg, where id @@ -1936,217 +2139,226 @@ and architecture specific registers. Each have their own range of operation and their own constants and width. To keep track of the implemented registers, find a list below: - Arch | Register | Width (bits) - | | - PPC | KVM_REG_PPC_HIOR | 64 - PPC | KVM_REG_PPC_IAC1 | 64 - PPC | KVM_REG_PPC_IAC2 | 64 - PPC | KVM_REG_PPC_IAC3 | 64 - PPC | KVM_REG_PPC_IAC4 | 64 - PPC | KVM_REG_PPC_DAC1 | 64 - PPC | KVM_REG_PPC_DAC2 | 64 - PPC | KVM_REG_PPC_DABR | 64 - PPC | KVM_REG_PPC_DSCR | 64 - PPC | KVM_REG_PPC_PURR | 64 - PPC | KVM_REG_PPC_SPURR | 64 - PPC | KVM_REG_PPC_DAR | 64 - PPC | KVM_REG_PPC_DSISR | 32 - PPC | KVM_REG_PPC_AMR | 64 - PPC | KVM_REG_PPC_UAMOR | 64 - PPC | KVM_REG_PPC_MMCR0 | 64 - PPC | KVM_REG_PPC_MMCR1 | 64 - PPC | KVM_REG_PPC_MMCRA | 64 - PPC | KVM_REG_PPC_MMCR2 | 64 - PPC | KVM_REG_PPC_MMCRS | 64 - PPC | KVM_REG_PPC_SIAR | 64 - PPC | KVM_REG_PPC_SDAR | 64 - PPC | KVM_REG_PPC_SIER | 64 - PPC | KVM_REG_PPC_PMC1 | 32 - PPC | KVM_REG_PPC_PMC2 | 32 - PPC | KVM_REG_PPC_PMC3 | 32 - PPC | KVM_REG_PPC_PMC4 | 32 - PPC | KVM_REG_PPC_PMC5 | 32 - PPC | KVM_REG_PPC_PMC6 | 32 - PPC | KVM_REG_PPC_PMC7 | 32 - PPC | KVM_REG_PPC_PMC8 | 32 - PPC | KVM_REG_PPC_FPR0 | 64 - ... - PPC | KVM_REG_PPC_FPR31 | 64 - PPC | KVM_REG_PPC_VR0 | 128 - ... - PPC | KVM_REG_PPC_VR31 | 128 - PPC | KVM_REG_PPC_VSR0 | 128 - ... - PPC | KVM_REG_PPC_VSR31 | 128 - PPC | KVM_REG_PPC_FPSCR | 64 - PPC | KVM_REG_PPC_VSCR | 32 - PPC | KVM_REG_PPC_VPA_ADDR | 64 - PPC | KVM_REG_PPC_VPA_SLB | 128 - PPC | KVM_REG_PPC_VPA_DTL | 128 - PPC | KVM_REG_PPC_EPCR | 32 - PPC | KVM_REG_PPC_EPR | 32 - PPC | KVM_REG_PPC_TCR | 32 - PPC | KVM_REG_PPC_TSR | 32 - PPC | KVM_REG_PPC_OR_TSR | 32 - PPC | KVM_REG_PPC_CLEAR_TSR | 32 - PPC | KVM_REG_PPC_MAS0 | 32 - PPC | KVM_REG_PPC_MAS1 | 32 - PPC | KVM_REG_PPC_MAS2 | 64 - PPC | KVM_REG_PPC_MAS7_3 | 64 - PPC | KVM_REG_PPC_MAS4 | 32 - PPC | KVM_REG_PPC_MAS6 | 32 - PPC | KVM_REG_PPC_MMUCFG | 32 - PPC | KVM_REG_PPC_TLB0CFG | 32 - PPC | KVM_REG_PPC_TLB1CFG | 32 - PPC | KVM_REG_PPC_TLB2CFG | 32 - PPC | KVM_REG_PPC_TLB3CFG | 32 - PPC | KVM_REG_PPC_TLB0PS | 32 - PPC | KVM_REG_PPC_TLB1PS | 32 - PPC | KVM_REG_PPC_TLB2PS | 32 - PPC | KVM_REG_PPC_TLB3PS | 32 - PPC | KVM_REG_PPC_EPTCFG | 32 - PPC | KVM_REG_PPC_ICP_STATE | 64 - PPC | KVM_REG_PPC_VP_STATE | 128 - PPC | KVM_REG_PPC_TB_OFFSET | 64 - PPC | KVM_REG_PPC_SPMC1 | 32 - PPC | KVM_REG_PPC_SPMC2 | 32 - PPC | KVM_REG_PPC_IAMR | 64 - PPC | KVM_REG_PPC_TFHAR | 64 - PPC | KVM_REG_PPC_TFIAR | 64 - PPC | KVM_REG_PPC_TEXASR | 64 - PPC | KVM_REG_PPC_FSCR | 64 - PPC | KVM_REG_PPC_PSPB | 32 - PPC | KVM_REG_PPC_EBBHR | 64 - PPC | KVM_REG_PPC_EBBRR | 64 - PPC | KVM_REG_PPC_BESCR | 64 - PPC | KVM_REG_PPC_TAR | 64 - PPC | KVM_REG_PPC_DPDES | 64 - PPC | KVM_REG_PPC_DAWR | 64 - PPC | KVM_REG_PPC_DAWRX | 64 - PPC | KVM_REG_PPC_CIABR | 64 - PPC | KVM_REG_PPC_IC | 64 - PPC | KVM_REG_PPC_VTB | 64 - PPC | KVM_REG_PPC_CSIGR | 64 - PPC | KVM_REG_PPC_TACR | 64 - PPC | KVM_REG_PPC_TCSCR | 64 - PPC | KVM_REG_PPC_PID | 64 - PPC | KVM_REG_PPC_ACOP | 64 - PPC | KVM_REG_PPC_VRSAVE | 32 - PPC | KVM_REG_PPC_LPCR | 32 - PPC | KVM_REG_PPC_LPCR_64 | 64 - PPC | KVM_REG_PPC_PPR | 64 - PPC | KVM_REG_PPC_ARCH_COMPAT | 32 - PPC | KVM_REG_PPC_DABRX | 32 - PPC | KVM_REG_PPC_WORT | 64 - PPC | KVM_REG_PPC_SPRG9 | 64 - PPC | KVM_REG_PPC_DBSR | 32 - PPC | KVM_REG_PPC_TIDR | 64 - PPC | KVM_REG_PPC_PSSCR | 64 - PPC | KVM_REG_PPC_DEC_EXPIRY | 64 - PPC | KVM_REG_PPC_PTCR | 64 - PPC | KVM_REG_PPC_TM_GPR0 | 64 - ... - PPC | KVM_REG_PPC_TM_GPR31 | 64 - PPC | KVM_REG_PPC_TM_VSR0 | 128 - ... - PPC | KVM_REG_PPC_TM_VSR63 | 128 - PPC | KVM_REG_PPC_TM_CR | 64 - PPC | KVM_REG_PPC_TM_LR | 64 - PPC | KVM_REG_PPC_TM_CTR | 64 - PPC | KVM_REG_PPC_TM_FPSCR | 64 - PPC | KVM_REG_PPC_TM_AMR | 64 - PPC | KVM_REG_PPC_TM_PPR | 64 - PPC | KVM_REG_PPC_TM_VRSAVE | 64 - PPC | KVM_REG_PPC_TM_VSCR | 32 - PPC | KVM_REG_PPC_TM_DSCR | 64 - PPC | KVM_REG_PPC_TM_TAR | 64 - PPC | KVM_REG_PPC_TM_XER | 64 - | | - MIPS | KVM_REG_MIPS_R0 | 64 - ... - MIPS | KVM_REG_MIPS_R31 | 64 - MIPS | KVM_REG_MIPS_HI | 64 - MIPS | KVM_REG_MIPS_LO | 64 - MIPS | KVM_REG_MIPS_PC | 64 - MIPS | KVM_REG_MIPS_CP0_INDEX | 32 - MIPS | KVM_REG_MIPS_CP0_ENTRYLO0 | 64 - MIPS | KVM_REG_MIPS_CP0_ENTRYLO1 | 64 - MIPS | KVM_REG_MIPS_CP0_CONTEXT | 64 - MIPS | KVM_REG_MIPS_CP0_CONTEXTCONFIG| 32 - MIPS | KVM_REG_MIPS_CP0_USERLOCAL | 64 - MIPS | KVM_REG_MIPS_CP0_XCONTEXTCONFIG| 64 - MIPS | KVM_REG_MIPS_CP0_PAGEMASK | 32 - MIPS | KVM_REG_MIPS_CP0_PAGEGRAIN | 32 - MIPS | KVM_REG_MIPS_CP0_SEGCTL0 | 64 - MIPS | KVM_REG_MIPS_CP0_SEGCTL1 | 64 - MIPS | KVM_REG_MIPS_CP0_SEGCTL2 | 64 - MIPS | KVM_REG_MIPS_CP0_PWBASE | 64 - MIPS | KVM_REG_MIPS_CP0_PWFIELD | 64 - MIPS | KVM_REG_MIPS_CP0_PWSIZE | 64 - MIPS | KVM_REG_MIPS_CP0_WIRED | 32 - MIPS | KVM_REG_MIPS_CP0_PWCTL | 32 - MIPS | KVM_REG_MIPS_CP0_HWRENA | 32 - MIPS | KVM_REG_MIPS_CP0_BADVADDR | 64 - MIPS | KVM_REG_MIPS_CP0_BADINSTR | 32 - MIPS | KVM_REG_MIPS_CP0_BADINSTRP | 32 - MIPS | KVM_REG_MIPS_CP0_COUNT | 32 - MIPS | KVM_REG_MIPS_CP0_ENTRYHI | 64 - MIPS | KVM_REG_MIPS_CP0_COMPARE | 32 - MIPS | KVM_REG_MIPS_CP0_STATUS | 32 - MIPS | KVM_REG_MIPS_CP0_INTCTL | 32 - MIPS | KVM_REG_MIPS_CP0_CAUSE | 32 - MIPS | KVM_REG_MIPS_CP0_EPC | 64 - MIPS | KVM_REG_MIPS_CP0_PRID | 32 - MIPS | KVM_REG_MIPS_CP0_EBASE | 64 - MIPS | KVM_REG_MIPS_CP0_CONFIG | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG1 | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG2 | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG3 | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG4 | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG5 | 32 - MIPS | KVM_REG_MIPS_CP0_CONFIG7 | 32 - MIPS | KVM_REG_MIPS_CP0_XCONTEXT | 64 - MIPS | KVM_REG_MIPS_CP0_ERROREPC | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH1 | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH2 | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH3 | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH4 | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH5 | 64 - MIPS | KVM_REG_MIPS_CP0_KSCRATCH6 | 64 - MIPS | KVM_REG_MIPS_CP0_MAAR(0..63) | 64 - MIPS | KVM_REG_MIPS_COUNT_CTL | 64 - MIPS | KVM_REG_MIPS_COUNT_RESUME | 64 - MIPS | KVM_REG_MIPS_COUNT_HZ | 64 - MIPS | KVM_REG_MIPS_FPR_32(0..31) | 32 - MIPS | KVM_REG_MIPS_FPR_64(0..31) | 64 - MIPS | KVM_REG_MIPS_VEC_128(0..31) | 128 - MIPS | KVM_REG_MIPS_FCR_IR | 32 - MIPS | KVM_REG_MIPS_FCR_CSR | 32 - MIPS | KVM_REG_MIPS_MSA_IR | 32 - MIPS | KVM_REG_MIPS_MSA_CSR | 32 + ======= =============================== ============ + Arch Register Width (bits) + ======= =============================== ============ + PPC KVM_REG_PPC_HIOR 64 + PPC KVM_REG_PPC_IAC1 64 + PPC KVM_REG_PPC_IAC2 64 + PPC KVM_REG_PPC_IAC3 64 + PPC KVM_REG_PPC_IAC4 64 + PPC KVM_REG_PPC_DAC1 64 + PPC KVM_REG_PPC_DAC2 64 + PPC KVM_REG_PPC_DABR 64 + PPC KVM_REG_PPC_DSCR 64 + PPC KVM_REG_PPC_PURR 64 + PPC KVM_REG_PPC_SPURR 64 + PPC KVM_REG_PPC_DAR 64 + PPC KVM_REG_PPC_DSISR 32 + PPC KVM_REG_PPC_AMR 64 + PPC KVM_REG_PPC_UAMOR 64 + PPC KVM_REG_PPC_MMCR0 64 + PPC KVM_REG_PPC_MMCR1 64 + PPC KVM_REG_PPC_MMCRA 64 + PPC KVM_REG_PPC_MMCR2 64 + PPC KVM_REG_PPC_MMCRS 64 + PPC KVM_REG_PPC_SIAR 64 + PPC KVM_REG_PPC_SDAR 64 + PPC KVM_REG_PPC_SIER 64 + PPC KVM_REG_PPC_PMC1 32 + PPC KVM_REG_PPC_PMC2 32 + PPC KVM_REG_PPC_PMC3 32 + PPC KVM_REG_PPC_PMC4 32 + PPC KVM_REG_PPC_PMC5 32 + PPC KVM_REG_PPC_PMC6 32 + PPC KVM_REG_PPC_PMC7 32 + PPC KVM_REG_PPC_PMC8 32 + PPC KVM_REG_PPC_FPR0 64 + ... + PPC KVM_REG_PPC_FPR31 64 + PPC KVM_REG_PPC_VR0 128 + ... + PPC KVM_REG_PPC_VR31 128 + PPC KVM_REG_PPC_VSR0 128 + ... + PPC KVM_REG_PPC_VSR31 128 + PPC KVM_REG_PPC_FPSCR 64 + PPC KVM_REG_PPC_VSCR 32 + PPC KVM_REG_PPC_VPA_ADDR 64 + PPC KVM_REG_PPC_VPA_SLB 128 + PPC KVM_REG_PPC_VPA_DTL 128 + PPC KVM_REG_PPC_EPCR 32 + PPC KVM_REG_PPC_EPR 32 + PPC KVM_REG_PPC_TCR 32 + PPC KVM_REG_PPC_TSR 32 + PPC KVM_REG_PPC_OR_TSR 32 + PPC KVM_REG_PPC_CLEAR_TSR 32 + PPC KVM_REG_PPC_MAS0 32 + PPC KVM_REG_PPC_MAS1 32 + PPC KVM_REG_PPC_MAS2 64 + PPC KVM_REG_PPC_MAS7_3 64 + PPC KVM_REG_PPC_MAS4 32 + PPC KVM_REG_PPC_MAS6 32 + PPC KVM_REG_PPC_MMUCFG 32 + PPC KVM_REG_PPC_TLB0CFG 32 + PPC KVM_REG_PPC_TLB1CFG 32 + PPC KVM_REG_PPC_TLB2CFG 32 + PPC KVM_REG_PPC_TLB3CFG 32 + PPC KVM_REG_PPC_TLB0PS 32 + PPC KVM_REG_PPC_TLB1PS 32 + PPC KVM_REG_PPC_TLB2PS 32 + PPC KVM_REG_PPC_TLB3PS 32 + PPC KVM_REG_PPC_EPTCFG 32 + PPC KVM_REG_PPC_ICP_STATE 64 + PPC KVM_REG_PPC_VP_STATE 128 + PPC KVM_REG_PPC_TB_OFFSET 64 + PPC KVM_REG_PPC_SPMC1 32 + PPC KVM_REG_PPC_SPMC2 32 + PPC KVM_REG_PPC_IAMR 64 + PPC KVM_REG_PPC_TFHAR 64 + PPC KVM_REG_PPC_TFIAR 64 + PPC KVM_REG_PPC_TEXASR 64 + PPC KVM_REG_PPC_FSCR 64 + PPC KVM_REG_PPC_PSPB 32 + PPC KVM_REG_PPC_EBBHR 64 + PPC KVM_REG_PPC_EBBRR 64 + PPC KVM_REG_PPC_BESCR 64 + PPC KVM_REG_PPC_TAR 64 + PPC KVM_REG_PPC_DPDES 64 + PPC KVM_REG_PPC_DAWR 64 + PPC KVM_REG_PPC_DAWRX 64 + PPC KVM_REG_PPC_CIABR 64 + PPC KVM_REG_PPC_IC 64 + PPC KVM_REG_PPC_VTB 64 + PPC KVM_REG_PPC_CSIGR 64 + PPC KVM_REG_PPC_TACR 64 + PPC KVM_REG_PPC_TCSCR 64 + PPC KVM_REG_PPC_PID 64 + PPC KVM_REG_PPC_ACOP 64 + PPC KVM_REG_PPC_VRSAVE 32 + PPC KVM_REG_PPC_LPCR 32 + PPC KVM_REG_PPC_LPCR_64 64 + PPC KVM_REG_PPC_PPR 64 + PPC KVM_REG_PPC_ARCH_COMPAT 32 + PPC KVM_REG_PPC_DABRX 32 + PPC KVM_REG_PPC_WORT 64 + PPC KVM_REG_PPC_SPRG9 64 + PPC KVM_REG_PPC_DBSR 32 + PPC KVM_REG_PPC_TIDR 64 + PPC KVM_REG_PPC_PSSCR 64 + PPC KVM_REG_PPC_DEC_EXPIRY 64 + PPC KVM_REG_PPC_PTCR 64 + PPC KVM_REG_PPC_TM_GPR0 64 + ... + PPC KVM_REG_PPC_TM_GPR31 64 + PPC KVM_REG_PPC_TM_VSR0 128 + ... + PPC KVM_REG_PPC_TM_VSR63 128 + PPC KVM_REG_PPC_TM_CR 64 + PPC KVM_REG_PPC_TM_LR 64 + PPC KVM_REG_PPC_TM_CTR 64 + PPC KVM_REG_PPC_TM_FPSCR 64 + PPC KVM_REG_PPC_TM_AMR 64 + PPC KVM_REG_PPC_TM_PPR 64 + PPC KVM_REG_PPC_TM_VRSAVE 64 + PPC KVM_REG_PPC_TM_VSCR 32 + PPC KVM_REG_PPC_TM_DSCR 64 + PPC KVM_REG_PPC_TM_TAR 64 + PPC KVM_REG_PPC_TM_XER 64 + + MIPS KVM_REG_MIPS_R0 64 + ... + MIPS KVM_REG_MIPS_R31 64 + MIPS KVM_REG_MIPS_HI 64 + MIPS KVM_REG_MIPS_LO 64 + MIPS KVM_REG_MIPS_PC 64 + MIPS KVM_REG_MIPS_CP0_INDEX 32 + MIPS KVM_REG_MIPS_CP0_ENTRYLO0 64 + MIPS KVM_REG_MIPS_CP0_ENTRYLO1 64 + MIPS KVM_REG_MIPS_CP0_CONTEXT 64 + MIPS KVM_REG_MIPS_CP0_CONTEXTCONFIG 32 + MIPS KVM_REG_MIPS_CP0_USERLOCAL 64 + MIPS KVM_REG_MIPS_CP0_XCONTEXTCONFIG 64 + MIPS KVM_REG_MIPS_CP0_PAGEMASK 32 + MIPS KVM_REG_MIPS_CP0_PAGEGRAIN 32 + MIPS KVM_REG_MIPS_CP0_SEGCTL0 64 + MIPS KVM_REG_MIPS_CP0_SEGCTL1 64 + MIPS KVM_REG_MIPS_CP0_SEGCTL2 64 + MIPS KVM_REG_MIPS_CP0_PWBASE 64 + MIPS KVM_REG_MIPS_CP0_PWFIELD 64 + MIPS KVM_REG_MIPS_CP0_PWSIZE 64 + MIPS KVM_REG_MIPS_CP0_WIRED 32 + MIPS KVM_REG_MIPS_CP0_PWCTL 32 + MIPS KVM_REG_MIPS_CP0_HWRENA 32 + MIPS KVM_REG_MIPS_CP0_BADVADDR 64 + MIPS KVM_REG_MIPS_CP0_BADINSTR 32 + MIPS KVM_REG_MIPS_CP0_BADINSTRP 32 + MIPS KVM_REG_MIPS_CP0_COUNT 32 + MIPS KVM_REG_MIPS_CP0_ENTRYHI 64 + MIPS KVM_REG_MIPS_CP0_COMPARE 32 + MIPS KVM_REG_MIPS_CP0_STATUS 32 + MIPS KVM_REG_MIPS_CP0_INTCTL 32 + MIPS KVM_REG_MIPS_CP0_CAUSE 32 + MIPS KVM_REG_MIPS_CP0_EPC 64 + MIPS KVM_REG_MIPS_CP0_PRID 32 + MIPS KVM_REG_MIPS_CP0_EBASE 64 + MIPS KVM_REG_MIPS_CP0_CONFIG 32 + MIPS KVM_REG_MIPS_CP0_CONFIG1 32 + MIPS KVM_REG_MIPS_CP0_CONFIG2 32 + MIPS KVM_REG_MIPS_CP0_CONFIG3 32 + MIPS KVM_REG_MIPS_CP0_CONFIG4 32 + MIPS KVM_REG_MIPS_CP0_CONFIG5 32 + MIPS KVM_REG_MIPS_CP0_CONFIG7 32 + MIPS KVM_REG_MIPS_CP0_XCONTEXT 64 + MIPS KVM_REG_MIPS_CP0_ERROREPC 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH1 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH2 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH3 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH4 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH5 64 + MIPS KVM_REG_MIPS_CP0_KSCRATCH6 64 + MIPS KVM_REG_MIPS_CP0_MAAR(0..63) 64 + MIPS KVM_REG_MIPS_COUNT_CTL 64 + MIPS KVM_REG_MIPS_COUNT_RESUME 64 + MIPS KVM_REG_MIPS_COUNT_HZ 64 + MIPS KVM_REG_MIPS_FPR_32(0..31) 32 + MIPS KVM_REG_MIPS_FPR_64(0..31) 64 + MIPS KVM_REG_MIPS_VEC_128(0..31) 128 + MIPS KVM_REG_MIPS_FCR_IR 32 + MIPS KVM_REG_MIPS_FCR_CSR 32 + MIPS KVM_REG_MIPS_MSA_IR 32 + MIPS KVM_REG_MIPS_MSA_CSR 32 + ======= =============================== ============ ARM registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: -ARM core registers have the following id bit patterns: +ARM core registers have the following id bit patterns:: + 0x4020 0000 0010 <index into the kvm_regs struct:16> -ARM 32-bit CP15 registers have the following id bit patterns: +ARM 32-bit CP15 registers have the following id bit patterns:: + 0x4020 0000 000F <zero:1> <crn:4> <crm:4> <opc1:4> <opc2:3> -ARM 64-bit CP15 registers have the following id bit patterns: +ARM 64-bit CP15 registers have the following id bit patterns:: + 0x4030 0000 000F <zero:1> <zero:4> <crm:4> <opc1:4> <zero:3> -ARM CCSIDR registers are demultiplexed by CSSELR value: +ARM CCSIDR registers are demultiplexed by CSSELR value:: + 0x4020 0000 0011 00 <csselr:8> -ARM 32-bit VFP control registers have the following id bit patterns: +ARM 32-bit VFP control registers have the following id bit patterns:: + 0x4020 0000 0012 1 <regno:12> -ARM 64-bit FP registers have the following id bit patterns: +ARM 64-bit FP registers have the following id bit patterns:: + 0x4030 0000 0012 0 <regno:12> -ARM firmware pseudo-registers have the following bit pattern: +ARM firmware pseudo-registers have the following bit pattern:: + 0x4030 0000 0014 <regno:16> @@ -2156,15 +2368,18 @@ that is the register group type, or coprocessor number: arm64 core/FP-SIMD registers have the following id bit patterns. Note that the size of the access is variable, as the kvm_regs structure contains elements ranging from 32 to 128 bits. The index is a 32bit -value in the kvm_regs structure seen as a 32bit array. +value in the kvm_regs structure seen as a 32bit array:: + 0x60x0 0000 0010 <index into the kvm_regs struct:16> Specifically: + +======================= ========= ===== ======================================= Encoding Register Bits kvm_regs member ----------------------------------------------------------------- +======================= ========= ===== ======================================= 0x6030 0000 0010 0000 X0 64 regs.regs[0] 0x6030 0000 0010 0002 X1 64 regs.regs[1] - ... + ... 0x6030 0000 0010 003c X30 64 regs.regs[30] 0x6030 0000 0010 003e SP 64 regs.sp 0x6030 0000 0010 0040 PC 64 regs.pc @@ -2176,27 +2391,31 @@ Specifically: 0x6030 0000 0010 004c SPSR_UND 64 spsr[KVM_SPSR_UND] 0x6030 0000 0010 004e SPSR_IRQ 64 spsr[KVM_SPSR_IRQ] 0x6060 0000 0010 0050 SPSR_FIQ 64 spsr[KVM_SPSR_FIQ] - 0x6040 0000 0010 0054 V0 128 fp_regs.vregs[0] (*) - 0x6040 0000 0010 0058 V1 128 fp_regs.vregs[1] (*) - ... - 0x6040 0000 0010 00d0 V31 128 fp_regs.vregs[31] (*) + 0x6040 0000 0010 0054 V0 128 fp_regs.vregs[0] [1]_ + 0x6040 0000 0010 0058 V1 128 fp_regs.vregs[1] [1]_ + ... + 0x6040 0000 0010 00d0 V31 128 fp_regs.vregs[31] [1]_ 0x6020 0000 0010 00d4 FPSR 32 fp_regs.fpsr 0x6020 0000 0010 00d5 FPCR 32 fp_regs.fpcr +======================= ========= ===== ======================================= + +.. [1] These encodings are not accepted for SVE-enabled vcpus. See + KVM_ARM_VCPU_INIT. -(*) These encodings are not accepted for SVE-enabled vcpus. See - KVM_ARM_VCPU_INIT. + The equivalent register content can be accessed via bits [127:0] of + the corresponding SVE Zn registers instead for vcpus that have SVE + enabled (see below). - The equivalent register content can be accessed via bits [127:0] of - the corresponding SVE Zn registers instead for vcpus that have SVE - enabled (see below). +arm64 CCSIDR registers are demultiplexed by CSSELR value:: -arm64 CCSIDR registers are demultiplexed by CSSELR value: 0x6020 0000 0011 00 <csselr:8> -arm64 system registers have the following id bit patterns: +arm64 system registers have the following id bit patterns:: + 0x6030 0000 0013 <op0:2> <op1:3> <crn:4> <crm:4> <op2:3> -WARNING: +.. warning:: + Two system register IDs do not follow the specified pattern. These are KVM_REG_ARM_TIMER_CVAL and KVM_REG_ARM_TIMER_CNT, which map to system registers CNTV_CVAL_EL0 and CNTVCT_EL0 respectively. These @@ -2205,10 +2424,12 @@ WARNING: derived from the register encoding for CNTV_CVAL_EL0. As this is API, it must remain this way. -arm64 firmware pseudo-registers have the following bit pattern: +arm64 firmware pseudo-registers have the following bit pattern:: + 0x6030 0000 0014 <regno:16> -arm64 SVE registers have the following bit patterns: +arm64 SVE registers have the following bit patterns:: + 0x6080 0000 0015 00 <n:5> <slice:5> Zn bits[2048*slice + 2047 : 2048*slice] 0x6050 0000 0015 04 <n:4> <slice:5> Pn bits[256*slice + 255 : 256*slice] 0x6050 0000 0015 060 <slice:5> FFR bits[256*slice + 255 : 256*slice] @@ -2216,7 +2437,7 @@ arm64 SVE registers have the following bit patterns: Access to register IDs where 2048 * slice >= 128 * max_vq will fail with ENOENT. max_vq is the vcpu's maximum supported vector length in 128-bit -quadwords: see (**) below. +quadwords: see [2]_ below. These registers are only accessible on vcpus for which SVE is enabled. See KVM_ARM_VCPU_INIT for details. @@ -2231,21 +2452,21 @@ lengths supported by the vcpu to be discovered and configured by userspace. When transferred to or from user memory via KVM_GET_ONE_REG or KVM_SET_ONE_REG, the value of this register is of type __u64[KVM_ARM64_SVE_VLS_WORDS], and encodes the set of vector lengths as -follows: +follows:: -__u64 vector_lengths[KVM_ARM64_SVE_VLS_WORDS]; + __u64 vector_lengths[KVM_ARM64_SVE_VLS_WORDS]; -if (vq >= SVE_VQ_MIN && vq <= SVE_VQ_MAX && - ((vector_lengths[(vq - KVM_ARM64_SVE_VQ_MIN) / 64] >> + if (vq >= SVE_VQ_MIN && vq <= SVE_VQ_MAX && + ((vector_lengths[(vq - KVM_ARM64_SVE_VQ_MIN) / 64] >> ((vq - KVM_ARM64_SVE_VQ_MIN) % 64)) & 1)) /* Vector length vq * 16 bytes supported */ -else + else /* Vector length vq * 16 bytes not supported */ -(**) The maximum value vq for which the above condition is true is -max_vq. This is the maximum vector length available to the guest on -this vcpu, and determines which register slices are visible through -this ioctl interface. +.. [2] The maximum value vq for which the above condition is true is + max_vq. This is the maximum vector length available to the guest on + this vcpu, and determines which register slices are visible through + this ioctl interface. (See Documentation/arm64/sve.rst for an explanation of the "vq" nomenclature.) @@ -2270,11 +2491,13 @@ write this register will fail with EPERM. MIPS registers are mapped using the lower 32 bits. The upper 16 of that is the register group type: -MIPS core registers (see above) have the following id bit patterns: +MIPS core registers (see above) have the following id bit patterns:: + 0x7030 0000 0000 <reg:16> MIPS CP0 registers (see KVM_REG_MIPS_CP0_* above) have the following id bit -patterns depending on whether they're 32-bit or 64-bit registers: +patterns depending on whether they're 32-bit or 64-bit registers:: + 0x7020 0000 0001 00 <reg:5> <sel:3> (32-bit) 0x7030 0000 0001 00 <reg:5> <sel:3> (64-bit) @@ -2285,10 +2508,12 @@ with the RI and XI bits (if they exist) in bits 63 and 62 respectively, and the PFNX field starting at bit 30. MIPS MAARs (see KVM_REG_MIPS_CP0_MAAR(*) above) have the following id bit -patterns: +patterns:: + 0x7030 0000 0001 01 <reg:8> -MIPS KVM control registers (see above) have the following id bit patterns: +MIPS KVM control registers (see above) have the following id bit patterns:: + 0x7030 0000 0002 <reg:16> MIPS FPU registers (see KVM_REG_MIPS_FPR_{32,64}() above) have the following @@ -2297,31 +2522,40 @@ always accessed according to the current guest FPU mode (Status.FR and Config5.FRE), i.e. as the guest would see them, and they become unpredictable if the guest FPU mode is changed. MIPS SIMD Architecture (MSA) vector registers (see KVM_REG_MIPS_VEC_128() above) have similar patterns as they -overlap the FPU registers: +overlap the FPU registers:: + 0x7020 0000 0003 00 <0:3> <reg:5> (32-bit FPU registers) 0x7030 0000 0003 00 <0:3> <reg:5> (64-bit FPU registers) 0x7040 0000 0003 00 <0:3> <reg:5> (128-bit MSA vector registers) MIPS FPU control registers (see KVM_REG_MIPS_FCR_{IR,CSR} above) have the -following id bit patterns: +following id bit patterns:: + 0x7020 0000 0003 01 <0:3> <reg:5> MIPS MSA control registers (see KVM_REG_MIPS_MSA_{IR,CSR} above) have the -following id bit patterns: +following id bit patterns:: + 0x7020 0000 0003 02 <0:3> <reg:5> 4.69 KVM_GET_ONE_REG +-------------------- + +:Capability: KVM_CAP_ONE_REG +:Architectures: all +:Type: vcpu ioctl +:Parameters: struct kvm_one_reg (in and out) +:Returns: 0 on success, negative value on failure -Capability: KVM_CAP_ONE_REG -Architectures: all -Type: vcpu ioctl -Parameters: struct kvm_one_reg (in and out) -Returns: 0 on success, negative value on failure Errors include: - ENOENT: no such register - EINVAL: invalid register ID, or no such register - EPERM: (arm64) register access not allowed before vcpu finalization + + ======== ============================================================ + ENOENT no such register + EINVAL invalid register ID, or no such register + EPERM (arm64) register access not allowed before vcpu finalization + ======== ============================================================ + (These error codes are indicative only: do not rely on a specific error code being returned in a specific situation.) @@ -2335,12 +2569,13 @@ list in 4.68. 4.70 KVM_KVMCLOCK_CTRL +---------------------- -Capability: KVM_CAP_KVMCLOCK_CTRL -Architectures: Any that implement pvclocks (currently x86 only) -Type: vcpu ioctl -Parameters: None -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_KVMCLOCK_CTRL +:Architectures: Any that implement pvclocks (currently x86 only) +:Type: vcpu ioctl +:Parameters: None +:Returns: 0 on success, -1 on error This signals to the host kernel that the specified guest is being paused by userspace. The host will set a flag in the pvclock structure that is checked @@ -2356,26 +2591,30 @@ after pausing the vcpu, but before it is resumed. 4.71 KVM_SIGNAL_MSI +------------------- -Capability: KVM_CAP_SIGNAL_MSI -Architectures: x86 arm arm64 -Type: vm ioctl -Parameters: struct kvm_msi (in) -Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error +:Capability: KVM_CAP_SIGNAL_MSI +:Architectures: x86 arm arm64 +:Type: vm ioctl +:Parameters: struct kvm_msi (in) +:Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error Directly inject a MSI message. Only valid with in-kernel irqchip that handles MSI messages. -struct kvm_msi { +:: + + struct kvm_msi { __u32 address_lo; __u32 address_hi; __u32 data; __u32 flags; __u32 devid; __u8 pad[12]; -}; + }; -flags: KVM_MSI_VALID_DEVID: devid contains a valid value. The per-VM +flags: + KVM_MSI_VALID_DEVID: devid contains a valid value. The per-VM KVM_CAP_MSI_DEVID capability advertises the requirement to provide the device ID. If this capability is not available, userspace should never set the KVM_MSI_VALID_DEVID flag as the ioctl might fail. @@ -2391,30 +2630,31 @@ address_hi must be zero. 4.71 KVM_CREATE_PIT2 +-------------------- -Capability: KVM_CAP_PIT2 -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_pit_config (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PIT2 +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_pit_config (in) +:Returns: 0 on success, -1 on error Creates an in-kernel device model for the i8254 PIT. This call is only valid after enabling in-kernel irqchip support via KVM_CREATE_IRQCHIP. The following -parameters have to be passed: +parameters have to be passed:: -struct kvm_pit_config { + struct kvm_pit_config { __u32 flags; __u32 pad[15]; -}; + }; -Valid flags are: +Valid flags are:: -#define KVM_PIT_SPEAKER_DUMMY 1 /* emulate speaker port stub */ + #define KVM_PIT_SPEAKER_DUMMY 1 /* emulate speaker port stub */ PIT timer interrupts may use a per-VM kernel thread for injection. If it -exists, this thread will have a name of the following pattern: +exists, this thread will have a name of the following pattern:: -kvm-pit/<owner-process-pid> + kvm-pit/<owner-process-pid> When running a guest with elevated priorities, the scheduling parameters of this thread may have to be adjusted accordingly. @@ -2423,37 +2663,39 @@ This IOCTL replaces the obsolete KVM_CREATE_PIT. 4.72 KVM_GET_PIT2 +----------------- -Capability: KVM_CAP_PIT_STATE2 -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_pit_state2 (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PIT_STATE2 +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_pit_state2 (out) +:Returns: 0 on success, -1 on error Retrieves the state of the in-kernel PIT model. Only valid after -KVM_CREATE_PIT2. The state is returned in the following structure: +KVM_CREATE_PIT2. The state is returned in the following structure:: -struct kvm_pit_state2 { + struct kvm_pit_state2 { struct kvm_pit_channel_state channels[3]; __u32 flags; __u32 reserved[9]; -}; + }; -Valid flags are: +Valid flags are:: -/* disable PIT in HPET legacy mode */ -#define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001 + /* disable PIT in HPET legacy mode */ + #define KVM_PIT_FLAGS_HPET_LEGACY 0x00000001 This IOCTL replaces the obsolete KVM_GET_PIT. 4.73 KVM_SET_PIT2 +----------------- -Capability: KVM_CAP_PIT_STATE2 -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_pit_state2 (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PIT_STATE2 +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_pit_state2 (in) +:Returns: 0 on success, -1 on error Sets the state of the in-kernel PIT model. Only valid after KVM_CREATE_PIT2. See KVM_GET_PIT2 for details on struct kvm_pit_state2. @@ -2462,12 +2704,13 @@ This IOCTL replaces the obsolete KVM_SET_PIT. 4.74 KVM_PPC_GET_SMMU_INFO +-------------------------- -Capability: KVM_CAP_PPC_GET_SMMU_INFO -Architectures: powerpc -Type: vm ioctl -Parameters: None -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PPC_GET_SMMU_INFO +:Architectures: powerpc +:Type: vm ioctl +:Parameters: None +:Returns: 0 on success, -1 on error This populates and returns a structure describing the features of the "Server" class MMU emulation supported by KVM. @@ -2475,7 +2718,7 @@ This can in turn be used by userspace to generate the appropriate device-tree properties for the guest operating system. The structure contains some global information, followed by an -array of supported segment page sizes: +array of supported segment page sizes:: struct kvm_ppc_smmu_info { __u64 flags; @@ -2503,7 +2746,7 @@ The "slb_size" field indicates how many SLB entries are supported The "sps" array contains 8 entries indicating the supported base page sizes for a segment in increasing order. Each entry is defined -as follow: +as follow:: struct kvm_ppc_one_seg_page_size { __u32 page_shift; /* Base page shift of segment (or 0) */ @@ -2524,7 +2767,7 @@ size provides the list of supported actual page sizes (which can be only larger or equal to the base page size), along with the corresponding encoding in the hash PTE. Similarly, the array is 8 entries sorted by increasing sizes and an entry with a "0" shift -is an empty entry and a terminator: +is an empty entry and a terminator:: struct kvm_ppc_one_page_size { __u32 page_shift; /* Page shift (or 0) */ @@ -2536,12 +2779,13 @@ PTE's RPN field (ie, it needs to be shifted left by 12 to OR it into the hash PTE second double word). 4.75 KVM_IRQFD +-------------- -Capability: KVM_CAP_IRQFD -Architectures: x86 s390 arm arm64 -Type: vm ioctl -Parameters: struct kvm_irqfd (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_IRQFD +:Architectures: x86 s390 arm arm64 +:Type: vm ioctl +:Parameters: struct kvm_irqfd (in) +:Returns: 0 on success, -1 on error Allows setting an eventfd to directly trigger a guest interrupt. kvm_irqfd.fd specifies the file descriptor to use as the eventfd and @@ -2565,6 +2809,7 @@ irqfd. The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment and need not be specified with KVM_IRQFD_FLAG_DEASSIGN. On arm/arm64, gsi routing being supported, the following can happen: + - in case no routing entry is associated to this gsi, injection fails - in case the gsi is associated to an irqchip routing entry, irqchip.pin + 32 corresponds to the injected SPI ID. @@ -2573,12 +2818,13 @@ On arm/arm64, gsi routing being supported, the following can happen: to GICv3 ITS in-kernel emulation). 4.76 KVM_PPC_ALLOCATE_HTAB +-------------------------- -Capability: KVM_CAP_PPC_ALLOC_HTAB -Architectures: powerpc -Type: vm ioctl -Parameters: Pointer to u32 containing hash table order (in/out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PPC_ALLOC_HTAB +:Architectures: powerpc +:Type: vm ioctl +:Parameters: Pointer to u32 containing hash table order (in/out) +:Returns: 0 on success, -1 on error This requests the host kernel to allocate an MMU hash table for a guest using the PAPR paravirtualization interface. This only does @@ -2609,75 +2855,88 @@ real-mode area (VRMA) facility, the kernel will re-create the VMRA HPTEs on the next KVM_RUN of any vcpu. 4.77 KVM_S390_INTERRUPT +----------------------- -Capability: basic -Architectures: s390 -Type: vm ioctl, vcpu ioctl -Parameters: struct kvm_s390_interrupt (in) -Returns: 0 on success, -1 on error +:Capability: basic +:Architectures: s390 +:Type: vm ioctl, vcpu ioctl +:Parameters: struct kvm_s390_interrupt (in) +:Returns: 0 on success, -1 on error Allows to inject an interrupt to the guest. Interrupts can be floating (vm ioctl) or per cpu (vcpu ioctl), depending on the interrupt type. -Interrupt parameters are passed via kvm_s390_interrupt: +Interrupt parameters are passed via kvm_s390_interrupt:: -struct kvm_s390_interrupt { + struct kvm_s390_interrupt { __u32 type; __u32 parm; __u64 parm64; -}; + }; type can be one of the following: -KVM_S390_SIGP_STOP (vcpu) - sigp stop; optional flags in parm -KVM_S390_PROGRAM_INT (vcpu) - program check; code in parm -KVM_S390_SIGP_SET_PREFIX (vcpu) - sigp set prefix; prefix address in parm -KVM_S390_RESTART (vcpu) - restart -KVM_S390_INT_CLOCK_COMP (vcpu) - clock comparator interrupt -KVM_S390_INT_CPU_TIMER (vcpu) - CPU timer interrupt -KVM_S390_INT_VIRTIO (vm) - virtio external interrupt; external interrupt - parameters in parm and parm64 -KVM_S390_INT_SERVICE (vm) - sclp external interrupt; sclp parameter in parm -KVM_S390_INT_EMERGENCY (vcpu) - sigp emergency; source cpu in parm -KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; source cpu in parm -KVM_S390_INT_IO(ai,cssid,ssid,schid) (vm) - compound value to indicate an - I/O interrupt (ai - adapter interrupt; cssid,ssid,schid - subchannel); - I/O interruption parameters in parm (subchannel) and parm64 (intparm, - interruption subclass) -KVM_S390_MCHK (vm, vcpu) - machine check interrupt; cr 14 bits in parm, - machine check interrupt code in parm64 (note that - machine checks needing further payload are not - supported by this ioctl) +KVM_S390_SIGP_STOP (vcpu) + - sigp stop; optional flags in parm +KVM_S390_PROGRAM_INT (vcpu) + - program check; code in parm +KVM_S390_SIGP_SET_PREFIX (vcpu) + - sigp set prefix; prefix address in parm +KVM_S390_RESTART (vcpu) + - restart +KVM_S390_INT_CLOCK_COMP (vcpu) + - clock comparator interrupt +KVM_S390_INT_CPU_TIMER (vcpu) + - CPU timer interrupt +KVM_S390_INT_VIRTIO (vm) + - virtio external interrupt; external interrupt + parameters in parm and parm64 +KVM_S390_INT_SERVICE (vm) + - sclp external interrupt; sclp parameter in parm +KVM_S390_INT_EMERGENCY (vcpu) + - sigp emergency; source cpu in parm +KVM_S390_INT_EXTERNAL_CALL (vcpu) + - sigp external call; source cpu in parm +KVM_S390_INT_IO(ai,cssid,ssid,schid) (vm) + - compound value to indicate an + I/O interrupt (ai - adapter interrupt; cssid,ssid,schid - subchannel); + I/O interruption parameters in parm (subchannel) and parm64 (intparm, + interruption subclass) +KVM_S390_MCHK (vm, vcpu) + - machine check interrupt; cr 14 bits in parm, machine check interrupt + code in parm64 (note that machine checks needing further payload are not + supported by this ioctl) This is an asynchronous vcpu ioctl and can be invoked from any thread. 4.78 KVM_PPC_GET_HTAB_FD +------------------------ -Capability: KVM_CAP_PPC_HTAB_FD -Architectures: powerpc -Type: vm ioctl -Parameters: Pointer to struct kvm_get_htab_fd (in) -Returns: file descriptor number (>= 0) on success, -1 on error +:Capability: KVM_CAP_PPC_HTAB_FD +:Architectures: powerpc +:Type: vm ioctl +:Parameters: Pointer to struct kvm_get_htab_fd (in) +:Returns: file descriptor number (>= 0) on success, -1 on error This returns a file descriptor that can be used either to read out the entries in the guest's hashed page table (HPT), or to write entries to initialize the HPT. The returned fd can only be written to if the KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and can only be read if that bit is clear. The argument struct looks like -this: +this:: -/* For KVM_PPC_GET_HTAB_FD */ -struct kvm_get_htab_fd { + /* For KVM_PPC_GET_HTAB_FD */ + struct kvm_get_htab_fd { __u64 flags; __u64 start_index; __u64 reserved[2]; -}; + }; -/* Values for kvm_get_htab_fd.flags */ -#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) -#define KVM_GET_HTAB_WRITE ((__u64)0x2) + /* Values for kvm_get_htab_fd.flags */ + #define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) + #define KVM_GET_HTAB_WRITE ((__u64)0x2) -The `start_index' field gives the index in the HPT of the entry at +The 'start_index' field gives the index in the HPT of the entry at which to start reading. It is ignored when writing. Reads on the fd will initially supply information about all @@ -2692,29 +2951,34 @@ Data read or written is structured as a header (8 bytes) followed by a series of valid HPT entries (16 bytes) each. The header indicates how many valid HPT entries there are and how many invalid entries follow the valid entries. The invalid entries are not represented explicitly -in the stream. The header format is: +in the stream. The header format is:: -struct kvm_get_htab_header { + struct kvm_get_htab_header { __u32 index; __u16 n_valid; __u16 n_invalid; -}; + }; Writes to the fd create HPT entries starting at the index given in the -header; first `n_valid' valid entries with contents from the data -written, then `n_invalid' invalid entries, invalidating any previously +header; first 'n_valid' valid entries with contents from the data +written, then 'n_invalid' invalid entries, invalidating any previously valid entries found. 4.79 KVM_CREATE_DEVICE +---------------------- + +:Capability: KVM_CAP_DEVICE_CTRL +:Type: vm ioctl +:Parameters: struct kvm_create_device (in/out) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_DEVICE_CTRL -Type: vm ioctl -Parameters: struct kvm_create_device (in/out) -Returns: 0 on success, -1 on error Errors: - ENODEV: The device type is unknown or unsupported - EEXIST: Device already created, and this type of device may not + + ====== ======================================================= + ENODEV The device type is unknown or unsupported + EEXIST Device already created, and this type of device may not be instantiated multiple times + ====== ======================================================= Other error conditions may be defined by individual device types or have their standard meanings. @@ -2730,25 +2994,32 @@ Individual devices should not define flags. Attributes should be used for specifying any behavior that is not implied by the device type number. -struct kvm_create_device { +:: + + struct kvm_create_device { __u32 type; /* in: KVM_DEV_TYPE_xxx */ __u32 fd; /* out: device handle */ __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ -}; + }; 4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR +-------------------------------------------- + +:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device, + KVM_CAP_VCPU_ATTRIBUTES for vcpu device +:Type: device ioctl, vm ioctl, vcpu ioctl +:Parameters: struct kvm_device_attr +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device, - KVM_CAP_VCPU_ATTRIBUTES for vcpu device -Type: device ioctl, vm ioctl, vcpu ioctl -Parameters: struct kvm_device_attr -Returns: 0 on success, -1 on error Errors: - ENXIO: The group or attribute is unknown/unsupported for this device + + ===== ============================================================= + ENXIO The group or attribute is unknown/unsupported for this device or hardware support is missing. - EPERM: The attribute cannot (currently) be accessed this way + EPERM The attribute cannot (currently) be accessed this way (e.g. read-only attribute, or attribute that only makes sense when the device is in a different state) + ===== ============================================================= Other error conditions may be defined by individual device types. @@ -2757,23 +3028,30 @@ semantics are device-specific. See individual device documentation in the "devices" directory. As with ONE_REG, the size of the data transferred is defined by the particular attribute. -struct kvm_device_attr { +:: + + struct kvm_device_attr { __u32 flags; /* no flags currently defined */ __u32 group; /* device-defined */ __u64 attr; /* group-defined */ __u64 addr; /* userspace address of attr data */ -}; + }; 4.81 KVM_HAS_DEVICE_ATTR +------------------------ + +:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device, + KVM_CAP_VCPU_ATTRIBUTES for vcpu device +:Type: device ioctl, vm ioctl, vcpu ioctl +:Parameters: struct kvm_device_attr +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device, - KVM_CAP_VCPU_ATTRIBUTES for vcpu device -Type: device ioctl, vm ioctl, vcpu ioctl -Parameters: struct kvm_device_attr -Returns: 0 on success, -1 on error Errors: - ENXIO: The group or attribute is unknown/unsupported for this device + + ===== ============================================================= + ENXIO The group or attribute is unknown/unsupported for this device or hardware support is missing. + ===== ============================================================= Tests whether a device supports a particular attribute. A successful return indicates the attribute is implemented. It does not necessarily @@ -2781,15 +3059,20 @@ indicate that the attribute can be read or written in the device's current state. "addr" is ignored. 4.82 KVM_ARM_VCPU_INIT +---------------------- + +:Capability: basic +:Architectures: arm, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_vcpu_init (in) +:Returns: 0 on success; -1 on error -Capability: basic -Architectures: arm, arm64 -Type: vcpu ioctl -Parameters: struct kvm_vcpu_init (in) -Returns: 0 on success; -1 on error Errors: - EINVAL: the target is unknown, or the combination of features is invalid. - ENOENT: a features bit specified is unknown. + + ====== ================================================================= + EINVAL the target is unknown, or the combination of features is invalid. + ENOENT a features bit specified is unknown. + ====== ================================================================= This tells KVM what type of CPU to present to the guest, and what optional features it should have. This will cause a reset of the cpu @@ -2805,6 +3088,7 @@ state. All calls to this function after the initial call must use the same target and same set of feature flags, otherwise EINVAL will be returned. Possible features: + - KVM_ARM_VCPU_POWER_OFF: Starts the CPU in a power-off state. Depends on KVM_CAP_ARM_PSCI. If not set, the CPU will be powered on and execute guest code when KVM_RUN is called. @@ -2861,14 +3145,19 @@ Possible features: no longer be written using KVM_SET_ONE_REG. 4.83 KVM_ARM_PREFERRED_TARGET +----------------------------- + +:Capability: basic +:Architectures: arm, arm64 +:Type: vm ioctl +:Parameters: struct struct kvm_vcpu_init (out) +:Returns: 0 on success; -1 on error -Capability: basic -Architectures: arm, arm64 -Type: vm ioctl -Parameters: struct struct kvm_vcpu_init (out) -Returns: 0 on success; -1 on error Errors: - ENODEV: no preferred target available for the host + + ====== ========================================== + ENODEV no preferred target available for the host + ====== ========================================== This queries KVM for preferred CPU target type which can be emulated by KVM on underlying host. @@ -2885,43 +3174,57 @@ in VCPU matching underlying host. 4.84 KVM_GET_REG_LIST +--------------------- + +:Capability: basic +:Architectures: arm, arm64, mips +:Type: vcpu ioctl +:Parameters: struct kvm_reg_list (in/out) +:Returns: 0 on success; -1 on error -Capability: basic -Architectures: arm, arm64, mips -Type: vcpu ioctl -Parameters: struct kvm_reg_list (in/out) -Returns: 0 on success; -1 on error Errors: - E2BIG: the reg index list is too big to fit in the array specified by + + ===== ============================================================== + E2BIG the reg index list is too big to fit in the array specified by the user (the number required will be written into n). + ===== ============================================================== + +:: -struct kvm_reg_list { + struct kvm_reg_list { __u64 n; /* number of registers in reg[] */ __u64 reg[0]; -}; + }; This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. 4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated) +----------------------------------------- + +:Capability: KVM_CAP_ARM_SET_DEVICE_ADDR +:Architectures: arm, arm64 +:Type: vm ioctl +:Parameters: struct kvm_arm_device_address (in) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_ARM_SET_DEVICE_ADDR -Architectures: arm, arm64 -Type: vm ioctl -Parameters: struct kvm_arm_device_address (in) -Returns: 0 on success, -1 on error Errors: - ENODEV: The device id is unknown - ENXIO: Device not supported on current system - EEXIST: Address already set - E2BIG: Address outside guest physical address space - EBUSY: Address overlaps with other device range -struct kvm_arm_device_addr { + ====== ============================================ + ENODEV The device id is unknown + ENXIO Device not supported on current system + EEXIST Address already set + E2BIG Address outside guest physical address space + EBUSY Address overlaps with other device range + ====== ============================================ + +:: + + struct kvm_arm_device_addr { __u64 id; __u64 addr; -}; + }; Specify a device address in the guest's physical address space where guests can access emulated or directly exposed devices, which the host kernel needs @@ -2929,7 +3232,7 @@ to know about. The id field is an architecture specific identifier for a specific device. ARM/arm64 divides the id field into two parts, a device id and an -address type id specific to the individual device. +address type id specific to the individual device:: bits: | 63 ... 32 | 31 ... 16 | 15 ... 0 | field: | 0x00000000 | device id | addr type id | @@ -2947,12 +3250,13 @@ should be used instead. 4.86 KVM_PPC_RTAS_DEFINE_TOKEN +------------------------------ -Capability: KVM_CAP_PPC_RTAS -Architectures: ppc -Type: vm ioctl -Parameters: struct kvm_rtas_token_args -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PPC_RTAS +:Architectures: ppc +:Type: vm ioctl +:Parameters: struct kvm_rtas_token_args +:Returns: 0 on success, -1 on error Defines a token value for a RTAS (Run Time Abstraction Services) service in order to allow it to be handled in the kernel. The @@ -2966,18 +3270,21 @@ calls by the guest for that service will be passed to userspace to be handled. 4.87 KVM_SET_GUEST_DEBUG +------------------------ -Capability: KVM_CAP_SET_GUEST_DEBUG -Architectures: x86, s390, ppc, arm64 -Type: vcpu ioctl -Parameters: struct kvm_guest_debug (in) -Returns: 0 on success; -1 on error +:Capability: KVM_CAP_SET_GUEST_DEBUG +:Architectures: x86, s390, ppc, arm64 +:Type: vcpu ioctl +:Parameters: struct kvm_guest_debug (in) +:Returns: 0 on success; -1 on error -struct kvm_guest_debug { +:: + + struct kvm_guest_debug { __u32 control; __u32 pad; struct kvm_guest_debug_arch arch; -}; + }; Set up the processor specific debug registers and configure vcpu for handling guest debug events. There are two parts to the structure, the @@ -3019,26 +3326,31 @@ KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. 4.88 KVM_GET_EMULATED_CPUID +--------------------------- + +:Capability: KVM_CAP_EXT_EMUL_CPUID +:Architectures: x86 +:Type: system ioctl +:Parameters: struct kvm_cpuid2 (in/out) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_EXT_EMUL_CPUID -Architectures: x86 -Type: system ioctl -Parameters: struct kvm_cpuid2 (in/out) -Returns: 0 on success, -1 on error +:: -struct kvm_cpuid2 { + struct kvm_cpuid2 { __u32 nent; __u32 flags; struct kvm_cpuid_entry2 entries[0]; -}; + }; The member 'flags' is used for passing flags from userspace. -#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0) -#define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) -#define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) +:: -struct kvm_cpuid_entry2 { + #define KVM_CPUID_FLAG_SIGNIFCANT_INDEX BIT(0) + #define KVM_CPUID_FLAG_STATEFUL_FUNC BIT(1) + #define KVM_CPUID_FLAG_STATE_READ_NEXT BIT(2) + + struct kvm_cpuid_entry2 { __u32 function; __u32 index; __u32 flags; @@ -3047,7 +3359,7 @@ struct kvm_cpuid_entry2 { __u32 ecx; __u32 edx; __u32 padding[3]; -}; + }; This ioctl returns x86 cpuid features which are emulated by kvm.Userspace can use the information returned by this ioctl to query @@ -3072,10 +3384,14 @@ emulated efficiently and thus not included here. The fields in each entry are defined as follows: - function: the eax value used to obtain the entry - index: the ecx value used to obtain the entry (for entries that are + function: + the eax value used to obtain the entry + index: + the ecx value used to obtain the entry (for entries that are affected by ecx) - flags: an OR of zero or more of the following: + flags: + an OR of zero or more of the following: + KVM_CPUID_FLAG_SIGNIFCANT_INDEX: if the index field is valid KVM_CPUID_FLAG_STATEFUL_FUNC: @@ -3085,24 +3401,28 @@ The fields in each entry are defined as follows: KVM_CPUID_FLAG_STATE_READ_NEXT: for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is the first entry to be read by a cpu - eax, ebx, ecx, edx: the values returned by the cpuid instruction for + + eax, ebx, ecx, edx: + + the values returned by the cpuid instruction for this function/index combination 4.89 KVM_S390_MEM_OP +-------------------- -Capability: KVM_CAP_S390_MEM_OP -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_mem_op (in) -Returns: = 0 on success, - < 0 on generic error (e.g. -EFAULT or -ENOMEM), - > 0 if an exception occurred while walking the page tables +:Capability: KVM_CAP_S390_MEM_OP +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_mem_op (in) +:Returns: = 0 on success, + < 0 on generic error (e.g. -EFAULT or -ENOMEM), + > 0 if an exception occurred while walking the page tables Read or write data from/to the logical (virtual) memory of a VCPU. -Parameters are specified via the following structure: +Parameters are specified via the following structure:: -struct kvm_s390_mem_op { + struct kvm_s390_mem_op { __u64 gaddr; /* the guest address */ __u64 flags; /* flags */ __u32 size; /* amount of bytes */ @@ -3110,7 +3430,7 @@ struct kvm_s390_mem_op { __u64 buf; /* buffer in userspace */ __u8 ar; /* the access register number */ __u8 reserved[31]; /* should be set to 0 */ -}; + }; The type of operation is specified in the "op" field. It is either KVM_S390_MEMOP_LOGICAL_READ for reading from logical memory space or @@ -3137,24 +3457,25 @@ The "reserved" field is meant for future extensions. It is not used by KVM with the currently defined set of flags. 4.90 KVM_S390_GET_SKEYS +----------------------- -Capability: KVM_CAP_S390_SKEYS -Architectures: s390 -Type: vm ioctl -Parameters: struct kvm_s390_skeys -Returns: 0 on success, KVM_S390_GET_KEYS_NONE if guest is not using storage - keys, negative value on error +:Capability: KVM_CAP_S390_SKEYS +:Architectures: s390 +:Type: vm ioctl +:Parameters: struct kvm_s390_skeys +:Returns: 0 on success, KVM_S390_GET_KEYS_NONE if guest is not using storage + keys, negative value on error This ioctl is used to get guest storage key values on the s390 -architecture. The ioctl takes parameters via the kvm_s390_skeys struct. +architecture. The ioctl takes parameters via the kvm_s390_skeys struct:: -struct kvm_s390_skeys { + struct kvm_s390_skeys { __u64 start_gfn; __u64 count; __u64 skeydata_addr; __u32 flags; __u32 reserved[9]; -}; + }; The start_gfn field is the number of the first guest frame whose storage keys you want to get. @@ -3168,12 +3489,13 @@ The skeydata_addr field is the address to a buffer large enough to hold count bytes. This buffer will be filled with storage key data by the ioctl. 4.91 KVM_S390_SET_SKEYS +----------------------- -Capability: KVM_CAP_S390_SKEYS -Architectures: s390 -Type: vm ioctl -Parameters: struct kvm_s390_skeys -Returns: 0 on success, negative value on error +:Capability: KVM_CAP_S390_SKEYS +:Architectures: s390 +:Type: vm ioctl +:Parameters: struct kvm_s390_skeys +:Returns: 0 on success, negative value on error This ioctl is used to set guest storage key values on the s390 architecture. The ioctl takes parameters via the kvm_s390_skeys struct. @@ -3195,21 +3517,27 @@ Note: If any architecturally invalid key value is found in the given data then the ioctl will return -EINVAL. 4.92 KVM_S390_IRQ +----------------- + +:Capability: KVM_CAP_S390_INJECT_IRQ +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_irq (in) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_S390_INJECT_IRQ -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_irq (in) -Returns: 0 on success, -1 on error Errors: - EINVAL: interrupt type is invalid - type is KVM_S390_SIGP_STOP and flag parameter is invalid value + + + ====== ================================================================= + EINVAL interrupt type is invalid + type is KVM_S390_SIGP_STOP and flag parameter is invalid value, type is KVM_S390_INT_EXTERNAL_CALL and code is bigger - than the maximum of VCPUs - EBUSY: type is KVM_S390_SIGP_SET_PREFIX and vcpu is not stopped - type is KVM_S390_SIGP_STOP and a stop irq is already pending + than the maximum of VCPUs + EBUSY type is KVM_S390_SIGP_SET_PREFIX and vcpu is not stopped, + type is KVM_S390_SIGP_STOP and a stop irq is already pending, type is KVM_S390_INT_EXTERNAL_CALL and an external call interrupt - is already pending + is already pending + ====== ================================================================= Allows to inject an interrupt to the guest. @@ -3217,9 +3545,9 @@ Using struct kvm_s390_irq as a parameter allows to inject additional payload which is not possible via KVM_S390_INTERRUPT. -Interrupt parameters are passed via kvm_s390_irq: +Interrupt parameters are passed via kvm_s390_irq:: -struct kvm_s390_irq { + struct kvm_s390_irq { __u64 type; union { struct kvm_s390_io_info io; @@ -3232,44 +3560,45 @@ struct kvm_s390_irq { struct kvm_s390_mchk_info mchk; char reserved[64]; } u; -}; + }; type can be one of the following: -KVM_S390_SIGP_STOP - sigp stop; parameter in .stop -KVM_S390_PROGRAM_INT - program check; parameters in .pgm -KVM_S390_SIGP_SET_PREFIX - sigp set prefix; parameters in .prefix -KVM_S390_RESTART - restart; no parameters -KVM_S390_INT_CLOCK_COMP - clock comparator interrupt; no parameters -KVM_S390_INT_CPU_TIMER - CPU timer interrupt; no parameters -KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg -KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall -KVM_S390_MCHK - machine check interrupt; parameters in .mchk +- KVM_S390_SIGP_STOP - sigp stop; parameter in .stop +- KVM_S390_PROGRAM_INT - program check; parameters in .pgm +- KVM_S390_SIGP_SET_PREFIX - sigp set prefix; parameters in .prefix +- KVM_S390_RESTART - restart; no parameters +- KVM_S390_INT_CLOCK_COMP - clock comparator interrupt; no parameters +- KVM_S390_INT_CPU_TIMER - CPU timer interrupt; no parameters +- KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg +- KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall +- KVM_S390_MCHK - machine check interrupt; parameters in .mchk This is an asynchronous vcpu ioctl and can be invoked from any thread. 4.94 KVM_S390_GET_IRQ_STATE +--------------------------- -Capability: KVM_CAP_S390_IRQ_STATE -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_irq_state (out) -Returns: >= number of bytes copied into buffer, - -EINVAL if buffer size is 0, - -ENOBUFS if buffer size is too small to fit all pending interrupts, - -EFAULT if the buffer address was invalid +:Capability: KVM_CAP_S390_IRQ_STATE +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_irq_state (out) +:Returns: >= number of bytes copied into buffer, + -EINVAL if buffer size is 0, + -ENOBUFS if buffer size is too small to fit all pending interrupts, + -EFAULT if the buffer address was invalid This ioctl allows userspace to retrieve the complete state of all currently pending interrupts in a single buffer. Use cases include migration and introspection. The parameter structure contains the address of a -userspace buffer and its length: +userspace buffer and its length:: -struct kvm_s390_irq_state { + struct kvm_s390_irq_state { __u64 buf; __u32 flags; /* will stay unused for compatibility reasons */ __u32 len; __u32 reserved[4]; /* will stay unused for compatibility reasons */ -}; + }; Userspace passes in the above struct and for each pending interrupt a struct kvm_s390_irq is copied to the provided buffer. @@ -3283,29 +3612,30 @@ If -ENOBUFS is returned the buffer provided was too small and userspace may retry with a bigger buffer. 4.95 KVM_S390_SET_IRQ_STATE - -Capability: KVM_CAP_S390_IRQ_STATE -Architectures: s390 -Type: vcpu ioctl -Parameters: struct kvm_s390_irq_state (in) -Returns: 0 on success, - -EFAULT if the buffer address was invalid, - -EINVAL for an invalid buffer length (see below), - -EBUSY if there were already interrupts pending, - errors occurring when actually injecting the +--------------------------- + +:Capability: KVM_CAP_S390_IRQ_STATE +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: struct kvm_s390_irq_state (in) +:Returns: 0 on success, + -EFAULT if the buffer address was invalid, + -EINVAL for an invalid buffer length (see below), + -EBUSY if there were already interrupts pending, + errors occurring when actually injecting the interrupt. See KVM_S390_IRQ. This ioctl allows userspace to set the complete state of all cpu-local interrupts currently pending for the vcpu. It is intended for restoring interrupt state after a migration. The input parameter is a userspace buffer -containing a struct kvm_s390_irq_state: +containing a struct kvm_s390_irq_state:: -struct kvm_s390_irq_state { + struct kvm_s390_irq_state { __u64 buf; __u32 flags; /* will stay unused for compatibility reasons */ __u32 len; __u32 reserved[4]; /* will stay unused for compatibility reasons */ -}; + }; The restrictions for flags and reserved apply as well. (see KVM_S390_GET_IRQ_STATE) @@ -3320,20 +3650,22 @@ and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq), which is the maximum number of possibly pending cpu-local interrupts. 4.96 KVM_SMI +------------ -Capability: KVM_CAP_X86_SMM -Architectures: x86 -Type: vcpu ioctl -Parameters: none -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_X86_SMM +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 on success, -1 on error Queues an SMI on the thread's vcpu. 4.97 KVM_CAP_PPC_MULTITCE +------------------------- -Capability: KVM_CAP_PPC_MULTITCE -Architectures: ppc -Type: vm +:Capability: KVM_CAP_PPC_MULTITCE +:Architectures: ppc +:Type: vm This capability means the kernel is capable of handling hypercalls H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user @@ -3355,26 +3687,27 @@ an implementation for these despite the in kernel acceleration. This capability is always enabled. 4.98 KVM_CREATE_SPAPR_TCE_64 +---------------------------- -Capability: KVM_CAP_SPAPR_TCE_64 -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_create_spapr_tce_64 (in) -Returns: file descriptor for manipulating the created TCE table +:Capability: KVM_CAP_SPAPR_TCE_64 +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_create_spapr_tce_64 (in) +:Returns: file descriptor for manipulating the created TCE table This is an extension for KVM_CAP_SPAPR_TCE which only supports 32bit windows, described in 4.62 KVM_CREATE_SPAPR_TCE -This capability uses extended struct in ioctl interface: +This capability uses extended struct in ioctl interface:: -/* for KVM_CAP_SPAPR_TCE_64 */ -struct kvm_create_spapr_tce_64 { + /* for KVM_CAP_SPAPR_TCE_64 */ + struct kvm_create_spapr_tce_64 { __u64 liobn; __u32 page_shift; __u32 flags; __u64 offset; /* in pages */ __u64 size; /* in pages */ -}; + }; The aim of extension is to support an additional bigger DMA window with a variable page size. @@ -3387,12 +3720,13 @@ of IOMMU pages. The rest of functionality is identical to KVM_CREATE_SPAPR_TCE. 4.99 KVM_REINJECT_CONTROL +------------------------- -Capability: KVM_CAP_REINJECT_CONTROL -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_reinject_control (in) -Returns: 0 on success, +:Capability: KVM_CAP_REINJECT_CONTROL +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_reinject_control (in) +:Returns: 0 on success, -EFAULT if struct kvm_reinject_control cannot be read, -ENXIO if KVM_CREATE_PIT or KVM_CREATE_PIT2 didn't succeed earlier. @@ -3402,21 +3736,24 @@ vector(s) that i8254 injects. Reinject mode dequeues a tick and injects its interrupt whenever there isn't a pending interrupt from i8254. !reinject mode injects an interrupt as soon as a tick arrives. -struct kvm_reinject_control { +:: + + struct kvm_reinject_control { __u8 pit_reinject; __u8 reserved[31]; -}; + }; pit_reinject = 0 (!reinject mode) is recommended, unless running an old operating system that uses the PIT for timing (e.g. Linux 2.4.x). 4.100 KVM_PPC_CONFIGURE_V3_MMU +------------------------------ -Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3 -Architectures: ppc -Type: vm ioctl -Parameters: struct kvm_ppc_mmuv3_cfg (in) -Returns: 0 on success, +:Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3 +:Architectures: ppc +:Type: vm ioctl +:Parameters: struct kvm_ppc_mmuv3_cfg (in) +:Returns: 0 on success, -EFAULT if struct kvm_ppc_mmuv3_cfg cannot be read, -EINVAL if the configuration is invalid @@ -3424,10 +3761,12 @@ This ioctl controls whether the guest will use radix or HPT (hashed page table) translation, and sets the pointer to the process table for the guest. -struct kvm_ppc_mmuv3_cfg { +:: + + struct kvm_ppc_mmuv3_cfg { __u64 flags; __u64 process_table; -}; + }; There are two bits that can be set in flags; KVM_PPC_MMUV3_RADIX and KVM_PPC_MMUV3_GTSE. KVM_PPC_MMUV3_RADIX, if set, configures the guest @@ -3442,12 +3781,13 @@ as the second doubleword of the partition table entry, as defined in the Power ISA V3.00, Book III section 5.7.6.1. 4.101 KVM_PPC_GET_RMMU_INFO +--------------------------- -Capability: KVM_CAP_PPC_RADIX_MMU -Architectures: ppc -Type: vm ioctl -Parameters: struct kvm_ppc_rmmu_info (out) -Returns: 0 on success, +:Capability: KVM_CAP_PPC_RADIX_MMU +:Architectures: ppc +:Type: vm ioctl +:Parameters: struct kvm_ppc_rmmu_info (out) +:Returns: 0 on success, -EFAULT if struct kvm_ppc_rmmu_info cannot be written, -EINVAL if no useful information can be returned @@ -3456,14 +3796,16 @@ containing supported radix tree geometries, and (b) a list that maps page sizes to put in the "AP" (actual page size) field for the tlbie (TLB invalidate entry) instruction. -struct kvm_ppc_rmmu_info { +:: + + struct kvm_ppc_rmmu_info { struct kvm_ppc_radix_geom { __u8 page_shift; __u8 level_bits[4]; __u8 pad[3]; } geometries[8]; __u32 ap_encodings[8]; -}; + }; The geometries[] field gives up to 8 supported geometries for the radix page table, in terms of the log base 2 of the smallest page @@ -3476,19 +3818,54 @@ encodings, encoded with the AP value in the top 3 bits and the log base 2 of the page size in the bottom 6 bits. 4.102 KVM_PPC_RESIZE_HPT_PREPARE +-------------------------------- -Capability: KVM_CAP_SPAPR_RESIZE_HPT -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_ppc_resize_hpt (in) -Returns: 0 on successful completion, +:Capability: KVM_CAP_SPAPR_RESIZE_HPT +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_ppc_resize_hpt (in) +:Returns: 0 on successful completion, >0 if a new HPT is being prepared, the value is an estimated - number of milliseconds until preparation is complete + number of milliseconds until preparation is complete, -EFAULT if struct kvm_reinject_control cannot be read, - -EINVAL if the supplied shift or flags are invalid - -ENOMEM if unable to allocate the new HPT - -ENOSPC if there was a hash collision when moving existing - HPT entries to the new HPT + -EINVAL if the supplied shift or flags are invalid, + -ENOMEM if unable to allocate the new HPT, + -ENOSPC if there was a hash collision + +:: + + struct kvm_ppc_rmmu_info { + struct kvm_ppc_radix_geom { + __u8 page_shift; + __u8 level_bits[4]; + __u8 pad[3]; + } geometries[8]; + __u32 ap_encodings[8]; + }; + +The geometries[] field gives up to 8 supported geometries for the +radix page table, in terms of the log base 2 of the smallest page +size, and the number of bits indexed at each level of the tree, from +the PTE level up to the PGD level in that order. Any unused entries +will have 0 in the page_shift field. + +The ap_encodings gives the supported page sizes and their AP field +encodings, encoded with the AP value in the top 3 bits and the log +base 2 of the page size in the bottom 6 bits. + +4.102 KVM_PPC_RESIZE_HPT_PREPARE +-------------------------------- + +:Capability: KVM_CAP_SPAPR_RESIZE_HPT +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_ppc_resize_hpt (in) +:Returns: 0 on successful completion, + >0 if a new HPT is being prepared, the value is an estimated + number of milliseconds until preparation is complete, + -EFAULT if struct kvm_reinject_control cannot be read, + -EINVAL if the supplied shift or flags are invalid,when moving existing + HPT entries to the new HPT, -EIO on other error conditions Used to implement the PAPR extension for runtime resizing of a guest's @@ -3506,6 +3883,7 @@ requested in the parameters, discards the existing pending HPT and creates a new one as above. If called when there is a pending HPT of the size requested, will: + * If preparation of the pending HPT is already complete, return 0 * If preparation of the pending HPT has failed, return an error code, then discard the pending HPT. @@ -3522,26 +3900,29 @@ Normally this will be called repeatedly with the same parameters until it returns <= 0. The first call will initiate preparation, subsequent ones will monitor preparation until it completes or fails. -struct kvm_ppc_resize_hpt { +:: + + struct kvm_ppc_resize_hpt { __u64 flags; __u32 shift; __u32 pad; -}; + }; 4.103 KVM_PPC_RESIZE_HPT_COMMIT +------------------------------- -Capability: KVM_CAP_SPAPR_RESIZE_HPT -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_ppc_resize_hpt (in) -Returns: 0 on successful completion, +:Capability: KVM_CAP_SPAPR_RESIZE_HPT +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_ppc_resize_hpt (in) +:Returns: 0 on successful completion, -EFAULT if struct kvm_reinject_control cannot be read, - -EINVAL if the supplied shift or flags are invalid + -EINVAL if the supplied shift or flags are invalid, -ENXIO is there is no pending HPT, or the pending HPT doesn't - have the requested size - -EBUSY if the pending HPT is not fully prepared + have the requested size, + -EBUSY if the pending HPT is not fully prepared, -ENOSPC if there was a hash collision when moving existing - HPT entries to the new HPT + HPT entries to the new HPT, -EIO on other error conditions Used to implement the PAPR extension for runtime resizing of a guest's @@ -3564,31 +3945,35 @@ HPT and the previous HPT will be discarded. On failure, the guest will still be operating on its previous HPT. -struct kvm_ppc_resize_hpt { +:: + + struct kvm_ppc_resize_hpt { __u64 flags; __u32 shift; __u32 pad; -}; + }; 4.104 KVM_X86_GET_MCE_CAP_SUPPORTED +----------------------------------- -Capability: KVM_CAP_MCE -Architectures: x86 -Type: system ioctl -Parameters: u64 mce_cap (out) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_MCE +:Architectures: x86 +:Type: system ioctl +:Parameters: u64 mce_cap (out) +:Returns: 0 on success, -1 on error Returns supported MCE capabilities. The u64 mce_cap parameter has the same format as the MSR_IA32_MCG_CAP register. Supported capabilities will have the corresponding bits set. 4.105 KVM_X86_SETUP_MCE +----------------------- -Capability: KVM_CAP_MCE -Architectures: x86 -Type: vcpu ioctl -Parameters: u64 mcg_cap (in) -Returns: 0 on success, +:Capability: KVM_CAP_MCE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: u64 mcg_cap (in) +:Returns: 0 on success, -EFAULT if u64 mcg_cap cannot be read, -EINVAL if the requested number of banks is invalid, -EINVAL if requested MCE capability is not supported. @@ -3601,20 +3986,21 @@ checking for KVM_CAP_MCE. The supported capabilities can be retrieved with KVM_X86_GET_MCE_CAP_SUPPORTED. 4.106 KVM_X86_SET_MCE +--------------------- -Capability: KVM_CAP_MCE -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_x86_mce (in) -Returns: 0 on success, +:Capability: KVM_CAP_MCE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_x86_mce (in) +:Returns: 0 on success, -EFAULT if struct kvm_x86_mce cannot be read, -EINVAL if the bank number is invalid, -EINVAL if VAL bit is not set in status field. Inject a machine check error (MCE) into the guest. The input -parameter is: +parameter is:: -struct kvm_x86_mce { + struct kvm_x86_mce { __u64 status; __u64 addr; __u64 misc; @@ -3622,7 +4008,7 @@ struct kvm_x86_mce { __u8 bank; __u8 pad1[7]; __u64 pad2[3]; -}; + }; If the MCE being reported is an uncorrected error, KVM will inject it as an MCE exception into the guest. If the guest @@ -3634,15 +4020,17 @@ store it in the corresponding bank (provided this bank is not holding a previously reported uncorrected error). 4.107 KVM_S390_GET_CMMA_BITS +---------------------------- -Capability: KVM_CAP_S390_CMMA_MIGRATION -Architectures: s390 -Type: vm ioctl -Parameters: struct kvm_s390_cmma_log (in, out) -Returns: 0 on success, a negative value on error +:Capability: KVM_CAP_S390_CMMA_MIGRATION +:Architectures: s390 +:Type: vm ioctl +:Parameters: struct kvm_s390_cmma_log (in, out) +:Returns: 0 on success, a negative value on error This ioctl is used to get the values of the CMMA bits on the s390 architecture. It is meant to be used in two scenarios: + - During live migration to save the CMMA values. Live migration needs to be enabled via the KVM_REQ_START_MIGRATION VM property. - To non-destructively peek at the CMMA values, with the flag @@ -3652,9 +4040,12 @@ The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired values are written to a buffer whose location is indicated via the "values" member in the kvm_s390_cmma_log struct. The values in the input struct are also updated as needed. + Each CMMA value takes up one byte. -struct kvm_s390_cmma_log { +:: + + struct kvm_s390_cmma_log { __u64 start_gfn; __u32 count; __u32 flags; @@ -3663,7 +4054,7 @@ struct kvm_s390_cmma_log { __u64 mask; }; __u64 values; -}; + }; start_gfn is the number of the first guest frame whose CMMA values are to be retrieved, @@ -3724,12 +4115,13 @@ KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with present for the addresses (e.g. when using hugepages). 4.108 KVM_S390_SET_CMMA_BITS +---------------------------- -Capability: KVM_CAP_S390_CMMA_MIGRATION -Architectures: s390 -Type: vm ioctl -Parameters: struct kvm_s390_cmma_log (in) -Returns: 0 on success, a negative value on error +:Capability: KVM_CAP_S390_CMMA_MIGRATION +:Architectures: s390 +:Type: vm ioctl +:Parameters: struct kvm_s390_cmma_log (in) +:Returns: 0 on success, a negative value on error This ioctl is used to set the values of the CMMA bits on the s390 architecture. It is meant to be used during live migration to restore @@ -3737,16 +4129,18 @@ the CMMA values, but there are no restrictions on its use. The ioctl takes parameters via the kvm_s390_cmma_values struct. Each CMMA value takes up one byte. -struct kvm_s390_cmma_log { +:: + + struct kvm_s390_cmma_log { __u64 start_gfn; __u32 count; __u32 flags; union { __u64 remaining; __u64 mask; - }; + }; __u64 values; -}; + }; start_gfn indicates the starting guest frame number, @@ -3769,26 +4163,27 @@ or if no page table is present for the addresses (e.g. when using hugepages). 4.109 KVM_PPC_GET_CPU_CHAR +-------------------------- -Capability: KVM_CAP_PPC_GET_CPU_CHAR -Architectures: powerpc -Type: vm ioctl -Parameters: struct kvm_ppc_cpu_char (out) -Returns: 0 on successful completion +:Capability: KVM_CAP_PPC_GET_CPU_CHAR +:Architectures: powerpc +:Type: vm ioctl +:Parameters: struct kvm_ppc_cpu_char (out) +:Returns: 0 on successful completion, -EFAULT if struct kvm_ppc_cpu_char cannot be written This ioctl gives userspace information about certain characteristics of the CPU relating to speculative execution of instructions and possible information leakage resulting from speculative execution (see CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754). The information is -returned in struct kvm_ppc_cpu_char, which looks like this: +returned in struct kvm_ppc_cpu_char, which looks like this:: -struct kvm_ppc_cpu_char { + struct kvm_ppc_cpu_char { __u64 character; /* characteristics of the CPU */ __u64 behaviour; /* recommended software behaviour */ __u64 character_mask; /* valid bits in character */ __u64 behaviour_mask; /* valid bits in behaviour */ -}; + }; For extensibility, the character_mask and behaviour_mask fields indicate which bits of character and behaviour have been filled in by @@ -3815,12 +4210,13 @@ These fields use the same bit definitions as the new H_GET_CPU_CHARACTERISTICS hypercall. 4.110 KVM_MEMORY_ENCRYPT_OP +--------------------------- -Capability: basic -Architectures: x86 -Type: system -Parameters: an opaque platform specific structure (in/out) -Returns: 0 on success; -1 on error +:Capability: basic +:Architectures: x86 +:Type: system +:Parameters: an opaque platform specific structure (in/out) +:Returns: 0 on success; -1 on error If the platform supports creating encrypted VMs then this ioctl can be used for issuing platform-specific memory encryption commands to manage those @@ -3831,12 +4227,13 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization Documentation/virt/kvm/amd-memory-encryption.rst. 4.111 KVM_MEMORY_ENCRYPT_REG_REGION +----------------------------------- -Capability: basic -Architectures: x86 -Type: system -Parameters: struct kvm_enc_region (in) -Returns: 0 on success; -1 on error +:Capability: basic +:Architectures: x86 +:Type: system +:Parameters: struct kvm_enc_region (in) +:Returns: 0 on success; -1 on error This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc). @@ -3854,60 +4251,71 @@ swap or migrate (move) ciphertext pages. Hence, for now we pin the guest memory region registered with the ioctl. 4.112 KVM_MEMORY_ENCRYPT_UNREG_REGION +------------------------------------- -Capability: basic -Architectures: x86 -Type: system -Parameters: struct kvm_enc_region (in) -Returns: 0 on success; -1 on error +:Capability: basic +:Architectures: x86 +:Type: system +:Parameters: struct kvm_enc_region (in) +:Returns: 0 on success; -1 on error This ioctl can be used to unregister the guest memory region registered with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above. 4.113 KVM_HYPERV_EVENTFD +------------------------ -Capability: KVM_CAP_HYPERV_EVENTFD -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_hyperv_eventfd (in) +:Capability: KVM_CAP_HYPERV_EVENTFD +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_hyperv_eventfd (in) This ioctl (un)registers an eventfd to receive notifications from the guest on the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without causing a user exit. SIGNAL_EVENT hypercall with non-zero event flag number (bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit. -struct kvm_hyperv_eventfd { +:: + + struct kvm_hyperv_eventfd { __u32 conn_id; __s32 fd; __u32 flags; __u32 padding[3]; -}; + }; -The conn_id field should fit within 24 bits: +The conn_id field should fit within 24 bits:: -#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff + #define KVM_HYPERV_CONN_ID_MASK 0x00ffffff -The acceptable values for the flags field are: +The acceptable values for the flags field are:: -#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) + #define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) -Returns: 0 on success, - -EINVAL if conn_id or flags is outside the allowed range - -ENOENT on deassign if the conn_id isn't registered - -EEXIST on assign if the conn_id is already registered +:Returns: 0 on success, + -EINVAL if conn_id or flags is outside the allowed range, + -ENOENT on deassign if the conn_id isn't registered, + -EEXIST on assign if the conn_id is already registered 4.114 KVM_GET_NESTED_STATE +-------------------------- + +:Capability: KVM_CAP_NESTED_STATE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_nested_state (in/out) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_NESTED_STATE -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_nested_state (in/out) -Returns: 0 on success, -1 on error Errors: - E2BIG: the total state size exceeds the value of 'size' specified by + + ===== ============================================================= + E2BIG the total state size exceeds the value of 'size' specified by the user; the size required will be written into size. + ===== ============================================================= + +:: -struct kvm_nested_state { + struct kvm_nested_state { __u16 flags; __u16 format; __u32 size; @@ -3924,33 +4332,33 @@ struct kvm_nested_state { struct kvm_vmx_nested_state_data vmx[0]; struct kvm_svm_nested_state_data svm[0]; } data; -}; + }; -#define KVM_STATE_NESTED_GUEST_MODE 0x00000001 -#define KVM_STATE_NESTED_RUN_PENDING 0x00000002 -#define KVM_STATE_NESTED_EVMCS 0x00000004 + #define KVM_STATE_NESTED_GUEST_MODE 0x00000001 + #define KVM_STATE_NESTED_RUN_PENDING 0x00000002 + #define KVM_STATE_NESTED_EVMCS 0x00000004 -#define KVM_STATE_NESTED_FORMAT_VMX 0 -#define KVM_STATE_NESTED_FORMAT_SVM 1 + #define KVM_STATE_NESTED_FORMAT_VMX 0 + #define KVM_STATE_NESTED_FORMAT_SVM 1 -#define KVM_STATE_NESTED_VMX_VMCS_SIZE 0x1000 + #define KVM_STATE_NESTED_VMX_VMCS_SIZE 0x1000 -#define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE 0x00000001 -#define KVM_STATE_NESTED_VMX_SMM_VMXON 0x00000002 + #define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE 0x00000001 + #define KVM_STATE_NESTED_VMX_SMM_VMXON 0x00000002 -struct kvm_vmx_nested_state_hdr { + struct kvm_vmx_nested_state_hdr { __u64 vmxon_pa; __u64 vmcs12_pa; struct { __u16 flags; } smm; -}; + }; -struct kvm_vmx_nested_state_data { + struct kvm_vmx_nested_state_data { __u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE]; __u8 shadow_vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE]; -}; + }; This ioctl copies the vcpu's nested virtualization state from the kernel to userspace. @@ -3959,24 +4367,26 @@ The maximum size of the state can be retrieved by passing KVM_CAP_NESTED_STATE to the KVM_CHECK_EXTENSION ioctl(). 4.115 KVM_SET_NESTED_STATE +-------------------------- -Capability: KVM_CAP_NESTED_STATE -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_nested_state (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_NESTED_STATE +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_nested_state (in) +:Returns: 0 on success, -1 on error This copies the vcpu's kvm_nested_state struct from userspace to the kernel. For the definition of struct kvm_nested_state, see KVM_GET_NESTED_STATE. 4.116 KVM_(UN)REGISTER_COALESCED_MMIO +------------------------------------- -Capability: KVM_CAP_COALESCED_MMIO (for coalesced mmio) - KVM_CAP_COALESCED_PIO (for coalesced pio) -Architectures: all -Type: vm ioctl -Parameters: struct kvm_coalesced_mmio_zone -Returns: 0 on success, < 0 on error +:Capability: KVM_CAP_COALESCED_MMIO (for coalesced mmio) + KVM_CAP_COALESCED_PIO (for coalesced pio) +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_coalesced_mmio_zone +:Returns: 0 on success, < 0 on error Coalesced I/O is a performance optimization that defers hardware register write emulation so that userspace exits are avoided. It is @@ -3998,15 +4408,18 @@ between coalesced mmio and pio except that coalesced pio records accesses to I/O ports. 4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl) +------------------------------------ -Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 -Architectures: x86, arm, arm64, mips -Type: vm ioctl -Parameters: struct kvm_dirty_log (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 +:Architectures: x86, arm, arm64, mips +:Type: vm ioctl +:Parameters: struct kvm_dirty_log (in) +:Returns: 0 on success, -1 on error -/* for KVM_CLEAR_DIRTY_LOG */ -struct kvm_clear_dirty_log { +:: + + /* for KVM_CLEAR_DIRTY_LOG */ + struct kvm_clear_dirty_log { __u32 slot; __u32 num_pages; __u64 first_page; @@ -4014,7 +4427,7 @@ struct kvm_clear_dirty_log { void __user *dirty_bitmap; /* one bit per page */ __u64 padding; }; -}; + }; The ioctl clears the dirty status of pages in a memory slot, according to the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap @@ -4038,20 +4451,23 @@ However, it can always be used as long as KVM_CHECK_EXTENSION confirms that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present. 4.118 KVM_GET_SUPPORTED_HV_CPUID +-------------------------------- + +:Capability: KVM_CAP_HYPERV_CPUID +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_cpuid2 (in/out) +:Returns: 0 on success, -1 on error -Capability: KVM_CAP_HYPERV_CPUID -Architectures: x86 -Type: vcpu ioctl -Parameters: struct kvm_cpuid2 (in/out) -Returns: 0 on success, -1 on error +:: -struct kvm_cpuid2 { + struct kvm_cpuid2 { __u32 nent; __u32 padding; struct kvm_cpuid_entry2 entries[0]; -}; + }; -struct kvm_cpuid_entry2 { + struct kvm_cpuid_entry2 { __u32 function; __u32 index; __u32 flags; @@ -4060,7 +4476,7 @@ struct kvm_cpuid_entry2 { __u32 ecx; __u32 edx; __u32 padding[3]; -}; + }; This ioctl returns x86 cpuid features leaves related to Hyper-V emulation in KVM. Userspace can use the information returned by this ioctl to construct @@ -4073,13 +4489,13 @@ KVM_GET_SUPPORTED_CPUID ioctl because some of them intersect with KVM feature leaves (0x40000000, 0x40000001). Currently, the following list of CPUID leaves are returned: - HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS - HYPERV_CPUID_INTERFACE - HYPERV_CPUID_VERSION - HYPERV_CPUID_FEATURES - HYPERV_CPUID_ENLIGHTMENT_INFO - HYPERV_CPUID_IMPLEMENT_LIMITS - HYPERV_CPUID_NESTED_FEATURES + - HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS + - HYPERV_CPUID_INTERFACE + - HYPERV_CPUID_VERSION + - HYPERV_CPUID_FEATURES + - HYPERV_CPUID_ENLIGHTMENT_INFO + - HYPERV_CPUID_IMPLEMENT_LIMITS + - HYPERV_CPUID_NESTED_FEATURES HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS). @@ -4095,17 +4511,25 @@ number of valid entries in the 'entries' array, which is then filled. userspace should not expect to get any particular value there. 4.119 KVM_ARM_VCPU_FINALIZE +--------------------------- + +:Architectures: arm, arm64 +:Type: vcpu ioctl +:Parameters: int feature (in) +:Returns: 0 on success, -1 on error -Architectures: arm, arm64 -Type: vcpu ioctl -Parameters: int feature (in) -Returns: 0 on success, -1 on error Errors: - EPERM: feature not enabled, needs configuration, or already finalized - EINVAL: feature unknown or not present + + ====== ============================================================== + EPERM feature not enabled, needs configuration, or already finalized + EINVAL feature unknown or not present + ====== ============================================================== Recognised values for feature: + + ===== =========================================== arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE) + ===== =========================================== Finalizes the configuration of the specified vcpu feature. @@ -4129,21 +4553,24 @@ See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization using this ioctl. 4.120 KVM_SET_PMU_EVENT_FILTER +------------------------------ -Capability: KVM_CAP_PMU_EVENT_FILTER -Architectures: x86 -Type: vm ioctl -Parameters: struct kvm_pmu_event_filter (in) -Returns: 0 on success, -1 on error +:Capability: KVM_CAP_PMU_EVENT_FILTER +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_pmu_event_filter (in) +:Returns: 0 on success, -1 on error -struct kvm_pmu_event_filter { +:: + + struct kvm_pmu_event_filter { __u32 action; __u32 nevents; __u32 fixed_counter_bitmap; __u32 flags; __u32 pad[4]; __u64 events[0]; -}; + }; This ioctl restricts the set of PMU events that the guest can program. The argument holds a list of events which will be allowed or denied. @@ -4154,20 +4581,26 @@ counters are controlled by the fixed_counter_bitmap. No flags are defined yet, the field must be zero. -Valid values for 'action': -#define KVM_PMU_EVENT_ALLOW 0 -#define KVM_PMU_EVENT_DENY 1 +Valid values for 'action':: + + #define KVM_PMU_EVENT_ALLOW 0 + #define KVM_PMU_EVENT_DENY 1 4.121 KVM_PPC_SVM_OFF +--------------------- + +:Capability: basic +:Architectures: powerpc +:Type: vm ioctl +:Parameters: none +:Returns: 0 on successful completion, -Capability: basic -Architectures: powerpc -Type: vm ioctl -Parameters: none -Returns: 0 on successful completion, Errors: - EINVAL: if ultravisor failed to terminate the secure guest - ENOMEM: if hypervisor failed to allocate new radix page tables for guest + + ====== ================================================================ + EINVAL if ultravisor failed to terminate the secure guest + ENOMEM if hypervisor failed to allocate new radix page tables for guest + ====== ================================================================ This ioctl is used to turn off the secure mode of the guest or transition the guest from secure mode to normal mode. This is invoked when the guest @@ -4178,35 +4611,38 @@ unpins the VPA pages and releases all the device pages that are used to track the secure pages by hypervisor. 4.122 KVM_S390_NORMAL_RESET +--------------------------- -Capability: KVM_CAP_S390_VCPU_RESETS -Architectures: s390 -Type: vcpu ioctl -Parameters: none -Returns: 0 +:Capability: KVM_CAP_S390_VCPU_RESETS +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 This ioctl resets VCPU registers and control structures according to the cpu reset definition in the POP (Principles Of Operation). 4.123 KVM_S390_INITIAL_RESET +---------------------------- -Capability: none -Architectures: s390 -Type: vcpu ioctl -Parameters: none -Returns: 0 +:Capability: none +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 This ioctl resets VCPU registers and control structures according to the initial cpu reset definition in the POP. However, the cpu is not put into ESA mode. This reset is a superset of the normal reset. 4.124 KVM_S390_CLEAR_RESET +-------------------------- -Capability: KVM_CAP_S390_VCPU_RESETS -Architectures: s390 -Type: vcpu ioctl -Parameters: none -Returns: 0 +:Capability: KVM_CAP_S390_VCPU_RESETS +:Architectures: s390 +:Type: vcpu ioctl +:Parameters: none +:Returns: 0 This ioctl resets VCPU registers and control structures according to the clear cpu reset definition in the POP. However, the cpu is not put @@ -4214,7 +4650,7 @@ into ESA mode. This reset is a superset of the initial reset. 5. The kvm_run structure ------------------------- +======================== Application code obtains a pointer to the kvm_run structure by mmap()ing a vcpu fd. From that point, application code can control @@ -4222,13 +4658,17 @@ execution by changing fields in kvm_run prior to calling the KVM_RUN ioctl, and obtain information about the reason KVM_RUN returned by looking up structure members. -struct kvm_run { +:: + + struct kvm_run { /* in */ __u8 request_interrupt_window; Request that KVM_RUN return when it becomes possible to inject external interrupts into the guest. Useful in conjunction with KVM_INTERRUPT. +:: + __u8 immediate_exit; This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN @@ -4240,6 +4680,8 @@ a signal handler that sets run->immediate_exit to a non-zero value. This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available. +:: + __u8 padding1[6]; /* out */ @@ -4249,16 +4691,22 @@ When KVM_RUN has returned successfully (return value 0), this informs application code why KVM_RUN has returned. Allowable values for this field are detailed below. +:: + __u8 ready_for_interrupt_injection; If request_interrupt_window has been specified, this field indicates an interrupt can be injected now with KVM_INTERRUPT. +:: + __u8 if_flag; The value of the current interrupt flag. Only valid if in-kernel local APIC is not used. +:: + __u16 flags; More architecture-specific flags detailing state of the VCPU that may @@ -4266,17 +4714,23 @@ affect the device's behavior. The only currently defined flag is KVM_RUN_X86_SMM, which is valid on x86 machines and is set if the VCPU is in system management mode. +:: + /* in (pre_kvm_run), out (post_kvm_run) */ __u64 cr8; The value of the cr8 register. Only valid if in-kernel local APIC is not used. Both input and output. +:: + __u64 apic_base; The value of the APIC BASE msr. Only valid if in-kernel local APIC is not used. Both input and output. +:: + union { /* KVM_EXIT_UNKNOWN */ struct { @@ -4287,6 +4741,8 @@ If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown reasons. Further architecture-specific information is available in hardware_exit_reason. +:: + /* KVM_EXIT_FAIL_ENTRY */ struct { __u64 hardware_entry_failure_reason; @@ -4296,6 +4752,8 @@ If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due to unknown reasons. Further architecture-specific information is available in hardware_entry_failure_reason. +:: + /* KVM_EXIT_EXCEPTION */ struct { __u32 exception; @@ -4304,10 +4762,12 @@ available in hardware_entry_failure_reason. Unused. +:: + /* KVM_EXIT_IO */ struct { -#define KVM_EXIT_IO_IN 0 -#define KVM_EXIT_IO_OUT 1 + #define KVM_EXIT_IO_IN 0 + #define KVM_EXIT_IO_OUT 1 __u8 direction; __u8 size; /* bytes */ __u16 port; @@ -4321,6 +4781,8 @@ data_offset describes where the data is located (KVM_EXIT_IO_OUT) or where kvm expects application code to place the data for the next KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array. +:: + /* KVM_EXIT_DEBUG */ struct { struct kvm_debug_exit_arch arch; @@ -4329,6 +4791,8 @@ KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array. If the exit_reason is KVM_EXIT_DEBUG, then a vcpu is processing a debug event for which architecture specific information is returned. +:: + /* KVM_EXIT_MMIO */ struct { __u64 phys_addr; @@ -4346,14 +4810,19 @@ The 'data' member contains, in its first 'len' bytes, the value as it would appear if the VCPU performed a load or store of the appropriate width directly to the byte array. -NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and +.. note:: + + For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and KVM_EXIT_EPR the corresponding + operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. +:: + /* KVM_EXIT_HYPERCALL */ struct { __u64 nr; @@ -4365,7 +4834,10 @@ pending operations. Unused. This was once used for 'hypercall to userspace'. To implement such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390). -Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. + +.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. + +:: /* KVM_EXIT_TPR_ACCESS */ struct { @@ -4376,6 +4848,8 @@ Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO. To be documented (KVM_TPR_ACCESS_REPORTING). +:: + /* KVM_EXIT_S390_SIEIC */ struct { __u8 icptcode; @@ -4387,16 +4861,20 @@ To be documented (KVM_TPR_ACCESS_REPORTING). s390 specific. +:: + /* KVM_EXIT_S390_RESET */ -#define KVM_S390_RESET_POR 1 -#define KVM_S390_RESET_CLEAR 2 -#define KVM_S390_RESET_SUBSYSTEM 4 -#define KVM_S390_RESET_CPU_INIT 8 -#define KVM_S390_RESET_IPL 16 + #define KVM_S390_RESET_POR 1 + #define KVM_S390_RESET_CLEAR 2 + #define KVM_S390_RESET_SUBSYSTEM 4 + #define KVM_S390_RESET_CPU_INIT 8 + #define KVM_S390_RESET_IPL 16 __u64 s390_reset_flags; s390 specific. +:: + /* KVM_EXIT_S390_UCONTROL */ struct { __u64 trans_exc_code; @@ -4411,6 +4889,8 @@ in the cpu's lowcore are presented here as defined by the z Architecture Principles of Operation Book in the Chapter for Dynamic Address Translation (DAT) +:: + /* KVM_EXIT_DCR */ struct { __u32 dcrn; @@ -4420,6 +4900,8 @@ Principles of Operation Book in the Chapter for Dynamic Address Translation Deprecated - was used for 440 KVM. +:: + /* KVM_EXIT_OSI */ struct { __u64 gprs[32]; @@ -4433,6 +4915,8 @@ Userspace can now handle the hypercall and when it's done modify the gprs as necessary. Upon guest entry all guest GPRs will then be replaced by the values in this struct. +:: + /* KVM_EXIT_PAPR_HCALL */ struct { __u64 nr; @@ -4450,6 +4934,8 @@ The possible hypercalls are defined in the Power Architecture Platform Requirements (PAPR) document available from www.power.org (free developer registration required to access it). +:: + /* KVM_EXIT_S390_TSCH */ struct { __u16 subchannel_id; @@ -4466,6 +4952,8 @@ interrupt for the target subchannel has been dequeued and subchannel_id, subchannel_nr, io_int_parm and io_int_word contain the parameters for that interrupt. ipb is needed for instruction parameter decoding. +:: + /* KVM_EXIT_EPR */ struct { __u32 epr; @@ -4485,11 +4973,13 @@ It gets triggered whenever both KVM_CAP_PPC_EPR are enabled and an external interrupt has just been delivered into the guest. User space should put the acknowledged interrupt vector into the 'epr' field. +:: + /* KVM_EXIT_SYSTEM_EVENT */ struct { -#define KVM_SYSTEM_EVENT_SHUTDOWN 1 -#define KVM_SYSTEM_EVENT_RESET 2 -#define KVM_SYSTEM_EVENT_CRASH 3 + #define KVM_SYSTEM_EVENT_SHUTDOWN 1 + #define KVM_SYSTEM_EVENT_RESET 2 + #define KVM_SYSTEM_EVENT_CRASH 3 __u32 type; __u64 flags; } system_event; @@ -4502,18 +4992,21 @@ the system-level event type. The 'flags' field describes architecture specific flags for the system-level event. Valid values for 'type' are: - KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the + + - KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the VM. Userspace is not obliged to honour this, and if it does honour this does not need to destroy the VM synchronously (ie it may call KVM_RUN again before shutdown finally occurs). - KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM. + - KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM. As with SHUTDOWN, userspace can choose to ignore the request, or to schedule the reset to occur in the future and may call KVM_RUN again. - KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest + - KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest has requested a crash condition maintenance. Userspace can choose to ignore the request, or to gather VM memory core dump and/or reset/shutdown of the VM. +:: + /* KVM_EXIT_IOAPIC_EOI */ struct { __u8 vector; @@ -4526,9 +5019,11 @@ the userspace IOAPIC should process the EOI and retrigger the interrupt if it is still asserted. Vector is the LAPIC interrupt vector for which the EOI was received. +:: + struct kvm_hyperv_exit { -#define KVM_EXIT_HYPERV_SYNIC 1 -#define KVM_EXIT_HYPERV_HCALL 2 + #define KVM_EXIT_HYPERV_SYNIC 1 + #define KVM_EXIT_HYPERV_HCALL 2 __u32 type; union { struct { @@ -4546,14 +5041,20 @@ EOI was received. }; /* KVM_EXIT_HYPERV */ struct kvm_hyperv_exit hyperv; + Indicates that the VCPU exits into userspace to process some tasks related to Hyper-V emulation. + Valid values for 'type' are: - KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about + + - KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about + Hyper-V SynIC state change. Notification is used to remap SynIC event/message pages and to enable/disable SynIC messages/events processing in userspace. +:: + /* KVM_EXIT_ARM_NISV */ struct { __u64 esr_iss; @@ -4587,6 +5088,8 @@ Note that KVM does not skip the faulting instruction as it does for KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state if it decides to decode and emulate the instruction. +:: + /* Fix the size of the union. */ char padding[256]; }; @@ -4611,18 +5114,20 @@ avoid some system call overhead if userspace has to handle the exit. Userspace can query the validity of the structure by checking kvm_valid_regs for specific bits. These bits are architecture specific and usually define the validity of a groups of registers. (e.g. one bit - for general purpose registers) +for general purpose registers) Please note that the kernel is allowed to use the kvm_run structure as the primary storage for certain register types. Therefore, the kernel may use the values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. -}; +:: + + }; 6. Capabilities that can be enabled on vCPUs --------------------------------------------- +============================================ There are certain capabilities that change the behavior of the virtual CPU or the virtual machine when enabled. To enable them, please see section 4.37. @@ -4631,23 +5136,28 @@ the virtual machine is when enabling them. The following information is provided along with the description: - Architectures: which instruction set architectures provide this ioctl. + Architectures: + which instruction set architectures provide this ioctl. x86 includes both i386 and x86_64. - Target: whether this is a per-vcpu or per-vm capability. + Target: + whether this is a per-vcpu or per-vm capability. - Parameters: what parameters are accepted by the capability. + Parameters: + what parameters are accepted by the capability. - Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) + Returns: + the return value. General error numbers (EBADF, ENOMEM, EINVAL) are not detailed, but errors with specific meanings are. 6.1 KVM_CAP_PPC_OSI +------------------- -Architectures: ppc -Target: vcpu -Parameters: none -Returns: 0 on success; -1 on error +:Architectures: ppc +:Target: vcpu +:Parameters: none +:Returns: 0 on success; -1 on error This capability enables interception of OSI hypercalls that otherwise would be treated as normal system calls to be injected into the guest. OSI hypercalls @@ -4658,11 +5168,12 @@ When this capability is enabled, KVM_EXIT_OSI can occur. 6.2 KVM_CAP_PPC_PAPR +-------------------- -Architectures: ppc -Target: vcpu -Parameters: none -Returns: 0 on success; -1 on error +:Architectures: ppc +:Target: vcpu +:Parameters: none +:Returns: 0 on success; -1 on error This capability enables interception of PAPR hypercalls. PAPR hypercalls are done using the hypercall instruction "sc 1". @@ -4678,18 +5189,21 @@ When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur. 6.3 KVM_CAP_SW_TLB +------------------ + +:Architectures: ppc +:Target: vcpu +:Parameters: args[0] is the address of a struct kvm_config_tlb +:Returns: 0 on success; -1 on error -Architectures: ppc -Target: vcpu -Parameters: args[0] is the address of a struct kvm_config_tlb -Returns: 0 on success; -1 on error +:: -struct kvm_config_tlb { + struct kvm_config_tlb { __u64 params; __u64 array; __u32 mmu_type; __u32 array_len; -}; + }; Configures the virtual CPU's TLB array, establishing a shared memory area between userspace and KVM. The "params" and "array" fields are userspace @@ -4708,6 +5222,7 @@ to tell KVM which entries have been changed, prior to calling KVM_RUN again on this vcpu. For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV: + - The "params" field is of type "struct kvm_book3e_206_tlb_params". - The "array" field points to an array of type "struct kvm_book3e_206_tlb_entry". @@ -4721,11 +5236,12 @@ For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV: hardware ignores this value for TLB0. 6.4 KVM_CAP_S390_CSS_SUPPORT +---------------------------- -Architectures: s390 -Target: vcpu -Parameters: none -Returns: 0 on success; -1 on error +:Architectures: s390 +:Target: vcpu +:Parameters: none +:Returns: 0 on success; -1 on error This capability enables support for handling of channel I/O instructions. @@ -4739,11 +5255,12 @@ Note that even though this capability is enabled per-vcpu, the complete virtual machine is affected. 6.5 KVM_CAP_PPC_EPR +------------------- -Architectures: ppc -Target: vcpu -Parameters: args[0] defines whether the proxy facility is active -Returns: 0 on success; -1 on error +:Architectures: ppc +:Target: vcpu +:Parameters: args[0] defines whether the proxy facility is active +:Returns: 0 on success; -1 on error This capability enables or disables the delivery of interrupts through the external proxy facility. @@ -4757,62 +5274,70 @@ When disabled (args[0] == 0), behavior is as if this facility is unsupported. When this capability is enabled, KVM_EXIT_EPR can occur. 6.6 KVM_CAP_IRQ_MPIC +-------------------- -Architectures: ppc -Parameters: args[0] is the MPIC device fd - args[1] is the MPIC CPU number for this vcpu +:Architectures: ppc +:Parameters: args[0] is the MPIC device fd; + args[1] is the MPIC CPU number for this vcpu This capability connects the vcpu to an in-kernel MPIC device. 6.7 KVM_CAP_IRQ_XICS +-------------------- -Architectures: ppc -Target: vcpu -Parameters: args[0] is the XICS device fd - args[1] is the XICS CPU number (server ID) for this vcpu +:Architectures: ppc +:Target: vcpu +:Parameters: args[0] is the XICS device fd; + args[1] is the XICS CPU number (server ID) for this vcpu This capability connects the vcpu to an in-kernel XICS device. 6.8 KVM_CAP_S390_IRQCHIP +------------------------ -Architectures: s390 -Target: vm -Parameters: none +:Architectures: s390 +:Target: vm +:Parameters: none This capability enables the in-kernel irqchip for s390. Please refer to "4.24 KVM_CREATE_IRQCHIP" for details. 6.9 KVM_CAP_MIPS_FPU +-------------------- -Architectures: mips -Target: vcpu -Parameters: args[0] is reserved for future use (should be 0). +:Architectures: mips +:Target: vcpu +:Parameters: args[0] is reserved for future use (should be 0). This capability allows the use of the host Floating Point Unit by the guest. It allows the Config1.FP bit to be set to enable the FPU in the guest. Once this is -done the KVM_REG_MIPS_FPR_* and KVM_REG_MIPS_FCR_* registers can be accessed -(depending on the current guest FPU register mode), and the Status.FR, +done the ``KVM_REG_MIPS_FPR_*`` and ``KVM_REG_MIPS_FCR_*`` registers can be +accessed (depending on the current guest FPU register mode), and the Status.FR, Config5.FRE bits are accessible via the KVM API and also from the guest, depending on them being supported by the FPU. 6.10 KVM_CAP_MIPS_MSA +--------------------- -Architectures: mips -Target: vcpu -Parameters: args[0] is reserved for future use (should be 0). +:Architectures: mips +:Target: vcpu +:Parameters: args[0] is reserved for future use (should be 0). This capability allows the use of the MIPS SIMD Architecture (MSA) by the guest. It allows the Config3.MSAP bit to be set to enable the use of MSA by the guest. -Once this is done the KVM_REG_MIPS_VEC_* and KVM_REG_MIPS_MSA_* registers can be -accessed, and the Config5.MSAEn bit is accessible via the KVM API and also from -the guest. +Once this is done the ``KVM_REG_MIPS_VEC_*`` and ``KVM_REG_MIPS_MSA_*`` +registers can be accessed, and the Config5.MSAEn bit is accessible via the +KVM API and also from the guest. 6.74 KVM_CAP_SYNC_REGS -Architectures: s390, x86 -Target: s390: always enabled, x86: vcpu -Parameters: none -Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register -sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h). +---------------------- + +:Architectures: s390, x86 +:Target: s390: always enabled, x86: vcpu +:Parameters: none +:Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register + sets are supported + (bitfields defined in arch/x86/include/uapi/asm/kvm.h). As described above in the kvm_sync_regs struct info in section 5 (kvm_run): KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers @@ -4825,6 +5350,7 @@ userspace. For s390 specifics, please refer to the source code. For x86: + - the register sets to be copied out to kvm_run are selectable by userspace (rather that all sets being copied out for every exit). - vcpu_events are available in addition to regs and sregs. @@ -4841,23 +5367,26 @@ into the vCPU even if they've been modified. Unused bitfields in the bitarrays must be set to zero. -struct kvm_sync_regs { +:: + + struct kvm_sync_regs { struct kvm_regs regs; struct kvm_sregs sregs; struct kvm_vcpu_events events; -}; + }; 6.75 KVM_CAP_PPC_IRQ_XIVE +------------------------- -Architectures: ppc -Target: vcpu -Parameters: args[0] is the XIVE device fd - args[1] is the XIVE CPU number (server ID) for this vcpu +:Architectures: ppc +:Target: vcpu +:Parameters: args[0] is the XIVE device fd; + args[1] is the XIVE CPU number (server ID) for this vcpu This capability connects the vcpu to an in-kernel XIVE device. 7. Capabilities that can be enabled on VMs ------------------------------------------- +========================================== There are certain capabilities that change the behavior of the virtual machine when enabled. To enable them, please see section 4.37. Below @@ -4866,20 +5395,24 @@ is when enabling them. The following information is provided along with the description: - Architectures: which instruction set architectures provide this ioctl. + Architectures: + which instruction set architectures provide this ioctl. x86 includes both i386 and x86_64. - Parameters: what parameters are accepted by the capability. + Parameters: + what parameters are accepted by the capability. - Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL) + Returns: + the return value. General error numbers (EBADF, ENOMEM, EINVAL) are not detailed, but errors with specific meanings are. 7.1 KVM_CAP_PPC_ENABLE_HCALL +---------------------------- -Architectures: ppc -Parameters: args[0] is the sPAPR hcall number - args[1] is 0 to disable, 1 to enable in-kernel handling +:Architectures: ppc +:Parameters: args[0] is the sPAPR hcall number; + args[1] is 0 to disable, 1 to enable in-kernel handling This capability controls whether individual sPAPR hypercalls (hcalls) get handled by the kernel or not. Enabling or disabling in-kernel @@ -4897,13 +5430,15 @@ implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL error. 7.2 KVM_CAP_S390_USER_SIGP +-------------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none This capability controls which SIGP orders will be handled completely in user space. With this capability enabled, all fast orders will be handled completely in the kernel: + - SENSE - SENSE RUNNING - EXTERNAL CALL @@ -4917,48 +5452,52 @@ in the hardware prior to interception). If this capability is not enabled, the old way of handling SIGP orders is used (partially in kernel and user space). 7.3 KVM_CAP_S390_VECTOR_REGISTERS +--------------------------------- -Architectures: s390 -Parameters: none -Returns: 0 on success, negative value on error +:Architectures: s390 +:Parameters: none +:Returns: 0 on success, negative value on error Allows use of the vector registers introduced with z13 processor, and provides for the synchronization between host and user space. Will return -EINVAL if the machine does not support vectors. 7.4 KVM_CAP_S390_USER_STSI +-------------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none This capability allows post-handlers for the STSI instruction. After initial handling in the kernel, KVM exits to user space with KVM_EXIT_S390_STSI to allow user space to insert further data. Before exiting to userspace, kvm handlers should fill in s390_stsi field of -vcpu->run: -struct { +vcpu->run:: + + struct { __u64 addr; __u8 ar; __u8 reserved; __u8 fc; __u8 sel1; __u16 sel2; -} s390_stsi; + } s390_stsi; -@addr - guest address of STSI SYSIB -@fc - function code -@sel1 - selector 1 -@sel2 - selector 2 -@ar - access register number + @addr - guest address of STSI SYSIB + @fc - function code + @sel1 - selector 1 + @sel2 - selector 2 + @ar - access register number KVM handlers should exit to userspace with rc = -EREMOTE. 7.5 KVM_CAP_SPLIT_IRQCHIP +------------------------- -Architectures: x86 -Parameters: args[0] - number of routes reserved for userspace IOAPICs -Returns: 0 on success, -1 on error +:Architectures: x86 +:Parameters: args[0] - number of routes reserved for userspace IOAPICs +:Returns: 0 on success, -1 on error Create a local apic for each processor in the kernel. This can be used instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the @@ -4975,24 +5514,26 @@ Fails if VCPU has already been created, or if the irqchip is already in the kernel (i.e. KVM_CREATE_IRQCHIP has already been called). 7.6 KVM_CAP_S390_RI +------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none Allows use of runtime-instrumentation introduced with zEC12 processor. Will return -EINVAL if the machine does not support runtime-instrumentation. Will return -EBUSY if a VCPU has already been created. 7.7 KVM_CAP_X2APIC_API +---------------------- -Architectures: x86 -Parameters: args[0] - features that should be enabled -Returns: 0 on success, -EINVAL when args[0] contains invalid features +:Architectures: x86 +:Parameters: args[0] - features that should be enabled +:Returns: 0 on success, -EINVAL when args[0] contains invalid features -Valid feature flags in args[0] are +Valid feature flags in args[0] are:: -#define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0) -#define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1) + #define KVM_X2APIC_API_USE_32BIT_IDS (1ULL << 0) + #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK (1ULL << 1) Enabling KVM_X2APIC_API_USE_32BIT_IDS changes the behavior of KVM_SET_GSI_ROUTING, KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC, @@ -5006,9 +5547,10 @@ without interrupt remapping. This is undesirable in logical mode, where 0xff represents CPUs 0-7 in cluster 0. 7.8 KVM_CAP_S390_USER_INSTR0 +---------------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none With this capability enabled, all illegal instructions 0x0000 (2 bytes) will be intercepted and forwarded to user space. User space can use this @@ -5020,26 +5562,29 @@ This capability can be enabled dynamically even if VCPUs were already created and are running. 7.9 KVM_CAP_S390_GS +------------------- -Architectures: s390 -Parameters: none -Returns: 0 on success; -EINVAL if the machine does not support - guarded storage; -EBUSY if a VCPU has already been created. +:Architectures: s390 +:Parameters: none +:Returns: 0 on success; -EINVAL if the machine does not support + guarded storage; -EBUSY if a VCPU has already been created. Allows use of guarded storage for the KVM guest. 7.10 KVM_CAP_S390_AIS +--------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none Allow use of adapter-interruption suppression. -Returns: 0 on success; -EBUSY if a VCPU has already been created. +:Returns: 0 on success; -EBUSY if a VCPU has already been created. 7.11 KVM_CAP_PPC_SMT +-------------------- -Architectures: ppc -Parameters: vsmt_mode, flags +:Architectures: ppc +:Parameters: vsmt_mode, flags Enabling this capability on a VM provides userspace with a way to set the desired virtual SMT mode (i.e. the number of virtual CPUs per @@ -5054,9 +5599,10 @@ The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT modes are available. 7.12 KVM_CAP_PPC_FWNMI +---------------------- -Architectures: ppc -Parameters: none +:Architectures: ppc +:Parameters: none With this capability a machine check exception in the guest address space will cause KVM to exit the guest with NMI exit reason. This @@ -5065,17 +5611,18 @@ machine check handling routine. Without this capability KVM will branch to guests' 0x200 interrupt vector. 7.13 KVM_CAP_X86_DISABLE_EXITS +------------------------------ -Architectures: x86 -Parameters: args[0] defines which exits are disabled -Returns: 0 on success, -EINVAL when args[0] contains invalid exits +:Architectures: x86 +:Parameters: args[0] defines which exits are disabled +:Returns: 0 on success, -EINVAL when args[0] contains invalid exits -Valid bits in args[0] are +Valid bits in args[0] are:: -#define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) -#define KVM_X86_DISABLE_EXITS_HLT (1 << 1) -#define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) -#define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) + #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0) + #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) + #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) + #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) Enabling this capability on a VM provides userspace with a way to no longer intercept some instructions for improved latency in some @@ -5087,12 +5634,13 @@ all such vmexits. Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits. 7.14 KVM_CAP_S390_HPAGE_1M +-------------------------- -Architectures: s390 -Parameters: none -Returns: 0 on success, -EINVAL if hpage module parameter was not set - or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL - flag set +:Architectures: s390 +:Parameters: none +:Returns: 0 on success, -EINVAL if hpage module parameter was not set + or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL + flag set With this capability the KVM support for memory backing with 1m pages through hugetlbfs can be enabled for a VM. After the capability is @@ -5104,20 +5652,22 @@ While it is generally possible to create a huge page backed VM without this capability, the VM will not be able to run. 7.15 KVM_CAP_MSR_PLATFORM_INFO +------------------------------ -Architectures: x86 -Parameters: args[0] whether feature should be enabled or not +:Architectures: x86 +:Parameters: args[0] whether feature should be enabled or not With this capability, a guest may read the MSR_PLATFORM_INFO MSR. Otherwise, a #GP would be raised when the guest tries to access. Currently, this capability does not enable write permissions of this MSR for the guest. 7.16 KVM_CAP_PPC_NESTED_HV +-------------------------- -Architectures: ppc -Parameters: none -Returns: 0 on success, -EINVAL when the implementation doesn't support - nested-HV virtualization. +:Architectures: ppc +:Parameters: none +:Returns: 0 on success, -EINVAL when the implementation doesn't support + nested-HV virtualization. HV-KVM on POWER9 and later systems allows for "nested-HV" virtualization, which provides a way for a guest VM to run guests that @@ -5127,9 +5677,10 @@ the necessary functionality and on the facility being enabled with a kvm-hv module parameter. 7.17 KVM_CAP_EXCEPTION_PAYLOAD +------------------------------ -Architectures: x86 -Parameters: args[0] whether feature should be enabled or not +:Architectures: x86 +:Parameters: args[0] whether feature should be enabled or not With this capability enabled, CR2 will not be modified prior to the emulated VM-exit when L1 intercepts a #PF exception that occurs in @@ -5140,21 +5691,21 @@ L2. As a result, when KVM_GET_VCPU_EVENTS reports a pending #PF (or faulting address (or the new DR6 bits*) will be reported in the exception_payload field. Similarly, when userspace injects a #PF (or #DB) into L2 using KVM_SET_VCPU_EVENTS, it is expected to set -exception.has_payload and to put the faulting address (or the new DR6 -bits*) in the exception_payload field. +exception.has_payload and to put the faulting address - or the new DR6 +bits\ [#]_ - in the exception_payload field. This capability also enables exception.pending in struct kvm_vcpu_events, which allows userspace to distinguish between pending and injected exceptions. -* For the new DR6 bits, note that bit 16 is set iff the #DB exception - will clear DR6.RTM. +.. [#] For the new DR6 bits, note that bit 16 is set iff the #DB exception + will clear DR6.RTM. 7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 -Architectures: x86, arm, arm64, mips -Parameters: args[0] whether feature should be enabled or not +:Architectures: x86, arm, arm64, mips +:Parameters: args[0] whether feature should be enabled or not With this capability enabled, KVM_GET_DIRTY_LOG will not automatically clear and write-protect all pages that are returned as dirty. @@ -5181,14 +5732,15 @@ KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed. Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT. 8. Other capabilities. ----------------------- +====================== This section lists capabilities that give information about other features of the KVM implementation. 8.1 KVM_CAP_PPC_HWRNG +--------------------- -Architectures: ppc +:Architectures: ppc This capability, if KVM_CHECK_EXTENSION indicates that it is available, means that that the kernel has an implementation of the @@ -5197,8 +5749,10 @@ If present, the kernel H_RANDOM handler can be enabled for guest use with the KVM_CAP_PPC_ENABLE_HCALL capability. 8.2 KVM_CAP_HYPERV_SYNIC +------------------------ + +:Architectures: x86 -Architectures: x86 This capability, if KVM_CHECK_EXTENSION indicates that it is available, means that that the kernel has an implementation of the Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is @@ -5210,8 +5764,9 @@ will disable the use of APIC hardware virtualization even if supported by the CPU, as it's incompatible with SynIC auto-EOI behavior. 8.3 KVM_CAP_PPC_RADIX_MMU +------------------------- -Architectures: ppc +:Architectures: ppc This capability, if KVM_CHECK_EXTENSION indicates that it is available, means that that the kernel can support guests using the @@ -5219,8 +5774,9 @@ radix MMU defined in Power ISA V3.00 (as implemented in the POWER9 processor). 8.4 KVM_CAP_PPC_HASH_MMU_V3 +--------------------------- -Architectures: ppc +:Architectures: ppc This capability, if KVM_CHECK_EXTENSION indicates that it is available, means that that the kernel can support guests using the @@ -5228,8 +5784,9 @@ hashed page table MMU defined in Power ISA V3.00 (as implemented in the POWER9 processor), including in-memory segment tables. 8.5 KVM_CAP_MIPS_VZ +------------------- -Architectures: mips +:Architectures: mips This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that it is available, means that full hardware assisted virtualization capabilities @@ -5247,16 +5804,19 @@ values (see below). All other values are reserved. This is to allow for the possibility of other hardware assisted virtualization implementations which may be incompatible with the MIPS VZ ASE. - 0: The trap & emulate implementation is in use to run guest code in user +== ========================================================================== + 0 The trap & emulate implementation is in use to run guest code in user mode. Guest virtual memory segments are rearranged to fit the guest in the user mode address space. - 1: The MIPS VZ ASE is in use, providing full hardware assisted + 1 The MIPS VZ ASE is in use, providing full hardware assisted virtualization, including standard guest virtual memory segments. +== ========================================================================== 8.6 KVM_CAP_MIPS_TE +------------------- -Architectures: mips +:Architectures: mips This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that it is available, means that the trap & emulate implementation is available to @@ -5268,8 +5828,9 @@ If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is available, it means that the VM is using trap & emulate. 8.7 KVM_CAP_MIPS_64BIT +---------------------- -Architectures: mips +:Architectures: mips This capability indicates the supported architecture type of the guest, i.e. the supported register and address width. @@ -5279,22 +5840,26 @@ kvm VM handle correspond roughly to the CP0_Config.AT register field, and should be checked specifically against known values (see below). All other values are reserved. - 0: MIPS32 or microMIPS32. +== ======================================================================== + 0 MIPS32 or microMIPS32. Both registers and addresses are 32-bits wide. It will only be possible to run 32-bit guest code. - 1: MIPS64 or microMIPS64 with access only to 32-bit compatibility segments. + 1 MIPS64 or microMIPS64 with access only to 32-bit compatibility segments. Registers are 64-bits wide, but addresses are 32-bits wide. 64-bit guest code may run but cannot access MIPS64 memory segments. It will also be possible to run 32-bit guest code. - 2: MIPS64 or microMIPS64 with access to all address segments. + 2 MIPS64 or microMIPS64 with access to all address segments. Both registers and addresses are 64-bits wide. It will be possible to run 64-bit or 32-bit guest code. +== ======================================================================== 8.9 KVM_CAP_ARM_USER_IRQ +------------------------ + +:Architectures: arm, arm64 -Architectures: arm, arm64 This capability, if KVM_CHECK_EXTENSION indicates that it is available, means that if userspace creates a VM without an in-kernel interrupt controller, it will be notified of changes to the output level of in-kernel emulated devices, @@ -5321,7 +5886,7 @@ If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a number larger than 0 indicating the version of this capability is implemented and thereby which bits in in run->s.regs.device_irq_level can signal values. -Currently the following bits are defined for the device_irq_level bitmap: +Currently the following bits are defined for the device_irq_level bitmap:: KVM_CAP_ARM_USER_IRQ >= 1: @@ -5334,8 +5899,9 @@ indicated by returning a higher number from KVM_CHECK_EXTENSION and will be listed above. 8.10 KVM_CAP_PPC_SMT_POSSIBLE +----------------------------- -Architectures: ppc +:Architectures: ppc Querying this capability returns a bitmap indicating the possible virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N @@ -5343,8 +5909,9 @@ virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N available. 8.11 KVM_CAP_HYPERV_SYNIC2 +-------------------------- -Architectures: x86 +:Architectures: x86 This capability enables a newer version of Hyper-V Synthetic interrupt controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM @@ -5352,8 +5919,9 @@ doesn't clear SynIC message and event flags pages when they are enabled by writing to the respective MSRs. 8.12 KVM_CAP_HYPERV_VP_INDEX +---------------------------- -Architectures: x86 +:Architectures: x86 This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its value is used to denote the target vcpu for a SynIC interrupt. For @@ -5361,47 +5929,53 @@ compatibilty, KVM initializes this msr to KVM's internal vcpu index. When this capability is absent, userspace can still query this msr's value. 8.13 KVM_CAP_S390_AIS_MIGRATION +------------------------------- -Architectures: s390 -Parameters: none +:Architectures: s390 +:Parameters: none This capability indicates if the flic device will be able to get/set the AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows to discover this without having to create a flic device. 8.14 KVM_CAP_S390_PSW +--------------------- -Architectures: s390 +:Architectures: s390 This capability indicates that the PSW is exposed via the kvm_run structure. 8.15 KVM_CAP_S390_GMAP +---------------------- -Architectures: s390 +:Architectures: s390 This capability indicates that the user space memory used as guest mapping can be anywhere in the user memory address space, as long as the memory slots are aligned and sized to a segment (1MB) boundary. 8.16 KVM_CAP_S390_COW +--------------------- -Architectures: s390 +:Architectures: s390 This capability indicates that the user space memory used as guest mapping can use copy-on-write semantics as well as dirty pages tracking via read-only page tables. 8.17 KVM_CAP_S390_BPB +--------------------- -Architectures: s390 +:Architectures: s390 This capability indicates that kvm will implement the interfaces to handle reset, migration and nested KVM for branch prediction blocking. The stfle facility 82 should not be provided to the guest without this capability. 8.18 KVM_CAP_HYPERV_TLBFLUSH +---------------------------- -Architectures: x86 +:Architectures: x86 This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush hypercalls: @@ -5409,8 +5983,9 @@ HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressList, HvFlushVirtualAddressListEx. 8.19 KVM_CAP_ARM_INJECT_SERROR_ESR +---------------------------------- -Architectures: arm, arm64 +:Architectures: arm, arm64 This capability indicates that userspace can specify (via the KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it @@ -5421,16 +5996,20 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using AArch64, this value will be reported in the ISS field of ESR_ELx. See KVM_CAP_VCPU_EVENTS for more details. + 8.20 KVM_CAP_HYPERV_SEND_IPI +---------------------------- -Architectures: x86 +:Architectures: x86 This capability indicates that KVM supports paravirtualized Hyper-V IPI send hypercalls: HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. + 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH +----------------------------------- -Architecture: x86 +:Architecture: x86 This capability indicates that KVM running on top of Hyper-V hypervisor enables Direct TLB flush for its guests meaning that TLB flush diff --git a/Documentation/virt/kvm/arm/hyp-abi.txt b/Documentation/virt/kvm/arm/hyp-abi.rst index a20a0bee268d..d1fc27d848e9 100644 --- a/Documentation/virt/kvm/arm/hyp-abi.txt +++ b/Documentation/virt/kvm/arm/hyp-abi.rst @@ -1,4 +1,8 @@ -* Internal ABI between the kernel and HYP +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +Internal ABI between the kernel and HYP +======================================= This file documents the interaction between the Linux kernel and the hypervisor layer when running Linux as a hypervisor (for example @@ -19,25 +23,31 @@ and only act on individual CPUs. Unless specified otherwise, any built-in hypervisor must implement these functions (see arch/arm{,64}/include/asm/virt.h): -* r0/x0 = HVC_SET_VECTORS - r1/x1 = vectors +* :: + + r0/x0 = HVC_SET_VECTORS + r1/x1 = vectors Set HVBAR/VBAR_EL2 to 'vectors' to enable a hypervisor. 'vectors' must be a physical address, and respect the alignment requirements of the architecture. Only implemented by the initial stubs, not by Linux hypervisors. -* r0/x0 = HVC_RESET_VECTORS +* :: + + r0/x0 = HVC_RESET_VECTORS Turn HYP/EL2 MMU off, and reset HVBAR/VBAR_EL2 to the initials stubs' exception vector value. This effectively disables an existing hypervisor. -* r0/x0 = HVC_SOFT_RESTART - r1/x1 = restart address - x2 = x0's value when entering the next payload (arm64) - x3 = x1's value when entering the next payload (arm64) - x4 = x2's value when entering the next payload (arm64) +* :: + + r0/x0 = HVC_SOFT_RESTART + r1/x1 = restart address + x2 = x0's value when entering the next payload (arm64) + x3 = x1's value when entering the next payload (arm64) + x4 = x2's value when entering the next payload (arm64) Mask all exceptions, disable the MMU, move the arguments into place (arm64 only), and jump to the restart address while at HYP/EL2. This diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst new file mode 100644 index 000000000000..3e2b2aba90fc --- /dev/null +++ b/Documentation/virt/kvm/arm/index.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=== +ARM +=== + +.. toctree:: + :maxdepth: 2 + + hyp-abi + psci + pvtime diff --git a/Documentation/virt/kvm/arm/psci.txt b/Documentation/virt/kvm/arm/psci.rst index 559586fc9d37..d52c2e83b5b8 100644 --- a/Documentation/virt/kvm/arm/psci.txt +++ b/Documentation/virt/kvm/arm/psci.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Power State Coordination Interface (PSCI) +========================================= + KVM implements the PSCI (Power State Coordination Interface) specification in order to provide services such as CPU on/off, reset and power-off to the guest. @@ -30,32 +36,42 @@ The following register is defined: - Affects the whole VM (even if the register view is per-vcpu) * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1: - Holds the state of the firmware support to mitigate CVE-2017-5715, as - offered by KVM to the guest via a HVC call. The workaround is described - under SMCCC_ARCH_WORKAROUND_1 in [1]. + Holds the state of the firmware support to mitigate CVE-2017-5715, as + offered by KVM to the guest via a HVC call. The workaround is described + under SMCCC_ARCH_WORKAROUND_1 in [1]. + Accepted values are: - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: KVM does not offer + + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_AVAIL: + KVM does not offer firmware support for the workaround. The mitigation status for the guest is unknown. - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: The workaround HVC call is + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_AVAIL: + The workaround HVC call is available to the guest and required for the mitigation. - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: The workaround HVC call + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_1_NOT_REQUIRED: + The workaround HVC call is available to the guest, but it is not needed on this VCPU. * KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2: - Holds the state of the firmware support to mitigate CVE-2018-3639, as - offered by KVM to the guest via a HVC call. The workaround is described - under SMCCC_ARCH_WORKAROUND_2 in [1]. + Holds the state of the firmware support to mitigate CVE-2018-3639, as + offered by KVM to the guest via a HVC call. The workaround is described + under SMCCC_ARCH_WORKAROUND_2 in [1]_. + Accepted values are: - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: A workaround is not + + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_AVAIL: + A workaround is not available. KVM does not offer firmware support for the workaround. - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: The workaround state is + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_UNKNOWN: + The workaround state is unknown. KVM does not offer firmware support for the workaround. - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: The workaround is available, + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_AVAIL: + The workaround is available, and can be disabled by a vCPU. If KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_ENABLED is set, it is active for this vCPU. - KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: The workaround is - always active on this vCPU or it is not needed. + KVM_REG_ARM_SMCCC_ARCH_WORKAROUND_2_NOT_REQUIRED: + The workaround is always active on this vCPU or it is not needed. -[1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf +.. [1] https://developer.arm.com/-/media/developer/pdf/ARM_DEN_0070A_Firmware_interfaces_for_mitigating_CVE-2017-5715.pdf diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.txt b/Documentation/virt/kvm/devices/arm-vgic-its.rst index eeaa95b893a8..6c304fd2b1b4 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-its.txt +++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== ARM Virtual Interrupt Translation Service (ITS) =============================================== @@ -12,22 +15,32 @@ There can be multiple ITS controllers per guest, each of them has to have a separate, non-overlapping MMIO region. -Groups: - KVM_DEV_ARM_VGIC_GRP_ADDR +Groups +====== + +KVM_DEV_ARM_VGIC_GRP_ADDR +------------------------- + Attributes: KVM_VGIC_ITS_ADDR_TYPE (rw, 64-bit) Base address in the guest physical address space of the GICv3 ITS control register frame. This address needs to be 64K aligned and the region covers 128K. + Errors: - -E2BIG: Address outside of addressable IPA range - -EINVAL: Incorrectly aligned address - -EEXIST: Address already configured - -EFAULT: Invalid user pointer for attr->addr. - -ENODEV: Incorrect attribute or the ITS is not supported. + ======= ================================================= + -E2BIG Address outside of addressable IPA range + -EINVAL Incorrectly aligned address + -EEXIST Address already configured + -EFAULT Invalid user pointer for attr->addr. + -ENODEV Incorrect attribute or the ITS is not supported. + ======= ================================================= + + +KVM_DEV_ARM_VGIC_GRP_CTRL +------------------------- - KVM_DEV_ARM_VGIC_GRP_CTRL Attributes: KVM_DEV_ARM_VGIC_CTRL_INIT request the initialization of the ITS, no additional parameter in @@ -58,16 +71,21 @@ Groups: "ITS Restore Sequence". Errors: - -ENXIO: ITS not properly configured as required prior to setting + + ======= ========================================================== + -ENXIO ITS not properly configured as required prior to setting this attribute - -ENOMEM: Memory shortage when allocating ITS internal data - -EINVAL: Inconsistent restored data - -EFAULT: Invalid guest ram access - -EBUSY: One or more VCPUS are running - -EACCES: The virtual ITS is backed by a physical GICv4 ITS, and the + -ENOMEM Memory shortage when allocating ITS internal data + -EINVAL Inconsistent restored data + -EFAULT Invalid guest ram access + -EBUSY One or more VCPUS are running + -EACCES The virtual ITS is backed by a physical GICv4 ITS, and the state is not available + ======= ========================================================== + +KVM_DEV_ARM_VGIC_GRP_ITS_REGS +----------------------------- - KVM_DEV_ARM_VGIC_GRP_ITS_REGS Attributes: The attr field of kvm_device_attr encodes the offset of the ITS register, relative to the ITS control frame base address @@ -78,6 +96,7 @@ Groups: be accessed with full length. Writes to read-only registers are ignored by the kernel except for: + - GITS_CREADR. It must be restored otherwise commands in the queue will be re-executed after restoring CWRITER. GITS_CREADR must be restored before restoring the GITS_CTLR which is likely to enable the @@ -91,30 +110,36 @@ Groups: For other registers, getting or setting a register has the same effect as reading/writing the register on real hardware. + Errors: - -ENXIO: Offset does not correspond to any supported register - -EFAULT: Invalid user pointer for attr->addr - -EINVAL: Offset is not 64-bit aligned - -EBUSY: one or more VCPUS are running - ITS Restore Sequence: - ------------------------- + ======= ==================================================== + -ENXIO Offset does not correspond to any supported register + -EFAULT Invalid user pointer for attr->addr + -EINVAL Offset is not 64-bit aligned + -EBUSY one or more VCPUS are running + ======= ==================================================== + +ITS Restore Sequence: +--------------------- The following ordering must be followed when restoring the GIC and the ITS: + a) restore all guest memory and create vcpus b) restore all redistributors c) provide the ITS base address (KVM_DEV_ARM_VGIC_GRP_ADDR) d) restore the ITS in the following order: - 1. Restore GITS_CBASER - 2. Restore all other GITS_ registers, except GITS_CTLR! - 3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES) - 4. Restore GITS_CTLR + + 1. Restore GITS_CBASER + 2. Restore all other ``GITS_`` registers, except GITS_CTLR! + 3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES) + 4. Restore GITS_CTLR Then vcpus can be started. - ITS Table ABI REV0: - ------------------- +ITS Table ABI REV0: +------------------- Revision 0 of the ABI only supports the features of a virtual GICv3, and does not support a virtual GICv4 with support for direct injection of virtual @@ -125,12 +150,13 @@ Then vcpus can be started. entries in the collection are listed in no particular order. All entries are 8 bytes. - Device Table Entry (DTE): + Device Table Entry (DTE):: - bits: | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 | - values: | V | next | ITT_addr | Size | + bits: | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 | + values: | V | next | ITT_addr | Size | + + where: - where; - V indicates whether the entry is valid. If not, other fields are not meaningful. - next: equals to 0 if this entry is the last one; otherwise it @@ -140,32 +166,34 @@ Then vcpus can be started. - Size specifies the supported number of bits for the EventID, minus one - Collection Table Entry (CTE): + Collection Table Entry (CTE):: - bits: | 63| 62 .. 52 | 51 ... 16 | 15 ... 0 | - values: | V | RES0 | RDBase | ICID | + bits: | 63| 62 .. 52 | 51 ... 16 | 15 ... 0 | + values: | V | RES0 | RDBase | ICID | where: + - V indicates whether the entry is valid. If not, other fields are not meaningful. - RES0: reserved field with Should-Be-Zero-or-Preserved behavior. - RDBase is the PE number (GICR_TYPER.Processor_Number semantic), - ICID is the collection ID - Interrupt Translation Entry (ITE): + Interrupt Translation Entry (ITE):: - bits: | 63 ... 48 | 47 ... 16 | 15 ... 0 | - values: | next | pINTID | ICID | + bits: | 63 ... 48 | 47 ... 16 | 15 ... 0 | + values: | next | pINTID | ICID | where: + - next: equals to 0 if this entry is the last one; otherwise it corresponds to the EventID offset to the next ITE capped by 2^16 -1. - pINTID is the physical LPI ID; if zero, it means the entry is not valid and other fields are not meaningful. - ICID is the collection ID - ITS Reset State: - ---------------- +ITS Reset State: +---------------- RESET returns the ITS to the same state that it was when first created and initialized. When the RESET command returns, the following things are diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.txt b/Documentation/virt/kvm/devices/arm-vgic-v3.rst index ff290b43c8e5..5dd3bff51978 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-v3.txt +++ b/Documentation/virt/kvm/devices/arm-vgic-v3.rst @@ -1,9 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================== ARM Virtual Generic Interrupt Controller v3 and later (VGICv3) ============================================================== Device types supported: - KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0 + - KVM_DEV_TYPE_ARM_VGIC_V3 ARM Generic Interrupt Controller v3.0 Only one VGIC instance may be instantiated through this API. The created VGIC will act as the VM interrupt controller, requiring emulated user-space devices @@ -15,7 +18,8 @@ Creating a guest GICv3 device requires a host GICv3 as well. Groups: KVM_DEV_ARM_VGIC_GRP_ADDR - Attributes: + Attributes: + KVM_VGIC_V3_ADDR_TYPE_DIST (rw, 64-bit) Base address in the guest physical address space of the GICv3 distributor register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3. @@ -29,21 +33,25 @@ Groups: This address needs to be 64K aligned. KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION (rw, 64-bit) - The attribute data pointed to by kvm_device_attr.addr is a __u64 value: - bits: | 63 .... 52 | 51 .... 16 | 15 - 12 |11 - 0 - values: | count | base | flags | index + The attribute data pointed to by kvm_device_attr.addr is a __u64 value:: + + bits: | 63 .... 52 | 51 .... 16 | 15 - 12 |11 - 0 + values: | count | base | flags | index + - index encodes the unique redistributor region index - flags: reserved for future use, currently 0 - base field encodes bits [51:16] of the guest physical base address of the first redistributor in the region. - count encodes the number of redistributors in the region. Must be greater than 0. + There are two 64K pages for each redistributor in the region and redistributors are laid out contiguously within the region. Regions are filled with redistributors in the index order. The sum of all region count fields must be greater than or equal to the number of VCPUs. Redistributor regions must be registered in the incremental index order, starting from index 0. + The characteristics of a specific redistributor region can be read by presetting the index field in the attr data. Only valid for KVM_DEV_TYPE_ARM_VGIC_V3. @@ -52,23 +60,27 @@ Groups: KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attributes. Errors: - -E2BIG: Address outside of addressable IPA range - -EINVAL: Incorrectly aligned address, bad redistributor region + + ======= ============================================================= + -E2BIG Address outside of addressable IPA range + -EINVAL Incorrectly aligned address, bad redistributor region count/index, mixed redistributor region attribute usage - -EEXIST: Address already configured - -ENOENT: Attempt to read the characteristics of a non existing + -EEXIST Address already configured + -ENOENT Attempt to read the characteristics of a non existing redistributor region - -ENXIO: The group or attribute is unknown/unsupported for this device + -ENXIO The group or attribute is unknown/unsupported for this device or hardware support is missing. - -EFAULT: Invalid user pointer for attr->addr. + -EFAULT Invalid user pointer for attr->addr. + ======= ============================================================= + + KVM_DEV_ARM_VGIC_GRP_DIST_REGS, KVM_DEV_ARM_VGIC_GRP_REDIST_REGS + Attributes: - KVM_DEV_ARM_VGIC_GRP_DIST_REGS - KVM_DEV_ARM_VGIC_GRP_REDIST_REGS - Attributes: - The attr field of kvm_device_attr encodes two values: - bits: | 63 .... 32 | 31 .... 0 | - values: | mpidr | offset | + The attr field of kvm_device_attr encodes two values:: + + bits: | 63 .... 32 | 31 .... 0 | + values: | mpidr | offset | All distributor regs are (rw, 32-bit) and kvm_device_attr.addr points to a __u32 value. 64-bit registers must be accessed by separately accessing the @@ -93,7 +105,8 @@ Groups: redistributor is accessed. The mpidr is ignored for the distributor. The mpidr encoding is based on the affinity information in the - architecture defined MPIDR, and the field is encoded as follows: + architecture defined MPIDR, and the field is encoded as follows:: + | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 | | Aff3 | Aff2 | Aff1 | Aff0 | @@ -148,24 +161,30 @@ Groups: ignored. Errors: - -ENXIO: Getting or setting this register is not yet supported - -EBUSY: One or more VCPUs are running + + ====== ===================================================== + -ENXIO Getting or setting this register is not yet supported + -EBUSY One or more VCPUs are running + ====== ===================================================== KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS - Attributes: - The attr field of kvm_device_attr encodes two values: - bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 | - values: | mpidr | RES | instr | + Attributes: + + The attr field of kvm_device_attr encodes two values:: + + bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 | + values: | mpidr | RES | instr | The mpidr field encodes the CPU ID based on the affinity information in the - architecture defined MPIDR, and the field is encoded as follows: + architecture defined MPIDR, and the field is encoded as follows:: + | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 | | Aff3 | Aff2 | Aff1 | Aff0 | The instr field encodes the system register to access based on the fields defined in the A64 instruction set encoding for system register access - (RES means the bits are reserved for future use and should be zero): + (RES means the bits are reserved for future use and should be zero):: | 15 ... 14 | 13 ... 11 | 10 ... 7 | 6 ... 3 | 2 ... 0 | | Op 0 | Op1 | CRn | CRm | Op2 | @@ -178,26 +197,35 @@ Groups: CPU interface registers access is not implemented for AArch32 mode. Error -ENXIO is returned when accessed in AArch32 mode. + Errors: - -ENXIO: Getting or setting this register is not yet supported - -EBUSY: VCPU is running - -EINVAL: Invalid mpidr or register value supplied + + ======= ===================================================== + -ENXIO Getting or setting this register is not yet supported + -EBUSY VCPU is running + -EINVAL Invalid mpidr or register value supplied + ======= ===================================================== KVM_DEV_ARM_VGIC_GRP_NR_IRQS - Attributes: + Attributes: + A value describing the number of interrupts (SGI, PPI and SPI) for this GIC instance, ranging from 64 to 1024, in increments of 32. kvm_device_attr.addr points to a __u32 value. Errors: - -EINVAL: Value set is out of the expected range - -EBUSY: Value has already be set. + + ======= ====================================== + -EINVAL Value set is out of the expected range + -EBUSY Value has already be set. + ======= ====================================== KVM_DEV_ARM_VGIC_GRP_CTRL - Attributes: + Attributes: + KVM_DEV_ARM_VGIC_CTRL_INIT request the initialization of the VGIC, no additional parameter in kvm_device_attr.addr. @@ -205,20 +233,26 @@ Groups: save all LPI pending bits into guest RAM pending tables. The first kB of the pending table is not altered by this operation. + Errors: - -ENXIO: VGIC not properly configured as required prior to calling - this attribute - -ENODEV: no online VCPU - -ENOMEM: memory shortage when allocating vgic internal data - -EFAULT: Invalid guest ram access - -EBUSY: One or more VCPUS are running + + ======= ======================================================== + -ENXIO VGIC not properly configured as required prior to calling + this attribute + -ENODEV no online VCPU + -ENOMEM memory shortage when allocating vgic internal data + -EFAULT Invalid guest ram access + -EBUSY One or more VCPUS are running + ======= ======================================================== KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO - Attributes: - The attr field of kvm_device_attr encodes the following values: - bits: | 63 .... 32 | 31 .... 10 | 9 .... 0 | - values: | mpidr | info | vINTID | + Attributes: + + The attr field of kvm_device_attr encodes the following values:: + + bits: | 63 .... 32 | 31 .... 10 | 9 .... 0 | + values: | mpidr | info | vINTID | The vINTID specifies which set of IRQs is reported on. @@ -228,6 +262,7 @@ Groups: VGIC_LEVEL_INFO_LINE_LEVEL: Get/Set the input level of the IRQ line for a set of 32 contiguously numbered interrupts. + vINTID must be a multiple of 32. kvm_device_attr.addr points to a __u32 value which will contain a @@ -243,9 +278,14 @@ Groups: reported with the same value regardless of the mpidr specified. The mpidr field encodes the CPU ID based on the affinity information in the - architecture defined MPIDR, and the field is encoded as follows: + architecture defined MPIDR, and the field is encoded as follows:: + | 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 | | Aff3 | Aff2 | Aff1 | Aff0 | + Errors: - -EINVAL: vINTID is not multiple of 32 or - info field is not VGIC_LEVEL_INFO_LINE_LEVEL + + ======= ============================================= + -EINVAL vINTID is not multiple of 32 or info field is + not VGIC_LEVEL_INFO_LINE_LEVEL + ======= ============================================= diff --git a/Documentation/virt/kvm/devices/arm-vgic.txt b/Documentation/virt/kvm/devices/arm-vgic.rst index 97b6518148f8..40bdeea1d86e 100644 --- a/Documentation/virt/kvm/devices/arm-vgic.txt +++ b/Documentation/virt/kvm/devices/arm-vgic.rst @@ -1,8 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================== ARM Virtual Generic Interrupt Controller v2 (VGIC) ================================================== Device types supported: - KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0 + + - KVM_DEV_TYPE_ARM_VGIC_V2 ARM Generic Interrupt Controller v2.0 Only one VGIC instance may be instantiated through either this API or the legacy KVM_CREATE_IRQCHIP API. The created VGIC will act as the VM interrupt @@ -17,7 +21,8 @@ create both a GICv3 and GICv2 device on the same VM. Groups: KVM_DEV_ARM_VGIC_GRP_ADDR - Attributes: + Attributes: + KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit) Base address in the guest physical address space of the GIC distributor register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2. @@ -27,19 +32,25 @@ Groups: Base address in the guest physical address space of the GIC virtual cpu interface register mappings. Only valid for KVM_DEV_TYPE_ARM_VGIC_V2. This address needs to be 4K aligned and the region covers 4 KByte. + Errors: - -E2BIG: Address outside of addressable IPA range - -EINVAL: Incorrectly aligned address - -EEXIST: Address already configured - -ENXIO: The group or attribute is unknown/unsupported for this device + + ======= ============================================================= + -E2BIG Address outside of addressable IPA range + -EINVAL Incorrectly aligned address + -EEXIST Address already configured + -ENXIO The group or attribute is unknown/unsupported for this device or hardware support is missing. - -EFAULT: Invalid user pointer for attr->addr. + -EFAULT Invalid user pointer for attr->addr. + ======= ============================================================= KVM_DEV_ARM_VGIC_GRP_DIST_REGS - Attributes: - The attr field of kvm_device_attr encodes two values: - bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | - values: | reserved | vcpu_index | offset | + Attributes: + + The attr field of kvm_device_attr encodes two values:: + + bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | + values: | reserved | vcpu_index | offset | All distributor regs are (rw, 32-bit) @@ -58,16 +69,22 @@ Groups: KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS) to ensure the expected behavior. Unless GICD_IIDR has been set from userspace, writes to the interrupt group registers (GICD_IGROUPR) are ignored. + Errors: - -ENXIO: Getting or setting this register is not yet supported - -EBUSY: One or more VCPUs are running - -EINVAL: Invalid vcpu_index supplied + + ======= ===================================================== + -ENXIO Getting or setting this register is not yet supported + -EBUSY One or more VCPUs are running + -EINVAL Invalid vcpu_index supplied + ======= ===================================================== KVM_DEV_ARM_VGIC_GRP_CPU_REGS - Attributes: - The attr field of kvm_device_attr encodes two values: - bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | - values: | reserved | vcpu_index | offset | + Attributes: + + The attr field of kvm_device_attr encodes two values:: + + bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | + values: | reserved | vcpu_index | offset | All CPU interface regs are (rw, 32-bit) @@ -101,27 +118,39 @@ Groups: value left by 3 places to obtain the actual priority mask level. Errors: - -ENXIO: Getting or setting this register is not yet supported - -EBUSY: One or more VCPUs are running - -EINVAL: Invalid vcpu_index supplied + + ======= ===================================================== + -ENXIO Getting or setting this register is not yet supported + -EBUSY One or more VCPUs are running + -EINVAL Invalid vcpu_index supplied + ======= ===================================================== KVM_DEV_ARM_VGIC_GRP_NR_IRQS - Attributes: + Attributes: + A value describing the number of interrupts (SGI, PPI and SPI) for this GIC instance, ranging from 64 to 1024, in increments of 32. Errors: - -EINVAL: Value set is out of the expected range - -EBUSY: Value has already be set, or GIC has already been initialized - with default values. + + ======= ============================================================= + -EINVAL Value set is out of the expected range + -EBUSY Value has already be set, or GIC has already been initialized + with default values. + ======= ============================================================= KVM_DEV_ARM_VGIC_GRP_CTRL - Attributes: + Attributes: + KVM_DEV_ARM_VGIC_CTRL_INIT request the initialization of the VGIC or ITS, no additional parameter in kvm_device_attr.addr. + Errors: - -ENXIO: VGIC not properly configured as required prior to calling - this attribute - -ENODEV: no online VCPU - -ENOMEM: memory shortage when allocating vgic internal data + + ======= ========================================================= + -ENXIO VGIC not properly configured as required prior to calling + this attribute + -ENODEV no online VCPU + -ENOMEM memory shortage when allocating vgic internal data + ======= ========================================================= diff --git a/Documentation/virt/kvm/devices/index.rst b/Documentation/virt/kvm/devices/index.rst new file mode 100644 index 000000000000..192cda7405c8 --- /dev/null +++ b/Documentation/virt/kvm/devices/index.rst @@ -0,0 +1,19 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +Devices +======= + +.. toctree:: + :maxdepth: 2 + + arm-vgic-its + arm-vgic + arm-vgic-v3 + mpic + s390_flic + vcpu + vfio + vm + xics + xive diff --git a/Documentation/virt/kvm/devices/mpic.txt b/Documentation/virt/kvm/devices/mpic.rst index 8257397adc3c..55cefe030d41 100644 --- a/Documentation/virt/kvm/devices/mpic.txt +++ b/Documentation/virt/kvm/devices/mpic.rst @@ -1,9 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= MPIC interrupt controller ========================= Device types supported: - KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0 - KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2 + + - KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0 + - KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2 Only one MPIC instance, of any type, may be instantiated. The created MPIC will act as the system interrupt controller, connecting to each @@ -11,7 +15,8 @@ vcpu's interrupt inputs. Groups: KVM_DEV_MPIC_GRP_MISC - Attributes: + Attributes: + KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit) Base address of the 256 KiB MPIC register space. Must be naturally aligned. A value of zero disables the mapping. diff --git a/Documentation/virt/kvm/devices/s390_flic.txt b/Documentation/virt/kvm/devices/s390_flic.rst index a4e20a090174..954190da7d04 100644 --- a/Documentation/virt/kvm/devices/s390_flic.txt +++ b/Documentation/virt/kvm/devices/s390_flic.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================== FLIC (floating interrupt controller) ==================================== @@ -31,8 +34,10 @@ Groups: Copies all floating interrupts into a buffer provided by userspace. When the buffer is too small it returns -ENOMEM, which is the indication for userspace to try again with a bigger buffer. + -ENOBUFS is returned when the allocation of a kernelspace buffer has failed. + -EFAULT is returned when copying data to userspace failed. All interrupts remain pending, i.e. are not deleted from the list of currently pending interrupts. @@ -60,38 +65,41 @@ Groups: KVM_DEV_FLIC_ADAPTER_REGISTER Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter - describing the adapter to register: + describing the adapter to register:: -struct kvm_s390_io_adapter { - __u32 id; - __u8 isc; - __u8 maskable; - __u8 swap; - __u8 flags; -}; + struct kvm_s390_io_adapter { + __u32 id; + __u8 isc; + __u8 maskable; + __u8 swap; + __u8 flags; + }; id contains the unique id for the adapter, isc the I/O interruption subclass to use, maskable whether this adapter may be masked (interrupts turned off), swap whether the indicators need to be byte swapped, and flags contains further characteristics of the adapter. + Currently defined values for 'flags' are: + - KVM_S390_ADAPTER_SUPPRESSIBLE: adapter is subject to AIS (adapter-interrupt-suppression) facility. This flag only has an effect if the AIS capability is enabled. + Unknown flag values are ignored. KVM_DEV_FLIC_ADAPTER_MODIFY Modifies attributes of an existing I/O adapter interrupt source. Takes - a kvm_s390_io_adapter_req specifying the adapter and the operation: + a kvm_s390_io_adapter_req specifying the adapter and the operation:: -struct kvm_s390_io_adapter_req { - __u32 id; - __u8 type; - __u8 mask; - __u16 pad0; - __u64 addr; -}; + struct kvm_s390_io_adapter_req { + __u32 id; + __u8 type; + __u8 mask; + __u16 pad0; + __u64 addr; + }; id specifies the adapter and type the operation. The supported operations are: @@ -103,8 +111,9 @@ struct kvm_s390_io_adapter_req { perform a gmap translation for the guest address provided in addr, pin a userspace page for the translated address and add it to the list of mappings - Note: A new mapping will be created unconditionally; therefore, - the calling code should avoid making duplicate mappings. + + .. note:: A new mapping will be created unconditionally; therefore, + the calling code should avoid making duplicate mappings. KVM_S390_IO_ADAPTER_UNMAP release a userspace page for the translated address specified in addr @@ -112,16 +121,17 @@ struct kvm_s390_io_adapter_req { KVM_DEV_FLIC_AISM modify the adapter-interruption-suppression mode for a given isc if the - AIS capability is enabled. Takes a kvm_s390_ais_req describing: + AIS capability is enabled. Takes a kvm_s390_ais_req describing:: -struct kvm_s390_ais_req { - __u8 isc; - __u16 mode; -}; + struct kvm_s390_ais_req { + __u8 isc; + __u16 mode; + }; isc contains the target I/O interruption subclass, mode the target adapter-interruption-suppression mode. The following modes are currently supported: + - KVM_S390_AIS_MODE_ALL: ALL-Interruptions Mode, i.e. airq injection is always allowed; - KVM_S390_AIS_MODE_SINGLE: SINGLE-Interruption Mode, i.e. airq @@ -139,12 +149,12 @@ struct kvm_s390_ais_req { KVM_DEV_FLIC_AISM_ALL Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes - a kvm_s390_ais_all describing: + a kvm_s390_ais_all describing:: -struct kvm_s390_ais_all { - __u8 simm; /* Single-Interruption-Mode mask */ - __u8 nimm; /* No-Interruption-Mode mask * -}; + struct kvm_s390_ais_all { + __u8 simm; /* Single-Interruption-Mode mask */ + __u8 nimm; /* No-Interruption-Mode mask * + }; simm contains Single-Interruption-Mode mask for all ISCs, nimm contains No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds @@ -159,5 +169,5 @@ ENXIO, as specified in the API documentation). It is not possible to conclude that a FLIC operation is unavailable based on the error code resulting from a usage attempt. -Note: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a zero -schid is specified. +.. note:: The KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl will return EINVAL in case a + zero schid is specified. diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst new file mode 100644 index 000000000000..9963e680770a --- /dev/null +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -0,0 +1,114 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Generic vcpu interface +====================== + +The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR, +KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct +kvm_device_attr as other devices, but targets VCPU-wide settings and controls. + +The groups and attributes per virtual cpu, if any, are architecture specific. + +1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL +================================== + +:Architectures: ARM64 + +1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ +--------------------------------------- + +:Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a + pointer to an int + +Returns: + + ======= ======================================================== + -EBUSY The PMU overflow interrupt is already set + -ENXIO The overflow interrupt not set when attempting to get it + -ENODEV PMUv3 not supported + -EINVAL Invalid PMU overflow interrupt number supplied or + trying to set the IRQ number without using an in-kernel + irqchip. + ======= ======================================================== + +A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt +number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt +type must be same for each vcpu. As a PPI, the interrupt number is the same for +all vcpus, while as an SPI it must be a separate number per vcpu. + +1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT +--------------------------------------- + +:Parameters: no additional parameter in kvm_device_attr.addr + +Returns: + + ======= ====================================================== + -ENODEV PMUv3 not supported or GIC not initialized + -ENXIO PMUv3 not properly configured or in-kernel irqchip not + configured as required prior to calling this attribute + -EBUSY PMUv3 already initialized + ======= ====================================================== + +Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel +virtual GIC implementation, this must be done after initializing the in-kernel +irqchip. + + +2. GROUP: KVM_ARM_VCPU_TIMER_CTRL +================================= + +:Architectures: ARM, ARM64 + +2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER +----------------------------------------------------------------------------- + +:Parameters: in kvm_device_attr.addr the address for the timer interrupt is a + pointer to an int + +Returns: + + ======= ================================= + -EINVAL Invalid timer interrupt number + -EBUSY One or more VCPUs has already run + ======= ================================= + +A value describing the architected timer interrupt number when connected to an +in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the +attribute overrides the default values (see below). + +============================= ========================================== +KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27) +KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30) +============================= ========================================== + +Setting the same PPI for different timers will prevent the VCPUs from running. +Setting the interrupt number on a VCPU configures all VCPUs created at that +time to use the number provided for a given timer, overwriting any previously +configured values on other VCPUs. Userspace should configure the interrupt +numbers on at least one VCPU after creating all VCPUs and before running any +VCPUs. + +3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL +================================== + +:Architectures: ARM64 + +3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA +-------------------------------------- + +:Parameters: 64-bit base address + +Returns: + + ======= ====================================== + -ENXIO Stolen time not implemented + -EEXIST Base address already set for this VCPU + -EINVAL Base address not 64 byte aligned + ======= ====================================== + +Specifies the base address of the stolen time structure for this VCPU. The +base address must be 64 byte aligned and exist within a valid guest memory +region. See Documentation/virt/kvm/arm/pvtime.txt for more information +including the layout of the stolen time structure. diff --git a/Documentation/virt/kvm/devices/vcpu.txt b/Documentation/virt/kvm/devices/vcpu.txt deleted file mode 100644 index 6f3bd64a05b0..000000000000 --- a/Documentation/virt/kvm/devices/vcpu.txt +++ /dev/null @@ -1,76 +0,0 @@ -Generic vcpu interface -==================================== - -The virtual cpu "device" also accepts the ioctls KVM_SET_DEVICE_ATTR, -KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same struct -kvm_device_attr as other devices, but targets VCPU-wide settings and controls. - -The groups and attributes per virtual cpu, if any, are architecture specific. - -1. GROUP: KVM_ARM_VCPU_PMU_V3_CTRL -Architectures: ARM64 - -1.1. ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_IRQ -Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a - pointer to an int -Returns: -EBUSY: The PMU overflow interrupt is already set - -ENXIO: The overflow interrupt not set when attempting to get it - -ENODEV: PMUv3 not supported - -EINVAL: Invalid PMU overflow interrupt number supplied or - trying to set the IRQ number without using an in-kernel - irqchip. - -A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt -number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt -type must be same for each vcpu. As a PPI, the interrupt number is the same for -all vcpus, while as an SPI it must be a separate number per vcpu. - -1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT -Parameters: no additional parameter in kvm_device_attr.addr -Returns: -ENODEV: PMUv3 not supported or GIC not initialized - -ENXIO: PMUv3 not properly configured or in-kernel irqchip not - configured as required prior to calling this attribute - -EBUSY: PMUv3 already initialized - -Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel -virtual GIC implementation, this must be done after initializing the in-kernel -irqchip. - - -2. GROUP: KVM_ARM_VCPU_TIMER_CTRL -Architectures: ARM,ARM64 - -2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER -2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER -Parameters: in kvm_device_attr.addr the address for the timer interrupt is a - pointer to an int -Returns: -EINVAL: Invalid timer interrupt number - -EBUSY: One or more VCPUs has already run - -A value describing the architected timer interrupt number when connected to an -in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the -attribute overrides the default values (see below). - -KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27) -KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30) - -Setting the same PPI for different timers will prevent the VCPUs from running. -Setting the interrupt number on a VCPU configures all VCPUs created at that -time to use the number provided for a given timer, overwriting any previously -configured values on other VCPUs. Userspace should configure the interrupt -numbers on at least one VCPU after creating all VCPUs and before running any -VCPUs. - -3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL -Architectures: ARM64 - -3.1 ATTRIBUTE: KVM_ARM_VCPU_PVTIME_IPA -Parameters: 64-bit base address -Returns: -ENXIO: Stolen time not implemented - -EEXIST: Base address already set for this VCPU - -EINVAL: Base address not 64 byte aligned - -Specifies the base address of the stolen time structure for this VCPU. The -base address must be 64 byte aligned and exist within a valid guest memory -region. See Documentation/virt/kvm/arm/pvtime.txt for more information -including the layout of the stolen time structure. diff --git a/Documentation/virt/kvm/devices/vfio.txt b/Documentation/virt/kvm/devices/vfio.rst index 528c77c8022c..2d20dc561069 100644 --- a/Documentation/virt/kvm/devices/vfio.txt +++ b/Documentation/virt/kvm/devices/vfio.rst @@ -1,8 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== VFIO virtual device =================== Device types supported: - KVM_DEV_TYPE_VFIO + + - KVM_DEV_TYPE_VFIO Only one VFIO instance may be created per VM. The created device tracks VFIO groups in use by the VM and features of those groups @@ -23,14 +27,15 @@ KVM_DEV_VFIO_GROUP attributes: for the VFIO group. KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table allocated by sPAPR KVM. - kvm_device_attr.addr points to a struct: + kvm_device_attr.addr points to a struct:: + + struct kvm_vfio_spapr_tce { + __s32 groupfd; + __s32 tablefd; + }; - struct kvm_vfio_spapr_tce { - __s32 groupfd; - __s32 tablefd; - }; + where: - where - @groupfd is a file descriptor for a VFIO group; - @tablefd is a file descriptor for a TCE table allocated via - KVM_CREATE_SPAPR_TCE. + - @groupfd is a file descriptor for a VFIO group; + - @tablefd is a file descriptor for a TCE table allocated via + KVM_CREATE_SPAPR_TCE. diff --git a/Documentation/virt/kvm/devices/vm.txt b/Documentation/virt/kvm/devices/vm.rst index 4ffb82b02468..0aa5b1cfd700 100644 --- a/Documentation/virt/kvm/devices/vm.txt +++ b/Documentation/virt/kvm/devices/vm.rst @@ -1,5 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== Generic vm interface -==================================== +==================== The virtual machine "device" also accepts the ioctls KVM_SET_DEVICE_ATTR, KVM_GET_DEVICE_ATTR, and KVM_HAS_DEVICE_ATTR. The interface uses the same @@ -10,30 +13,38 @@ The groups and attributes per virtual machine, if any, are architecture specific. 1. GROUP: KVM_S390_VM_MEM_CTRL -Architectures: s390 +============================== + +:Architectures: s390 1.1. ATTRIBUTE: KVM_S390_VM_MEM_ENABLE_CMMA -Parameters: none -Returns: -EBUSY if a vcpu is already defined, otherwise 0 +------------------------------------------- + +:Parameters: none +:Returns: -EBUSY if a vcpu is already defined, otherwise 0 Enables Collaborative Memory Management Assist (CMMA) for the virtual machine. 1.2. ATTRIBUTE: KVM_S390_VM_MEM_CLR_CMMA -Parameters: none -Returns: -EINVAL if CMMA was not enabled - 0 otherwise +---------------------------------------- + +:Parameters: none +:Returns: -EINVAL if CMMA was not enabled; + 0 otherwise Clear the CMMA status for all guest pages, so any pages the guest marked as unused are again used any may not be reclaimed by the host. 1.3. ATTRIBUTE KVM_S390_VM_MEM_LIMIT_SIZE -Parameters: in attr->addr the address for the new limit of guest memory -Returns: -EFAULT if the given address is not accessible - -EINVAL if the virtual machine is of type UCONTROL - -E2BIG if the given guest memory is to big for that machine - -EBUSY if a vcpu is already defined - -ENOMEM if not enough memory is available for a new shadow guest mapping - 0 otherwise +----------------------------------------- + +:Parameters: in attr->addr the address for the new limit of guest memory +:Returns: -EFAULT if the given address is not accessible; + -EINVAL if the virtual machine is of type UCONTROL; + -E2BIG if the given guest memory is to big for that machine; + -EBUSY if a vcpu is already defined; + -ENOMEM if not enough memory is available for a new shadow guest mapping; + 0 otherwise. Allows userspace to query the actual limit and set a new limit for the maximum guest memory size. The limit will be rounded up to @@ -42,78 +53,92 @@ the number of page table levels. In the case that there is no limit we will set the limit to KVM_S390_NO_MEM_LIMIT (U64_MAX). 2. GROUP: KVM_S390_VM_CPU_MODEL -Architectures: s390 +=============================== + +:Architectures: s390 2.1. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE (r/o) +--------------------------------------------- -Allows user space to retrieve machine and kvm specific cpu related information: +Allows user space to retrieve machine and kvm specific cpu related information:: -struct kvm_s390_vm_cpu_machine { + struct kvm_s390_vm_cpu_machine { __u64 cpuid; # CPUID of host __u32 ibc; # IBC level range offered by host __u8 pad[4]; __u64 fac_mask[256]; # set of cpu facilities enabled by KVM __u64 fac_list[256]; # set of cpu facilities offered by host -} + } -Parameters: address of buffer to store the machine related cpu data - of type struct kvm_s390_vm_cpu_machine* -Returns: -EFAULT if the given address is not accessible from kernel space - -ENOMEM if not enough memory is available to process the ioctl - 0 in case of success +:Parameters: address of buffer to store the machine related cpu data + of type struct kvm_s390_vm_cpu_machine* +:Returns: -EFAULT if the given address is not accessible from kernel space; + -ENOMEM if not enough memory is available to process the ioctl; + 0 in case of success. 2.2. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR (r/w) +=============================================== -Allows user space to retrieve or request to change cpu related information for a vcpu: +Allows user space to retrieve or request to change cpu related information for a vcpu:: -struct kvm_s390_vm_cpu_processor { + struct kvm_s390_vm_cpu_processor { __u64 cpuid; # CPUID currently (to be) used by this vcpu __u16 ibc; # IBC level currently (to be) used by this vcpu __u8 pad[6]; __u64 fac_list[256]; # set of cpu facilities currently (to be) used - # by this vcpu -} + # by this vcpu + } KVM does not enforce or limit the cpu model data in any form. Take the information retrieved by means of KVM_S390_VM_CPU_MACHINE as hint for reasonable configuration setups. Instruction interceptions triggered by additionally set facility bits that are not handled by KVM need to by imlemented in the VM driver code. -Parameters: address of buffer to store/set the processor related cpu - data of type struct kvm_s390_vm_cpu_processor*. -Returns: -EBUSY in case 1 or more vcpus are already activated (only in write case) - -EFAULT if the given address is not accessible from kernel space - -ENOMEM if not enough memory is available to process the ioctl - 0 in case of success +:Parameters: address of buffer to store/set the processor related cpu + data of type struct kvm_s390_vm_cpu_processor*. +:Returns: -EBUSY in case 1 or more vcpus are already activated (only in write case); + -EFAULT if the given address is not accessible from kernel space; + -ENOMEM if not enough memory is available to process the ioctl; + 0 in case of success. + +.. _KVM_S390_VM_CPU_MACHINE_FEAT: 2.3. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_FEAT (r/o) +-------------------------------------------------- Allows user space to retrieve available cpu features. A feature is available if provided by the hardware and supported by kvm. In theory, cpu features could even be completely emulated by kvm. -struct kvm_s390_vm_cpu_feat { - __u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering -}; +:: -Parameters: address of a buffer to load the feature list from. -Returns: -EFAULT if the given address is not accessible from kernel space. - 0 in case of success. + struct kvm_s390_vm_cpu_feat { + __u64 feat[16]; # Bitmap (1 = feature available), MSB 0 bit numbering + }; + +:Parameters: address of a buffer to load the feature list from. +:Returns: -EFAULT if the given address is not accessible from kernel space; + 0 in case of success. 2.4. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_FEAT (r/w) +---------------------------------------------------- Allows user space to retrieve or change enabled cpu features for all VCPUs of a VM. Features that are not available cannot be enabled. -See 2.3. for a description of the parameter struct. +See :ref:`KVM_S390_VM_CPU_MACHINE_FEAT` for +a description of the parameter struct. -Parameters: address of a buffer to store/load the feature list from. -Returns: -EFAULT if the given address is not accessible from kernel space. - -EINVAL if a cpu feature that is not available is to be enabled. - -EBUSY if at least one VCPU has already been defined. +:Parameters: address of a buffer to store/load the feature list from. +:Returns: -EFAULT if the given address is not accessible from kernel space; + -EINVAL if a cpu feature that is not available is to be enabled; + -EBUSY if at least one VCPU has already been defined; 0 in case of success. +.. _KVM_S390_VM_CPU_MACHINE_SUBFUNC: + 2.5. ATTRIBUTE: KVM_S390_VM_CPU_MACHINE_SUBFUNC (r/o) +----------------------------------------------------- Allows user space to retrieve available cpu subfunctions without any filtering done by a set IBC. These subfunctions are indicated to the guest VCPU via @@ -126,7 +151,9 @@ contained in the returned struct. If the affected instruction indicates subfunctions via a "test bit" mechanism, the subfunction codes are contained in the returned struct in MSB 0 bit numbering. -struct kvm_s390_vm_cpu_subfunc { +:: + + struct kvm_s390_vm_cpu_subfunc { u8 plo[32]; # always valid (ESA/390 feature) u8 ptff[16]; # valid with TOD-clock steering u8 kmac[16]; # valid with Message-Security-Assist @@ -143,13 +170,14 @@ struct kvm_s390_vm_cpu_subfunc { u8 kma[16]; # valid with Message-Security-Assist-Extension 8 u8 kdsa[16]; # valid with Message-Security-Assist-Extension 9 u8 reserved[1792]; # reserved for future instructions -}; + }; -Parameters: address of a buffer to load the subfunction blocks from. -Returns: -EFAULT if the given address is not accessible from kernel space. +:Parameters: address of a buffer to load the subfunction blocks from. +:Returns: -EFAULT if the given address is not accessible from kernel space; 0 in case of success. 2.6. ATTRIBUTE: KVM_S390_VM_CPU_PROCESSOR_SUBFUNC (r/w) +------------------------------------------------------- Allows user space to retrieve or change cpu subfunctions to be indicated for all VCPUs of a VM. This attribute will only be available if kernel and @@ -164,107 +192,125 @@ As long as no data has been written, a read will fail. The IBC will be used to determine available subfunctions in this case, this will guarantee backward compatibility. -See 2.5. for a description of the parameter struct. +See :ref:`KVM_S390_VM_CPU_MACHINE_SUBFUNC` for a +description of the parameter struct. -Parameters: address of a buffer to store/load the subfunction blocks from. -Returns: -EFAULT if the given address is not accessible from kernel space. - -EINVAL when reading, if there was no write yet. - -EBUSY if at least one VCPU has already been defined. +:Parameters: address of a buffer to store/load the subfunction blocks from. +:Returns: -EFAULT if the given address is not accessible from kernel space; + -EINVAL when reading, if there was no write yet; + -EBUSY if at least one VCPU has already been defined; 0 in case of success. 3. GROUP: KVM_S390_VM_TOD -Architectures: s390 +========================= + +:Architectures: s390 3.1. ATTRIBUTE: KVM_S390_VM_TOD_HIGH +------------------------------------ Allows user space to set/get the TOD clock extension (u8) (superseded by KVM_S390_VM_TOD_EXT). -Parameters: address of a buffer in user space to store the data (u8) to -Returns: -EFAULT if the given address is not accessible from kernel space +:Parameters: address of a buffer in user space to store the data (u8) to +:Returns: -EFAULT if the given address is not accessible from kernel space; -EINVAL if setting the TOD clock extension to != 0 is not supported 3.2. ATTRIBUTE: KVM_S390_VM_TOD_LOW +----------------------------------- Allows user space to set/get bits 0-63 of the TOD clock register as defined in the POP (u64). -Parameters: address of a buffer in user space to store the data (u64) to -Returns: -EFAULT if the given address is not accessible from kernel space +:Parameters: address of a buffer in user space to store the data (u64) to +:Returns: -EFAULT if the given address is not accessible from kernel space 3.3. ATTRIBUTE: KVM_S390_VM_TOD_EXT +----------------------------------- + Allows user space to set/get bits 0-63 of the TOD clock register as defined in the POP (u64). If the guest CPU model supports the TOD clock extension (u8), it also allows user space to get/set it. If the guest CPU model does not support it, it is stored as 0 and not allowed to be set to a value != 0. -Parameters: address of a buffer in user space to store the data - (kvm_s390_vm_tod_clock) to -Returns: -EFAULT if the given address is not accessible from kernel space +:Parameters: address of a buffer in user space to store the data + (kvm_s390_vm_tod_clock) to +:Returns: -EFAULT if the given address is not accessible from kernel space; -EINVAL if setting the TOD clock extension to != 0 is not supported 4. GROUP: KVM_S390_VM_CRYPTO -Architectures: s390 +============================ + +:Architectures: s390 4.1. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_AES_KW (w/o) +------------------------------------------------------ Allows user space to enable aes key wrapping, including generating a new wrapping key. -Parameters: none -Returns: 0 +:Parameters: none +:Returns: 0 4.2. ATTRIBUTE: KVM_S390_VM_CRYPTO_ENABLE_DEA_KW (w/o) +------------------------------------------------------ Allows user space to enable dea key wrapping, including generating a new wrapping key. -Parameters: none -Returns: 0 +:Parameters: none +:Returns: 0 4.3. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_AES_KW (w/o) +------------------------------------------------------- Allows user space to disable aes key wrapping, clearing the wrapping key. -Parameters: none -Returns: 0 +:Parameters: none +:Returns: 0 4.4. ATTRIBUTE: KVM_S390_VM_CRYPTO_DISABLE_DEA_KW (w/o) +------------------------------------------------------- Allows user space to disable dea key wrapping, clearing the wrapping key. -Parameters: none -Returns: 0 +:Parameters: none +:Returns: 0 5. GROUP: KVM_S390_VM_MIGRATION -Architectures: s390 +=============================== + +:Architectures: s390 5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o) +------------------------------------------------ Allows userspace to stop migration mode, needed for PGSTE migration. Setting this attribute when migration mode is not active will have no effects. -Parameters: none -Returns: 0 +:Parameters: none +:Returns: 0 5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o) +------------------------------------------------- Allows userspace to start migration mode, needed for PGSTE migration. Setting this attribute when migration mode is already active will have no effects. -Parameters: none -Returns: -ENOMEM if there is not enough free memory to start migration mode - -EINVAL if the state of the VM is invalid (e.g. no memory defined) +:Parameters: none +:Returns: -ENOMEM if there is not enough free memory to start migration mode; + -EINVAL if the state of the VM is invalid (e.g. no memory defined); 0 in case of success. 5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o) +-------------------------------------------------- Allows userspace to query the status of migration mode. -Parameters: address of a buffer in user space to store the data (u64) to; - the data itself is either 0 if migration mode is disabled or 1 - if it is enabled -Returns: -EFAULT if the given address is not accessible from kernel space +:Parameters: address of a buffer in user space to store the data (u64) to; + the data itself is either 0 if migration mode is disabled or 1 + if it is enabled +:Returns: -EFAULT if the given address is not accessible from kernel space; 0 in case of success. diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.rst index 423332dda7bc..2d6927e0b776 100644 --- a/Documentation/virt/kvm/devices/xics.txt +++ b/Documentation/virt/kvm/devices/xics.rst @@ -1,20 +1,31 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= XICS interrupt controller +========================= Device type supported: KVM_DEV_TYPE_XICS Groups: 1. KVM_DEV_XICS_GRP_SOURCES - Attributes: One per interrupt source, indexed by the source number. + Attributes: + One per interrupt source, indexed by the source number. 2. KVM_DEV_XICS_GRP_CTRL - Attributes: - 2.1 KVM_DEV_XICS_NR_SERVERS (write only) + Attributes: + + 2.1 KVM_DEV_XICS_NR_SERVERS (write only) + The kvm_device_attr.addr points to a __u32 value which is the number of interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: - -EINVAL: Value greater than KVM_MAX_VCPU_ID. - -EFAULT: Invalid user pointer for attr->addr. - -EBUSY: A vcpu is already connected to the device. + + ======= ========================================== + -EINVAL Value greater than KVM_MAX_VCPU_ID. + -EFAULT Invalid user pointer for attr->addr. + -EBUSY A vcpu is already connected to the device. + ======= ========================================== This device emulates the XICS (eXternal Interrupt Controller Specification) defined in PAPR. The XICS has a set of interrupt @@ -53,24 +64,29 @@ the interrupt source number. The 64 bit state word has the following bitfields, starting from the least-significant end of the word: * Destination (server number), 32 bits + This specifies where the interrupt should be sent, and is the interrupt server number specified for the destination vcpu. * Priority, 8 bits + This is the priority specified for this interrupt source, where 0 is the highest priority and 255 is the lowest. An interrupt with a priority of 255 will never be delivered. * Level sensitive flag, 1 bit + This bit is 1 for a level-sensitive interrupt source, or 0 for edge-sensitive (or MSI). * Masked flag, 1 bit + This bit is set to 1 if the interrupt is masked (cannot be delivered regardless of its priority), for example by the ibm,int-off RTAS call, or 0 if it is not masked. * Pending flag, 1 bit + This bit is 1 if the source has a pending interrupt, otherwise 0. Only one XICS instance may be created per VM. diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.rst index f5d1d6b5af61..8bdf3dc38f01 100644 --- a/Documentation/virt/kvm/devices/xive.txt +++ b/Documentation/virt/kvm/devices/xive.rst @@ -1,8 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================================== POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1) -========================================================== +=========================================================== Device types supported: - KVM_DEV_TYPE_XIVE POWER9 XIVE Interrupt Controller generation 1 + - KVM_DEV_TYPE_XIVE POWER9 XIVE Interrupt Controller generation 1 This device acts as a VM interrupt controller. It provides the KVM interface to configure the interrupt sources of a VM in the underlying @@ -64,72 +67,100 @@ the legacy interrupt mode, referred as XICS (POWER7/8). * Groups: - 1. KVM_DEV_XIVE_GRP_CTRL - Provides global controls on the device +1. KVM_DEV_XIVE_GRP_CTRL + Provides global controls on the device + Attributes: 1.1 KVM_DEV_XIVE_RESET (write only) Resets the interrupt controller configuration for sources and event queues. To be used by kexec and kdump. + Errors: none 1.2 KVM_DEV_XIVE_EQ_SYNC (write only) Sync all the sources and queues and mark the EQ pages dirty. This to make sure that a consistent memory state is captured when migrating the VM. + Errors: none 1.3 KVM_DEV_XIVE_NR_SERVERS (write only) The kvm_device_attr.addr points to a __u32 value which is the number of interrupt server numbers (ie, highest possible vcpu id plus one). + Errors: - -EINVAL: Value greater than KVM_MAX_VCPU_ID. - -EFAULT: Invalid user pointer for attr->addr. - -EBUSY: A vCPU is already connected to the device. - 2. KVM_DEV_XIVE_GRP_SOURCE (write only) - Initializes a new source in the XIVE device and mask it. + ======= ========================================== + -EINVAL Value greater than KVM_MAX_VCPU_ID. + -EFAULT Invalid user pointer for attr->addr. + -EBUSY A vCPU is already connected to the device. + ======= ========================================== + +2. KVM_DEV_XIVE_GRP_SOURCE (write only) + Initializes a new source in the XIVE device and mask it. + Attributes: Interrupt source number (64-bit) - The kvm_device_attr.addr points to a __u64 value: - bits: | 63 .... 2 | 1 | 0 - values: | unused | level | type + + The kvm_device_attr.addr points to a __u64 value:: + + bits: | 63 .... 2 | 1 | 0 + values: | unused | level | type + - type: 0:MSI 1:LSI - level: assertion level in case of an LSI. + Errors: - -E2BIG: Interrupt source number is out of range - -ENOMEM: Could not create a new source block - -EFAULT: Invalid user pointer for attr->addr. - -ENXIO: Could not allocate underlying HW interrupt - 3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only) - Configures source targeting + ======= ========================================== + -E2BIG Interrupt source number is out of range + -ENOMEM Could not create a new source block + -EFAULT Invalid user pointer for attr->addr. + -ENXIO Could not allocate underlying HW interrupt + ======= ========================================== + +3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only) + Configures source targeting + Attributes: Interrupt source number (64-bit) - The kvm_device_attr.addr points to a __u64 value: - bits: | 63 .... 33 | 32 | 31 .. 3 | 2 .. 0 - values: | eisn | mask | server | priority + + The kvm_device_attr.addr points to a __u64 value:: + + bits: | 63 .... 33 | 32 | 31 .. 3 | 2 .. 0 + values: | eisn | mask | server | priority + - priority: 0-7 interrupt priority level - server: CPU number chosen to handle the interrupt - mask: mask flag (unused) - eisn: Effective Interrupt Source Number + Errors: - -ENOENT: Unknown source number - -EINVAL: Not initialized source number - -EINVAL: Invalid priority - -EINVAL: Invalid CPU number. - -EFAULT: Invalid user pointer for attr->addr. - -ENXIO: CPU event queues not configured or configuration of the - underlying HW interrupt failed - -EBUSY: No CPU available to serve interrupt - - 4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write) - Configures an event queue of a CPU + + ======= ======================================================= + -ENOENT Unknown source number + -EINVAL Not initialized source number + -EINVAL Invalid priority + -EINVAL Invalid CPU number. + -EFAULT Invalid user pointer for attr->addr. + -ENXIO CPU event queues not configured or configuration of the + underlying HW interrupt failed + -EBUSY No CPU available to serve interrupt + ======= ======================================================= + +4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write) + Configures an event queue of a CPU + Attributes: EQ descriptor identifier (64-bit) - The EQ descriptor identifier is a tuple (server, priority) : - bits: | 63 .... 32 | 31 .. 3 | 2 .. 0 - values: | unused | server | priority - The kvm_device_attr.addr points to : + + The EQ descriptor identifier is a tuple (server, priority):: + + bits: | 63 .... 32 | 31 .. 3 | 2 .. 0 + values: | unused | server | priority + + The kvm_device_attr.addr points to:: + struct kvm_ppc_xive_eq { __u32 flags; __u32 qshift; @@ -138,8 +169,9 @@ the legacy interrupt mode, referred as XICS (POWER7/8). __u32 qindex; __u8 pad[40]; }; + - flags: queue flags - KVM_XIVE_EQ_ALWAYS_NOTIFY (required) + KVM_XIVE_EQ_ALWAYS_NOTIFY (required) forces notification without using the coalescing mechanism provided by the XIVE END ESBs. - qshift: queue size (power of 2) @@ -147,22 +179,31 @@ the legacy interrupt mode, referred as XICS (POWER7/8). - qtoggle: current queue toggle bit - qindex: current queue index - pad: reserved for future use + Errors: - -ENOENT: Invalid CPU number - -EINVAL: Invalid priority - -EINVAL: Invalid flags - -EINVAL: Invalid queue size - -EINVAL: Invalid queue address - -EFAULT: Invalid user pointer for attr->addr. - -EIO: Configuration of the underlying HW failed - - 5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only) - Synchronize the source to flush event notifications + + ======= ========================================= + -ENOENT Invalid CPU number + -EINVAL Invalid priority + -EINVAL Invalid flags + -EINVAL Invalid queue size + -EINVAL Invalid queue address + -EFAULT Invalid user pointer for attr->addr. + -EIO Configuration of the underlying HW failed + ======= ========================================= + +5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only) + Synchronize the source to flush event notifications + Attributes: Interrupt source number (64-bit) + Errors: - -ENOENT: Unknown source number - -EINVAL: Not initialized source number + + ======= ============================= + -ENOENT Unknown source number + -EINVAL Not initialized source number + ======= ============================= * VCPU state @@ -175,11 +216,12 @@ the legacy interrupt mode, referred as XICS (POWER7/8). as it synthesizes the priorities of the pending interrupts. We capture a bit more to report debug information. - KVM_REG_PPC_VP_STATE (2 * 64bits) - bits: | 63 .... 32 | 31 .... 0 | - values: | TIMA word0 | TIMA word1 | - bits: | 127 .......... 64 | - values: | unused | + KVM_REG_PPC_VP_STATE (2 * 64bits):: + + bits: | 63 .... 32 | 31 .... 0 | + values: | TIMA word0 | TIMA word1 | + bits: | 127 .......... 64 | + values: | unused | * Migration: @@ -196,7 +238,7 @@ the legacy interrupt mode, referred as XICS (POWER7/8). 3. Capture the state of the source targeting, the EQs configuration and the state of thread interrupt context registers. - Restore is similar : + Restore is similar: 1. Restore the EQ configuration. As targeting depends on it. 2. Restore targeting diff --git a/Documentation/virt/kvm/halt-polling.txt b/Documentation/virt/kvm/halt-polling.rst index 4f791b128dd2..4922e4a15f18 100644 --- a/Documentation/virt/kvm/halt-polling.txt +++ b/Documentation/virt/kvm/halt-polling.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== The KVM halt polling system =========================== @@ -68,7 +71,8 @@ steady state polling interval but will only really do a good job for wakeups which come at an approximately constant rate, otherwise there will be constant adjustment of the polling interval. -[0] total block time: the time between when the halt polling function is +[0] total block time: + the time between when the halt polling function is invoked and a wakeup source received (irrespective of whether the scheduler is invoked within that function). @@ -81,31 +85,32 @@ shrunk. These variables are defined in include/linux/kvm_host.h and as module parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the powerpc kvm-hv case. -Module Parameter | Description | Default Value --------------------------------------------------------------------------------- -halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT - | interval which defines | - | the ceiling value of the | - | polling interval for | (per arch value) - | each vcpu. | --------------------------------------------------------------------------------- -halt_poll_ns_grow | The value by which the | 2 - | halt polling interval is | - | multiplied in the | - | grow_halt_poll_ns() | - | function. | --------------------------------------------------------------------------------- -halt_poll_ns_grow_start | The initial value to grow | 10000 - | to from zero in the | - | grow_halt_poll_ns() | - | function. | --------------------------------------------------------------------------------- -halt_poll_ns_shrink | The value by which the | 0 - | halt polling interval is | - | divided in the | - | shrink_halt_poll_ns() | - | function. | --------------------------------------------------------------------------------- ++-----------------------+---------------------------+-------------------------+ +|Module Parameter | Description | Default Value | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT| +| | interval which defines | | +| | the ceiling value of the | | +| | polling interval for | (per arch value) | +| | each vcpu. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_grow | The value by which the | 2 | +| | halt polling interval is | | +| | multiplied in the | | +| | grow_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_grow_start| The initial value to grow | 10000 | +| | to from zero in the | | +| | grow_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_shrink | The value by which the | 0 | +| | halt polling interval is | | +| | divided in the | | +| | shrink_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ These module parameters can be set from the debugfs files in: @@ -117,20 +122,19 @@ Note: that these module parameters are system wide values and are not able to Further Notes ============= -- Care should be taken when setting the halt_poll_ns module parameter as a -large value has the potential to drive the cpu usage to 100% on a machine which -would be almost entirely idle otherwise. This is because even if a guest has -wakeups during which very little work is done and which are quite far apart, if -the period is shorter than the global max polling interval (halt_poll_ns) then -the host will always poll for the entire block time and thus cpu utilisation -will go to 100%. - -- Halt polling essentially presents a trade off between power usage and latency -and the module parameters should be used to tune the affinity for this. Idle -cpu time is essentially converted to host kernel time with the aim of decreasing -latency when entering the guest. - -- Halt polling will only be conducted by the host when no other tasks are -runnable on that cpu, otherwise the polling will cease immediately and -schedule will be invoked to allow that other task to run. Thus this doesn't -allow a guest to denial of service the cpu. +- Care should be taken when setting the halt_poll_ns module parameter as a large value + has the potential to drive the cpu usage to 100% on a machine which would be almost + entirely idle otherwise. This is because even if a guest has wakeups during which very + little work is done and which are quite far apart, if the period is shorter than the + global max polling interval (halt_poll_ns) then the host will always poll for the + entire block time and thus cpu utilisation will go to 100%. + +- Halt polling essentially presents a trade off between power usage and latency and + the module parameters should be used to tune the affinity for this. Idle cpu time is + essentially converted to host kernel time with the aim of decreasing latency when + entering the guest. + +- Halt polling will only be conducted by the host when no other tasks are runnable on + that cpu, otherwise the polling will cease immediately and schedule will be invoked to + allow that other task to run. Thus this doesn't allow a guest to denial of service the + cpu. diff --git a/Documentation/virt/kvm/hypercalls.txt b/Documentation/virt/kvm/hypercalls.rst index 5f6d291bd004..dbaf207e560d 100644 --- a/Documentation/virt/kvm/hypercalls.txt +++ b/Documentation/virt/kvm/hypercalls.rst @@ -1,5 +1,9 @@ -Linux KVM Hypercall: +.. SPDX-License-Identifier: GPL-2.0 + +=================== +Linux KVM Hypercall =================== + X86: KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall instruction. The hypervisor can replace it with instructions that are @@ -20,7 +24,7 @@ S390: For further information on the S390 diagnose call as supported by KVM, refer to Documentation/virt/kvm/s390-diag.txt. - PowerPC: +PowerPC: It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers. Return value is placed in R3. @@ -34,7 +38,8 @@ MIPS: the return value is placed in $2 (v0). KVM Hypercalls Documentation -=========================== +============================ + The template for each hypercall is: 1. Hypercall name. 2. Architecture(s) @@ -43,56 +48,64 @@ The template for each hypercall is: 1. KVM_HC_VAPIC_POLL_IRQ ------------------------ -Architecture: x86 -Status: active -Purpose: Trigger guest exit so that the host can check for pending -interrupts on reentry. + +:Architecture: x86 +:Status: active +:Purpose: Trigger guest exit so that the host can check for pending + interrupts on reentry. 2. KVM_HC_MMU_OP ------------------------- -Architecture: x86 -Status: deprecated. -Purpose: Support MMU operations such as writing to PTE, -flushing TLB, release PT. +---------------- + +:Architecture: x86 +:Status: deprecated. +:Purpose: Support MMU operations such as writing to PTE, + flushing TLB, release PT. 3. KVM_HC_FEATURES ------------------------- -Architecture: PPC -Status: active -Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid -used to enumerate which hypercalls are available. On PPC, either device tree -based lookup ( which is also what EPAPR dictates) OR KVM specific enumeration -mechanism (which is this hypercall) can be used. +------------------ + +:Architecture: PPC +:Status: active +:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid + used to enumerate which hypercalls are available. On PPC, either + device tree based lookup ( which is also what EPAPR dictates) + OR KVM specific enumeration mechanism (which is this hypercall) + can be used. 4. KVM_HC_PPC_MAP_MAGIC_PAGE ------------------------- -Architecture: PPC -Status: active -Purpose: To enable communication between the hypervisor and guest there is a -shared page that contains parts of supervisor visible register state. -The guest can map this shared page to access its supervisor register through -memory using this hypercall. +---------------------------- + +:Architecture: PPC +:Status: active +:Purpose: To enable communication between the hypervisor and guest there is a + shared page that contains parts of supervisor visible register state. + The guest can map this shared page to access its supervisor register + through memory using this hypercall. 5. KVM_HC_KICK_CPU ------------------------- -Architecture: x86 -Status: active -Purpose: Hypercall used to wakeup a vcpu from HLT state -Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest -kernel mode for an event to occur (ex: a spinlock to become available) can -execute HLT instruction once it has busy-waited for more than a threshold -time-interval. Execution of HLT instruction would cause the hypervisor to put -the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the -same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall, -specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0) -is used in the hypercall for future use. +------------------ + +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to wakeup a vcpu from HLT state +:Usage example: + A vcpu of a paravirtualized guest that is busywaiting in guest + kernel mode for an event to occur (ex: a spinlock to become available) can + execute HLT instruction once it has busy-waited for more than a threshold + time-interval. Execution of HLT instruction would cause the hypervisor to put + the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the + same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall, + specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0) + is used in the hypercall for future use. 6. KVM_HC_CLOCK_PAIRING ------------------------- -Architecture: x86 -Status: active -Purpose: Hypercall used to synchronize host and guest clocks. +----------------------- +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to synchronize host and guest clocks. + Usage: a0: guest physical address where host copies @@ -101,6 +114,8 @@ a0: guest physical address where host copies a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0) is supported (corresponding to the host's CLOCK_REALTIME clock). + :: + struct kvm_clock_pairing { __s64 sec; __s64 nsec; @@ -123,15 +138,16 @@ Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. 6. KVM_HC_SEND_IPI ------------------------- -Architecture: x86 -Status: active -Purpose: Send IPIs to multiple vCPUs. +------------------ + +:Architecture: x86 +:Status: active +:Purpose: Send IPIs to multiple vCPUs. -a0: lower part of the bitmap of destination APIC IDs -a1: higher part of the bitmap of destination APIC IDs -a2: the lowest APIC ID in bitmap -a3: APIC ICR +- a0: lower part of the bitmap of destination APIC IDs +- a1: higher part of the bitmap of destination APIC IDs +- a2: the lowest APIC ID in bitmap +- a3: APIC ICR The hypercall lets a guest send multicast IPIs, with at most 128 128 destinations per hypercall in 64-bit mode and 64 vCPUs per @@ -143,12 +159,13 @@ corresponds to the APIC ID a2+1, and so on. Returns the number of CPUs to which the IPIs were delivered successfully. 7. KVM_HC_SCHED_YIELD ------------------------- -Architecture: x86 -Status: active -Purpose: Hypercall used to yield if the IPI target vCPU is preempted +--------------------- + +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to yield if the IPI target vCPU is preempted a0: destination APIC ID -Usage example: When sending a call-function IPI-many to vCPUs, yield if -any of the IPI target vCPUs was preempted. +:Usage example: When sending a call-function IPI-many to vCPUs, yield if + any of the IPI target vCPUs was preempted. diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst index ada224a511fe..774deaebf7fa 100644 --- a/Documentation/virt/kvm/index.rst +++ b/Documentation/virt/kvm/index.rst @@ -7,6 +7,22 @@ KVM .. toctree:: :maxdepth: 2 + api amd-memory-encryption cpuid + halt-polling + hypercalls + locking + mmu + msr + nested-vmx + ppc-pv + s390-diag + timekeeping vcpu-requests + + review-checklist + + arm/index + + devices/index diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst new file mode 100644 index 000000000000..c02291beac3f --- /dev/null +++ b/Documentation/virt/kvm/locking.rst @@ -0,0 +1,243 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +KVM Lock Overview +================= + +1. Acquisition Orders +--------------------- + +The acquisition orders for mutexes are as follows: + +- kvm->lock is taken outside vcpu->mutex + +- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock + +- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring + them together is quite rare. + +On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. + +Everything else is a leaf: no other lock is taken inside the critical +sections. + +2. Exception +------------ + +Fast page fault: + +Fast page fault is the fast path which fixes the guest page fault out of +the mmu-lock on x86. Currently, the page fault can be fast in one of the +following two cases: + +1. Access Tracking: The SPTE is not present, but it is marked for access + tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to + restore the saved R/X bits. This is described in more detail later below. + +2. Write-Protection: The SPTE is present and the fault is + caused by write-protect. That means we just need to change the W bit of + the spte. + +What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and +SPTE_MMU_WRITEABLE bit on the spte: + +- SPTE_HOST_WRITEABLE means the gfn is writable on host. +- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when + the gfn is writable on guest mmu and it is not write-protected by shadow + page write-protection. + +On fast page fault path, we will use cmpxchg to atomically set the spte W +bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or +restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This +is safe because whenever changing these bits can be detected by cmpxchg. + +But we need carefully check these cases: + +1) The mapping from gfn to pfn + +The mapping from gfn to pfn may be changed since we can only ensure the pfn +is not changed during cmpxchg. This is a ABA problem, for example, below case +will happen: + ++------------------------------------------------------------------------+ +| At the beginning:: | +| | +| gpte = gfn1 | +| gfn1 is mapped to pfn1 on host | +| spte is the shadow page table entry corresponding with gpte and | +| spte = pfn1 | ++------------------------------------------------------------------------+ +| On fast page fault path: | ++------------------------------------+-----------------------------------+ +| CPU 0: | CPU 1: | ++------------------------------------+-----------------------------------+ +| :: | | +| | | +| old_spte = *spte; | | ++------------------------------------+-----------------------------------+ +| | pfn1 is swapped out:: | +| | | +| | spte = 0; | +| | | +| | pfn1 is re-alloced for gfn2. | +| | | +| | gpte is changed to point to | +| | gfn2 by the guest:: | +| | | +| | spte = pfn1; | ++------------------------------------+-----------------------------------+ +| :: | +| | +| if (cmpxchg(spte, old_spte, old_spte+W) | +| mark_page_dirty(vcpu->kvm, gfn1) | +| OOPS!!! | ++------------------------------------------------------------------------+ + +We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. + +For direct sp, we can easily avoid it since the spte of direct sp is fixed +to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() +to pin gfn to pfn, because after gfn_to_pfn_atomic(): + +- We have held the refcount of pfn that means the pfn can not be freed and + be reused for another gfn. +- The pfn is writable that means it can not be shared between different gfns + by KSM. + +Then, we can ensure the dirty bitmaps is correctly set for a gfn. + +Currently, to simplify the whole things, we disable fast page fault for +indirect shadow page. + +2) Dirty bit tracking + +In the origin code, the spte can be fast updated (non-atomically) if the +spte is read-only and the Accessed bit has already been set since the +Accessed bit and Dirty bit can not be lost. + +But it is not true after fast page fault since the spte can be marked +writable between reading spte and updating spte. Like below case: + ++------------------------------------------------------------------------+ +| At the beginning:: | +| | +| spte.W = 0 | +| spte.Accessed = 1 | ++------------------------------------+-----------------------------------+ +| CPU 0: | CPU 1: | ++------------------------------------+-----------------------------------+ +| In mmu_spte_clear_track_bits():: | | +| | | +| old_spte = *spte; | | +| | | +| | | +| /* 'if' condition is satisfied. */| | +| if (old_spte.Accessed == 1 && | | +| old_spte.W == 0) | | +| spte = 0ull; | | ++------------------------------------+-----------------------------------+ +| | on fast page fault path:: | +| | | +| | spte.W = 1 | +| | | +| | memory write on the spte:: | +| | | +| | spte.Dirty = 1 | ++------------------------------------+-----------------------------------+ +| :: | | +| | | +| else | | +| old_spte = xchg(spte, 0ull) | | +| if (old_spte.Accessed == 1) | | +| kvm_set_pfn_accessed(spte.pfn);| | +| if (old_spte.Dirty == 1) | | +| kvm_set_pfn_dirty(spte.pfn); | | +| OOPS!!! | | ++------------------------------------+-----------------------------------+ + +The Dirty bit is lost in this case. + +In order to avoid this kind of issue, we always treat the spte as "volatile" +if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, +the spte is always atomically updated in this case. + +3) flush tlbs due to spte updated + +If the spte is updated from writable to readonly, we should flush all TLBs, +otherwise rmap_write_protect will find a read-only spte, even though the +writable spte might be cached on a CPU's TLB. + +As mentioned before, the spte can be updated to writable out of mmu-lock on +fast page fault path, in order to easily audit the path, we see if TLBs need +be flushed caused by this reason in mmu_spte_update() since this is a common +function to update spte (present -> present). + +Since the spte is "volatile" if it can be updated out of mmu-lock, we always +atomically update the spte, the race caused by fast page fault can be avoided, +See the comments in spte_has_volatile_bits() and mmu_spte_update(). + +Lockless Access Tracking: + +This is used for Intel CPUs that are using EPT but do not support the EPT A/D +bits. In this case, when the KVM MMU notifier is called to track accesses to a +page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present +by clearing the RWX bits in the PTE and storing the original R & X bits in +some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the +PTE (using the ignored bit 62). When the VM tries to access the page later on, +a fault is generated and the fast page fault mechanism described above is used +to atomically restore the PTE to a Present state. The W bit is not saved when +the PTE is marked for access tracking and during restoration to the Present +state, the W bit is set depending on whether or not it was a write access. If +it wasn't, then the W bit will remain clear until a write access happens, at +which time it will be set using the Dirty tracking mechanism described above. + +3. Reference +------------ + +:Name: kvm_lock +:Type: mutex +:Arch: any +:Protects: - vm_list + +:Name: kvm_count_lock +:Type: raw_spinlock_t +:Arch: any +:Protects: - hardware virtualization enable/disable +:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt + migration. + +:Name: kvm_arch::tsc_write_lock +:Type: raw_spinlock +:Arch: x86 +:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} + - tsc offset in vmcb +:Comment: 'raw' because updating the tsc offsets must not be preempted. + +:Name: kvm->mmu_lock +:Type: spinlock_t +:Arch: any +:Protects: -shadow page/shadow tlb entry +:Comment: it is a spinlock since it is used in mmu notifier. + +:Name: kvm->srcu +:Type: srcu lock +:Arch: any +:Protects: - kvm->memslots + - kvm->buses +:Comment: The srcu read lock must be held while accessing memslots (e.g. + when using gfn_to_* functions) and while accessing in-kernel + MMIO/PIO address->device structure mapping (kvm->buses). + The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu + if it is needed by multiple functions. + +:Name: blocked_vcpu_on_cpu_lock +:Type: spinlock_t +:Arch: x86 +:Protects: blocked_vcpu_on_cpu +:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. + When VT-d posted-interrupts is supported and the VM has assigned + devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu + protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues + wakeup notification event since external interrupts from the + assigned devices happens, we will find the vCPU on the list to + wakeup. diff --git a/Documentation/virt/kvm/locking.txt b/Documentation/virt/kvm/locking.txt deleted file mode 100644 index 635cd6eaf714..000000000000 --- a/Documentation/virt/kvm/locking.txt +++ /dev/null @@ -1,215 +0,0 @@ -KVM Lock Overview -================= - -1. Acquisition Orders ---------------------- - -The acquisition orders for mutexes are as follows: - -- kvm->lock is taken outside vcpu->mutex - -- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock - -- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring - them together is quite rare. - -On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. - -Everything else is a leaf: no other lock is taken inside the critical -sections. - -2: Exception ------------- - -Fast page fault: - -Fast page fault is the fast path which fixes the guest page fault out of -the mmu-lock on x86. Currently, the page fault can be fast in one of the -following two cases: - -1. Access Tracking: The SPTE is not present, but it is marked for access -tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to -restore the saved R/X bits. This is described in more detail later below. - -2. Write-Protection: The SPTE is present and the fault is -caused by write-protect. That means we just need to change the W bit of the -spte. - -What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and -SPTE_MMU_WRITEABLE bit on the spte: -- SPTE_HOST_WRITEABLE means the gfn is writable on host. -- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when - the gfn is writable on guest mmu and it is not write-protected by shadow - page write-protection. - -On fast page fault path, we will use cmpxchg to atomically set the spte W -bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or -restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This -is safe because whenever changing these bits can be detected by cmpxchg. - -But we need carefully check these cases: -1): The mapping from gfn to pfn -The mapping from gfn to pfn may be changed since we can only ensure the pfn -is not changed during cmpxchg. This is a ABA problem, for example, below case -will happen: - -At the beginning: -gpte = gfn1 -gfn1 is mapped to pfn1 on host -spte is the shadow page table entry corresponding with gpte and -spte = pfn1 - - VCPU 0 VCPU0 -on fast page fault path: - - old_spte = *spte; - pfn1 is swapped out: - spte = 0; - - pfn1 is re-alloced for gfn2. - - gpte is changed to point to - gfn2 by the guest: - spte = pfn1; - - if (cmpxchg(spte, old_spte, old_spte+W) - mark_page_dirty(vcpu->kvm, gfn1) - OOPS!!! - -We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. - -For direct sp, we can easily avoid it since the spte of direct sp is fixed -to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() -to pin gfn to pfn, because after gfn_to_pfn_atomic(): -- We have held the refcount of pfn that means the pfn can not be freed and - be reused for another gfn. -- The pfn is writable that means it can not be shared between different gfns - by KSM. - -Then, we can ensure the dirty bitmaps is correctly set for a gfn. - -Currently, to simplify the whole things, we disable fast page fault for -indirect shadow page. - -2): Dirty bit tracking -In the origin code, the spte can be fast updated (non-atomically) if the -spte is read-only and the Accessed bit has already been set since the -Accessed bit and Dirty bit can not be lost. - -But it is not true after fast page fault since the spte can be marked -writable between reading spte and updating spte. Like below case: - -At the beginning: -spte.W = 0 -spte.Accessed = 1 - - VCPU 0 VCPU0 -In mmu_spte_clear_track_bits(): - - old_spte = *spte; - - /* 'if' condition is satisfied. */ - if (old_spte.Accessed == 1 && - old_spte.W == 0) - spte = 0ull; - on fast page fault path: - spte.W = 1 - memory write on the spte: - spte.Dirty = 1 - - - else - old_spte = xchg(spte, 0ull) - - - if (old_spte.Accessed == 1) - kvm_set_pfn_accessed(spte.pfn); - if (old_spte.Dirty == 1) - kvm_set_pfn_dirty(spte.pfn); - OOPS!!! - -The Dirty bit is lost in this case. - -In order to avoid this kind of issue, we always treat the spte as "volatile" -if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, -the spte is always atomically updated in this case. - -3): flush tlbs due to spte updated -If the spte is updated from writable to readonly, we should flush all TLBs, -otherwise rmap_write_protect will find a read-only spte, even though the -writable spte might be cached on a CPU's TLB. - -As mentioned before, the spte can be updated to writable out of mmu-lock on -fast page fault path, in order to easily audit the path, we see if TLBs need -be flushed caused by this reason in mmu_spte_update() since this is a common -function to update spte (present -> present). - -Since the spte is "volatile" if it can be updated out of mmu-lock, we always -atomically update the spte, the race caused by fast page fault can be avoided, -See the comments in spte_has_volatile_bits() and mmu_spte_update(). - -Lockless Access Tracking: - -This is used for Intel CPUs that are using EPT but do not support the EPT A/D -bits. In this case, when the KVM MMU notifier is called to track accesses to a -page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present -by clearing the RWX bits in the PTE and storing the original R & X bits in -some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the -PTE (using the ignored bit 62). When the VM tries to access the page later on, -a fault is generated and the fast page fault mechanism described above is used -to atomically restore the PTE to a Present state. The W bit is not saved when -the PTE is marked for access tracking and during restoration to the Present -state, the W bit is set depending on whether or not it was a write access. If -it wasn't, then the W bit will remain clear until a write access happens, at -which time it will be set using the Dirty tracking mechanism described above. - -3. Reference ------------- - -Name: kvm_lock -Type: mutex -Arch: any -Protects: - vm_list - -Name: kvm_count_lock -Type: raw_spinlock_t -Arch: any -Protects: - hardware virtualization enable/disable -Comment: 'raw' because hardware enabling/disabling must be atomic /wrt - migration. - -Name: kvm_arch::tsc_write_lock -Type: raw_spinlock -Arch: x86 -Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} - - tsc offset in vmcb -Comment: 'raw' because updating the tsc offsets must not be preempted. - -Name: kvm->mmu_lock -Type: spinlock_t -Arch: any -Protects: -shadow page/shadow tlb entry -Comment: it is a spinlock since it is used in mmu notifier. - -Name: kvm->srcu -Type: srcu lock -Arch: any -Protects: - kvm->memslots - - kvm->buses -Comment: The srcu read lock must be held while accessing memslots (e.g. - when using gfn_to_* functions) and while accessing in-kernel - MMIO/PIO address->device structure mapping (kvm->buses). - The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu - if it is needed by multiple functions. - -Name: blocked_vcpu_on_cpu_lock -Type: spinlock_t -Arch: x86 -Protects: blocked_vcpu_on_cpu -Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. - When VT-d posted-interrupts is supported and the VM has assigned - devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu - protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues - wakeup notification event since external interrupts from the - assigned devices happens, we will find the vCPU on the list to - wakeup. diff --git a/Documentation/virt/kvm/mmu.txt b/Documentation/virt/kvm/mmu.rst index dadb29e8738f..60981887d20b 100644 --- a/Documentation/virt/kvm/mmu.txt +++ b/Documentation/virt/kvm/mmu.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== The x86 kvm shadow mmu ====================== @@ -7,27 +10,37 @@ physical addresses to host physical addresses. The mmu code attempts to satisfy the following requirements: -- correctness: the guest should not be able to determine that it is running +- correctness: + the guest should not be able to determine that it is running on an emulated mmu except for timing (we attempt to comply with the specification, not emulate the characteristics of a particular implementation such as tlb size) -- security: the guest must not be able to touch host memory not assigned +- security: + the guest must not be able to touch host memory not assigned to it -- performance: minimize the performance penalty imposed by the mmu -- scaling: need to scale to large memory and large vcpu guests -- hardware: support the full range of x86 virtualization hardware -- integration: Linux memory management code must be in control of guest memory +- performance: + minimize the performance penalty imposed by the mmu +- scaling: + need to scale to large memory and large vcpu guests +- hardware: + support the full range of x86 virtualization hardware +- integration: + Linux memory management code must be in control of guest memory so that swapping, page migration, page merging, transparent hugepages, and similar features work without change -- dirty tracking: report writes to guest memory to enable live migration +- dirty tracking: + report writes to guest memory to enable live migration and framebuffer-based displays -- footprint: keep the amount of pinned kernel memory low (most memory +- footprint: + keep the amount of pinned kernel memory low (most memory should be shrinkable) -- reliability: avoid multipage or GFP_ATOMIC allocations +- reliability: + avoid multipage or GFP_ATOMIC allocations Acronyms ======== +==== ==================================================================== pfn host page frame number hpa host physical address hva host virtual address @@ -41,6 +54,7 @@ pte page table entry (used also to refer generically to paging structure gpte guest pte (referring to gfns) spte shadow pte (referring to pfns) tdp two dimensional paging (vendor neutral term for NPT and EPT) +==== ==================================================================== Virtual and real hardware supported =================================== @@ -90,11 +104,13 @@ Events The mmu is driven by events, some from the guest, some from the host. Guest generated events: + - writes to control registers (especially cr3) - invlpg/invlpga instruction execution - access to missing or protected translations Host generated events: + - changes in the gpa->hpa translation (either through gpa->hva changes or through hva->hpa changes) - memory pressure (the shrinker) @@ -117,16 +133,19 @@ Leaf ptes point at guest pages. The following table shows translations encoded by leaf ptes, with higher-level translations in parentheses: - Non-nested guests: + Non-nested guests:: + nonpaging: gpa->hpa paging: gva->gpa->hpa paging, tdp: (gva->)gpa->hpa - Nested guests: + + Nested guests:: + non-tdp: ngva->gpa->hpa (*) tdp: (ngva->)ngpa->gpa->hpa -(*) the guest hypervisor will encode the ngva->gpa translation into its page - tables if npt is not present + (*) the guest hypervisor will encode the ngva->gpa translation into its page + tables if npt is not present Shadow pages contain the following information: role.level: @@ -291,28 +310,41 @@ Handling a page fault is performed as follows: - if the RSV bit of the error code is set, the page fault is caused by guest accessing MMIO and cached MMIO information is available. + - walk shadow page table - check for valid generation number in the spte (see "Fast invalidation of MMIO sptes" below) - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn, and call the emulator + - If both P bit and R/W bit of error code are set, this could possibly be handled as a "fast page fault" (fixed without taking the MMU lock). See the description in Documentation/virt/kvm/locking.txt. + - if needed, walk the guest page tables to determine the guest translation (gva->gpa or ngpa->gpa) + - if permissions are insufficient, reflect the fault back to the guest + - determine the host page + - if this is an mmio request, there is no host page; cache the info to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn + - walk the shadow page table to find the spte for the translation, instantiating missing intermediate page tables as necessary + - If this is an mmio request, cache the mmio info to the spte and set some reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) + - try to unsynchronize the page + - if successful, we can let the guest continue and modify the gpte + - emulate the instruction + - if failed, unshadow the page and let the guest continue + - update any translations that were modified by the instruction invlpg handling: @@ -324,10 +356,12 @@ invlpg handling: Guest control register updates: - mov to cr3 + - look up new shadow roots - synchronize newly reachable shadow pages - mov to cr0/cr4/efer + - set up mmu context for new paging mode - look up new shadow roots - synchronize newly reachable shadow pages @@ -358,6 +392,7 @@ on fault type: (user write faults generate a #PF) In the first case there are two additional complications: + - if CR4.SMEP is enabled: since we've turned the page into a kernel page, the kernel may now execute it. We handle this by also setting spte.nx. If we get a user fetch or read fault, we'll change spte.u=1 and @@ -446,4 +481,3 @@ Further reading - NPT presentation from KVM Forum 2008 http://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf - diff --git a/Documentation/virt/kvm/msr.txt b/Documentation/virt/kvm/msr.rst index df1f4338b3ca..33892036672d 100644 --- a/Documentation/virt/kvm/msr.txt +++ b/Documentation/virt/kvm/msr.rst @@ -1,6 +1,10 @@ -KVM-specific MSRs. -Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 -===================================================== +.. SPDX-License-Identifier: GPL-2.0 + +================= +KVM-specific MSRs +================= + +:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 KVM makes use of some custom MSRs to service some requests. @@ -9,34 +13,39 @@ Custom MSRs have a range reserved for them, that goes from but they are deprecated and their use is discouraged. Custom MSR list --------- +--------------- The current supported Custom MSR list is: -MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 +MSR_KVM_WALL_CLOCK_NEW: + 0x4b564d00 - data: 4-byte alignment physical address of a memory area which must be +data: + 4-byte alignment physical address of a memory area which must be in guest RAM. This memory is expected to hold a copy of the following - structure: + structure:: - struct pvclock_wall_clock { + struct pvclock_wall_clock { u32 version; u32 sec; u32 nsec; - } __attribute__((__packed__)); + } __attribute__((__packed__)); whose data will be filled in by the hypervisor. The hypervisor is only guaranteed to update this data at the moment of MSR write. Users that want to reliably query this information more than once have to write more than once to this MSR. Fields have the following meanings: - version: guest has to check version before and after grabbing + version: + guest has to check version before and after grabbing time information and check that they are both equal and even. An odd version indicates an in-progress update. - sec: number of seconds for wallclock at time of boot. + sec: + number of seconds for wallclock at time of boot. - nsec: number of nanoseconds for wallclock at time of boot. + nsec: + number of nanoseconds for wallclock at time of boot. In order to get the current wallclock time, the system_time from MSR_KVM_SYSTEM_TIME_NEW needs to be added. @@ -47,13 +56,15 @@ MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid leaf prior to usage. -MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 +MSR_KVM_SYSTEM_TIME_NEW: + 0x4b564d01 - data: 4-byte aligned physical address of a memory area which must be in +data: + 4-byte aligned physical address of a memory area which must be in guest RAM, plus an enable bit in bit 0. This memory is expected to hold - a copy of the following structure: + a copy of the following structure:: - struct pvclock_vcpu_time_info { + struct pvclock_vcpu_time_info { u32 version; u32 pad0; u64 tsc_timestamp; @@ -62,7 +73,7 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 s8 tsc_shift; u8 flags; u8 pad[2]; - } __attribute__((__packed__)); /* 32 bytes */ + } __attribute__((__packed__)); /* 32 bytes */ whose data will be filled in by the hypervisor periodically. Only one write, or registration, is needed for each VCPU. The interval between @@ -72,23 +83,28 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 Fields have the following meanings: - version: guest has to check version before and after grabbing + version: + guest has to check version before and after grabbing time information and check that they are both equal and even. An odd version indicates an in-progress update. - tsc_timestamp: the tsc value at the current VCPU at the time + tsc_timestamp: + the tsc value at the current VCPU at the time of the update of this structure. Guests can subtract this value from current tsc to derive a notion of elapsed time since the structure update. - system_time: a host notion of monotonic time, including sleep + system_time: + a host notion of monotonic time, including sleep time at the time this structure was last updated. Unit is nanoseconds. - tsc_to_system_mul: multiplier to be used when converting + tsc_to_system_mul: + multiplier to be used when converting tsc-related quantity to nanoseconds - tsc_shift: shift to be used when converting tsc-related + tsc_shift: + shift to be used when converting tsc-related quantity to nanoseconds. This shift will ensure that multiplication with tsc_to_system_mul does not overflow. A positive value denotes a left shift, a negative value @@ -96,7 +112,7 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 The conversion from tsc to nanoseconds involves an additional right shift by 32 bits. With this information, guests can - derive per-CPU time by doing: + derive per-CPU time by doing:: time = (current_tsc - tsc_timestamp) if (tsc_shift >= 0) @@ -106,29 +122,34 @@ MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 time = (time * tsc_to_system_mul) >> 32 time = time + system_time - flags: bits in this field indicate extended capabilities + flags: + bits in this field indicate extended capabilities coordinated between the guest and the hypervisor. Availability of specific flags has to be checked in 0x40000001 cpuid leaf. Current flags are: - flag bit | cpuid bit | meaning - ------------------------------------------------------------- - | | time measures taken across - 0 | 24 | multiple cpus are guaranteed to - | | be monotonic - ------------------------------------------------------------- - | | guest vcpu has been paused by - 1 | N/A | the host - | | See 4.70 in api.txt - ------------------------------------------------------------- + + +-----------+--------------+----------------------------------+ + | flag bit | cpuid bit | meaning | + +-----------+--------------+----------------------------------+ + | | | time measures taken across | + | 0 | 24 | multiple cpus are guaranteed to | + | | | be monotonic | + +-----------+--------------+----------------------------------+ + | | | guest vcpu has been paused by | + | 1 | N/A | the host | + | | | See 4.70 in api.txt | + +-----------+--------------+----------------------------------+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid leaf prior to usage. -MSR_KVM_WALL_CLOCK: 0x11 +MSR_KVM_WALL_CLOCK: + 0x11 - data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. +data and functioning: + same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. This MSR falls outside the reserved KVM range and may be removed in the future. Its usage is deprecated. @@ -136,9 +157,11 @@ MSR_KVM_WALL_CLOCK: 0x11 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid leaf prior to usage. -MSR_KVM_SYSTEM_TIME: 0x12 +MSR_KVM_SYSTEM_TIME: + 0x12 - data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. +data and functioning: + same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. This MSR falls outside the reserved KVM range and may be removed in the future. Its usage is deprecated. @@ -146,7 +169,7 @@ MSR_KVM_SYSTEM_TIME: 0x12 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid leaf prior to usage. - The suggested algorithm for detecting kvmclock presence is then: + The suggested algorithm for detecting kvmclock presence is then:: if (!kvm_para_available()) /* refer to cpuid.txt */ return NON_PRESENT; @@ -163,8 +186,11 @@ MSR_KVM_SYSTEM_TIME: 0x12 } else return NON_PRESENT; -MSR_KVM_ASYNC_PF_EN: 0x4b564d02 - data: Bits 63-6 hold 64-byte aligned physical address of a +MSR_KVM_ASYNC_PF_EN: + 0x4b564d02 + +data: + Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area which must be in guest RAM and must be zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu 0 when @@ -200,20 +226,22 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 Currently type 2 APF will be always delivered on the same vcpu as type 1 was, but guest should not rely on that. -MSR_KVM_STEAL_TIME: 0x4b564d03 +MSR_KVM_STEAL_TIME: + 0x4b564d03 - data: 64-byte alignment physical address of a memory area which must be +data: + 64-byte alignment physical address of a memory area which must be in guest RAM, plus an enable bit in bit 0. This memory is expected to - hold a copy of the following structure: + hold a copy of the following structure:: - struct kvm_steal_time { + struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; __u8 preempted; __u8 u8_pad[3]; __u32 pad[11]; - } + } whose data will be filled in by the hypervisor periodically. Only one write, or registration, is needed for each VCPU. The interval between @@ -224,25 +252,32 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 Fields have the following meanings: - version: a sequence counter. In other words, guest has to check + version: + a sequence counter. In other words, guest has to check this field before and after grabbing time information and make sure they are both equal and even. An odd version indicates an in-progress update. - flags: At this point, always zero. May be used to indicate + flags: + At this point, always zero. May be used to indicate changes in this structure in the future. - steal: the amount of time in which this vCPU did not run, in + steal: + the amount of time in which this vCPU did not run, in nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. - preempted: indicate the vCPU who owns this struct is running or + preempted: + indicate the vCPU who owns this struct is running or not. Non-zero values mean the vCPU has been preempted. Zero means the vCPU is not preempted. NOTE, it is always zero if the the hypervisor doesn't support this field. -MSR_KVM_EOI_EN: 0x4b564d04 - data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 +MSR_KVM_EOI_EN: + 0x4b564d04 + +data: + Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned physical address of a 4 byte memory area which must be in guest RAM and @@ -274,11 +309,13 @@ MSR_KVM_EOI_EN: 0x4b564d04 clear it using a single CPU instruction, such as test and clear, or compare and exchange. -MSR_KVM_POLL_CONTROL: 0x4b564d05 +MSR_KVM_POLL_CONTROL: + 0x4b564d05 + Control host-side polling. - data: Bit 0 enables (1) or disables (0) host-side HLT polling logic. +data: + Bit 0 enables (1) or disables (0) host-side HLT polling logic. KVM guests can request the host not to poll on HLT, for example if they are performing polling themselves. - diff --git a/Documentation/virt/kvm/nested-vmx.txt b/Documentation/virt/kvm/nested-vmx.rst index 97eb1353e962..592b0ab6970b 100644 --- a/Documentation/virt/kvm/nested-vmx.txt +++ b/Documentation/virt/kvm/nested-vmx.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== Nested VMX ========== @@ -41,9 +44,9 @@ No modifications are required to user space (qemu). However, qemu's default emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be explicitly enabled, by giving qemu one of the following options: - -cpu host (emulated CPU has all features of the real CPU) + - cpu host (emulated CPU has all features of the real CPU) - -cpu qemu64,+vmx (add just the vmx feature to a named CPU type) + - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) ABIs @@ -75,6 +78,8 @@ of this structure changes, this can break live migration across KVM versions. VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs is ever changed. +:: + typedef u64 natural_width; struct __packed vmcs12 { /* According to the Intel spec, a VMCS region must start with @@ -220,21 +225,21 @@ Authors ------- These patches were written by: - Abel Gordon, abelg <at> il.ibm.com - Nadav Har'El, nyh <at> il.ibm.com - Orit Wasserman, oritw <at> il.ibm.com - Ben-Ami Yassor, benami <at> il.ibm.com - Muli Ben-Yehuda, muli <at> il.ibm.com + - Abel Gordon, abelg <at> il.ibm.com + - Nadav Har'El, nyh <at> il.ibm.com + - Orit Wasserman, oritw <at> il.ibm.com + - Ben-Ami Yassor, benami <at> il.ibm.com + - Muli Ben-Yehuda, muli <at> il.ibm.com With contributions by: - Anthony Liguori, aliguori <at> us.ibm.com - Mike Day, mdday <at> us.ibm.com - Michael Factor, factor <at> il.ibm.com - Zvi Dubitzky, dubi <at> il.ibm.com + - Anthony Liguori, aliguori <at> us.ibm.com + - Mike Day, mdday <at> us.ibm.com + - Michael Factor, factor <at> il.ibm.com + - Zvi Dubitzky, dubi <at> il.ibm.com And valuable reviews by: - Avi Kivity, avi <at> redhat.com - Gleb Natapov, gleb <at> redhat.com - Marcelo Tosatti, mtosatti <at> redhat.com - Kevin Tian, kevin.tian <at> intel.com - and others. + - Avi Kivity, avi <at> redhat.com + - Gleb Natapov, gleb <at> redhat.com + - Marcelo Tosatti, mtosatti <at> redhat.com + - Kevin Tian, kevin.tian <at> intel.com + - and others. diff --git a/Documentation/virt/kvm/ppc-pv.txt b/Documentation/virt/kvm/ppc-pv.rst index e26115ce4258..5fdb907670be 100644 --- a/Documentation/virt/kvm/ppc-pv.txt +++ b/Documentation/virt/kvm/ppc-pv.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= The PPC KVM paravirtual interface ================================= @@ -34,8 +37,9 @@ up the hypercall. To call a hypercall, just call these instructions. The parameters are as follows: + ======== ================ ================ Register IN OUT - + ======== ================ ================ r0 - volatile r3 1st parameter Return code r4 2nd parameter 1st output value @@ -47,6 +51,7 @@ The parameters are as follows: r10 8th parameter 7th output value r11 hypercall number 8th output value r12 - volatile + ======== ================ ================ Hypercall definitions are shared in generic code, so the same hypercall numbers apply for x86 and powerpc alike with the exception that each KVM hypercall @@ -54,11 +59,13 @@ also needs to be ORed with the KVM vendor code which is (42 << 16). Return codes can be as follows: + ==== ========================= Code Meaning - + ==== ========================= 0 Success 12 Hypercall not implemented <0 Error + ==== ========================= The magic page ============== @@ -72,7 +79,7 @@ desired location. The first parameter indicates the effective address when the MMU is enabled. The second parameter indicates the address in real mode, if applicable to the target. For now, we always map the page to -4096. This way we can access it using absolute load and store functions. The following -instruction reads the first field of the magic page: +instruction reads the first field of the magic page:: ld rX, -4096(0) @@ -93,8 +100,10 @@ a bitmap of available features inside the magic page. The following enhancements to the magic page are currently available: + ============================ ======================================= KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs + ============================ ======================================= For enhanced features in the magic page, please check for the existence of the feature before using them! @@ -121,8 +130,8 @@ when entering the guest or don't have any impact on the hypervisor's behavior. The following bits are safe to be set inside the guest: - MSR_EE - MSR_RI + - MSR_EE + - MSR_RI If any other bit changes in the MSR, please still use mtmsr(d). @@ -138,9 +147,9 @@ guest. Implementing any of those mappings is optional, as the instruction traps also act on the shared page. So calling privileged instructions still works as before. +======================= ================================ From To -==== == - +======================= ================================ mfmsr rX ld rX, magic_page->msr mfsprg rX, 0 ld rX, magic_page->sprg0 mfsprg rX, 1 ld rX, magic_page->sprg1 @@ -173,7 +182,7 @@ mtsrin rX, rY b <special mtsrin section> [BookE only] wrteei [0|1] b <special wrteei section> - +======================= ================================ Some instructions require more logic to determine what's going on than a load or store instruction can deliver. To enable patching of those, we keep some @@ -191,6 +200,7 @@ for example. Hypercall ABIs in KVM on PowerPC ================================= + 1) KVM hypercalls (ePAPR) These are ePAPR compliant hypercall implementation (mentioned above). Even diff --git a/Documentation/virt/kvm/review-checklist.txt b/Documentation/virt/kvm/review-checklist.rst index 499af499e296..1f86a9d3f705 100644 --- a/Documentation/virt/kvm/review-checklist.txt +++ b/Documentation/virt/kvm/review-checklist.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ Review checklist for kvm patches ================================ diff --git a/Documentation/virt/kvm/s390-diag.txt b/Documentation/virt/kvm/s390-diag.rst index 7c52e5f8b210..eaac4864d3d6 100644 --- a/Documentation/virt/kvm/s390-diag.txt +++ b/Documentation/virt/kvm/s390-diag.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= The s390 DIAGNOSE call on KVM ============================= @@ -16,12 +19,12 @@ DIAGNOSE calls by the guest cause a mandatory intercept. This implies all supported DIAGNOSE calls need to be handled by either KVM or its userspace. -All DIAGNOSE calls supported by KVM use the RS-a format: +All DIAGNOSE calls supported by KVM use the RS-a format:: --------------------------------------- -| '83' | R1 | R3 | B2 | D2 | --------------------------------------- -0 8 12 16 20 31 + -------------------------------------- + | '83' | R1 | R3 | B2 | D2 | + -------------------------------------- + 0 8 12 16 20 31 The second-operand address (obtained by the base/displacement calculation) is not used to address data. Instead, bits 48-63 of this address specify diff --git a/Documentation/virt/kvm/timekeeping.txt b/Documentation/virt/kvm/timekeeping.rst index 76808a17ad84..21ae7efa29ba 100644 --- a/Documentation/virt/kvm/timekeeping.txt +++ b/Documentation/virt/kvm/timekeeping.rst @@ -1,17 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 - Timekeeping Virtualization for X86-Based Architectures +====================================================== +Timekeeping Virtualization for X86-Based Architectures +====================================================== - Zachary Amsden <zamsden@redhat.com> - Copyright (c) 2010, Red Hat. All rights reserved. +:Author: Zachary Amsden <zamsden@redhat.com> +:Copyright: (c) 2010, Red Hat. All rights reserved. -1) Overview -2) Timing Devices -3) TSC Hardware -4) Virtualization Problems +.. Contents -========================================================================= + 1) Overview + 2) Timing Devices + 3) TSC Hardware + 4) Virtualization Problems -1) Overview +1. Overview +=========== One of the most complicated parts of the X86 platform, and specifically, the virtualization of this platform is the plethora of timing devices available @@ -27,15 +31,15 @@ The purpose of this document is to collect data and information relevant to timekeeping which may be difficult to find elsewhere, specifically, information relevant to KVM and hardware-based virtualization. -========================================================================= - -2) Timing Devices +2. Timing Devices +================= First we discuss the basic hardware devices available. TSC and the related KVM clock are special enough to warrant a full exposition and are described in the following section. -2.1) i8254 - PIT +2.1. i8254 - PIT +---------------- One of the first timer devices available is the programmable interrupt timer, or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three @@ -50,13 +54,13 @@ The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done using single or multiple byte access to the I/O ports. There are 6 modes available, but not all modes are available to all timers, as only timer 2 has a connected gate input, required for modes 1 and 5. The gate line is -controlled by port 61h, bit 0, as illustrated in the following diagram. +controlled by port 61h, bit 0, as illustrated in the following diagram:: - -------------- ---------------- -| | | | -| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0 -| Clock | | | | - -------------- | +->| GATE TIMER 0 | + -------------- ---------------- + | | | | + | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 + | Clock | | | | + -------------- | +->| GATE TIMER 0 | | ---------------- | | ---------------- @@ -70,29 +74,33 @@ controlled by port 61h, bit 0, as illustrated in the following diagram. | | | |------>| CLOCK OUT | ---------> Port 61h, bit 5 | | | -Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____ + Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ ---------------- _| )--|LPF|---Speaker / *---- \___/ -Port 61h, bit 1 -----------------------------------/ + Port 61h, bit 1 ---------------------------------/ The timer modes are now described. -Mode 0: Single Timeout. This is a one-shot software timeout that counts down +Mode 0: Single Timeout. + This is a one-shot software timeout that counts down when the gate is high (always true for timers 0 and 1). When the count reaches zero, the output goes high. -Mode 1: Triggered One-shot. The output is initially set high. When the gate +Mode 1: Triggered One-shot. + The output is initially set high. When the gate line is set high, a countdown is initiated (which does not stop if the gate is lowered), during which the output is set low. When the count reaches zero, the output goes high. -Mode 2: Rate Generator. The output is initially set high. When the countdown +Mode 2: Rate Generator. + The output is initially set high. When the countdown reaches 1, the output goes low for one count and then returns high. The value is reloaded and the countdown automatically resumes. If the gate line goes low, the count is halted. If the output is low when the gate is lowered, the output automatically goes high (this only affects timer 2). -Mode 3: Square Wave. This generates a high / low square wave. The count +Mode 3: Square Wave. + This generates a high / low square wave. The count determines the length of the pulse, which alternates between high and low when zero is reached. The count only proceeds when gate is high and is automatically reloaded on reaching zero. The count is decremented twice at @@ -103,12 +111,14 @@ Mode 3: Square Wave. This generates a high / low square wave. The count values are not observed when reading. This is the intended mode for timer 2, which generates sine-like tones by low-pass filtering the square wave output. -Mode 4: Software Strobe. After programming this mode and loading the counter, +Mode 4: Software Strobe. + After programming this mode and loading the counter, the output remains high until the counter reaches zero. Then the output goes low for 1 clock cycle and returns high. The counter is not reloaded. Counting only occurs when gate is high. -Mode 5: Hardware Strobe. After programming and loading the counter, the +Mode 5: Hardware Strobe. + After programming and loading the counter, the output remains high. When the gate is raised, a countdown is initiated (which does not stop if the gate is lowered). When the counter reaches zero, the output goes low for 1 clock cycle and then returns high. The counter is @@ -118,49 +128,49 @@ In addition to normal binary counting, the PIT supports BCD counting. The command port, 0x43 is used to set the counter and mode for each of the three timers. -PIT commands, issued to port 0x43, using the following bit encoding: +PIT commands, issued to port 0x43, using the following bit encoding:: -Bit 7-4: Command (See table below) -Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) -Bit 0 : Binary (0) / BCD (1) + Bit 7-4: Command (See table below) + Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) + Bit 0 : Binary (0) / BCD (1) -Command table: +Command table:: -0000 - Latch Timer 0 count for port 0x40 + 0000 - Latch Timer 0 count for port 0x40 sample and hold the count to be read in port 0x40; additional commands ignored until counter is read; mode bits ignored. -0001 - Set Timer 0 LSB mode for port 0x40 + 0001 - Set Timer 0 LSB mode for port 0x40 set timer to read LSB only and force MSB to zero; mode bits set timer mode -0010 - Set Timer 0 MSB mode for port 0x40 + 0010 - Set Timer 0 MSB mode for port 0x40 set timer to read MSB only and force LSB to zero; mode bits set timer mode -0011 - Set Timer 0 16-bit mode for port 0x40 + 0011 - Set Timer 0 16-bit mode for port 0x40 set timer to read / write LSB first, then MSB; mode bits set timer mode -0100 - Latch Timer 1 count for port 0x41 - as described above -0101 - Set Timer 1 LSB mode for port 0x41 - as described above -0110 - Set Timer 1 MSB mode for port 0x41 - as described above -0111 - Set Timer 1 16-bit mode for port 0x41 - as described above + 0100 - Latch Timer 1 count for port 0x41 - as described above + 0101 - Set Timer 1 LSB mode for port 0x41 - as described above + 0110 - Set Timer 1 MSB mode for port 0x41 - as described above + 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above -1000 - Latch Timer 2 count for port 0x42 - as described above -1001 - Set Timer 2 LSB mode for port 0x42 - as described above -1010 - Set Timer 2 MSB mode for port 0x42 - as described above -1011 - Set Timer 2 16-bit mode for port 0x42 as described above + 1000 - Latch Timer 2 count for port 0x42 - as described above + 1001 - Set Timer 2 LSB mode for port 0x42 - as described above + 1010 - Set Timer 2 MSB mode for port 0x42 - as described above + 1011 - Set Timer 2 16-bit mode for port 0x42 as described above -1101 - General counter latch + 1101 - General counter latch Latch combination of counters into corresponding ports Bit 3 = Counter 2 Bit 2 = Counter 1 Bit 1 = Counter 0 Bit 0 = Unused -1110 - Latch timer status + 1110 - Latch timer status Latch combination of counter mode into corresponding ports Bit 3 = Counter 2 Bit 2 = Counter 1 @@ -177,7 +187,8 @@ Command table: Bit 3-1 = Mode Bit 0 = Binary (0) / BCD mode (1) -2.2) RTC +2.2. RTC +-------- The second device which was available in the original PC was the MC146818 real time clock. The original device is now obsolete, and usually emulated by the @@ -201,21 +212,21 @@ in progress, as indicated in the status register. The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be programmed to a 32kHz divider if the RTC is to count seconds. -This is the RAM map originally used for the RTC/CMOS: - -Location Size Description ------------------------------------------- -00h byte Current second (BCD) -01h byte Seconds alarm (BCD) -02h byte Current minute (BCD) -03h byte Minutes alarm (BCD) -04h byte Current hour (BCD) -05h byte Hours alarm (BCD) -06h byte Current day of week (BCD) -07h byte Current day of month (BCD) -08h byte Current month (BCD) -09h byte Current year (BCD) -0Ah byte Register A +This is the RAM map originally used for the RTC/CMOS:: + + Location Size Description + ------------------------------------------ + 00h byte Current second (BCD) + 01h byte Seconds alarm (BCD) + 02h byte Current minute (BCD) + 03h byte Minutes alarm (BCD) + 04h byte Current hour (BCD) + 05h byte Hours alarm (BCD) + 06h byte Current day of week (BCD) + 07h byte Current day of month (BCD) + 08h byte Current month (BCD) + 09h byte Current year (BCD) + 0Ah byte Register A bit 7 = Update in progress bit 6-4 = Divider for clock 000 = 4.194 MHz @@ -234,7 +245,7 @@ Location Size Description 1101 = 125 mS 1110 = 250 mS 1111 = 500 mS -0Bh byte Register B + 0Bh byte Register B bit 7 = Run (0) / Halt (1) bit 6 = Periodic interrupt enable bit 5 = Alarm interrupt enable @@ -243,19 +254,20 @@ Location Size Description bit 2 = BCD calendar (0) / Binary (1) bit 1 = 12-hour mode (0) / 24-hour mode (1) bit 0 = 0 (DST off) / 1 (DST enabled) -OCh byte Register C (read only) + OCh byte Register C (read only) bit 7 = interrupt request flag (IRQF) bit 6 = periodic interrupt flag (PF) bit 5 = alarm interrupt flag (AF) bit 4 = update interrupt flag (UF) bit 3-0 = reserved -ODh byte Register D (read only) + ODh byte Register D (read only) bit 7 = RTC has power bit 6-0 = reserved -32h byte Current century BCD (*) + 32h byte Current century BCD (*) (*) location vendor specific and now determined from ACPI global tables -2.3) APIC +2.3. APIC +--------- On Pentium and later processors, an on-board timer is available to each CPU as part of the Advanced Programmable Interrupt Controller. The APIC is @@ -276,7 +288,8 @@ timer is programmed through the LVT (local vector timer) register, is capable of one-shot or periodic operation, and is based on the bus clock divided down by the programmable divider register. -2.4) HPET +2.4. HPET +--------- HPET is quite complex, and was originally intended to replace the PIT / RTC support of the X86 PC. It remains to be seen whether that will be the case, as @@ -297,7 +310,8 @@ indicated through ACPI tables by the BIOS. Detailed specification of the HPET is beyond the current scope of this document, as it is also very well documented elsewhere. -2.5) Offboard Timers +2.5. Offboard Timers +-------------------- Several cards, both proprietary (watchdog boards) and commonplace (e1000) have timing chips built into the cards which may have registers which are accessible @@ -307,9 +321,8 @@ general frowned upon as not playing by the agreed rules of the game. Such a timer device would require additional support to be virtualized properly and is not considered important at this time as no known operating system does this. -========================================================================= - -3) TSC Hardware +3. TSC Hardware +=============== The TSC or time stamp counter is relatively simple in theory; it counts instruction cycles issued by the processor, which can be used as a measure of @@ -340,7 +353,8 @@ allows the guest visible TSC to be offset by a constant. Newer implementations promise to allow the TSC to additionally be scaled, but this hardware is not yet widely available. -3.1) TSC synchronization +3.1. TSC synchronization +------------------------ The TSC is a CPU-local clock in most implementations. This means, on SMP platforms, the TSCs of different CPUs may start at different times depending @@ -357,7 +371,8 @@ practice, getting a perfectly synchronized TSC will not be possible unless all values are read from the same clock, which generally only is possible on single socket systems or those with special hardware support. -3.2) TSC and CPU hotplug +3.2. TSC and CPU hotplug +------------------------ As touched on already, CPUs which arrive later than the boot time of the system may not have a TSC value that is synchronized with the rest of the system. @@ -367,7 +382,8 @@ a guarantee. This can have the effect of bringing a system from a state where TSC is synchronized back to a state where TSC synchronization flaws, however small, may be exposed to the OS and any virtualization environment. -3.3) TSC and multi-socket / NUMA +3.3. TSC and multi-socket / NUMA +-------------------------------- Multi-socket systems, especially large multi-socket systems are likely to have individual clocksources rather than a single, universally distributed clock. @@ -385,7 +401,8 @@ standards for telecommunications and computer equipment. It is recommended not to trust the TSCs to remain synchronized on NUMA or multiple socket systems for these reasons. -3.4) TSC and C-states +3.4. TSC and C-states +--------------------- C-states, or idling states of the processor, especially C1E and deeper sleep states may be problematic for TSC as well. The TSC may stop advancing in such @@ -396,7 +413,8 @@ based on CPU and chipset identifications. The TSC in such a case may be corrected by catching it up to a known external clocksource. -3.5) TSC frequency change / P-states +3.5. TSC frequency change / P-states +------------------------------------ To make things slightly more interesting, some CPUs may change frequency. They may or may not run the TSC at the same rate, and because the frequency change @@ -416,14 +434,16 @@ other processors. In such cases, the TSC on halted CPUs could advance faster than that of non-halted processors. AMD Turion processors are known to have this problem. -3.6) TSC and STPCLK / T-states +3.6. TSC and STPCLK / T-states +------------------------------ External signals given to the processor may also have the effect of stopping the TSC. This is typically done for thermal emergency power control to prevent an overheating condition, and typically, there is no way to detect that this condition has happened. -3.7) TSC virtualization - VMX +3.7. TSC virtualization - VMX +----------------------------- VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP instructions, which is enough for full virtualization of TSC in any manner. In @@ -431,14 +451,16 @@ addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET field specified in the VMCS. Special instructions must be used to read and write the VMCS field. -3.8) TSC virtualization - SVM +3.8. TSC virtualization - SVM +----------------------------- SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP instructions, which is enough for full virtualization of TSC in any manner. In addition, SVM allows passing through the host TSC plus an additional offset field specified in the SVM control block. -3.9) TSC feature bits in Linux +3.9. TSC feature bits in Linux +------------------------------ In summary, there is no way to guarantee the TSC remains in perfect synchronization unless it is explicitly guaranteed by the architecture. Even @@ -448,13 +470,16 @@ despite being locally consistent. The following feature bits are used by Linux to signal various TSC attributes, but they can only be taken to be meaningful for UP or single node systems. -X86_FEATURE_TSC : The TSC is available in hardware -X86_FEATURE_RDTSCP : The RDTSCP instruction is available -X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states -X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states -X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware) +========================= ======================================= +X86_FEATURE_TSC The TSC is available in hardware +X86_FEATURE_RDTSCP The RDTSCP instruction is available +X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states +X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states +X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) +========================= ======================================= -4) Virtualization Problems +4. Virtualization Problems +========================== Timekeeping is especially problematic for virtualization because a number of challenges arise. The most obvious problem is that time is now shared between @@ -473,7 +498,8 @@ BIOS, but not in such an extreme fashion. However, the fact that SMM mode may cause similar problems to virtualization makes it a good justification for solving many of these problems on bare metal. -4.1) Interrupt clocking +4.1. Interrupt clocking +----------------------- One of the most immediate problems that occurs with legacy operating systems is that the system timekeeping routines are often designed to keep track of @@ -502,7 +528,8 @@ thus requires interrupt slewing to keep proper time. It does use a low enough rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in practice. -4.2) TSC sampling and serialization +4.2. TSC sampling and serialization +----------------------------------- As the highest precision time source available, the cycle counter of the CPU has aroused much interest from developers. As explained above, this timer has @@ -524,7 +551,8 @@ it may be necessary for an implementation to guard against "backwards" reads of the TSC as seen from other CPUs, even in an otherwise perfectly synchronized system. -4.3) Timespec aliasing +4.3. Timespec aliasing +---------------------- Additionally, this lack of serialization from the TSC poses another challenge when using results of the TSC when measured against another time source. As @@ -548,7 +576,8 @@ This aliasing requires care in the computation and recalibration of kvmclock and any other values derived from TSC computation (such as TSC virtualization itself). -4.4) Migration +4.4. Migration +-------------- Migration of a virtual machine raises problems for timekeeping in two ways. First, the migration itself may take time, during which interrupts cannot be @@ -566,7 +595,8 @@ always be caught up to the original rate. KVM clock avoids these problems by simply storing multipliers and offsets against the TSC for the guest to convert back into nanosecond resolution values. -4.5) Scheduling +4.5. Scheduling +--------------- Since scheduling may be based on precise timing and firing of interrupts, the scheduling algorithms of an operating system may be adversely affected by @@ -579,7 +609,8 @@ In an attempt to work around this, several implementations have provided a paravirtualized scheduler clock, which reveals the true amount of CPU time for which a virtual machine has been running. -4.6) Watchdogs +4.6. Watchdogs +-------------- Watchdog timers, such as the lock detector in Linux may fire accidentally when running under hardware virtualization due to timer interrupts being delayed or @@ -587,7 +618,8 @@ misinterpretation of the passage of real time. Usually, these warnings are spurious and can be ignored, but in some circumstances it may be necessary to disable such detection. -4.7) Delays and precision timing +4.7. Delays and precision timing +-------------------------------- Precise timing and delays may not be possible in a virtualized system. This can happen if the system is controlling physical hardware, or issues delays to @@ -600,7 +632,8 @@ The second issue may cause performance problems, but this is unlikely to be a significant issue. In many cases these delays may be eliminated through configuration or paravirtualization. -4.8) Covert channels and leaks +4.8. Covert channels and leaks +------------------------------ In addition to the above problems, time information will inevitably leak to the guest about the host in anything but a perfect implementation of virtualized diff --git a/Documentation/virt/uml/UserModeLinux-HOWTO.txt b/Documentation/virt/uml/user_mode_linux.rst index 87b80f589e1c..de0f0b2c9d5b 100644 --- a/Documentation/virt/uml/UserModeLinux-HOWTO.txt +++ b/Documentation/virt/uml/user_mode_linux.rst @@ -1,12 +1,17 @@ - User Mode Linux HOWTO - User Mode Linux Core Team - Mon Nov 18 14:16:16 EST 2002 +.. SPDX-License-Identifier: GPL-2.0 - This document describes the use and abuse of Jeff Dike's User Mode - Linux: a port of the Linux kernel as a normal Intel Linux process. - ______________________________________________________________________ +===================== +User Mode Linux HOWTO +===================== - Table of Contents +:Author: User Mode Linux Core Team +:Last-updated: Sat Jan 25 16:07:55 CET 2020 + +This document describes the use and abuse of Jeff Dike's User Mode +Linux: a port of the Linux kernel as a normal Intel Linux process. + + +.. Table of Contents 1. Introduction @@ -132,19 +137,19 @@ 15.5 Other contributions - ______________________________________________________________________ - - 1. Introduction +1. Introduction +================ Welcome to User Mode Linux. It's going to be fun. - 1.1. How is User Mode Linux Different? +1.1. How is User Mode Linux Different? +--------------------------------------- Normally, the Linux Kernel talks straight to your hardware (video card, keyboard, hard drives, etc), and any programs which run ask the - kernel to operate the hardware, like so: + kernel to operate the hardware, like so:: @@ -160,10 +165,10 @@ The User Mode Linux Kernel is different; instead of talking to the - hardware, it talks to a `real' Linux kernel (called the `host kernel' + hardware, it talks to a `real` Linux kernel (called the `host kernel` from now on), like any other program. Programs can then run inside User-Mode Linux as if they were running under a normal kernel, like - so: + so:: @@ -181,7 +186,8 @@ - 1.2. Why Would I Want User Mode Linux? +1.2. Why Would I Want User Mode Linux? +--------------------------------------- 1. If User Mode Linux crashes, your host kernel is still fine. @@ -204,83 +210,41 @@ +.. _Compiling_the_kernel_and_modules: - - 2. Compiling the kernel and modules +2. Compiling the kernel and modules +==================================== - 2.1. Compiling the kernel +2.1. Compiling the kernel +-------------------------- Compiling the user mode kernel is just like compiling any other - kernel. Let's go through the steps, using 2.4.0-prerelease (current - as of this writing) as an example: - - - 1. Download the latest UML patch from - - the download page <http://user-mode-linux.sourceforge.net/ - - In this example, the file is uml-patch-2.4.0-prerelease.bz2. + kernel. - 2. Download the matching kernel from your favourite kernel mirror, + 1. Download the latest kernel from your favourite kernel mirror, such as: - ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2 - <ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2> - . - - - 3. Make a directory and unpack the kernel into it. - + https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.14.tar.xz + 2. Make a directory and unpack the kernel into it:: host% mkdir ~/uml - - - - - host% cd ~/uml - - - - - - host% - tar -xzvf linux-2.4.0-prerelease.tar.bz2 - - - - - - - 4. Apply the patch using - - - - host% - cd ~/uml/linux - - - host% - bzcat uml-patch-2.4.0-prerelease.bz2 | patch -p1 + tar xvf linux-5.4.14.tar.xz - - - - - 5. Run your favorite config; `make xconfig ARCH=um' is the most - convenient. `make config ARCH=um' and 'make menuconfig ARCH=um' + 3. Run your favorite config; ``make xconfig ARCH=um`` is the most + convenient. ``make config ARCH=um`` and ``make menuconfig ARCH=um`` will work as well. The defaults will give you a useful kernel. If you want to change something, go ahead, it probably won't hurt anything. @@ -288,44 +252,20 @@ Note: If the host is configured with a 2G/2G address space split rather than the usual 3G/1G split, then the packaged UML binaries - will not run. They will immediately segfault. See ``UML on 2G/2G - hosts'' for the scoop on running UML on your system. - - - - 6. Finish with `make linux ARCH=um': the result is a file called - `linux' in the top directory of your source tree. - - Make sure that you don't build this kernel in /usr/src/linux. On some - distributions, /usr/include/asm is a link into this pool. The user- - mode build changes the other end of that link, and things that include - <asm/anything.h> stop compiling. - - The sources are also available from cvs at the project's cvs page, - which has directions on getting the sources. You can also browse the - CVS pool from there. + will not run. They will immediately segfault. See + :ref:`UML_on_2G/2G_hosts` for the scoop on running UML on your system. - If you get the CVS sources, you will have to check them out into an - empty directory. You will then have to copy each file into the - corresponding directory in the appropriate kernel pool. - If you don't have the latest kernel pool, you can get the - corresponding user-mode sources with + 4. Finish with ``make linux ARCH=um``: the result is a file called + ``linux`` in the top directory of your source tree. - host% cvs co -r v_2_3_x linux - - - - where 'x' is the version in your pool. Note that you will not get the - bug fixes and enhancements that have gone into subsequent releases. - - - 2.2. Compiling and installing kernel modules +2.2. Compiling and installing kernel modules +--------------------------------------------- UML modules are built in the same way as the native kernel (with the - exception of the 'ARCH=um' that you always need for UML): + exception of the 'ARCH=um' that you always need for UML):: host% make modules ARCH=um @@ -337,12 +277,12 @@ the user-mode pool. Modules from the native kernel won't work. You can install them by using ftp or something to copy them into the - virtual machine and dropping them into /lib/modules/`uname -r`. + virtual machine and dropping them into ``/lib/modules/$(uname -r)``. You can also get the kernel build process to install them as follows: 1. with the kernel not booted, mount the root filesystem in the top - level of the kernel pool: + level of the kernel pool:: host% mount root_fs mnt -o loop @@ -352,7 +292,7 @@ - 2. run + 2. run:: host% @@ -363,7 +303,7 @@ - 3. unmount the filesystem + 3. unmount the filesystem:: host% umount mnt @@ -381,27 +321,28 @@ as modules, especially filesystems and network protocols and filters, so most symbols which need to be exported probably already are. However, if you do find symbols that need exporting, let us - <http://user-mode-linux.sourceforge.net/> know, and + know at http://user-mode-linux.sourceforge.net/, and they'll be "taken care of". - 2.3. Compiling and installing uml_utilities +2.3. Compiling and installing uml_utilities +-------------------------------------------- Many features of the UML kernel require a user-space helper program, so a uml_utilities package is distributed separately from the kernel patch which provides these helpers. Included within this is: - o port-helper - Used by consoles which connect to xterms or ports + - port-helper - Used by consoles which connect to xterms or ports - o tunctl - Configuration tool to create and delete tap devices + - tunctl - Configuration tool to create and delete tap devices - o uml_net - Setuid binary for automatic tap device configuration + - uml_net - Setuid binary for automatic tap device configuration - o uml_switch - User-space virtual switch required for daemon + - uml_switch - User-space virtual switch required for daemon transport - The uml_utilities tree is compiled with: + The uml_utilities tree is compiled with:: host# @@ -423,38 +364,42 @@ - 3. Running UML and logging in +3. Running UML and logging in +============================== - 3.1. Running UML +3.1. Running UML +----------------- - It runs on 2.2.15 or later, and all 2.4 kernels. + It runs on 2.2.15 or later, and all kernel versions since 2.4. Booting UML is straightforward. Simply run 'linux': it will try to - mount the file `root_fs' in the current directory. You do not need to - run it as root. If your root filesystem is not named `root_fs', then - you need to put a `ubd0=root_fs_whatever' switch on the linux command + mount the file ``root_fs`` in the current directory. You do not need to + run it as root. If your root filesystem is not named ``root_fs``, then + you need to put a ``ubd0=root_fs_whatever`` switch on the linux command line. You will need a filesystem to boot UML from. There are a number - available for download from here <http://user-mode- - linux.sourceforge.net/> . There are also several tools - <http://user-mode-linux.sourceforge.net/> which can be + available for download from http://user-mode-linux.sourceforge.net. + There are also several tools at + http://user-mode-linux.sourceforge.net/ which can be used to generate UML-compatible filesystem images from media. The kernel will boot up and present you with a login prompt. - Note: If the host is configured with a 2G/2G address space split +Note: + If the host is configured with a 2G/2G address space split rather than the usual 3G/1G split, then the packaged UML binaries will - not run. They will immediately segfault. See ``UML on 2G/2G hosts'' + not run. They will immediately segfault. See :ref:`UML_on_2G/2G_hosts` for the scoop on running UML on your system. - 3.2. Logging in +3.2. Logging in +---------------- @@ -468,22 +413,22 @@ There are a couple of other ways to log in: - o On a virtual console + - On a virtual console Each virtual console that is configured (i.e. the device exists in /dev and /etc/inittab runs a getty on it) will come up in its own - xterm. If you get tired of the xterms, read ``Setting up serial - lines and consoles'' to see how to attach the consoles to - something else, like host ptys. + xterm. If you get tired of the xterms, read + :ref:`setting_up_serial_lines_and_consoles` to see how to attach + the consoles to something else, like host ptys. - o Over the serial line + - Over the serial line - In the boot output, find a line that looks like: + In the boot output, find a line that looks like:: @@ -493,7 +438,7 @@ Attach your favorite terminal program to the corresponding tty. I.e. - for minicom, the command would be + for minicom, the command would be:: host% minicom -o -p /dev/ttyp1 @@ -503,37 +448,40 @@ - o Over the net + - Over the net If the network is running, then you can telnet to the virtual - machine and log in to it. See ``Setting up the network'' to learn + machine and log in to it. See :ref:`Setting_up_the_network` to learn about setting up a virtual network. When you're done using it, run halt, and the kernel will bring itself down and the process will exit. - 3.3. Examples +3.3. Examples +-------------- Here are some examples of UML in action: - o A login session <http://user-mode-linux.sourceforge.net/login.html> + - A login session http://user-mode-linux.sourceforge.net/old/login.html - o A virtual network <http://user-mode-linux.sourceforge.net/net.html> + - A virtual network http://user-mode-linux.sourceforge.net/old/net.html +.. _UML_on_2G/2G_hosts: +4. UML on 2G/2G hosts +====================== - 4. UML on 2G/2G hosts - - 4.1. Introduction +4.1. Introduction +------------------ Most Linux machines are configured so that the kernel occupies the @@ -546,7 +494,8 @@ - 4.2. The problem +4.2. The problem +----------------- The prebuilt UML binaries on this site will not run on 2G/2G hosts @@ -558,13 +507,14 @@ - 4.3. The solution +4.3. The solution +------------------ The fix for this is to rebuild UML from source after enabling CONFIG_HOST_2G_2G (under 'General Setup'). This will cause UML to load itself in the top .5G of that smaller process address space, - where it will run fine. See ``Compiling the kernel and modules'' if + where it will run fine. See :ref:`Compiling_the_kernel_and_modules` if you need help building UML from source. @@ -573,10 +523,11 @@ +.. _setting_up_serial_lines_and_consoles: - - 5. Setting up serial lines and consoles +5. Setting up serial lines and consoles +======================================== It is possible to attach UML serial lines and consoles to many types @@ -584,22 +535,23 @@ You can attach them to host ptys, ttys, file descriptors, and ports. - This allows you to do things like + This allows you to do things like: - o have a UML console appear on an unused host console, + - have a UML console appear on an unused host console, - o hook two virtual machines together by having one attach to a pty + - hook two virtual machines together by having one attach to a pty and having the other attach to the corresponding tty - o make a virtual machine accessible from the net by attaching a + - make a virtual machine accessible from the net by attaching a console to a port on the host. - The general format of the command line option is device=channel. + The general format of the command line option is ``device=channel``. - 5.1. Specifying the device +5.1. Specifying the device +--------------------------- Devices are specified with "con" or "ssl" (console or serial line, respectively), optionally with a device number if you are talking @@ -613,7 +565,7 @@ A specific device name will override a less general "con=" or "ssl=". So, for example, you can assign a pty to each of the serial lines - except for the first two like this: + except for the first two like this:: ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1 @@ -626,13 +578,14 @@ - 5.2. Specifying the channel +5.2. Specifying the channel +---------------------------- There are a number of different types of channels to attach a UML device to, each with a different way of specifying exactly what to attach to. - o pseudo-terminals - device=pty pts terminals - device=pts + - pseudo-terminals - device=pty pts terminals - device=pts This will cause UML to allocate a free host pseudo-terminal for the @@ -640,23 +593,23 @@ log. You access it by attaching a terminal program to the corresponding tty: - o screen /dev/pts/n + - screen /dev/pts/n - o screen /dev/ttyxx + - screen /dev/ttyxx - o minicom -o -p /dev/ttyxx - minicom seems not able to handle pts + - minicom -o -p /dev/ttyxx - minicom seems not able to handle pts devices - o kermit - start it up, 'open' the device, then 'connect' + - kermit - start it up, 'open' the device, then 'connect' - o terminals - device=tty:tty device file + - terminals - device=tty:tty device file - This will make UML attach the device to the specified tty (i.e + This will make UML attach the device to the specified tty (i.e:: con1=tty:/dev/tty3 @@ -672,7 +625,7 @@ - o xterms - device=xterm + - xterms - device=xterm UML will run an xterm and the device will be attached to it. @@ -681,12 +634,12 @@ - o Port - device=port:port number + - Port - device=port:port number This will attach the UML devices to the specified host port. Attaching console 1 to the host's port 9000 would be done like - this: + this:: con1=port:9000 @@ -694,7 +647,7 @@ - Attaching all the serial lines to that port would be done similarly: + Attaching all the serial lines to that port would be done similarly:: ssl=port:9000 @@ -702,8 +655,8 @@ - You access these devices by telnetting to that port. Each active tel- - net session gets a different device. If there are more telnets to a + You access these devices by telnetting to that port. Each active + telnet session gets a different device. If there are more telnets to a port than UML devices attached to it, then the extra telnet sessions will block until an existing telnet detaches, or until another device becomes active (i.e. by being activated in /etc/inittab). @@ -725,13 +678,13 @@ - o already-existing file descriptors - device=file descriptor + - already-existing file descriptors - device=file descriptor If you set up a file descriptor on the UML command line, you can attach a UML device to it. This is most commonly used to put the main console back on stdin and stdout after assigning all the other - consoles to something else: + consoles to something else:: con0=fd:0,fd:1 con=pts @@ -743,7 +696,7 @@ - o Nothing - device=null + - Nothing - device=null This allows the device to be opened, in contrast to 'none', but @@ -754,7 +707,7 @@ - o None - device=none + - None - device=none This causes the device to disappear. @@ -762,7 +715,7 @@ You can also specify different input and output channels for a device - by putting a comma between them: + by putting a comma between them:: ssl3=tty:/dev/tty2,xterm @@ -785,14 +738,15 @@ - 5.3. Examples +5.3. Examples +-------------- There are a number of interesting things you can do with this capability. First, this is how you get rid of those bleeding console xterms by - attaching them to host ptys: + attaching them to host ptys:: con=pty con0=fd:0,fd:1 @@ -802,7 +756,7 @@ This will make a UML console take over an unused host virtual console, so that when you switch to it, you will see the UML login prompt - rather than the host login prompt: + rather than the host login prompt:: con1=tty:/dev/tty6 @@ -813,7 +767,7 @@ You can attach two virtual machines together with what amounts to a serial line as follows: - Run one UML with a serial line attached to a pty - + Run one UML with a serial line attached to a pty:: ssl1=pty @@ -825,7 +779,7 @@ that it got /dev/ptyp1). Boot the other UML with a serial line attached to the corresponding - tty - + tty:: ssl1=tty:/dev/ttyp1 @@ -838,7 +792,10 @@ prompt of the other virtual machine. - 6. Setting up the network +.. _setting_up_the_network: + +6. Setting up the network +========================== @@ -858,19 +815,19 @@ There are currently five transport types available for a UML virtual machine to exchange packets with other hosts: - o ethertap + - ethertap - o TUN/TAP + - TUN/TAP - o Multicast + - Multicast - o a switch daemon + - a switch daemon - o slip + - slip - o slirp + - slirp - o pcap + - pcap The TUN/TAP, ethertap, slip, and slirp transports allow a UML instance to exchange packets with the host. They may be directed @@ -893,28 +850,28 @@ With so many host transports, which one should you use? Here's when you should use each one: - o ethertap - if you want access to the host networking and it is + - ethertap - if you want access to the host networking and it is running 2.2 - o TUN/TAP - if you want access to the host networking and it is + - TUN/TAP - if you want access to the host networking and it is running 2.4. Also, the TUN/TAP transport is able to use a preconfigured device, allowing it to avoid using the setuid uml_net helper, which is a security advantage. - o Multicast - if you want a purely virtual network and you don't want + - Multicast - if you want a purely virtual network and you don't want to set up anything but the UML - o a switch daemon - if you want a purely virtual network and you + - a switch daemon - if you want a purely virtual network and you don't mind running the daemon in order to get somewhat better performance - o slip - there is no particular reason to run the slip backend unless + - slip - there is no particular reason to run the slip backend unless ethertap and TUN/TAP are just not available for some reason - o slirp - if you don't have root access on the host to setup + - slirp - if you don't have root access on the host to setup networking, or if you don't want to allocate an IP to your UML - o pcap - not much use for actual network connectivity, but great for + - pcap - not much use for actual network connectivity, but great for monitoring traffic on the host Ethertap is available on 2.4 and works fine. TUN/TAP is preferred @@ -926,7 +883,8 @@ exploit the helper's root privileges. - 6.1. General setup +6.1. General setup +------------------- First, you must have the virtual network enabled in your UML. If are running a prebuilt kernel from this site, everything is already @@ -938,7 +896,7 @@ The next step is to provide a network device to the virtual machine. This is done by describing it on the kernel command line. - The general format is + The general format is:: eth <n> = <transport> , <transport args> @@ -947,7 +905,7 @@ For example, a virtual ethernet device may be attached to a host - ethertap device as follows: + ethertap device as follows:: eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254 @@ -978,7 +936,7 @@ You can also add devices to a UML and remove them at runtime. See the - ``The Management Console'' page for details. + :ref:`The_Management_Console` page for details. The sections below describe this in more detail. @@ -995,7 +953,8 @@ - 6.2. Userspace daemons +6.2. Userspace daemons +----------------------- You will likely need the setuid helper, or the switch daemon, or both. They are both installed with the RPM and deb, so if you've installed @@ -1011,7 +970,8 @@ - 6.3. Specifying ethernet addresses +6.3. Specifying ethernet addresses +----------------------------------- Below, you will see that the TUN/TAP, ethertap, and daemon interfaces allow you to specify hardware addresses for the virtual ethernet @@ -1023,21 +983,21 @@ sufficient to guarantee a unique hardware address for the device. A couple of exceptions are: - o Another set of virtual ethernet devices are on the same network and + - Another set of virtual ethernet devices are on the same network and they are assigned hardware addresses using a different scheme which may conflict with the UML IP address-based scheme - o You aren't going to use the device for IP networking, so you don't + - You aren't going to use the device for IP networking, so you don't assign the device an IP address If you let the driver provide the hardware address, you should make sure that the device IP address is known before the interface is - brought up. So, inside UML, this will guarantee that: + brought up. So, inside UML, this will guarantee that:: - UML# - ifconfig eth0 192.168.0.250 up + UML# + ifconfig eth0 192.168.0.250 up @@ -1049,13 +1009,14 @@ - 6.4. UML interface setup +6.4. UML interface setup +------------------------- Once the network devices have been described on the command line, you should boot UML and log in. - The first thing to do is bring the interface up: + The first thing to do is bring the interface up:: UML# ifconfig ethn ip-address up @@ -1067,7 +1028,7 @@ To reach the rest of the world, you should set a default route to the - host: + host:: UML# route add default gw host ip @@ -1075,7 +1036,7 @@ - Again, with host ip of 192.168.0.4: + Again, with host ip of 192.168.0.4:: UML# route add default gw 192.168.0.4 @@ -1097,29 +1058,25 @@ Note: If you can't communicate with other hosts on your physical ethernet, it's probably because of a network route that's automatically set up. If you run 'route -n' and see a route that - looks like this: + looks like this:: - Destination Gateway Genmask Flags Metric Ref Use Iface - 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 + Destination Gateway Genmask Flags Metric Ref Use Iface + 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 with a mask that's not 255.255.255.255, then replace it with a route - to your host: + to your host:: UML# route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0 - - - - UML# route add -host 192.168.0.4 dev eth0 @@ -1131,7 +1088,8 @@ - 6.5. Multicast +6.5. Multicast +--------------- The simplest way to set up a virtual network between multiple UMLs is to use the mcast transport. This was written by Harald Welte and is @@ -1142,7 +1100,7 @@ messages when you bring the device up inside UML. - To use it, run two UMLs with + To use it, run two UMLs with:: eth0=mcast @@ -1151,16 +1109,12 @@ on their command lines. Log in, configure the ethernet device in each - machine with different IP addresses: + machine with different IP addresses:: UML1# ifconfig eth0 192.168.0.254 - - - - UML2# ifconfig eth0 192.168.0.253 @@ -1168,7 +1122,7 @@ and they should be able to talk to each other. - The full set of command line options for this transport are + The full set of command line options for this transport are:: @@ -1177,16 +1131,11 @@ - - Harald's original README is here <http://user-mode-linux.source- - forge.net/> and explains these in detail, as well as - some other issues. - There is also a related point-to-point only "ucast" transport. This is useful when your network does not support multicast, and all network connections are simple point to point links. - The full set of command line options for this transport are + The full set of command line options for this transport are:: ethn=ucast,ethernet address,remote address,listen port,remote port @@ -1194,7 +1143,8 @@ - 6.6. TUN/TAP with the uml_net helper +6.6. TUN/TAP with the uml_net helper +------------------------------------- TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the host. The TUN/TAP backend has been in UML since 2.4.9-3um. @@ -1216,7 +1166,7 @@ kernel or as the tun.o module. The format of the command line switch to attach a device to a TUN/TAP - device is + device is:: eth <n> =tuntap,,, <IP address> @@ -1226,7 +1176,7 @@ For example, this argument will attach the UML's eth0 to the next available tap device and assign an ethernet address to it based on its - IP address + IP address:: eth0=tuntap,,,192.168.0.254 @@ -1247,10 +1197,10 @@ There are a couple potential problems with running the TUN/TAP transport on a 2.4 host kernel - o TUN/TAP seems not to work on 2.4.3 and earlier. Upgrade the host + - TUN/TAP seems not to work on 2.4.3 and earlier. Upgrade the host kernel or use the ethertap transport. - o With an upgraded kernel, TUN/TAP may fail with + - With an upgraded kernel, TUN/TAP may fail with:: File descriptor in bad state @@ -1263,13 +1213,12 @@ make sure that /usr/src/linux points to the headers for the running kernel. - These were pointed out by Tim Robinson <timro at trkr dot net> in - <http://www.geocrawler.com/> name="this uml- - user post"> . + These were pointed out by Tim Robinson <timro at trkr dot net> in the past. - 6.7. TUN/TAP with a preconfigured tap device +6.7. TUN/TAP with a preconfigured tap device +--------------------------------------------- If you prefer not to have UML use uml_net (which is somewhat insecure), with UML 2.4.17-11, you can set up a TUN/TAP device @@ -1277,8 +1226,8 @@ there is no need for root assistance. Setting up the device is done as follows: - o Create the device with tunctl (available from the UML utilities - tarball) + - Create the device with tunctl (available from the UML utilities + tarball):: @@ -1291,8 +1240,8 @@ where uid is the user id or username that UML will be run as. This will tell you what device was created. - o Configure the device IP (change IP addresses and device name to - suit) + - Configure the device IP (change IP addresses and device name to + suit):: @@ -1303,8 +1252,8 @@ - o Set up routing and arping if desired - this is my recipe, there are - other ways of doing the same thing + - Set up routing and arping if desired - this is my recipe, there are + other ways of doing the same thing:: host# @@ -1313,19 +1262,9 @@ host# route add -host 192.168.0.253 dev tap0 - - - - - host# bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp' - - - - - host# arp -Ds 192.168.0.253 eth0 pub @@ -1338,76 +1277,43 @@ utility which reads the information from a config file and sets up devices at boot time. - o Rather than using up two IPs and ARPing for one of them, you can + - Rather than using up two IPs and ARPing for one of them, you can also provide direct access to your LAN by the UML by using a - bridge. + bridge:: host# brctl addbr br0 - - - - host# ifconfig eth0 0.0.0.0 promisc up - - - - host# ifconfig tap0 0.0.0.0 promisc up - - - - host# ifconfig br0 192.168.0.1 netmask 255.255.255.0 up - - - - - - host# - brctl stp br0 off - - - - + host# + brctl stp br0 off host# brctl setfd br0 1 - - - - host# brctl sethello br0 1 - - - - host# brctl addif br0 eth0 - - - - host# brctl addif br0 tap0 @@ -1417,12 +1323,12 @@ Note that 'br0' should be setup using ifconfig with the existing IP address of eth0, as eth0 no longer has its own IP. - o + - Also, the /dev/net/tun device must be writable by the user running UML in order for the UML to use the device that's been configured - for it. The simplest thing to do is + for it. The simplest thing to do is:: host# chmod 666 /dev/net/tun @@ -1438,14 +1344,14 @@ devices and chgrp /dev/net/tun to that group with mode 664 or 660. - o Once the device is set up, run UML with 'eth0=tuntap,device name' + - Once the device is set up, run UML with 'eth0=tuntap,device name' (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the mconsole config command). - o Bring the eth device up in UML and you're in business. + - Bring the eth device up in UML and you're in business. If you don't want that tap device any more, you can make it non- - persistent with + persistent with:: host# tunctl -d tap device @@ -1455,7 +1361,7 @@ Finally, tunctl has a -b (for brief mode) switch which causes it to output only the name of the tap device it created. This makes it - suitable for capture by a script: + suitable for capture by a script:: host# TAP=`tunctl -u 1000 -b` @@ -1465,7 +1371,8 @@ - 6.8. Ethertap +6.8. Ethertap +-------------- Ethertap is the general mechanism on 2.2 for userspace processes to exchange packets with the kernel. @@ -1473,7 +1380,7 @@ To use this transport, you need to describe the virtual network device - on the UML command line. The general format for this is + on the UML command line. The general format for this is:: eth <n> =ethertap, <device> , <ethernet address> , <tap IP address> @@ -1481,7 +1388,7 @@ - So, the previous example + So, the previous example:: eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254 @@ -1521,7 +1428,7 @@ If you want to set things up yourself, you need to make sure that the appropriate /dev entry exists. If it doesn't, become root and create - it as follows: + it as follows:: mknod /dev/tap <minor> c 36 <minor> + 16 @@ -1529,7 +1436,7 @@ - For example, this is how to create /dev/tap0: + For example, this is how to create /dev/tap0:: mknod /dev/tap0 c 36 0 + 16 @@ -1539,7 +1446,7 @@ You also need to make sure that the host kernel has ethertap support. If ethertap is enabled as a module, you apparently need to insmod - ethertap once for each ethertap device you want to enable. So, + ethertap once for each ethertap device you want to enable. So,:: host# @@ -1549,7 +1456,7 @@ will give you the tap0 interface. To get the tap1 interface, you need - to run + to run:: host# @@ -1561,7 +1468,8 @@ - 6.9. The switch daemon +6.9. The switch daemon +----------------------- Note: This is the daemon formerly known as uml_router, but which was renamed so the network weenies of the world would stop growling at me. @@ -1577,7 +1485,7 @@ sockets. - If you want it to listen on a different pair of sockets, use + If you want it to listen on a different pair of sockets, use:: -unix control socket data socket @@ -1586,7 +1494,7 @@ - If you want it to act as a hub rather than a switch, use + If you want it to act as a hub rather than a switch, use:: -hub @@ -1596,7 +1504,7 @@ If you want the switch to be connected to host networking (allowing - the umls to get access to the outside world through the host), use + the umls to get access to the outside world through the host), use:: -tap tap0 @@ -1610,7 +1518,7 @@ device than tap0, specify that instead of tap0. - uml_switch can be backgrounded as follows + uml_switch can be backgrounded as follows:: host% @@ -1623,7 +1531,7 @@ stdin for EOF. When it sees that, it exits. - The general format of the kernel command line switch is + The general format of the kernel command line switch is:: @@ -1639,7 +1547,8 @@ how to communicate with the daemon. You should only specify them if you told the daemon to use different sockets than the default. So, if you ran the daemon with no arguments, running the UML on the same - machine with + machine with:: + eth0=daemon @@ -1649,7 +1558,8 @@ - 6.10. Slip +6.10. Slip +----------- Slip is another, less general, mechanism for a process to communicate with the host networking. In contrast to the ethertap interface, @@ -1658,7 +1568,7 @@ IP. - The general format of the command line switch is + The general format of the command line switch is:: @@ -1681,7 +1591,8 @@ - 6.11. Slirp +6.11. Slirp +------------ slirp uses an external program, usually /usr/bin/slirp, to provide IP only networking connectivity through the host. This is similar to IP @@ -1691,7 +1602,7 @@ root access or setuid binaries on the host. - The general format of the command line switch for slirp is: + The general format of the command line switch for slirp is:: @@ -1716,7 +1627,7 @@ The eth0 interface on UML should be set up with the IP 10.2.0.15, although you can use anything as long as it is not used by a network you will be connecting to. The default route on UML should be set to - use + use:: UML# @@ -1737,10 +1648,11 @@ - 6.12. pcap +6.12. pcap +----------- The pcap transport is attached to a UML ethernet device on the command - line or with uml_mconsole with the following syntax: + line or with uml_mconsole with the following syntax:: @@ -1762,7 +1674,7 @@ expression optimizer is used. - Example: + Example:: @@ -1777,7 +1689,8 @@ - 6.13. Setting up the host yourself +6.13. Setting up the host yourself +----------------------------------- If you don't specify an address for the host side of the ethertap or slip device, UML won't do any setup on the host. So this is what is @@ -1785,19 +1698,15 @@ 192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your own network): - o The device needs to be configured with its IP address. Tap devices + - The device needs to be configured with its IP address. Tap devices are also configured with an mtu of 1484. Slip devices are configured with a point-to-point address pointing at the UML ip - address. + address:: host# ifconfig tap0 arp mtu 1484 192.168.0.251 up - - - - host# ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up @@ -1805,7 +1714,7 @@ - o If a tap device is being set up, a route is set to the UML IP. + - If a tap device is being set up, a route is set to the UML IP:: UML# route add -host 192.168.0.250 gw 192.168.0.251 @@ -1814,8 +1723,8 @@ - o To allow other hosts on your network to see the virtual machine, - proxy arp is set up for it. + - To allow other hosts on your network to see the virtual machine, + proxy arp is set up for it:: host# arp -Ds 192.168.0.250 eth0 pub @@ -1824,7 +1733,7 @@ - o Finally, the host is set up to route packets. + - Finally, the host is set up to route packets:: host# echo 1 > /proc/sys/net/ipv4/ip_forward @@ -1838,12 +1747,14 @@ - 7. Sharing Filesystems between Virtual Machines +7. Sharing Filesystems between Virtual Machines +================================================ - 7.1. A warning +7.1. A warning +--------------- Don't attempt to share filesystems simply by booting two UMLs from the same file. That's the same thing as booting two physical machines @@ -1851,7 +1762,8 @@ - 7.2. Using layered block devices +7.2. Using layered block devices +--------------------------------- The way to share a filesystem between two virtual machines is to use the copy-on-write (COW) layering capability of the ubd block driver. @@ -1872,7 +1784,7 @@ To add a copy-on-write layer to an existing block device file, simply - add the name of the COW file to the appropriate ubd switch: + add the name of the COW file to the appropriate ubd switch:: ubd0=root_fs_cow,root_fs_debian_22 @@ -1883,7 +1795,7 @@ where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is the existing shared filesystem. The COW file need not exist. If it doesn't, the driver will create and initialize it. Once the COW file - has been initialized, it can be used on its own on the command line: + has been initialized, it can be used on its own on the command line:: ubd0=root_fs_cow @@ -1896,14 +1808,16 @@ - 7.3. Note! +7.3. Note! +----------- When checking the size of the COW file in order to see the gobs of space that you're saving, make sure you use 'ls -ls' to see the actual disk consumption rather than the length of the file. The COW file is sparse, so the length will be very different from the disk usage. Here is a 'ls -l' of a COW file and backing file from one boot and - shutdown: + shutdown:: + host% ls -l cow.debian debian2.2 -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2 @@ -1911,7 +1825,7 @@ - Doesn't look like much saved space, does it? Well, here's 'ls -ls': + Doesn't look like much saved space, does it? Well, here's 'ls -ls':: host% ls -ls cow.debian debian2.2 @@ -1926,7 +1840,8 @@ - 7.4. Another warning +7.4. Another warning +--------------------- Once a filesystem is being used as a readonly backing file for a COW file, do not boot directly from it or modify it in any way. Doing so @@ -1952,7 +1867,8 @@ - 7.5. uml_moo : Merging a COW file with its backing file +7.5. uml_moo : Merging a COW file with its backing file +-------------------------------------------------------- Depending on how you use UML and COW devices, it may be advisable to merge the changes in the COW file into the backing file every once in @@ -1961,7 +1877,7 @@ - The utility that does this is uml_moo. Its usage is + The utility that does this is uml_moo. Its usage is:: host% uml_moo COW file new backing file @@ -1991,8 +1907,8 @@ uml_moo is installed with the UML deb and RPM. If you didn't install UML from one of those packages, you can also get it from the UML - utilities <http://user-mode-linux.sourceforge.net/ - utilities> tar file in tools/moo. + utilities http://user-mode-linux.sourceforge.net/utilities tar file + in tools/moo. @@ -2001,7 +1917,8 @@ - 8. Creating filesystems +8. Creating filesystems +======================== You may want to create and mount new UML filesystems, either because @@ -2015,13 +1932,14 @@ should be easy to translate to the filesystem of your choice. - 8.1. Create the filesystem file +8.1. Create the filesystem file +================================ dd is your friend. All you need to do is tell dd to create an empty file of the appropriate size. I usually make it sparse to save time and to avoid allocating disk space until it's actually used. For example, the following command will create a sparse 100 meg file full - of zeroes. + of zeroes:: host% @@ -2034,9 +1952,9 @@ 8.2. Assign the file to a UML device - Add an argument like the following to the UML command line: + Add an argument like the following to the UML command line:: - ubd4=new_filesystem + ubd4=new_filesystem @@ -2053,7 +1971,7 @@ etc), then get them into UML by way of the net or hostfs. - Make the new filesystem on the device assigned to the new file: + Make the new filesystem on the device assigned to the new file:: host# mkreiserfs /dev/ubd/4 @@ -2077,7 +1995,7 @@ - Now, mount it: + Now, mount it:: UML# @@ -2096,7 +2014,8 @@ - 9. Host file access +9. Host file access +==================== If you want to access files on the host machine from inside UML, you @@ -2112,10 +2031,11 @@ files contained in it just as you would on the host. - 9.1. Using hostfs +9.1. Using hostfs +------------------ To begin with, make sure that hostfs is available inside the virtual - machine with + machine with:: UML# cat /proc/filesystems @@ -2127,7 +2047,7 @@ module and available inside the virtual machine, and insmod it. - Now all you need to do is run mount: + Now all you need to do is run mount:: UML# mount none /mnt/host -t hostfs @@ -2139,7 +2059,7 @@ If you don't want to mount the host root directory, then you can - specify a subdirectory to mount with the -o switch to mount: + specify a subdirectory to mount with the -o switch to mount:: UML# mount none /mnt/home -t hostfs -o /home @@ -2151,13 +2071,14 @@ - 9.2. hostfs as the root filesystem +9.2. hostfs as the root filesystem +----------------------------------- It's possible to boot from a directory hierarchy on the host using hostfs rather than using the standard filesystem in a file. To start, you need that hierarchy. The easiest way is to loop mount - an existing root_fs file: + an existing root_fs file:: host# mount root_fs uml_root_dir -o loop @@ -2166,15 +2087,15 @@ You need to change the filesystem type of / in etc/fstab to be - 'hostfs', so that line looks like this: + 'hostfs', so that line looks like this:: - /dev/ubd/0 / hostfs defaults 1 1 + /dev/ubd/0 / hostfs defaults 1 1 Then you need to chown to yourself all the files in that directory - that are owned by root. This worked for me: + that are owned by root. This worked for me:: host# find . -uid 0 -exec chown jdike {} \; @@ -2183,7 +2104,7 @@ Next, make sure that your UML kernel has hostfs compiled in, not as a - module. Then run UML with the boot device pointing at that directory: + module. Then run UML with the boot device pointing at that directory:: ubd0=/path/to/uml/root/directory @@ -2194,41 +2115,35 @@ UML should then boot as it does normally. - 9.3. Building hostfs +9.3. Building hostfs +--------------------- If you need to build hostfs because it's not in your kernel, you have two choices: - o Compiling hostfs into the kernel: + - Compiling hostfs into the kernel: Reconfigure the kernel and set the 'Host filesystem' option under - o Compiling hostfs as a module: + - Compiling hostfs as a module: Reconfigure the kernel and set the 'Host filesystem' option under be in arch/um/fs/hostfs/hostfs.o. Install that in - /lib/modules/`uname -r`/fs in the virtual machine, boot it up, and + ``/lib/modules/$(uname -r)/fs`` in the virtual machine, boot it up, and:: UML# insmod hostfs +.. _The_Management_Console: - - - - - - - - - - 10. The Management Console +10. The Management Console +=========================== @@ -2240,15 +2155,15 @@ There are a number of things you can do with the mconsole interface: - o get the kernel version + - get the kernel version - o add and remove devices + - add and remove devices - o halt or reboot the machine + - halt or reboot the machine - o Send SysRq commands + - Send SysRq commands - o Pause and resume the UML + - Pause and resume the UML You need the mconsole client (uml_mconsole) which is present in CVS @@ -2257,7 +2172,7 @@ You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML. - When you boot UML, you'll see a line like: + When you boot UML, you'll see a line like:: mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole @@ -2265,7 +2180,7 @@ - If you specify a unique machine id one the UML command line, i.e. + If you specify a unique machine id one the UML command line, i.e.:: umid=debian @@ -2273,7 +2188,7 @@ - you'll see this + you'll see this:: mconsole initialized on /home/jdike/.uml/debian/mconsole @@ -2282,7 +2197,7 @@ That file is the socket that uml_mconsole will use to communicate with - UML. Run it with either the umid or the full path as its argument: + UML. Run it with either the umid or the full path as its argument:: host% uml_mconsole debian @@ -2290,7 +2205,7 @@ - or + or:: host% uml_mconsole /home/jdike/.uml/debian/mconsole @@ -2300,30 +2215,31 @@ You'll get a prompt, at which you can run one of these commands: - o version + - version - o halt + - halt - o reboot + - reboot - o config + - config - o remove + - remove - o sysrq + - sysrq - o help + - help - o cad + - cad - o stop + - stop - o go + - go - 10.1. version +10.1. version +-------------- - This takes no arguments. It prints the UML version. + This takes no arguments. It prints the UML version:: (mconsole) version @@ -2342,11 +2258,12 @@ - 10.2. halt and reboot +10.2. halt and reboot +---------------------- These take no arguments. They shut the machine down immediately, with no syncing of disks and no clean shutdown of userspace. So, they are - pretty close to crashing the machine. + pretty close to crashing the machine:: (mconsole) halt @@ -2357,34 +2274,36 @@ - 10.3. config +10.3. config +------------- "config" adds a new device to the virtual machine. Currently the ubd and network drivers support this. It takes one argument, which is the - device to add, with the same syntax as the kernel command line. + device to add, with the same syntax as the kernel command line:: - (mconsole) - config ubd3=/home/jdike/incoming/roots/root_fs_debian22 + (mconsole) + config ubd3=/home/jdike/incoming/roots/root_fs_debian22 - OK - (mconsole) config eth1=mcast - OK + OK + (mconsole) config eth1=mcast + OK - 10.4. remove +10.4. remove +------------- "remove" deletes a device from the system. Its argument is just the name of the device to be removed. The device must be idle in whatever sense the driver considers necessary. In the case of the ubd driver, the removed block device must not be mounted, swapped on, or otherwise - open, and in the case of the network driver, the device must be down. + open, and in the case of the network driver, the device must be down:: (mconsole) remove ubd3 @@ -2397,7 +2316,8 @@ - 10.5. sysrq +10.5. sysrq +------------ This takes one argument, which is a single letter. It calls the generic kernel's SysRq driver, which does whatever is called for by @@ -2407,19 +2327,21 @@ - 10.6. help +10.6. help +----------- "help" returns a string listing the valid commands and what each one does. - 10.7. cad +10.7. cad +---------- This invokes the Ctl-Alt-Del action on init. What exactly this ends up doing is up to /etc/inittab. Normally, it reboots the machine. With UML, this is usually not desired, so if a halt would be better, - then find the section of inittab that looks like this + then find the section of inittab that looks like this:: # What to do when CTRL-ALT-DEL is pressed. @@ -2432,7 +2354,8 @@ - 10.8. stop +10.8. stop +----------- This puts the UML in a loop reading mconsole requests until a 'go' mconsole command is received. This is very useful for making backups @@ -2448,7 +2371,8 @@ - 10.9. go +10.9. go +--------- This resumes a UML after being paused by a 'stop' command. Note that when the UML has resumed, TCP connections may have timed out and if @@ -2460,9 +2384,10 @@ +.. _Kernel_debugging: - - 11. Kernel debugging +11. Kernel debugging +===================== Note: The interface that makes debugging, as described here, possible @@ -2477,15 +2402,16 @@ In order to debug the kernel, you need build it from source. See - ``Compiling the kernel and modules'' for information on doing that. + :ref:`Compiling_the_kernel_and_modules` for information on doing that. Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during - the config. These will compile the kernel with -g, and enable the + the config. These will compile the kernel with ``-g``, and enable the ptrace proxy so that gdb works with UML, respectively. - 11.1. Starting the kernel under gdb +11.1. Starting the kernel under gdb +------------------------------------ You can have the kernel running under the control of gdb from the beginning by putting 'debug' on the command line. You will get an @@ -2498,7 +2424,11 @@ There is a transcript of a debugging session here <debug- session.html> , with breakpoints being set in the scheduler and in an interrupt handler. - 11.2. Examining sleeping processes + + +11.2. Examining sleeping processes +----------------------------------- + Not every bug is evident in the currently running process. Sometimes, processes hang in the kernel when they shouldn't because they've @@ -2516,7 +2446,7 @@ Now what you do is this: - o detach from the current thread + - detach from the current thread:: (UML gdb) det @@ -2525,7 +2455,7 @@ - o attach to the thread you are interested in + - attach to the thread you are interested in:: (UML gdb) att <host pid> @@ -2534,7 +2464,7 @@ - o look at its stack and anything else of interest + - look at its stack and anything else of interest:: (UML gdb) bt @@ -2545,18 +2475,14 @@ Note that you can't do anything at this point that requires that a process execute, e.g. calling a function - o when you're done looking at that process, reattach to the current - thread and continue it + - when you're done looking at that process, reattach to the current + thread and continue it:: (UML gdb) att 1 - - - - (UML gdb) c @@ -2569,12 +2495,13 @@ - 11.3. Running ddd on UML +11.3. Running ddd on UML +------------------------- ddd works on UML, but requires a special kludge. The process goes like this: - o Start ddd + - Start ddd:: host% ddd linux @@ -2583,14 +2510,14 @@ - o With ps, get the pid of the gdb that ddd started. You can ask the + - With ps, get the pid of the gdb that ddd started. You can ask the gdb to tell you, but for some reason that confuses things and causes a hang. - o run UML with 'debug=parent gdb-pid=<pid>' added to the command line + - run UML with 'debug=parent gdb-pid=<pid>' added to the command line - it will just sit there after you hit return - o type 'att 1' to the ddd gdb and you will see something like + - type 'att 1' to the ddd gdb and you will see something like:: 0xa013dc51 in __kill () @@ -2602,12 +2529,14 @@ - o At this point, type 'c', UML will boot up, and you can use ddd just + - At this point, type 'c', UML will boot up, and you can use ddd just as you do on any other process. - 11.4. Debugging modules +11.4. Debugging modules +------------------------ + gdb has support for debugging code which is dynamically loaded into the process. This support is what is needed to debug kernel modules @@ -2629,7 +2558,8 @@ First, you must tell it where your modules are. There is a list in - the script that looks like this: + the script that looks like this:: + set MODULE_PATHS { "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o" "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o" @@ -2641,9 +2571,7 @@ You change that to list the names and paths of the modules that you are going to debug. Then you run it from the toplevel directory of - your UML pool and it basically tells you what to do: - - + your UML pool and it basically tells you what to do:: ******** GDB pid is 21903 ******** @@ -2666,7 +2594,7 @@ After you run UML and it sits there doing nothing, you hit return at - the 'att 1' and continue it: + the 'att 1' and continue it:: Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1 @@ -2678,63 +2606,48 @@ At this point, you debug normally. When you insmod something, the - expect magic will kick in and you'll see something like: - - - - - - - - - - - - - - - - - - *** Module hostfs loaded *** - Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs", - mod_user=0x8070e00) at module.c:349 - 349 char *name, *n_name, *name_tmp = NULL; - (UML gdb) finish - Run till exit from #0 sys_init_module (name_user=0x805abb0 "hostfs", - mod_user=0x8070e00) at module.c:349 - 0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411 - 411 else res = EXECUTE_SYSCALL(syscall, regs); - Value returned is $1 = 0 - (UML gdb) - p/x (int)module_list + module_list->size_of_struct - - $2 = 0xa9021054 - (UML gdb) symbol-file ./linux - Load new symbol table from "./linux"? (y or n) y - Reading symbols from ./linux... - done. - (UML gdb) - add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054 - - add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at - .text_addr = 0xa9021054 - (y or n) y - - Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o... - done. - (UML gdb) p *module_list - $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs", - size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1, - nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0, - init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>, - ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0, - persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0, - kallsyms_end = 0x0, - archdata_start = 0x1b855 <Address 0x1b855 out of bounds>, - archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>, - kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>} - >> Finished loading symbols for hostfs ... + expect magic will kick in and you'll see something like:: + + + *** Module hostfs loaded *** + Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs", + mod_user=0x8070e00) at module.c:349 + 349 char *name, *n_name, *name_tmp = NULL; + (UML gdb) finish + Run till exit from #0 sys_init_module (name_user=0x805abb0 "hostfs", + mod_user=0x8070e00) at module.c:349 + 0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411 + 411 else res = EXECUTE_SYSCALL(syscall, regs); + Value returned is $1 = 0 + (UML gdb) + p/x (int)module_list + module_list->size_of_struct + + $2 = 0xa9021054 + (UML gdb) symbol-file ./linux + Load new symbol table from "./linux"? (y or n) y + Reading symbols from ./linux... + done. + (UML gdb) + add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054 + + add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at + .text_addr = 0xa9021054 + (y or n) y + + Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o... + done. + (UML gdb) p *module_list + $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs", + size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1, + nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0, + init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>, + ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0, + persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0, + kallsyms_end = 0x0, + archdata_start = 0x1b855 <Address 0x1b855 out of bounds>, + archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>, + kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>} + >> Finished loading symbols for hostfs ... @@ -2744,7 +2657,7 @@ Boot the kernel under the debugger and load the module with insmod or - modprobe. With gdb, do: + modprobe. With gdb, do:: (UML gdb) p module_list @@ -2758,12 +2671,12 @@ the name fields until find the module you want to debug. Take the address of that structure, and add module.size_of_struct (which in 2.4.10 kernels is 96 (0x60)) to it. Gdb can make this hard addition - for you :-): + for you :-):: - (UML gdb) - printf "%#x\n", (int)module_list module_list->size_of_struct + (UML gdb) + printf "%#x\n", (int)module_list module_list->size_of_struct @@ -2771,7 +2684,7 @@ The offset from the module start occasionally changes (before 2.4.0, it was module.size_of_struct + 4), so it's a good idea to check the init and cleanup addresses once in a while, as describe below. Now - do: + do:: (UML gdb) @@ -2786,7 +2699,7 @@ If there's any doubt that you got the offset right, like breakpoints appear not to work, or they're appearing in the wrong place, you can check it by looking at the module structure. The init and cleanup - fields should look like: + fields should look like:: init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs> @@ -2801,7 +2714,7 @@ When you want to load in a new version of the module, you need to get gdb to forget about the old one. The only way I've found to do that - is to tell gdb to forget about all symbols that it knows about: + is to tell gdb to forget about all symbols that it knows about:: (UML gdb) symbol-file @@ -2809,7 +2722,7 @@ - Then reload the symbols from the kernel binary: + Then reload the symbols from the kernel binary:: (UML gdb) symbol-file /path/to/kernel @@ -2823,17 +2736,19 @@ - 11.5. Attaching gdb to the kernel +11.5. Attaching gdb to the kernel +---------------------------------- If you don't have the kernel running under gdb, you can attach gdb to it later by sending the tracing thread a SIGUSR1. The first line of - the console output identifies its pid: + the console output identifies its pid:: + tracing thread pid = 20093 - When you send it the signal: + When you send it the signal:: host% kill -USR1 20093 @@ -2845,7 +2760,7 @@ If you have the mconsole compiled into UML, then the mconsole client - can be used to start gdb: + can be used to start gdb:: (mconsole) (mconsole) config gdb=xterm @@ -2857,7 +2772,8 @@ - 11.6. Using alternate debuggers +11.6. Using alternate debuggers +-------------------------------- UML has support for attaching to an already running debugger rather than starting gdb itself. This is present in CVS as of 17 Apr 2001. @@ -2886,7 +2802,7 @@ An example of an alternate debugger is strace. You can strace the actual kernel as follows: - o Run the following in a shell + - Run the following in a shell:: host% @@ -2894,13 +2810,13 @@ - o Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out + - Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out by the previous command - o Hit return in the shell, and UML will start running, and strace + - Hit return in the shell, and UML will start running, and strace output will start accumulating in the output file. - Note that this is different from running + Note that this is different from running:: host% strace ./linux @@ -2917,95 +2833,57 @@ - 12. Kernel debugging examples +12. Kernel debugging examples +============================== - 12.1. The case of the hung fsck +12.1. The case of the hung fsck +-------------------------------- When booting up the kernel, fsck failed, and dropped me into a shell - to fix things up. I ran fsck -y, which hung: - - - - - - - - - - - - - - - - - - - - - - - - - + to fix things up. I ran fsck -y, which hung:: + Setting hostname uml [ OK ] + Checking root filesystem + /dev/fhd0 was not cleanly unmounted, check forced. + Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. + /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. + (i.e., without -a or -p options) + [ FAILED ] + *** An error occurred during the file system check. + *** Dropping you to a shell; the system will reboot + *** when you leave the shell. + Give root password for maintenance + (or type Control-D for normal startup): + [root@uml /root]# fsck -y /dev/fhd0 + fsck -y /dev/fhd0 + Parallelizing fsck version 1.14 (9-Jan-1999) + e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 + /dev/fhd0 contains a file system with errors, check forced. + Pass 1: Checking inodes, blocks, and sizes + Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes + Inode 19780, i_blocks is 1548, should be 540. Fix? yes + Pass 2: Checking directory structure + Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes + Directory inode 11858, block 0, offset 0: directory corrupted + Salvage? yes + Missing '.' in directory inode 11858. + Fix? yes - - - Setting hostname uml [ OK ] - Checking root filesystem - /dev/fhd0 was not cleanly unmounted, check forced. - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. - - /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. - (i.e., without -a or -p options) - [ FAILED ] - - *** An error occurred during the file system check. - *** Dropping you to a shell; the system will reboot - *** when you leave the shell. - Give root password for maintenance - (or type Control-D for normal startup): - - [root@uml /root]# fsck -y /dev/fhd0 - fsck -y /dev/fhd0 - Parallelizing fsck version 1.14 (9-Jan-1999) - e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 - /dev/fhd0 contains a file system with errors, check forced. - Pass 1: Checking inodes, blocks, and sizes - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes - - Inode 19780, i_blocks is 1548, should be 540. Fix? yes - - Pass 2: Checking directory structure - Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes - - Directory inode 11858, block 0, offset 0: directory corrupted - Salvage? yes - - Missing '.' in directory inode 11858. - Fix? yes - - Missing '..' in directory inode 11858. - Fix? yes - - - + Missing '..' in directory inode 11858. + Fix? yes The standard drill in this sort of situation is to fire up gdb on the signal thread, which, in this case, was pid 1935. In another window, - I run gdb and attach pid 1935. - - + I run gdb and attach pid 1935:: ~/linux/2.3.26/um 1016: gdb linux @@ -3022,11 +2900,7 @@ 0x100756d9 in __wait4 () - - - - - Let's see what's currently running: + Let's see what's currently running:: @@ -3041,7 +2915,7 @@ reason and never woke up. - Let's guess that the last process in the process list is fsck: + Let's guess that the last process in the process list is fsck:: @@ -3052,7 +2926,7 @@ - It is, so let's see what it thinks it's up to: + It is, so let's see what it thinks it's up to:: @@ -3068,8 +2942,6 @@ - - The interesting things here are the fact that its .thread.syscall.id is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or the defines in include/asm-um/arch/unistd.h), and that it never @@ -3081,30 +2953,20 @@ The fact that it never returned from write means that its stack should be fairly interesting. Its pid is 1980 (.thread.extern_pid). That process is being ptraced by the signal thread, so it must be detached - before gdb can attach it: - - - - - - + before gdb can attach it:: + (gdb) call detach(1980) - (gdb) call detach(1980) - - Program received signal SIGSEGV, Segmentation fault. - <function called from gdb> - The program being debugged stopped while in a function called from GDB. - When the function (detach) is done executing, GDB will silently - stop (instead of continuing to evaluate the expression containing - the function call). - (gdb) call detach(1980) - $15 = 0 - - - + Program received signal SIGSEGV, Segmentation fault. + <function called from gdb> + The program being debugged stopped while in a function called from GDB. + When the function (detach) is done executing, GDB will silently + stop (instead of continuing to evaluate the expression containing + the function call). + (gdb) call detach(1980) + $15 = 0 The first detach segfaults for some reason, and the second one @@ -3112,7 +2974,7 @@ Now I detach from the signal thread, attach to the fsck thread, and - look at its stack: + look at its stack:: (gdb) det @@ -3152,14 +3014,14 @@ - The interesting things here are : + The interesting things here are: - o There are two segfaults on this stack (frames 9 and 14) + - There are two segfaults on this stack (frames 9 and 14) - o The first faulting address (frame 11) is 0x50000800 + - The first faulting address (frame 11) is 0x50000800:: - (gdb) p (void *)1342179328 - $16 = (void *) 0x50000800 + (gdb) p (void *)1342179328 + $16 = (void *) 0x50000800 @@ -3175,7 +3037,7 @@ However, the more immediate problem is that second segfault and I'm going to concentrate on that. First, I want to see where the fault - happened, so I have to go look at the sigcontent struct in frame 8: + happened, so I have to go look at the sigcontent struct in frame 8:: @@ -3211,7 +3073,7 @@ - That's not very useful, so I'll try a more manual method: + That's not very useful, so I'll try a more manual method:: (gdb) p *((struct sigcontext *) (&sig + 1)) @@ -3224,7 +3086,7 @@ - The ip is in handle_mm_fault: + The ip is in handle_mm_fault:: (gdb) p (void *)268480945 @@ -3236,7 +3098,7 @@ - Specifically, it's in pte_alloc: + Specifically, it's in pte_alloc:: (gdb) i line *$20 @@ -3249,7 +3111,7 @@ To find where in handle_mm_fault this is, I'll jump forward in the - code until I see an address in that procedure: + code until I see an address in that procedure:: @@ -3286,21 +3148,21 @@ Something is apparently wrong with the page tables or vma_structs, so - lets go back to frame 11 and have a look at them: + lets go back to frame 11 and have a look at them:: - #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50 - 50 handle_mm_fault(current, vma, address, is_write); - (gdb) call pgd_offset_proc(vma->vm_mm, address) - $22 = (pgd_t *) 0x80a548c + #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50 + 50 handle_mm_fault(current, vma, address, is_write); + (gdb) call pgd_offset_proc(vma->vm_mm, address) + $22 = (pgd_t *) 0x80a548c That's pretty bogus. Page tables aren't supposed to be in process - text or data areas. Let's see what's in the vma: + text or data areas. Let's see what's in the vma:: (gdb) p *vma @@ -3325,12 +3187,9 @@ - - This also pretty bogus. With all of the 0x80xxxxx and 0xaffffxxx addresses, this is looking like a stack was plonked down on top of - these structures. Maybe it's a stack overflow from the next page: - + these structures. Maybe it's a stack overflow from the next page:: (gdb) p vma @@ -3338,52 +3197,36 @@ - - That's towards the lower quarter of the page, so that would have to - have been pretty heavy stack overflow: - - - - - - - - - - - - - - - (gdb) x/100x $25 - 0x507d2434: 0x507d2434 0x00000000 0x08048000 0x080a4f8c - 0x507d2444: 0x00000000 0x080a79e0 0x080a8c94 0x080d1000 - 0x507d2454: 0xaffffdb0 0xaffffe63 0xaffffe7a 0xaffffe7a - 0x507d2464: 0xafffffec 0x00000062 0x0000008a 0x00000000 - 0x507d2474: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2484: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2494: 0x00000000 0x00000000 0x507d2fe0 0x00000000 - 0x507d24a4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24b4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24c4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24d4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24e4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24f4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2504: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2514: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2524: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2534: 0x00000000 0x00000000 0x507d25dc 0x00000000 - 0x507d2544: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2554: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2564: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2574: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2584: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2594: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d25a4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d25b4: 0x00000000 0x00000000 0x00000000 0x00000000 - - + have been pretty heavy stack overflow:: + + + (gdb) x/100x $25 + 0x507d2434: 0x507d2434 0x00000000 0x08048000 0x080a4f8c + 0x507d2444: 0x00000000 0x080a79e0 0x080a8c94 0x080d1000 + 0x507d2454: 0xaffffdb0 0xaffffe63 0xaffffe7a 0xaffffe7a + 0x507d2464: 0xafffffec 0x00000062 0x0000008a 0x00000000 + 0x507d2474: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2484: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2494: 0x00000000 0x00000000 0x507d2fe0 0x00000000 + 0x507d24a4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d24b4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d24c4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d24d4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d24e4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d24f4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2504: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2514: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2524: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2534: 0x00000000 0x00000000 0x507d25dc 0x00000000 + 0x507d2544: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2554: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2564: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2574: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2584: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d2594: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d25a4: 0x00000000 0x00000000 0x00000000 0x00000000 + 0x507d25b4: 0x00000000 0x00000000 0x00000000 0x00000000 @@ -3399,65 +3242,53 @@ on will be somewhat clearer. - 12.2. Episode 2: The case of the hung fsck +12.2. Episode 2: The case of the hung fsck +------------------------------------------- After setting a trap in the SEGV handler for accesses to the signal thread's stack, I reran the kernel. - fsck hung again, this time by hitting the trap: - - + fsck hung again, this time by hitting the trap:: + Setting hostname uml [ OK ] + Checking root filesystem + /dev/fhd0 contains a file system with errors, check forced. + Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. + /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. + (i.e., without -a or -p options) + [ FAILED ] + *** An error occurred during the file system check. + *** Dropping you to a shell; the system will reboot + *** when you leave the shell. + Give root password for maintenance + (or type Control-D for normal startup): + [root@uml /root]# fsck -y /dev/fhd0 + fsck -y /dev/fhd0 + Parallelizing fsck version 1.14 (9-Jan-1999) + e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 + /dev/fhd0 contains a file system with errors, check forced. + Pass 1: Checking inodes, blocks, and sizes + Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes + Pass 2: Checking directory structure + Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes + Directory inode 11858, block 0, offset 0: directory corrupted + Salvage? yes + Missing '.' in directory inode 11858. + Fix? yes + Missing '..' in directory inode 11858. + Fix? yes - - - - Setting hostname uml [ OK ] - Checking root filesystem - /dev/fhd0 contains a file system with errors, check forced. - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. - - /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. - (i.e., without -a or -p options) - [ FAILED ] - - *** An error occurred during the file system check. - *** Dropping you to a shell; the system will reboot - *** when you leave the shell. - Give root password for maintenance - (or type Control-D for normal startup): - - [root@uml /root]# fsck -y /dev/fhd0 - fsck -y /dev/fhd0 - Parallelizing fsck version 1.14 (9-Jan-1999) - e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 - /dev/fhd0 contains a file system with errors, check forced. - Pass 1: Checking inodes, blocks, and sizes - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes - - Pass 2: Checking directory structure - Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes - - Directory inode 11858, block 0, offset 0: directory corrupted - Salvage? yes - - Missing '.' in directory inode 11858. - Fix? yes - - Missing '..' in directory inode 11858. - Fix? yes - - Untested (4127) [100fe44c]: trap_kern.c line 31 + Untested (4127) [100fe44c]: trap_kern.c line 31 @@ -3465,7 +3296,7 @@ I need to get the signal thread to detach from pid 4127 so that I can attach to it with gdb. This is done by sending it a SIGUSR1, which is - caught by the signal thread, which detaches the process: + caught by the signal thread, which detaches the process:: kill -USR1 4127 @@ -3474,31 +3305,20 @@ - Now I can run gdb on it: - - - - - - - + Now I can run gdb on it:: - - - - - ~/linux/2.3.26/um 1034: gdb linux - GNU gdb 4.17.0.11 with Linux support - Copyright 1998 Free Software Foundation, Inc. - GDB is free software, covered by the GNU General Public License, and you are - welcome to change it and/or distribute copies of it under certain conditions. - Type "show copying" to see the conditions. - There is absolutely no warranty for GDB. Type "show warranty" for details. - This GDB was configured as "i386-redhat-linux"... - (gdb) att 4127 - Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127 - 0x10075891 in __libc_nanosleep () + ~/linux/2.3.26/um 1034: gdb linux + GNU gdb 4.17.0.11 with Linux support + Copyright 1998 Free Software Foundation, Inc. + GDB is free software, covered by the GNU General Public License, and you are + welcome to change it and/or distribute copies of it under certain conditions. + Type "show copying" to see the conditions. + There is absolutely no warranty for GDB. Type "show warranty" for details. + This GDB was configured as "i386-redhat-linux"... + (gdb) att 4127 + Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127 + 0x10075891 in __libc_nanosleep () @@ -3506,7 +3326,7 @@ The backtrace shows that it was in a write and that the fault address (address in frame 3) is 0x50000800, which is right in the middle of - the signal thread's stack page: + the signal thread's stack page:: (gdb) bt @@ -3540,58 +3360,48 @@ - - Going up the stack to the segv_handler frame and looking at where in the code the access happened shows that it happened near line 110 of - block_dev.c: - - - - - - - - - - (gdb) up - #1 0x1007584d in __sleep (seconds=1000000) - at ../sysdeps/unix/sysv/linux/sleep.c:78 - ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory. - (gdb) - #2 0x1006ce9a in stop () at user_util.c:191 - 191 while(1) sleep(1000000); - (gdb) - #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31 - 31 KERN_UNTESTED(); - (gdb) - #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174 - 174 segv(sc->cr2, sc->err & 2); - (gdb) p *sc - $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, - __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484, - esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14, - err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070, - esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0, - cr2 = 1342179328} - (gdb) p (void *)268550834 - $2 = (void *) 0x1001c2b2 - (gdb) i sym $2 - block_write + 1090 in section .text - (gdb) i line *$2 - Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h" - starts at address 0x1001c2a1 <block_write+1073> - and ends at 0x1001c2bf <block_write+1103>. - (gdb) i line *0x1001c2c0 - Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103> - and ends at 0x1001c2e3 <block_write+1139>. - - + block_dev.c:: + + + + (gdb) up + #1 0x1007584d in __sleep (seconds=1000000) + at ../sysdeps/unix/sysv/linux/sleep.c:78 + ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory. + (gdb) + #2 0x1006ce9a in stop () at user_util.c:191 + 191 while(1) sleep(1000000); + (gdb) + #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31 + 31 KERN_UNTESTED(); + (gdb) + #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174 + 174 segv(sc->cr2, sc->err & 2); + (gdb) p *sc + $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, + __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484, + esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14, + err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070, + esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0, + cr2 = 1342179328} + (gdb) p (void *)268550834 + $2 = (void *) 0x1001c2b2 + (gdb) i sym $2 + block_write + 1090 in section .text + (gdb) i line *$2 + Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h" + starts at address 0x1001c2a1 <block_write+1073> + and ends at 0x1001c2bf <block_write+1103>. + (gdb) i line *0x1001c2c0 + Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103> + and ends at 0x1001c2e3 <block_write+1139>. Looking at the source shows that the fault happened during a call to - copy_from_user to copy the data into the kernel: + copy_from_user to copy the data into the kernel:: 107 count -= chars; @@ -3601,10 +3411,8 @@ - - p is the pointer which must contain 0x50000800, since buf contains - 0x80b8800 (frame 8 above). It is defined as: + 0x80b8800 (frame 8 above). It is defined as:: p = offset + bh->b_data; @@ -3615,24 +3423,22 @@ I need to figure out what bh is, and it just so happens that bh is passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a - few lines later, so I do a little disassembly: - - + few lines later, so I do a little disassembly:: - (gdb) disas 0x1001c2bf 0x1001c2e0 - Dump of assembler code from 0x1001c2bf to 0x1001c2d0: - 0x1001c2bf <block_write+1103>: addl %eax,0xc(%ebp) - 0x1001c2c2 <block_write+1106>: movl 0xfffffdd4(%ebp),%edx - 0x1001c2c8 <block_write+1112>: btsl $0x0,0x18(%edx) - 0x1001c2cd <block_write+1117>: btsl $0x1,0x18(%edx) - 0x1001c2d2 <block_write+1122>: sbbl %ecx,%ecx - 0x1001c2d4 <block_write+1124>: testl %ecx,%ecx - 0x1001c2d6 <block_write+1126>: jne 0x1001c2e3 <block_write+1139> - 0x1001c2d8 <block_write+1128>: pushl $0x0 - 0x1001c2da <block_write+1130>: pushl %edx - 0x1001c2db <block_write+1131>: call 0x1001819c <__mark_buffer_dirty> - End of assembler dump. + (gdb) disas 0x1001c2bf 0x1001c2e0 + Dump of assembler code from 0x1001c2bf to 0x1001c2d0: + 0x1001c2bf <block_write+1103>: addl %eax,0xc(%ebp) + 0x1001c2c2 <block_write+1106>: movl 0xfffffdd4(%ebp),%edx + 0x1001c2c8 <block_write+1112>: btsl $0x0,0x18(%edx) + 0x1001c2cd <block_write+1117>: btsl $0x1,0x18(%edx) + 0x1001c2d2 <block_write+1122>: sbbl %ecx,%ecx + 0x1001c2d4 <block_write+1124>: testl %ecx,%ecx + 0x1001c2d6 <block_write+1126>: jne 0x1001c2e3 <block_write+1139> + 0x1001c2d8 <block_write+1128>: pushl $0x0 + 0x1001c2da <block_write+1130>: pushl %edx + 0x1001c2db <block_write+1131>: call 0x1001819c <__mark_buffer_dirty> + End of assembler dump. @@ -3640,7 +3446,7 @@ At that point, bh is in %edx (address 0x1001c2da), which is calculated at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is, - taking %ebp from the sigcontext_struct above: + taking %ebp from the sigcontext_struct above:: (gdb) p (void *)1342631484 @@ -3657,7 +3463,7 @@ Now, I look at the structure to see what's in it, and particularly, - what its b_data field contains: + what its b_data field contains:: (gdb) p *((struct buffer_head *)0x50100200) @@ -3682,18 +3488,18 @@ The b_page field is a pointer to the page_struct representing the 0x50000000 page. Looking at it shows the kernel's idea of the state - of that page: + of that page:: - (gdb) p *$13.b_page - $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0, - index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = { - next = 0x50008460, prev = 0x50019350}, wait = { - lock = <optimized out or zero length>, task_list = {next = 0x50004024, - prev = 0x50004024}, __magic = 1342193708, __creator = 0}, - pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280, - zone = 0x100c5160} + (gdb) p *$13.b_page + $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0, + index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = { + next = 0x50008460, prev = 0x50019350}, wait = { + lock = <optimized out or zero length>, task_list = {next = 0x50004024, + prev = 0x50004024}, __magic = 1342193708, __creator = 0}, + pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280, + zone = 0x100c5160} @@ -3702,7 +3508,7 @@ Some sanity-checking: the virtual field shows the "virtual" address of this page, which in this kernel is the same as its "physical" address, and the page_struct itself should be mem_map[0], since it represents - the first page of memory: + the first page of memory:: @@ -3719,7 +3525,7 @@ Now to check out the page_struct itself. In particular, the flags - field shows whether the page is considered free or not: + field shows whether the page is considered free or not:: (gdb) p (void *)132 @@ -3739,7 +3545,7 @@ In my setup_arch procedure, I have the following code which looks just - fine: + fine:: @@ -3762,7 +3568,7 @@ Stepping into init_bootmem, and looking at bootmem_map before looking - at what it contains shows the following: + at what it contains shows the following:: @@ -3788,18 +3594,20 @@ - 13. What to do when UML doesn't work +13. What to do when UML doesn't work +===================================== - 13.1. Strange compilation errors when you build from source +13.1. Strange compilation errors when you build from source +------------------------------------------------------------ As of test11, it is necessary to have "ARCH=um" in the environment or on the make command line for all steps in building UML, including clean, distclean, or mrproper, config, menuconfig, or xconfig, dep, and linux. If you forget for any of them, the i386 build seems to - contaminate the UML build. If this happens, start from scratch with + contaminate the UML build. If this happens, start from scratch with:: host% @@ -3811,7 +3619,7 @@ and repeat the build process with ARCH=um on all the steps. - See ``Compiling the kernel and modules'' for more details. + See :ref:`Compiling_the_kernel_and_modules` for more details. Another cause of strange compilation errors is building UML in @@ -3824,11 +3632,11 @@ - 13.3. A variety of panics and hangs with /tmp on a reiserfs filesys- - tem +13.3. A variety of panics and hangs with /tmp on a reiserfs filesystem +----------------------------------------------------------------------- I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27. - Panics preceded by + Panics preceded by:: Detaching pid nnnn @@ -3854,17 +3662,19 @@ - 13.5. UML doesn't work when /tmp is an NFS filesystem +13.5. UML doesn't work when /tmp is an NFS filesystem +------------------------------------------------------ This seems to be a similar situation with the ReiserFS problem above. Some versions of NFS seems not to handle mmap correctly, which UML depends on. The workaround is have /tmp be a non-NFS directory. - 13.6. UML hangs on boot when compiled with gprof support +13.6. UML hangs on boot when compiled with gprof support +--------------------------------------------------------- If you build UML with gprof support and, early in the boot, it does - this + this:: kernel BUG at page_alloc.c:100! @@ -3878,10 +3688,11 @@ - 13.7. syslogd dies with a SIGTERM on startup +13.7. syslogd dies with a SIGTERM on startup +--------------------------------------------- The exact boot error depends on the distribution that you're booting, - but Debian produces this: + but Debian produces this:: /etc/rc2.d/S10sysklogd: line 49: 93 Terminated @@ -3891,23 +3702,21 @@ This is a syslogd bug. There's a race between a parent process - installing a signal handler and its child sending the signal. See - this uml-devel post <http://www.geocrawler.com/lists/3/Source- - Forge/709/0/6612801> for the details. + installing a signal handler and its child sending the signal. - 13.8. TUN/TAP networking doesn't work on a 2.4 host +13.8. TUN/TAP networking doesn't work on a 2.4 host +---------------------------------------------------- - There are a couple of problems which were - <http://www.geocrawler.com/lists/3/SourceForge/597/0/> name="pointed - out"> by Tim Robinson <timro at trkr dot net> + There are a couple of problems which were reported by + Tim Robinson <timro at trkr dot net> - o It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier. + - It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier. The fix is to upgrade to something more recent and then read the next item. - o If you see + - If you see:: File descriptor in bad state @@ -3921,8 +3730,8 @@ - 13.9. You can network to the host but not to other machines on the - net +13.9. You can network to the host but not to other machines on the net +======================================================================= If you can connect to the host, and the host can connect to UML, but you cannot connect to any other machines, then you may need to enable @@ -3930,7 +3739,7 @@ using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML networking, rather than the public address space that your host is connected to. UML does not enable IP Masquerading, so you will need - to create a static rule to enable it: + to create a static rule to enable it:: host% @@ -3944,11 +3753,11 @@ Documentation on IP Masquerading, and SNAT, can be found at - www.netfilter.org <http://www.netfilter.org> . + http://www.netfilter.org. If you can reach the local net, but not the outside Internet, then - that is usually a routing problem. The UML needs a default route: + that is usually a routing problem. The UML needs a default route:: UML# @@ -3972,7 +3781,8 @@ - 13.10. I have no root and I want to scream +13.10. I have no root and I want to scream +=========================================== Thanks to Birgit Wahlich for telling me about this strange one. It turns out that there's a limit of six environment variables on the @@ -3987,14 +3797,16 @@ - 13.11. UML build conflict between ptrace.h and ucontext.h +13.11. UML build conflict between ptrace.h and ucontext.h +========================================================== On some older systems, /usr/include/asm/ptrace.h and /usr/include/sys/ucontext.h define the same names. So, when they're included together, the defines from one completely mess up the parsing - of the other, producing errors like: + of the other, producing errors like:: + /usr/include/sys/ucontext.h:47: parse error before - `10' + `10` @@ -4007,7 +3819,8 @@ - 13.12. The UML BogoMips is exactly half the host's BogoMips +13.12. The UML BogoMips is exactly half the host's BogoMips +------------------------------------------------------------ On i386 kernels, there are two ways of running the loop that is used to calculate the BogoMips rating, using the TSC if it's there or using @@ -4019,15 +3832,17 @@ - 13.13. When you run UML, it immediately segfaults +13.13. When you run UML, it immediately segfaults +-------------------------------------------------- If the host is configured with the 2G/2G address space split, that's - why. See ``UML on 2G/2G hosts'' for the details on getting UML to + why. See ref:`UML_on_2G/2G_hosts` for the details on getting UML to run on your host. - 13.14. xterms appear, then immediately disappear +13.14. xterms appear, then immediately disappear +------------------------------------------------- If you're running an up to date kernel with an old release of uml_utilities, the port-helper program will not work properly, so @@ -4039,7 +3854,8 @@ - 13.15. Any other panic, hang, or strange behavior +13.15. Any other panic, hang, or strange behavior +-------------------------------------------------- If you're seeing truly strange behavior, such as hangs or panics that happen in random places, or you try running the debugger to see what's @@ -4057,9 +3873,13 @@ it and that a fix is imminent. - If you want to be super-helpful, read ``Diagnosing Problems'' and + If you want to be super-helpful, read :ref:`Diagnosing_Problems` and follow the instructions contained therein. - 14. Diagnosing Problems + +.. _Diagnosing_Problems: + +14. Diagnosing Problems +======================== If you get UML to crash, hang, or otherwise misbehave, you should @@ -4074,21 +3894,22 @@ For any diagnosis, you're going to need to build a debugging kernel. The binaries from this site aren't debuggable. If you haven't done - this before, read about ``Compiling the kernel and modules'' and - ``Kernel debugging'' UML first. + this before, read about :ref:`Compiling_the_kernel_and_modules` and + :ref:`Kernel_debugging` UML first. - 14.1. Case 1 : Normal kernel panics +14.1. Case 1 : Normal kernel panics +------------------------------------ The most common case is for a normal thread to panic. To debug this, you will need to run it under the debugger (add 'debug' to the command line). An xterm will start up with gdb running inside it. Continue - it when it stops in start_kernel and make it crash. Now ^C gdb and + it when it stops in start_kernel and make it crash. Now ``^C gdb`` and If the panic was a "Kernel mode fault", then there will be a segv frame on the stack and I'm going to want some more information. The - stack might look something like this: + stack might look something like this:: (UML gdb) backtrace @@ -4107,7 +3928,7 @@ I'm going to want to see the symbol and line information for the value - of ip in the segv frame. In this case, you would do the following: + of ip in the segv frame. In this case, you would do the following:: (UML gdb) i sym 268849158 @@ -4115,7 +3936,7 @@ - and + and:: (UML gdb) i line *268849158 @@ -4128,7 +3949,8 @@ to get that information from the faulting ip. - 14.2. Case 2 : Tracing thread panics +14.2. Case 2 : Tracing thread panics +------------------------------------- The less common and more painful case is when the tracing thread panics. In this case, the kernel debugger will be useless because it @@ -4136,7 +3958,7 @@ do is get a backtrace from the tracing thread. This is done by figuring out what its pid is, firing up gdb, and attaching it to that pid. You can figure out the tracing thread pid by looking at the - first line of the console output, which will look like this: + first line of the console output, which will look like this:: tracing thread pid = 15851 @@ -4145,7 +3967,7 @@ or by running ps on the host and finding the line that looks like - this: + this:: jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)] @@ -4164,7 +3986,7 @@ 14.3. Case 3 : Tracing thread panics caused by other threads However, there are cases where the misbehavior of another thread - caused the problem. The most common panic of this type is: + caused the problem. The most common panic of this type is:: wait_for_stop failed to wait for <pid> to stop with <signal number> @@ -4177,7 +3999,7 @@ debugger is defunct and without some fancy footwork, another gdb can't attach to it. So, this is how the fancy footwork goes: - In a shell: + In a shell:: host% kill -STOP pid @@ -4185,7 +4007,7 @@ - Run gdb on the tracing thread as described in case 2 and do: + Run gdb on the tracing thread as described in case 2 and do:: (host gdb) call detach(pid) @@ -4193,7 +4015,7 @@ If you get a segfault, do it again. It always works the second time. - Detach from the tracing thread and attach to that other thread: + Detach from the tracing thread and attach to that other thread:: (host gdb) detach @@ -4209,7 +4031,7 @@ If gdb hangs when attaching to that process, go back to a shell and - do: + do:: host% @@ -4218,7 +4040,7 @@ - And then get the backtrace: + And then get the backtrace:: (host gdb) backtrace @@ -4227,13 +4049,14 @@ - 14.4. Case 4 : Hangs +14.4. Case 4 : Hangs +--------------------- Hangs seem to be fairly rare, but they sometimes happen. When a hang happens, we need a backtrace from the offending process. Run the kernel debugger as described in case 1 and get a backtrace. If the current process is not the idle thread, then send in the backtrace. - You can tell that it's the idle thread if the stack looks like this: + You can tell that it's the idle thread if the stack looks like this:: #0 0x100b1401 in __libc_nanosleep () @@ -4257,7 +4080,8 @@ - 15. Thanks +15. Thanks +=========== A number of people have helped this project in various ways, and this @@ -4274,20 +4098,21 @@ bookkeeping lapses and I forget about contributions. - 15.1. Code and Documentation +15.1. Code and Documentation +----------------------------- Rusty Russell <rusty at linuxcare.com.au> - - o wrote the HOWTO <http://user-mode- - linux.sourceforge.net/UserModeLinux-HOWTO.html> + - wrote the HOWTO + http://user-mode-linux.sourceforge.net/old/UserModeLinux-HOWTO.html - o prodded me into making this project official and putting it on + - prodded me into making this project official and putting it on SourceForge - o came up with the way cool UML logo <http://user-mode- - linux.sourceforge.net/uml-small.png> + - came up with the way cool UML logo + http://user-mode-linux.sourceforge.net/uml-small.png - o redid the config process + - redid the config process Peter Moulder <reiter at netspace.net.au> - Fixed my config and build @@ -4296,34 +4121,32 @@ Bill Stearns <wstearns at pobox.com> - - o HOWTO updates + - HOWTO updates - o lots of bug reports + - lots of bug reports - o lots of testing + - lots of testing - o dedicated a box (uml.ists.dartmouth.edu) to support UML development + - dedicated a box (uml.ists.dartmouth.edu) to support UML development - o wrote the mkrootfs script, which allows bootable filesystems of + - wrote the mkrootfs script, which allows bootable filesystems of RPM-based distributions to be cranked out - o cranked out a large number of filesystems with said script + - cranked out a large number of filesystems with said script Jim Leu <jleu at mindspring.com> - Wrote the virtual ethernet driver and associated usermode tools - Lars Brinkhoff <http://lars.nocrew.org/> - Contributed the ptrace - proxy from his own project <http://a386.nocrew.org/> to allow easier - kernel debugging + Lars Brinkhoff http://lars.nocrew.org/ - Contributed the ptrace + proxy from his own project to allow easier kernel debugging Andrea Arcangeli <andrea at suse.de> - Redid some of the early boot code so that it would work on machines with Large File Support - Chris Emerson <http://www.chiark.greenend.org.uk/~cemerson/> - Did - the first UML port to Linux/ppc + Chris Emerson - Did the first UML port to Linux/ppc Harald Welte <laforge at gnumonks.org> - Wrote the multicast @@ -4338,7 +4161,7 @@ wrote the iomem emulation support - Henrik Nordstrom <http://hem.passagen.se/hno/> - Provided a variety + Henrik Nordstrom http://hem.passagen.se/hno/ - Provided a variety of patches, fixes, and clues @@ -4373,190 +4196,193 @@ submitted patches for the slip transport and lots of other things. - David Coulson <http://davidcoulson.net> - + David Coulson http://davidcoulson.net - - o Set up the usermodelinux.org <http://usermodelinux.org> site, + - Set up the http://usermodelinux.org site, which is a great way of keeping the UML user community on top of UML goings-on. - o Site documentation and updates + - Site documentation and updates - o Nifty little UML management daemon UMLd - <http://uml.openconsultancy.com/umld/> + - Nifty little UML management daemon UMLd - o Lots of testing and bug reports + - Lots of testing and bug reports - 15.2. Flushing out bugs +15.2. Flushing out bugs +------------------------ - o Yuri Pudgorodsky + - Yuri Pudgorodsky - o Gerald Britton + - Gerald Britton - o Ian Wehrman + - Ian Wehrman - o Gord Lamb + - Gord Lamb - o Eugene Koontz + - Eugene Koontz - o John H. Hartman + - John H. Hartman - o Anders Karlsson + - Anders Karlsson - o Daniel Phillips + - Daniel Phillips - o John Fremlin + - John Fremlin - o Rainer Burgstaller + - Rainer Burgstaller - o James Stevenson + - James Stevenson - o Matt Clay + - Matt Clay - o Cliff Jefferies + - Cliff Jefferies - o Geoff Hoff + - Geoff Hoff - o Lennert Buytenhek + - Lennert Buytenhek - o Al Viro + - Al Viro - o Frank Klingenhoefer + - Frank Klingenhoefer - o Livio Baldini Soares + - Livio Baldini Soares - o Jon Burgess + - Jon Burgess - o Petru Paler + - Petru Paler - o Paul + - Paul - o Chris Reahard + - Chris Reahard - o Sverker Nilsson + - Sverker Nilsson - o Gong Su + - Gong Su - o johan verrept + - johan verrept - o Bjorn Eriksson + - Bjorn Eriksson - o Lorenzo Allegrucci + - Lorenzo Allegrucci - o Muli Ben-Yehuda + - Muli Ben-Yehuda - o David Mansfield + - David Mansfield - o Howard Goff + - Howard Goff - o Mike Anderson + - Mike Anderson - o John Byrne + - John Byrne - o Sapan J. Batia + - Sapan J. Batia - o Iris Huang + - Iris Huang - o Jan Hudec + - Jan Hudec - o Voluspa + - Voluspa - 15.3. Buglets and clean-ups +15.3. Buglets and clean-ups +---------------------------- - o Dave Zarzycki + - Dave Zarzycki - o Adam Lazur + - Adam Lazur - o Boria Feigin + - Boria Feigin - o Brian J. Murrell + - Brian J. Murrell - o JS + - JS - o Roman Zippel + - Roman Zippel - o Wil Cooley + - Wil Cooley - o Ayelet Shemesh + - Ayelet Shemesh - o Will Dyson + - Will Dyson - o Sverker Nilsson + - Sverker Nilsson - o dvorak + - dvorak - o v.naga srinivas + - v.naga srinivas - o Shlomi Fish + - Shlomi Fish - o Roger Binns + - Roger Binns - o johan verrept + - johan verrept - o MrChuoi + - MrChuoi - o Peter Cleve + - Peter Cleve - o Vincent Guffens + - Vincent Guffens - o Nathan Scott + - Nathan Scott - o Patrick Caulfield + - Patrick Caulfield - o jbearce + - jbearce - o Catalin Marinas + - Catalin Marinas - o Shane Spencer + - Shane Spencer - o Zou Min + - Zou Min - o Ryan Boder + - Ryan Boder - o Lorenzo Colitti + - Lorenzo Colitti - o Gwendal Grignou + - Gwendal Grignou - o Andre' Breiler + - Andre' Breiler - o Tsutomu Yasuda + - Tsutomu Yasuda - 15.4. Case Studies +15.4. Case Studies +------------------- - o Jon Wright + - Jon Wright - o William McEwan + - William McEwan - o Michael Richardson + - Michael Richardson - 15.5. Other contributions +15.5. Other contributions +-------------------------- Bill Carr <Bill.Carr at compaq.com> made the Red Hat mkrootfs script work with RH 6.2. Michael Jennings <mikejen at hevanet.com> sent in some material which - is now gracing the top of the index page <http://user-mode- - linux.sourceforge.net/> of this site. + is now gracing the top of the index page + http://user-mode-linux.sourceforge.net/ of this site. - SGI <http://www.sgi.com> (and more specifically Ralf Baechle <ralf at - uni-koblenz.de> ) gave me an account on oss.sgi.com - <http://www.oss.sgi.com> . The bandwidth there made it possible to + SGI (and more specifically Ralf Baechle <ralf at + uni-koblenz.de> ) gave me an account on oss.sgi.com. + The bandwidth there made it possible to produce most of the filesystems available on the project download page. @@ -4573,17 +4399,5 @@ Chris Reahard built a specialized root filesystem for running a DNS server jailed inside UML. It's available from the download - <http://user-mode-linux.sourceforge.net/dl-sf.html> page in the Jail + http://user-mode-linux.sourceforge.net/old/dl-sf.html page in the Jail Filesystems section. - - - - - - - - - - - - diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst index c9c201596c3e..fa7ddc0428c8 100644 --- a/Documentation/x86/boot.rst +++ b/Documentation/x86/boot.rst @@ -490,15 +490,11 @@ Protocol: 2.00+ kernel) to not write early messages that require accessing the display hardware directly. - Bit 6 (write): KEEP_SEGMENTS + Bit 6 (obsolete): KEEP_SEGMENTS Protocol: 2.07+ - - If 0, reload the segment registers in the 32bit entry point. - - If 1, do not reload the segment registers in the 32bit entry point. - - Assume that %cs %ds %ss %es are all set to flat segments with - a base of 0 (or the equivalent for their environment). + - This flag is obsolete. Bit 7 (write): CAN_USE_HEAP diff --git a/Documentation/x86/exception-tables.rst b/Documentation/x86/exception-tables.rst index ed6d4b0cf62c..de58110c5ffd 100644 --- a/Documentation/x86/exception-tables.rst +++ b/Documentation/x86/exception-tables.rst @@ -257,6 +257,9 @@ the fault, in our case the actual value is c0199ff5: the original assembly code: > 3: movl $-14,%eax and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax +If the fixup was able to handle the exception, control flow may be returned +to the instruction after the one that triggered the fault, ie. local label 2b. + The assembly code:: > .section __ex_table,"a" @@ -337,10 +340,15 @@ pointer which points to one of: entry->insn. It is used to distinguish page faults from machine check. -3) ``int ex_handler_ext(const struct exception_table_entry *fixup)`` - This case is used for uaccess_err ... we need to set a flag - in the task structure. Before the handler functions existed this - case was handled by adding a large offset to the fixup to tag - it as special. - More functions can easily be added. + +CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post +link of the kernel image, via a host utility scripts/sorttable. It will set the +symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section +at boot time. With the exception table sorted, at runtime when an exception +occurs we can quickly lookup the __ex_table entry via binary search. + +This is not just a boot time optimization, some architectures require this +table to be sorted in order to handle exceptions relatively early in the boot +process. For example, i386 makes use of this form of exception handling before +paging support is even enabled! diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index a8de2fbc1caa..265d9e9a093b 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -19,7 +19,6 @@ x86-specific Documentation tlb mtrr pat - intel_mpx intel-iommu intel_txt amd-memory-encryption diff --git a/Documentation/x86/intel-iommu.rst b/Documentation/x86/intel-iommu.rst index 9dae6b47e398..099f13d51d5f 100644 --- a/Documentation/x86/intel-iommu.rst +++ b/Documentation/x86/intel-iommu.rst @@ -95,9 +95,10 @@ and any RMRR's processed:: When DMAR is enabled for use, you will notice.. PCI-DMA: Using DMAR IOMMU +------------------------- Fault reporting ---------------- +^^^^^^^^^^^^^^^ :: |