diff options
Diffstat (limited to 'Documentation/admin-guide')
62 files changed, 1735 insertions, 753 deletions
diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst index 8d3a2d045c0a..bb5032a99234 100644 --- a/Documentation/admin-guide/bcache.rst +++ b/Documentation/admin-guide/bcache.rst @@ -204,7 +204,7 @@ For example:: This should present your unmodified backing device data in /dev/loop0 If your cache is in writethrough mode, then you can safely discard the -cache device without loosing data. +cache device without losing data. E) Wiping a cache device diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst index e1ce90af602a..e85ad37cc0e5 100644 --- a/Documentation/admin-guide/blockdev/paride.rst +++ b/Documentation/admin-guide/blockdev/paride.rst @@ -3,6 +3,7 @@ Linux and parallel port IDE devices =================================== PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net> +PATA_PARPORT (c) 2023 Ondrej Zary 1. Introduction =============== @@ -51,27 +52,15 @@ parallel port IDE subsystem, including: as well as most of the clone and no-name products on the market. -To support such a wide range of devices, PARIDE, the parallel port IDE -subsystem, is actually structured in three parts. There is a base -paride module which provides a registry and some common methods for -accessing the parallel ports. The second component is a set of -high-level drivers for each of the different types of supported devices: +To support such a wide range of devices, pata_parport is actually structured +in two parts. There is a base pata_parport module which provides an interface +to kernel libata subsystem, registry and some common methods for accessing +the parallel ports. - === ============= - pd IDE disk - pcd ATAPI CD-ROM - pf ATAPI disk - pt ATAPI tape - pg ATAPI generic - === ============= - -(Currently, the pg driver is only used with CD-R drives). - -The high-level drivers function according to the relevant standards. -The third component of PARIDE is a set of low-level protocol drivers -for each of the parallel port IDE adapter chips. Thanks to the interest -and encouragement of Linux users from many parts of the world, -support is available for almost all known adapter protocols: +The second component is a set of low-level protocol drivers for each of the +parallel port IDE adapter chips. Thanks to the interest and encouragement of +Linux users from many parts of the world, support is available for almost all +known adapter protocols: ==== ====================================== ==== aten ATEN EH-100 (HK) @@ -91,251 +80,87 @@ support is available for almost all known adapter protocols: ==== ====================================== ==== -2. Using the PARIDE subsystem -============================= +2. Using pata_parport subsystem +=============================== While configuring the Linux kernel, you may choose either to build -the PARIDE drivers into your kernel, or to build them as modules. +the pata_parport drivers into your kernel, or to build them as modules. In either case, you will need to select "Parallel port IDE device support" -as well as at least one of the high-level drivers and at least one -of the parallel port communication protocols. If you do not know -what kind of parallel port adapter is used in your drive, you could -begin by checking the file names and any text files on your DOS +and at least one of the parallel port communication protocols. +If you do not know what kind of parallel port adapter is used in your drive, +you could begin by checking the file names and any text files on your DOS installation floppy. Alternatively, you can look at the markings on the adapter chip itself. That's usually sufficient to identify the correct device. -You can actually select all the protocol modules, and allow the PARIDE +You can actually select all the protocol modules, and allow the pata_parport subsystem to try them all for you. For the "brand-name" products listed above, here are the protocol and high-level drivers that you would use: - ================ ============ ====== ======== - Manufacturer Model Driver Protocol - ================ ============ ====== ======== - MicroSolutions CD-ROM pcd bpck - MicroSolutions PD drive pf bpck - MicroSolutions hard-drive pd bpck - MicroSolutions 8000t tape pt bpck - SyQuest EZ, SparQ pd epat - Imation Superdisk pf epat - Maxell Superdisk pf friq - Avatar Shark pd epat - FreeCom CD-ROM pcd frpw - Hewlett-Packard 5GB Tape pt epat - Hewlett-Packard 7200e (CD) pcd epat - Hewlett-Packard 7200e (CD-R) pg epat - ================ ============ ====== ======== - -2.1 Configuring built-in drivers ---------------------------------- - -We recommend that you get to know how the drivers work and how to -configure them as loadable modules, before attempting to compile a -kernel with the drivers built-in. - -If you built all of your PARIDE support directly into your kernel, -and you have just a single parallel port IDE device, your kernel should -locate it automatically for you. If you have more than one device, -you may need to give some command line options to your bootloader -(eg: LILO), how to do that is beyond the scope of this document. - -The high-level drivers accept a number of command line parameters, all -of which are documented in the source files in linux/drivers/block/paride. -By default, each driver will automatically try all parallel ports it -can find, and all protocol types that have been installed, until it finds -a parallel port IDE adapter. Once it finds one, the probe stops. So, -if you have more than one device, you will need to tell the drivers -how to identify them. This requires specifying the port address, the -protocol identification number and, for some devices, the drive's -chain ID. While your system is booting, a number of messages are -displayed on the console. Like all such messages, they can be -reviewed with the 'dmesg' command. Among those messages will be -some lines like:: - - paride: bpck registered as protocol 0 - paride: epat registered as protocol 1 - -The numbers will always be the same until you build a new kernel with -different protocol selections. You should note these numbers as you -will need them to identify the devices. + ================ ============ ======== + Manufacturer Model Protocol + ================ ============ ======== + MicroSolutions CD-ROM bpck + MicroSolutions PD drive bpck + MicroSolutions hard-drive bpck + MicroSolutions 8000t tape bpck + SyQuest EZ, SparQ epat + Imation Superdisk epat + Maxell Superdisk friq + Avatar Shark epat + FreeCom CD-ROM frpw + Hewlett-Packard 5GB Tape epat + Hewlett-Packard 7200e (CD) epat + Hewlett-Packard 7200e (CD-R) epat + ================ ============ ======== + +All parports and all protocol drivers are probed automatically unless probe=0 +parameter is used. So just "modprobe epat" is enough for a Imation SuperDisk +drive to work. + +Manual device creation:: + + # echo "port protocol mode unit delay" >/sys/bus/pata_parport/new_device + +where: + + ======== ================================================ + port parport name (or "auto" for all parports) + protocol protocol name (or "auto" for all protocols) + mode mode number (protocol-specific) or -1 for probe + unit unit number (for backpack only, see below) + delay I/O delay (see troubleshooting section below) + ======== ================================================ If you happen to be using a MicroSolutions backpack device, you will also need to know the unit ID number for each drive. This is usually the last two digits of the drive's serial number (but read MicroSolutions' documentation about this). -As an example, let's assume that you have a MicroSolutions PD/CD drive -with unit ID number 36 connected to the parallel port at 0x378, a SyQuest -EZ-135 connected to the chained port on the PD/CD drive and also an -Imation Superdisk connected to port 0x278. You could give the following -options on your boot command:: - - pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36 - -In the last option, pf.drive1 configures device /dev/pf1, the 0x378 -is the parallel port base address, the 0 is the protocol registration -number and 36 is the chain ID. - -Please note: while PARIDE will work both with and without the -PARPORT parallel port sharing system that is included by the -"Parallel port support" option, PARPORT must be included and enabled -if you want to use chains of devices on the same parallel port. - -2.2 Loading and configuring PARIDE as modules ----------------------------------------------- - -It is much faster and simpler to get to understand the PARIDE drivers -if you use them as loadable kernel modules. - -Note 1: - using these drivers with the "kerneld" automatic module loading - system is not recommended for beginners, and is not documented here. - -Note 2: - if you build PARPORT support as a loadable module, PARIDE must - also be built as loadable modules, and PARPORT must be loaded before - the PARIDE modules. - -To use PARIDE, you must begin by:: - - insmod paride - -this loads a base module which provides a registry for the protocols, -among other tasks. - -Then, load as many of the protocol modules as you think you might need. -As you load each module, it will register the protocols that it supports, -and print a log message to your kernel log file and your console. For -example:: - - # insmod epat - paride: epat registered as protocol 0 - # insmod kbic - paride: k951 registered as protocol 1 - paride: k971 registered as protocol 2 - -Finally, you can load high-level drivers for each kind of device that -you have connected. By default, each driver will autoprobe for a single -device, but you can support up to four similar devices by giving their -individual coordinates when you load the driver. - -For example, if you had two no-name CD-ROM drives both using the -KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc -you could give the following command:: - - # insmod pcd drive0=0x378,1 drive1=0x3bc,1 - -For most adapters, giving a port address and protocol number is sufficient, -but check the source files in linux/drivers/block/paride for more -information. (Hopefully someone will write some man pages one day !). - -As another example, here's what happens when PARPORT is installed, and -a SyQuest EZ-135 is attached to port 0x378:: - - # insmod paride - paride: version 1.0 installed - # insmod epat - paride: epat registered as protocol 0 - # insmod pd - pd: pd version 1.0, major 45, cluster 64, nice 0 - pda: Sharing parport1 at 0x378 - pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1 - pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media - pda: pda1 - -Note that the last line is the output from the generic partition table -scanner - in this case it reports that it has found a disk with one partition. - -2.3 Using a PARIDE device --------------------------- - -Once the drivers have been loaded, you can access PARIDE devices in the -same way as their traditional counterparts. You will probably need to -create the device "special files". Here is a simple script that you can -cut to a file and execute:: - - #!/bin/bash - # - # mkd -- a script to create the device special files for the PARIDE subsystem - # - function mkdev { - mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1 - } - # - function pd { - D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) ) - mkdev pd$D b 45 $[ $1 * 16 ] - for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 - do mkdev pd$D$P b 45 $[ $1 * 16 + $P ] - done - } - # - cd /dev - # - for u in 0 1 2 3 ; do pd $u ; done - for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done - for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done - for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done - for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done - for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done - # - # end of mkd - -With the device files and drivers in place, you can access PARIDE devices -like any other Linux device. For example, to mount a CD-ROM in pcd0, use:: - - mount /dev/pcd0 /cdrom - -If you have a fresh Avatar Shark cartridge, and the drive is pda, you -might do something like:: - - fdisk /dev/pda -- make a new partition table with - partition 1 of type 83 - - mke2fs /dev/pda1 -- to build the file system - - mkdir /shark -- make a place to mount the disk - - mount /dev/pda1 /shark - -Devices like the Imation superdisk work in the same way, except that -they do not have a partition table. For example to make a 120MB -floppy that you could share with a DOS system:: - - mkdosfs /dev/pf0 - mount /dev/pf0 /mnt - - -2.4 The pf driver ------------------- - -The pf driver is intended for use with parallel port ATAPI disk -devices. The most common devices in this category are PD drives -and LS-120 drives. Traditionally, media for these devices are not -partitioned. Consequently, the pf driver does not support partitioned -media. This may be changed in a future version of the driver. - -2.5 Using the pt driver ------------------------- - -The pt driver for parallel port ATAPI tape drives is a minimal driver. -It does not yet support many of the standard tape ioctl operations. -For best performance, a block size of 32KB should be used. You will -probably want to set the parallel port delay to 0, if you can. - -2.6 Using the pg driver ------------------------- - -The pg driver can be used in conjunction with the cdrecord program -to create CD-ROMs. Please get cdrecord version 1.6.1 or later -from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media -your parallel port should ideally be set to EPP mode, and the "port delay" -should be set to 0. With those settings it is possible to record at 2x -speed without any buffer underruns. If you cannot get the driver to work -in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only. +If you omit the parameters from the end, defaults will be used, e.g.: + +Probe all parports with all protocols:: + + # echo auto >/sys/bus/pata_parport/new_device + +Probe parport0 using protocol epat and mode 4 (EPP-16):: + + # echo "parport0 epat 4" >/sys/bus/pata_parport/new_device + +Probe parport0 using all protocols:: + + # echo "parport0 auto" >/sys/bus/pata_parport/new_device + +Probe all parports using protoocol epat:: + + # echo "auto epat" >/sys/bus/pata_parport/new_device + +Deleting devices:: + + # echo pata_parport.0 >/sys/bus/pata_parport/delete_device 3. Troubleshooting @@ -344,9 +169,9 @@ in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only. 3.1 Use EPP mode if you can ---------------------------- -The most common problems that people report with the PARIDE drivers +The most common problems that people report with the pata_parport drivers concern the parallel port CMOS settings. At this time, none of the -PARIDE protocol modules support ECP mode, or any ECP combination modes. +protocol modules support ECP mode, or any ECP combination modes. If you are able to do so, please set your parallel port into EPP mode using your CMOS setup procedure. @@ -354,17 +179,14 @@ using your CMOS setup procedure. ------------------------- Some parallel ports cannot reliably transfer data at full speed. To -offset the errors, the PARIDE protocol modules introduce a "port +offset the errors, the protocol modules introduce a "port delay" between each access to the i/o ports. Each protocol sets a default value for this delay. In most cases, the user can override the default and set it to 0 - resulting in somewhat higher transfer rates. In some rare cases (especially with older 486 systems) the default delays are not long enough. if you experience corrupt data transfers, or unexpected failures, you may wish to increase the -port delay. The delay can be programmed using the "driveN" parameters -to each of the high-level drivers. Please see the notes above, or -read the comments at the beginning of the driver source files in -linux/drivers/block/paride. +port delay. 3.3 Some drives need a printer reset ------------------------------------- @@ -374,66 +196,12 @@ that do not always power up correctly. We have noticed this with some drives based on OnSpec and older Freecom adapters. In these rare cases, the adapter can often be reinitialised by issuing a "printer reset" on the parallel port. As the reset operation is potentially disruptive in -multiple device environments, the PARIDE drivers will not do it +multiple device environments, the pata_parport drivers will not do it automatically. You can however, force a printer reset by doing:: insmod lp reset=1 rmmod lp If you have one of these marginal cases, you should probably build -your paride drivers as modules, and arrange to do the printer reset -before loading the PARIDE drivers. - -3.4 Use the verbose option and dmesg if you need help ------------------------------------------------------- - -While a lot of testing has gone into these drivers to make them work -as smoothly as possible, problems will arise. If you do have problems, -please check all the obvious things first: does the drive work in -DOS with the manufacturer's drivers ? If that doesn't yield any useful -clues, then please make sure that only one drive is hooked to your system, -and that either (a) PARPORT is enabled or (b) no other device driver -is using your parallel port (check in /proc/ioports). Then, load the -appropriate drivers (you can load several protocol modules if you want) -as in:: - - # insmod paride - # insmod epat - # insmod bpck - # insmod kbic - ... - # insmod pd verbose=1 - -(using the correct driver for the type of device you have, of course). -The verbose=1 parameter will cause the drivers to log a trace of their -activity as they attempt to locate your drive. - -Use 'dmesg' to capture a log of all the PARIDE messages (any messages -beginning with paride:, a protocol module's name or a driver's name) and -include that with your bug report. You can submit a bug report in one -of two ways. Either send it directly to the author of the PARIDE suite, -by e-mail to grant@torque.net, or join the linux-parport mailing list -and post your report there. - -3.5 For more information or help ---------------------------------- - -You can join the linux-parport mailing list by sending a mail message -to: - - linux-parport-request@torque.net - -with the single word:: - - subscribe - -in the body of the mail message (not in the subject line). Please be -sure that your mail program is correctly set up when you do this, as -the list manager is a robot that will subscribe you using the reply -address in your mail headers. REMOVE any anti-spam gimmicks you may -have in your mail headers, when sending mail to the list server. - -You might also find some useful information on the linux-parport -web pages (although they are not always up to date) at - - http://web.archive.org/web/%2E/http://www.torque.net/parport/ +your pata_parport drivers as modules, and arrange to do the printer reset +before loading the pata_parport drivers. diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index 9355c525fbe0..91339efdcb54 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -201,6 +201,8 @@ To remove the config from the image, you can use -d option as below:: Then add "bootconfig" on the normal kernel command line to tell the kernel to look for the bootconfig at the end of the initrd file. +Alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE`` +Kconfig option selected. Embedding a Boot Config into Kernel ----------------------------------- @@ -217,7 +219,9 @@ path to the bootconfig file from source tree or object tree. The kernel will embed it as the default bootconfig. Just as when attaching the bootconfig to the initrd, you need ``bootconfig`` -option on the kernel command line to enable the embedded bootconfig. +option on the kernel command line to enable the embedded bootconfig, or, +alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE`` +Kconfig option selected. Note that even if you set this option, you can override the embedded bootconfig by another bootconfig which attached to the initrd. diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst index 16253eda192e..dabb80cdd25a 100644 --- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst +++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst @@ -106,7 +106,7 @@ Proportional weight policy files see Documentation/block/bfq-iosched.rst. blkio.bfq.weight_device - Specifes per cgroup per device weights, overriding the default group + Specifies per cgroup per device weights, overriding the default group weight. For more details, see Documentation/block/bfq-iosched.rst. Following is the format:: diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index b0688011ed06..9343148ee993 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -80,6 +80,8 @@ access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rs you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. +.. _cgroups-why-needed: + 1.2 Why are cgroups needed ? ---------------------------- diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 60370f2c67b9..47d1d7d932a8 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -2,18 +2,18 @@ Memory Resource Controller ========================== -NOTE: +.. caution:: This document is hopelessly outdated and it asks for a complete rewrite. It still contains a useful information so we are keeping it here but make sure to check the current code if you need a deeper understanding. -NOTE: +.. note:: The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. -(For editors) In this document: +.. hint:: When we mention a cgroup (cgroupfs's directory) with memory controller, we call it "memory cgroup". When you see git-log and source code, you'll see patch's title and function names tend to use "memcg". @@ -23,7 +23,7 @@ Benefits and Purpose of the memory controller ============================================= The memory controller isolates the memory behaviour of a group of tasks -from the rest of the system. The article on LWN [12] mentions some probable +from the rest of the system. The article on LWN [12]_ mentions some probable uses of the memory controller. The memory controller can be used to a. Isolate an application or a group of applications @@ -55,7 +55,8 @@ Features: - Root cgroup has no limit controls. Kernel memory support is a work in progress, and the current version provides - basically functionality. (See Section 2.7) + basically functionality. (See :ref:`section 2.7 + <cgroup-v1-memory-kernel-extension>`) Brief summary of control files. @@ -86,6 +87,8 @@ Brief summary of control files. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate set/show controls of moving charges + This knob is deprecated and shouldn't be + used. memory.oom_control set/show oom controls. memory.numa_stat show the number of memory usage per numa node @@ -107,16 +110,16 @@ Brief summary of control files. ========== The memory controller has a long history. A request for comments for the memory -controller was posted by Balbir Singh [1]. At the time the RFC was posted +controller was posted by Balbir Singh [1]_. At the time the RFC was posted there were several implementations for memory control. The goal of the RFC was to build consensus and agreement for the minimal features required -for memory control. The first RSS controller was posted by Balbir Singh[2] -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the -RSS controller. At OLS, at the resource management BoF, everyone suggested -that we handle both page cache and RSS together. Another request was raised -to allow user space handling of OOM. The current memory controller is +for memory control. The first RSS controller was posted by Balbir Singh [2]_ +in Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions +of the RSS controller. At OLS, at the resource management BoF, everyone +suggested that we handle both page cache and RSS together. Another request was +raised to allow user space handling of OOM. The current memory controller is at version 6; it combines both mapped (RSS) and unmapped Page -Cache Control [11]. +Cache Control [11]_. 2. Memory Control ================= @@ -147,7 +150,8 @@ specific data structure (mem_cgroup) associated with it. 2.2. Accounting --------------- -:: +.. code-block:: + :caption: Figure 1: Hierarchy of Accounting +--------------------+ | mem_cgroup | @@ -167,7 +171,6 @@ specific data structure (mem_cgroup) associated with it. | | | | +---------------+ +---------------+ - (Figure 1: Hierarchy of Accounting) Figure 1 shows the important aspects of the controller @@ -221,8 +224,9 @@ behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in -- this will happen on memory pressure). -But see section 8.2: when moving a task to another cgroup, its pages may -be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. +But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a +task to another cgroup, its pages may be recharged to the new cgroup, if +move_charge_at_immigrate has been chosen. 2.4 Swap Extension -------------------------------------- @@ -244,7 +248,8 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. By using the memsw limit, you can avoid system OOM which can be caused by swap shortage. -**why 'memory+swap' rather than swap** +2.4.1 why 'memory+swap' rather than swap +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of @@ -252,7 +257,8 @@ memory+swap. In other words, when we want to limit the usage of swap without affecting global LRU, memory+swap limit is better than just limiting swap from an OS point of view. -**What happens when a cgroup hits memory.memsw.limit_in_bytes** +2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out in this cgroup. Then, swap-out will not be done by cgroup routine and file @@ -268,26 +274,26 @@ global VM. When a cgroup goes over its limit, we first try to reclaim memory from the cgroup so as to make space for the new pages that the cgroup has touched. If the reclaim is unsuccessful, an OOM routine is invoked to select and kill the bulkiest task in the -cgroup. (See 10. OOM Control below.) +cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.) The reclaim algorithm has not been modified for cgroups, except that pages that are selected for reclaiming come from the per-cgroup LRU list. -NOTE: - Reclaim does not work for the root cgroup, since we cannot set any - limits on the root cgroup. +.. note:: + Reclaim does not work for the root cgroup, since we cannot set any + limits on the root cgroup. -Note2: - When panic_on_oom is set to "2", the whole system will panic. +.. note:: + When panic_on_oom is set to "2", the whole system will panic. When oom event notifier is registered, event will be delivered. -(See oom_control section) +(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section) 2.6 Locking ----------- -Lock order is as follows: +Lock order is as follows:: Page lock (PG_locked bit of page->flags) mm->page_table_lock or split pte_lock @@ -299,6 +305,8 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. +.. _cgroup-v1-memory-kernel-extension: + 2.7 Kernel Memory Extension ----------------------------------------------- @@ -367,10 +375,10 @@ U != 0, K < U: never greater than the total memory, and freely set U at the cost of his QoS. -WARNING: - In the current implementation, memory reclaim will NOT be - triggered for a cgroup when it hits K while staying below U, which makes - this setup impractical. + .. warning:: + In the current implementation, memory reclaim will NOT be triggered for + a cgroup when it hits K while staying below U, which makes this setup + impractical. U != 0, K >= U: Since kmem charges will also be fed to the user counter and reclaim will be @@ -381,45 +389,41 @@ U != 0, K >= U: 3. User Interface ================= -3.0. Configuration ------------------- - -a. Enable CONFIG_CGROUPS -b. Enable CONFIG_MEMCG +To use the user interface: -3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) -------------------------------------------------------------------- - -:: +1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options +2. Prepare the cgroups (see :ref:`Why are cgroups needed? + <cgroups-why-needed>` for the background information):: # mount -t tmpfs none /sys/fs/cgroup # mkdir /sys/fs/cgroup/memory # mount -t cgroup none /sys/fs/cgroup/memory -o memory -3.2. Make the new group and move bash into it:: +3. Make the new group and move bash into it:: # mkdir /sys/fs/cgroup/memory/0 # echo $$ > /sys/fs/cgroup/memory/0/tasks -Since now we're in the 0 cgroup, we can alter the memory limit:: +4. Since now we're in the 0 cgroup, we can alter the memory limit:: # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes -NOTE: - We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, - mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, - Gibibytes.) + The limit can now be queried:: + + # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes + 4194304 -NOTE: - We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. +.. note:: + We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, + mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, + Gibibytes.) -NOTE: - We cannot set limits on the root cgroup any more. +.. note:: + We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. -:: +.. note:: + We cannot set limits on the root cgroup any more. - # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes - 4194304 We can check the usage:: @@ -458,6 +462,8 @@ test because it has noise of shared objects/status. But the above two are testing extreme situations. Trying usual test under memory controller is always helpful. +.. _cgroup-v1-memory-test-troubleshoot: + 4.1 Troubleshooting ------------------- @@ -470,8 +476,11 @@ terminated by the OOM killer. There are several causes for this: A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of some of the pages cached in the cgroup (page cache pages). -To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and -seeing what happens will be helpful. +To know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control" +<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be +helpful. + +.. _cgroup-v1-memory-test-task-migration: 4.2 Task migration ------------------ @@ -482,15 +491,16 @@ remain charged to it, the charge is dropped when the page is freed or reclaimed. You can move charges of a task along with task migration. -See 8. "Move charges at task migration" +See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>` 4.3 Removing a cgroup --------------------- -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a -cgroup might have some charge associated with it, even though all -tasks have migrated away from it. (because we charge against pages, not -against tasks.) +A cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1 +<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2 +<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge +associated with it, even though all tasks have migrated away from it. (because +we charge against pages, not against tasks.) We move the stats to parent, and no change on the charge except uncharging from the child. @@ -519,67 +529,66 @@ will be charged as a new owner of it. 5.2 stat file ------------- -memory.stat file includes following statistics - -per-memory cgroup local status -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -=============== =============================================================== -cache # of bytes of page cache memory. -rss # of bytes of anonymous and swap cache memory (includes - transparent hugepages). -rss_huge # of bytes of anonymous transparent hugepages. -mapped_file # of bytes of mapped file (includes tmpfs/shmem) -pgpgin # of charging events to the memory cgroup. The charging - event happens each time a page is accounted as either mapped - anon page(RSS) or cache page(Page Cache) to the cgroup. -pgpgout # of uncharging events to the memory cgroup. The uncharging - event happens each time a page is unaccounted from the cgroup. -swap # of bytes of swap usage -dirty # of bytes that are waiting to get written back to the disk. -writeback # of bytes of file/anon cache that are queued for syncing to - disk. -inactive_anon # of bytes of anonymous and swap cache memory on inactive - LRU list. -active_anon # of bytes of anonymous and swap cache memory on active - LRU list. -inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory( - LazyFree pages) on inactive LRU list. -active_file # of bytes of file-backed memory on active LRU list. -unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). -=============== =============================================================== - -status considering hierarchy (see memory.use_hierarchy settings) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= =================================================== -hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy - under which the memory cgroup is -hierarchical_memsw_limit # of bytes of memory+swap limit with regard to - hierarchy under which memory cgroup is. - -total_<counter> # hierarchical version of <counter>, which in - addition to the cgroup's own value includes the - sum of all hierarchical children's values of - <counter>, i.e. total_cache -========================= =================================================== - -The following additional stats are dependent on CONFIG_DEBUG_VM -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= ======================================== -recent_rotated_anon VM internal parameter. (see mm/vmscan.c) -recent_rotated_file VM internal parameter. (see mm/vmscan.c) -recent_scanned_anon VM internal parameter. (see mm/vmscan.c) -recent_scanned_file VM internal parameter. (see mm/vmscan.c) -========================= ======================================== - -Memo: +memory.stat file includes following statistics: + + * per-memory cgroup local status + + =============== =============================================================== + cache # of bytes of page cache memory. + rss # of bytes of anonymous and swap cache memory (includes + transparent hugepages). + rss_huge # of bytes of anonymous transparent hugepages. + mapped_file # of bytes of mapped file (includes tmpfs/shmem) + pgpgin # of charging events to the memory cgroup. The charging + event happens each time a page is accounted as either mapped + anon page(RSS) or cache page(Page Cache) to the cgroup. + pgpgout # of uncharging events to the memory cgroup. The uncharging + event happens each time a page is unaccounted from the + cgroup. + swap # of bytes of swap usage + dirty # of bytes that are waiting to get written back to the disk. + writeback # of bytes of file/anon cache that are queued for syncing to + disk. + inactive_anon # of bytes of anonymous and swap cache memory on inactive + LRU list. + active_anon # of bytes of anonymous and swap cache memory on active + LRU list. + inactive_file # of bytes of file-backed memory and MADV_FREE anonymous + memory (LazyFree pages) on inactive LRU list. + active_file # of bytes of file-backed memory on active LRU list. + unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). + =============== =============================================================== + + * status considering hierarchy (see memory.use_hierarchy settings): + + ========================= =================================================== + hierarchical_memory_limit # of bytes of memory limit with regard to + hierarchy + under which the memory cgroup is + hierarchical_memsw_limit # of bytes of memory+swap limit with regard to + hierarchy under which memory cgroup is. + + total_<counter> # hierarchical version of <counter>, which in + addition to the cgroup's own value includes the + sum of all hierarchical children's values of + <counter>, i.e. total_cache + ========================= =================================================== + + * additional vm parameters (depends on CONFIG_DEBUG_VM): + + ========================= ======================================== + recent_rotated_anon VM internal parameter. (see mm/vmscan.c) + recent_rotated_file VM internal parameter. (see mm/vmscan.c) + recent_scanned_anon VM internal parameter. (see mm/vmscan.c) + recent_scanned_file VM internal parameter. (see mm/vmscan.c) + ========================= ======================================== + +.. hint:: recent_rotated means recent frequency of LRU rotation. recent_scanned means recent # of scans to LRU. showing for better debug please see the code for meanings. -Note: +.. note:: Only anonymous and swap cache memory is listed as part of 'rss' stat. This should not be confused with the true 'resident set size' or the amount of physical memory used by the cgroup. @@ -710,15 +719,25 @@ If we want to change this to 1G, we can at any time use:: # echo 1G > memory.soft_limit_in_bytes -NOTE1: +.. note:: Soft limits take effect over a long period of time, since they involve reclaiming memory for balancing between memory cgroups -NOTE2: + +.. note:: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. -8. Move charges at task migration -================================= +.. _cgroup-v1-memory-move-charges: + +8. Move charges at task migration (DEPRECATED!) +=============================================== + +THIS IS DEPRECATED! + +It's expensive and unreliable! It's better practice to launch workload +tasks directly from inside their target cgroup. Use dedicated workload +cgroups to allow fine-grained policy adjustments without having to +move physical pages between control domains. Users can move charges associated with a task along with task migration, that is, uncharge task's pages from the old cgroup and charge them to the new cgroup. @@ -735,23 +754,29 @@ If you want to enable it:: # echo (some positive value) > memory.move_charge_at_immigrate -Note: +.. note:: Each bits of move_charge_at_immigrate has its own meaning about what type - of charges should be moved. See 8.2 for details. -Note: + of charges should be moved. See :ref:`section 8.2 + <cgroup-v1-memory-movable-charges>` for details. + +.. note:: Charges are moved only when you move mm->owner, in other words, a leader of a thread group. -Note: + +.. note:: If we cannot find enough space for the task in the destination cgroup, we try to make space by reclaiming memory. Task migration may fail if we cannot make enough space. -Note: + +.. note:: It can take several seconds if you move charges much. And if you want disable it again:: # echo 0 > memory.move_charge_at_immigrate +.. _cgroup-v1-memory-movable-charges: + 8.2 Type of charges which can be moved -------------------------------------- @@ -801,6 +826,8 @@ threshold in any direction. It's applicable for root and non-root cgroup. +.. _cgroup-v1-memory-oom-control: + 10. OOM Control =============== @@ -956,15 +983,16 @@ commented and discussed quite extensively in the community. References ========== -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ -2. Singh, Balbir. Memory Controller (RSS Control), +.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ +.. [2] Singh, Balbir. Memory Controller (RSS Control), http://lwn.net/Articles/222762/ -3. Emelianov, Pavel. Resource controllers based on process cgroups +.. [3] Emelianov, Pavel. Resource controllers based on process cgroups https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru -4. Emelianov, Pavel. RSS controller based on process cgroups (v2) +.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2) https://lore.kernel.org/r/461A3010.90403@sw.ru -5. Emelianov, Pavel. RSS controller based on process cgroups (v3) +.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3) https://lore.kernel.org/r/465D9739.8070209@openvz.org + 6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control subsystem (v3), http://lwn.net/Articles/235534/ @@ -974,7 +1002,8 @@ References https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com 10. Singh, Balbir. Memory controller v6 test results, https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop -11. Singh, Balbir. Memory controller introduction (v6), - https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop -12. Corbet, Jonathan, Controlling memory use in cgroups, - http://lwn.net/Articles/243795/ + +.. [11] Singh, Balbir. Memory controller introduction (v6), + https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop +.. [12] Corbet, Jonathan, Controlling memory use in cgroups, + http://lwn.net/Articles/243795/ diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index c8ae7c897f14..f67c0829350b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -619,10 +619,12 @@ process migrations. and is an example of this type. +.. _cgroupv2-limits-distributor: + Limits ------ -A child can only consume upto the configured amount of the resource. +A child can only consume up to the configured amount of the resource. Limits can be over-committed - the sum of the limits of children can exceed the amount of resource available to the parent. @@ -635,15 +637,16 @@ process migrations. "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume on an IO device and is an example of this type. +.. _cgroupv2-protections-distributor: Protections ----------- -A cgroup is protected upto the configured amount of the resource +A cgroup is protected up to the configured amount of the resource as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case -only upto the amount available to the parent is protected among +only up to the amount available to the parent is protected among children. Protections are in the range [0, max] and defaults to 0, which is @@ -1076,7 +1079,7 @@ All time durations are in microseconds. $MAX $PERIOD - which indicates that the group may consume upto $MAX in each + which indicates that the group may consume up to $MAX in each $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. @@ -1245,13 +1248,17 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a string which contains the number of bytes to - reclaim. + This file accepts a single key, the number of bytes to reclaim. + No nested keys are currently supported. Example:: echo "1G" > memory.reclaim + The interface can be later extended with nested keys to + configure the reclaim behavior. For example, specify the + type of memory to reclaim from (anon, file, ..). + Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1263,13 +1270,6 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. - This file also allows the user to specify the nodes to reclaim from, - via the 'nodes=' key, for example:: - - echo "1G nodes=0,1" > memory.reclaim - - The above instructs the kernel to reclaim memory from nodes 0,1. - memory.peak A read-only single value file which exists on non-root cgroups. @@ -2289,7 +2289,7 @@ Cpuset Interface Files For a valid partition root with the sibling cpu exclusivity rule enabled, changes made to "cpuset.cpus" that violate the exclusivity rule will invalidate the partition as well as its - sibiling partitions with conflicting cpuset.cpus values. So + sibling partitions with conflicting cpuset.cpus values. So care must be taking in changing "cpuset.cpus". A valid non-root parent partition may distribute out all its CPUs diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index ed3b8dc854ec..2e151cd8c2e4 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -399,7 +399,7 @@ A partial list of the supported mount options follows: sep if first mount option (after the -o), overrides the comma as the separator between the mount - parms. e.g.:: + parameters. e.g.:: -o user=myname,password=mypassword,domain=mydom @@ -765,7 +765,7 @@ cifsFYI If set to non-zero value, additional debug information Some debugging statements are not compiled into the cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the kernel configuration. cifsFYI may be set to one or - nore of the following flags (7 sets them all):: + more of the following flags (7 sets them all):: +-----------------------------------------------+------+ | log cifs informational messages | 0x01 | diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst index b17fe352fc41..13da4d831d46 100644 --- a/Documentation/admin-guide/device-mapper/cache-policies.rst +++ b/Documentation/admin-guide/device-mapper/cache-policies.rst @@ -70,7 +70,7 @@ the entries (each hotspot block covers a larger area than a single cache block). All this means smq uses ~25bytes per cache block. Still a lot of -memory, but a substantial improvement nontheless. +memory, but a substantial improvement nonetheless. Level balancing ^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/device-mapper/dm-ebs.rst b/Documentation/admin-guide/device-mapper/dm-ebs.rst index 534fa38e8862..c09f66db5621 100644 --- a/Documentation/admin-guide/device-mapper/dm-ebs.rst +++ b/Documentation/admin-guide/device-mapper/dm-ebs.rst @@ -31,7 +31,7 @@ Mandatory parameters: Optional parameter: - <underyling sectors>: + <underlying sectors>: Number of sectors defining the logical block size of <dev path>. 2^N supported, e.g. 8 = emulate 8 sectors of 512 bytes = 4KiB. If not provided, the logical block size of <dev path> will be used. diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst index 0fac051caeac..932383fe6e88 100644 --- a/Documentation/admin-guide/device-mapper/dm-zoned.rst +++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst @@ -46,7 +46,7 @@ just like conventional zones. The zones of the device(s) are separated into 2 types: 1) Metadata zones: these are conventional zones used to store metadata. -Metadata zones are not reported as useable capacity to the user. +Metadata zones are not reported as usable capacity to the user. 2) Data zones: all remaining zones, the vast majority of which will be sequential zones used exclusively to store user data. The conventional diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst index 0a8d3eb3f072..5772ccdd1f5f 100644 --- a/Documentation/admin-guide/device-mapper/unstriped.rst +++ b/Documentation/admin-guide/device-mapper/unstriped.rst @@ -35,7 +35,7 @@ An example of undoing an existing dm-stripe This small bash script will setup 4 loop devices and use the existing striped target to combine the 4 devices into one. It then will use -the unstriped target ontop of the striped device to access the +the unstriped target on top of the striped device to access the individual backing loop devices. We write data to the newly exposed unstriped devices and verify the data written matches the correct underlying device on the striped array:: @@ -110,8 +110,8 @@ to get a 92% reduction in read latency using this device mapper target. Example dmsetup usage ===================== -unstriped ontop of Intel NVMe device that has 2 cores ------------------------------------------------------ +unstriped on top of Intel NVMe device that has 2 cores +------------------------------------------------------ :: @@ -124,8 +124,8 @@ respectively:: /dev/mapper/nvmset0 /dev/mapper/nvmset1 -unstriped ontop of striped with 4 drives using 128K chunk size --------------------------------------------------------------- +unstriped on top of striped with 4 drives using 128K chunk size +--------------------------------------------------------------- :: diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index faa22f77847a..8dc668cc1216 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -330,7 +330,7 @@ Examples // boot-args example, with newlines and comments for readability Kernel command line: ... - // see whats going on in dyndbg=value processing + // see what's going on in dyndbg=value processing dynamic_debug.verbose=3 // enable pr_debugs in the btrfs module (can be builtin or loadable) btrfs.dyndbg="+p" diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index d8a90c81b9ee..1cc5567a4bbe 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -123,7 +123,7 @@ Each simulated GPIO chip creates a separate sysfs group under its device directory for each exposed line (e.g. ``/sys/devices/platform/gpio-sim.X/gpiochipY/``). The name of each group is of the form: ``'sim_gpioX'`` where X is the offset of the line. Inside each -group there are two attibutes: +group there are two attributes: ``pull`` - allows to read and set the current simulated pull setting for every line, when writing the value must be one of: ``'pull-up'``, diff --git a/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst b/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst new file mode 100644 index 000000000000..875616d675fe --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/cross-thread-rsb.rst @@ -0,0 +1,91 @@ + +.. SPDX-License-Identifier: GPL-2.0 + +Cross-Thread Return Address Predictions +======================================= + +Certain AMD and Hygon processors are subject to a cross-thread return address +predictions vulnerability. When running in SMT mode and one sibling thread +transitions out of C0 state, the other sibling thread could use return target +predictions from the sibling thread that transitioned out of C0. + +The Spectre v2 mitigations protect the Linux kernel, as it fills the return +address prediction entries with safe targets when context switching to the idle +thread. However, KVM does allow a VMM to prevent exiting guest mode when +transitioning out of C0. This could result in a guest-controlled return target +being consumed by the sibling thread. + +Affected processors +------------------- + +The following CPUs are vulnerable: + + - AMD Family 17h processors + - Hygon Family 18h processors + +Related CVEs +------------ + +The following CVE entry is related to this issue: + + ============== ======================================= + CVE-2022-27672 Cross-Thread Return Address Predictions + ============== ======================================= + +Problem +------- + +Affected SMT-capable processors support 1T and 2T modes of execution when SMT +is enabled. In 2T mode, both threads in a core are executing code. For the +processor core to enter 1T mode, it is required that one of the threads +requests to transition out of the C0 state. This can be communicated with the +HLT instruction or with an MWAIT instruction that requests non-C0. +When the thread re-enters the C0 state, the processor transitions back +to 2T mode, assuming the other thread is also still in C0 state. + +In affected processors, the return address predictor (RAP) is partitioned +depending on the SMT mode. For instance, in 2T mode each thread uses a private +16-entry RAP, but in 1T mode, the active thread uses a 32-entry RAP. Upon +transition between 1T/2T mode, the RAP contents are not modified but the RAP +pointers (which control the next return target to use for predictions) may +change. This behavior may result in return targets from one SMT thread being +used by RET predictions in the sibling thread following a 1T/2T switch. In +particular, a RET instruction executed immediately after a transition to 1T may +use a return target from the thread that just became idle. In theory, this +could lead to information disclosure if the return targets used do not come +from trustworthy code. + +Attack scenarios +---------------- + +An attack can be mounted on affected processors by performing a series of CALL +instructions with targeted return locations and then transitioning out of C0 +state. + +Mitigation mechanism +-------------------- + +Before entering idle state, the kernel context switches to the idle thread. The +context switch fills the RAP entries (referred to as the RSB in Linux) with safe +targets by performing a sequence of CALL instructions. + +Prevent a guest VM from directly putting the processor into an idle state by +intercepting HLT and MWAIT instructions. + +Both mitigations are required to fully address this issue. + +Mitigation control on the kernel command line +--------------------------------------------- + +Use existing Spectre v2 mitigations that will fill the RSB on context switch. + +Mitigation control for KVM - module parameter +--------------------------------------------- + +By default, the KVM hypervisor mitigates this issue by intercepting guest +attempts to transition out of C0. A VMM can use the KVM_CAP_X86_DISABLE_EXITS +capability to override those interceptions, but since this is not common, the +mitigation that covers this path is not enabled by default. + +The mitigation for the KVM_CAP_X86_DISABLE_EXITS capability can be turned on +using the boolean module parameter mitigate_smt_rsb, e.g. ``kvm.mitigate_smt_rsb=1``. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 4df436e7c417..e0614760a99e 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -18,3 +18,4 @@ are configurable at compile, boot or run time. core-scheduling.rst l1d_flush.rst processor_mmio_stale_data.rst + cross-thread-rsb.rst diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst index 2d19c9f4c1fe..f491de74ea79 100644 --- a/Documentation/admin-guide/hw-vuln/mds.rst +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -64,8 +64,8 @@ architecture section: :ref:`Documentation/x86/mds.rst <mds>`. Attack scenarios ---------------- -Attacks against the MDS vulnerabilities can be mounted from malicious non -priviledged user space applications running on hosts or guest. Malicious +Attacks against the MDS vulnerabilities can be mounted from malicious non- +privileged user space applications running on hosts or guest. Malicious guest OSes can obviously mount attacks as well. Contrary to other speculation based vulnerabilities the MDS vulnerability diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index c4dcdb3d0d45..3fe6511c5405 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -610,9 +610,9 @@ kernel command line. retpoline,generic Retpolines retpoline,lfence LFENCE; indirect branch retpoline,amd alias for retpoline,lfence - eibrs enhanced IBRS - eibrs,retpoline enhanced IBRS + Retpolines - eibrs,lfence enhanced IBRS + LFENCE + eibrs Enhanced/Auto IBRS + eibrs,retpoline Enhanced/Auto IBRS + Retpolines + eibrs,lfence Enhanced/Auto IBRS + LFENCE ibrs use IBRS to protect kernel Not specifying this option is equivalent to diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 5bfafcbb9562..0ad7e7ec0d27 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -56,6 +56,17 @@ ABI will be found here. sysfs-rules +This is the beginning of a section with information of interest to +application developers and system integrators doing analysis of the +Linux kernel for safety critical applications. Documents supporting +analysis of kernel interactions with applications, and key kernel +subsystems expectations will be found here. + +.. toctree:: + :maxdepth: 1 + + workload-tracing + The rest of this manual consists of various unordered guides on how to configure specific aspects of kernel behavior to your liking. @@ -116,6 +127,7 @@ configure specific aspects of kernel behavior to your liking. svga syscall-user-dispatch sysrq + thermal/index thunderbolt ufs unicode diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 82aecdcae8a6..030de95e3e6b 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -312,10 +312,10 @@ define dmesg set var $prev_flags = $info->flags end - set var $id = ($id + 1) & $id_mask if ($id == $end_id) loop_break end + set var $id = ($id + 1) & $id_mask end end document dmesg diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 959f73a32712..19600c50277b 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -142,7 +142,6 @@ parameter is applicable:: NFS Appropriate NFS support is enabled. OF Devicetree is enabled. PV_OPS A paravirtualized kernel is enabled. - PARIDE The ParIDE (parallel port IDE) subsystem is enabled. PARISC The PA-RISC architecture is enabled. PCI PCI bus support is enabled. PCIE PCI Express support is enabled. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ecdb9530af8a..46268d6baa43 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -378,18 +378,16 @@ autoconf= [IPV6] See Documentation/networking/ipv6.rst. - show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller - Limit apic dumping. The parameter defines the maximal - number of local apics being dumped. Also it is possible - to set it to "all" by meaning -- no limit here. - Format: { 1 (default) | 2 | ... | all }. - The parameter valid if only apic=debug or - apic=verbose is specified. - Example: apic=debug show_lapic=all - apm= [APM] Advanced Power Management See header of arch/x86/kernel/apm_32.c. + apparmor= [APPARMOR] Disable or enable AppArmor at boot time + Format: { "0" | "1" } + See security/apparmor/Kconfig help text + 0 -- disable. + 1 -- enable. + Default value is set via kernel config option. + arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> @@ -480,8 +478,10 @@ See Documentation/block/cmdline-partition.rst boot_delay= Milliseconds to delay each printk during boot. - Values larger than 10 seconds (10000) are changed to - no delay (0). + Only works if CONFIG_BOOT_PRINTK_DELAY is enabled, + and you may also have to specify "lpj=". Boot_delay + values larger than 10 seconds (10000) are assumed + erroneous and ignored. Format: integer bootconfig [KNL] @@ -557,6 +557,7 @@ Format: <string> nosocket -- Disable socket memory accounting. nokmem -- Disable kernel memory accounting. + nobpf -- Disable BPF memory accounting. checkreqprot= [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } @@ -672,7 +673,7 @@ Sets the size of kernel per-numa memory area for contiguous memory allocations. A value of 0 disables per-numa CMA altogether. And If this option is not - specificed, the default value is 0. + specified, the default value is 0. With per-numa CMA enabled, DMA users on node nid will first try to allocate buffer from the pernuma area which is located in node nid, if the allocation fails, @@ -944,7 +945,7 @@ driver code when a CPU writes to (or reads from) a random memory location. Note that there exists a class of memory corruptions problems caused by buggy H/W or - F/W or by drivers badly programing DMA (basically when + F/W or by drivers badly programming DMA (basically when memory is written at bus level and the CPU MMU is bypassed) which are not detectable by CONFIG_DEBUG_PAGEALLOC, hence this option will not help @@ -1045,26 +1046,12 @@ can be useful when debugging issues that require an SLB miss to occur. - stress_slb [PPC] - Limits the number of kernel SLB entries, and flushes - them frequently to increase the rate of SLB faults - on kernel addresses. - - stress_hpt [PPC] - Limits the number of kernel HPT entries in the hash - page table to increase the rate of hash page table - faults on kernel addresses. - disable= [IPV6] See Documentation/networking/ipv6.rst. disable_radix [PPC] Disable RADIX MMU mode on POWER9 - radix_hcall_invalidate=on [PPC/PSERIES] - Disable RADIX GTSE feature and use hcall for TLB - invalidate. - disable_tlbie [PPC] Disable TLBIE instruction. Currently does not work with KVM, with HASH MMU, or with coherent accelerators. @@ -1166,16 +1153,6 @@ Documentation/admin-guide/dynamic-debug-howto.rst for details. - nopku [X86] Disable Memory Protection Keys CPU feature found - in some Intel CPUs. - - <module>.async_probe[=<bool>] [KNL] - If no <bool> value is specified or if the value - specified is not a valid <bool>, enable asynchronous - probe on this module. Otherwise, enable/disable - asynchronous probe on this module as indicated by the - <bool> value. See also: module.async_probe - early_ioremap_debug [KNL] Enable debug messages in early_ioremap support. This is useful for tracking down temporary early mappings @@ -1195,10 +1172,10 @@ specified, the serial port must already be setup and configured. - uart[8250],io,<addr>[,options] - uart[8250],mmio,<addr>[,options] - uart[8250],mmio32,<addr>[,options] - uart[8250],mmio32be,<addr>[,options] + uart[8250],io,<addr>[,options[,uartclk]] + uart[8250],mmio,<addr>[,options[,uartclk]] + uart[8250],mmio32,<addr>[,options[,uartclk]] + uart[8250],mmio32be,<addr>[,options[,uartclk]] uart[8250],0x<addr>[,options] Start an early, polled-mode console on the 8250/16550 UART at the specified I/O port or MMIO address. @@ -1207,7 +1184,9 @@ If none of [io|mmio|mmio32|mmio32be], <addr> is assumed to be equivalent to 'mmio'. 'options' are specified in the same format described for "console=ttyS<n>"; if - unspecified, the h/w is not initialized. + unspecified, the h/w is not initialized. 'uartclk' is + the uart clock frequency; if unspecified, it is set + to 'BASE_BAUD' * 16. pl011,<addr> pl011,mmio32,<addr> @@ -1532,6 +1511,15 @@ boot up that is likely to be overridden by user space start up functionality. + Optionally, the snapshot can also be defined for a tracing + instance that was created by the trace_instance= command + line parameter. + + trace_instance=foo,sched_switch ftrace_boot_snapshot=foo + + The above will cause the "foo" tracing instance to trigger + a snapshot at the end of boot up. + ftrace_dump_on_oops[=orig_cpu] [FTRACE] will dump the trace buffers on oops. If no parameter is passed, ftrace will dump @@ -1752,7 +1740,7 @@ boot-time allocation of gigantic hugepages is skipped. hugetlb_free_vmemmap= - [KNL] Reguires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP + [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. Control if HugeTLB Vmemmap Optimization (HVO) is enabled. Allows heavy hugetlb users to free up some more @@ -1791,12 +1779,6 @@ which allow the hypervisor to 'idle' the guest on lock contention. - keep_bootcon [KNL] - Do not unregister boot console at start. This is only - useful for debugging when something happens in the window - between unregistering the boot console and initializing - the real console. - i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. @@ -2366,17 +2348,18 @@ js= [HW,JOY] Analog joystick See Documentation/input/joydev/joystick.rst. - nokaslr [KNL] - When CONFIG_RANDOMIZE_BASE is set, this disables - kernel and module base offset ASLR (Address Space - Layout Randomization). - kasan_multi_shot [KNL] Enforce KASAN (Kernel Address Sanitizer) to print report on every invalid memory access. Without this parameter KASAN will print report only for the first invalid access. + keep_bootcon [KNL] + Do not unregister boot console at start. This is only + useful for debugging when something happens in the window + between unregistering the boot console and initializing + the real console. + keepinitrd [HW,ARM] kernelcore= [KNL,X86,IA-64,PPC] @@ -2816,6 +2799,9 @@ * [no]setxfer: Indicate if transfer speed mode setting should be skipped. + * [no]fua: Disable or enable FUA (Force Unit Access) + support for devices supporting this feature. + * dump_id: Dump IDENTIFY data. * disable: Disable this device. @@ -3325,6 +3311,13 @@ For details see: Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + <module>.async_probe[=<bool>] [KNL] + If no <bool> value is specified or if the value + specified is not a valid <bool>, enable asynchronous + probe on this module. Otherwise, enable/disable + asynchronous probe on this module as indicated by the + <bool> value. See also: module.async_probe + module.async_probe=<bool> [KNL] When set to true, modules will use async probing by default. To enable/disable async probing for a @@ -3708,7 +3701,7 @@ implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the sleep(SH) or wfi(ARM,ARM64) instructions do not work - correctly or when doing power measurements to evalute + correctly or when doing power measurements to evaluate the impact of the sleep instructions. This is also useful when using JTAG debugger. @@ -3779,6 +3772,11 @@ nojitter [IA-64] Disables jitter checking for ITC timers. + nokaslr [KNL] + When CONFIG_RANDOMIZE_BASE is set, this disables + kernel and module base offset ASLR (Address Space + Layout Randomization). + no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page @@ -3824,6 +3822,19 @@ nopcid [X86-64] Disable the PCID cpu feature. + nopku [X86] Disable Memory Protection Keys CPU feature found + in some Intel CPUs. + + nopv= [X86,XEN,KVM,HYPER_V,VMWARE] + Disables the PV optimizations forcing the guest to run + as generic guest with no PV drivers. Currently support + XEN HVM, KVM, HYPER_V and VMWARE guest. + + nopvspin [X86,XEN,KVM] + Disables the qspinlock slow path using PV optimizations + which allow the hypervisor to 'idle' the guest on lock + contention. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space @@ -4117,10 +4128,6 @@ pcbit= [HW,ISDN] - pcd. [PARIDE] - See header of drivers/block/paride/pcd.c. - See also Documentation/admin-guide/blockdev/paride.rst. - pci=option[,option...] [PCI] various PCI subsystem options. Some options herein operate on a specific device @@ -4385,9 +4392,6 @@ for debug and development, but should not be needed on a platform with proper driver support. - pd. [PARIDE] - See Documentation/admin-guide/blockdev/paride.rst. - pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at boot time. Format: { 0 | 1 } @@ -4400,12 +4404,6 @@ allocator. This parameter is primarily for debugging and performance comparison. - pf. [PARIDE] - See Documentation/admin-guide/blockdev/paride.rst. - - pg. [PARIDE] - See Documentation/admin-guide/blockdev/paride.rst. - pirq= [SMP,APIC] Manual mp-table setup See Documentation/x86/i386/IO-APIC.rst. @@ -4567,9 +4565,6 @@ pstore.backend= Specify the name of the pstore backend to use - pt. [PARIDE] - See Documentation/admin-guide/blockdev/paride.rst. - pti= [X86-64] Control Page Table Isolation of user and kernel address spaces. Disabling this feature removes hardening, but improves performance of @@ -4593,6 +4588,10 @@ r128= [HW,DRM] + radix_hcall_invalidate=on [PPC/PSERIES] + Disable RADIX GTSE feature and use hcall for TLB + invalidate. + raid= [HW,RAID] See Documentation/admin-guide/md.rst. @@ -5115,6 +5114,17 @@ rcupdate.rcu_cpu_stall_timeout to be used (after conversion from seconds to milliseconds). + rcupdate.rcu_cpu_stall_cputime= [KNL] + Provide statistics on the cputime and count of + interrupts and tasks during the sampling period. For + multiple continuous RCU stalls, all sampling periods + begin at half of the first RCU stall timeout. + + rcupdate.rcu_exp_stall_task_details= [KNL] + Print stack dumps of any tasks blocking the + current expedited RCU grace period during an + expedited RCU CPU stall warning. + rcupdate.rcu_expedited= [KNL] Use expedited grace-period primitives, for example, synchronize_rcu_expedited() instead @@ -5223,7 +5233,7 @@ rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, - mba. + mba, smba, bmec. E.g. to turn on cmt and turn off mba use: rdt=cmt,!mba @@ -5574,13 +5584,6 @@ 1 -- enable. Default value is 1. - apparmor= [APPARMOR] Disable or enable AppArmor at boot time - Format: { "0" | "1" } - See security/apparmor/Kconfig help text - 0 -- disable. - 1 -- enable. - Default value is set via kernel config option. - serialnumber [BUGS=X86-32] sev=option[,option...] [X86-64] See Documentation/x86/x86_64/boot-options.rst @@ -5588,6 +5591,15 @@ shapers= [NET] Maximal number of shapers. + show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller + Limit apic dumping. The parameter defines the maximal + number of local apics being dumped. Also it is possible + to set it to "all" by meaning -- no limit here. + Format: { 1 (default) | 2 | ... | all }. + The parameter valid if only apic=debug or + apic=verbose is specified. + Example: apic=debug show_lapic=all + simeth= [IA-64] simscsi= @@ -5731,9 +5743,9 @@ retpoline,generic - Retpolines retpoline,lfence - LFENCE; indirect branch retpoline,amd - alias for retpoline,lfence - eibrs - enhanced IBRS - eibrs,retpoline - enhanced IBRS + Retpolines - eibrs,lfence - enhanced IBRS + LFENCE + eibrs - Enhanced/Auto IBRS + eibrs,retpoline - Enhanced/Auto IBRS + Retpolines + eibrs,lfence - Enhanced/Auto IBRS + LFENCE ibrs - use IBRS to protect kernel Not specifying this option is equivalent to @@ -6027,6 +6039,16 @@ be used to filter out binaries which have not yet been made aware of AT_MINSIGSTKSZ. + stress_hpt [PPC] + Limits the number of kernel HPT entries in the hash + page table to increase the rate of hash page table + faults on kernel addresses. + + stress_slb [PPC] + Limits the number of kernel SLB entries, and flushes + them frequently to increase the rate of SLB faults + on kernel addresses. + sunrpc.min_resvport= sunrpc.max_resvport= [NFS,SUNRPC] @@ -6274,13 +6296,33 @@ comma-separated list of trace events to enable. See also Documentation/trace/events.rst + trace_instance=[instance-info] + [FTRACE] Create a ring buffer instance early in boot up. + This will be listed in: + + /sys/kernel/tracing/instances + + Events can be enabled at the time the instance is created + via: + + trace_instance=<name>,<system1>:<event1>,<system2>:<event2> + + Note, the "<system*>:" portion is optional if the event is + unique. + + trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall + + will enable the "sched_switch" event (note, the "sched:" is optional, and + the same thing would happen if it was left off). The irq_handler_entry + event, and all events under the "initcall" system. + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. The option-list is a comma delimited list of options that can be enabled or disabled just as if you were to echo the option name into - /sys/kernel/debug/tracing/trace_options + /sys/kernel/tracing/trace_options For example, to enable stacktrace option (to dump the stack trace of each event), add to the command line: @@ -6313,7 +6355,7 @@ [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can be enabled again by echoing '1' into the "tracing_on" - file located in /sys/kernel/debug/tracing/ + file located in /sys/kernel/tracing/ This option is useful, as it disables the trace before the WARNING dump is called, which prevents the trace to @@ -6371,6 +6413,16 @@ in situations with strict latency requirements (where interruptions from clocksource watchdog are not acceptable). + [x86] recalibrate: force recalibration against a HW timer + (HPET or PM timer) on systems whose TSC frequency was + obtained from HW or FW using either an MSR or CPUID(0x15). + Warn if the difference is more than 500 ppm. + [x86] watchdog: Use TSC as the watchdog clocksource with + which to check other HW timers (HPET or PM timer), but + only on systems where TSC has been deemed trustworthy. + This will be suppressed by an earlier tsc=nowatchdog and + can be overridden by a later tsc=nowatchdog. A console + message will flag any such suppression or overriding. tsc_early_khz= [X86] Skip early TSC calibration and use the given value instead. Useful when the early TSC frequency discovery @@ -6758,11 +6810,11 @@ functions are at fixed addresses, they make nice targets for exploits that can control RIP. - emulate [default] Vsyscalls turn into traps and are - emulated reasonably safely. The vsyscall - page is readable. + emulate Vsyscalls turn into traps and are emulated + reasonably safely. The vsyscall page is + readable. - xonly Vsyscalls turn into traps and are + xonly [default] Vsyscalls turn into traps and are emulated reasonably safely. The vsyscall page is not readable. @@ -6959,16 +7011,6 @@ fairer and the number of possible event channels is much higher. Default is on (use fifo events). - nopv= [X86,XEN,KVM,HYPER_V,VMWARE] - Disables the PV optimizations forcing the guest to run - as generic guest with no PV drivers. Currently support - XEN HVM, KVM, HYPER_V and VMWARE guest. - - nopvspin [X86,XEN,KVM] - Disables the qspinlock slow path using PV optimizations - which allow the hypervisor to 'idle' the guest on lock - contention. - xirc2ps_cs= [NET,PCMCIA] Format: <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] @@ -7022,3 +7064,10 @@ management firmware translates the requests into actual hardware states (core frequency, data fabric and memory clocks etc.) + active + Use amd_pstate_epp driver instance as the scaling driver, + driver provides a hint to the hardware if software wants + to bias toward performance (0x0) or energy efficiency (0xff) + to the CPPC firmware. then CPPC power algorithm will + calculate the runtime workload and adjust the realtime cores + frequency. diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index e4a5fc26f1a9..993c2a05f5ee 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -25,7 +25,7 @@ References - In order to locate kernel-generated OS jitter on CPU N: - cd /sys/kernel/debug/tracing + cd /sys/kernel/tracing echo 1 > max_graph_depth # Increase the "1" for more detail echo function_graph > current_tracer # run workload diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 475eb0e81e4a..e27a1c3f634e 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -1488,7 +1488,7 @@ Example of command to set keyboard language is mentioned below:: Text corresponding to keyboard layout to be set in sysfs are: be(Belgian), cz(Czech), da(Danish), de(German), en(English), es(Spain), et(Estonian), fr(French), fr-ch(French(Switzerland)), hu(Hungarian), it(Italy), jp (Japan), -nl(Dutch), nn(Norway), pl(Polish), pt(portugese), sl(Slovenian), sv(Sweden), +nl(Dutch), nn(Norway), pl(Polish), pt(portuguese), sl(Slovenian), sv(Sweden), tr(Turkey) WWAN Antenna type diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst index d8fc9a59c086..4ff2cc291d18 100644 --- a/Documentation/admin-guide/md.rst +++ b/Documentation/admin-guide/md.rst @@ -317,7 +317,7 @@ All md devices contain: suspended (not supported yet) All IO requests will block. The array can be reconfigured. - Writing this, if accepted, will block until array is quiessent + Writing this, if accepted, will block until array is quiescent readonly no resync can happen. no superblocks get written. diff --git a/Documentation/admin-guide/media/bttv.rst b/Documentation/admin-guide/media/bttv.rst index 125f6f47123d..58cbaf6df694 100644 --- a/Documentation/admin-guide/media/bttv.rst +++ b/Documentation/admin-guide/media/bttv.rst @@ -909,7 +909,7 @@ DE hat diverse Treiber fuer diese Modelle (Stand 09/2002): - TVPhone98 (Bt878) - AVerTV und TVCapture98 w/VCR (Bt 878) - AVerTVStudio und TVPhone98 w/VCR (Bt878) - - AVerTV GO Serie (Kein SVideo Input) + - AVerTV GO Series (Kein SVideo Input) - AVerTV98 (BT-878 chip) - AVerTV98 mit Fernbedienung (BT-878 chip) - AVerTV/FM98 (BT-878 chip) diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst index 2d660b76caea..a06473429916 100644 --- a/Documentation/admin-guide/media/building.rst +++ b/Documentation/admin-guide/media/building.rst @@ -137,7 +137,7 @@ The ``LIRC user interface`` option adds enhanced functionality when using the from remote controllers. The ``Support for eBPF programs attached to lirc devices`` option allows -the usage of special programs (called eBPF) that would allow aplications +the usage of special programs (called eBPF) that would allow applications to add extra remote controller decoding functionality to the Linux Kernel. The ``Remote controller decoders`` option allows selecting the diff --git a/Documentation/admin-guide/media/davinci-vpbe.rst b/Documentation/admin-guide/media/davinci-vpbe.rst deleted file mode 100644 index 9e6360fd02db..000000000000 --- a/Documentation/admin-guide/media/davinci-vpbe.rst +++ /dev/null @@ -1,65 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -The VPBE V4L2 driver design -=========================== - -Functional partitioning ------------------------ - -Consists of the following: - - 1. V4L2 display driver - - Implements creation of video2 and video3 device nodes and - provides v4l2 device interface to manage VID0 and VID1 layers. - - 2. Display controller - - Loads up VENC, OSD and external encoders such as ths8200. It provides - a set of API calls to V4L2 drivers to set the output/standards - in the VENC or external sub devices. It also provides - a device object to access the services from OSD subdevice - using sub device ops. The connection of external encoders to VENC LCD - controller port is done at init time based on default output and standard - selection or at run time when application change the output through - V4L2 IOCTLs. - - When connected to an external encoder, vpbe controller is also responsible - for setting up the interface between VENC and external encoders based on - board specific settings (specified in board-xxx-evm.c). This allows - interfacing external encoders such as ths8200. The setup_if_config() - is implemented for this as well as configure_venc() (part of the next patch) - API to set timings in VENC for a specific display resolution. As of this - patch series, the interconnection and enabling and setting of the external - encoders is not present, and would be a part of the next patch series. - - 3. VENC subdevice module - - Responsible for setting outputs provided through internal DACs and also - setting timings at LCD controller port when external encoders are connected - at the port or LCD panel timings required. When external encoder/LCD panel - is connected, the timings for a specific standard/preset is retrieved from - the board specific table and the values are used to set the timings in - venc using non-standard timing mode. - - Support LCD Panel displays using the VENC. For example to support a Logic - PD display, it requires setting up the LCD controller port with a set of - timings for the resolution supported and setting the dot clock. So we could - add the available outputs as a board specific entry (i.e add the "LogicPD" - output name to board-xxx-evm.c). A table of timings for various LCDs - supported can be maintained in the board specific setup file to support - various LCD displays.As of this patch a basic driver is present, and this - support for external encoders and displays forms a part of the next - patch series. - - 4. OSD module - - OSD module implements all OSD layer management and hardware specific - features. The VPBE module interacts with the OSD for enabling and - disabling appropriate features of the OSD. - -Current status --------------- - -A fully functional working version of the V4L2 driver is available. This -driver has been tested with NTSC and PAL standards and buffer streaming. diff --git a/Documentation/admin-guide/media/platform-cardlist.rst b/Documentation/admin-guide/media/platform-cardlist.rst index ac73c4166d1e..8ef57cd13dec 100644 --- a/Documentation/admin-guide/media/platform-cardlist.rst +++ b/Documentation/admin-guide/media/platform-cardlist.rst @@ -73,7 +73,6 @@ via-camera VIAFB camera controller video-mux Video Multiplexer vpif_display TI DaVinci VPIF V4L2-Display vpif_capture TI DaVinci VPIF video capture -vpss TI DaVinci VPBE V4L2-Display vsp1 Renesas VSP1 Video Processing Engine xilinx-tpg Xilinx Video Test Pattern Generator xilinx-video Xilinx Video IP (EXPERIMENTAL) diff --git a/Documentation/admin-guide/media/si476x.rst b/Documentation/admin-guide/media/si476x.rst index 87062301d6a1..c8882ee9f208 100644 --- a/Documentation/admin-guide/media/si476x.rst +++ b/Documentation/admin-guide/media/si476x.rst @@ -142,7 +142,7 @@ The drivers exposes following files: indicator 0x18 lassi Signed Low side adjacent Channel Strength indicator - 0x19 hassi ditto fpr High side + 0x19 hassi ditto for High side 0x20 mult Multipath indicator 0x21 dev Frequency deviation 0x24 assi Adjacent channel SSI diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index 90a026ee05c6..734e18c310bd 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -13,7 +13,6 @@ Video4Linux (V4L) driver-specific documentation cafe_ccic cpia2 cx88 - davinci-vpbe fimc imx imx7 diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index 672a8371f6ad..58ac25b2c385 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -580,7 +580,7 @@ Metadata Capture ---------------- The Metadata capture generates UVC format metadata. The PTS and SCR are -transmitted based on the values set in vivid contols. +transmitted based on the values set in vivid controls. The Metadata device will only work for the Webcam input, it will give back an error for all other inputs. diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst index c79f1e336222..e796b0a7e4a5 100644 --- a/Documentation/admin-guide/mm/concepts.rst +++ b/Documentation/admin-guide/mm/concepts.rst @@ -1,5 +1,3 @@ -.. _mm_concepts: - ================= Concepts overview ================= @@ -86,16 +84,15 @@ memory with the huge pages. The first one is `HugeTLB filesystem`, or hugetlbfs. It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages. The hugetlbfs is described at -:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. +Documentation/admin-guide/mm/hugetlbpage.rst. Another, more recent, mechanism that enables use of the huge pages is called `Transparent HugePages`, or THP. Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the -name. See -:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` -for more details about THP. +name. See Documentation/admin-guide/mm/transhuge.rst for more details +about THP. Zones ===== @@ -125,8 +122,8 @@ processor. Each bank is referred to as a `node` and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters. You can find more details about NUMA in -:ref:`Documentation/mm/numa.rst <numa>` and in -:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. +Documentation/mm/numa.rst` and in +Documentation/admin-guide/mm/numa_memory_policy.rst. Page cache ========== diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst index c09cace80651..7b0775d281b4 100644 --- a/Documentation/admin-guide/mm/damon/lru_sort.rst +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -54,7 +54,7 @@ that is built with ``CONFIG_DAMON_LRU_SORT=y``. To let sysadmins enable or disable it and tune for the given system, DAMON_LRU_SORT utilizes module parameters. That is, you can put ``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write -proper values to ``/sys/modules/damon_lru_sort/parameters/<parameter>`` files. +proper values to ``/sys/module/damon_lru_sort/parameters/<parameter>`` files. Below are the description of each parameter. @@ -283,7 +283,7 @@ doesn't make progress and therefore the free memory rate becomes lower than 20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to the LRU-list based page granularity reclamation. :: - # cd /sys/modules/damon_lru_sort/parameters + # cd /sys/module/damon_lru_sort/parameters # echo 500 > hot_thres_access_freq # echo 120000000 > cold_min_age # echo 10 > quota_ms diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 4f1479a11e63..343e25b252f4 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -46,7 +46,7 @@ that is built with ``CONFIG_DAMON_RECLAIM=y``. To let sysadmins enable or disable it and tune for the given system, DAMON_RECLAIM utilizes module parameters. That is, you can put ``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write -proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files. +proper values to ``/sys/module/damon_reclaim/parameters/<parameter>`` files. Below are the description of each parameter. @@ -205,6 +205,15 @@ The end physical address of memory region that DAMON_RECLAIM will do work against. That is, DAMON_RECLAIM will find cold memory regions in this region and reclaims. By default, biggest System RAM is used as the region. +skip_anon +--------- + +Skip anonymous pages reclamation. + +If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous +pages. By default, ``N``. + + kdamond_pid ----------- @@ -251,7 +260,7 @@ therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to do nothing again, so that we can fall back to the LRU-list based page granularity reclamation. :: - # cd /sys/modules/damon_reclaim/parameters + # cd /sys/module/damon_reclaim/parameters # echo 30000000 > min_age # echo $((1 * 1024 * 1024 * 1024)) > quota_sz # echo 1000 > quota_reset_interval_ms diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 1a5b6b71efa1..9b823fec974d 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -25,10 +25,12 @@ DAMON provides below interfaces for different users. interface provides only simple :ref:`statistics <damos_stats>` for the monitoring results. For detailed monitoring results, DAMON provides a :ref:`tracepoint <tracepoint>`. -- *debugfs interface.* +- *debugfs interface. (DEPRECATED!)* :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface - <sysfs_interface>`. This will be removed after next LTS kernel is released, - so users should move to the :ref:`sysfs interface <sysfs_interface>`. + <sysfs_interface>`. This is deprecated, so users should move to the + :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot + move, please report your usecase to damon@lists.linux.dev and + linux-mm@kvack.org. - *Kernel Space Programming Interface.* :doc:`This </mm/damon/api>` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by @@ -87,6 +89,8 @@ comma (","). :: │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low + │ │ │ │ │ │ │ filters/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,memcg_id │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds │ │ │ │ │ │ │ tried_regions/ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age @@ -151,6 +155,8 @@ number (``N``) to the file creates the number of child directories named as moment, only one context per kdamond is supported, so only ``0`` or ``1`` can be written to the file. +.. _sysfs_contexts: + contexts/<N>/ ------------- @@ -268,21 +274,32 @@ schemes/<N>/ ------------ In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``) -exist. +``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file +(``action``) exist. The ``action`` file is for setting and getting what action you want to apply to memory regions having specific access pattern of the interest. The keywords that can be written to and read from the file and their meaning are as below. - - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED`` - - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD`` - - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` +Note that support of each action depends on the running DAMON operations set +`implementation <sysfs_contexts>`. + + - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. + - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. - ``lru_prio``: Prioritize the region on its LRU lists. + Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. - - ``stat``: Do nothing but count the statistics + Supported by ``paddr`` operations set. + - ``stat``: Do nothing but count the statistics. + Supported by all operations sets. schemes/<N>/access_pattern/ --------------------------- @@ -347,6 +364,46 @@ as below. The ``interval`` should written in microseconds unit. +schemes/<N>/filters/ +-------------------- + +Users could know something more than the kernel for specific types of memory. +In the case, users could do their own management for the memory and hence +doesn't want DAMOS bothers that. Users could limit DAMOS by setting the access +pattern of the scheme and/or the monitoring regions for the purpose, but that +can be inefficient in some cases. In such cases, users could set non-access +pattern driven filters using files in this directory. + +In the beginning, this directory has only one file, ``nr_filters``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each filter. The filters are evaluated +in the numeric order. + +Each filter directory contains three files, namely ``type``, ``matcing``, and +``memcg_path``. You can write one of two special keywords, ``anon`` for +anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of +the memory cgroup filtering, you can specify the memory cgroup of the interest +by writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to +filter out pages that does or does not match to the type, respectively. Then, +the scheme's action will not be applied to the pages that specified to be +filtered out. + +For example, below restricts a DAMOS action to be applied to only non-anonymous +pages of all memory cgroups except ``/having_care_already``.:: + + # echo 2 > nr_filters + # # filter out anonymous pages + echo anon > 0/type + echo Y > 0/matching + # # further filter out all cgroups except one at '/having_care_already' + echo memcg > 1/type + echo /having_care_already > 1/memcg_path + echo N > 1/matching + +Note that filters are currently supported only when ``paddr`` +`implementation <sysfs_contexts>` is being used. + .. _sysfs_schemes_stats: schemes/<N>/stats/ @@ -432,13 +489,17 @@ the files as above. Above is only for an example. .. _debugfs_interface: -debugfs Interface -================= +debugfs Interface (DEPRECATED!) +=============================== .. note:: - DAMON debugfs interface will be removed after next LTS kernel is released, so - users should move to the :ref:`sysfs interface <sysfs_interface>`. + THIS IS DEPRECATED! + + DAMON debugfs interface is deprecated, so users should move to the + :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot + move, please report your usecase to damon@lists.linux.dev and + linux-mm@kvack.org. DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and @@ -574,11 +635,15 @@ The ``<action>`` is a predefined integer for memory management actions, which DAMON will apply to the regions having the target access pattern. The supported numbers and their meanings are as below. - - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED`` - - 1: Call ``madvise()`` for the region with ``MADV_COLD`` - - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` + - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if + ``target`` is ``paddr``. + - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if + ``target`` is ``paddr``. + - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if + ``target`` is ``paddr``. + - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if + ``target`` is ``paddr``. - 5: Do nothing but count the statistics Quota diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 19f27c0d92e0..e4d4b4a8dc97 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -1,5 +1,3 @@ -.. _hugetlbpage: - ============= HugeTLB Pages ============= @@ -86,7 +84,7 @@ by increasing or decreasing the value of ``nr_hugepages``. Note: When the feature of freeing unused vmemmap pages associated with each hugetlb page is enabled, we can fail to free the huge pages triggered by -the user when ths system is under memory pressure. Please try again later. +the user when the system is under memory pressure. Please try again later. Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under @@ -313,7 +311,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: #. Regardless of mempolicy mode [see - :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`], + Documentation/admin-guide/mm/numa_memory_policy.rst], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous @@ -461,13 +459,13 @@ Examples .. _map_hugetlb: ``map_hugetlb`` - see tools/testing/selftests/vm/map_hugetlb.c + see tools/testing/selftests/mm/map_hugetlb.c ``hugepage-shm`` - see tools/testing/selftests/vm/hugepage-shm.c + see tools/testing/selftests/mm/hugepage-shm.c ``hugepage-mmap`` - see tools/testing/selftests/vm/hugepage-mmap.c + see tools/testing/selftests/mm/hugepage-mmap.c The `libhugetlbfs`_ library provides a wide range of userspace tools to help with huge page usability, environment setup, and control. diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst index df9394fb39c2..16fcf38dac56 100644 --- a/Documentation/admin-guide/mm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -1,5 +1,3 @@ -.. _idle_page_tracking: - ================== Idle Page Tracking ================== @@ -65,14 +63,13 @@ workload one should: are not reclaimable, he or she can filter them out using ``/proc/kpageflags``. -The page-types tool in the tools/vm directory can be used to assist in this. +The page-types tool in the tools/mm directory can be used to assist in this. If the tool is run initially with the appropriate option, it will mark all the queried pages as idle. Subsequent runs of the tool can then show which pages have their idle flag cleared in the interim. -See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more -information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and -``/proc/kpagecgroup``. +See Documentation/admin-guide/mm/pagemap.rst for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. .. _impl_details: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index d1064e0ba34a..1f883abf3f00 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -16,8 +16,7 @@ are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html Linux memory management has its own jargon and if you are not yet -familiar with it, consider reading -:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`. +familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst. Here we document in detail how to interact with various mechanisms in the Linux memory management. diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index fb6ba2002a4b..eed51a910c94 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -1,5 +1,3 @@ -.. _admin_guide_ksm: - ======================= Kernel Samepage Merging ======================= diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index a3c9e8ad8fa0..1b02fe5807cc 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -1,5 +1,3 @@ -.. _admin_guide_memory_hotplug: - ================== Memory Hot(Un)Plug ================== diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 5a6afecbb0d0..46515ad2337f 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -1,5 +1,3 @@ -.. _numa_memory_policy: - ================== NUMA Memory Policy ================== @@ -246,7 +244,7 @@ MPOL_INTERLEAVED interleaved system default policy works in this mode. MPOL_PREFERRED_MANY - This mode specifices that the allocation should be preferrably + This mode specifies that the allocation should be preferably satisfied from the nodemask specified in the policy. If there is a memory pressure on all nodes in the nodemask, the allocation can fall back to all existing numa nodes. This is effectively @@ -360,7 +358,7 @@ and NUMA nodes. "Usage" here means one of the following: 2) examination of the policy to determine the policy mode and associated node or node lists, if any, for page allocation. This is considered a "hot path". Note that for MPOL_BIND, the "usage" extends across the entire - allocation process, which may sleep during page reclaimation, because the + allocation process, which may sleep during page reclamation, because the BIND policy nodemask is used, by reference, to filter ineligible nodes. We can avoid taking an extra reference during the usages listed above as diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst index 166697325947..90a12b6a8bfc 100644 --- a/Documentation/admin-guide/mm/numaperf.rst +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -1,6 +1,7 @@ -.. _numaperf: +======================= +NUMA Memory Performance +======================= -============= NUMA Locality ============= @@ -61,7 +62,6 @@ that are CPUs and hence suitable for generic task scheduling, and IO initiators such as GPUs and NICs. Unlike access class 0, only nodes containing CPUs are considered. -================ NUMA Performance ================ @@ -96,7 +96,6 @@ for the platform. Access class 1 takes the same form but only includes values for CPU to memory activity. -========== NUMA Cache ========== @@ -170,7 +169,6 @@ The "size" is the number of bytes provided by this cache level. The "write_policy" will be 0 for write-back, and non-zero for write-through caching. -======== See Also ======== diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..b5f970dc91e7 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -1,5 +1,3 @@ -.. _pagemap: - ============================= Examining Process Page Tables ============================= @@ -19,10 +17,10 @@ There are four components to pagemap: * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped * Bit 55 pte is soft-dirty (see - :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) + Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see - :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`) + Documentation/admin-guide/mm/userfaultfd.rst) * Bits 58-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped @@ -46,7 +44,7 @@ There are four components to pagemap: * ``/proc/kpagecount``. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. -The page-types tool in the tools/vm directory can be used to query the +The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each @@ -105,8 +103,7 @@ Short descriptions to the page flags A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages - (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`), + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. @@ -128,7 +125,7 @@ Short descriptions to the page flags Zero page for pfn_zero or huge_zero page. 25 - IDLE The page has not been accessed since it was marked idle (see - :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. @@ -173,7 +170,7 @@ LRU related page flags 14 - SWAPBACKED The page is backed by swap/RAM. -The page-types tool in the tools/vm directory can be used to query the +The page-types tool in the tools/mm directory can be used to query the above flags. Using pagemap to do something useful diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst index 3887f0b294fe..c582033bd113 100644 --- a/Documentation/admin-guide/mm/shrinker_debugfs.rst +++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst @@ -1,5 +1,3 @@ -.. _shrinker_debugfs: - ========================== Shrinker Debugfs Interface ========================== diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst index cb0cfd6672fa..aeea936caa44 100644 --- a/Documentation/admin-guide/mm/soft-dirty.rst +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -1,5 +1,3 @@ -.. _soft_dirty: - =============== Soft-Dirty PTEs =============== diff --git a/Documentation/admin-guide/mm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst index e0466f2db8fa..2e630627bcee 100644 --- a/Documentation/admin-guide/mm/swap_numa.rst +++ b/Documentation/admin-guide/mm/swap_numa.rst @@ -1,5 +1,3 @@ -.. _swap_numa: - =========================================== Automatically bind swap device to numa node =========================================== diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 8ee78ec232eb..b0cc8243e093 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -1,5 +1,3 @@ -.. _admin_guide_transhuge: - ============================ Transparent Hugepage Support ============================ diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..7dc823b56ca4 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -1,5 +1,3 @@ -.. _userfaultfd: - =========== Userfaultfd =========== diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index f67de481c7f6..c5c2c7dbb155 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -1,5 +1,3 @@ -.. _zswap: - ===== zswap ===== @@ -70,9 +68,7 @@ e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which means the compression ratio will always be 2:1 or worse (because of half-full zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. However, -zsmalloc does not implement compressed page eviction, so once zswap fills it -cannot evict the oldest page, it can only reject new pages. +storage method, and it can achieve greater storage densities. When a swap page is passed from frontswap to zswap, zswap maintains a mapping of the swap entry, a combination of the swap type and swap offset, to the zpool diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst index 578407e487d6..75a40846d47f 100644 --- a/Documentation/admin-guide/perf/hns3-pmu.rst +++ b/Documentation/admin-guide/perf/hns3-pmu.rst @@ -53,7 +53,7 @@ two events have same value of bits 0~15 of config, that means they are event pair. And the bit 16 of config indicates getting counter 0 or counter 1 of hardware event. -After getting two values of event pair in usersapce, the formula of +After getting two values of event pair in userspace, the formula of computation to calculate real performance data is::: counter 0 / counter 1 diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 5376d53faaa8..6e5298b521b1 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -230,8 +230,8 @@ with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the request from AMD P-States. -User Space Interface in ``sysfs`` -================================== +User Space Interface in ``sysfs`` - Per-policy control +====================================================== ``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to control its functionality at the system level. They are located in the @@ -262,6 +262,25 @@ lowest non-linear performance in `AMD CPPC Performance Capability <perf_cap_>`_.) This attribute is read-only. +``energy_performance_available_preferences`` + +A list of all the supported EPP preferences that could be used for +``energy_performance_preference`` on this system. +These profiles represent different hints that are provided +to the low-level firmware about the user's desired energy vs efficiency +tradeoff. ``default`` represents the epp value is set by platform +firmware. This attribute is read-only. + +``energy_performance_preference`` + +The current energy performance preference can be read from this attribute. +and user can change current preference according to energy or performance needs +Please get all support profiles list from +``energy_performance_available_preferences`` attribute, all the profiles are +integer values defined between 0 to 255 when EPP feature is enabled by platform +firmware, if EPP feature is disabled, driver will ignore the written value +This attribute is read-write. + Other performance and frequency values can be read back from ``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`. @@ -280,8 +299,30 @@ module which supports the new AMD P-States mechanism on most of the future AMD platforms. The AMD P-States mechanism is the more performance and energy efficiency frequency management method on AMD processors. -Kernel Module Options for ``amd-pstate`` -========================================= + +AMD Pstate Driver Operation Modes +================================= + +``amd_pstate`` CPPC has two operation modes: CPPC Autonomous(active) mode and +CPPC non-autonomous(passive) mode. +active mode and passive mode can be chosen by different kernel parameters. +When in Autonomous mode, CPPC ignores requests done in the Desired Performance +Target register and takes into account only the values set to the Minimum requested +performance, Maximum requested performance, and Energy Performance Preference +registers. When Autonomous is disabled, it only considers the Desired Performance Target. + +Active Mode +------------ + +``amd_pstate=active`` + +This is the low-level firmware control mode which is implemented by ``amd_pstate_epp`` +driver with ``amd_pstate=active`` passed to the kernel in the command line. +In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software +wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware. +then CPPC power algorithm will calculate the runtime workload and adjust the realtime +cores frequency according to the power supply and thermal, core voltage and some other +hardware conditions. Passive Mode ------------ @@ -298,6 +339,35 @@ processor must provide at least nominal performance requested and go higher if c operating conditions allow. +User Space Interface in ``sysfs`` - General +=========================================== + +Global Attributes +----------------- + +``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to +control its functionality at the system level. They are located in the +``/sys/devices/system/cpu/amd-pstate/`` directory and affect all CPUs. + +``status`` + Operation mode of the driver: "active", "passive" or "disable". + + "active" + The driver is functional and in the ``active mode`` + + "passive" + The driver is functional and in the ``passive mode`` + + "disable" + The driver is unregistered and not functional now. + + This attribute can be written to in order to change the driver's + operation mode or to unregister it. The string written to it must be + one of the possible values of it and, if successful, writing one of + these values to the sysfs file will cause the driver to switch over + to the operation mode represented by that string - or to be + unregistered in the "disable" case. + ``cpupower`` tool support for ``amd-pstate`` =============================================== @@ -403,7 +473,7 @@ Unit Tests for amd-pstate * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization. -1. Test case decriptions +1. Test case descriptions 1). Basic tests diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index d5043cd8d2f5..bf13ad25a32f 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -712,7 +712,7 @@ it works in the `active mode <Active Mode_>`_. The following sequence of shell commands can be used to enable them and see their output (if the kernel is generally configured to support event tracing):: - # cd /sys/kernel/debug/tracing/ + # cd /sys/kernel/tracing/ # echo 1 > events/power/pstate_sample/enable # echo 1 > events/power/cpu_frequency/enable # cat trace @@ -732,7 +732,7 @@ The ``ftrace`` interface can be used for low-level diagnostics of P-state is called, the ``ftrace`` filter can be set to :c:func:`intel_pstate_set_pstate`:: - # cd /sys/kernel/debug/tracing/ + # cd /sys/kernel/tracing/ # cat available_filter_functions | grep -i pstate intel_pstate_set_pstate intel_pstate_cpu_init diff --git a/Documentation/admin-guide/spkguide.txt b/Documentation/admin-guide/spkguide.txt index 1265c1eab31c..74ea7f391942 100644 --- a/Documentation/admin-guide/spkguide.txt +++ b/Documentation/admin-guide/spkguide.txt @@ -1105,8 +1105,8 @@ speakup load Alternatively, you can add the above line to your file ~/.bashrc or ~/.bash_profile. -If your system administrator ran himself the script, all the users will be able -to change from English to the language choosed by root and do directly +If your system administrator himself ran the script, all the users will be able +to change from English to the language chosen by root and do directly speakupconf load (or add this to the ~/.bashrc or ~/.bash_profile file). If there are several languages to handle, the administrator (or every user) will have to run the first steps until speakupconf diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 46e3d62c0eea..4b7bfea28cd7 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -453,9 +453,10 @@ this allows system administrators to override the kexec_load_disabled =================== -A toggle indicating if the ``kexec_load`` syscall has been disabled. -This value defaults to 0 (false: ``kexec_load`` enabled), but can be -set to 1 (true: ``kexec_load`` disabled). +A toggle indicating if the syscalls ``kexec_load`` and +``kexec_file_load`` have been disabled. +This value defaults to 0 (false: ``kexec_*load`` enabled), but can be +set to 1 (true: ``kexec_*load`` disabled). Once true, kexec can no longer be used, and the toggle cannot be set back to false. This allows a kexec image to be loaded before disabling the syscall, @@ -463,6 +464,24 @@ allowing a system to set up (and later use) an image without it being altered. Generally used together with the `modules_disabled`_ sysctl. +kexec_load_limit_panic +====================== + +This parameter specifies a limit to the number of times the syscalls +``kexec_load`` and ``kexec_file_load`` can be called with a crash +image. It can only be set with a more restrictive value than the +current one. + +== ====================================================== +-1 Unlimited calls to kexec. This is the default setting. +N Number of calls left. +== ====================================================== + +kexec_load_limit_reboot +======================= + +Similar functionality as ``kexec_load_limit_panic``, but for a normal +image. kptr_restrict ============= diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 6394f5dc2303..466c560b0c30 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -215,6 +215,12 @@ rmem_max The maximum receive socket buffer size in bytes. +rps_default_mask +---------------- + +The default RPS CPU mask used on newly created network devices. An empty +mask means RPS disabled by default. + tstamp_allow_data ----------------- Allow processes to receive tx timestamps looped together with the original diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 988f6a4c8084..45ba1f4dc004 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -356,7 +356,7 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file:: But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). +in /proc/zoneinfo like the following. (This is an example of x86-64 box). Each zone has an array of protection pages like this:: Node 0, zone DMA @@ -433,7 +433,7 @@ a 2bit error in a memory module) is detected in the background by hardware that cannot be handled by the kernel. In some cases (like the page still having a valid copy on disk) the kernel will handle the failure transparently without affecting any applications. But if there is -no other uptodate copy of the data it will kill to prevent any data +no other up-to-date copy of the data it will kill to prevent any data corruptions from propagating. 1: Kill all processes that have the corrupted and not reloadable page mapped diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 0a178ef0111d..51906e47327b 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -138,7 +138,7 @@ Command Function ``v`` Forcefully restores framebuffer console ``v`` Causes ETM buffer dump [ARM-specific] -``w`` Dumps tasks that are in uninterruptable (blocked) state. +``w`` Dumps tasks that are in uninterruptible (blocked) state. ``x`` Used by xmon interface on ppc/powerpc platforms. Show global PMU Registers on sparc64. diff --git a/Documentation/admin-guide/thermal/index.rst b/Documentation/admin-guide/thermal/index.rst new file mode 100644 index 000000000000..193b7b01a87d --- /dev/null +++ b/Documentation/admin-guide/thermal/index.rst @@ -0,0 +1,8 @@ +================= +Thermal Subsystem +================= + +.. toctree:: + :maxdepth: 1 + + intel_powerclamp diff --git a/Documentation/admin-guide/thermal/intel_powerclamp.rst b/Documentation/admin-guide/thermal/intel_powerclamp.rst new file mode 100644 index 000000000000..08509b978af4 --- /dev/null +++ b/Documentation/admin-guide/thermal/intel_powerclamp.rst @@ -0,0 +1,345 @@ +======================= +Intel Powerclamp Driver +======================= + +By: + - Arjan van de Ven <arjan@linux.intel.com> + - Jacob Pan <jacob.jun.pan@linux.intel.com> + +.. Contents: + + (*) Introduction + - Goals and Objectives + + (*) Theory of Operation + - Idle Injection + - Calibration + + (*) Performance Analysis + - Effectiveness and Limitations + - Power vs Performance + - Scalability + - Calibration + - Comparison with Alternative Techniques + + (*) Usage and Interfaces + - Generic Thermal Layer (sysfs) + - Kernel APIs (TBD) + + (*) Module Parameters + +INTRODUCTION +============ + +Consider the situation where a system’s power consumption must be +reduced at runtime, due to power budget, thermal constraint, or noise +level, and where active cooling is not preferred. Software managed +passive power reduction must be performed to prevent the hardware +actions that are designed for catastrophic scenarios. + +Currently, P-states, T-states (clock modulation), and CPU offlining +are used for CPU throttling. + +On Intel CPUs, C-states provide effective power reduction, but so far +they’re only used opportunistically, based on workload. With the +development of intel_powerclamp driver, the method of synchronizing +idle injection across all online CPU threads was introduced. The goal +is to achieve forced and controllable C-state residency. + +Test/Analysis has been made in the areas of power, performance, +scalability, and user experience. In many cases, clear advantage is +shown over taking the CPU offline or modulating the CPU clock. + + +THEORY OF OPERATION +=================== + +Idle Injection +-------------- + +On modern Intel processors (Nehalem or later), package level C-state +residency is available in MSRs, thus also available to the kernel. + +These MSRs are:: + + #define MSR_PKG_C2_RESIDENCY 0x60D + #define MSR_PKG_C3_RESIDENCY 0x3F8 + #define MSR_PKG_C6_RESIDENCY 0x3F9 + #define MSR_PKG_C7_RESIDENCY 0x3FA + +If the kernel can also inject idle time to the system, then a +closed-loop control system can be established that manages package +level C-state. The intel_powerclamp driver is conceived as such a +control system, where the target set point is a user-selected idle +ratio (based on power reduction), and the error is the difference +between the actual package level C-state residency ratio and the target idle +ratio. + +Injection is controlled by high priority kernel threads, spawned for +each online CPU. + +These kernel threads, with SCHED_FIFO class, are created to perform +clamping actions of controlled duty ratio and duration. Each per-CPU +thread synchronizes its idle time and duration, based on the rounding +of jiffies, so accumulated errors can be prevented to avoid a jittery +effect. Threads are also bound to the CPU such that they cannot be +migrated, unless the CPU is taken offline. In this case, threads +belong to the offlined CPUs will be terminated immediately. + +Running as SCHED_FIFO and relatively high priority, also allows such +scheme to work for both preemptible and non-preemptible kernels. +Alignment of idle time around jiffies ensures scalability for HZ +values. This effect can be better visualized using a Perf timechart. +The following diagram shows the behavior of kernel thread +kidle_inject/cpu. During idle injection, it runs monitor/mwait idle +for a given "duration", then relinquishes the CPU to other tasks, +until the next time interval. + +The NOHZ schedule tick is disabled during idle time, but interrupts +are not masked. Tests show that the extra wakeups from scheduler tick +have a dramatic impact on the effectiveness of the powerclamp driver +on large scale systems (Westmere system with 80 processors). + +:: + + CPU0 + ____________ ____________ + kidle_inject/0 | sleep | mwait | sleep | + _________| |________| |_______ + duration + CPU1 + ____________ ____________ + kidle_inject/1 | sleep | mwait | sleep | + _________| |________| |_______ + ^ + | + | + roundup(jiffies, interval) + +Only one CPU is allowed to collect statistics and update global +control parameters. This CPU is referred to as the controlling CPU in +this document. The controlling CPU is elected at runtime, with a +policy that favors BSP, taking into account the possibility of a CPU +hot-plug. + +In terms of dynamics of the idle control system, package level idle +time is considered largely as a non-causal system where its behavior +cannot be based on the past or current input. Therefore, the +intel_powerclamp driver attempts to enforce the desired idle time +instantly as given input (target idle ratio). After injection, +powerclamp monitors the actual idle for a given time window and adjust +the next injection accordingly to avoid over/under correction. + +When used in a causal control system, such as a temperature control, +it is up to the user of this driver to implement algorithms where +past samples and outputs are included in the feedback. For example, a +PID-based thermal controller can use the powerclamp driver to +maintain a desired target temperature, based on integral and +derivative gains of the past samples. + + + +Calibration +----------- +During scalability testing, it is observed that synchronized actions +among CPUs become challenging as the number of cores grows. This is +also true for the ability of a system to enter package level C-states. + +To make sure the intel_powerclamp driver scales well, online +calibration is implemented. The goals for doing such a calibration +are: + +a) determine the effective range of idle injection ratio +b) determine the amount of compensation needed at each target ratio + +Compensation to each target ratio consists of two parts: + + a) steady state error compensation + + This is to offset the error occurring when the system can + enter idle without extra wakeups (such as external interrupts). + + b) dynamic error compensation + + When an excessive amount of wakeups occurs during idle, an + additional idle ratio can be added to quiet interrupts, by + slowing down CPU activities. + +A debugfs file is provided for the user to examine compensation +progress and results, such as on a Westmere system:: + + [jacob@nex01 ~]$ cat + /sys/kernel/debug/intel_powerclamp/powerclamp_calib + controlling cpu: 0 + pct confidence steady dynamic (compensation) + 0 0 0 0 + 1 1 0 0 + 2 1 1 0 + 3 3 1 0 + 4 3 1 0 + 5 3 1 0 + 6 3 1 0 + 7 3 1 0 + 8 3 1 0 + ... + 30 3 2 0 + 31 3 2 0 + 32 3 1 0 + 33 3 2 0 + 34 3 1 0 + 35 3 2 0 + 36 3 1 0 + 37 3 2 0 + 38 3 1 0 + 39 3 2 0 + 40 3 3 0 + 41 3 1 0 + 42 3 2 0 + 43 3 1 0 + 44 3 1 0 + 45 3 2 0 + 46 3 3 0 + 47 3 0 0 + 48 3 2 0 + 49 3 3 0 + +Calibration occurs during runtime. No offline method is available. +Steady state compensation is used only when confidence levels of all +adjacent ratios have reached satisfactory level. A confidence level +is accumulated based on clean data collected at runtime. Data +collected during a period without extra interrupts is considered +clean. + +To compensate for excessive amounts of wakeup during idle, additional +idle time is injected when such a condition is detected. Currently, +we have a simple algorithm to double the injection ratio. A possible +enhancement might be to throttle the offending IRQ, such as delaying +EOI for level triggered interrupts. But it is a challenge to be +non-intrusive to the scheduler or the IRQ core code. + + +CPU Online/Offline +------------------ +Per-CPU kernel threads are started/stopped upon receiving +notifications of CPU hotplug activities. The intel_powerclamp driver +keeps track of clamping kernel threads, even after they are migrated +to other CPUs, after a CPU offline event. + + +Performance Analysis +==================== +This section describes the general performance data collected on +multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). + +Effectiveness and Limitations +----------------------------- +The maximum range that idle injection is allowed is capped at 50 +percent. As mentioned earlier, since interrupts are allowed during +forced idle time, excessive interrupts could result in less +effectiveness. The extreme case would be doing a ping -f to generated +flooded network interrupts without much CPU acknowledgement. In this +case, little can be done from the idle injection threads. In most +normal cases, such as scp a large file, applications can be throttled +by the powerclamp driver, since slowing down the CPU also slows down +network protocol processing, which in turn reduces interrupts. + +When control parameters change at runtime by the controlling CPU, it +may take an additional period for the rest of the CPUs to catch up +with the changes. During this time, idle injection is out of sync, +thus not able to enter package C- states at the expected ratio. But +this effect is minor, in that in most cases change to the target +ratio is updated much less frequently than the idle injection +frequency. + +Scalability +----------- +Tests also show a minor, but measurable, difference between the 4P/8P +Ivy Bridge system and the 80P Westmere server under 50% idle ratio. +More compensation is needed on Westmere for the same amount of +target idle ratio. The compensation also increases as the idle ratio +gets larger. The above reason constitutes the need for the +calibration code. + +On the IVB 8P system, compared to an offline CPU, powerclamp can +achieve up to 40% better performance per watt. (measured by a spin +counter summed over per CPU counting threads spawned for all running +CPUs). + +Usage and Interfaces +==================== +The powerclamp driver is registered to the generic thermal layer as a +cooling device. Currently, it’s not bound to any thermal zones:: + + jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * + cur_state:0 + max_state:50 + type:intel_powerclamp + +cur_state allows user to set the desired idle percentage. Writing 0 to +cur_state will stop idle injection. Writing a value between 1 and +max_state will start the idle injection. Reading cur_state returns the +actual and current idle percentage. This may not be the same value +set by the user in that current idle percentage depends on workload +and includes natural idle. When idle injection is disabled, reading +cur_state returns value -1 instead of 0 which is to avoid confusing +100% busy state with the disabled state. + +Example usage: + +- To inject 25% idle time:: + + $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state + +If the system is not busy and has more than 25% idle time already, +then the powerclamp driver will not start idle injection. Using Top +will not show idle injection kernel threads. + +If the system is busy (spin test below) and has less than 25% natural +idle time, powerclamp kernel threads will do idle injection. Forced +idle time is accounted as normal idle in that common code path is +taken as the idle task. + +In this example, 24.1% idle is shown. This helps the system admin or +user determine the cause of slowdown, when a powerclamp driver is in action:: + + + Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie + Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st + Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers + Swap: 4087804k total, 0k used, 4087804k free, 945336k cached + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND + 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin + 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 + 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 + 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 + 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 + 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox + 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg + 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz + +Tests have shown that by using the powerclamp driver as a cooling +device, a PID based userspace thermal controller can manage to +control CPU temperature effectively, when no other thermal influence +is added. For example, a UltraBook user can compile the kernel under +certain temperature (below most active trip points). + +Module Parameters +================= + +``cpumask`` (RW) + A bit mask of CPUs to inject idle. The format of the bitmask is same as + used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is + comma separated 32 bit groups. Each CPU is one bit. For example for a 256 + CPU system the full mask is: + ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff + + The rightmost mask is for CPU 0-32. + +``max_idle`` (RW) + Maximum injected idle time to the total CPU time ratio in percent range + from 1 to 100. Even if the cooling device max_state is always 100 (100%), + this parameter allows to add a max idle percent limit. The default is 50, + to match the current implementation of powerclamp driver. Also doesn't + allow value more than 75, if the cpumask includes every CPU present in + the system. diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst new file mode 100644 index 000000000000..b2e254ec8ee8 --- /dev/null +++ b/Documentation/admin-guide/workload-tracing.rst @@ -0,0 +1,606 @@ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) + +====================================================== +Discovering Linux kernel subsystems used by a workload +====================================================== + +:Authors: - Shuah Khan <skhan@linuxfoundation.org> + - Shefali Sharma <sshefali021@gmail.com> +:maintained-by: Shuah Khan <skhan@linuxfoundation.org> + +Key Points +========== + + * Understanding system resources necessary to build and run a workload + is important. + * Linux tracing and strace can be used to discover the system resources + in use by a workload. The completeness of the system usage information + depends on the completeness of coverage of a workload. + * Performance and security of the operating system can be analyzed with + the help of tools such as: + `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_, + `stress-ng <https://www.mankier.com/1/stress-ng>`_, + `paxtest <https://github.com/opntr/paxtest-freebsd>`_. + * Once we discover and understand the workload needs, we can focus on them + to avoid regressions and use it to evaluate safety considerations. + +Methodology +=========== + +`strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ is a +diagnostic, instructional, and debugging tool and can be used to discover +the system resources in use by a workload. Once we discover and understand +the workload needs, we can focus on them to avoid regressions and use it +to evaluate safety considerations. We use strace tool to trace workloads. + +This method of tracing using strace tells us the system calls invoked by +the workload and doesn't include all the system calls that can be invoked +by it. In addition, this tracing method tells us just the code paths within +these system calls that are invoked. As an example, if a workload opens a +file and reads from it successfully, then the success path is the one that +is traced. Any error paths in that system call will not be traced. If there +is a workload that provides full coverage of a workload then the method +outlined here will trace and find all possible code paths. The completeness +of the system usage information depends on the completeness of coverage of a +workload. + +The goal is tracing a workload on a system running a default kernel without +requiring custom kernel installs. + +How do we gather fine-grained system information? +================================================= + +strace tool can be used to trace system calls made by a process and signals +it receives. System calls are the fundamental interface between an +application and the operating system kernel. They enable a program to +request services from the kernel. For instance, the open() system call in +Linux is used to provide access to a file in the file system. strace enables +us to track all the system calls made by an application. It lists all the +system calls made by a process and their resulting output. + +You can generate profiling data combining strace and perf record tools to +record the events and information associated with a process. This provides +insight into the process. "perf annotate" tool generates the statistics of +each instruction of the program. This document goes over the details of how +to gather fine-grained information on a workload's usage of system resources. + +We used strace to trace the perf, stress-ng, paxtest workloads to illustrate +our methodology to discover resources used by a workload. This process can +be applied to trace other workloads. + +Getting the system ready for tracing +==================================== + +Before we can get started we will show you how to get your system ready. +We assume that you have a Linux distribution running on a physical system +or a virtual machine. Most distributions will include strace command. Let’s +install other tools that aren’t usually included to build Linux kernel. +Please note that the following works on Debian based distributions. You +might have to find equivalent packages on other Linux distributions. + +Install tools to build Linux kernel and tools in kernel repository. +scripts/ver_linux is a good way to check if your system already has +the necessary tools:: + + sudo apt-get build-essentials flex bison yacc + sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + +cscope is a good tool to browse kernel sources. Let's install it now:: + + sudo apt-get install cscope + +Install stress-ng and paxtest:: + + apt-get install stress-ng + apt-get install paxtest + +Workload overview +================= + +As mentioned earlier, we used strace to trace perf bench, stress-ng and +paxtest workloads to show how to analyze a workload and identify Linux +subsystems used by these workloads. Let's start with an overview of these +three workloads to get a better understanding of what they do and how to +use them. + +perf bench (all) workload +------------------------- + +The perf bench command contains multiple multi-threaded microkernel +benchmarks for executing different subsystems in the Linux kernel and +system calls. This allows us to easily measure the impact of changes, +which can help mitigate performance regressions. It also acts as a common +benchmarking framework, enabling developers to easily create test cases, +integrate transparently, and use performance-rich tooling subsystems. + +Stress-ng netdev stressor workload +---------------------------------- + +stress-ng is used for performing stress testing on the kernel. It allows +you to exercise various physical subsystems of the computer, as well as +interfaces of the OS kernel, using "stressor-s". They are available for +CPU, CPU cache, devices, I/O, interrupts, file system, memory, network, +operating system, pipelines, schedulers, and virtual machines. Please refer +to the `stress-ng man-page <https://www.mankier.com/1/stress-ng>`_ to +find the description of all the available stressor-s. The netdev stressor +starts specified number (N) of workers that exercise various netdevice +ioctl commands across all the available network devices. + +paxtest kiddie workload +----------------------- + +paxtest is a program that tests buffer overflows in the kernel. It tests +kernel enforcements over memory usage. Generally, execution in some memory +segments makes buffer overflows possible. It runs a set of programs that +attempt to subvert memory usage. It is used as a regression test suite for +PaX, but might be useful to test other memory protection patches for the +kernel. We used paxtest kiddie mode which looks for simple vulnerabilities. + +What is strace and how do we use it? +==================================== + +As mentioned earlier, strace which is a useful diagnostic, instructional, +and debugging tool and can be used to discover the system resources in use +by a workload. It can be used: + + * To see how a process interacts with the kernel. + * To see why a process is failing or hanging. + * For reverse engineering a process. + * To find the files on which a program depends. + * For analyzing the performance of an application. + * For troubleshooting various problems related to the operating system. + +In addition, strace can generate run-time statistics on times, calls, and +errors for each system call and report a summary when program exits, +suppressing the regular output. This attempts to show system time (CPU time +spent running in the kernel) independent of wall clock time. We plan to use +these features to get information on workload system usage. + +strace command supports basic, verbose, and stats modes. strace command when +run in verbose mode gives more detailed information about the system calls +invoked by a process. + +Running strace -c generates a report of the percentage of time spent in each +system call, the total time in seconds, the microseconds per call, the total +number of calls, the count of each system call that has failed with an error +and the type of system call made. + + * Usage: strace <command we want to trace> + * Verbose mode usage: strace -v <command> + * Gather statistics: strace -c <command> + +We used the “-c” option to gather fine-grained run-time statistics in use +by three workloads we have chose for this analysis. + + * perf + * stress-ng + * paxtest + +What is cscope and how do we use it? +==================================== + +Now let’s look at `cscope <https://cscope.sourceforge.net/>`_, a command +line tool for browsing C, C++ or Java code-bases. We can use it to find +all the references to a symbol, global definitions, functions called by a +function, functions calling a function, text strings, regular expression +patterns, files including a file. + +We can use cscope to find which system call belongs to which subsystem. +This way we can find the kernel subsystems used by a process when it is +executed. + +Let’s checkout the latest Linux repository and build cscope database:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux + cd linux + cscope -R -p10 # builds cscope.out database before starting browse session + cscope -d -p10 # starts browse session on cscope.out database + +Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to +enter into the browsing session. cscope by default cscope.out database. +To get out of this mode press ctrl+d. -p option is used to specify the +number of file path components to display. -p10 is optimal for browsing +kernel sources. + +What is perf and how do we use it? +================================== + +Perf is an analysis tool based on Linux 2.6+ systems, which abstracts the +CPU hardware difference in performance measurement in Linux, and provides +a simple command line interface. Perf is based on the perf_events interface +exported by the kernel. It is very useful for profiling the system and +finding performance bottlenecks in an application. + +If you haven't already checked out the Linux mainline repository, you can do +so and then build kernel and perf tool:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux + cd linux + make -j3 all + cd tools/perf + make + +Note: The perf command can be built without building the kernel in the +repository and can be run on older kernels. However matching the kernel +and perf revisions gives more accurate information on the subsystem usage. + +We used "perf stat" and "perf bench" options. For a detailed information on +the perf tool, run "perf -h". + +perf stat +--------- +The perf stat command generates a report of various hardware and software +events. It does so with the help of hardware counter registers found in +modern CPUs that keep the count of these activities. "perf stat cal" shows +stats for cal command. + +Perf bench +---------- +The perf bench command contains multiple multi-threaded microkernel +benchmarks for executing different subsystems in the Linux kernel and +system calls. This allows us to easily measure the impact of changes, +which can help mitigate performance regressions. It also acts as a common +benchmarking framework, enabling developers to easily create test cases, +integrate transparently, and use performance-rich tooling. + +"perf bench all" command runs the following benchmarks: + + * sched/messaging + * sched/pipe + * syscall/basic + * mem/memcpy + * mem/memset + +What is stress-ng and how do we use it? +======================================= + +As mentioned earlier, stress-ng is used for performing stress testing on +the kernel. It allows you to exercise various physical subsystems of the +computer, as well as interfaces of the OS kernel, using stressor-s. They +are available for CPU, CPU cache, devices, I/O, interrupts, file system, +memory, network, operating system, pipelines, schedulers, and virtual +machines. + +The netdev stressor starts N workers that exercise various netdevice ioctl +commands across all the available network devices. The following ioctls are +exercised: + + * SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFFLAGS + * SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, SIOCGIFMTU + * SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN + +The following command runs the stressor:: + + stress-ng --netdev 1 -t 60 --metrics command. + +We can use the perf record command to record the events and information +associated with a process. This command records the profiling data in the +perf.data file in the same directory. + +Using the following commands you can record the events associated with the +netdev stressor, view the generated report perf.data and annotate the to +view the statistics of each instruction of the program:: + + perf record stress-ng --netdev 1 -t 60 --metrics command. + perf report + perf annotate + +What is paxtest and how do we use it? +===================================== + +paxtest is a program that tests buffer overflows in the kernel. It tests +kernel enforcements over memory usage. Generally, execution in some memory +segments makes buffer overflows possible. It runs a set of programs that +attempt to subvert memory usage. It is used as a regression test suite for +PaX, and will be useful to test other memory protection patches for the +kernel. + +paxtest provides kiddie and blackhat modes. The paxtest kiddie mode runs +in normal mode, whereas the blackhat mode tries to get around the protection +of the kernel testing for vulnerabilities. We focus on the kiddie mode here +and combine "paxtest kiddie" run with "perf record" to collect CPU stack +traces for the paxtest kiddie run to see which function is calling other +functions in the performance profile. Then the "dwarf" (DWARF's Call Frame +Information) mode can be used to unwind the stack. + +The following command can be used to view resulting report in call-graph +format:: + + perf record --call-graph dwarf paxtest kiddie + perf report --stdio + +Tracing workloads +================= + +Now that we understand the workloads, let's start tracing them. + +Tracing perf bench all workload +------------------------------- + +Run the following command to trace perf bench all workload:: + + strace -c perf bench all + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+-------------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+=========================+ +| getppid | 10000001 | Process Mgmt | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| clone | 1077 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+-------------------------+ +| prctl | 23 | Process Mgmt. | sys_prctl() | ++-------------------+-----------+-----------------+-------------------------+ +| prlimit64 | 7 | Process Mgmt. | sys_prlimit64() | ++-------------------+-----------+-----------------+-------------------------+ +| getpid | 10 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| uname | 3 | Process Mgmt. | sys_uname() | ++-------------------+-----------+-----------------+-------------------------+ +| sysinfo | 1 | Process Mgmt. | sys_sysinfo() | ++-------------------+-----------+-----------------+-------------------------+ +| getuid | 1 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getgid | 1 | Process Mgmt. | sys_getgid() | ++-------------------+-----------+-----------------+-------------------------+ +| geteuid | 1 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getegid | 1 | Process Mgmt. | sys_getegid | ++-------------------+-----------+-----------------+-------------------------+ +| close | 49951 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+-------------------------+ +| pipe | 604 | Filesystem | sys_pipe() | ++-------------------+-----------+-----------------+-------------------------+ +| openat | 48560 | Filesystem | sys_opennat() | ++-------------------+-----------+-----------------+-------------------------+ +| fstat | 8338 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+-------------------------+ +| stat | 1573 | Filesystem | sys_stat() | ++-------------------+-----------+-----------------+-------------------------+ +| pread64 | 9646 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+-------------------------+ +| getdents64 | 1873 | Filesystem | sys_getdents64() | ++-------------------+-----------+-----------------+-------------------------+ +| access | 3 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+-------------------------+ +| lstat | 1880 | Filesystem | sys_lstat() | ++-------------------+-----------+-----------------+-------------------------+ +| lseek | 6 | Filesystem | sys_lseek() | ++-------------------+-----------+-----------------+-------------------------+ +| ioctl | 3 | Filesystem | sys_ioctl() | ++-------------------+-----------+-----------------+-------------------------+ +| dup2 | 1 | Filesystem | sys_dup2() | ++-------------------+-----------+-----------------+-------------------------+ +| execve | 2 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+-------------------------+ +| fcntl | 8779 | Filesystem | sys_fcntl() | ++-------------------+-----------+-----------------+-------------------------+ +| statfs | 1 | Filesystem | sys_statfs() | ++-------------------+-----------+-----------------+-------------------------+ +| epoll_create | 2 | Filesystem | sys_epoll_create() | ++-------------------+-----------+-----------------+-------------------------+ +| epoll_ctl | 64 | Filesystem | sys_epoll_ctl() | ++-------------------+-----------+-----------------+-------------------------+ +| newfstatat | 8318 | Filesystem | sys_newfstatat() | ++-------------------+-----------+-----------------+-------------------------+ +| eventfd2 | 192 | Filesystem | sys_eventfd2() | ++-------------------+-----------+-----------------+-------------------------+ +| mmap | 243 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+-------------------------+ +| mprotect | 32 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+-------------------------+ +| brk | 21 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+-------------------------+ +| munmap | 128 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+-------------------------+ +| set_mempolicy | 156 | Memory Mgmt. | sys_set_mempolicy() | ++-------------------+-----------+-----------------+-------------------------+ +| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() | ++-------------------+-----------+-----------------+-------------------------+ +| set_robust_list | 1 | Futex | sys_set_robust_list() | ++-------------------+-----------+-----------------+-------------------------+ +| futex | 341 | Futex | sys_futex() | ++-------------------+-----------+-----------------+-------------------------+ +| sched_getaffinity | 79 | Scheduler | sys_sched_getaffinity() | ++-------------------+-----------+-----------------+-------------------------+ +| sched_setaffinity | 223 | Scheduler | sys_sched_setaffinity() | ++-------------------+-----------+-----------------+-------------------------+ +| socketpair | 202 | Network | sys_socketpair() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigprocmask | 21 | Signal | sys_rt_sigprocmask() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigaction | 36 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigreturn | 2 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+-------------------------+ +| wait4 | 889 | Time | sys_wait4() | ++-------------------+-----------+-----------------+-------------------------+ +| clock_nanosleep | 37 | Time | sys_clock_nanosleep() | ++-------------------+-----------+-----------------+-------------------------+ +| capget | 4 | Capability | sys_capget() | ++-------------------+-----------+-----------------+-------------------------+ + +Tracing stress-ng netdev stressor workload +------------------------------------------ + +Run the following command to trace stress-ng netdev stressor workload:: + + strace -c stress-ng --netdev 1 -t 60 --metrics + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+-------------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+=========================+ +| openat | 74 | Filesystem | sys_openat() | ++-------------------+-----------+-----------------+-------------------------+ +| close | 75 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+-------------------------+ +| read | 58 | Filesystem | sys_read() | ++-------------------+-----------+-----------------+-------------------------+ +| fstat | 20 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+-------------------------+ +| flock | 10 | Filesystem | sys_flock() | ++-------------------+-----------+-----------------+-------------------------+ +| write | 7 | Filesystem | sys_write() | ++-------------------+-----------+-----------------+-------------------------+ +| getdents64 | 8 | Filesystem | sys_getdents64() | ++-------------------+-----------+-----------------+-------------------------+ +| pread64 | 8 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+-------------------------+ +| lseek | 1 | Filesystem | sys_lseek() | ++-------------------+-----------+-----------------+-------------------------+ +| access | 2 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+-------------------------+ +| getcwd | 1 | Filesystem | sys_getcwd() | ++-------------------+-----------+-----------------+-------------------------+ +| execve | 1 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+-------------------------+ +| mmap | 61 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+-------------------------+ +| munmap | 3 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+-------------------------+ +| mprotect | 20 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+-------------------------+ +| mlock | 2 | Memory Mgmt. | sys_mlock() | ++-------------------+-----------+-----------------+-------------------------+ +| brk | 3 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigaction | 21 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigprocmask | 1 | Signal | sys_rt_sigprocmask() | ++-------------------+-----------+-----------------+-------------------------+ +| sigaltstack | 1 | Signal | sys_sigaltstack() | ++-------------------+-----------+-----------------+-------------------------+ +| rt_sigreturn | 1 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+-------------------------+ +| getpid | 8 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+-------------------------+ +| prlimit64 | 5 | Process Mgmt. | sys_prlimit64() | ++-------------------+-----------+-----------------+-------------------------+ +| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() | ++-------------------+-----------+-----------------+-------------------------+ +| sysinfo | 2 | Process Mgmt. | sys_sysinfo() | ++-------------------+-----------+-----------------+-------------------------+ +| getuid | 2 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+-------------------------+ +| uname | 1 | Process Mgmt. | sys_uname() | ++-------------------+-----------+-----------------+-------------------------+ +| setpgid | 1 | Process Mgmt. | sys_setpgid() | ++-------------------+-----------+-----------------+-------------------------+ +| getrusage | 1 | Process Mgmt. | sys_getrusage() | ++-------------------+-----------+-----------------+-------------------------+ +| geteuid | 1 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+-------------------------+ +| getppid | 1 | Process Mgmt. | sys_getppid() | ++-------------------+-----------+-----------------+-------------------------+ +| sendto | 3 | Network | sys_sendto() | ++-------------------+-----------+-----------------+-------------------------+ +| connect | 1 | Network | sys_connect() | ++-------------------+-----------+-----------------+-------------------------+ +| socket | 1 | Network | sys_socket() | ++-------------------+-----------+-----------------+-------------------------+ +| clone | 1 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+-------------------------+ +| set_tid_address | 1 | Process Mgmt. | sys_set_tid_address() | ++-------------------+-----------+-----------------+-------------------------+ +| wait4 | 2 | Time | sys_wait4() | ++-------------------+-----------+-----------------+-------------------------+ +| alarm | 1 | Time | sys_alarm() | ++-------------------+-----------+-----------------+-------------------------+ +| set_robust_list | 1 | Futex | sys_set_robust_list() | ++-------------------+-----------+-----------------+-------------------------+ + +Tracing paxtest kiddie workload +------------------------------- + +Run the following command to trace paxtest kiddie workload:: + + strace -c paxtest kiddie + +**System Calls made by the workload** + +The below table shows the system calls invoked by the workload, number of +times each system call is invoked, and the corresponding Linux subsystem. + ++-------------------+-----------+-----------------+----------------------+ +| System Call | # calls | Linux Subsystem | System Call (API) | ++===================+===========+=================+======================+ +| read | 3 | Filesystem | sys_read() | ++-------------------+-----------+-----------------+----------------------+ +| write | 11 | Filesystem | sys_write() | ++-------------------+-----------+-----------------+----------------------+ +| close | 41 | Filesystem | sys_close() | ++-------------------+-----------+-----------------+----------------------+ +| stat | 24 | Filesystem | sys_stat() | ++-------------------+-----------+-----------------+----------------------+ +| fstat | 2 | Filesystem | sys_fstat() | ++-------------------+-----------+-----------------+----------------------+ +| pread64 | 6 | Filesystem | sys_pread64() | ++-------------------+-----------+-----------------+----------------------+ +| access | 1 | Filesystem | sys_access() | ++-------------------+-----------+-----------------+----------------------+ +| pipe | 1 | Filesystem | sys_pipe() | ++-------------------+-----------+-----------------+----------------------+ +| dup2 | 24 | Filesystem | sys_dup2() | ++-------------------+-----------+-----------------+----------------------+ +| execve | 1 | Filesystem | sys_execve() | ++-------------------+-----------+-----------------+----------------------+ +| fcntl | 26 | Filesystem | sys_fcntl() | ++-------------------+-----------+-----------------+----------------------+ +| openat | 14 | Filesystem | sys_openat() | ++-------------------+-----------+-----------------+----------------------+ +| rt_sigaction | 7 | Signal | sys_rt_sigaction() | ++-------------------+-----------+-----------------+----------------------+ +| rt_sigreturn | 38 | Signal | sys_rt_sigreturn() | ++-------------------+-----------+-----------------+----------------------+ +| clone | 38 | Process Mgmt. | sys_clone() | ++-------------------+-----------+-----------------+----------------------+ +| wait4 | 44 | Time | sys_wait4() | ++-------------------+-----------+-----------------+----------------------+ +| mmap | 7 | Memory Mgmt. | sys_mmap() | ++-------------------+-----------+-----------------+----------------------+ +| mprotect | 3 | Memory Mgmt. | sys_mprotect() | ++-------------------+-----------+-----------------+----------------------+ +| munmap | 1 | Memory Mgmt. | sys_munmap() | ++-------------------+-----------+-----------------+----------------------+ +| brk | 3 | Memory Mgmt. | sys_brk() | ++-------------------+-----------+-----------------+----------------------+ +| getpid | 1 | Process Mgmt. | sys_getpid() | ++-------------------+-----------+-----------------+----------------------+ +| getuid | 1 | Process Mgmt. | sys_getuid() | ++-------------------+-----------+-----------------+----------------------+ +| getgid | 1 | Process Mgmt. | sys_getgid() | ++-------------------+-----------+-----------------+----------------------+ +| geteuid | 2 | Process Mgmt. | sys_geteuid() | ++-------------------+-----------+-----------------+----------------------+ +| getegid | 1 | Process Mgmt. | sys_getegid() | ++-------------------+-----------+-----------------+----------------------+ +| getppid | 1 | Process Mgmt. | sys_getppid() | ++-------------------+-----------+-----------------+----------------------+ +| arch_prctl | 2 | Process Mgmt. | sys_arch_prctl() | ++-------------------+-----------+-----------------+----------------------+ + +Conclusion +========== + +This document is intended to be used as a guide on how to gather fine-grained +information on the resources in use by workloads using strace. + +References +========== + + * `Discovery Linux Kernel Subsystems used by OpenAPS <https://elisa.tech/blog/2022/02/02/discovery-linux-kernel-subsystems-used-by-openaps>`_ + * `ELISA-White-Papers-Discovering Linux kernel subsystems used by a workload <https://github.com/elisa-tech/ELISA-White-Papers/blob/master/Processes/Discovering_Linux_kernel_subsystems_used_by_a_workload.md>`_ + * `strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ + * `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_ + * `paxtest README <https://github.com/opntr/paxtest-freebsd/blob/hardenedbsd/0.9.14-hbsd/README>`_ + * `stress-ng <https://www.mankier.com/1/stress-ng>`_ + * `Monitoring and managing system status and performance <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/index>`_ diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index 8de008c0c5ad..e2561416391c 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -296,7 +296,7 @@ The following sysctls are available for the XFS filesystem: XFS_ERRLEVEL_LOW: 1 XFS_ERRLEVEL_HIGH: 5 - fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256) + fs.xfs.panic_mask (Min: 0 Default: 0 Max: 511) Causes certain error conditions to call BUG(). Value is a bitmask; OR together the tags which represent errors which should cause panics: |