diff options
Diffstat (limited to 'Documentation/admin-guide')
19 files changed, 761 insertions, 32 deletions
diff --git a/Documentation/admin-guide/acpi/cppc_sysfs.rst b/Documentation/admin-guide/acpi/cppc_sysfs.rst new file mode 100644 index 000000000000..a4b99afbe331 --- /dev/null +++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst @@ -0,0 +1,76 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================== +Collaborative Processor Performance Control (CPPC) +================================================== + +CPPC +==== + +CPPC defined in the ACPI spec describes a mechanism for the OS to manage the +performance of a logical processor on a contigious and abstract performance +scale. CPPC exposes a set of registers to describe abstract performance scale, +to request performance levels and to measure per-cpu delivered performance. + +For more details on CPPC please refer to the ACPI specification at: + +http://uefi.org/specifications + +Some of the CPPC registers are exposed via sysfs under:: + + /sys/devices/system/cpu/cpuX/acpi_cppc/ + +for each cpu X:: + + $ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/ + /sys/devices/system/cpu/cpu0/acpi_cppc/: + total 0 + -r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs + -r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq + -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf + -r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time + +* highest_perf : Highest performance of this processor (abstract scale). +* nominal_perf : Highest sustained performance of this processor + (abstract scale). +* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear + power savings (abstract scale). +* lowest_perf : Lowest performance of this processor (abstract scale). + +* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz). +* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz). + The above frequencies should only be used to report processor performance in + freqency instead of abstract scale. These values should not be used for any + functional decisions. + +* feedback_ctrs : Includes both Reference and delivered performance counter. + Reference counter ticks up proportional to processor's reference performance. + Delivered counter ticks up proportional to processor's delivered performance. +* wraparound_time: Minimum time for the feedback counters to wraparound + (seconds). +* reference_perf : Performance level at which reference performance counter + accumulates (abstract scale). + + +Computing Average Delivered Performance +======================================= + +Below describes the steps to compute the average performance delivered by +taking two different snapshots of feedback counters at time T1 and T2. + + T1: Read feedback_ctrs as fbc_t1 + Wait or run some workload + + T2: Read feedback_ctrs as fbc_t2 + +:: + + delivered_counter_delta = fbc_t2[del] - fbc_t1[del] + reference_counter_delta = fbc_t2[ref] - fbc_t1[ref] + + delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst new file mode 100644 index 000000000000..50bd7f194bf4 --- /dev/null +++ b/Documentation/admin-guide/acpi/dsdt-override.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Overriding DSDT +=============== + +Linux supports a method of overriding the BIOS DSDT: + +CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel. + +When to use this method is described in detail on the +Linux/ACPI home page: +https://01.org/linux-acpi/documentation/overriding-dsdt diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst new file mode 100644 index 000000000000..4d13eeea1eca --- /dev/null +++ b/Documentation/admin-guide/acpi/index.rst @@ -0,0 +1,14 @@ +============ +ACPI Support +============ + +Here we document in detail how to interact with various mechanisms in +the Linux ACPI support. + +.. toctree:: + :maxdepth: 1 + + initrd_table_override + dsdt-override + ssdt-overlays + cppc_sysfs diff --git a/Documentation/admin-guide/acpi/initrd_table_override.rst b/Documentation/admin-guide/acpi/initrd_table_override.rst new file mode 100644 index 000000000000..cbd768207631 --- /dev/null +++ b/Documentation/admin-guide/acpi/initrd_table_override.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Upgrading ACPI tables via initrd +================================ + +What is this about +================== + +If the ACPI_TABLE_UPGRADE compile option is true, it is possible to +upgrade the ACPI execution environment that is defined by the ACPI tables +via upgrading the ACPI tables provided by the BIOS with an instrumented, +modified, more recent version one, or installing brand new ACPI tables. + +When building initrd with kernel in a single image, option +ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD should also be true for this +feature to work. + +For a full list of ACPI tables that can be upgraded/installed, take a look +at the char `*table_sigs[MAX_ACPI_SIGNATURE];` definition in +drivers/acpi/tables.c. + +All ACPI tables iasl (Intel's ACPI compiler and disassembler) knows should +be overridable, except: + + - ACPI_SIG_RSDP (has a signature of 6 bytes) + - ACPI_SIG_FACS (does not have an ordinary ACPI table header) + +Both could get implemented as well. + + +What is this for +================ + +Complain to your platform/BIOS vendor if you find a bug which is so severe +that a workaround is not accepted in the Linux kernel. And this facility +allows you to upgrade the buggy tables before your platform/BIOS vendor +releases an upgraded BIOS binary. + +This facility can be used by platform/BIOS vendors to provide a Linux +compatible environment without modifying the underlying platform firmware. + +This facility also provides a powerful feature to easily debug and test +ACPI BIOS table compatibility with the Linux kernel by modifying old +platform provided ACPI tables or inserting new ACPI tables. + +It can and should be enabled in any kernel because there is no functional +change with not instrumented initrds. + + +How does it work +================ +:: + + # Extract the machine's ACPI tables: + cd /tmp + acpidump >acpidump + acpixtract -a acpidump + # Disassemble, modify and recompile them: + iasl -d *.dat + # For example add this statement into a _PRT (PCI Routing Table) function + # of the DSDT: + Store("HELLO WORLD", debug) + # And increase the OEM Revision. For example, before modification: + DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000000) + # After modification: + DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000001) + iasl -sa dsdt.dsl + # Add the raw ACPI tables to an uncompressed cpio archive. + # They must be put into a /kernel/firmware/acpi directory inside the cpio + # archive. Note that if the table put here matches a platform table + # (similar Table Signature, and similar OEMID, and similar OEM Table ID) + # with a more recent OEM Revision, the platform table will be upgraded by + # this table. If the table put here doesn't match a platform table + # (dissimilar Table Signature, or dissimilar OEMID, or dissimilar OEM Table + # ID), this table will be appended. + mkdir -p kernel/firmware/acpi + cp dsdt.aml kernel/firmware/acpi + # A maximum of "NR_ACPI_INITRD_TABLES (64)" tables are currently allowed + # (see osl.c): + iasl -sa facp.dsl + iasl -sa ssdt1.dsl + cp facp.aml kernel/firmware/acpi + cp ssdt1.aml kernel/firmware/acpi + # The uncompressed cpio archive must be the first. Other, typically + # compressed cpio archives, must be concatenated on top of the uncompressed + # one. Following command creates the uncompressed cpio archive and + # concatenates the original initrd on top: + find kernel | cpio -H newc --create > /boot/instrumented_initrd + cat /boot/initrd >>/boot/instrumented_initrd + # reboot with increased acpi debug level, e.g. boot params: + acpi.debug_level=0x2 acpi.debug_layer=0xFFFFFFFF + # and check your syslog: + [ 1.268089] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] + [ 1.272091] [ACPI Debug] String [0x0B] "HELLO WORLD" + +iasl is able to disassemble and recompile quite a lot different, +also static ACPI tables. + + +Where to retrieve userspace tools +================================= + +iasl and acpixtract are part of Intel's ACPICA project: +http://acpica.org/ + +and should be packaged by distributions (for example in the acpica package +on SUSE). + +acpidump can be found in Len Browns pmtools: +ftp://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools/acpidump + +This tool is also part of the acpica package on SUSE. +Alternatively, used ACPI tables can be retrieved via sysfs in latest kernels: +/sys/firmware/acpi/tables diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst new file mode 100644 index 000000000000..da37455f96c9 --- /dev/null +++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst @@ -0,0 +1,180 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +SSDT Overlays +============= + +In order to support ACPI open-ended hardware configurations (e.g. development +boards) we need a way to augment the ACPI configuration provided by the firmware +image. A common example is connecting sensors on I2C / SPI buses on development +boards. + +Although this can be accomplished by creating a kernel platform driver or +recompiling the firmware image with updated ACPI tables, neither is practical: +the former proliferates board specific kernel code while the latter requires +access to firmware tools which are often not publicly available. + +Because ACPI supports external references in AML code a more practical +way to augment firmware ACPI configuration is by dynamically loading +user defined SSDT tables that contain the board specific information. + +For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the +Minnowboard MAX development board exposed via the LSE connector [1], the +following ASL code can be used:: + + DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003) + { + External (\_SB.I2C6, DeviceObj) + + Scope (\_SB.I2C6) + { + Device (STAC) + { + Name (_ADR, Zero) + Name (_HID, "BMA222E") + + Method (_CRS, 0, Serialized) + { + Name (RBUF, ResourceTemplate () + { + I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80, + AddressingMode7Bit, "\\_SB.I2C6", 0x00, + ResourceConsumer, ,) + GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000, + "\\_SB.GPO2", 0x00, ResourceConsumer, , ) + { // Pin list + 0 + } + }) + Return (RBUF) + } + } + } + } + +which can then be compiled to AML binary format:: + + $ iasl minnowmax.asl + + Intel ACPI Component Architecture + ASL Optimizing Compiler version 20140214-64 [Mar 29 2014] + Copyright (c) 2000 - 2014 Intel Corporation + + ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords + AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes + +[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29 + +The resulting AML code can then be loaded by the kernel using one of the methods +below. + +Loading ACPI SSDTs from initrd +============================== + +This option allows loading of user defined SSDTs from initrd and it is useful +when the system does not support EFI or when there is not enough EFI storage. + +It works in a similar way with initrd based ACPI tables override/upgrade: SSDT +aml code must be placed in the first, uncompressed, initrd under the +"kernel/firmware/acpi" path. Multiple files can be used and this will translate +in loading multiple tables. Only SSDT and OEM tables are allowed. See +initrd_table_override.txt for more details. + +Here is an example:: + + # Add the raw ACPI tables to an uncompressed cpio archive. + # They must be put into a /kernel/firmware/acpi directory inside the + # cpio archive. + # The uncompressed cpio archive must be the first. + # Other, typically compressed cpio archives, must be + # concatenated on top of the uncompressed one. + mkdir -p kernel/firmware/acpi + cp ssdt.aml kernel/firmware/acpi + + # Create the uncompressed cpio archive and concatenate the original initrd + # on top: + find kernel | cpio -H newc --create > /boot/instrumented_initrd + cat /boot/initrd >>/boot/instrumented_initrd + +Loading ACPI SSDTs from EFI variables +===================================== + +This is the preferred method, when EFI is supported on the platform, because it +allows a persistent, OS independent way of storing the user defined SSDTs. There +is also work underway to implement EFI support for loading user defined SSDTs +and using this method will make it easier to convert to the EFI loading +mechanism when that will arrive. + +In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line +parameter can be used. The argument for the option is the variable name to +use. If there are multiple variables with the same name but with different +vendor GUIDs, all of them will be loaded. + +In order to store the AML code in an EFI variable the efivarfs filesystem can be +used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all +recent distribution. + +Creating a new file in /sys/firmware/efi/efivars will automatically create a new +EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI +variable. Please note that the file name needs to be specially formatted as +"Name-GUID" and that the first 4 bytes in the file (little-endian format) +represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in +include/linux/efi.h). Writing to the file must also be done with one write +operation. + +For example, you can use the following bash script to create/update an EFI +variable with the content from a given file:: + + #!/bin/sh -e + + while ! [ -z "$1" ]; do + case "$1" in + "-f") filename="$2"; shift;; + "-g") guid="$2"; shift;; + *) name="$1";; + esac + shift + done + + usage() + { + echo "Syntax: ${0##*/} -f filename [ -g guid ] name" + exit 1 + } + + [ -n "$name" -a -f "$filename" ] || usage + + EFIVARFS="/sys/firmware/efi/efivars" + + [ -d "$EFIVARFS" ] || exit 2 + + if stat -tf $EFIVARFS | grep -q -v de5e81e4; then + mount -t efivarfs none $EFIVARFS + fi + + # try to pick up an existing GUID + [ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-) + + # use a randomly generated GUID + [ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)" + + # efivarfs expects all of the data in one write + tmp=$(mktemp) + /bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp + dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp) + rm $tmp + +Loading ACPI SSDTs from configfs +================================ + +This option allows loading of user defined SSDTs from userspace via the configfs +interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be +mounted. In the following examples, we assume that configfs has been mounted in +/config. + +New tables can be loading by creating new directories in /config/acpi/table/ and +writing the SSDT aml code in the aml attribute:: + + cd /config/acpi/table + mkdir my_ssdt + cat ~/ssdt.aml > my_ssdt/aml diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index e506d3dae510..059ddcbe769d 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -91,10 +91,48 @@ Currently Available * large block (up to pagesize) support * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force the ordering) +* Case-insensitive file name lookups [1] Filesystems with a block size of 1k may see a limit imposed by the directory hash tree having a maximum depth of two. +case-insensitive file name lookups +====================================================== + +The case-insensitive file name lookup feature is supported on a +per-directory basis, allowing the user to mix case-insensitive and +case-sensitive directories in the same filesystem. It is enabled by +flipping the +F inode attribute of an empty directory. The +case-insensitive string match operation is only defined when we know how +text in encoded in a byte sequence. For that reason, in order to enable +case-insensitive directories, the filesystem must have the +casefold feature, which stores the filesystem-wide encoding +model used. By default, the charset adopted is the latest version of +Unicode (12.1.0, by the time of this writing), encoded in the UTF-8 +form. The comparison algorithm is implemented by normalizing the +strings to the Canonical decomposition form, as defined by Unicode, +followed by a byte per byte comparison. + +The case-awareness is name-preserving on the disk, meaning that the file +name provided by userspace is a byte-per-byte match to what is actually +written in the disk. The Unicode normalization format used by the +kernel is thus an internal representation, and not exposed to the +userspace nor to the disk, with the important exception of disk hashes, +used on large case-insensitive directories with DX feature. On DX +directories, the hash must be calculated using the casefolded version of +the filename, meaning that the normalization format used actually has an +impact on where the directory entry is stored. + +When we change from viewing filenames as opaque byte sequences to seeing +them as encoded strings we need to address what happens when a program +tries to create a file with an invalid name. The Unicode subsystem +within the kernel leaves the decision of what to do in this case to the +filesystem, which select its preferred behavior by enabling/disabling +the strict mode. When Ext4 encounters one of those strings and the +filesystem did not require strict mode, it falls back to considering the +entire string as an opaque byte sequence, which still allows the user to +operate on that file, but the case-insensitive lookups won't work. + Options ======= diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 0a491676685e..5b8286fdd91b 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -77,6 +77,7 @@ configure specific aspects of kernel behavior to your liking. LSM/index mm/index perf-security + acpi/index .. only:: subproject and html diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index b8d0bc07ed0a..0124980dca2d 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -88,6 +88,7 @@ parameter is applicable:: APIC APIC support is enabled. APM Advanced Power Management support is enabled. ARM ARM architecture is enabled. + ARM64 ARM64 architecture is enabled. AX25 Appropriate AX.25 support is enabled. CLK Common clock infrastructure is enabled. CMA Contiguous Memory Area support is enabled. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2b8ee90bb644..a1fe7e8c4f15 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -704,8 +704,11 @@ upon panic. This parameter reserves the physical memory region [offset, offset + size] for that kernel image. If '@offset' is omitted, then a suitable offset - is selected automatically. Check - Documentation/kdump/kdump.txt for further details. + is selected automatically. + [KNL, x86_64] select a region under 4G first, and + fall back to reserve region above 4G when '@offset' + hasn't been specified. + See Documentation/kdump/kdump.txt for further details. crashkernel=range1:size1[,range2:size2,...][@offset] [KNL] Same as above, but depends on the memory @@ -1585,7 +1588,7 @@ Format: { "off" | "enforce" | "fix" | "log" } default: "enforce" - ima_appraise_tcb [IMA] + ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead. The builtin appraise policy appraises all files owned by uid=0. @@ -1612,8 +1615,7 @@ uid=0. The "appraise_tcb" policy appraises the integrity of - all files owned by root. (This is the equivalent - of ima_appraise_tcb.) + all files owned by root. The "secure_boot" policy appraises the integrity of files (eg. kexec kernel image, kernel modules, @@ -2544,6 +2546,40 @@ in the "bleeding edge" mini2440 support kernel at http://repo.or.cz/w/linux-2.6/mini2440.git + mitigations= + [X86,PPC,S390,ARM64] Control optional mitigations for + CPU vulnerabilities. This is a set of curated, + arch-independent options, each of which is an + aggregation of existing arch-specific options. + + off + Disable all optional CPU mitigations. This + improves system performance, but it may also + expose users to several CPU vulnerabilities. + Equivalent to: nopti [X86,PPC] + kpti=0 [ARM64] + nospectre_v1 [PPC] + nobp=0 [S390] + nospectre_v2 [X86,PPC,S390,ARM64] + spectre_v2_user=off [X86] + spec_store_bypass_disable=off [X86,PPC] + ssbd=force-off [ARM64] + l1tf=off [X86] + + auto (default) + Mitigate all CPU vulnerabilities, but leave SMT + enabled, even if it's vulnerable. This is for + users who don't want to be surprised by SMT + getting disabled across kernel upgrades, or who + have other ways of avoiding SMT-based attacks. + Equivalent to: (default behavior) + + auto,nosmt + Mitigate all CPU vulnerabilities, disabling SMT + if needed. This is for users who always want to + be fully mitigated, even if it means losing SMT. + Equivalent to: l1tf=flush,nosmt [X86] + mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -2873,10 +2909,10 @@ check bypass). With this option data leaks are possible in the system. - nospectre_v2 [X86,PPC_FSL_BOOK3E] Disable all mitigations for the Spectre variant 2 - (indirect branch prediction) vulnerability. System may - allow data leaks with this option, which is equivalent - to spectre_v2=off. + nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for + the Spectre variant 2 (indirect branch prediction) + vulnerability. System may allow data leaks with this + option. nospec_store_bypass_disable [HW] Disable all mitigations for the Speculative Store Bypass vulnerability @@ -3394,6 +3430,8 @@ bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group. + force_floating [S390] Force usage of floating interrupts. + nomio [S390] Do not use MIO instructions. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. @@ -3623,7 +3661,9 @@ see CONFIG_RAS_CEC help text. rcu_nocbs= [KNL] - The argument is a cpu list, as described above. + The argument is a cpu list, as described above, + except that the string "all" can be used to + specify every CPU on the system. In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs. @@ -4703,6 +4743,10 @@ [x86] unstable: mark the TSC clocksource as unstable, this marks the TSC unconditionally unstable at bootup and avoids any further wobbles once the TSC watchdog notices. + [x86] nowatchdog: disable clocksource watchdog. Used + in situations with strict latency requirements (where + interruptions from clocksource watchdog are not + acceptable). turbografx.map[2|3]= [HW,JOY] TurboGraFX parallel port interface diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst new file mode 100644 index 000000000000..b79f70c04397 --- /dev/null +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -0,0 +1,169 @@ +.. _numaperf: + +============= +NUMA Locality +============= + +Some platforms may have multiple types of memory attached to a compute +node. These disparate memory ranges may share some characteristics, such +as CPU cache coherence, but may have different performance. For example, +different media types and buses affect bandwidth and latency. + +A system supports such heterogeneous memory by grouping each memory type +under different domains, or "nodes", based on locality and performance +characteristics. Some memory may share the same node as a CPU, and others +are provided as memory only nodes. While memory only nodes do not provide +CPUs, they may still be local to one or more compute nodes relative to +other nodes. The following diagram shows one such example of two compute +nodes with local memory and a memory only node for each of compute node: + + +------------------+ +------------------+ + | Compute Node 0 +-----+ Compute Node 1 | + | Local Node0 Mem | | Local Node1 Mem | + +--------+---------+ +--------+---------+ + | | + +--------+---------+ +--------+---------+ + | Slower Node2 Mem | | Slower Node3 Mem | + +------------------+ +--------+---------+ + +A "memory initiator" is a node containing one or more devices such as +CPUs or separate memory I/O devices that can initiate memory requests. +A "memory target" is a node containing one or more physical address +ranges accessible from one or more memory initiators. + +When multiple memory initiators exist, they may not all have the same +performance when accessing a given memory target. Each initiator-target +pair may be organized into different ranked access classes to represent +this relationship. The highest performing initiator to a given target +is considered to be one of that target's local initiators, and given +the highest access class, 0. Any given target may have one or more +local initiators, and any given initiator may have multiple local +memory targets. + +To aid applications matching memory targets with their initiators, the +kernel provides symlinks to each other. The following example lists the +relationship for the access class "0" memory initiators and targets:: + + # symlinks -v /sys/devices/system/node/nodeX/access0/targets/ + relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY + + # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/ + relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX + +A memory initiator may have multiple memory targets in the same access +class. The target memory's initiators in a given class indicate the +nodes' access characteristics share the same performance relative to other +linked initiator nodes. Each target within an initiator's access class, +though, do not necessarily perform the same as each other. + +================ +NUMA Performance +================ + +Applications may wish to consider which node they want their memory to +be allocated from based on the node's performance characteristics. If +the system provides these attributes, the kernel exports them under the +node sysfs hierarchy by appending the attributes directory under the +memory node's access class 0 initiators as follows:: + + /sys/devices/system/node/nodeY/access0/initiators/ + +These attributes apply only when accessed from nodes that have the +are linked under the this access's inititiators. + +The performance characteristics the kernel provides for the local initiators +are exported are as follows:: + + # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/ + /sys/devices/system/node/nodeY/access0/initiators/ + |-- read_bandwidth + |-- read_latency + |-- write_bandwidth + `-- write_latency + +The bandwidth attributes are provided in MiB/second. + +The latency attributes are provided in nanoseconds. + +The values reported here correspond to the rated latency and bandwidth +for the platform. + +========== +NUMA Cache +========== + +System memory may be constructed in a hierarchy of elements with various +performance characteristics in order to provide large address space of +slower performing memory cached by a smaller higher performing memory. The +system physical addresses memory initiators are aware of are provided +by the last memory level in the hierarchy. The system meanwhile uses +higher performing memory to transparently cache access to progressively +slower levels. + +The term "far memory" is used to denote the last level memory in the +hierarchy. Each increasing cache level provides higher performing +initiator access, and the term "near memory" represents the fastest +cache provided by the system. + +This numbering is different than CPU caches where the cache level (ex: +L1, L2, L3) uses the CPU-side view where each increased level is lower +performing. In contrast, the memory cache level is centric to the last +level memory, so the higher numbered cache level corresponds to memory +nearer to the CPU, and further from far memory. + +The memory-side caches are not directly addressable by software. When +software accesses a system address, the system will return it from the +near memory cache if it is present. If it is not present, the system +accesses the next level of memory until there is either a hit in that +cache level, or it reaches far memory. + +An application does not need to know about caching attributes in order +to use the system. Software may optionally query the memory cache +attributes in order to maximize the performance out of such a setup. +If the system provides a way for the kernel to discover this information, +for example with ACPI HMAT (Heterogeneous Memory Attribute Table), +the kernel will append these attributes to the NUMA node memory target. + +When the kernel first registers a memory cache with a node, the kernel +will create the following directory:: + + /sys/devices/system/node/nodeX/memory_side_cache/ + +If that directory is not present, the system either does not not provide +a memory-side cache, or that information is not accessible to the kernel. + +The attributes for each level of cache is provided under its cache +level index:: + + /sys/devices/system/node/nodeX/memory_side_cache/indexA/ + /sys/devices/system/node/nodeX/memory_side_cache/indexB/ + /sys/devices/system/node/nodeX/memory_side_cache/indexC/ + +Each cache level's directory provides its attributes. For example, the +following shows a single cache level and the attributes available for +software to query:: + + # tree sys/devices/system/node/node0/memory_side_cache/ + /sys/devices/system/node/node0/memory_side_cache/ + |-- index1 + | |-- indexing + | |-- line_size + | |-- size + | `-- write_policy + +The "indexing" will be 0 if it is a direct-mapped cache, and non-zero +for any other indexed based, multi-way associativity. + +The "line_size" is the number of bytes accessed from the next cache +level on a miss. + +The "size" is the number of bytes provided by this cache level. + +The "write_policy" will be 0 for write-back, and non-zero for +write-through caching. + +======== +See Also +======== +.. [1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf + Section 5.2.27 diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 7eca9026a9ed..0c74a7784964 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` @@ -5,9 +8,10 @@ CPU Performance Scaling ======================= -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> The Concept of CPU Performance Scaling ====================================== @@ -396,8 +400,8 @@ RT or deadline scheduling classes, the governor will increase the frequency to the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, if it is invoked by the CFS scheduling class, the governor will use the Per-Entity Load Tracking (PELT) metric for the root control group of the -given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ -LWN.net article for a description of the PELT mechanism). Then, the new +given CPU as the CPU utilization estimate (see the *Per-entity load tracking* +LWN.net article [1]_ for a description of the PELT mechanism). Then, the new CPU frequency to apply is computed in accordance with the formula f = 1.25 * ``f_0`` * ``util`` / ``max`` @@ -698,4 +702,8 @@ hardware feature (e.g. all Intel ones), even if the :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. -.. _Per-entity load tracking: https://lwn.net/Articles/531853/ +References +========== + +.. [1] Jonathan Corbet, *Per-entity load tracking*, + https://lwn.net/Articles/531853/ diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 9c58b35a81cb..e70b365dbc60 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -1,3 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + .. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>` .. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>` @@ -5,9 +8,10 @@ CPU Idle Time Management ======================== -:: +:Copyright: |copy| 2018 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2018 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> Concepts ======== diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst index 49237ac73442..39f8f9f81e7a 100644 --- a/Documentation/admin-guide/pm/index.rst +++ b/Documentation/admin-guide/pm/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ================ Power Management ================ diff --git a/Documentation/admin-guide/pm/intel_epb.rst b/Documentation/admin-guide/pm/intel_epb.rst new file mode 100644 index 000000000000..005121167af7 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_epb.rst @@ -0,0 +1,41 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +====================================== +Intel Performance and Energy Bias Hint +====================================== + +:Copyright: |copy| 2019 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + + +.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c + :doc: overview + +Intel Performance and Energy Bias Attribute in ``sysfs`` +======================================================== + +The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU +can be checked or updated through a ``sysfs`` attribute (file) under +:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>`` +is allocated at the system initialization time: + +``energy_perf_bias`` + Shows the current EPB value for the CPU in a sliding scale 0 - 15, where + a value of 0 corresponds to a hint preference for highest performance + and a value of 15 corresponds to the maximum energy savings. + + In order to update the EPB value for the CPU, this attribute can be + written to, either with a number in the 0 - 15 sliding scale above, or + with one of the strings: "performance", "balance-performance", "normal", + "balance-power", "power" that represent values reflected by their + meaning. + + This attribute is present for all online CPUs supporting the EPB + feature. + +Note that while the EPB interface to the processor is defined at the logical CPU +level, the physical register backing it may be shared by multiple CPUs (for +example, SMT siblings or cores in one package). For this reason, updating the +EPB value for one CPU may cause the EPB values for other CPUs to change. diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index ec0f7c111f65..67e414e34f37 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -1,10 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =============================================== ``intel_pstate`` CPU Performance Scaling Driver =============================================== -:: +:Copyright: |copy| 2017 Intel Corporation - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> General Information @@ -20,11 +23,10 @@ you have not done that yet.] For the processors supported by ``intel_pstate``, the P-state concept is broader than just an operating frequency or an operating performance point (see the -`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more +LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more information about that). For this reason, the representation of P-states used by ``intel_pstate`` internally follows the hardware specification (for details -refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual -Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core +refer to Intel Software Developer’s Manual [2]_). However, the ``CPUFreq`` core uses frequencies for identifying operating performance points of CPUs and frequencies are involved in the user space interface exposed by it, so ``intel_pstate`` maps its internal representation of P-states to frequencies too @@ -561,9 +563,9 @@ or to pin every task potentially sensitive to them to a specific CPU.] On the majority of systems supported by ``intel_pstate``, the ACPI tables provided by the platform firmware contain ``_PSS`` objects returning information -that can be used for CPU performance scaling (refer to the `ACPI specification`_ -for details on the ``_PSS`` objects and the format of the information returned -by them). +that can be used for CPU performance scaling (refer to the ACPI specification +[3]_ for details on the ``_PSS`` objects and the format of the information +returned by them). The information returned by the ACPI ``_PSS`` objects is used by the ``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate`` @@ -728,6 +730,14 @@ P-state is called, the ``ftrace`` filter can be set to to <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func -.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf -.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html -.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf +References +========== + +.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*, + http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf + +.. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*, + http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html + +.. [3] *Advanced Configuration and Power Interface Specification*, + https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst index dbf5acd49f35..cd3a28cb81f4 100644 --- a/Documentation/admin-guide/pm/sleep-states.rst +++ b/Documentation/admin-guide/pm/sleep-states.rst @@ -1,10 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =================== System Sleep States =================== -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> Sleep states are global low-power states of the entire system in which user space code cannot be executed and the overall system activity is significantly diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst index afe4d3f831fe..dd0362e32fa5 100644 --- a/Documentation/admin-guide/pm/strategies.rst +++ b/Documentation/admin-guide/pm/strategies.rst @@ -1,10 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + =========================== Power Management Strategies =========================== -:: +:Copyright: |copy| 2017 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> - Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> The Linux kernel supports two major high-level power management strategies. diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst index 0c81e4c5de39..2b1f987b34f0 100644 --- a/Documentation/admin-guide/pm/system-wide.rst +++ b/Documentation/admin-guide/pm/system-wide.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============================ System-Wide Power Management ============================ diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index b6cef9b5e961..fc298eb1234b 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============================== Working-State Power Management ============================== @@ -8,3 +10,4 @@ Working-State Power Management cpuidle cpufreq intel_pstate + intel_epb |