diff options
Diffstat (limited to 'Documentation/admin-guide')
94 files changed, 3436 insertions, 930 deletions
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index ce63be6d64ad..b44ef68f6e4d 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -48,3 +48,4 @@ subdirectories. Yama SafeSetID ipe + landlock diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst index f93a467db628..dc7088451f9d 100644 --- a/Documentation/admin-guide/LSM/ipe.rst +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -423,7 +423,7 @@ Field descriptions: Event Example:: - type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 errno=0 type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E @@ -433,24 +433,55 @@ This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record Field descriptions: -+----------------+------------+-----------+---------------------------------------------------+ -| Field | Value Type | Optional? | Description of Value | -+================+============+===========+===================================================+ -| policy_name | string | No | The policy_name | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_version | string | No | The policy_version | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_digest | string | No | The policy hash | -+----------------+------------+-----------+---------------------------------------------------+ -| auid | integer | No | The login user ID | -+----------------+------------+-----------+---------------------------------------------------+ -| ses | integer | No | The login session ID | -+----------------+------------+-----------+---------------------------------------------------+ -| lsm | string | No | The lsm name associated with the event | -+----------------+------------+-----------+---------------------------------------------------+ -| res | integer | No | The result of the audited operation(success/fail) | -+----------------+------------+-----------+---------------------------------------------------+ - ++----------------+------------+-----------+-------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+=============================================================+ +| policy_name | string | Yes | The policy_name | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_version | string | Yes | The policy_version | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_digest | string | Yes | The policy hash | ++----------------+------------+-----------+-------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+-------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+-------------------------------------------------------------+ +| errno | integer | No | Error code from policy loading operations (see table below) | ++----------------+------------+-----------+-------------------------------------------------------------+ + +Policy error codes (errno): + +The following table lists the error codes that may appear in the errno field while loading or updating the policy: + ++----------------+--------------------------------------------------------+ +| Error Code | Description | ++================+========================================================+ +| 0 | Success | ++----------------+--------------------------------------------------------+ +| -EPERM | Insufficient permission | ++----------------+--------------------------------------------------------+ +| -EEXIST | Same name policy already deployed | ++----------------+--------------------------------------------------------+ +| -EBADMSG | Policy is invalid | ++----------------+--------------------------------------------------------+ +| -ENOMEM | Out of memory (OOM) | ++----------------+--------------------------------------------------------+ +| -ERANGE | Policy version number overflow | ++----------------+--------------------------------------------------------+ +| -EINVAL | Policy version parsing error | ++----------------+--------------------------------------------------------+ +| -ENOKEY | Key used to sign the IPE policy not found in keyring | ++----------------+--------------------------------------------------------+ +| -EKEYREJECTED | Policy signature verification failed | ++----------------+--------------------------------------------------------+ +| -ESTALE | Attempting to update an IPE policy with older version | ++----------------+--------------------------------------------------------+ +| -ENOENT | Policy was deleted while updating | ++----------------+--------------------------------------------------------+ 1404 AUDIT_MAC_STATUS ^^^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst new file mode 100644 index 000000000000..9e61607def08 --- /dev/null +++ b/Documentation/admin-guide/LSM/landlock.rst @@ -0,0 +1,158 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. Copyright © 2025 Microsoft Corporation + +================================ +Landlock: system-wide management +================================ + +:Author: Mickaël Salaün +:Date: March 2025 + +Landlock can leverage the audit framework to log events. + +User space documentation can be found here: +Documentation/userspace-api/landlock.rst. + +Audit +===== + +Denied access requests are logged by default for a sandboxed program if `audit` +is enabled. This default behavior can be changed with the +sys_landlock_restrict_self() flags (cf. +Documentation/userspace-api/landlock.rst). Landlock logs can also be masked +thanks to audit rules. Landlock can generate 2 audit record types. + +Record types +------------ + +AUDIT_LANDLOCK_ACCESS + This record type identifies a denied access request to a kernel resource. + The ``domain`` field indicates the ID of the domain that blocked the + request. The ``blockers`` field indicates the cause(s) of this denial + (separated by a comma), and the following fields identify the kernel object + (similar to SELinux). There may be more than one of this record type per + audit event. + + Example with a file link request generating two records in the same event:: + + domain=195ba459b blockers=fs.refer path="/usr/bin" dev="vda2" ino=351 + domain=195ba459b blockers=fs.make_reg,fs.refer path="/usr/local" dev="vda2" ino=365 + +AUDIT_LANDLOCK_DOMAIN + This record type describes the status of a Landlock domain. The ``status`` + field can be either ``allocated`` or ``deallocated``. + + The ``allocated`` status is part of the same audit event and follows + the first logged ``AUDIT_LANDLOCK_ACCESS`` record of a domain. It identifies + Landlock domain information at the time of the sys_landlock_restrict_self() + call with the following fields: + + - the ``domain`` ID + - the enforcement ``mode`` + - the domain creator's ``pid`` + - the domain creator's ``uid`` + - the domain creator's executable path (``exe``) + - the domain creator's command line (``comm``) + + Example:: + + domain=195ba459b status=allocated mode=enforcing pid=300 uid=0 exe="/root/sandboxer" comm="sandboxer" + + The ``deallocated`` status is an event on its own and it identifies a + Landlock domain release. After such event, it is guarantee that the + related domain ID will never be reused during the lifetime of the system. + The ``domain`` field indicates the ID of the domain which is released, and + the ``denials`` field indicates the total number of denied access request, + which might not have been logged according to the audit rules and + sys_landlock_restrict_self()'s flags. + + Example:: + + domain=195ba459b status=deallocated denials=3 + + +Event samples +-------------- + +Here are two examples of log events (see serial numbers). + +In this example a sandboxed program (``kill``) tries to send a signal to the +init process, which is denied because of the signal scoping restriction +(``LL_SCOPED=s``):: + + $ LL_FS_RO=/ LL_FS_RW=/ LL_SCOPED=s LL_FORCE_LOG=1 ./sandboxer kill 1 + +This command generates two events, each identified with a unique serial +number following a timestamp (``msg=audit(1729738800.268:30)``). The first +event (serial ``30``) contains 4 records. The first record +(``type=LANDLOCK_ACCESS``) shows an access denied by the domain `1a6fdc66f`. +The cause of this denial is signal scopping restriction +(``blockers=scope.signal``). The process that would have receive this signal +is the init process (``opid=1 ocomm="systemd"``). + +The second record (``type=LANDLOCK_DOMAIN``) describes (``status=allocated``) +domain `1a6fdc66f`. This domain was created by process ``286`` executing the +``/root/sandboxer`` program launched by the root user. + +The third record (``type=SYSCALL``) describes the syscall, its provided +arguments, its result (``success=no exit=-1``), and the process that called it. + +The fourth record (``type=PROCTITLE``) shows the command's name as an +hexadecimal value. This can be translated with ``python -c +'print(bytes.fromhex("6B696C6C0031"))'``. + +Finally, the last record (``type=LANDLOCK_DOMAIN``) is also the only one from +the second event (serial ``31``). It is not tied to a direct user space action +but an asynchronous one to free resources tied to a Landlock domain +(``status=deallocated``). This can be useful to know that the following logs +will not concern the domain ``1a6fdc66f`` anymore. This record also summarize +the number of requests this domain denied (``denials=1``), whether they were +logged or not. + +.. code-block:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.268:30): domain=1a6fdc66f blockers=scope.signal opid=1 ocomm="systemd" + type=LANDLOCK_DOMAIN msg=audit(1729738800.268:30): domain=1a6fdc66f status=allocated mode=enforcing pid=286 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.268:30): arch=c000003e syscall=62 success=no exit=-1 [..] ppid=272 pid=286 auid=0 uid=0 gid=0 [...] comm="kill" [...] + type=PROCTITLE msg=audit(1729738800.268:30): proctitle=6B696C6C0031 + type=LANDLOCK_DOMAIN msg=audit(1729738800.324:31): domain=1a6fdc66f status=deallocated denials=1 + +Here is another example showcasing filesystem access control:: + + $ LL_FS_RO=/ LL_FS_RW=/tmp LL_FORCE_LOG=1 ./sandboxer sh -c "echo > /etc/passwd" + +The related audit logs contains 8 records from 3 different events (serials 33, +34 and 35) created by the same domain `1a6fdc679`:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.221:33): domain=1a6fdc679 blockers=fs.write_file path="/dev/tty" dev="devtmpfs" ino=9 + type=LANDLOCK_DOMAIN msg=audit(1729738800.221:33): domain=1a6fdc679 status=allocated mode=enforcing pid=289 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.221:33): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:33): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_ACCESS msg=audit(1729738800.221:34): domain=1a6fdc679 blockers=fs.write_file path="/etc/passwd" dev="vda2" ino=143821 + type=SYSCALL msg=audit(1729738800.221:34): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:34): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_DOMAIN msg=audit(1729738800.261:35): domain=1a6fdc679 status=deallocated denials=2 + + +Event filtering +--------------- + +If you get spammed with audit logs related to Landlock, this is either an +attack attempt or a bug in the security policy. We can put in place some +filters to limit noise with two complementary ways: + +- with sys_landlock_restrict_self()'s flags if we can fix the sandboxed + programs, +- or with audit rules (see :manpage:`auditctl(8)`). + +Additional documentation +======================== + +* `Linux Audit Documentation`_ +* Documentation/userspace-api/landlock.rst +* Documentation/security/landlock.rst +* https://landlock.io + +.. Links +.. _Linux Audit Documentation: + https://github.com/linux-audit/audit-documentation/wiki diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index f2bebff6a733..05301f03b717 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -165,7 +165,7 @@ Configuring the kernel "make xconfig" Qt based configuration tool. - "make gconfig" GTK+ based configuration tool. + "make gconfig" GTK based configuration tool. "make oldconfig" Default all questions based on the contents of your existing ./.config file and asking about @@ -176,7 +176,7 @@ Configuring the kernel values without prompting. "make defconfig" Create a ./.config file by using the default - symbol values from either arch/$ARCH/defconfig + symbol values from either arch/$ARCH/configs/defconfig or arch/$ARCH/configs/${PLATFORM}_defconfig, depending on the architecture. @@ -259,7 +259,7 @@ Configuring the kernel Compiling the kernel -------------------- - - Make sure you have at least gcc 5.1 available. + - Make sure you have at least gcc 8.1 available. For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. - Do a ``make`` to create a compressed kernel image. It is also possible to do @@ -356,5 +356,5 @@ instructions at 'Documentation/admin-guide/reporting-issues.rst'. Hints on understanding kernel bug reports are in 'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel -with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and -'Documentation/dev-tools/kgdb.rst'. +with gdb is in 'Documentation/process/debugging/gdb-kernel-debugging.rst' and +'Documentation/process/debugging/kgdb.rst'. diff --git a/Documentation/admin-guide/abi-obsolete-files.rst b/Documentation/admin-guide/abi-obsolete-files.rst new file mode 100644 index 000000000000..3061a916b4b5 --- /dev/null +++ b/Documentation/admin-guide/abi-obsolete-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Obsolete ABI Files +================== + +.. kernel-abi:: obsolete + :no-symbols: diff --git a/Documentation/admin-guide/abi-obsolete.rst b/Documentation/admin-guide/abi-obsolete.rst index 594e697aa1b2..640f3903e847 100644 --- a/Documentation/admin-guide/abi-obsolete.rst +++ b/Documentation/admin-guide/abi-obsolete.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI obsolete symbols ==================== @@ -7,5 +9,5 @@ marked to be removed at some later point in time. The description of the interface will document the reason why it is obsolete and when it can be expected to be removed. -.. kernel-abi:: ABI/obsolete - :rst: +.. kernel-abi:: obsolete + :no-files: diff --git a/Documentation/admin-guide/abi-removed-files.rst b/Documentation/admin-guide/abi-removed-files.rst new file mode 100644 index 000000000000..f1bdfadd2ec4 --- /dev/null +++ b/Documentation/admin-guide/abi-removed-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Removed ABI Files +================= + +.. kernel-abi:: removed + :no-symbols: diff --git a/Documentation/admin-guide/abi-removed.rst b/Documentation/admin-guide/abi-removed.rst index f9e000c81828..88832d3eacd6 100644 --- a/Documentation/admin-guide/abi-removed.rst +++ b/Documentation/admin-guide/abi-removed.rst @@ -1,5 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI removed symbols =================== -.. kernel-abi:: ABI/removed - :rst: +.. kernel-abi:: removed + :no-files: diff --git a/Documentation/admin-guide/abi-stable-files.rst b/Documentation/admin-guide/abi-stable-files.rst new file mode 100644 index 000000000000..f867738fc178 --- /dev/null +++ b/Documentation/admin-guide/abi-stable-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Stable ABI Files +================ + +.. kernel-abi:: stable + :no-symbols: diff --git a/Documentation/admin-guide/abi-stable.rst b/Documentation/admin-guide/abi-stable.rst index fc3361d847b1..528c68401f4b 100644 --- a/Documentation/admin-guide/abi-stable.rst +++ b/Documentation/admin-guide/abi-stable.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI stable symbols ================== @@ -10,5 +12,5 @@ for at least 2 years. Most interfaces (like syscalls) are expected to never change and always be available. -.. kernel-abi:: ABI/stable - :rst: +.. kernel-abi:: stable + :no-files: diff --git a/Documentation/admin-guide/abi-testing-files.rst b/Documentation/admin-guide/abi-testing-files.rst new file mode 100644 index 000000000000..1da868e42fdb --- /dev/null +++ b/Documentation/admin-guide/abi-testing-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Testing ABI Files +================= + +.. kernel-abi:: testing + :no-symbols: diff --git a/Documentation/admin-guide/abi-testing.rst b/Documentation/admin-guide/abi-testing.rst index 19767926b344..6153ebd38e2d 100644 --- a/Documentation/admin-guide/abi-testing.rst +++ b/Documentation/admin-guide/abi-testing.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI testing symbols =================== @@ -16,5 +18,5 @@ Programs that use these interfaces are strongly encouraged to add their name to the description of these interfaces, so that the kernel developers can easily notify them if any changes occur. -.. kernel-abi:: ABI/testing - :rst: +.. kernel-abi:: testing + :no-files: diff --git a/Documentation/admin-guide/abi.rst b/Documentation/admin-guide/abi.rst index bcab3ef2597c..c6039359e585 100644 --- a/Documentation/admin-guide/abi.rst +++ b/Documentation/admin-guide/abi.rst @@ -1,7 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + ===================== Linux ABI description ===================== +.. kernel-abi:: README + +ABI symbols +----------- + .. toctree:: :maxdepth: 2 @@ -9,3 +16,14 @@ Linux ABI description abi-testing abi-obsolete abi-removed + +ABI files +--------- + +.. toctree:: + :maxdepth: 2 + + abi-stable-files + abi-testing-files + abi-obsolete-files + abi-removed-files diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 957ccf617797..3262397ebe8f 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -11,6 +11,7 @@ Block Devices nbd paride ramdisk + zoned_loop zram drbd/index diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst new file mode 100644 index 000000000000..9c7aa3b482f3 --- /dev/null +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -0,0 +1,169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Zoned Loop Block Device +======================= + +.. Contents: + + 1) Overview + 2) Creating a Zoned Device + 3) Deleting a Zoned Device + 4) Example + + +1) Overview +----------- + +The zoned loop block device driver (zloop) allows a user to create a zoned block +device using one regular file per zone as backing storage. This driver does not +directly control any hardware and uses read, write and truncate operations to +regular files of a file system to emulate a zoned block device. + +Using zloop, zoned block devices with a configurable capacity, zone size and +number of conventional zones can be created. The storage for each zone of the +device is implemented using a regular file with a maximum size equal to the zone +size. The size of a file backing a conventional zone is always equal to the zone +size. The size of a file backing a sequential zone indicates the amount of data +sequentially written to the file, that is, the size of the file directly +indicates the position of the write pointer of the zone. + +When resetting a sequential zone, its backing file size is truncated to zero. +Conversely, for a zone finish operation, the backing file is truncated to the +zone size. With this, the maximum capacity of a zloop zoned block device created +can be larger configured to be larger than the storage space available on the +backing file system. Of course, for such configuration, writing more data than +the storage space available on the backing file system will result in write +errors. + +The zoned loop block device driver implements a complete zone transition state +machine. That is, zones can be empty, implicitly opened, explicitly opened, +closed or full. The current implementation does not support any limits on the +maximum number of open and active zones. + +No user tools are necessary to create and delete zloop devices. + +2) Creating a Zoned Device +-------------------------- + +Once the zloop module is loaded (or if zloop is compiled in the kernel), the +character device file /dev/zloop-control can be used to add a zloop device. +This is done by writing an "add" command directly to the /dev/zloop-control +device:: + + $ modprobe zloop + $ ls -l /dev/zloop* + crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control + + $ mkdir -p <base directory/<device ID> + $ echo "add [options]" > /dev/zloop-control + +The options available for the add command can be listed by reading the +/dev/zloop-control device:: + + $ cat /dev/zloop-control + add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io + remove id=%d + +In more details, the options that can be used with the "add" command are as +follows. + +================ =========================================================== +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up to + the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower than + the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from sector 0. + Default: 8. +base_dir Path to the base directoy where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This value is + always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +================ =========================================================== + +3) Deleting a Zoned Device +-------------------------- + +Deleting an unused zoned loop block device is done by issuing the "remove" +command to /dev/zloop-control, specifying the ID of the device to remove:: + + $ echo "remove id=X" > /dev/zloop-control + +The remove command does not have any option. + +A zoned device that was removed can be re-added again without any change to the +state of the device zones: the device zones are restored to their last state +before the device was removed. Adding again a zoned device after it was removed +must always be done using the same configuration as when the device was first +added. If a zone configuration change is detected, an error will be returned and +the zoned device will not be created. + +To fully delete a zoned device, after executing the remove operation, the device +base directory containing the backing files of the device zones must be deleted. + +4) Example +---------- + +The following sequence of commands creates a 2GB zoned device with zones of 64 +MB and a zone capacity of 63 MB:: + + $ modprobe zloop + $ mkdir -p /var/local/zloop/0 + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control + +For the device created (/dev/zloop0), the zone backing files are all created +under the default base directory (/var/local/zloop):: + + $ ls -l /var/local/zloop/0 + total 0 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000008 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000009 + ... + +The zoned device created (/dev/zloop0) can then be used normally:: + + $ lsblk -z + NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN + zloop0 host-managed 64M 32 0 0 1M 4K + $ blkzone report /dev/zloop0 + start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + ... + +Deleting this device is done using the command:: + + $ echo "remove id=0" > /dev/zloop-control + +The removed device can be re-added again using the same "add" command as when +the device was first created. To fully delete a zoned device, its backing files +should also be deleted after executing the remove command:: + + $ rm -r /var/local/zloop/0 diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 714a5171bfc0..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -54,7 +54,7 @@ The list of possible return codes: If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -73,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -107,7 +93,7 @@ Examples:: For the time being, the `comp_algorithm` content shows only compression algorithms that are supported by zram. -4) Set compression algorithm parameters: Optional +3) Set compression algorithm parameters: Optional ================================================= Compression algorithms may support specific parameters which can be @@ -121,14 +107,14 @@ compression algorithm to use external pre-trained dictionary, pass full path to the `dict` along with other parameters:: #pass path to pre-trained zstd dictionary - echo "algo=zstd dict=/etc/dictioary" > /sys/block/zram0/algorithm_params + echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params #same, but using algorithm priority - echo "priority=1 dict=/etc/dictioary" > \ + echo "priority=1 dict=/etc/dictionary" > \ /sys/block/zram0/algorithm_params #pass path to pre-trained zstd dictionary and compression level - echo "algo=zstd level=8 dict=/etc/dictioary" > \ + echo "algo=zstd level=8 dict=/etc/dictionary" > \ /sys/block/zram0/algorithm_params Parameters are algorithm specific: not all algorithms support pre-trained @@ -138,7 +124,7 @@ better the compression ratio, it even can take negatives values for some algorithms), for other algorithms `level` is acceleration level (the higher the value the lower the compression ratio). -5) Set Disksize +4) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -158,7 +144,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -6) Set memory limit: Optional +5) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -177,7 +163,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -7) Activate +6) Activate =========== :: @@ -188,7 +174,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -8) Add/remove zram devices +7) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -208,7 +194,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -9) Stats +8) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -228,8 +214,6 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations comp_algorithm RW show and change the compression algorithm algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction @@ -310,7 +294,7 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -10) Deactivate +9) Deactivate ============== :: @@ -318,7 +302,7 @@ a single line of text and contains the following stats separated by whitespace: swapoff /dev/zram0 umount /dev/zram1 -11) Reset +10) Reset ========= Write any positive value to 'reset' sysfs node:: @@ -333,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -347,24 +351,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -385,6 +372,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -498,8 +502,6 @@ attempt to recompress::: echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress -Recompression of idle pages requires memory tracking. - During re-compression for every page, that matches re-compression criteria, ZRAM iterates the list of registered alternative compression algorithms in order of their priorities. ZRAM stops either when re-compression was diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst index 18e79337dcfd..153472e93cae 100644 --- a/Documentation/admin-guide/braille-console.rst +++ b/Documentation/admin-guide/braille-console.rst @@ -21,8 +21,8 @@ override the baud rate to 115200, etc. By default, the braille device will just show the last kernel message (console mode). To review previous messages, press the Insert key to switch to the VT review mode. In review mode, the arrow keys permit to browse in the VT content, -:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and -the :kbd:`HOME` key goes back +`PAGE-UP`/`PAGE-DOWN` keys go at the top/bottom of the screen, and +the `HOME` key goes back to the cursor, hence providing very basic screen reviewing facility. Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 1d0f8ceb3075..30858757c9f2 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -196,7 +196,7 @@ will see the assembler code for the routine shown, but if your kernel has debug symbols the C code will also be available. (Debug symbols can be enabled in the kernel hacking menu of the menu configuration.) For example:: - $ objdump -r -S -l --disassemble net/dccp/ipv4.o + $ objdump -r -S -l --disassemble net/ipv4/tcp.o .. note:: @@ -368,12 +368,3 @@ processed by ``klogd``:: Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 ---------------------------------------------------------------------------- - -:: - - Dr. G.W. Wettstein Oncology Research Div. Computing Facility - Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com - 820 4th St. N. - Fargo, ND 58122 - Phone: 701-234-7556 diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index a3e2edb3d274..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> -Modified by Christoph Lameter <cl@linux.com> +Modified by Christoph Lameter <cl@gentwo.org> .. CONTENTS: diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index f401af5e2f09..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson <pj@sgi.com> -- Modified by Christoph Lameter <cl@linux.com> +- Modified by Christoph Lameter <cl@gentwo.org> - Modified by Paul Menage <menage@google.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst index 582d3427de3f..a964aff373b1 100644 --- a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst +++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst @@ -125,3 +125,7 @@ to unfreeze all tasks in the container:: This is the basic mechanism which should do the right thing for user space task in a simple scenario. + +This freezer implementation is affected by shortcomings (see commit +76f969e8948d8 ("cgroup: cgroup v2 freezer")) and cgroup v2 freezer is +recommended. diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 286d16fc22eb..d6b1db8cc7eb 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -90,6 +90,7 @@ Brief summary of control files. used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) + Per memcg knob does not exist in cgroup v2. memory.move_charge_at_immigrate This knob is deprecated. memory.oom_control set/show oom controls. This knob is deprecated and shouldn't be @@ -609,6 +610,10 @@ memory.stat file includes following statistics: 'rss + mapped_file" will give you resident set size of cgroup. + Note that some kernel configurations might account complete larger + allocations (e.g., THP) towards 'rss' and 'mapped_file', even if + only some, but not all that memory is mapped. + (Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 315ede811c9d..bd98ea3175ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files - 5-8. HugeTLB - 5.8-1. HugeTLB Interface Files - 5-9. Misc - 5.9-1 Miscellaneous cgroup Interface Files - 5.9-2 Migration and Ownership - 5-10. Others - 5-10-1. perf_event + 5-8. DMEM + 5-9. HugeTLB + 5.9-1. HugeTLB Interface Files + 5-10. Misc + 5.10-1 Miscellaneous cgroup Interface Files + 5.10-2 Migration and Ownership + 5-11. Others + 5-11-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1075,33 +1076,53 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 doesn't yet support control of realtime processes. For -a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group -scheduling of realtime processes, the cpu controller can only be enabled -when all RT processes are in the root cgroup. This limitation does -not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system -management software may already have placed RT processes into nonroot -cgroups during the system boot process, and these processes may need -to be moved to the root cgroup before the cpu controller can be enabled -with a CONFIG_RT_GROUP_SCHED enabled kernel. +WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of +realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option +enabled for group scheduling of realtime processes, the cpu controller can only +be enabled when all RT processes are in the root cgroup. Be aware that system +management software may already have placed RT processes into non-root cgroups +during the system boot process, and these processes may need to be moved to the +root cgroup before the cpu controller can be enabled with a +CONFIG_RT_GROUP_SCHED enabled kernel. + +With CONFIG_RT_GROUP_SCHED disabled, this limitation does not apply and some of +the interface files either affect realtime processes or account for them. See +the following section for details. Only the cpu controller is affected by +CONFIG_RT_GROUP_SCHED. Other controllers can be used for the resource control of +realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy and the underlying scheduler. From the point of view of the cpu controller, +processes can be categorized as follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +For details on when a process is under the fair-class scheduler or a BPF scheduler, +check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`. + +For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1119,6 +1140,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1131,6 +1156,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1143,43 +1172,55 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. Shows pressure stall information for CPU. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + This file accounts for all the processes in the cgroup. + cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. + + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping + + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1191,7 +1232,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ @@ -1293,6 +1334,18 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1310,6 +1363,18 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. @@ -1342,6 +1407,9 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak A read-write single value file which exists on non-root cgroups. @@ -1439,7 +1507,10 @@ The following nested keys are defined. anon Amount of memory used in anonymous mappings such as - brk(), sbrk(), and mmap(MAP_ANONYMOUS) + brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that + some kernel configurations might account complete larger + allocations (e.g., THP) if only some, but not all the + memory of such an allocation is mapped anymore. file Amount of memory used to cache filesystem data, @@ -1482,7 +1553,10 @@ The following nested keys are defined. Amount of application memory swapped out to zswap. file_mapped - Amount of cached filesystem data mapped with mmap() + Amount of cached filesystem data mapped with mmap(). Note + that some kernel configurations might account complete + larger allocations (e.g., THP) if only some, but not + not all the memory of such an allocation is mapped. file_dirty Amount of cached filesystem data that was modified but @@ -1554,6 +1628,12 @@ The following nested keys are defined. workingset_nodereclaim Number of times a shadow node has been reclaimed + pswpin (npn) + Number of pages swapped into memory + + pswpout (npn) + Number of pages swapped out of memory + pgscan (npn) Amount of scanned pages (in an inactive LRU list) @@ -1569,6 +1649,9 @@ The following nested keys are defined. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1578,6 +1661,9 @@ The following nested keys are defined. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1655,6 +1741,9 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + pgdemote_proactive + Number of pages demoted by proactively. + hugetlb Amount of memory used by hugetlb pages. This metric only shows up if hugetlb usage is accounted for in memory.current (i.e. @@ -2626,6 +2715,49 @@ RDMA Interface Files mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +DMEM +---- + +The "dmem" controller regulates the distribution and accounting of +device memory regions. Because each memory region may have its own page size, +which does not have to be equal to the system page size, the units are always bytes. + +DMEM Interface Files +~~~~~~~~~~~~~~~~~~~~ + + dmem.max, dmem.min, dmem.low + A readwrite nested-keyed file that exists for all the cgroups + except root that describes current configured resource limit + for a region. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 1073741824 + drm/0000:03:00.0/stolen max + + The semantics are the same as for the memory cgroup controller, and are + calculated in the same way. + + dmem.capacity + A read-only file that describes maximum region capacity. + It only exists on the root cgroup. Not all memory can be + allocated by cgroups, as the kernel reserves some for + internal use. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 8514437120 + drm/0000:03:00.0/stolen 67108864 + + dmem.current + A read-only file that describes current resource usage. + It exists for all the cgroup except root. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 12550144 + drm/0000:03:00.0/stolen 8650752 + HugeTLB ------- @@ -2949,7 +3081,7 @@ Filesystem Support for Writeback -------------------------------- A filesystem can support cgroup writeback by updating -address_space_operations->writepage[s]() to annotate bio's using the +address_space_operations->writepages() to annotate bio's using the following two functions. wbc_init_bio(@wbc, @bio) diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index c09674a75a9e..d989ae5778ba 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -270,6 +270,8 @@ configured for Unix Extensions (and the client has not disabled illegal Windows/NTFS/SMB characters to a remap range (this mount parameter is the default for SMB3). This remap (``mapposix``) range is also compatible with Mac (and "Services for Mac" on some older Windows). +When POSIX Extensions for SMB 3.1.1 are negotiated, remapping is automatically +disabled. CIFS VFS Mount Options ====================== diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index 9f8139ff97d6..4467f6d4b632 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -146,6 +146,11 @@ integrity:<bytes>:<type> integrity for the encrypted device. The additional space is then used for storing authentication tag (and persistent IV if needed). +integrity_key_size:<bytes> + Optionally set the integrity key size if it differs from the digest size. + It allows the use of wrapped key algorithms where the key size is + independent of the cryptographic key size. + sector_size:<bytes> Use <bytes> as the encryption unit instead of 512 bytes sectors. This option can be in range 512 - 4096 bytes and must be power of two. diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index d8a5f14d0e3c..c2e18ecc065c 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -92,6 +92,11 @@ Target arguments: allowed. This mode is useful for data recovery if the device cannot be activated in any of the other standard modes. + I - inline mode - in this mode, dm-integrity will store integrity + data directly in the underlying device sectors. + The underlying device must have an integrity profile that + allows storing user integrity data and provides enough + space for the selected integrity tag. 5. the number of additional arguments diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index a65c1602cb23..8c3f1f967a3c 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -87,6 +87,15 @@ panic_on_corruption Panic the device when a corrupted block is discovered. This option is not compatible with ignore_corruption and restart_on_corruption. +restart_on_error + Restart the system when an I/O error is detected. + This option can be combined with the restart_on_corruption option. + +panic_on_error + Panic the device when an I/O error is detected. This option is + not compatible with the restart_on_error option but can be combined + with the panic_on_corruption option. + ignore_zero_blocks Do not verify blocks that are expected to contain zeroes and always return zeroes instead. This may be useful if the partition contains unused blocks @@ -142,8 +151,15 @@ root_hash_sig_key_desc <key_description> already in the secondary trusted keyring. try_verify_in_tasklet - If verity hashes are in cache, verify data blocks in kernel tasklet instead - of workqueue. This option can reduce IO latency. + If verity hashes are in cache and the IO size does not exceed the limit, + verify data blocks in bottom half instead of workqueue. This option can + reduce IO latency. The size limits can be configured via + /sys/module/dm_verity/parameters/use_bh_bytes. The four parameters + correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, + IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn. + For example: + <none>,<rt>,<be>,<idle> + 4096,4096,4096,4096 Theory of operation =================== diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 2418b0c2d3df..b857eb6ca1b6 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -238,11 +238,10 @@ When mounting an ext4 filesystem, the following option are accepted: configured using tune2fs) data_err=ignore(*) - Just print an error message if an error occurs in a file data buffer in - ordered mode. + Just print an error message if an error occurs in a file data buffer. + data_err=abort - Abort the journal if an error occurs in a file data buffer in ordered - mode. + Abort the journal if an error occurs in a file data buffer. grpid | bsdgroups New objects have the group ID of their parent. diff --git a/Documentation/admin-guide/gpio/gpio-aggregator.rst b/Documentation/admin-guide/gpio/gpio-aggregator.rst index 5cd1e7221756..8374a9df9105 100644 --- a/Documentation/admin-guide/gpio/gpio-aggregator.rst +++ b/Documentation/admin-guide/gpio/gpio-aggregator.rst @@ -69,6 +69,113 @@ write-only attribute files in sysfs. $ echo gpio-aggregator.0 > delete_device +Aggregating GPIOs using Configfs +-------------------------------- + +**Group:** ``/config/gpio-aggregator`` + + This is the root directory of the gpio-aggregator configfs tree. + +**Group:** ``/config/gpio-aggregator/<example-name>`` + + This directory represents a GPIO aggregator device. You can assign any + name to ``<example-name>`` (e.g. ``agg0``), except names starting with + ``_sysfs`` prefix, which are reserved for auto-generated configfs + entries corresponding to devices created via Sysfs. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/live`` + + The ``live`` attribute allows to trigger the actual creation of the device + once it's fully configured. Accepted values are: + + * ``1``, ``yes``, ``true`` : enable the virtual device + * ``0``, ``no``, ``false`` : disable the virtual device + +**Attribute:** ``/config/gpio-aggregator/<example-name>/dev_name`` + + The read-only ``dev_name`` attribute exposes the name of the device as it + will appear in the system on the platform bus (e.g. ``gpio-aggregator.0``). + This is useful for identifying a character device for the newly created + aggregator. If it's ``gpio-aggregator.0``, + ``/sys/devices/platform/gpio-aggregator.0/gpiochipX`` path tells you that the + GPIO device id is ``X``. + +You must create subdirectories for each virtual line you want to +instantiate, named exactly as ``line0``, ``line1``, ..., ``lineY``, when +you want to instantiate ``Y+1`` (Y >= 0) lines. Configure all lines before +activating the device by setting ``live`` to 1. + +**Group:** ``/config/gpio-aggregator/<example-name>/<lineY>/`` + + This directory represents a GPIO line to include in the aggregator. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/key`` + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/offset`` + + The default values after creating the ``<lineY>`` directory are: + + * ``key`` : <empty> + * ``offset`` : -1 + + ``key`` must always be explicitly configured, while ``offset`` depends. + Two configuration patterns exist for each ``<lineY>``: + + (a). For lookup by GPIO line name: + + * Set ``key`` to the line name. + * Ensure ``offset`` remains -1 (the default). + + (b). For lookup by GPIO chip name and the line offset within the chip: + + * Set ``key`` to the chip name. + * Set ``offset`` to the line offset (0 <= ``offset`` < 65535). + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/name`` + + The ``name`` attribute sets a custom name for lineY. If left unset, the + line will remain unnamed. + +Once the configuration is done, the ``'live'`` attribute must be set to 1 +in order to instantiate the aggregator device. It can be set back to 0 to +destroy the virtual device. The module will synchronously wait for the new +aggregator device to be successfully probed and if this doesn't happen, writing +to ``'live'`` will result in an error. This is a different behaviour from the +case when you create it using sysfs ``new_device`` interface. + +.. note:: + + For aggregators created via Sysfs, the configfs entries are + auto-generated and appear as ``/config/gpio-aggregator/_sysfs.<N>/``. You + cannot add or remove line directories with mkdir(2)/rmdir(2). To modify + lines, you must use the "delete_device" interface to tear down the + existing device and reconfigure it from scratch. However, you can still + toggle the aggregator with the ``live`` attribute and adjust the + ``key``, ``offset``, and ``name`` attributes for each line when ``live`` + is set to 0 by hand (i.e. it's not waiting for deferred probe). + +Sample configuration commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + # Create a directory for an aggregator device + $ mkdir /sys/kernel/config/gpio-aggregator/agg0 + + # Configure each line + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line0 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line0/key + $ echo 6 > /sys/kernel/config/gpio-aggregator/agg0/line0/offset + $ echo test0 > /sys/kernel/config/gpio-aggregator/agg0/line0/name + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line1 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line1/key + $ echo 7 > /sys/kernel/config/gpio-aggregator/agg0/line1/offset + $ echo test1 > /sys/kernel/config/gpio-aggregator/agg0/line1/name + + # Activate the aggregator device + $ echo 1 > /sys/kernel/config/gpio-aggregator/agg0/live + + Generic GPIO Driver ------------------- diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index 1cc5567a4bbe..35d49ccd49e0 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -71,7 +71,7 @@ specific lines. The name of those subdirectories must take the form of: ``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be used by the module to assign the config to the specific line at given offset. -Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in order to instantiate the chip. It can be set back to 0 to destroy the simulated chip. The module will synchronously wait for the new simulated device to be successfully probed and if this doesn't happen, writing to ``'live'`` will diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst index 2aca70db9f3b..7e7c0df51640 100644 --- a/Documentation/admin-guide/gpio/gpio-virtuser.rst +++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst @@ -92,7 +92,7 @@ struct. The first two take string values as arguments: Activating GPIO consumers ------------------------- -Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in order to instantiate the consumer. It can be set back to 0 to destroy the virtual device. The module will synchronously wait for the new simulated device to be successfully probed and if this doesn't happen, writing to ``'live'`` will diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst deleted file mode 100644 index 6ee70465c0ea..000000000000 --- a/Documentation/admin-guide/highuid.rst +++ /dev/null @@ -1,80 +0,0 @@ -=================================================== -Notes on the change from 16-bit UIDs to 32-bit UIDs -=================================================== - -:Author: Chris Wing <wingc@umich.edu> -:Last updated: January 11, 2000 - -- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t - when communicating between user and kernel space in an ioctl or data - structure. - -- kernel code should use uid_t and gid_t in kernel-private structures and - code. - -What's left to be done for 32-bit UIDs on all Linux architectures: - -- Disk quotas have an interesting limitation that is not related to the - maximum UID/GID. They are limited by the maximum file size on the - underlying filesystem, because quota records are written at offsets - corresponding to the UID in question. - Further investigation is needed to see if the quota system can cope - properly with huge UIDs. If it can deal with 64-bit file offsets on all - architectures, this should not be a problem. - -- Decide whether or not to keep backwards compatibility with the system - accounting file, or if we should break it as the comments suggest - (currently, the old 16-bit UID and GID are still written to disk, and - part of the former pad space is used to store separate 32-bit UID and - GID) - -- Need to validate that OS emulation calls the 16-bit UID - compatibility syscalls, if the OS being emulated used 16-bit UIDs, or - uses the 32-bit UID system calls properly otherwise. - - This affects at least: - - - iBCS on Intel - - - sparc32 emulation on sparc64 - (need to support whatever new 32-bit UID system calls are added to - sparc32) - -- Validate that all filesystems behave properly. - - At present, 32-bit UIDs _should_ work for: - - - ext2 - - ufs - - isofs - - nfs - - coda - - udf - - Ioctl() fixups have been made for: - - - ncpfs - - smbfs - - Filesystems with simple fixups to prevent 16-bit UID wraparound: - - - minix - - sysv - - qnx4 - - Other filesystems have not been checked yet. - -- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in - all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but - more are needed. (as well as new user<->kernel data structures) - -- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k, - sh, and sparc32. Fixing this is probably not that important, but would - require adding a new ELF section. - -- The ioctl()s used to control the in-kernel NFS server only support - 16-bit UIDs on arm, i386, m68k, sh, and sparc32. - -- make sure that the UID mapping feature of AX25 networking works properly - (it should be safe because it's always used a 32-bit integer to - communicate between user and kernel) diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ff0b440ef2dc..09890a8f3ee9 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -22,3 +22,6 @@ are configurable at compile, boot or run time. srso gather_data_sampling reg-file-data-sampling + rsb + old_microcode + indirect-target-selection diff --git a/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst new file mode 100644 index 000000000000..d9ca64108d23 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Indirect Target Selection (ITS) +=============================== + +ITS is a vulnerability in some Intel CPUs that support Enhanced IBRS and were +released before Alder Lake. ITS may allow an attacker to control the prediction +of indirect branches and RETs located in the lower half of a cacheline. + +ITS is assigned CVE-2024-28956 with a CVSS score of 4.7 (Medium). + +Scope of Impact +--------------- +- **eIBRS Guest/Host Isolation**: Indirect branches in KVM/kernel may still be + predicted with unintended target corresponding to a branch in the guest. + +- **Intra-Mode BTI**: In-kernel training such as through cBPF or other native + gadgets. + +- **Indirect Branch Prediction Barrier (IBPB)**: After an IBPB, indirect + branches may still be predicted with targets corresponding to direct branches + executed prior to the IBPB. This is fixed by the IPU 2025.1 microcode, which + should be available via distro updates. Alternatively microcode can be + obtained from Intel's github repository [#f1]_. + +Affected CPUs +------------- +Below is the list of ITS affected CPUs [#f2]_ [#f3]_: + + ======================== ============ ==================== =============== + Common name Family_Model eIBRS Intra-mode BTI + Guest/Host Isolation + ======================== ============ ==================== =============== + SKYLAKE_X (step >= 6) 06_55H Affected Affected + ICELAKE_X 06_6AH Not affected Affected + ICELAKE_D 06_6CH Not affected Affected + ICELAKE_L 06_7EH Not affected Affected + TIGERLAKE_L 06_8CH Not affected Affected + TIGERLAKE 06_8DH Not affected Affected + KABYLAKE_L (step >= 12) 06_8EH Affected Affected + KABYLAKE (step >= 13) 06_9EH Affected Affected + COMETLAKE 06_A5H Affected Affected + COMETLAKE_L 06_A6H Affected Affected + ROCKETLAKE 06_A7H Not affected Affected + ======================== ============ ==================== =============== + +- All affected CPUs enumerate Enhanced IBRS feature. +- IBPB isolation is affected on all ITS affected CPUs, and need a microcode + update for mitigation. +- None of the affected CPUs enumerate BHI_CTRL which was introduced in Golden + Cove (Alder Lake and Sapphire Rapids). This can help guests to determine the + host's affected status. +- Intel Atom CPUs are not affected by ITS. + +Mitigation +---------- +As only the indirect branches and RETs that have their last byte of instruction +in the lower half of the cacheline are vulnerable to ITS, the basic idea behind +the mitigation is to not allow indirect branches in the lower half. + +This is achieved by relying on existing retpoline support in the kernel, and in +compilers. ITS-vulnerable retpoline sites are runtime patched to point to newly +added ITS-safe thunks. These safe thunks consists of indirect branch in the +second half of the cacheline. Not all retpoline sites are patched to thunks, if +a retpoline site is evaluated to be ITS-safe, it is replaced with an inline +indirect branch. + +Dynamic thunks +~~~~~~~~~~~~~~ +From a dynamically allocated pool of safe-thunks, each vulnerable site is +replaced with a new thunk, such that they get a unique address. This could +improve the branch prediction accuracy. Also, it is a defense-in-depth measure +against aliasing. + +Note, for simplicity, indirect branches in eBPF programs are always replaced +with a jump to a static thunk in __x86_indirect_its_thunk_array. If required, +in future this can be changed to use dynamic thunks. + +All vulnerable RETs are replaced with a static thunk, they do not use dynamic +thunks. This is because RETs get their prediction from RSB mostly that does not +depend on source address. RETs that underflow RSB may benefit from dynamic +thunks. But, RETs significantly outnumber indirect branches, and any benefit +from a unique source address could be outweighed by the increased icache +footprint and iTLB pressure. + +Retpoline +~~~~~~~~~ +Retpoline sequence also mitigates ITS-unsafe indirect branches. For this +reason, when retpoline is enabled, ITS mitigation only relocates the RETs to +safe thunks. Unless user requested the RSB-stuffing mitigation. + +RSB Stuffing +~~~~~~~~~~~~ +RSB-stuffing via Call Depth Tracking is a mitigation for Retbleed RSB-underflow +attacks. And it also mitigates RETs that are vulnerable to ITS. + +Mitigation in guests +^^^^^^^^^^^^^^^^^^^^ +All guests deploy ITS mitigation by default, irrespective of eIBRS enumeration +and Family/Model of the guest. This is because eIBRS feature could be hidden +from a guest. One exception to this is when a guest enumerates BHI_DIS_S, which +indicates that the guest is running on an unaffected host. + +To prevent guests from unnecessarily deploying the mitigation on unaffected +platforms, Intel has defined ITS_NO bit(62) in MSR IA32_ARCH_CAPABILITIES. When +a guest sees this bit set, it should not enumerate the ITS bug. Note, this bit +is not set by any hardware, but is **intended for VMMs to synthesize** it for +guests as per the host's affected status. + +Mitigation options +^^^^^^^^^^^^^^^^^^ +The ITS mitigation can be controlled using the "indirect_target_selection" +kernel parameter. The available options are: + + ======== =================================================================== + on (default) Deploy the "Aligned branch/return thunks" mitigation. + If spectre_v2 mitigation enables retpoline, aligned-thunks are only + deployed for the affected RET instructions. Retpoline mitigates + indirect branches. + + off Disable ITS mitigation. + + vmexit Equivalent to "=on" if the CPU is affected by guest/host isolation + part of ITS. Otherwise, mitigation is not deployed. This option is + useful when host userspace is not in the threat model, and only + attacks from guest to host are considered. + + stuff Deploy RSB-fill mitigation when retpoline is also deployed. + Otherwise, deploy the default mitigation. When retpoline mitigation + is enabled, RSB-stuffing via Call-Depth-Tracking also mitigates + ITS. + + force Force the ITS bug and deploy the default mitigation. + ======== =================================================================== + +Sysfs reporting +--------------- + +The sysfs file showing ITS mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/indirect_target_selection + +Note, microcode mitigation status is not reported in this file. + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - Vulnerable + - System is vulnerable and no mitigation has been applied. + * - Vulnerable, KVM: Not affected + - System is vulnerable to intra-mode BTI, but not affected by eIBRS + guest/host isolation. + * - Mitigation: Aligned branch/return thunks + - The mitigation is enabled, affected indirect branches and RETs are + relocated to safe thunks. + * - Mitigation: Retpolines, Stuffing RSB + - The mitigation is enabled using retpoline and RSB stuffing. + +References +---------- +.. [#f1] Microcode repository - https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files + +.. [#f2] Affected Processors list - https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html + +.. [#f3] Affected Processors list (machine readable) - https://github.com/intel/Intel-affected-processor-list diff --git a/Documentation/admin-guide/hw-vuln/old_microcode.rst b/Documentation/admin-guide/hw-vuln/old_microcode.rst new file mode 100644 index 000000000000..6ded8f86b8d0 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/old_microcode.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Old Microcode +============= + +The kernel keeps a table of released microcode. Systems that had +microcode older than this at boot will say "Vulnerable". This means +that the system was vulnerable to some known CPU issue. It could be +security or functional, the kernel does not know or care. + +You should update the CPU microcode to mitigate any exposure. This is +usually accomplished by updating the files in +/lib/firmware/intel-ucode/ via normal distribution updates. Intel also +distributes these files in a github repo: + + https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git + +Just like all the other hardware vulnerabilities, exposure is +determined at boot. Runtime microcode updates do not change the status +of this vulnerability. diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst index 1302fd1b55e8..6dba18dbb9ab 100644 --- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -157,9 +157,7 @@ This is achieved by using the otherwise unused and obsolete VERW instruction in combination with a microcode update. The microcode clears the affected CPU buffers when the VERW instruction is executed. -Kernel reuses the MDS function to invoke the buffer clearing: - - mds_clear_cpu_buffers() +Kernel does the buffer clearing with x86_clear_cpu_buffers(). On MDS affected CPUs, the kernel already invokes CPU buffer clear on kernel/userspace, hypervisor/guest and C-state (idle) transitions. No diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst index 0585d02b9a6c..ad15417d39f9 100644 --- a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst +++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst @@ -29,14 +29,6 @@ Below is the list of affected Intel processors [#f1]_: RAPTORLAKE_S 06_BFH =================== ============ -As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and -RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as -vulnerable in Linux because they share the same family/model with an affected -part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or -CPUID.HYBRID. This information could be used to distinguish between the -affected and unaffected parts, but it is deemed not worth adding complexity as -the reporting is fixed automatically when these parts enumerate RFDS_NO. - Mitigation ========== Intel released a microcode update that enables software to clear sensitive diff --git a/Documentation/admin-guide/hw-vuln/rsb.rst b/Documentation/admin-guide/hw-vuln/rsb.rst new file mode 100644 index 000000000000..21dbf9cf25f8 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/rsb.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +RSB-related mitigations +======================= + +.. warning:: + Please keep this document up-to-date, otherwise you will be + volunteered to update it and convert it to a very long comment in + bugs.c! + +Since 2018 there have been many Spectre CVEs related to the Return Stack +Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or +Return Address Predictor (RAP) on AMD). + +Information about these CVEs and how to mitigate them is scattered +amongst a myriad of microarchitecture-specific documents. + +This document attempts to consolidate all the relevant information in +once place and clarify the reasoning behind the current RSB-related +mitigations. It's meant to be as concise as possible, focused only on +the current kernel mitigations: what are the RSB-related attack vectors +and how are they currently being mitigated? + +It's *not* meant to describe how the RSB mechanism operates or how the +exploits work. More details about those can be found in the references +below. + +Rather, this is basically a glorified comment, but too long to actually +be one. So when the next CVE comes along, a kernel developer can +quickly refer to this as a refresher to see what we're actually doing +and why. + +At a high level, there are two classes of RSB attacks: RSB poisoning +(Intel and AMD) and RSB underflow (Intel only). They must each be +considered individually for each attack vector (and microarchitecture +where applicable). + +---- + +RSB poisoning (Intel and AMD) +============================= + +SpectreRSB +~~~~~~~~~~ + +RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where +an attacker poisons an RSB entry to cause a victim's return instruction +to speculate to an attacker-controlled address. This can happen when +there are unbalanced CALLs/RETs after a context switch or VMEXIT. + +* All attack vectors can potentially be mitigated by flushing out any + poisoned RSB entries using an RSB filling sequence + [#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between + untrusted and trusted domains. But this has a performance impact and + should be avoided whenever possible. + + .. DANGER:: + **FIXME**: Currently we're flushing 32 entries. However, some CPU + models have more than 32 entries. The loop count needs to be + increased for those. More detailed information is needed about RSB + sizes. + +* On context switch, the user->user mitigation requires ensuring the + RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_ + during a context switch: + + * AMD: + On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB. + This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_. + + On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be + always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is + indicated by X86_BUG_IBPB_NO_RET. + + * Intel: + IBPB always clears the RSB: + + "Software that executed before the IBPB command cannot control + the predicted targets of indirect branches executed after the + command on the same logical processor. The term indirect branch + in this context includes near return instructions, so these + predicted targets may come from the RSB." [#intel-ibpb-rsb]_ + +* On context switch, user->kernel attacks are prevented by SMEP. User + space can only insert user space addresses into the RSB. Even + non-canonical addresses can't be inserted due to the page gap at the + end of the user canonical address space reserved by TASK_SIZE_MAX. + A SMEP #PF at instruction fetch prevents the kernel from speculatively + executing user space. + + * AMD: + "Finally, branches that are predicted as 'ret' instructions get + their predicted targets from the Return Address Predictor (RAP). + AMD recommends software use a RAP stuffing sequence (mitigation + V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP) + to ensure that the addresses in the RAP are safe for + speculation. Collectively, we refer to these mitigations as "RAP + Protection"." [#amd-smep-rsb]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_ + +* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB + mitigation if needed): + + * AMD: + "When Automatic IBRS is enabled, the internal return address + stack used for return address predictions is cleared on VMEXIT." + [#amd-eibrs-vmexit]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits. Processors with + enhanced IBRS still support the usage model where IBRS is set + only in the OS/VMM for OSes that enable SMEP. To do this, such + processors will ensure that guest behavior cannot control the + RSB after a VM exit once IBRS is set, even if IBRS was not set + at the time of the VM exit." [#intel-eibrs-vmexit]_ + + Note that some Intel CPUs are susceptible to Post-barrier Return + Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last + CALL from the guest can be used to predict the first unbalanced RET. + In this case the PBRSB mitigation is needed in addition to eIBRS. + +AMD RETBleed / SRSO / Branch Type Confusion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On AMD, poisoned RSB entries can also be created by the AMD RETBleed +variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack +Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel +protects itself by replacing every RET in the kernel with a branch to a +single safe RET. + +---- + +RSB underflow (Intel only) +========================== + +RSB Alternate (RSBA) ("Intel Retbleed") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some Intel Skylake-generation CPUs are susceptible to the Intel variant +of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow +[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due +to mismatched CALLs/RETs or returning from a deep call stack, the branch +predictor can fall back to using the Branch Target Buffer (BTB). If a +user forces a BTB collision then the RET can speculatively branch to a +user-controlled address. + +* Note that RSB filling doesn't fully mitigate this issue. If there + are enough unbalanced RETs, the RSB may still underflow and fall back + to using a poisoned BTB entry. + +* On context switch, user->user underflow attacks are mitigated by the + conditional IBPB [#cond-ibpb]_ on context switch which effectively + clears the BTB: + + * "The indirect branch predictor barrier (IBPB) is an indirect branch + control mechanism that establishes a barrier, preventing software + that executed before the barrier from controlling the predicted + targets of indirect branches executed after the barrier on the same + logical processor." [#intel-ibpb-btb]_ + +* On context switch and VMEXIT, user->kernel and guest->host RSB + underflows are mitigated by IBRS or eIBRS: + + * "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU" + attack demonstrated by the researchers. As previously documented, + Intel recommends the use of enhanced IBRS, where supported. This + includes any processor that enumerates RRSBA but not RRSBA_DIS_S." + [#intel-rsbu]_ + + However, note that eIBRS and IBRS do not mitigate intra-mode attacks. + Like RRSBA below, this is mitigated by clearing the BHB on kernel + entry. + + As an alternative to classic IBRS, call depth tracking (combined with + retpolines) can be used to track kernel returns and fill the RSB when + it gets close to being empty. + +Restricted RSB Alternate (RRSBA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior, +which, similar to RSBA described above, also falls back to using the BTB +on RSB underflow. The only difference is that the predicted targets are +restricted to the current domain when eIBRS is enabled: + +* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch + predictors to be used by near RET instructions when the RSB is + empty. When eIBRS is enabled, the predicted targets of these + alternate predictors are restricted to those belonging to the + indirect branch predictor entries of the current prediction domain. + [#intel-eibrs-rrsba]_ + +When a CPU with RRSBA is vulnerable to Branch History Injection +[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an +intra-mode BTI attack. This is mitigated by clearing the BHB on +kernel entry. + +However if the kernel uses retpolines instead of eIBRS, it needs to +disable RRSBA: + +* "Where software is using retpoline as a mitigation for BHI or + intra-mode BTI, and the processor both enumerates RRSBA and + enumerates RRSBA_DIS controls, it should disable this behavior." + [#intel-retpoline-rrsba]_ + +---- + +References +========== + +.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_ + +.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_ + +.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_ + +.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt. + +.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD. + +.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_. + +.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_ + +.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_ + +.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_ + +.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_ + +.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_ + +.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_ + +.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ + +.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index 2ad1c05b8c88..66af95251a3d 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -104,7 +104,20 @@ The possible values in this file are: (spec_rstack_overflow=ibpb-vmexit) + * 'Mitigation: Reduced Speculation': + This mitigation gets automatically enabled when the above one "IBPB on + VMEXIT" has been selected and the CPU supports the BpSpecReduce bit. + + It gets automatically enabled on machines which have the + SRSO_USER_KERNEL_NO=1 CPUID bit. In that case, the code logic is to switch + to the above =ibpb-vmexit mitigation because the user/kernel boundary is + not affected anymore and thus "safe RET" is not needed. + + After enabling the IBPB on VMEXIT mitigation option, the BpSpecReduce bit + is detected (functionality present on all such machines) and that + practically overrides IBPB on VMEXIT as it has a lot less performance + impact and takes care of the guest->host attack vector too. In order to exploit vulnerability, an attacker needs to: diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index e85b1adf5908..259d79fbeb94 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -7,6 +7,9 @@ added to the kernel over time. There is, as yet, little overall order or organization here — this material was not written to be a single, coherent document! With luck things will improve quickly over time. +General guides to kernel administration +--------------------------------------- + This initial section contains overall information, including the README file describing the kernel as a whole, documentation on kernel parameters, etc. @@ -15,19 +18,44 @@ etc. :maxdepth: 1 README - kernel-parameters devices - sysctl/index - abi features -This section describes CPU vulnerabilities and their mitigations. +A big part of the kernel's administrative interface is the /proc and sysfs +virtual filesystems; these documents describe how to interact with tem + +.. toctree:: + :maxdepth: 1 + + sysfs-rules + sysctl/index + cputopology + abi + +Security-related documentation: .. toctree:: :maxdepth: 1 hw-vuln/index + LSM/index + perf-security + +Booting the kernel +------------------ + +.. toctree:: + :maxdepth: 1 + + bootconfig + kernel-parameters + efi-stub + initrd + + +Tracking down and identifying problems +-------------------------------------- Here is a set of documents aimed at users who are trying to track down problems and bugs in particular. @@ -48,94 +76,119 @@ problems and bugs in particular. kdump/index perf/index pstore-blk + clearing-warn-once + kernel-per-CPU-kthreads + lockup-watchdogs + RAS/index + sysrq -This is the beginning of a section with information of interest to -application developers. Documents covering various aspects of the kernel -ABI will be found here. + +Core-kernel subsystems +---------------------- + +These documents describe core-kernel administration interfaces that are +likely to be of interest on almost any system. .. toctree:: :maxdepth: 1 - sysfs-rules + cgroup-v2 + cgroup-v1/index + cpu-load + mm/index + module-signing + namespaces/index + numastat + pm/index + syscall-user-dispatch -This is the beginning of a section with information of interest to -application developers and system integrators doing analysis of the -Linux kernel for safety critical applications. Documents supporting -analysis of kernel interactions with applications, and key kernel -subsystems expectations will be found here. +Support for non-native binary formats. Note that some of these +documents are ... old ... .. toctree:: :maxdepth: 1 - workload-tracing + binfmt-misc + java + mono -The rest of this manual consists of various unordered guides on how to -configure specific aspects of kernel behavior to your liking. + +Block-layer and filesystem administration +----------------------------------------- .. toctree:: :maxdepth: 1 - acpi/index - aoe/index - auxdisplay/index bcache binderfs - binfmt-misc blockdev/index - bootconfig - braille-console - btmrvl - cgroup-v1/index - cgroup-v2 cifs/index - clearing-warn-once - cpu-load - cputopology - dell_rbu device-mapper/index - edid - efi-stub ext4 filesystem-monitoring nfs/index - gpio/index - highuid - hw_random - initrd iostats - java jfs - kernel-per-CPU-kthreads + md + ufs + xfs + +Device-specific guides +---------------------- + +How to configure your hardware within your Linux system. + +.. toctree:: + :maxdepth: 1 + + acpi/index + aoe/index + auxdisplay/index + braille-console + btmrvl + dell_rbu + edid + gpio/index + hw_random laptops/index lcd-panel-cgram - ldm - lockup-watchdogs - LSM/index - md media/index - mm/index - module-signing - mono - namespaces/index - numastat + nvme-multipath parport - perf-security - pm/index pnp rapidio - RAS/index rtc serial-console svga - syscall-user-dispatch - sysrq thermal/index thunderbolt - ufs - unicode vga-softcursor video-output - xfs + +Workload analysis +----------------- + +This is the beginning of a section with information of interest to +application developers and system integrators doing analysis of the +Linux kernel for safety critical applications. Documents supporting +analysis of kernel interactions with applications, and key kernel +subsystems expectations will be found here. + +.. toctree:: + :maxdepth: 1 + + workload-tracing + +Everything else +--------------- + +A few hard-to-categorize and generally obsolete documents. + +.. toctree:: + :maxdepth: 1 + + ldm + unicode .. only:: subproject and html diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index 609a3201fd4e..9453196ade51 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -2,62 +2,39 @@ I/O statistics fields ===================== -Since 2.4.20 (and some versions before, with patches), and 2.5.45, -more extensive disk statistics have been introduced to help measure disk -activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do -the work for you, but in case you are interested in creating your own -tools, the fields are explained here. - -In 2.4 now, the information is found as additional fields in -``/proc/partitions``. In 2.6 and upper, the same information is found in two -places: one is in the file ``/proc/diskstats``, and the other is within -the sysfs file system, which must be mounted in order to obtain -the information. Throughout this document we'll assume that sysfs -is mounted on ``/sys``, although of course it may be mounted anywhere. -Both ``/proc/diskstats`` and sysfs use the same source for the information -and so should not differ. - -Here are examples of these different formats:: - - 2.4: - 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 - - 2.6+ sysfs: - 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 35486 38030 38030 38030 - - 2.6+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 hda1 35486 38030 38030 38030 - - 4.18+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 - -On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have -a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. - -The advantage of one over the other is that the sysfs choice works well -if you are watching a known, small set of disks. ``/proc/diskstats`` may -be a better choice if you are watching a large number of disks because -you'll avoid the overhead of 50, 100, or 500 or more opens/closes with -each snapshot of your disk statistics. - -In 2.4, the statistics fields are those after the device name. In -the above example, the first field of statistics would be 446216. -By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll -find just the 15 fields, beginning with 446216. If you look at -``/proc/diskstats``, the 15 fields will be preceded by the major and -minor device numbers, and device name. Each of these formats provides -15 fields of statistics, each meaning exactly the same things. -All fields except field 9 are cumulative since boot. Field 9 should -go to zero as I/Os complete; all others only increase (unless they -overflow and wrap). Wrapping might eventually occur on a very busy -or long-lived system; so applications should be prepared to deal with -it. Regarding wrapping, the types of the fields are either unsigned -int (32 bit) or unsigned long (32-bit or 64-bit, depending on your -machine) as noted per-field below. Unless your observations are very -spread in time, these fields should not wrap twice before you notice it. +The kernel exposes disk statistics via ``/proc/diskstats`` and +``/sys/block/<device>/stat``. These stats are usually accessed via tools +such as ``sar`` and ``iostat``. + +Here are examples using a disk with two partitions:: + + /proc/diskstats: + 259 0 nvme0n1 255999 814 12369153 47919 996852 81 36123024 425995 0 301795 580470 0 0 0 0 60602 106555 + 259 1 nvme0n1p1 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + 259 2 nvme0n1p2 255401 1 12343477 47799 996004 0 36014736 425784 0 344336 473584 0 0 0 0 0 0 + + /sys/block/nvme0n1/stat: + 255999 814 12369153 47919 996858 81 36123056 426009 0 301809 580491 0 0 0 0 60605 106562 + + /sys/block/nvme0n1/nvme0n1p1/stat: + 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + +Both files contain the same 17 statistics. ``/sys/block/<device>/stat`` +contains the fields for ``<device>``. In ``/proc/diskstats`` the fields +are prefixed with the major and minor device numbers and the device +name. In the example above, the first stat value for ``nvme0n1`` is +255999 in both files. + +The sysfs ``stat`` file is efficient for monitoring a small, known set +of disks. If you're tracking a large number of devices, +``/proc/diskstats`` is often the better choice since it avoids the +overhead of opening and closing multiple files for each snapshot. + +All fields are cumulative, monotonic counters, except for field 9, which +resets to zero as I/Os complete. The remaining fields reset at boot, on +device reattachment or reinitialization, or when the underlying counter +overflows. Applications reading these counters should detect and handle +resets when comparing stat snapshots. Each set of stats only applies to the indicated device; if you want system-wide stats you'll have to find all the devices and sum them all up. diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 5376890adbeb..20fabdf6567e 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64) 1) On i386, enable high memory support under "Processor type and features":: - CONFIG_HIGHMEM64G=y - - or:: - CONFIG_HIGHMEM4G 2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel @@ -551,6 +547,38 @@ from within add_taint() whenever the value set in this bitmask matches with the bit flag being set by add_taint(). This will cause a kdump to occur at the add_taint()->panic() call. +Write the dump file to encrypted disk volume +============================================ + +CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an +encrypted disk volume (only x86_64 supported for now). User space can interact +with /sys/kernel/config/crash_dm_crypt_keys for setup, + +1. Tell the first kernel what logon keys are needed to unlock the disk volumes, + # Add key #1 + mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 + # Add key #1's description + echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 1 + + # Add key #2 in the same way + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 2 + + # To support CPU/memory hot-plugging, re-use keys already saved to reserved + # memory + echo true > /sys/kernel/config/crash_dm_crypt_key/reuse + +2. Load the dump-capture kernel + +3. After the dump-capture kerne get booted, restore the keys to user keyring + echo yes > /sys/kernel/crash_dm_crypt_keys/restore + Contact ======= diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 0f714fc945ac..8cf4614385b7 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -331,8 +331,8 @@ PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|P Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) ------------------------------------------------------------------------------ +PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_unaccepted) +------------------------------------------------------------------------------------------------------------------------- More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 59931f21c974..39d0e7ff0965 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -194,8 +194,6 @@ is applicable:: WDT Watchdog support is enabled. X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. - More X86-64 boot options can be found in - Documentation/arch/x86/x86_64/boot-options.rst. X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) X86_UV SGI UV support is enabled. XEN Xen support is enabled @@ -213,7 +211,6 @@ Do not modify the syntax of boot loader parameters without extreme need or coordination with <Documentation/arch/x86/boot.rst>. There are also arch-specific kernel-parameters not documented here. -See for example <Documentation/arch/x86/x86_64/boot-options.rst>. Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3872bc6ec49d..f6d317e1674d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -21,6 +21,10 @@ strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory + nocmcff -- Disable firmware first mode for corrected + errors. This disables parsing the HEST CMC error + source to check if firmware has set the FF flag. This + may result in duplicate corrected error reports. nospcr -- disable console in ACPI SPCR table as default _serial_ console on ARM64 For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or @@ -405,15 +409,13 @@ not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. + apic [APIC,X86-64] Use IO-APIC. Default. + apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller Change the output verbosity while booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - For X86-32, this can also be used to specify an APIC - driver name. - Format: apic=driver_name - Examples: apic=bigsmp apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting Format: { bsp (default) | all | none } @@ -424,6 +426,10 @@ useful so that a dump capture kernel won't be shot down by NMI + apicpmtimer Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally + broken. + autoconf= [IPV6] See Documentation/networking/ipv6.rst. @@ -452,6 +458,9 @@ arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory Set instructions support + arm64.nompam [ARM64] Unconditionally disable Memory Partitioning And + Monitoring support + arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension support @@ -624,6 +633,14 @@ named mounts. Specifying both "all" and "named" disables all v1 hierarchies. + cgroup_v1_proc= [KNL] Show also missing controllers in /proc/cgroups + Format: { "true" | "false" } + /proc/cgroups lists only v1 controllers by default. + This compatibility option enables listing also v2 + controllers (whose v1 code is not compiled!), so that + semi-legacy software can check this file to decide + about usage of v2 (sic) controllers. + cgroup_favordynmods= [KNL] Enable or Disable favordynmods. Format: { "true" | "false" } Defaults to the value of CONFIG_CGROUP_FAVOR_DYNMODS. @@ -1401,7 +1418,8 @@ earlyprintk=serial[,0x...[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] - earlyprintk=pciserial[,force],bus:device.function[,baudrate] + earlyprintk=mmio32,membase[,{nocfg|baudrate}] + earlyprintk=pciserial[,force],bus:device.function[,{nocfg|baudrate}] earlyprintk=xdbc[xhciController#] earlyprintk=bios @@ -1409,6 +1427,9 @@ the normal console is initialized. It is not enabled by default because it has some cosmetic problems. + Use "nocfg" to skip UART configuration, assume + BIOS/firmware has configured UART correctly. + Append ",keep" to not disable it when the real console takes over. @@ -1726,6 +1747,8 @@ off: Disable GDS mitigation. + gbpages [X86] Use GB pages for kernel direct mappings. + gcov_persist= [GCOV] When non-zero (default), profiling data for kernel modules is saved and remains accessible via debugfs, even when the module is unloaded/reloaded. @@ -1773,7 +1796,9 @@ allocation boundaries as a proactive defense against bounds-checking flaws in the kernel's copy_to_user()/copy_from_user() interface. - on Perform hardened usercopy checks (default). + The default is determined by + CONFIG_HARDENED_USERCOPY_DEFAULT_ON. + on Perform hardened usercopy checks. off Disable hardened usercopy checks. hardlockup_all_cpu_backtrace= @@ -1814,6 +1839,13 @@ lz4: Select LZ4 compression algorithm to compress/decompress hibernation image. + hibernate.pm_test_delay= + [HIBERNATION] + Sets the number of seconds to remain in a hibernation test + mode before resuming the system (see + /sys/power/pm_test). Only available when CONFIG_PM_DEBUG + is set. Default value is 5. + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -1849,7 +1881,7 @@ hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET registers. Default set by CONFIG_HPET_MMAP_DEFAULT. - hugepages= [HW] Number of HugeTLB pages to allocate at boot. + hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot. If this follows hugepagesz (below), it specifies the number of pages of hugepagesz to be allocated. If this is the first HugeTLB parameter on the command @@ -1861,15 +1893,24 @@ <node>:<integer>[,<node>:<integer>] hugepagesz= - [HW] The size of the HugeTLB pages. This is used in - conjunction with hugepages (above) to allocate huge - pages of a specific size at boot. The pair - hugepagesz=X hugepages=Y can be specified once for - each supported huge page size. Huge page sizes are - architecture dependent. See also + [HW,EARLY] The size of the HugeTLB pages. This is + used in conjunction with hugepages (above) to + allocate huge pages of a specific size at boot. The + pair hugepagesz=X hugepages=Y can be specified once + for each supported huge page size. Huge page sizes + are architecture dependent. See also Documentation/admin-guide/mm/hugetlbpage.rst. Format: size[KMG] + hugepage_alloc_threads= + [HW] The number of threads that should be used to + allocate hugepages during boot. This option can be + used to improve system bootup time when allocating + a large amount of huge pages. + The default value is 25% of the available hardware threads. + + Note that this parameter only applies to non-gigantic huge pages. + hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node can be specified. @@ -1880,6 +1921,13 @@ hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped. + hugetlb_cma_only= + [HW,CMA,EARLY] When allocating new HugeTLB pages, only + try to allocate from the CMA areas. + + This option does nothing if hugetlb_cma= is not also + specified. + hugetlb_free_vmemmap= [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. @@ -1921,6 +1969,12 @@ which allow the hypervisor to 'idle' the guest on lock contention. + hw_protection= [HW] + Format: reboot | shutdown + + Hardware protection action taken on critical events like + overtemperature or imminent voltage loss. + i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. @@ -2008,12 +2062,21 @@ idle= [X86,EARLY] Format: idle=poll, idle=halt, idle=nomwait - Poll forces a polling idle loop that can slightly - improve the performance of waking up a idle CPU, but - will use a lot of power and make the system run hot. - Not recommended. + + idle=poll: Don't do power saving in the idle loop + using HLT, but poll for rescheduling event. This will + make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor + benchmarks. It also makes some profiling using + performance counters more accurate. Please note that + on systems with MONITOR/MWAIT support (like Intel + EM64T CPUs) this option has no performance advantage + over the normal idle loop. It may also interact badly + with hyperthreading. + idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. + idle=nomwait: Disable mwait for CPU C-states idxd.sva= [HW] @@ -2157,6 +2220,23 @@ different crypto accelerators. This option can be used to achieve best performance for particular HW. + indirect_target_selection= [X86,Intel] Mitigation control for Indirect + Target Selection(ITS) bug in Intel CPUs. Updated + microcode is also required for a fix in IBPB. + + on: Enable mitigation (default). + off: Disable mitigation. + force: Force the ITS bug and deploy default + mitigation. + vmexit: Only deploy mitigation if CPU is affected by + guest/host isolation part of ITS. + stuff: Deploy RSB-fill mitigation when retpoline is + also deployed. Otherwise, deploy the default + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/indirect-target-selection.rst + init= [KNL] Format: <full_path> Run specified binary instead of /sbin/init as init @@ -2295,6 +2375,9 @@ per_cpu_perf_limits Allow per-logical-CPU P-State performance control limits using cpufreq sysfs interface + no_cas + Do not enable capacity-aware scheduling (CAS) on + hybrid systems intremap= [X86-64,Intel-IOMMU,EARLY] on enable Interrupt Remapping (default) @@ -2311,20 +2394,73 @@ relaxed iommu= [X86,EARLY] + off + Don't initialize and use any kind of IOMMU. + force + Force the use of the hardware IOMMU even when + it is not actually needed (e.g. because < 3 GB + memory). + noforce + Don't force hardware IOMMU usage when it is not + needed. (default). + biomerge panic nopanic merge nomerge + soft - pt [X86] - nopt [X86] - nobypass [PPC/POWERNV] + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + + [X86] + pt + [X86] + nopt + [PPC/POWERNV] + nobypass Disable IOMMU bypass, using IOMMU for PCI devices. + [X86] + AMD Gart HW IOMMU-specific options: + + <size> + Set the size of the remapping area in bytes. + + allowed + Overwrite iommu off workarounds for specific chipsets + + fullflush + Flush IOMMU on each allocation (default). + + nofullflush + Don't use IOMMU fullflush. + + memaper[=<order>] + Allocate an own aperture over RAM with size + 32MB<<order. (default: order=1, i.e. 64MB) + + merge + Do scatter-gather (SG) merging. Implies "force" + (experimental). + + nomerge + Don't do scatter-gather (SG) merging. + + noaperture + Ask the IOMMU not to touch the aperture for AGP. + + noagp + Don't initialize the AGP driver and use full aperture. + + panic + Always panic when IOMMU overflows. + iommu.forcedac= [ARM64,X86,EARLY] Control IOVA allocation for PCI devices. Format: { "0" | "1" } 0 - Try to allocate a 32-bit DMA address first, before @@ -2432,7 +2568,9 @@ specified in the flag list (default: domain): nohz - Disable the tick when a single task runs. + Disable the tick when a single task runs as well as + disabling other kernel noises like having RCU callbacks + offloaded. This is equivalent to the nohz_full parameter. A residual 1Hz tick is offloaded to workqueues, which you need to affine to housekeeping through the global @@ -2622,6 +2760,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. @@ -2695,7 +2858,7 @@ VMs, i.e. on the 0=>1 and 1=>0 transitions of the number of VMs. - Enabling virtualization at module lode avoids potential + Enabling virtualization at module load avoids potential latency for creation of the 0=>1 VM, as KVM serializes virtualization enabling across all online CPUs. The "cost" of enabling virtualization when KVM is loaded, @@ -2748,17 +2911,21 @@ nvhe: Standard nVHE-based mode, without support for protected guests. - protected: nVHE-based mode with support for guests whose - state is kept private from the host. + protected: Mode with support for guests whose state is + kept private from the host, using VHE or + nVHE depending on HW support. nested: VHE-based mode with support for nested - virtualization. Requires at least ARMv8.3 - hardware. + virtualization. Requires at least ARMv8.4 + hardware (with FEAT_NV2). Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation - for the host. "nested" is experimental and should be - used with extreme caution. + for the host. To force nVHE on VHE hardware, add + "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the + command-line. + "nested" is experimental and should be used with + extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 @@ -3036,6 +3203,8 @@ * max_sec_lba48: Set or clear transfer size limit to 65535 sectors. + * external: Mark port as external (hotplug-capable). + * [no]lpm: Enable or disable link power management. * [no]setxfer: Indicate if transfer speed mode setting @@ -3259,9 +3428,77 @@ devices can be requested on-demand with the /dev/loop-control interface. - mce [X86-32] Machine Check Exception + mce= [X86-{32,64}] + + Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. + + off + disable machine check + + no_cmci + disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your + hardware is misbehaving. + + Note that you'll get more problems without CMCI than + with due to the shared banks, i.e. you might get + duplicated error logs. + + dont_log_ce + don't make logs for corrected errors. All events + reported as corrected are silently cleared by OS. This + option will be useful if you have no interest in any + of corrected errors. + + ignore_ce + disable features for corrected errors, e.g. + polling timer and CMCI. All events reported as + corrected are not cleared by OS and remained in its + error banks. + + Usually this disablement is not recommended, however + if there is an agent checking/clearing corrected + errors (e.g. BIOS or hardware monitoring + applications), conflicting with OS's error handling, + and you cannot deactivate the agent, then this option + will be a help. + + no_lmce + do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + + bootlog + enable logging of machine checks left over from + booting. Disabled by default on AMD Fam10h and older + because some BIOS leave bogus ones. + + If your BIOS doesn't do that it's a good idea to + enable though to make sure you log even machine check + events that result in a reboot. On Intel systems it is + enabled by default. + + nobootlog + disable boot machine check logging. + + monarchtimeout (number) + sets the time in us to wait for other CPUs on machine + checks. 0 to disable. + + bios_cmci_threshold + don't overwrite the bios-set CMCI threshold. This boot + option prevents Linux from overwriting the CMCI + threshold set by the bios. Without this option, Linux + always sets the CMCI threshold to 1. Enabling this may + make memory predictive failure analysis less effective + if the bios sets thresholds for memory errors since we + will not see details for all errors. + + recovery + force-enable recoverable machine check code paths + + Everything else is in sysfs now. - mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. @@ -3351,8 +3588,8 @@ [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel config + options. See Documentation/admin-guide/mm/memory-hotplug.rst. memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact @@ -3516,6 +3753,7 @@ expose users to several CPU vulnerabilities. Equivalent to: if nokaslr then kpti=0 [ARM64] gather_data_sampling=off [X86] + indirect_target_selection=off [X86] kvm.nx_huge_pages=off [X86] l1tf=off [X86] mds=off [X86] @@ -3887,6 +4125,8 @@ noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any IOAPICs that may be present in the system. + noapictimer [APIC,X86] Don't set up the APIC timer + noautogroup Disable scheduler automatic task group creation. nocache [ARM,EARLY] @@ -3934,6 +4174,8 @@ register save and restore. The kernel will only save legacy floating-point registers on task switch. + nogbpages [X86] Do not use GB pages for kernel direct mappings. + no_hash_pointers [KNL,EARLY] Force pointers printed to the console or buffers to be @@ -3960,6 +4202,8 @@ the impact of the sleep instructions. This is also useful when using JTAG debugger. + nohpet [X86] Don't use the HPET timer. + nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings. nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings. @@ -4079,10 +4323,10 @@ nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel, and disable the IO APIC. legacy for "maxcpus=0". - nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT). + nosmt [KNL,MIPS,PPC,EARLY] Disable symmetric multithreading (SMT). Equivalent to smt=1. - [KNL,X86,PPC] Disable symmetric multithreading (SMT). + [KNL,X86,PPC,S390] Disable symmetric multithreading (SMT). nosmt=force: Force disable SMT, cannot be undone via the sysfs control file. @@ -4111,8 +4355,10 @@ nosync [HW,M68K] Disables sync negotiation for all devices. - no_timer_check [X86,APIC] Disables the code which tests for - broken timer IRQ sources. + no_timer_check [X86,APIC] Disables the code which tests for broken + timer IRQ sources, i.e., the IO-APIC timer. This can + work around problems with incorrect timer + initialization on some boards. no_uaccess_flush [PPC,EARLY] Don't flush the L1-D cache after accessing user data. @@ -4192,6 +4438,11 @@ If given as an integer followed by 'U', it will divide each physical node into N emulated nodes. + numa=noacpi [X86] Don't parse the SRAT table for NUMA setup + + numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or + soft-reserved memory partitioning. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -4673,7 +4924,7 @@ '1' – force enabled 'x' – unchanged For example, - pci=config_acs=10x + pci=config_acs=10x@pci:0:0 would configure all devices that support ACS to enable P2P Request Redirect, disable Translation Blocking, and leave Source @@ -4856,6 +5107,14 @@ Format: <bool> default: 0 (auto_verbose is enabled) + printk.debug_non_panic_cpus= + Allows storing messages from non-panic CPUs into + the printk log buffer during panic(). They are + flushed to consoles by the panic-CPU on + a best-effort basis. + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + Default: disabled + printk.devkmsg={on,off,ratelimit} Control writing to /dev/kmsg. on - unlimited logging to /dev/kmsg from userspace @@ -5367,7 +5626,42 @@ rcutorture.gp_cond= [KNL] Use conditional/asynchronous update-side - primitives, if available. + normal-grace-period primitives, if available. + + rcutorture.gp_cond_exp= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives, if available. + + rcutorture.gp_cond_full= [KNL] + Use conditional/asynchronous update-side + normal-grace-period primitives that also take + concurrent expedited grace periods into account, + if available. + + rcutorture.gp_cond_exp_full= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives that also take + concurrent normal grace periods into account, + if available. + + rcutorture.gp_cond_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_cond and gp_cond_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_cond_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_cond_exp and gp_cond_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. rcutorture.gp_exp= [KNL] Use expedited update-side primitives, if available. @@ -5376,6 +5670,43 @@ Use normal (non-expedited) asynchronous update-side primitives, if available. + rcutorture.gp_poll= [KNL] + Use polled update-side normal-grace-period + primitives, if available. + + rcutorture.gp_poll_exp= [KNL] + Use polled update-side expedited-grace-period + primitives, if available. + + rcutorture.gp_poll_full= [KNL] + Use polled update-side normal-grace-period + primitives that also take concurrent expedited + grace periods into account, if available. + + rcutorture.gp_poll_exp_full= [KNL] + Use polled update-side expedited-grace-period + primitives that also take concurrent normal + grace periods into account, if available. + + rcutorture.gp_poll_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_poll and gp_poll_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_poll_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_poll_exp and gp_poll_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. + rcutorture.gp_sync= [KNL] Use normal (non-expedited) synchronous update-side primitives, if available. If all @@ -5384,6 +5715,31 @@ are zero, rcutorture acts as if is interpreted they are all non-zero. + rcutorture.gpwrap_lag= [KNL] + Enable grace-period wrap lag testing. Setting + to false prevents the gpwrap lag test from + running. Default is true. + + rcutorture.gpwrap_lag_gps= [KNL] + Set the value for grace-period wrap lag during + active lag testing periods. This controls how many + grace periods differences we tolerate between + rdp and rnp's gp_seq before setting overflow flag. + The default is always set to 8. + + rcutorture.gpwrap_lag_cycle_mins= [KNL] + Set the total cycle duration for gpwrap lag + testing in minutes. This is the total time for + one complete cycle of active and inactive + testing periods. Default is 30 minutes. + + rcutorture.gpwrap_lag_active_mins= [KNL] + Set the duration for which gpwrap lag is active + within each cycle, in minutes. During this time, + the grace-period wrap lag will be set to the + value specified by gpwrap_lag_gps. Default is + 5 minutes. + rcutorture.irqreader= [KNL] Run RCU readers from irq handlers, or, more accurately, from a timer handler. Not all RCU @@ -5429,6 +5785,22 @@ Set time (jiffies) between CPU-hotplug operations, or zero to disable CPU-hotplug testing. + rcutorture.preempt_duration= [KNL] + Set duration (in milliseconds) of preemptions + by a high-priority FIFO real-time task. Set to + zero (the default) to disable. The CPUs to + preempt are selected randomly from the set that + are online at a given point in time. Races with + CPUs going offline are ignored, with that attempt + at preemption skipped. + + rcutorture.preempt_interval= [KNL] + Set interval (in milliseconds, defaulting to one + second) between preemptions by a high-priority + FIFO real-time task. This delay is mediated + by an hrtimer and is further fuzzed to avoid + inadvertent synchronizations. + rcutorture.read_exit_burst= [KNL] The number of times in a given read-then-exit episode that a set of read-then-exit kthreads @@ -5509,6 +5881,11 @@ rcutorture.test_boost_duration= [KNL] Duration (s) of each individual boost test. + rcutorture.test_boost_holdoff= [KNL] + Holdoff time (s) from start of test to the start + of RCU priority-boost testing. Defaults to zero, + that is, no holdoff. + rcutorture.test_boost_interval= [KNL] Interval (s) between each boost test. @@ -5715,6 +6092,55 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not + configured or the ACPI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + bios + Use the CPU reboot vector for warm reset + + cold + Set the cold reboot flag + + default + There are some built-in platform specific "quirks" + - you may see: "reboot: <name> series board detected. + Selecting <type> for reboots." In the case where you + think the quirk is in error (e.g. you have newer BIOS, + or newer board) using this option will ignore the + built-in quirk table, and use the generic default + reboot actions. + + efi + Use efi reset_system runtime service. If EFI is not + configured or the EFI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + force + Don't stop other CPUs on reboot. This can make reboot + more reliable in some cases. + + kbd + Use the keyboard controller. cold reset (default) + + pci + Use a write to the PCI config space register 0xcf9 to + trigger reboot. + + triple + Force a triple fault (init) + + warm + Don't set the cold reboot flag + + Using warm reset will be much faster especially on big + memory systems because the BIOS will not go through + the memory check. Disadvantage is that not all + hardware will be completely reinitialized on reboot so + there may be boot problems on some systems. + + refscale.holdoff= [KNL] Set test-start holdoff period. The purpose of this parameter is to delay the start of the @@ -5784,7 +6210,7 @@ is assumed to be I/O ports; otherwise it is memory. reserve_mem= [RAM] - Format: nn[KNG]:<align>:<label> + Format: nn[KMG]:<align>:<label> Reserve physical memory and label it with a name that other subsystems can use to access it. This is typically used for systems that do not wipe the RAM, and this command @@ -5910,7 +6336,7 @@ port and the regular usb controller gets disabled. root= [KNL] Root filesystem - Usually this a a block device specifier of some kind, + Usually this is a block device specifier of some kind, see the early_lookup_bdev comment in block/early-lookup.c for details. Alternatively this can be "ram" for the legacy initial @@ -5937,6 +6363,11 @@ Memory area to be used by remote processor image, managed by CMA. + rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling + when CONFIG_RT_GROUP_SCHED=y. Defaults to + !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. + Format: <bool> + rw [KNL] Mount root device read-write on boot S [KNL] Run init in single mode @@ -6106,7 +6537,16 @@ serialnumber [BUGS=X86-32] - sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + sev=option[,option...] [X86-64] + + debug + Enable debug messages. + + nosnp + Do not enable SEV-SNP (applies to host/hypervisor + only). Setting 'nosnp' avoids the RMP check overhead + in memory accesses when users do not want to run + SEV-SNP guests. shapers= [NET] Maximal number of shapers. @@ -6275,6 +6715,8 @@ Selecting 'on' will also enable the mitigation against user space to user space task attacks. + Selecting specific mitigation does not force enable + user mitigations. Selecting 'off' will disable both the kernel and the user space protections. @@ -6858,6 +7300,14 @@ comma-separated list of trace events to enable. See also Documentation/trace/events.rst + To enable modules, use :mod: keyword: + + trace_event=:mod:<module> + + The value before :mod: will only enable specific events + that are part of the module. See the above mentioned + document for more information. + trace_instance=[instance-info] [FTRACE] Create a ring buffer instance early in boot up. This will be listed in: @@ -6926,6 +7376,8 @@ This is just one of many ways that can clear memory. Make sure your system keeps the content of memory across reboots before relying on this option. + NB: Both the mapped address and size must be page aligned for the architecture. + See also Documentation/trace/debugging.rst @@ -6964,6 +7416,15 @@ See also "Event triggers" in Documentation/trace/events.rst + traceoff_after_boot + [FTRACE] Sometimes tracing is used to debug issues + during the boot process. Since the trace buffer has a + limited amount of storage, it may be prudent to + disable tracing after the boot is finished, otherwise + the critical information may be overwritten. With this + option, the main tracing buffer will be turned off at + the end of the boot process. + traceoff_on_warning [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can @@ -6992,6 +7453,13 @@ See Documentation/admin-guide/mm/transhuge.rst for more details. + transparent_hugepage_tmpfs= [KNL] + Format: [always|within_size|advise|never] + Can be used to control the default hugepage allocation policy + for the tmpfs mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + trusted.source= [KEYS] Format: <string> This parameter identifies the trust source as a backend @@ -7028,6 +7496,19 @@ having this key zero'ed is acceptable. E.g. in testing scenarios. + tsa= [X86] Control mitigation for Transient Scheduler + Attacks on AMD CPUs. Search the following in your + favourite search engine for more details: + + "Technical guidance for mitigating transient scheduler + attacks". + + off - disable the mitigation + on - enable the mitigation (default) + user - mitigate only user/kernel transitions + vm - mitigate only guest/host transitions + + tsc= Disable clocksource stability checks for TSC. Format: <string> [x86] reliable: mark tsc clocksource as reliable, this @@ -7155,6 +7636,22 @@ Note that genuine overcurrent events won't be reported either. + unaligned_scalar_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping scalar unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same scalar unaligned access speed. + + unaligned_vector_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping vector unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same vector unaligned access speed. + unknown_nmi_panic [X86] Cause panic on unknown NMI. @@ -7350,13 +7847,6 @@ 16 - SIGBUS faults Example: user_debug=31 - userpte= - [X86,EARLY] Flags controlling user PTE allocations. - - nohigh = do not allocate PTE pages in - HIGHMEM regardless of setting - of CONFIG_HIGHPTE. - vdso= [X86,SH,SPARC] On X86_32, this is an alias for vdso32=. Otherwise: @@ -7474,7 +7964,7 @@ vt.cur_default= [VT] Default cursor shape. Format: 0xCCBBAA, where AA, BB, and CC are the same as the parameters of the <Esc>[?A;B;Cc escape sequence; - see VGA-softcursor.txt. Default: 2 = underline. + see vga-softcursor.rst. Default: 2 = underline. vt.default_blu= [VT] Format: <blue0>,<blue1>,<blue2>,...,<blue15> diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index ea7fa2a8bbf0..ee9a6c94f383 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -278,12 +278,7 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - e. If running on Cell Processor, build your kernel with - CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from - spu_gov_work(). - WARNING: Please check your CPU specifications to - make sure that this is safe on your particular system. - f. If running on PowerMAC, build your kernel with + e. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). diff --git a/Documentation/admin-guide/laptops/alienware-wmi.rst b/Documentation/admin-guide/laptops/alienware-wmi.rst new file mode 100644 index 000000000000..27a32a8057da --- /dev/null +++ b/Documentation/admin-guide/laptops/alienware-wmi.rst @@ -0,0 +1,127 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Alienware WMI Driver +==================== + +Kurt Borja <kuurtb@gmail.com> + +This is a driver for the "WMAX" WMI device, which is found in most Dell gaming +laptops and controls various special features. + +Before the launch of M-Series laptops (~2018), the "WMAX" device controlled +basic RGB lighting, deep sleep mode, HDMI mode and amplifier status. + +Later, this device was completely repurpused. Now it mostly deals with thermal +profiles, sensor monitoring and overclocking. This interface is named "AWCC" and +is known to be used by the AWCC OEM application to control these features. + +The alienware-wmi driver controls both interfaces. + +AWCC Interface +============== + +WMI device documentation: Documentation/wmi/devices/alienware-wmi.rst + +Supported devices +----------------- + +- Alienware M-Series laptops +- Alienware X-Series laptops +- Alienware Aurora Desktops +- Dell G-Series laptops + +If you believe your device supports the AWCC interface and you don't have any of +the features described in this document, try the following alienware-wmi module +parameters: + +- ``force_platform_profile=1``: Forces probing for platform profile support +- ``force_hwmon=1``: Forces probing for HWMON support + +If the module loads successfully with these parameters, consider submitting a +patch adding your model to the ``awcc_dmi_table`` located in +``drivers/platform/x86/dell/alienware-wmi-wmax.c`` or contacting the maintainer +for further guidance. + +Status +------ + +The following features are currently supported: + +- :ref:`Platform Profile <platform-profile>`: + + - Thermal profile control + + - G-Mode toggling + +- :ref:`HWMON <hwmon>`: + + - Sensor monitoring + + - Manual fan control + +.. _platform-profile: + +Platform Profile +---------------- + +The AWCC interface exposes various firmware defined thermal profiles. These are +exposed to user-space through the Platform Profile class interface. Refer to +:ref:`sysfs-class-platform-profile <abi_file_testing_sysfs_class_platform_profile>` +for more information. + +The name of the platform-profile class device exported by this driver is +"alienware-wmi" and it's path can be found with: + +:: + + grep -l "alienware-wmi" /sys/class/platform-profile/platform-profile-*/name | sed 's|/[^/]*$||' + +If the device supports G-Mode, it is also toggled when selecting the +``performance`` profile. + +.. note:: + You may set the ``force_gmode`` module parameter to always try to toggle this + feature, without checking if your model supports it. + +.. _hwmon: + +HWMON +----- + +The AWCC interface also supports sensor monitoring and manual fan control. Both +of these features are exposed to user-space through the HWMON interface. + +The name of the hwmon class device exported by this driver is "alienware_wmi" +and it's path can be found with: + +:: + + grep -l "alienware_wmi" /sys/class/hwmon/hwmon*/name | sed 's|/[^/]*$||' + +Sensor monitoring is done through the standard HWMON interface. Refer to +:ref:`sysfs-class-hwmon <abi_file_testing_sysfs_class_hwmon>` for more +information. + +Manual fan control on the other hand, is not exposed directly by the AWCC +interface. Instead it let's us control a fan `boost` value. This `boost` value +has the following aproximate behavior over the fan pwm: + +:: + + pwm = pwm_base + (fan_boost / 255) * (pwm_max - pwm_base) + +Due to the above behavior, the fan `boost` control is exposed to user-space +through the following, custom hwmon sysfs attribute: + +=============================== ======= ======================================= +Name Perm Description +=============================== ======= ======================================= +fan[1-4]_boost RW Fan boost value. + + Integer value between 0 and 255 +=============================== ======= ======================================= + +.. note:: + In some devices, manual fan control only works reliably if the ``custom`` + platform profile is selected. diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index cd9a1c2695fd..db842b629303 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -7,10 +7,12 @@ Laptop Drivers .. toctree:: :maxdepth: 1 + alienware-wmi asus-laptop disk-shock-protection laptop-mode lg-laptop + samsung-galaxybook sony-laptop sonypi thinkpad-acpi diff --git a/Documentation/admin-guide/laptops/samsung-galaxybook.rst b/Documentation/admin-guide/laptops/samsung-galaxybook.rst new file mode 100644 index 000000000000..752b8f1a4a74 --- /dev/null +++ b/Documentation/admin-guide/laptops/samsung-galaxybook.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +========================== +Samsung Galaxy Book Driver +========================== + +Joshua Grisham <josh@joshuagrisham.com> + +This is a Linux x86 platform driver for Samsung Galaxy Book series notebook +devices which utilizes Samsung's ``SCAI`` ACPI device in order to control +extra features and receive various notifications. + +Supported devices +================= + +Any device with one of the supported ACPI device IDs should be supported. This +covers most of the "Samsung Galaxy Book" series notebooks that are currently +available as of this writing, and could include other Samsung notebook devices +as well. + +Status +====== + +The following features are currently supported: + +- :ref:`Keyboard backlight <keyboard-backlight>` control +- :ref:`Performance mode <performance-mode>` control implemented using the + platform profile interface +- :ref:`Battery charge control end threshold + <battery-charge-control-end-threshold>` (stop charging battery at given + percentage value) implemented as a battery hook +- :ref:`Firmware Attributes <firmware-attributes>` to allow control of various + device settings +- :ref:`Handling of Fn hotkeys <keyboard-hotkey-actions>` for various actions +- :ref:`Handling of ACPI notifications and hotkeys + <acpi-notifications-and-hotkey-actions>` + +Because different models of these devices can vary in their features, there is +logic built within the driver which attempts to test each implemented feature +for a valid response before enabling its support (registering additional devices +or extensions, adding sysfs attributes, etc). Therefore, it can be important to +note that not all features may be supported for your particular device. + +The following features might be possible to implement but will require +additional investigation and are therefore not supported at this time: + +- "Dolby Atmos" mode for the speakers +- "Outdoor Mode" for increasing screen brightness on models with ``SAM0427`` +- "Silent Mode" on models with ``SAM0427`` + +.. _keyboard-backlight: + +Keyboard backlight +================== + +A new LED class named ``samsung-galaxybook::kbd_backlight`` is created which +will then expose the device using the standard sysfs-based LED interface at +``/sys/class/leds/samsung-galaxybook::kbd_backlight``. Brightness can be +controlled by writing the desired value to the ``brightness`` sysfs attribute or +with any other desired userspace utility. + +.. note:: + Most of these devices have an ambient light sensor which also turns + off the keyboard backlight under well-lit conditions. This behavior does not + seem possible to control at this time, but can be good to be aware of. + +.. _performance-mode: + +Performance mode +================ + +This driver implements the +Documentation/userspace-api/sysfs-platform_profile.rst interface for working +with the "performance mode" function of the Samsung ACPI device. + +Mapping of each Samsung "performance mode" to its respective platform profile is +performed dynamically by the driver, as not all models support all of the same +performance modes. Your device might have one or more of the following mappings: + +- "Silent" maps to ``low-power`` +- "Quiet" maps to ``quiet`` +- "Optimized" maps to ``balanced`` +- "High performance" maps to ``performance`` + +The result of the mapping can be printed in the kernel log when the module is +loaded. Supported profiles can also be retrieved from +``/sys/firmware/acpi/platform_profile_choices``, while +``/sys/firmware/acpi/platform_profile`` can be used to read or write the +currently selected profile. + +The ``balanced`` platform profile will be set during module load if no profile +has been previously set. + +.. _battery-charge-control-end-threshold: + +Battery charge control end threshold +==================================== + +This platform driver will add the ability to set the battery's charge control +end threshold, but does not have the ability to set a start threshold. + +This feature is typically called "Battery Saver" by the various Samsung +applications in Windows, but in Linux we have implemented the standardized +"charge control threshold" sysfs interface on the battery device to allow for +controlling this functionality from the userspace. + +The sysfs attribute +``/sys/class/power_supply/BAT1/charge_control_end_threshold`` can be used to +read or set the desired charge end threshold. + +If you wish to maintain interoperability with the Samsung Settings application +in Windows, then you should set the value to 100 to represent "off", or enable +the feature using only one of the following values: 50, 60, 70, 80, or 90. +Otherwise, the driver will accept any value between 1 and 100 as the percentage +that you wish the battery to stop charging at. + +.. note:: + Some devices have been observed as automatically "turning off" the charge + control end threshold if an input value of less than 30 is given. + +.. _firmware-attributes: + +Firmware Attributes +=================== + +The following enumeration-typed firmware attributes are set up by this driver +and should be accessible under +``/sys/class/firmware-attributes/samsung-galaxybook/attributes/`` if your device +supports them: + +- ``power_on_lid_open`` (device should power on when the lid is opened) +- ``usb_charging`` (USB ports can deliver power to connected devices even when + the device is powered off or in a low sleep state) +- ``block_recording`` (blocks access to camera and microphone) + +All of these attributes are simple boolean-like enumeration values which use 0 +to represent "off" and 1 to represent "on". Use the ``current_value`` attribute +to get or change the setting on the device. + +Note that when ``block_recording`` is updated, the input device "Samsung Galaxy +Book Lens Cover" will receive a ``SW_CAMERA_LENS_COVER`` switch event which +reflects the current state. + +.. _keyboard-hotkey-actions: + +Keyboard hotkey actions (i8042 filter) +====================================== + +The i8042 filter will swallow the keyboard events for the Fn+F9 hotkey (Multi- +level keyboard backlight toggle) and Fn+F10 hotkey (Block recording toggle) +and instead execute their actions within the driver itself. + +Fn+F9 will cycle through the brightness levels of the keyboard backlight. A +notification will be sent using ``led_classdev_notify_brightness_hw_changed`` +so that the userspace can be aware of the change. This mimics the behavior of +other existing devices where the brightness level is cycled internally by the +embedded controller and then reported via a notification. + +Fn+F10 will toggle the value of the "block recording" setting, which blocks +or allows usage of the built-in camera and microphone (and generates the same +Lens Cover switch event mentioned above). + +.. _acpi-notifications-and-hotkey-actions: + +ACPI notifications and hotkey actions +===================================== + +ACPI notifications will generate ACPI netlink events under the device class +``samsung-galaxybook`` and bus ID matching the Samsung ACPI device ID found on +your device. The events can be received using userspace tools such as +``acpi_listen`` and ``acpid``. + +The Fn+F11 Performance mode hotkey will be handled by the driver; each keypress +will cycle to the next available platform profile. diff --git a/Documentation/admin-guide/media/c3-isp.dot b/Documentation/admin-guide/media/c3-isp.dot new file mode 100644 index 000000000000..42dc931ee84a --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.dot @@ -0,0 +1,26 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1} | c3-isp-core\n/dev/v4l-subdev0 | {<port2> 2 | <port3> 3 | <port4> 4 | <port5> 5}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n00000008:port0 + n00000001:port4 -> n0000000b:port0 + n00000001:port5 -> n0000000e:port0 + n00000001:port2 -> n00000027 + n00000008 [label="{{<port0> 0} | c3-isp-resizer0\n/dev/v4l-subdev1 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000008:port1 -> n00000016 [style=bold] + n0000000b [label="{{<port0> 0} | c3-isp-resizer1\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n0000001a [style=bold] + n0000000e [label="{{<port0> 0} | c3-isp-resizer2\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000e:port1 -> n00000023 [style=bold] + n00000011 [label="{{<port0> 0} | c3-mipi-adapter\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000011:port1 -> n00000001:port0 [style=bold] + n00000016 [label="c3-isp-cap0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000001a [label="c3-isp-cap1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n0000001e [label="{{<port0> 0} | c3-mipi-csi2\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000001e:port1 -> n00000011:port0 [style=bold] + n00000023 [label="c3-isp-cap2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000027 [label="c3-isp-stats\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n0000002b [label="c3-isp-params\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n0000002b -> n00000001:port1 + n0000003f [label="{{} | imx290 2-001a\n/dev/v4l-subdev6 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000003f:port0 -> n0000001e:port0 [style=bold] +} diff --git a/Documentation/admin-guide/media/c3-isp.rst b/Documentation/admin-guide/media/c3-isp.rst new file mode 100644 index 000000000000..ac508b8c6831 --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR MIT) + +.. include:: <isonum.txt> + +================================================= +Amlogic C3 Image Signal Processing (C3ISP) driver +================================================= + +Introduction +============ + +This file documents the Amlogic C3ISP driver located under +drivers/media/platform/amlogic/c3/isp. + +The current version of the driver supports the C3ISP found on +Amlogic C308L processor. + +The driver implements V4L2, Media controller and V4L2 subdev interfaces. +Camera sensor using V4L2 subdev interface in the kernel is supported. + +The driver has been tested on AW419-C308L-Socket platform. + +Amlogic C3 ISP +============== + +The Camera hardware found on C308L processors and supported by +the driver consists of: + +- 1 MIPI-CSI-2 module: handles the physical layer of the MIPI CSI-2 receiver and + receives data from the connected camera sensor. +- 1 MIPI-ADAPTER module: organizes MIPI data to meet ISP input requirements and + send MIPI data to ISP. +- 1 ISP (Image Signal Processing) module: contains a pipeline of image processing + hardware blocks. The ISP pipeline contains three resizers at the end each of + them connected to a DMA interface which writes the output data to memory. + +A high-level functional view of the C3 ISP is presented below.:: + + +----------+ +-------+ + | Resizer |--->| WRMIF | + +---------+ +------------+ +--------------+ +-------+ |----------+ +-------+ + | Sensor |--->| MIPI CSI-2 |--->| MIPI ADAPTER |--->| ISP |---|----------+ +-------+ + +---------+ +------------+ +--------------+ +-------+ | Resizer |--->| WRMIF | + +----------+ +-------+ + |----------+ +-------+ + | Resizer |--->| WRMIF | + +----------+ +-------+ + +Driver architecture and design +============================== + +With the goal to model the hardware links between the modules and to expose a +clean, logical and usable interface, the driver registers the following V4L2 +sub-devices: + +- 1 `c3-mipi-csi2` sub-device - the MIPI CSI-2 receiver +- 1 `c3-mipi-adapter` sub-device - the MIPI adapter +- 1 `c3-isp-core` sub-device - the ISP core +- 3 `c3-isp-resizer` sub-devices - the ISP resizers + +The `c3-isp-core` sub-device is linked to 2 video device nodes for statistics +capture and parameters programming: + +- the `c3-isp-stats` capture video device node for statistics capture +- the `c3-isp-params` output video device for parameters programming + +Each `c3-isp-resizer` sub-device is linked to a capture video device node where +frames are captured from: + +- `c3-isp-resizer0` is linked to the `c3-isp-cap0` capture video device +- `c3-isp-resizer1` is linked to the `c3-isp-cap1` capture video device +- `c3-isp-resizer2` is linked to the `c3-isp-cap2` capture video device + +The media controller pipeline graph is as follows (with connected a +IMX290 camera sensor): + +.. _isp_topology_graph: + +.. kernel-figure:: c3-isp.dot + :alt: c3-isp.dot + :align: center + + Media pipeline topology + +Implementation +============== + +Runtime configuration of the ISP hardware is performed on the `c3-isp-params` +video device node using the :ref:`V4L2_META_FMT_C3ISP_PARAMS +<v4l2-meta-fmt-c3isp-params>` as data format. The buffer structure is defined by +:c:type:`c3_isp_params_cfg`. + +Statistics are captured from the `c3-isp-stats` video device node using the +:ref:`V4L2_META_FMT_C3ISP_STATS <v4l2-meta-fmt-c3isp-stats>` data format. + +The final picture size and format is configured using the V4L2 video +capture interface on the `c3-isp-cap[0, 2]` video device nodes. + +The Amlogic C3 ISP is supported by `libcamera <https://libcamera.org>`_ with a +dedicated pipeline handler and algorithms that perform run-time image correction +and enhancement. diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst index 92690e1f2183..b2e7a300494a 100644 --- a/Documentation/admin-guide/media/cec.rst +++ b/Documentation/admin-guide/media/cec.rst @@ -451,7 +451,7 @@ configure the CEC devices for HDMI Input and the HDMI Outputs manually. --------------------- A three character manufacturer name that is used in the EDID for the HDMI -Input. If not set, then userspace is reponsible for configuring an EDID. +Input. If not set, then userspace is responsible for configuring an EDID. If set, then the driver will update the EDID automatically based on the resolutions supported by the connected displays, and it will not be possible anymore to manually set the EDID for the HDMI Input. diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst index 83b3cd03b35c..9c190942932e 100644 --- a/Documentation/admin-guide/media/ipu3.rst +++ b/Documentation/admin-guide/media/ipu3.rst @@ -98,7 +98,7 @@ frames in packed raw Bayer format to IPU3 CSI2 receiver. # and that ov5670 sensor is connected to i2c bus 10 with address 0x36 export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036") - # Establish the link for the media devices using media-ctl [#f3]_ + # Establish the link for the media devices using media-ctl media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]" # Set the format for the media devices @@ -589,12 +589,8 @@ preserved. References ========== -.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h - .. [#f1] https://github.com/intel/nvt .. [#f2] http://git.ideasonboard.org/yavta.git -.. [#f3] http://git.ideasonboard.org/?p=media-ctl.git;a=summary - .. [#f4] ImgU limitation requires an additional 16x16 for all input resolutions diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index b9da127c074d..5ac69b833a7a 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -1,8 +1,17 @@ .. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + The mgb4 driver =============== +Copyright |copy| 2023 - 2025 Digiteq Automotive + author: Martin Tůma <martin.tuma@digiteqautomotive.com> + +This is a v4l2 device driver for the Digiteq Automotive FrameGrabber 4, a PCIe +card capable of capturing and generating FPD-Link III and GMSL2/3 video streams +as used in the automotive industry. + sysfs interface --------------- @@ -22,7 +31,9 @@ Global (PCI card) parameters | 0 - No module present | 1 - FPDL3 - | 2 - GMSL + | 2 - GMSL (one serializer, two daisy chained deserializers) + | 3 - GMSL (one serializer, two deserializers) + | 4 - GMSL (two deserializers with two daisy chain outputs) **module_version** (R): Module version number. Zero in case of a missing module. diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst index 7d8e3c8987db..239879634ea5 100644 --- a/Documentation/admin-guide/media/pci-cardlist.rst +++ b/Documentation/admin-guide/media/pci-cardlist.rst @@ -86,7 +86,6 @@ saa7134 Philips SAA7134 saa7164 NXP SAA7164 smipcie SMI PCIe DVBSky cards solo6x10 Bluecherry / Softlogic 6x10 capture cards (MPEG-4/H.264) -sta2x11_vip STA2X11 VIP Video For Linux tw5864 Techwell TW5864 video/audio grabber and encoder tw686x Intersil/Techwell TW686x tw68 Techwell tw68x Video For Linux diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index e8761561b2fe..3bac5165b134 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -10,6 +10,7 @@ Video4Linux (V4L) driver-specific documentation :maxdepth: 2 bttv + c3-isp cafe_ccic cx88 fimc diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > <debugfs>/cma/<cma_name>/alloc diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..bc7e976120e0 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index c4dddf6733cd..ede14b679d02 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -42,32 +42,45 @@ the execution. :: $ git clone https://github.com/sjp38/masim; cd masim; make $ sudo damo start "./masim ./configs/stairs.cfg --quiet" - $ sudo ./damo show - 0 addr [85.541 TiB , 85.541 TiB ) (57.707 MiB ) access 0 % age 10.400 s - 1 addr [85.541 TiB , 85.542 TiB ) (413.285 MiB) access 0 % age 11.400 s - 2 addr [127.649 TiB , 127.649 TiB) (57.500 MiB ) access 0 % age 1.600 s - 3 addr [127.649 TiB , 127.649 TiB) (32.500 MiB ) access 0 % age 500 ms - 4 addr [127.649 TiB , 127.649 TiB) (9.535 MiB ) access 100 % age 300 ms - 5 addr [127.649 TiB , 127.649 TiB) (8.000 KiB ) access 60 % age 0 ns - 6 addr [127.649 TiB , 127.649 TiB) (6.926 MiB ) access 0 % age 1 s - 7 addr [127.998 TiB , 127.998 TiB) (120.000 KiB) access 0 % age 11.100 s - 8 addr [127.998 TiB , 127.998 TiB) (8.000 KiB ) access 40 % age 100 ms - 9 addr [127.998 TiB , 127.998 TiB) (4.000 KiB ) access 0 % age 11 s - total size: 577.590 MiB - $ sudo ./damo stop + $ sudo damo report access + heatmap: 641111111000000000000000000000000000000000000000000000[...]33333333333333335557984444[...]7 + # min/max temperatures: -1,840,000,000, 370,010,000, column size: 3.925 MiB + 0 addr 86.182 TiB size 8.000 KiB access 0 % age 14.900 s + 1 addr 86.182 TiB size 8.000 KiB access 60 % age 0 ns + 2 addr 86.182 TiB size 3.422 MiB access 0 % age 4.100 s + 3 addr 86.182 TiB size 2.004 MiB access 95 % age 2.200 s + 4 addr 86.182 TiB size 29.688 MiB access 0 % age 14.100 s + 5 addr 86.182 TiB size 29.516 MiB access 0 % age 16.700 s + 6 addr 86.182 TiB size 29.633 MiB access 0 % age 17.900 s + 7 addr 86.182 TiB size 117.652 MiB access 0 % age 18.400 s + 8 addr 126.990 TiB size 62.332 MiB access 0 % age 9.500 s + 9 addr 126.990 TiB size 13.980 MiB access 0 % age 5.200 s + 10 addr 126.990 TiB size 9.539 MiB access 100 % age 3.700 s + 11 addr 126.990 TiB size 16.098 MiB access 0 % age 6.400 s + 12 addr 127.987 TiB size 132.000 KiB access 0 % age 2.900 s + total size: 314.008 MiB + $ sudo damo stop The first command of the above example downloads and builds an artificial memory access generator program called ``masim``. The second command asks DAMO -to execute the artificial generator process start via the given command and -make DAMON monitors the generator process. The third command retrieves the -current snapshot of the monitored access pattern of the process from DAMON and -shows the pattern in a human readable format. - -Each line of the output shows which virtual address range (``addr [XX, XX)``) -of the process is how frequently (``access XX %``) accessed for how long time -(``age XX``). For example, the fifth region of ~9 MiB size is being most -frequently accessed for last 300 milliseconds. Finally, the fourth command -stops DAMON. +to start the program via the given command and make DAMON monitors the newly +started process. The third command retrieves the current snapshot of the +monitored access pattern of the process from DAMON and shows the pattern in a +human readable format. + +The first line of the output shows the relative access temperature (hotness) of +the regions in a single row hetmap format. Each column on the heatmap +represents regions of same size on the monitored virtual address space. The +position of the colun on the row and the number on the column represents the +relative location and access temperature of the region. ``[...]`` means +unmapped huge regions on the virtual address spaces. The second line shows +additional information for better understanding the heatmap. + +Each line of the output from the third line shows which virtual address range +(``addr XX size XX``) of the process is how frequently (``access XX %``) +accessed for how long time (``age XX``). For example, the evelenth region of +~9.5 MiB size is being most frequently accessed for last 3.7 seconds. Finally, +the fourth command stops DAMON. Note that DAMON can monitor not only virtual address spaces but multiple types of address spaces including the physical address space. @@ -95,7 +108,7 @@ Visualizing Recorded Patterns You can visualize the pattern in a heatmap, showing which memory region (x-axis) got accessed when (y-axis) and how frequently (number).:: - $ sudo damo report heats --heatmap stdout + $ sudo damo report heatmap 22222222222222222222222222222222222222211111111111111111111111111111111111111100 44444444444444444444444444444444444444434444444444444444444444444444444444443200 44444444444444444444444444444444444444433444444444444444444444444444444444444200 @@ -160,6 +173,6 @@ Data Access Pattern Aware Memory Management Below command makes every memory region of size >=4K that has not accessed for >=60 seconds in your workload to be swapped out. :: - $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \ - --damos_age 60s max --damos_action pageout \ - <pid of your workload> + $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ + --damos_age 60s max --damos_action pageout \ + <pid of your workload> diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index d9be9f7caa7d..d960aba72b82 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -26,12 +26,6 @@ DAMON provides below interfaces for different users. writing kernel space DAMON application programs for you. You can even extend DAMON for various address spaces. For detail, please refer to the interface :doc:`document </mm/damon/api>`. -- *debugfs interface. (DEPRECATED!)* - :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface - <sysfs_interface>`. This is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. .. _sysfs_interface: @@ -70,6 +64,7 @@ comma (","). │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us + │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target @@ -86,13 +81,13 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes - │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age + │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... @@ -138,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file again. +- ``update_tuned_intervals``: Update the contents of ``sample_us`` and + ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling + interval`` and ``aggregation interval`` for the files. Please refer to + :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>` + for more details. - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' :ref:`quota goals <sysfs_schemes_quota_goals>`. - ``update_schemes_stats``: Update the contents of stats files for each @@ -219,6 +219,25 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _damon_usage_sysfs_monitoring_intervals_goal: + +contexts/<N>/monitoring_attrs/intervals/intervals_goal/ +------------------------------------------------------- + +Under the ``intervals`` directory, one directory for automated tuning of +``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists. +Under the directory, four files for the auto-tuning control, namely +``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist. +Please refer to the :ref:`design document of the feature +<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning +mechanism. Reading and writing the four files under ``intervals_goal`` +directory shows and updates the tuning parameters that described in the +:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same +names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The +tuning-applied current values of the two intervals can be read from the +``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to +the ``state`` file. + .. _sysfs_targets: contexts/<N>/targets/ @@ -288,9 +307,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files -(``action``, ``target_nid`` and ``apply_interval``) exist. +In each scheme directory, seven directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and +``tried_regions``) and three files (``action``, ``target_nid`` and +``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read @@ -370,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. @@ -401,70 +421,84 @@ The ``interval`` should written in microseconds unit. .. _sysfs_filters: -schemes/<N>/filters/ --------------------- +schemes/<N>/{core\_,ops\_,}filters/ +----------------------------------- -The directory for the :ref:`filters <damon_design_damos_filters>` of the given +Directories for :ref:`filters <damon_design_damos_filters>` of the given DAMON-based operation scheme. -In the beginning, this directory has only one file, ``nr_filters``. Writing a +``core_filters`` and ``ops_filters`` directories are for the filters handled by +the DAMON core layer and operations set layer, respectively. ``filters`` +directory can be used for installing filters regardless of their handled +layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be +installed before those of ``filters``. All three directories have same files. + +Use of ``filters`` directory can make expecting evaluation orders of given +filters with the files under directory bit confusing. Users are hence +recommended to use ``core_filters`` and ``ops_filters`` directories. The +``filters`` directory could be deprecated in future. + +In the beginning, the directory has only one file, ``nr_filters``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains six files, namely ``type``, ``matcing``, -``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` -file, you can write one of five special keywords: ``anon`` for anonymous pages, -``memcg`` for specific memory cgroup, ``young`` for young pages, ``addr`` for -specific address range (an open-ended interval), or ``target`` for specific -DAMON monitoring target filtering. In case of the memory cgroup filtering, you -can specify the memory cgroup of the interest by writing the path of the memory -cgroup from the cgroups mount point to ``memcg_path`` file. In case of the -address range filtering, you can specify the start and end address of the range -to ``addr_start`` and ``addr_end`` files, respectively. For the DAMON -monitoring target filtering, you can specify the index of the target between -the list of the DAMON context's monitoring targets list to ``target_idx`` file. -You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that does -or does not match to the type, respectively. Then, the scheme's action will -not be applied to the pages that specified to be filtered out. +Each filter directory contains nine files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc <damon_design_damos_filters>` for available type +names, their meaning and on what layer those are handled. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. + +You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter +is for memory that matches the ``type``. You can write ``Y`` or ``N`` to +``allow`` file to specify if applying the action to the memory that satisfies +the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: + # cd ops_filters/0/ # echo 2 > nr_filters - # # filter out anonymous pages + # # disallow anonymous pages echo anon > 0/type echo Y > 0/matching + echo N > 0/allow # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path echo Y > 1/matching + echo N > 1/allow -Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation <sysfs_context>` is being used. - -Also, memory regions that are filtered out by ``addr`` or ``target`` filters -are not counted as the scheme has tried to those, while regions that filtered -out by other type filters are counted as the scheme has tried to. The -difference is applied to :ref:`stats <damos_stats>` and -:ref:`tried regions <sysfs_schemes_tried_regions>`. +Refer to the :ref:`DAMOS filters design documentation +<damon_design_damos_filters>` for more details including how multiple filters +of different ``allow`` works, when each of the filters are supported, and +differences on stats. .. _sysfs_schemes_stats: schemes/<N>/stats/ ------------------ -DAMON counts the total number and bytes of regions that each scheme is tried to -be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. +DAMON counts statistics for each scheme. This statistics can be used for +online analysis or tuning of the schemes. Refer to :ref:`design doc +<damon_design_damos_stat>` for more details about the stats. The statistics can be retrieved by reading the files under ``stats`` directory -(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and -``qt_exceeds``), respectively. The files are not updated in real time, so you -should ask DAMON sysfs interface to update the content of the files for the -stats by writing a special keyword, ``update_schemes_stats`` to the relevant -``kdamonds/<N>/state`` file. +(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, +``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not +updated in real time, so you should ask DAMON sysfs interface to update the +content of the files for the stats by writing a special keyword, +``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file. .. _sysfs_schemes_tried_regions: @@ -501,10 +535,10 @@ set the ``access pattern`` as their interested pattern that they want to query. tried_regions/<N>/ ------------------ -In each region directory, you will find four files (``start``, ``end``, -``nr_accesses``, and ``age``). Reading the files will show the start and end -addresses, ``nr_accesses``, and ``age`` of the region that corresponding -DAMON-based operation scheme ``action`` has tried to be applied. +In each region directory, you will find five files (``start``, ``end``, +``nr_accesses``, ``age``, and ``sz_filter_passed``). Reading the files will +show the properties of the region that corresponding DAMON-based operation +scheme ``action`` has tried to be applied. Example ~~~~~~~ @@ -600,306 +634,3 @@ fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``) of the scheme in the list of the contexts of the context's kdamond, the index of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in addition to the output of ``damon_aggregated`` tracepoint. - - -.. _debugfs_interface: - -debugfs Interface (DEPRECATED!) -=============================== - -.. note:: - - THIS IS DEPRECATED! - - DAMON debugfs interface is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. - -DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, -``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, -``mk_contexts`` and ``rm_contexts`` under its debugfs directory, -``<debugfs>/damon/``. - - -``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation -notice. Reading it returns the deprecation notice, as below:: - - # cat DEPRECATED - DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. - - -Attributes ----------- - -Users can get and set the ``sampling interval``, ``aggregation interval``, -``update interval``, and min/max number of monitoring target regions by -reading from and writing to the ``attrs`` file. To know about the monitoring -attributes in detail, please refer to the :doc:`/mm/damon/design`. For -example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and -1000, and then check it again:: - - # cd <debugfs>/damon - # echo 5000 100000 1000000 10 1000 > attrs - # cat attrs - 5000 100000 1000000 10 1000 - - -Target IDs ----------- - -Some types of address spaces supports multiple monitoring target. For example, -the virtual memory address spaces monitoring can have multiple processes as the -monitoring targets. Users can set the targets by writing relevant id values of -the targets to, and get the ids of the current targets by reading from the -``target_ids`` file. In case of the virtual address spaces monitoring, the -values should be pids of the monitoring target processes. For example, below -commands set processes having pids 42 and 4242 as the monitoring targets and -check it again:: - - # cd <debugfs>/damon - # echo 42 4242 > target_ids - # cat target_ids - 42 4242 - -Users can also monitor the physical memory address space of the system by -writing a special keyword, "``paddr\n``" to the file. Because physical address -space monitoring doesn't support multiple targets, reading the file will show a -fake value, ``42``, as below:: - - # cd <debugfs>/damon - # echo paddr > target_ids - # cat target_ids - 42 - -Note that setting the target ids doesn't start the monitoring. - - -Initial Monitoring Target Regions ---------------------------------- - -In case of the virtual address space monitoring, DAMON automatically sets and -updates the monitoring target regions so that entire memory mappings of target -processes can be covered. However, users can want to limit the monitoring -region to specific address ranges, such as the heap, the stack, or specific -file-mapped area. Or, some users can know the initial access pattern of their -workloads and therefore want to set optimal initial regions for the 'adaptive -regions adjustment'. - -In contrast, DAMON do not automatically sets and updates the monitoring target -regions in case of physical memory monitoring. Therefore, users should set the -monitoring target regions by themselves. - -In such cases, users can explicitly set the initial monitoring target regions -as they want, by writing proper values to the ``init_regions`` file. The input -should be a sequence of three integers separated by white spaces that represent -one region in below form.:: - - <target idx> <start address> <end address> - -The ``target idx`` should be the index of the target in ``target_ids`` file, -starting from ``0``, and the regions should be passed in address order. For -example, below commands will set a couple of address ranges, ``1-100`` and -``100-200`` as the initial monitoring target region of pid 42, which is the -first one (index ``0``) in ``target_ids``, and another couple of address -ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one -(index ``1``) in ``target_ids``.:: - - # cd <debugfs>/damon - # cat target_ids - 42 4242 - # echo "0 1 100 \ - 0 100 200 \ - 1 20 40 \ - 1 50 100" > init_regions - -Note that this sets the initial monitoring target regions only. In case of -virtual memory monitoring, DAMON will automatically updates the boundary of the -regions after one ``update interval``. Therefore, users should set the -``update interval`` large enough in this case, if they don't want the -update. - - -Schemes -------- - -Users can get and set the DAMON-based operation :ref:`schemes -<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file. -Reading the file also shows the statistics of each scheme. To the file, each -of the schemes should be represented in each line in below form:: - - <target access pattern> <action> <quota> <watermarks> - -You can disable schemes by simply writing an empty string to the file. - -Target Access Pattern -~~~~~~~~~~~~~~~~~~~~~ - -The target access :ref:`pattern <damon_design_damos_access_pattern>` of the -scheme. The ``<target access pattern>`` is constructed with three ranges in -below form:: - - min-size max-size min-acc max-acc min-age max-age - -Specifically, bytes for the size of regions (``min-size`` and ``max-size``), -number of monitored accesses per aggregate interval for access frequency -(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of -regions (``min-age`` and ``max-age``) are specified. Note that the ranges are -closed interval. - -Action -~~~~~~ - -The ``<action>`` is a predefined integer for memory management :ref:`actions -<damon_design_damos_action>`. The mapping between the ``<action>`` values and -the memory management actions is as below. For the detailed meaning of the -action and DAMON operations set supporting each action, please refer to the -list on :ref:`design doc <damon_design_damos_action>`. - - - 0: ``willneed`` - - 1: ``cold`` - - 2: ``pageout`` - - 3: ``hugepage`` - - 4: ``nohugepage`` - - 5: ``stat`` - -Quota -~~~~~ - -Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme -via the ``<quota>`` in below form:: - - <ms> <sz> <reset interval> <priority weights> - -This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying -the action to memory regions of the ``target access pattern`` within the -``<reset interval>`` milliseconds, and to apply the action to only up to -``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both -``<ms>`` and ``<sz>`` zero disables the quota limits. - -For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users -can set the weights for the three properties in ``<priority weights>`` in below -form:: - - <size weight> <access frequency weight> <age weight> - -Watermarks -~~~~~~~~~~ - -Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the -given scheme via ``<watermarks>`` in below form:: - - <metric> <check interval> <high mark> <middle mark> <low mark> - -``<metric>`` is a predefined integer for the metric to be checked. The -supported numbers and their meanings are as below. - - - 0: Ignore the watermarks - - 1: System's free memory rate (per thousand) - -The value of the metric is checked every ``<check interval>`` microseconds. - -If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the -scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme -is activated. - -.. _damos_stats: - -Statistics -~~~~~~~~~~ - -It also counts the total number and bytes of regions that each scheme is tried -to be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. - -The statistics can be shown by reading the ``schemes`` file. Reading the file -will show each scheme you entered in each line, and the five numbers for the -statistics will be added at the end of each line. - -Example -~~~~~~~ - -Below commands applies a scheme saying "If a memory region of size in [4KiB, -8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate -interval in [10, 20], page out the region. For the paging out, use only up to -10ms per second, and also don't page out more than 1GiB per second. Under the -limitation, page out memory regions having longer age first. Also, check the -free memory rate of the system every 5 seconds, start the monitoring and paging -out when the free memory rate becomes lower than 50%, but stop it if the free -memory rate becomes larger than 60%, or lower than 30%".:: - - # cd <debugfs>/damon - # scheme="4096 8192 0 5 10 20 2" # target access pattern and action - # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas - # scheme+=" 0 0 100" # prioritization weights - # scheme+=" 1 5000000 600 500 300" # watermarks - # echo "$scheme" > schemes - - -Turning On/Off --------------- - -Setting the files as described above doesn't incur effect unless you explicitly -start the monitoring. You can start, stop, and check the current status of the -monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file. -Writing ``on`` to the file starts the monitoring of the targets with the -attributes. Writing ``off`` to the file stops those. DAMON also stops if -every target process is terminated. Below example commands turn on, off, and -check the status of DAMON:: - - # cd <debugfs>/damon - # echo on > monitor_on_DEPRECATED - # echo off > monitor_on_DEPRECATED - # cat monitor_on_DEPRECATED - off - -Please note that you cannot write to the above-mentioned debugfs files while -the monitoring is turned on. If you write to the files while DAMON is running, -an error code such as ``-EBUSY`` will be returned. - - -Monitoring Thread PID ---------------------- - -DAMON does requested monitoring with a kernel thread called ``kdamond``. You -can get the pid of the thread by reading the ``kdamond_pid`` file. When the -monitoring is turned off, reading the file returns ``none``. :: - - # cd <debugfs>/damon - # cat monitor_on_DEPRECATED - off - # cat kdamond_pid - none - # echo on > monitor_on_DEPRECATED - # cat kdamond_pid - 18594 - - -Using Multiple Monitoring Threads ---------------------------------- - -One ``kdamond`` thread is created for each monitoring context. You can create -and remove monitoring contexts for multiple ``kdamond`` required use case using -the ``mk_contexts`` and ``rm_contexts`` files. - -Writing the name of the new context to the ``mk_contexts`` file creates a -directory of the name on the DAMON debugfs directory. The directory will have -DAMON debugfs files for the context. :: - - # cd <debugfs>/damon - # ls foo - # ls: cannot access 'foo': No such file or directory - # echo foo > mk_contexts - # ls foo - # attrs init_regions kdamond_pid schemes target_ids - -If the context is not needed anymore, you can remove it and the corresponding -directory by putting the name of the context to the ``rm_contexts`` file. :: - - # echo foo > rm_contexts - # ls foo - # ls: cannot access 'foo': No such file or directory - -Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files -are in the root directory only. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index f34a0d798d5b..67a941903fd2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -145,7 +145,17 @@ hugepages It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. If the node number is invalid, the parameter will be ignored. +hugepage_alloc_threads + Specify the number of threads that should be used to allocate hugepages + during boot. This parameter can be used to improve system bootup time + when allocating a large amount of huge pages. + The default value is 25% of the available hardware threads. + Example to use 8 allocation threads:: + + hugepage_alloc_threads=8 + + Note that this parameter only applies to non-gigantic huge pages. default_hugepagesz Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..2d2f6c222308 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -42,3 +42,4 @@ the Linux memory management. transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase <kho-finalization-phase>` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index cb2c080f400c..33c886f3d198 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -280,8 +280,8 @@ The following files are currently defined: blocks; configure auto-onlining. The default value depends on the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel configuration + options. See the ``state`` property of memory blocks for details. ``block_size_bytes`` read-only: the size in bytes of a memory block. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -37,12 +38,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. @@ -233,6 +250,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 8872203df088..dff8d5985f0f 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -332,6 +332,12 @@ allocation policy for the internal shmem mount by using the kernel parameter seven valid policies for shmem (``always``, ``within_size``, ``advise``, ``never``, ``deny``, and ``force``). +Similarly to ``transparent_hugepage_shmem``, you can control the default +hugepage allocation policy for the tmpfs mount by using the kernel parameter +``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the +four valid policies for tmpfs (``always``, ``within_size``, ``advise``, +``never``). The tmpfs mount default policy is ``never``. + In the same manner as ``thp_anon`` controls each supported anonymous THP size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` has the same format as ``thp_anon``, but also supports the policy @@ -352,8 +358,21 @@ default to ``never``. Hugepages in tmpfs/shmem ======================== -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: +Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, +it also supports smaller sizes just like anonymous memory, often referred +to as "multi-size THP" (mTHP). Huge pages of any size are commonly +represented in the kernel as "large folios". + +While there is fine control over the huge page sizes to use for the internal +shmem mount (see below), ordinary tmpfs mounts will make use of all available +huge page sizes without any control over the exact sizes, behaving more like +other file systems. + +tmpfs mounts +------------ + +The THP allocation policy for tmpfs mounts can be adjusted using the mount +option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; @@ -363,24 +382,24 @@ never within_size Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate huge pages if requested with fadvise()/madvise(); + Only allocate huge pages if requested with madvise(); + +Remember, that the kernel may use huge pages of all available sizes, and +that no fine control as for the internal tmpfs mount is available. -The default policy is ``never``. +The default policy in the past was ``never``, but it can now be adjusted +using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting ``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: +In addition to policies listed above, the sysfs knob +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the +allocation policy of tmpfs mounts, when set to the following values: deny For use in emergencies, to force the huge option off from @@ -388,13 +407,24 @@ deny force Force the huge option on for all - very useful for testing; -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to -control mTHP allocation: -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled', -and its value for each mTHP is essentially consistent with the global -setting. An 'inherit' option is added to ensure compatibility with these -global settings. Conversely, the options 'force' and 'deny' are dropped, -which are rather testing artifacts from the old ages. +shmem / internal tmpfs +---------------------- +The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +To control the THP allocation policy for this internal tmpfs mount, the +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs +per THP size in +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' +can be used. + +The global knob has the same semantics as the ``huge=`` mount options +for tmpfs mounts, except that the different huge page sizes can be controlled +individually, and will only use the setting of the global knob when the +per-size knob is set to 'inherit'. + +The options 'force' and 'deny' are dropped for the individual sizes, which +are rather testing artifacts from the old ages. always Attempt to allocate <size> huge pages every time we need a new page; @@ -408,10 +438,10 @@ never within_size Only allocate <size> huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate <size> huge pages if requested with fadvise()/madvise(); + Only allocate <size> huge pages if requested with madvise(); Need of application restart =========================== @@ -561,6 +591,16 @@ swpin is incremented every time a huge page is swapped in from a non-zswap swap device in one piece. +swpin_fallback + is incremented if swapin fails to allocate or charge a huge page + and instead falls back to using huge pages with lower orders or + small pages. + +swpin_fallback_charge + is incremented if swapin fails to charge a huge page and instead + falls back to using huge pages with lower orders or small pages + even though the allocation was successful. + swpout is incremented every time a huge page is swapped out to a non-zswap swap device in one piece without splitting. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 3598dcd7dbe7..fd3370aa43fe 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. By default, a zpool of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: - echo zbud > /sys/module/zswap/parameters/zpool + echo zsmalloc > /sys/module/zswap/parameters/zpool -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. +The zsmalloc type zpool has a complex compressed page storage method, and it +can achieve great storage densities. When a swap page is passed from swapout to zswap, zswap maintains a mapping of the swap entry, a combination of the swap type and swap offset, to the zpool diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst index 369556e00f0c..553a44803231 100644 --- a/Documentation/admin-guide/namespaces/resource-control.rst +++ b/Documentation/admin-guide/namespaces/resource-control.rst @@ -1,17 +1,17 @@ -=========================== -Namespaces research control -=========================== +==================================== +User namespaces and resource control +==================================== -There are a lot of kinds of objects in the kernel that don't have -individual limits or that have limits that are ineffective when a set -of processes is allowed to switch user ids. With user namespaces -enabled in a kernel for people who don't trust their users or their -users programs to play nice this problems becomes more acute. +The kernel contains many kinds of objects that either don't have +individual limits or that have limits which are ineffective when +a set of processes is allowed to switch their UID. On a system +where the admins don't trust their users or their users' programs, +user namespaces expose the system to potential misuse of resources. -Therefore it is recommended that memory control groups be enabled in -kernels that enable user namespaces, and it is further recommended -that userspace configure memory control groups to limit how much -memory user's they don't trust to play nice can use. +In order to mitigate this, we recommend that admins enable memory +control groups on any system that enables user namespaces. +Furthermore, we recommend that admins configure the memory control +groups to limit the maximum memory usable by any untrusted user. Memory control groups can be configured by installing the libcgroup package present on most distros editing /etc/cgrules.conf, diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst new file mode 100644 index 000000000000..97ca1ccef459 --- /dev/null +++ b/Documentation/admin-guide/nvme-multipath.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Linux NVMe multipath +==================== + +This document describes NVMe multipath and its path selection policies supported +by the Linux NVMe host driver. + + +Introduction +============ + +The NVMe multipath feature in Linux integrates namespaces with the same +identifier into a single block device. Using multipath enhances the reliability +and stability of I/O access while improving bandwidth performance. When a user +sends I/O to this merged block device, the multipath mechanism selects one of +the underlying block devices (paths) according to the configured policy. +Different policies result in different path selections. + + +Policies +======== + +All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning +that when an optimized path is available, it will be chosen over a non-optimized +one. Current the NVMe multipath policies include numa(default), round-robin and +queue-depth. + +To set the desired policy (e.g., round-robin), use one of the following methods: + 1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy + 2. or add the "nvme_core.iopolicy=round-robin" to cmdline. + + +NUMA +---- + +The NUMA policy selects the path closest to the NUMA node of the current CPU for +I/O distribution. This policy maintains the nearest paths to each NUMA node +based on network interface connections. + +When to use the NUMA policy: + 1. Multi-core Systems: Optimizes memory access in multi-core and + multi-processor systems, especially under NUMA architecture. + 2. High Affinity Workloads: Binds I/O processing to the CPU to reduce + communication and data transfer delays across nodes. + + +Round-Robin +----------- + +The round-robin policy distributes I/O requests evenly across all paths to +enhance throughput and resource utilization. Each I/O operation is sent to the +next path in sequence. + +When to use the round-robin policy: + 1. Balanced Workloads: Effective for balanced and predictable workloads with + similar I/O size and type. + 2. Homogeneous Path Performance: Utilizes all paths efficiently when + performance characteristics (e.g., latency, bandwidth) are similar. + + +Queue-Depth +----------- + +The queue-depth policy manages I/O requests based on the current queue depth +of each path, selecting the path with the least number of in-flight I/Os. + +When to use the queue-depth policy: + 1. High load with small I/Os: Effectively balances load across paths when + the load is high, and I/O operations consist of small, relatively + fixed-sized requests. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index 39b8e1fdd0cd..cb376f335f40 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -60,7 +60,7 @@ description of available events and configuration options in sysfs, see The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration templates for all documented events. For example, -"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1". +"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0". The "perf list" command shall list the available events from sysfs, e.g.:: @@ -79,8 +79,8 @@ Example usage of counting PCIe RX TLP data payload (Units of bytes):: The average RX/TX bandwidth can be calculated using the following formula: - PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window - PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window + PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window + PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window Lane Event Usage ------------------------------- diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index 5cc248d18c63..48992a0b8e94 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -35,7 +35,10 @@ e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in SCCL ID #1. The driver also provides a "cpumask" sysfs attribute, which shows the CPU core -ID used to count the uncore PMU event. +ID used to count the uncore PMU event. An "associated_cpus" sysfs attribute is +also provided to show the CPUs associated with this PMU. The "cpumask" indicates +the CPUs to open the events, usually as a hint for userspaces tools like perf. +It only contains one associated CPU from the "associated_cpus". Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index a58bd3f7e190..072b510385c4 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -14,6 +14,8 @@ Performance monitor support qcom_l2_pmu qcom_l3_pmu starfive_starlink_pmu + mrvl-odyssey-ddr-pmu + mrvl-odyssey-tad-pmu arm-ccn arm-cmn arm-ni diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst new file mode 100644 index 000000000000..2e817593a4d9 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst @@ -0,0 +1,80 @@ +=================================================================== +Marvell Odyssey DDR PMU Performance Monitoring Unit (PMU UNCORE) +=================================================================== + +Odyssey DRAM Subsystem supports eight counters for monitoring performance +and software can program those counters to monitor any of the defined +performance events. Supported performance events include those counted +at the interface between the DDR controller and the PHY, interface between +the DDR Controller and the CHI interconnect, or within the DDR Controller. + +Additionally DSS also supports two fixed performance event counters, one +for ddr reads and the other for ddr writes. + +The counter will be operating in either manual or auto mode. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/events/ + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/format/ + +Examples:: + + $ perf list | grep ddr + mrvl_ddr_pmu_<>/ddr_act_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_alloc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_starvation/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_mwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_read/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_write/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_capar_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_crit_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_reads/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_writes/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cmd_is_retry/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cycles/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_parity_poison/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_rd_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_wr_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mpc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_mpsm/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_powerdown/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_selfref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_pri_rdaccess/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rmw_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_load_mode/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_lpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_other/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_rdwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_raw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_uc_ecc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rdwr_transitions/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_refresh/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_retry_fifo_full/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_spec_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_tcr_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_war_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_waw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_rd/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_wr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_trxn_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_write_combine/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqcl/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqlatch/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqstart/ [Kernel PMU event] + + $ perf stat -e ddr_cam_read,ddr_cam_write,ddr_cam_active_access,ddr_cam + rd_or_wr_access,ddr_cam_rd_active_access,ddr_cam_mwr <workload> diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst new file mode 100644 index 000000000000..ad1975b14087 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst @@ -0,0 +1,37 @@ +==================================================================== +Marvell Odyssey LLC-TAD Performance Monitoring Unit (PMU UNCORE) +==================================================================== + +Each TAD provides eight 64-bit counters for monitoring +cache behavior.The driver always configures the same counter for +all the TADs. The user would end up effectively reserving one of +eight counters in every TAD to look across all TADs. +The occurrences of events are aggregated and presented to the user +at the end of running the workload. The driver does not provide a +way for the user to partition TADs so that different TADs are used for +different applications. + +The performance events reflect various internal or interface activities. +By combining the values from multiple performance counters, cache +performance can be measured in terms such as: cache miss rate, cache +allocations, interface retry rate, internal resource occupancy, etc. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/tad/events/ + /sys/bus/event_source/devices/tad/format/ + +Examples:: + + $ perf list | grep tad + tad/tad_alloc_any/ [Kernel PMU event] + tad/tad_alloc_dtg/ [Kernel PMU event] + tad/tad_alloc_ltg/ [Kernel PMU event] + tad/tad_hit_any/ [Kernel PMU event] + tad/tad_hit_dtg/ [Kernel PMU event] + tad/tad_hit_ltg/ [Kernel PMU event] + tad/tad_req_msh_in_exlmn/ [Kernel PMU event] + tad/tad_tag_rd/ [Kernel PMU event] + tad/tad_tot_cycle/ [Kernel PMU event] + + $ perf stat -e tad_alloc_dtg,tad_alloc_ltg,tad_alloc_any,tad_hit_dtg,tad_hit_ltg,tad_hit_any,tad_tag_rd <workload> diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst index 2e0d47cfe7ea..f538ef67e0e8 100644 --- a/Documentation/admin-guide/perf/nvidia-pmu.rst +++ b/Documentation/admin-guide/perf/nvidia-pmu.rst @@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>. Example usage: @@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. Example usage: @@ -86,6 +86,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/ + NVLink-C2C1 PMU ------------------- @@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. Example usage: @@ -116,6 +132,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/ + CNVLink PMU --------------- @@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>. Each SoC socket can be connected to one or more sockets via CNVLink. The user can use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to -socket 1 to 3. -/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket +socket 1 to 3. The PMU will monitor all remote sockets by default if not +specified. +/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket shows the valid bits that can be set in the "rem_socket" parameter. The PMU can not distinguish the remote traffic initiator, therefore it does not @@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>. Each SoC socket can support multiple root ports. The user can use "root_port" bitmap parameter to select the port(s) to monitor, i.e. -"root_port=0xF" corresponds to root port 0 to 3. -/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port +"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root +ports by default if not specified. +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port shows the valid bits that can be set in the "root_port" parameter. Example usage: diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index a21369eba034..2d74af7f0efe 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -231,7 +231,7 @@ are the following: present). The existence of the limit may be a result of some (often unintentional) - BIOS settings, restrictions coming from a service processor or another + BIOS settings, restrictions coming from a service processor or other BIOS/HW-based mechanisms. This does not cover ACPI thermal limitations which can be discovered @@ -248,6 +248,20 @@ are the following: If that frequency cannot be determined, this attribute should not be present. +``cpuinfo_avg_freq`` + An average frequency (in KHz) of all CPUs belonging to a given policy, + derived from a hardware provided feedback and reported on a time frame + spanning at most few milliseconds. + + This is expected to be based on the frequency the hardware actually runs + at and, as such, might require specialised hardware support (such as AMU + extension on ARM). If one cannot be determined, this attribute should + not be present. + + Note that failed attempt to retrieve current frequency for a given + CPU(s) will result in an appropriate error, i.e.: EAGAIN for CPU that + remains idle (raised on ARM). + ``cpuinfo_max_freq`` Maximum possible operating frequency the CPUs belonging to this policy can run at (in kHz). @@ -293,7 +307,8 @@ are the following: Some architectures (e.g. ``x86``) may attempt to provide information more precisely reflecting the current CPU frequency through this attribute, but that still may not be the exact current CPU frequency as - seen by the hardware at the moment. + seen by the hardware at the moment. This behavior though, is only + available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option. ``scaling_driver`` The scaling driver currently in use. @@ -484,7 +499,7 @@ This governor exposes the following tunables: represented by it to be 1.5 times as high as the transition latency (the default):: - # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate + # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2))` > ondemand/sampling_rate ``up_threshold`` If the estimated CPU load is above this value (in percent), the governor diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index eb58d7a5affd..0c090b076224 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -275,20 +275,25 @@ values and, when predicting the idle duration next time, it computes the average and variance of them. If the variance is small (smaller than 400 square milliseconds) or it is small relative to the average (the average is greater that 6 times the standard deviation), the average is regarded as the "typical -interval" value. Otherwise, the longest of the saved observed idle duration +interval" value. Otherwise, either the longest or the shortest (depending on +which one is farther from the average) of the saved observed idle duration values is discarded and the computation is repeated for the remaining ones. + Again, if the variance of them is small (in the above sense), the average is taken as the "typical interval" value and so on, until either the "typical -interval" is determined or too many data points are disregarded, in which case -the "typical interval" is assumed to equal "infinity" (the maximum unsigned -integer value). - -If the "typical interval" computed this way is long enough, the governor obtains -the time until the closest timer event with the assumption that the scheduler -tick will be stopped. That time, referred to as the *sleep length* in what follows, -is the upper bound on the time before the next CPU wakeup. It is used to determine -the sleep length range, which in turn is needed to get the sleep length correction -factor. +interval" is determined or too many data points are disregarded. In the latter +case, if the size of the set of data points still under consideration is +sufficiently large, the next idle duration is not likely to be above the largest +idle duration value still in that set, so that value is taken as the predicted +next idle duration. Finally, if the set of data points still under +consideration is too small, no prediction is made. + +If the preliminary prediction of the next idle duration computed this way is +long enough, the governor obtains the time until the closest timer event with +the assumption that the scheduler tick will be stopped. That time, referred to +as the *sleep length* in what follows, is the upper bound on the time before the +next CPU wakeup. It is used to determine the sleep length range, which in turn +is needed to get the sleep length correction factor. The ``menu`` governor maintains an array containing several correction factor values that correspond to different sleep length ranges organized so that each @@ -302,7 +307,7 @@ to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). The sleep length is multiplied by the correction factor for the range that it falls into to obtain an approximation of the predicted idle duration that is compared to the "typical interval" determined previously and the minimum of -the two is taken as the idle duration prediction. +the two is taken as the final idle duration prediction. If the "typical interval" value is small, which means that the CPU is likely to be woken up soon enough, the sleep length computation is skipped as it may diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index 39bd6ecce7de..ed6f055d4b14 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -38,6 +38,27 @@ instruction at all. only way to pass early-configuration-time parameters to it is via the kernel command line. +Sysfs Interface +=============== + +The ``intel_idle`` driver exposes the following ``sysfs`` attributes in +``/sys/devices/system/cpu/cpuidle/``: + +``intel_c1_demotion`` + Enable or disable C1 demotion for all CPUs in the system. This file is + only exposed on platforms that support the C1 demotion feature and where + it was tested. Value 0 means that C1 demotion is disabled, value 1 means + that it is enabled. Write 0 or 1 to disable or enable C1 demotion for + all CPUs. + + The C1 demotion feature involves the platform firmware demoting deep + C-state requests from the OS (e.g., C6 requests) to C1. The idea is that + firmware monitors CPU wake-up rate, and if it is higher than a + platform-specific threshold, the firmware demotes deep C-state requests + to C1. For example, Linux requests C6, but firmware noticed too many + wake-ups per second, and it keeps the CPU in C1. When the CPU stays in + C1 long enough, the platform promotes it back to C6. This may improve + some workloads' performance, but it may also increase power consumption. .. _intel-idle-enumeration-of-states: @@ -192,11 +213,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in Documentation/admin-guide/pm/cpuidle.rst). Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. -The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` -if the kernel has been configured with ACPI support) can be set to make the -driver ignore the system's ACPI tables entirely or use them for all of the -recognized processor models, respectively (they both are unset by default and -``use_acpi`` has no effect if ``no_acpi`` is set). +The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are +recognized by ``intel_idle`` if the kernel has been configured with ACPI +support. In the case that ACPI is not configured these flags have no impact +on functionality. + +``no_acpi`` - Do not use ACPI at all. Only native mode is available, no +ACPI mode. + +``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for +C-states on/off status in native mode. + +``no_native`` - Work only in ACPI mode, no native mode available (ignore +all custom tables). The value of the ``states_off`` module parameter (0 by default) represents a list of idle states to be disabled by default in the form of a bitmask. diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index bf13ad25a32f..26e702c7016e 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -329,6 +329,106 @@ information listed above is the same for all of the processors supporting the HWP feature, which is why ``intel_pstate`` works with all of them.] +Support for Hybrid Processors +============================= + +Some processors supported by ``intel_pstate`` contain two or more types of CPU +cores differing by the maximum turbo P-state, performance vs power characteristics, +cache sizes, and possibly other properties. They are commonly referred to as +hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled +and it assumes the HWP performance units to be the same for all CPUs in the +system, so a given HWP performance level always represents approximately the +same physical performance regardless of the core (CPU) type. + +Hybrid Processors with SMT +-------------------------- + +On systems where SMT (Simultaneous Multithreading), also referred to as +HyperThreading (HT) in the context of Intel processors, is enabled on at least +one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely, +the priority of a given CPU reflects its highest HWP performance level which +causes the CPU scheduler to generally prefer more performant CPUs, so the less +performant CPUs are used when the other ones are fully loaded. However, SMT +siblings (that is, logical CPUs sharing one physical core) are treated in a +special way such that if one of them is in use, the effective priority of the +other ones is lowered below the priorities of the CPUs located in the other +physical cores. + +This approach maximizes performance in the majority of cases, but unfortunately +it also leads to excessive energy usage in some important scenarios, like video +playback, which is not generally desirable. While there is no other viable +choice with SMT enabled because the effective capacity and utilization of SMT +siblings are hard to determine, hybrid processors without SMT can be handled in +more energy-efficient ways. + +.. _CAS: + +Capacity-Aware Scheduling Support +--------------------------------- + +The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by +``intel_pstate`` by default on hybrid processors without SMT. CAS generally +causes the scheduler to put tasks on a CPU so long as there is a sufficient +amount of spare capacity on it, and if the utilization of a given task is too +high for it, the task will need to go somewhere else. + +Since CAS takes CPU capacities into account, it does not require CPU +prioritization and it allows tasks to be distributed more symmetrically among +the more performant and less performant CPUs. Once placed on a CPU with enough +capacity to accommodate it, a task may just continue to run there regardless of +whether or not the other CPUs are fully loaded, so on average CAS reduces the +utilization of the more performant CPUs which causes the energy usage to be more +balanced because the more performant CPUs are generally less energy-efficient +than the less performant ones. + +In order to use CAS, the scheduler needs to know the capacity of each CPU in +the system and it needs to be able to compute scale-invariant utilization of +CPUs, so ``intel_pstate`` provides it with the requisite information. + +First of all, the capacity of each CPU is represented by the ratio of its highest +HWP performance level, multiplied by 1024, to the highest HWP performance level +of the most performant CPU in the system, which works because the HWP performance +units are the same for all CPUs. Second, the frequency-invariance computations, +carried out by the scheduler to always express CPU utilization in the same units +regardless of the frequency it is currently running at, are adjusted to take the +CPU capacity into account. All of this happens when ``intel_pstate`` has +registered itself with the ``CPUFreq`` core and it has figured out that it is +running on a hybrid processor without SMT. + +Energy-Aware Scheduling Support +------------------------------- + +If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and +``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling +`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the +Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if +``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` +to operate in the `passive mode <Passive Mode_>`_. + +The Energy Model registered by ``intel_pstate`` is artificial (that is, it is +based on abstract cost values and it does not include any real power numbers) +and it is relatively simple to avoid unnecessary computations in the scheduler. +There is a performance domain in it for every CPU in the system and the cost +values for these performance domains have been chosen so that running a task on +a less performant (small) CPU appears to be always cheaper than running that +task on a more performant (big) CPU. However, for two CPUs of the same type, +the cost difference depends on their current utilization, and the CPU whose +current utilization is higher generally appears to be a more expensive +destination for a given task. This helps to balance the load among CPUs of the +same type. + +Since EAS works on top of CAS, high-utilization tasks are always migrated to +CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization +tasks tend to be placed on the CPUs that look less expensive to the scheduler. +Effectively, this causes the less performant and less loaded CPUs to be +preferred as long as they have enough spare capacity to run the given task +which generally leads to reduced energy usage. + +The Energy Model created by ``intel_pstate`` can be inspected by looking at +the ``energy_model`` directory in ``debugfs`` (typlically mounted on +``/sys/kernel/debug/``). + + User Space Interface in ``sysfs`` ================================= @@ -696,6 +796,9 @@ of them have to be prepended with the ``intel_pstate=`` prefix. Use per-logical-CPU P-State limits (see `Coordination of P-state Limits`_ for details). +``no_cas`` + Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by + default on hybrid systems without SMT. Diagnostics and Tuning ====================== diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5151ec312dc0..d367ba4d744a 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -91,12 +91,22 @@ Attributes in each directory: ``domain_id`` This attribute is used to get the power domain id of this instance. +``die_id`` + This attribute is used to get the Linux die id of this instance. + This attribute is only present for domains with core agents and + when the CPUID leaf 0x1f presents die ID. + ``fabric_cluster_id`` This attribute is used to get the fabric cluster id of this instance. ``package_id`` This attribute is used to get the package id of this instance. +``agent_types`` + This attribute displays all the hardware agents present within the + domain. Each agent has the capability to control one or more hardware + subsystems, which include: core, cache, memory, and I/O. + The other attributes are same as presented at package_*_die_* level. In most of current use cases, the "max_freq_khz" and "min_freq_khz" diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst index 3eda08191d13..24d80e3eb309 100644 --- a/Documentation/admin-guide/pnp.rst +++ b/Documentation/admin-guide/pnp.rst @@ -129,9 +129,6 @@ pnp_put_protocol pnp_register_protocol use this to register a new PnP protocol -pnp_unregister_protocol - use this function to remove a PnP protocol from the Plug and Play Layer - pnp_register_driver adds a PnP driver to the Plug and Play Layer diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index f08149bc53f8..4a5ffb0996a3 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -347,7 +347,7 @@ again. [:ref:`details<uninstall>`] -.. _submit_improvements: +.. _submit_improvements_qbtl: Did you run into trouble following any of the above steps that is not cleared up by the reference section below? Or do you have ideas how to improve the text? @@ -733,7 +733,7 @@ can easily happen that your self-built kernel will lack modules for tasks you did not perform before utilizing this make target. That's because those tasks require kernel modules that are normally autoloaded when you perform that task for the first time; if you didn't perform that task at least once before using -localmodonfig, the latter will thus assume these modules are superfluous and +localmodconfig, the latter will thus assume these modules are superfluous and disable them. You can try to avoid this by performing typical tasks that often will autoload @@ -1070,7 +1070,7 @@ complicated, and harder to follow. That being said: this of course is a balancing act. Hence, if you think an additional use-case is worth describing, suggest it to the maintainers of this -document, as :ref:`described above <submit_improvements>`. +document, as :ref:`described above <submit_improvements_qbtl>`. .. diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst index 2fd5a030235a..9a847506f6ec 100644 --- a/Documentation/admin-guide/reporting-issues.rst +++ b/Documentation/admin-guide/reporting-issues.rst @@ -41,7 +41,7 @@ If you are facing multiple issues with the Linux kernel at once, report each separately. While writing your report, include all information relevant to the issue, like the kernel and the distro used. In case of a regression, CC the regressions mailing list (regressions@lists.linux.dev) to your report. Also try -to pin-point the culprit with a bisection; if you succeed, include its +to pinpoint the culprit with a bisection; if you succeed, include its commit-id and CC everyone in the sign-off-by chain. Once the report is out, answer any questions that come up and help where you @@ -206,7 +206,7 @@ Reporting issues only occurring in older kernel version lines This subsection is for you, if you tried the latest mainline kernel as outlined above, but failed to reproduce your issue there; at the same time you want to see the issue fixed in a still supported stable or longterm series or vendor -kernels regularly rebased on those. If that the case, follow these steps: +kernels regularly rebased on those. If that is the case, follow these steps: * Prepare yourself for the possibility that going through the next few steps might not get the issue solved in older releases: the fix might be too big @@ -312,7 +312,7 @@ small modifications to a kernel based on a recent Linux version; that for example often holds true for the mainline kernels shipped by Debian GNU/Linux Sid or Fedora Rawhide. Some developers will also accept reports about issues with kernels from distributions shipping the latest stable kernel, as long as -its only slightly modified; that for example is often the case for Arch Linux, +it's only slightly modified; that for example is often the case for Arch Linux, regular Fedora releases, and openSUSE Tumbleweed. But keep in mind, you better want to use a mainline Linux and avoid using a stable kernel for this process, as outlined in the section 'Install a fresh kernel for testing' in more diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst index a3dfc2c66e01..1609e7479249 100644 --- a/Documentation/admin-guide/serial-console.rst +++ b/Documentation/admin-guide/serial-console.rst @@ -78,7 +78,9 @@ If no console device is specified, the first device found capable of acting as a system console will be used. At this time, the system first looks for a VGA card and then for a serial port. So if you don't have a VGA card in your system the first serial port will automatically -become the console. +become the console, unless the kernel is configured with the +CONFIG_NULL_TTY_DEFAULT_CONSOLE option, then it will default to using the +ttynull device. You will need to create a new device to use ``/dev/console``. The official ``/dev/console`` is now character device 5,1. diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index f5ec6c9312e1..6c54718c9d04 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -41,7 +41,7 @@ pre-allocation or re-sizing of any kernel data structures. dentry-negative ---------------------------- -Policy for negative dentries. Set to 1 to to always delete the dentry when a +Policy for negative dentries. Set to 1 to always delete the dentry when a file is removed, and 0 to disable it. By default, this behavior is disabled. dentry-state @@ -347,3 +347,28 @@ filesystems: ``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for setting/getting the maximum number of pages that can be used for servicing requests in FUSE. + +``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for +setting/getting the default timeout (in seconds) for a fuse server to +reply to a kernel-issued request in the event where the server did not +specify a timeout at mount. If the server set a timeout, +then default_request_timeout will be ignored. The default +"default_request_timeout" is set to 0. 0 indicates no default timeout. +The maximum value that can be set is 65535. + +``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for +setting/getting the maximum timeout (in seconds) for a fuse server to +reply to a kernel-issued request. A value greater than 0 automatically opts +the server into a timeout that will be set to at most "max_request_timeout", +even if the server did not specify a timeout and default_request_timeout is +set to 0. If max_request_timeout is greater than 0 and the server set a timeout +greater than max_request_timeout or default_request_timeout is set to a value +greater than max_request_timeout, the system will use max_request_timeout as the +timeout. 0 indicates no max request timeout. The maximum value that can be set +is 65535. + +For timeouts, if the server does not respond to the request by the time +the set timeout elapses, then the connection to the fuse server will be aborted. +Please note that the timeouts are not 100% precise (eg you may set 60 seconds but +the timeout may kick in after 70 seconds). The upper margin of error for the +timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds. diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 7a85b6eb884e..dd49a89a62d3 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1555,6 +1555,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff). If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +timer_migration +=============== + +When set to a non-zero value, attempt to migrate timers away from idle cpus to +allow them to remain in low power states longer. + +Default is set (1). traceoff_on_warning =================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..9bef46151d53 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -74,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -130,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -145,6 +153,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== @@ -1008,19 +1024,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. + +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index a85b3384d1e7..9c7aa817adc7 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -49,26 +49,26 @@ How do I use the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On x86 - You press the key combo :kbd:`ALT-SysRq-<command key>`. + You press the key combo `ALT-SysRq-<command key>`. .. note:: Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot handle so many keys being pressed at the same time, so you might - have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, - release :kbd:`SysRq`, press :kbd:`<command key>`, release everything. + have better luck with press `Alt`, press `SysRq`, + release `SysRq`, press `<command key>`, release everything. On SPARC - You press :kbd:`ALT-STOP-<command key>`, I believe. + You press `ALT-STOP-<command key>`, I believe. On the serial console (PC style standard serial ports only) You send a ``BREAK``, then within 5 seconds a command key. Sending ``BREAK`` twice is interpreted as a normal BREAK. On PowerPC - Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`. - :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice. + Press `ALT - Print Screen` (or `F13`) - `<command key>`. + `Print Screen` (or `F13`) - `<command key>` may suffice. On other If you know of the key combos for other architectures, please @@ -88,7 +88,7 @@ On all echo _reisub > /proc/sysrq-trigger -The :kbd:`<command key>` is case sensitive. +The `<command key>` is case sensitive. What are the 'command' keys? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -225,9 +225,9 @@ Sometimes SysRq seems to get 'stuck' after using it, what can I do? When this happens, try tapping shift, alt and control on both sides of the keyboard, and hitting an invalid sysrq sequence again. (i.e., something like -:kbd:`alt-sysrq-z`). +`alt-sysrq-z`). -Switching to another virtual console (:kbd:`ALT+Fn`) and then back again +Switching to another virtual console (`ALT+Fn`) and then back again should also help. I hit SysRq, but nothing seems to happen, what's wrong? @@ -290,7 +290,7 @@ exception the header line from the sysrq command is passed to all console consumers as if the current loglevel was maximum. If only the header is emitted it is almost certain that the kernel loglevel is too low. Should you require the output on the console channel then you will need -to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or:: +to temporarily up the console loglevel using `alt-sysrq-8` or:: echo 8 > /proc/sysrq-trigger diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index 700aa72eecb1..a0cc017e4424 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted 16 _/X 65536 auxiliary taint, defined for and used by distros 17 _/T 131072 kernel was built with the struct randomization plugin 18 _/N 262144 an in-kernel test has been run + 19 _/J 524288 userspace used a mutating debug operation in fwctl === === ====== ======================================================== Note: The character ``_`` is representing a blank in this table to make reading @@ -184,3 +185,7 @@ More detailed explanation for tainting build time. 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. + + 19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE + to use the devices debugging features. Device debugging features could + cause the device to malfunction in undefined ways. diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 2ed79f41a411..240fee618e06 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -28,7 +28,7 @@ should be a userspace tool that handles all the low-level details, keeps a database of the authorized devices and prompts users for new connections. More details about the sysfs interface for Thunderbolt devices can be -found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``. +found in Documentation/ABI/testing/sysfs-bus-thunderbolt. Those users who just want to connect any device without any sort of manual work can add following line to @@ -296,6 +296,39 @@ information is missing. To recover from this mode, one needs to flash a valid NVM image to the host controller in the same way it is done in the previous chapter. +Tunneling events +---------------- +The driver sends ``KOBJ_CHANGE`` events to userspace when there is a +tunneling change in the ``thunderbolt_domain``. The notification carries +following environment variables:: + + TUNNEL_EVENT=<EVENT> + TUNNEL_DETAILS=0:12 <-> 1:20 (USB3) + +Possible values for ``<EVENT>`` are: + + activated + The tunnel was activated (created). + + changed + There is a change in this tunnel. For example bandwidth allocation was + changed. + + deactivated + The tunnel was torn down. + + low bandwidth + The tunnel is not getting optimal bandwidth. + + insufficient bandwidth + There is not enough bandwidth for the current tunnel requirements. + +The ``TUNNEL_DETAILS`` is only provided if the tunnel is known. For +example, in case of Firmware Connection Manager this is missing or does +not provide full tunnel information. In case of Software Connection Manager +this includes full tunnel details. The format currently matches what the +driver uses when logging. This may change over time. + Networking over Thunderbolt cable --------------------------------- Thunderbolt technology allows software communication between two hosts diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index 6281eae9e6bc..d8946b084b1e 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -267,7 +267,7 @@ culprit might be known already. For further details on what actually qualifies as a regression check out Documentation/admin-guide/reporting-regressions.rst. If you run into any problems while following this guide or have ideas how to -improve it, :ref:`please let the kernel developers know <submit_improvements>`. +improve it, :ref:`please let the kernel developers know <submit_improvements_vbbr>`. .. _introprep_bissbs: @@ -1055,7 +1055,7 @@ follow these instructions. [:ref:`details <introoptional_bisref>`] -.. _submit_improvements: +.. _submit_improvements_vbbr: Conclusion ---------- @@ -1431,7 +1431,7 @@ can easily happen that your self-built kernels will lack modules for tasks you did not perform at least once before utilizing this make target. That happens when a task requires kernel modules which are only autoloaded when you execute it for the first time. So when you never performed that task since starting your -kernel the modules will not have been loaded -- and from localmodonfig's point +kernel the modules will not have been loaded -- and from localmodconfig's point of view look superfluous, which thus disables them to reduce the amount of code to be compiled. diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index b2e254ec8ee8..d6313890ee41 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -82,8 +82,8 @@ Install tools to build Linux kernel and tools in kernel repository. scripts/ver_linux is a good way to check if your system already has the necessary tools:: - sudo apt-get build-essentials flex bison yacc - sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + sudo apt-get install build-essential flex bison yacc + sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev cscope is a good tool to browse kernel sources. Let's install it now:: diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index b67772cf36d6..a18328a5fb93 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -124,6 +124,14 @@ When mounting an XFS filesystem, the following options are accepted. controls the size of each buffer and so is also relevant to this case. + lifetime (default) or nolifetime + Enable data placement based on write life time hints provided + by the user. This turns on co-allocation of data of similar + life times when statistically favorable to reduce garbage + collection cost. + + These options are only available for zoned rt file systems. + logbsize=value Set the size of each in-memory log buffer. The size may be specified in bytes, or in kilobytes with a "k" suffix. @@ -143,6 +151,25 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + max_atomic_write=value + Set the maximum size of an atomic write. The size may be + specified in bytes, in kilobytes with a "k" suffix, in megabytes + with a "m" suffix, or in gigabytes with a "g" suffix. The size + cannot be larger than the maximum write size, larger than the + size of any allocation group, or larger than the size of a + remapping operation that the log can complete atomically. + + The default value is to set the maximum I/O completion size + to allow each CPU to handle one at a time. + + max_open_zones=value + Specify the max number of zones to keep open for writing on a + zoned rt device. Many open zones aids file data separation + but may impact performance on HDDs. + + If ``max_open_zones`` is not specified, the value is determined + by the capabilities and the size of the zoned rt device. + noalign Data allocations will not be aligned at stripe unit boundaries. This is only relevant to filesystems created @@ -542,3 +569,24 @@ The interesting knobs for XFS workqueues are as follows: nice Relative priority of scheduling the threads. These are the same nice levels that can be applied to userspace processes. ============ =========== + +Zoned Filesystems +================= + +For zoned file systems, the following attributes are exposed in: + + /sys/fs/xfs/<dev>/zoned/ + + max_open_zones (Min: 1 Default: Varies Max: UINTMAX) + This read-only attribute exposes the maximum number of open zones + available for data placement. The value is determined at mount time and + is limited by the capabilities of the backing zoned device, file system + size and the max_open_zones mount option. + + zonegc_low_space (Min: 0 Default: 0 Max: 100) + Define a percentage for how much of the unused space that GC should keep + available for writing. A high value will reclaim more of the space + occupied by unused blocks, creating a larger buffer against write + bursts at the cost of increased write amplification. Regardless + of this value, garbage collection will always aim to free a minimum + amount of blocks to keep max_open_zones open for data placement purposes. |