summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/mm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/mm')
-rw-r--r--Documentation/admin-guide/mm/cma_debugfs.rst10
-rw-r--r--Documentation/admin-guide/mm/damon/index.rst11
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst99
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst10
-rw-r--r--Documentation/admin-guide/mm/index.rst1
-rw-r--r--Documentation/admin-guide/mm/kho.rst115
-rw-r--r--Documentation/admin-guide/mm/multigen_lru.rst5
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst22
-rw-r--r--Documentation/admin-guide/mm/zswap.rst10
9 files changed, 234 insertions, 49 deletions
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
index 7367e6294ef6..4120e9cb0cd5 100644
--- a/Documentation/admin-guide/mm/cma_debugfs.rst
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -12,10 +12,16 @@ its CMA name like below:
The structure of the files created under that directory is as follows:
- - [RO] base_pfn: The base PFN (Page Frame Number) of the zone.
+ - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area.
+ This is the same as ranges/0/base_pfn.
- [RO] count: Amount of memory in the CMA area.
- [RO] order_per_bit: Order of pages represented by one bit.
- - [RO] bitmap: The bitmap of page states in the zone.
+ - [RO] bitmap: The bitmap of allocated pages in the area.
+ This is the same as ranges/0/base_pfn.
+ - [RO] ranges/N/base_pfn: The base PFN of contiguous range N
+ in the CMA area.
+ - [RO] ranges/N/bitmap: The bit map of allocated pages in
+ range N in the CMA area.
- [WO] alloc: Allocate N pages from that CMA area. For example::
echo 5 > <debugfs>/cma/<cma_name>/alloc
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
index 33d37bb2fb4e..bc7e976120e0 100644
--- a/Documentation/admin-guide/mm/damon/index.rst
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -1,12 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================
-DAMON: Data Access MONitor
-==========================
+================================================================
+DAMON: Data Access MONitoring and Access-aware System Operations
+================================================================
-:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
-Using DAMON, users can analyze the memory access patterns of their systems and
-optimize those.
+:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data
+access monitoring and access-aware system operations.
.. toctree::
:maxdepth: 2
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 47a44bd348ab..d960aba72b82 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -64,6 +64,7 @@ comma (",").
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
+ │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us
│ │ │ │ │ │ nr_regions/min,max
│ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
│ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
@@ -80,10 +81,10 @@ comma (",").
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
- │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
+ │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
- │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx
+ │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
@@ -132,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``off``: Stop running.
- ``commit``: Read the user inputs in the sysfs files except ``state`` file
again.
+- ``update_tuned_intervals``: Update the contents of ``sample_us`` and
+ ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling
+ interval`` and ``aggregation interval`` for the files. Please refer to
+ :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>`
+ for more details.
- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
:ref:`quota goals <sysfs_schemes_quota_goals>`.
- ``update_schemes_stats``: Update the contents of stats files for each
@@ -213,6 +219,25 @@ writing to and rading from the files.
For more details about the intervals and monitoring regions range, please refer
to the Design document (:doc:`/mm/damon/design`).
+.. _damon_usage_sysfs_monitoring_intervals_goal:
+
+contexts/<N>/monitoring_attrs/intervals/intervals_goal/
+-------------------------------------------------------
+
+Under the ``intervals`` directory, one directory for automated tuning of
+``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists.
+Under the directory, four files for the auto-tuning control, namely
+``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist.
+Please refer to the :ref:`design document of the feature
+<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning
+mechanism. Reading and writing the four files under ``intervals_goal``
+directory shows and updates the tuning parameters that described in the
+:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same
+names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The
+tuning-applied current values of the two intervals can be read from the
+``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to
+the ``state`` file.
+
.. _sysfs_targets:
contexts/<N>/targets/
@@ -282,9 +307,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/
------------
-In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
-(``action``, ``target_nid`` and ``apply_interval``) exist.
+In each scheme directory, seven directories (``access_pattern``, ``quotas``,
+``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and
+``tried_regions``) and three files (``action``, ``target_nid`` and
+``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
@@ -364,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each goal and current achievement.
Among the multiple feedback, the best one is used.
-Each goal directory contains three files, namely ``target_metric``,
-``target_value`` and ``current_value``. Users can set and get the three
-parameters for the quota auto-tuning goals that specified on the :ref:`design
-doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
-of the files. Note that users should further write
+Each goal directory contains four files, namely ``target_metric``,
+``target_value``, ``current_value`` and ``nid``. Users can set and get the
+four parameters for the quota auto-tuning goals that specified on the
+:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and
+reading from each of the files. Note that users should further write
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
directory <sysfs_kdamond>` to pass the feedback to DAMON.
@@ -395,33 +421,43 @@ The ``interval`` should written in microseconds unit.
.. _sysfs_filters:
-schemes/<N>/filters/
---------------------
+schemes/<N>/{core\_,ops\_,}filters/
+-----------------------------------
-The directory for the :ref:`filters <damon_design_damos_filters>` of the given
+Directories for :ref:`filters <damon_design_damos_filters>` of the given
DAMON-based operation scheme.
-In the beginning, this directory has only one file, ``nr_filters``. Writing a
+``core_filters`` and ``ops_filters`` directories are for the filters handled by
+the DAMON core layer and operations set layer, respectively. ``filters``
+directory can be used for installing filters regardless of their handled
+layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be
+installed before those of ``filters``. All three directories have same files.
+
+Use of ``filters`` directory can make expecting evaluation orders of given
+filters with the files under directory bit confusing. Users are hence
+recommended to use ``core_filters`` and ``ops_filters`` directories. The
+``filters`` directory could be deprecated in future.
+
+In the beginning, the directory has only one file, ``nr_filters``. Writing a
number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
-Each filter directory contains seven files, namely ``type``, ``matching``,
-``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``.
-To ``type`` file, you can write one of five special keywords: ``anon`` for
-anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young
-pages, ``addr`` for specific address range (an open-ended interval), or
-``target`` for specific DAMON monitoring target filtering. Meaning of the
-types are same to the description on the :ref:`design doc
-<damon_design_damos_filters>`.
-
-In case of the memory cgroup filtering, you can specify the memory cgroup of
-the interest by writing the path of the memory cgroup from the cgroups mount
-point to ``memcg_path`` file. In case of the address range filtering, you can
-specify the start and end address of the range to ``addr_start`` and
-``addr_end`` files, respectively. For the DAMON monitoring target filtering,
-you can specify the index of the target between the list of the DAMON context's
-monitoring targets list to ``target_idx`` file.
+Each filter directory contains nine files, namely ``type``, ``matching``,
+``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max``
+and ``target_idx``. To ``type`` file, you can write the type of the filter.
+Refer to :ref:`the design doc <damon_design_damos_filters>` for available type
+names, their meaning and on what layer those are handled.
+
+For ``memcg`` type, you can specify the memory cgroup of the interest by
+writing the path of the memory cgroup from the cgroups mount point to
+``memcg_path`` file. For ``addr`` type, you can specify the start and end
+address of the range (open-ended interval) to ``addr_start`` and ``addr_end``
+files, respectively. For ``hugepage_size`` type, you can specify the minimum
+and maximum size of the range (closed interval) to ``min`` and ``max`` files,
+respectively. For ``target`` type, you can specify the index of the target
+between the list of the DAMON context's monitoring targets list to
+``target_idx`` file.
You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
@@ -431,6 +467,7 @@ the ``type`` and ``matching`` should be allowed or not.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
+ # cd ops_filters/0/
# echo 2 > nr_filters
# # disallow anonymous pages
echo anon > 0/type
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b..67a941903fd2 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,17 @@ hugepages
It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
If the node number is invalid, the parameter will be ignored.
+hugepage_alloc_threads
+ Specify the number of threads that should be used to allocate hugepages
+ during boot. This parameter can be used to improve system bootup time
+ when allocating a large amount of huge pages.
+ The default value is 25% of the available hardware threads.
+ Example to use 8 allocation threads::
+
+ hugepage_alloc_threads=8
+
+ Note that this parameter only applies to non-gigantic huge pages.
default_hugepagesz
Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 8b35795b664b..2d2f6c222308 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -42,3 +42,4 @@ the Linux memory management.
transhuge
userfaultfd
zswap
+ kho
diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst
new file mode 100644
index 000000000000..6dc18ed4b886
--- /dev/null
+++ b/Documentation/admin-guide/mm/kho.rst
@@ -0,0 +1,115 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory
+regions, which could contain serialized system states, across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`concepts <kho-concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+=============
+
+KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER``
+set to y. Every KHO producer may have its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho=on`` command line
+parameter. You may use ``kho_scratch`` parameter to define size of the
+scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a
+16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
+per NUMA node scratch regions on boot.
+
+Perform a KHO kexec
+===================
+
+First, before you perform a KHO kexec, you need to move the system into
+the :ref:`KHO finalization phase <kho-finalization-phase>` ::
+
+ $ echo 1 > /sys/kernel/debug/kho/out/finalize
+
+After this command, the KHO FDT is available in
+``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register
+their own preserved sub FDTs under
+``/sys/kernel/debug/kho/out/sub_fdts/``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+ # kexec -l /path/to/bzImage --initrd /path/to/initrd -s
+ # kexec -e
+
+The new kernel will boot up and contain some of the previous kernel's state.
+
+For example, if you used ``reserve_mem`` command line parameter to create
+an early memory reservation, the new kernel will have that memory at the
+same physical address as the old kernel.
+
+Abort a KHO exec
+================
+
+You can move the system out of KHO finalization phase again by calling ::
+
+ $ echo 0 > /sys/kernel/debug/kho/out/active
+
+After this command, the KHO FDT is no longer available in
+``/sys/kernel/debug/kho/out/fdt``.
+
+debugfs Interfaces
+==================
+
+Currently KHO creates the following debugfs interfaces. Notice that these
+interfaces may change in the future. They will be moved to sysfs once KHO is
+stabilized.
+
+``/sys/kernel/debug/kho/out/finalize``
+ Kexec HandOver (KHO) allows Linux to transition the state of
+ compatible drivers into the next kexec'ed kernel. To do so,
+ device drivers will instruct KHO to preserve memory regions,
+ which could contain serialized kernel state.
+ While the state is serialized, they are unable to perform
+ any modifications to state that was serialized, such as
+ handed over memory allocations.
+
+ When this file contains "1", the system is in the transition
+ state. When contains "0", it is not. To switch between the
+ two states, echo the respective number into this file.
+
+``/sys/kernel/debug/kho/out/fdt``
+ When KHO state tree is finalized, the kernel exposes the
+ flattened device tree blob that carries its current KHO
+ state in this file. Kexec user space tooling can use this
+ as input file for the KHO payload image.
+
+``/sys/kernel/debug/kho/out/scratch_len``
+ Lengths of KHO scratch regions, which are physically contiguous
+ memory regions that will always stay available for future kexec
+ allocations. Kexec user space tools can use this file to determine
+ where it should place its payload images.
+
+``/sys/kernel/debug/kho/out/scratch_phys``
+ Physical locations of KHO scratch regions. Kexec user space tools
+ can use this file in conjunction to scratch_phys to determine where
+ it should place its payload images.
+
+``/sys/kernel/debug/kho/out/sub_fdts/``
+ In the KHO finalization phase, KHO producers register their own
+ FDT blob under this directory.
+
+``/sys/kernel/debug/kho/in/fdt``
+ When the kernel was booted with Kexec HandOver (KHO),
+ the state tree that carries metadata about the previous
+ kernel's state is in this file in the format of flattened
+ device tree. This file may disappear when all consumers of
+ it finished to interpret their metadata.
+
+``/sys/kernel/debug/kho/in/sub_fdts/``
+ Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs
+ of KHO producers passed from the old kernel.
diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..9cb54b4ff5d9 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``.
``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
the active list) and therefore cannot be evicted. ``swappiness``
-overrides the default value in ``/proc/sys/vm/swappiness``.
-``nr_to_reclaim`` limits the number of pages to evict.
+overrides the default value in ``/proc/sys/vm/swappiness`` and the valid
+range is [0-200, max], with max being exclusively used for the reclamation
+of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict.
A typical use case is that a job scheduler runs this command before it
tries to land a new job on a server. If it fails to materialize enough
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36..e60e9211fd9b 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -21,7 +21,8 @@ There are four components to pagemap:
* Bit 56 page exclusively mapped (since 4.2)
* Bit 57 pte is uffd-wp write-protected (since 5.13) (see
Documentation/admin-guide/mm/userfaultfd.rst)
- * Bits 58-60 zero
+ * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page)
+ * Bits 59-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present
@@ -37,12 +38,28 @@ There are four components to pagemap:
precisely which pages are mapped (or in swap) and comparing mapped
pages between processes.
+ Traditionally, bit 56 indicates that a page is mapped exactly once and bit
+ 56 is clear when a page is mapped multiple times, even when mapped in the
+ same process multiple times. In some kernel configurations, the semantics
+ for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set
+ if all pages part of the corresponding large allocation are *certainly*
+ mapped in the same process, even if the page is mapped multiple times in that
+ process. Bit 56 is clear when any page page of the larger allocation
+ is *maybe* mapped in a different process. In some cases, a large allocation
+ might be treated as "maybe mapped by multiple processes" even though this
+ is no longer the case.
+
Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
- times each page is mapped, indexed by PFN.
+ times each page is mapped, indexed by PFN. Some kernel configurations do
+ not track the precise number of times a page part of a larger allocation
+ (e.g., THP) is mapped. In these configurations, the average number of
+ mappings per page in this larger allocation is returned instead. However,
+ if any page of the large allocation is mapped, the returned value will
+ be at least 1.
The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped.
@@ -233,6 +250,7 @@ Following flags about pages are currently supported:
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
+- ``PAGE_IS_GUARD`` - Page is a part of a guard region
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 3598dcd7dbe7..fd3370aa43fe 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed
pages are freed. The pool is not preallocated. By default, a zpool
of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
but it can be overridden at boot time by setting the ``zpool`` attribute,
-e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs
``zpool`` attribute, e.g.::
- echo zbud > /sys/module/zswap/parameters/zpool
+ echo zsmalloc > /sys/module/zswap/parameters/zpool
-The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
-means the compression ratio will always be 2:1 or worse (because of half-full
-zbud pages). The zsmalloc type zpool has a more complex compressed page
-storage method, and it can achieve greater storage densities.
+The zsmalloc type zpool has a complex compressed page storage method, and it
+can achieve great storage densities.
When a swap page is passed from swapout to zswap, zswap maintains a mapping
of the swap entry, a combination of the swap type and swap offset, to the zpool