diff options
Diffstat (limited to 'Documentation/admin-guide/mm')
-rw-r--r-- | Documentation/admin-guide/mm/cma_debugfs.rst | 10 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/index.rst | 12 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/stat.rst | 69 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 137 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/hugetlbpage.rst | 10 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/index.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/kho.rst | 115 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/multigen_lru.rst | 5 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/pagemap.rst | 22 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/slab.rst | 469 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/transhuge.rst | 19 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/zswap.rst | 10 |
12 files changed, 824 insertions, 56 deletions
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > <debugfs>/cma/<cma_name>/alloc diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..3ce3164480c7 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 @@ -15,3 +14,4 @@ optimize those. usage reclaim lru_sort + stat diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst new file mode 100644 index 000000000000..4c517c2c219a --- /dev/null +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -0,0 +1,69 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Data Access Monitoring Results Stat +=================================== + +Data Access Monitoring Results Stat (DAMON_STAT) is a static kernel module that +is aimed to be used for simple access pattern monitoring. It monitors accesses +on the system's entire physical memory using DAMON, and provides simplified +access monitoring results statistics, namely idle time percentiles and +estimated memory bandwidth. + +Monitoring Accuracy and Overhead +================================ + +DAMON_STAT uses monitoring intervals :ref:`auto-tuning +<damon_design_monitoring_intervals_autotuning>` to make its accuracy high and +overhead minimum. It auto-tunes the intervals aiming 4 % of observable access +events to be captured in each snapshot, while limiting the resulting sampling +events to be 5 milliseconds in minimum and 10 seconds in maximum. On a few +production server systems, it resulted in consuming only 0.x % single CPU time, +while capturing reasonable quality of access patterns. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_STAT=y``. The feature can be enabled by +default at build time, by setting ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` true. + +To let sysadmins enable or disable it at boot and/or runtime, and read the +monitoring results, DAMON_STAT provides module parameters. Following +sections are descriptions of the parameters. + +enabled +------- + +Enable or disable DAMON_STAT. + +You can enable DAMON_STAT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_STAT. The default value is set by +``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. + +estimated_memory_bandwidth +-------------------------- + +Estimated memory bandwidth consumption (bytes per second) of the system. + +DAMON_STAT reads observed access events on the current DAMON results snapshot +and converts it to memory bandwidth consumption estimation in bytes per second. +The resulting metric is exposed to user via this read-only parameter. Because +DAMON uses sampling, this is only an estimation of the access intensity rather +than accurate memory bandwidth. + +memory_idle_ms_percentiles +-------------------------- + +Per-byte idle time (milliseconds) percentiles of the system. + +DAMON_STAT calculates how long each byte of the memory was not accessed until +now (idle time), based on the current DAMON results snapshot. If DAMON found a +region of access frequency (nr_accesses) larger than zero, every byte of the +region gets zero idle time. If a region has zero access frequency +(nr_accesses), how long the region was keeping the zero access frequency (age) +becomes the idle time of every byte of the region. Then, DAMON_STAT exposes +the percentiles of the idle time values via this read-only parameter. Reading +the parameter returns 101 idle time values in milliseconds, separated by comma. +Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle +times. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 47a44bd348ab..ff3a2dda1f02 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -59,11 +59,12 @@ comma (","). :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds - │ │ :ref:`0 <sysfs_kdamond>`/state,pid + │ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us + │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target @@ -80,10 +81,12 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx + │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests + │ │ │ │ │ │ │ │ 0/id,weight │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed @@ -120,8 +123,8 @@ kdamond. kdamonds/<N>/ ------------- -In each kdamond directory, two files (``state`` and ``pid``) and one directory -(``contexts``) exist. +In each kdamond directory, three files (``state``, ``pid`` and ``refresh_ms``) +and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or ``off`` if it is not running. @@ -132,6 +135,11 @@ Users can write below commands for the kdamond to the ``state`` file. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file again. +- ``update_tuned_intervals``: Update the contents of ``sample_us`` and + ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling + interval`` and ``aggregation interval`` for the files. Please refer to + :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>` + for more details. - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' :ref:`quota goals <sysfs_schemes_quota_goals>`. - ``update_schemes_stats``: Update the contents of stats files for each @@ -153,6 +161,13 @@ Users can write below commands for the kdamond to the ``state`` file. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. +Users can ask the kernel to periodically update files showing auto-tuned +parameters and DAMOS stats instead of manually writing +``update_tuned_intervals`` like keywords to ``state`` file. For this, users +should write the desired update time interval in milliseconds to ``refresh_ms`` +file. If the interval is zero, the periodic update is disabled. Reading the +file shows currently set time interval. + ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. @@ -213,6 +228,25 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _damon_usage_sysfs_monitoring_intervals_goal: + +contexts/<N>/monitoring_attrs/intervals/intervals_goal/ +------------------------------------------------------- + +Under the ``intervals`` directory, one directory for automated tuning of +``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists. +Under the directory, four files for the auto-tuning control, namely +``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist. +Please refer to the :ref:`design document of the feature +<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning +mechanism. Reading and writing the four files under ``intervals_goal`` +directory shows and updates the tuning parameters that described in the +:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same +names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The +tuning-applied current values of the two intervals can be read from the +``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to +the ``state`` file. + .. _sysfs_targets: contexts/<N>/targets/ @@ -282,9 +316,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files -(``action``, ``target_nid`` and ``apply_interval``) exist. +In each scheme directory, eight directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``, +``stats``, and ``tried_regions``) and three files (``action``, ``target_nid`` +and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read @@ -364,11 +399,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. @@ -395,33 +430,43 @@ The ``interval`` should written in microseconds unit. .. _sysfs_filters: -schemes/<N>/filters/ --------------------- +schemes/<N>/{core\_,ops\_,}filters/ +----------------------------------- -The directory for the :ref:`filters <damon_design_damos_filters>` of the given +Directories for :ref:`filters <damon_design_damos_filters>` of the given DAMON-based operation scheme. -In the beginning, this directory has only one file, ``nr_filters``. Writing a +``core_filters`` and ``ops_filters`` directories are for the filters handled by +the DAMON core layer and operations set layer, respectively. ``filters`` +directory can be used for installing filters regardless of their handled +layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be +installed before those of ``filters``. All three directories have same files. + +Use of ``filters`` directory can make expecting evaluation orders of given +filters with the files under directory bit confusing. Users are hence +recommended to use ``core_filters`` and ``ops_filters`` directories. The +``filters`` directory could be deprecated in future. + +In the beginning, the directory has only one file, ``nr_filters``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains seven files, namely ``type``, ``matching``, -``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. -To ``type`` file, you can write one of five special keywords: ``anon`` for -anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young -pages, ``addr`` for specific address range (an open-ended interval), or -``target`` for specific DAMON monitoring target filtering. Meaning of the -types are same to the description on the :ref:`design doc -<damon_design_damos_filters>`. - -In case of the memory cgroup filtering, you can specify the memory cgroup of -the interest by writing the path of the memory cgroup from the cgroups mount -point to ``memcg_path`` file. In case of the address range filtering, you can -specify the start and end address of the range to ``addr_start`` and -``addr_end`` files, respectively. For the DAMON monitoring target filtering, -you can specify the index of the target between the list of the DAMON context's -monitoring targets list to ``target_idx`` file. +Each filter directory contains nine files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc <damon_design_damos_filters>` for available type +names, their meaning and on what layer those are handled. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter is for memory that matches the ``type``. You can write ``Y`` or ``N`` to @@ -431,6 +476,7 @@ the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: + # cd ops_filters/0/ # echo 2 > nr_filters # # disallow anonymous pages echo anon > 0/type @@ -447,6 +493,29 @@ Refer to the :ref:`DAMOS filters design documentation of different ``allow`` works, when each of the filters are supported, and differences on stats. +.. _damon_sysfs_dests: + +schemes/<N>/dests/ +------------------ + +Directory for specifying the destinations of given DAMON-based operation +scheme's action. This directory is ignored if the action of the given scheme +is not supporting multiple destinations. Only ``DAMOS_MIGRATE_{HOT,COLD}`` +actions are supporting multiple destinations. + +In the beginning, the directory has only one file, ``nr_dests``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each action destination. + +Each destination directory contains two files, namely ``id`` and ``weight``. +Users can write and read the identifier of the destination to ``id`` file. +For ``DAMOS_MIGRATE_{HOT,COLD}`` actions, the migrate destination node's node +id should be written to ``id`` file. Users can write and read the weight of +the destination among the given destinations to the ``weight`` file. The +weight can be an arbitrary integer. When DAMOS apply the action to each entity +of the memory region, it will select the destination of the action based on the +relative weights of the destinations. + .. _sysfs_schemes_stats: schemes/<N>/stats/ diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index f34a0d798d5b..67a941903fd2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -145,7 +145,17 @@ hugepages It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. If the node number is invalid, the parameter will be ignored. +hugepage_alloc_threads + Specify the number of threads that should be used to allocate hugepages + during boot. This parameter can be used to improve system bootup time + when allocating a large amount of huge pages. + The default value is 25% of the available hardware threads. + Example to use 8 allocation threads:: + + hugepage_alloc_threads=8 + + Note that this parameter only applies to non-gigantic huge pages. default_hugepagesz Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..ebc83ca20fdc 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,8 +37,10 @@ the Linux memory management. numaperf pagemap shrinker_debugfs + slab soft-dirty swap_numa transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase <kho-finalization-phase>` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index caba0f52dd36..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -37,12 +38,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. @@ -233,6 +250,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/slab.rst b/Documentation/admin-guide/mm/slab.rst new file mode 100644 index 000000000000..14429ab90611 --- /dev/null +++ b/Documentation/admin-guide/mm/slab.rst @@ -0,0 +1,469 @@ +======================================== +Short users guide for the slab allocator +======================================== + +The slab allocator includes full debugging support (when built with +CONFIG_SLUB_DEBUG=y) but it is off by default (unless built with +CONFIG_SLUB_DEBUG_ON=y). You can enable debugging only for selected +slabs in order to avoid an impact on overall system performance which +may make a bug more difficult to find. + +In order to switch debugging on one can add an option ``slab_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/mm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slab_debug: +------------------------------------------- + +Parameters may be given to ``slab_debug``. If none is specified then full +debugging is enabled. Format: + +slab_debug=<Debug-Options> + Enable options for all slabs + +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) + +Multiple blocks of options for all slabs or selected slabs can be given, with +blocks of options delimited by ';'. The last of "all slabs" blocks is applied +to all slabs except those that match one of the "select slabs" block. Options +of the first "select slabs" blocks that matches the slab's name are applied. + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Enable failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slab_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slab_debug=,dentry + +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs:: + + slab_debug=P,kmalloc-*,dentry + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slab_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher likelihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slab_debug=O + +You can apply different options to different list of slab names, using blocks +of options. This will enable red zoning for dentry and user tracking for +kmalloc. All other slabs will not get any debugging enabled:: + + slab_debug=Z,dentry;U,kmalloc-* + +You can also enable options (e.g. sanity checks and poisoning) for all caches +except some that are deemed too performance critical and don't need to be +debugged by specifying global debug options followed by a list of slab names +with "-" as options:: + + slab_debug=FZ;-,zs_handle,zspage + +The state of each debug option for a slab can be found in the respective files +under:: + + /sys/kernel/slab/<slab name>/ + +If the file contains 1, the option is enabled, 0 means disabled. The debug +options from the ``slab_debug`` parameter translate to the following files:: + + F sanity_checks + Z red_zone + P poison + U store_user + T trace + A failslab + +failslab file is writable, so writing 1 or 0 will enable or disable +the option at runtime. Write returns -EINVAL if cache is an alias. +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slab_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slab_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slab_min_order`` + specifies a minimum order of slabs. A similar effect like + ``slab_min_objects``. + +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slab_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slab_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +``slab_strict_numa`` + Enables the application of memory policies on each + allocation. This results in more accurate placement of + objects which may result in the reduction of accesses + to remote nodes. The default is to only apply memory + policies at the folio level when a new folio is acquired + or a folio is retrieved from the lists. Enabling this + option reduces the fastpath performance of the slab allocator. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Right Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005 + Redzone (0xc90f6d28): 00 cc cc cc . + Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slab_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slab_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slab_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slab_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slab_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slab_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and height + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + + +DebugFS files for SLUB +====================== + +For more information about current state of SLUB caches with the user tracking +debug option enabled, debugfs files are available, typically under +/sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user +tracking). There are 2 types of these files with the following debug +information: + +1. alloc_traces:: + + Prints information about unique allocation traces of the currently + allocated objects. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. + + Example::: + + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + +2. free_traces:: + + Prints information about unique freeing traces of the currently allocated + objects. The freeing traces thus come from the previous life-cycle of the + objects and are reported as not available for objects allocated for the first + time. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, freeing function, minimal/average/maximal jiffies since free, + pid range of the freeing processes, cpu mask of freeing cpus, and stack trace. + + Example::: + + 1980 <not-available> age=4294912290 pid=0 cpus=0 + 51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1 + kfree+0x2db/0x420 + acpi_ut_update_ref_count+0x6a6/0x782 + acpi_ut_update_object_reference+0x1ad/0x234 + acpi_ut_remove_reference+0x7d/0x84 + acpi_rs_get_prt_method_data+0x97/0xd6 + acpi_get_irq_routing_table+0x82/0xc4 + acpi_pci_irq_find_prt_entry+0x8e/0x2e0 + acpi_pci_irq_lookup+0x3a/0x1e0 + acpi_pci_irq_enable+0x77/0x240 + pcibios_enable_device+0x39/0x40 + do_pci_enable_device.part.0+0x5d/0xe0 + pci_enable_device_flags+0xfc/0x120 + pci_enable_device+0x13/0x20 + virtio_pci_probe+0x9e/0x170 + local_pci_probe+0x48/0x80 + pci_device_probe+0x105/0x1c0 + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dff8d5985f0f..370fba113460 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -107,7 +107,7 @@ sysfs Global THP controls ------------------- -Transparent Hugepage Support for anonymous memory can be entirely disabled +Transparent Hugepage Support for anonymous memory can be disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled system wide. This can be achieved per-supported-THP-size with one of:: @@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of:: where <size> is the hugepage size being addressed, the available sizes for which vary by system. +.. note:: Setting "never" in all sysfs THP controls does **not** disable + Transparent Huge Pages globally. This is because ``madvise(..., + MADV_COLLAPSE)`` ignores these settings and collapses ranges to + PMD-sized huge pages unconditionally. + For example:: echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled @@ -187,7 +192,9 @@ madvise behaviour. never - should be self-explanatory. + should be self-explanatory. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be + obtained even if this mode is specified everywhere. By default kernel tries to use huge, PMD-mappable zero page on read page fault to anonymous mapping. It's possible to disable huge zero @@ -378,7 +385,9 @@ always Attempt to allocate huge pages every time we need a new page; never - Do not allocate huge pages; + Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` + can still cause transparent huge pages to be obtained even if this mode + is specified everywhere; within_size Only allocate huge page if it will be fully within i_size. @@ -434,7 +443,9 @@ inherit have enabled="inherit" and all other hugepage sizes have enabled="never"; never - Do not allocate <size> huge pages; + Do not allocate <size> huge pages. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained + even if this mode is specified everywhere; within_size Only allocate <size> huge page if it will be fully within i_size. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 3598dcd7dbe7..fd3370aa43fe 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. By default, a zpool of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: - echo zbud > /sys/module/zswap/parameters/zpool + echo zsmalloc > /sys/module/zswap/parameters/zpool -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. +The zsmalloc type zpool has a complex compressed page storage method, and it +can achieve great storage densities. When a swap page is passed from swapout to zswap, zswap maintains a mapping of the swap entry, a combination of the swap type and swap offset, to the zpool |