diff options
Diffstat (limited to 'Documentation')
20 files changed, 368 insertions, 118 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index b35649a46a2f..bfa5b8288d8d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -25,12 +25,14 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or stops, respectively. Reading the file returns the keywords based on the current status. Writing 'commit' to this file makes the kdamond reads the user inputs in the sysfs files - except 'state' again. Writing 'update_schemes_stats' to the - file updates contents of schemes stats files of the kdamond. - Writing 'update_schemes_tried_regions' to the file updates - contents of 'tried_regions' directory of every scheme directory - of this kdamond. Writing 'update_schemes_tried_bytes' to the - file updates only '.../tried_regions/total_bytes' files of this + except 'state' again. Writing 'commit_schemes_quota_goals' to + this file makes the kdamond reads the quota goal files again. + Writing 'update_schemes_stats' to the file updates contents of + schemes stats files of the kdamond. Writing + 'update_schemes_tried_regions' to the file updates contents of + 'tried_regions' directory of every scheme directory of this + kdamond. Writing 'update_schemes_tried_bytes' to the file + updates only '.../tried_regions/total_bytes' files of this kdamond. Writing 'clear_schemes_tried_regions' to the file removes contents of the 'tried_regions' directory. @@ -212,6 +214,25 @@ Contact: SeongJae Park <sj@kernel.org> Description: Writing to and reading from this file sets and gets the quotas charge reset interval of the scheme in milliseconds. +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/nr_goals +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing a number 'N' to this file creates the number of + directories for setting automatic tuning of the scheme's + aggressiveness named '0' to 'N-1' under the goals/ directory. + +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the target + value of the goal metric. + +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/current_value +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the current + value of the goal metric. + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil Date: Mar 2022 Contact: SeongJae Park <sj@kernel.org> diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index e4551579cb12..ee2b0030d416 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -328,7 +328,7 @@ as idle:: From now on, any pages on zram are idle pages. The idle mark will be removed until someone requests access of the block. IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be marked as idle based on how long (in seconds) it's been since they were last accessed:: diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 09e65312d20c..17e6e9565156 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1693,6 +1693,21 @@ PAGE_SIZE multiple when read back. limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk. + memory.zswap.writeback + A read-write single value file. The default value is "1". The + initial value of the root cgroup is 1, and when a new cgroup is + created, it inherits the current value of its parent. + + When this is set to 0, all swapping attempts to swapping devices + are disabled. This included both zswap writebacks, and swapping due + to zswap store failures. If the zswap store failures are recurring + (for e.g if the pages are incompressible), users can observe + reclaim inefficiency after disabling writeback (because the same + pages might be rejected again and again). + + Note that this is subtly different from setting memory.swap.max to + 0, as it still allows for pages to be written to the zswap pool. + memory.pressure A read-only nested-keyed file. diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 78e4d2e7ba14..bced9e4b6e08 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -172,7 +172,7 @@ variables. Offset of the free_list's member. This value is used to compute the number of free pages. -Each zone has a free_area structure array called free_area[MAX_ORDER + 1]. +Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS]. The free_list represents a linked list of free page blocks. (list_head, next|prev) @@ -189,11 +189,11 @@ Offsets of the vmap_area's members. They carry vmalloc-specific information. Makedumpfile gets the start address of the vmalloc region from this. -(zone.free_area, MAX_ORDER + 1) -------------------------------- +(zone.free_area, NR_PAGE_ORDERS) +-------------------------------- Free areas descriptor. User-space tools use this value to iterate the -free_area ranges. MAX_ORDER is used by the zone buddy allocator. +free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator. prb --- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 65731b060e3f..8a01b8112f0b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -970,17 +970,17 @@ buddy allocator. Bigger value increase the probability of catching random memory corruption, but reduce the amount of memory for normal system use. The maximum - possible value is MAX_ORDER/2. Setting this parameter - to 1 or 2 should be enough to identify most random - memory corruption problems caused by bugs in kernel or - driver code when a CPU writes to (or reads from) a - random memory location. Note that there exists a class - of memory corruptions problems caused by buggy H/W or - F/W or by drivers badly programming DMA (basically when - memory is written at bus level and the CPU MMU is - bypassed) which are not detectable by - CONFIG_DEBUG_PAGEALLOC, hence this option will not help - tracking down these problems. + possible value is MAX_PAGE_ORDER/2. Setting this + parameter to 1 or 2 should be enough to identify most + random memory corruption problems caused by bugs in + kernel or driver code when a CPU writes to (or reads + from) a random memory location. Note that there exists + a class of memory corruptions problems caused by buggy + H/W or F/W or by drivers badly programming DMA + (basically when memory is written at bus level and the + CPU MMU is bypassed) which are not detectable by + CONFIG_DEBUG_PAGEALLOC, hence this option will not + help tracking down these problems. debug_pagealloc= [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter @@ -4136,7 +4136,7 @@ [KNL] Minimal page reporting order Format: <integer> Adjust the minimal page reporting order. The page - reporting is disabled when it exceeds MAX_ORDER. + reporting is disabled when it exceeds MAX_PAGE_ORDER. panic= [KNL] Kernel behaviour on panic: delay <timeout> timeout > 0: seconds before rebooting diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index da94feb97ed1..9d23144bf985 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -59,41 +59,47 @@ Files Hierarchy The files hierarchy of DAMON sysfs interface is shown below. In the below figure, parents-children relations are represented with indentations, each directory is having ``/`` suffix, and files in each directory are separated by -comma (","). :: - - /sys/kernel/mm/damon/admin - │ kdamonds/nr_kdamonds - │ │ 0/state,pid - │ │ │ contexts/nr_contexts - │ │ │ │ 0/avail_operations,operations - │ │ │ │ │ monitoring_attrs/ +comma (","). + +.. parsed-literal:: + + :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin + │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds + │ │ :ref:`0 <sysfs_kdamond>`/state,pid + │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts + │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations + │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us │ │ │ │ │ │ nr_regions/min,max - │ │ │ │ │ targets/nr_targets - │ │ │ │ │ │ 0/pid_target - │ │ │ │ │ │ │ regions/nr_regions - │ │ │ │ │ │ │ │ 0/start,end + │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets + │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target + │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions + │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... - │ │ │ │ │ schemes/nr_schemes - │ │ │ │ │ │ 0/action,apply_interval_us - │ │ │ │ │ │ │ access_pattern/ + │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes + │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us + │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ age/min,max - │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms + │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil - │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ filters/nr_filters + │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals + │ │ │ │ │ │ │ │ │ 0/target_value,current_value + │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low + │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds - │ │ │ │ │ │ │ tried_regions/total_bytes + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... │ │ ... +.. _sysfs_root: + Root ---- @@ -102,6 +108,8 @@ has one directory named ``admin``. The directory contains the files for privileged user space programs' control of DAMON. User space tools or daemons having the root permission could use this directory. +.. _sysfs_kdamonds: + kdamonds/ --------- @@ -113,6 +121,8 @@ details) exists. In the beginning, this directory has only one file, child directories named ``0`` to ``N-1``. Each directory represents each kdamond. +.. _sysfs_kdamond: + kdamonds/<N>/ ------------- @@ -120,29 +130,37 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or -``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be -in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the -user inputs in the sysfs files except ``state`` file again. Writing -``update_schemes_stats`` to ``state`` file updates the contents of stats files -for each DAMON-based operation scheme of the kdamond. For details of the -stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. - -Writing ``update_schemes_tried_regions`` to ``state`` file updates the -DAMON-based operation scheme action tried regions directory for each -DAMON-based operation scheme of the kdamond. Writing -``update_schemes_tried_bytes`` to ``state`` file updates only -``.../tried_regions/total_bytes`` files. Writing -``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based -operating scheme action tried regions directory for each DAMON-based operation -scheme of the kdamond. For details of the DAMON-based operation scheme action -tried regions directory, please refer to :ref:`tried_regions section -<sysfs_schemes_tried_regions>`. +``off`` if it is not running. + +Users can write below commands for the kdamond to the ``state`` file. + +- ``on``: Start running. +- ``off``: Stop running. +- ``commit``: Read the user inputs in the sysfs files except ``state`` file + again. +- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' + :ref:`quota goals <sysfs_schemes_quota_goals>`. +- ``update_schemes_stats``: Update the contents of stats files for each + DAMON-based operation scheme of the kdamond. For details of the stats, + please refer to :ref:`stats section <sysfs_schemes_stats>`. +- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme + action tried regions directory for each DAMON-based operation scheme of the + kdamond. For details of the DAMON-based operation scheme action tried + regions directory, please refer to + :ref:`tried_regions section <sysfs_schemes_tried_regions>`. +- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes`` + files. +- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme + action tried regions directory for each DAMON-based operation scheme of the + kdamond. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. +.. _sysfs_contexts: + kdamonds/<N>/contexts/ ---------------------- @@ -153,7 +171,7 @@ number (``N``) to the file creates the number of child directories named as details). At the moment, only one context per kdamond is supported, so only ``0`` or ``1`` can be written to the file. -.. _sysfs_contexts: +.. _sysfs_context: contexts/<N>/ ------------- @@ -203,6 +221,8 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _sysfs_targets: + contexts/<N>/targets/ --------------------- @@ -210,6 +230,8 @@ In the beginning, this directory has only one file, ``nr_targets``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each monitoring target. +.. _sysfs_target: + targets/<N>/ ------------ @@ -244,6 +266,8 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each initial monitoring target region. +.. _sysfs_region: + regions/<N>/ ------------ @@ -254,6 +278,8 @@ region by writing to and reading from the files, respectively. Each region should not overlap with others. ``end`` of directory ``N`` should be equal or smaller than ``start`` of directory ``N+1``. +.. _sysfs_schemes: + contexts/<N>/schemes/ --------------------- @@ -265,6 +291,8 @@ In the beginning, this directory has only one file, ``nr_schemes``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each DAMON-based operation scheme. +.. _sysfs_scheme: + schemes/<N>/ ------------ @@ -277,7 +305,7 @@ The ``action`` file is for setting and getting the scheme's :ref:`action from the file and their meaning are as below. Note that support of each action depends on the running DAMON operations set -:ref:`implementation <sysfs_contexts>`. +:ref:`implementation <sysfs_context>`. - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Supported by ``vaddr`` and ``fvaddr`` operations set. @@ -299,6 +327,8 @@ Note that support of each action depends on the running DAMON operations set The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval <damon_design_damos>` in microseconds. +.. _sysfs_access_pattern: + schemes/<N>/access_pattern/ --------------------------- @@ -312,6 +342,8 @@ to and reading from the ``min`` and ``max`` files under ``sz``, ``nr_accesses``, and ``age`` directories, respectively. Note that the ``min`` and the ``max`` form a closed interval. +.. _sysfs_quotas: + schemes/<N>/quotas/ ------------------- @@ -319,8 +351,7 @@ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given DAMON-based operation scheme. Under ``quotas`` directory, three files (``ms``, ``bytes``, -``reset_interval_ms``) and one directory (``weights``) having three files -(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist. +``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist. You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and ``reset interval`` in milliseconds by writing the values to the three files, @@ -330,11 +361,37 @@ apply the action to only up to ``bytes`` bytes of memory regions within the ``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the quota limits. -You can also set the :ref:`prioritization weights +Under ``weights`` directory, three files (``sz_permil``, +``nr_accesses_permil``, and ``age_permil``) exist. +You can set the :ref:`prioritization weights <damon_design_damos_quotas_prioritization>` for size, access frequency, and age in per-thousand unit by writing the values to the three files under the ``weights`` directory. +.. _sysfs_schemes_quota_goals: + +schemes/<N>/quotas/goals/ +------------------------- + +The directory for the :ref:`automatic quota tuning goals +<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation +scheme. + +In the beginning, this directory has only one file, ``nr_goals``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each goal and current achievement. +Among the multiple feedback, the best one is used. + +Each goal directory contains two files, namely ``target_value`` and +``current_value``. Users can set and get any number to those files to set the +feedback. User space main workload's latency or throughput, system metrics +like free memory ratio or memory pressure stall time (PSI) could be example +metrics for the values. Note that users should write +``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond +directory <sysfs_kdamond>` to pass the feedback to DAMON. + +.. _sysfs_watermarks: + schemes/<N>/watermarks/ ----------------------- @@ -354,6 +411,8 @@ as below. The ``interval`` should written in microseconds unit. +.. _sysfs_filters: + schemes/<N>/filters/ -------------------- @@ -394,7 +453,7 @@ pages of all memory cgroups except ``/having_care_already``.:: echo N > 1/matching Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation <sysfs_contexts>` is being used. +``paddr`` :ref:`implementation <sysfs_context>` is being used. Also, memory regions that are filtered out by ``addr`` or ``target`` filters are not counted as the scheme has tried to those, while regions that filtered @@ -449,6 +508,8 @@ and query-like efficient data access monitoring results retrievals. For the latter use case, in particular, users can set the ``action`` as ``stat`` and set the ``access pattern`` as their interested pattern that they want to query. +.. _sysfs_schemes_tried_region: + tried_regions/<N>/ ------------------ diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index e59231ac6bb7..a639cac12477 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -80,6 +80,9 @@ pages_to_scan how many pages to scan before ksmd goes to sleep e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + The pages_to_scan value cannot be changed if ``advisor_mode`` has + been set to scan-time. + Default: 100 (chosen for demonstration purposes) sleep_millisecs @@ -164,6 +167,29 @@ smart_scan optimization is enabled. The ``pages_skipped`` metric shows how effective the setting is. +advisor_mode + The ``advisor_mode`` selects the current advisor. Two modes are + supported: none and scan-time. The default is none. By setting + ``advisor_mode`` to scan-time, the scan time advisor is enabled. + The section about ``advisor`` explains in detail how the scan time + advisor works. + +adivsor_max_cpu + specifies the upper limit of the cpu percent usage of the ksmd + background thread. The default is 70. + +advisor_target_scan_time + specifies the target scan time in seconds to scan all the candidate + pages. The default value is 200 seconds. + +advisor_min_pages_to_scan + specifies the lower limit of the ``pages_to_scan`` parameter of the + scan time advisor. The default is 500. + +adivsor_max_pages_to_scan + specifies the upper limit of the ``pages_to_scan`` parameter of the + scan time advisor. The default is 30000. + The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: general_profit @@ -263,6 +289,35 @@ ksm_swpin_copy note that KSM page might be copied when swapping in because do_swap_page() cannot do all the locking needed to reconstitute a cross-anon_vma KSM page. +Advisor +======= + +The number of candidate pages for KSM is dynamic. It can be often observed +that during the startup of an application more candidate pages need to be +processed. Without an advisor the ``pages_to_scan`` parameter needs to be +sized for the maximum number of candidate pages. The scan time advisor can +changes the ``pages_to_scan`` parameter based on demand. + +The advisor can be enabled, so KSM can automatically adapt to changes in the +number of candidate pages to scan. Two advisors are implemented: none and +scan-time. With none, no advisor is enabled. The default is none. + +The scan time advisor changes the ``pages_to_scan`` parameter based on the +observed scan times. The possible values for the ``pages_to_scan`` parameter is +limited by the ``advisor_max_cpu`` parameter. In addition there is also the +``advisor_target_scan_time`` parameter. This parameter sets the target time to +scan all the KSM candidate pages. The parameter ``advisor_target_scan_time`` +decides how aggressive the scan time advisor scans candidate pages. Lower +values make the scan time advisor to scan more aggresively. This is the most +important parameter for the configuration of the scan time advisor. + +The initial value and the maximum value can be changed with +``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default +values are sufficient for most workloads and use cases. + +The ``pages_to_scan`` parameter is re-calculated after a scan has been completed. + + -- Izik Eidus, Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index fe17cf210426..f5f065c67615 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -253,6 +253,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b0cc8243e093..04eb45a2f940 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -45,10 +45,25 @@ components: the two is using hugepages just because of the fact the TLB miss is going to run faster. +Modern kernels support "multi-size THP" (mTHP), which introduces the +ability to allocate memory in blocks that are bigger than a base page +but smaller than traditional PMD-size (as described above), in +increments of a power-of-2 number of pages. mTHP can back anonymous +memory (for example 16K, 32K, 64K, etc). These THPs continue to be +PTE-mapped, but in many cases can still provide similar benefits to +those outlined above: Page faults are significantly reduced (by a +factor of e.g. 4, 8, 16, etc), but latency spikes are much less +prominent because the size of each page isn't as huge as the PMD-sized +variant and there is less memory to clear in each page fault. Some +architectures also employ TLB compression mechanisms to squeeze more +entries in when a set of PTEs are virtually and physically contiguous +and approporiately aligned. In this case, TLB misses will occur less +often. + THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into huge pages. +collapses sequences of basic pages into PMD-sized huge pages. The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` interface and using madvise(2) and prctl(2) system calls. @@ -95,12 +110,40 @@ Global THP controls Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of:: +system wide. This can be achieved per-supported-THP-size with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + +where <size> is the hugepage size being addressed, the available sizes +for which vary by system. + +For example:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled + +Alternatively it is possible to specify that a given hugepage size +will inherit the top-level "enabled" value:: + + echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + +For example:: + + echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled + +The top-level setting (for use with "inherit") can be set by issuing +one of the following commands:: echo always >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled +By default, PMD-sized hugepages have enabled="inherit" and all other +hugepage sizes have enabled="never". If enabling multiple hugepage +sizes, the kernel will select the most appropriate enabled size for a +given allocation. + It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular @@ -146,25 +189,34 @@ madvise never should be self-explanatory. -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: +By default kernel tries to use huge, PMD-mappable zero page on read +page fault to anonymous mapping. It's possible to disable huge zero +page by writing 0 or enable it back by writing 1:: echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page -Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage:: +Some userspace (such as a test program, or an optimized memory +allocation library) may want to know the size (in bytes) of a +PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". +khugepaged will be automatically started when one or more hugepage +sizes are enabled (either by directly setting "always" or "madvise", +or by setting "inherit" while the top-level enabled is set to "always" +or "madvise"), and it'll be automatically shutdown when the last +hugepage size is disabled (either by directly setting "never", or by +setting "inherit" while the top-level enabled is set to "never"). Khugepaged controls ------------------- +.. note:: + khugepaged currently only searches for opportunities to collapse to + PMD-sized THP and no attempt is made to collapse to other THP + sizes. + khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -282,19 +334,26 @@ force Need of application restart =========================== -The transparent_hugepage/enabled values and tmpfs mount option only affect -future behavior. So to make them effective you need to restart any -application that could have been using hugepages. This also applies to the -regions registered in khugepaged. +The transparent_hugepage/enabled and +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount +option only affect future behavior. So to make them effective you need +to restart any application that could have been using hugepages. This +also applies to the regions registered in khugepaged. Monitoring usage ================ -The number of anonymous transparent huge pages currently used by the +.. note:: + Currently the below counters only record events relating to + PMD-sized THP. Events relating to other THP sizes are not included. + +The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields -for each mapping. +To identify what applications are using PMD-sized anonymous transparent huge +pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages +fields for each mapping. (Note that AnonHugePages only applies to traditional +PMD-sized THP for historical reasons and should have been called +AnonHugePmdMapped). The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. @@ -413,7 +472,7 @@ for huge pages. Optimizing the applications =========================== -To be guaranteed that the kernel will map a 2M page immediately in any +To be guaranteed that the kernel will map a THP immediately in any memory region, the mmap region has to be hugepage naturally aligned. posix_memalign() can provide that guarantee. diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 203e26da5f92..e5cc8848dcb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -113,6 +113,9 @@ events, except page fault notifications, may be generated: areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating support for shmem virtual memory areas. +- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an + existing page contents from userspace. + The userland application should set the feature flags it intends to use when invoking the ``UFFDIO_API`` ioctl, to request that those features be enabled if supported. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 45b98390e938..b42132969e31 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -153,6 +153,26 @@ attribute, e. g.:: Setting this parameter to 100 will disable the hysteresis. +Some users cannot tolerate the swapping that comes with zswap store failures +and zswap writebacks. Swapping can be disabled entirely (without disabling +zswap itself) on a cgroup-basis as follows: + + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback + +Note that if the store failures are recurring (for e.g if the pages are +incompressible), users can observe reclaim inefficiency after disabling +writeback (because the same pages might be rejected again and again). + +When there is a sizable amount of cold memory residing in the zswap pool, it +can be advantageous to proactively write these cold pages to swap and reclaim +the memory for other use cases. By default, the zswap shrinker is disabled. +User can enable it as follows: + + echo Y > /sys/module/zswap/parameters/shrinker_enabled + +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is +selected. + A debugfs interface is provided for various statistic about pool size, number of pages stored, same-value filled pages and various counters for the reasons pages are rejected. diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst index 96f3d5f076b5..ccdd1615cf97 100644 --- a/Documentation/core-api/maple_tree.rst +++ b/Documentation/core-api/maple_tree.rst @@ -81,6 +81,9 @@ section. Sometimes it is necessary to ensure the next call to store to a maple tree does not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. +You can use mtree_dup() to duplicate an entire maple tree. It is a more +efficient way than inserting all elements one by one into a new tree. + Finally, you can remove all entries from a maple tree by calling mtree_destroy(). If the maple tree entries are pointers, you may wish to free the entries first. @@ -112,6 +115,7 @@ Takes ma_lock internally: * mtree_insert() * mtree_insert_range() * mtree_erase() + * mtree_dup() * mtree_destroy() * mt_set_in_rcu() * mt_clear_in_rcu() diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 7be2900806c8..421daf837940 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -261,7 +261,7 @@ prototypes:: struct folio *src, enum migrate_mode); int (*launder_folio)(struct folio *); bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count); - int (*error_remove_page)(struct address_space *, struct page *); + int (*error_remove_folio)(struct address_space *, struct folio *); int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_deactivate)(struct file *); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); @@ -287,7 +287,7 @@ direct_IO: migrate_folio: yes (both) launder_folio: yes is_partially_uptodate: yes -error_remove_page: yes +error_remove_folio: yes swap_activate: no swap_deactivate: no swap_rw: yes, unlocks diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 49ef12df631b..104c6d047d9b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise. -It just shows the current status. +"THPeligible" indicates whether the mapping is eligible for allocating +naturally aligned THP pages of any currently enabled size. 1 if true, 0 +otherwise. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 99acc2e98673..dd99ce5912d8 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined: bool (*is_partially_uptodate) (struct folio *, size_t from, size_t count); void (*is_dirty_writeback)(struct folio *, bool *, bool *); - int (*error_remove_page) (struct mapping *mapping, struct page *page); + int (*error_remove_folio)(struct mapping *mapping, struct folio *); int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_deactivate)(struct file *); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); @@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined: VM if a folio should be treated as dirty or writeback for the purposes of stalling. -``error_remove_page`` - normally set to generic_error_remove_page if truncation is ok +``error_remove_folio`` + normally set to generic_error_remove_folio if truncation is ok for this address space. Used for memory failure handling. Setting this implies you deal with pages going away under you, unless you have them locked or reference counts increased. diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index c82e3ee20e51..2466d3363af7 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -18,8 +18,6 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_same | Tests whether both PTE entries are the same | +---------------------------+--------------------------------------------------+ -| pte_bad | Tests a non-table mapped PTE | -+---------------------------+--------------------------------------------------+ | pte_present | Tests a valid mapped PTE | +---------------------------+--------------------------------------------------+ | pte_young | Tests a young PTE | diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 1f7e0586b5fa..1bb69524a62e 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -5,6 +5,18 @@ Design ====== +.. _damon_design_execution_model_and_data_structures: + +Execution Model and Data Structures +=================================== + +The monitoring-related information including the monitoring request +specification and DAMON-based operation schemes are stored in a data structure +called DAMON ``context``. DAMON executes each context with a kernel thread +called ``kdamond``. Multiple kdamonds could run in parallel, for different +types of monitoring. + + Overall Architecture ==================== @@ -346,6 +358,19 @@ the weight will be respected are up to the underlying prioritization mechanism implementation. +.. _damon_design_damos_quotas_auto_tuning: + +Aim-oriented Feedback-driven Auto-tuning +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Automatic feedback-driven quota tuning. Instead of setting the absolute quota +value, users can repeatedly provide numbers representing how much of their goal +for the scheme is achieved as feedback. DAMOS then automatically tunes the +aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS +is under achieving the goal, DAMOS automatically increases the quota. If DAMOS +is over achieving the goal, it decreases the quota. + + .. _damon_design_damos_watermarks: Watermarks @@ -477,15 +502,3 @@ modules for proactive reclamation and LRU lists manipulation are provided. For more detail, please read the usage documents for those (:doc:`/admin-guide/mm/damon/reclaim` and :doc:`/admin-guide/mm/damon/lru_sort`). - - -.. _damon_design_execution_model_and_data_structures: - -Execution Model and Data Structures -=================================== - -The monitoring-related information including the monitoring request -specification and DAMON-based operation schemes are stored in a data structure -called DAMON ``context``. DAMON executes each context with a kernel thread -called ``kdamond``. Multiple kdamonds could run in parallel, for different -types of monitoring. diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 9a607059ea11..93c9239b9ebe 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -117,7 +117,7 @@ pages: - map/unmap of a PMD entry for the whole THP increment/decrement folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount + folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement @@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio() Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use -in page_remove_rmap() and queue the THP for splitting if memory pressure +in folio_remove_rmap_*() and queue the THP for splitting if memory pressure comes. Splitting will free up unused subpages. Splitting the page right away is not an option due to locking context in diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 67f1338440a5..b6a07a26b10d 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing. -For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). @@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages which had been Copied-On-Write from the file pages now being truncated. Mlocked pages can be munlocked and deleted in this way: like with munmap(), -for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index 30a3be3c48f3..dca15d15feaf 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -263,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second argument is "order" or a power of two number of pages, that is (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, order=2 ==> 16384 bytes, etc. The maximum size of a -region allocated by __get_free_pages is determined by the MAX_ORDER macro. More -precisely the limit can be calculated as:: +region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro. +More precisely the limit can be calculated as:: - PAGE_SIZE << MAX_ORDER + PAGE_SIZE << MAX_PAGE_ORDER In a i386 architecture PAGE_SIZE is 4096 bytes - In a 2.4/i386 kernel MAX_ORDER is 10 - In a 2.6/i386 kernel MAX_ORDER is 11 + In a 2.4/i386 kernel MAX_PAGE_ORDER is 10 + In a 2.6/i386 kernel MAX_PAGE_ORDER is 11 So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel respectively, with an i386 architecture. User space programs can include /usr/include/sys/user.h and -/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. +/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations. The pagesize can also be determined dynamically with the getpagesize (2) system call. @@ -324,7 +324,7 @@ Definitions: (see /proc/slabinfo) <pointer size> depends on the architecture -- ``sizeof(void *)`` <page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) -<max-order> is the value defined with MAX_ORDER +<max-order> is the value defined with MAX_PAGE_ORDER <frame size> it's an upper bound of frame's capture size (more on this later) ============== ================================================================ |