summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/mm/transhuge.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/mm/transhuge.rst')
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst140
1 files changed, 111 insertions, 29 deletions
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index cfdd16a52e39..370fba113460 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -107,7 +107,7 @@ sysfs
Global THP controls
-------------------
-Transparent Hugepage Support for anonymous memory can be entirely disabled
+Transparent Hugepage Support for anonymous memory can be disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled
system wide. This can be achieved per-supported-THP-size with one of::
@@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of::
where <size> is the hugepage size being addressed, the available sizes
for which vary by system.
+.. note:: Setting "never" in all sysfs THP controls does **not** disable
+ Transparent Huge Pages globally. This is because ``madvise(...,
+ MADV_COLLAPSE)`` ignores these settings and collapses ranges to
+ PMD-sized huge pages unconditionally.
+
For example::
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
@@ -187,7 +192,9 @@ madvise
behaviour.
never
- should be self-explanatory.
+ should be self-explanatory. Note that ``madvise(...,
+ MADV_COLLAPSE)`` can still cause transparent huge pages to be
+ obtained even if this mode is specified everywhere.
By default kernel tries to use huge, PMD-mappable zero page on read
page fault to anonymous mapping. It's possible to disable huge zero
@@ -303,7 +310,7 @@ control by passing the parameter ``transparent_hugepage=always`` or
kernel command line.
Alternatively, each supported anonymous THP size can be controlled by
-passing ``thp_anon=<size>,<size>[KMG]:<state>;<size>-<size>[KMG]:<state>``,
+passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
``never`` or ``inherit``.
@@ -326,38 +333,82 @@ PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER
is not defined within a valid ``thp_anon``, its policy will default to
``never``.
+Similarly to ``transparent_hugepage``, you can control the hugepage
+allocation policy for the internal shmem mount by using the kernel parameter
+``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the
+seven valid policies for shmem (``always``, ``within_size``, ``advise``,
+``never``, ``deny``, and ``force``).
+
+Similarly to ``transparent_hugepage_shmem``, you can control the default
+hugepage allocation policy for the tmpfs mount by using the kernel parameter
+``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the
+four valid policies for tmpfs (``always``, ``within_size``, ``advise``,
+``never``). The tmpfs mount default policy is ``never``.
+
+In the same manner as ``thp_anon`` controls each supported anonymous THP
+size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
+has the same format as ``thp_anon``, but also supports the policy
+``within_size``.
+
+``thp_shmem=`` may be specified multiple times to configure all THP sizes
+as required. If ``thp_shmem=`` is specified at least once, any shmem THP
+sizes not explicitly configured on the command line are implicitly set to
+``never``.
+
+``transparent_hugepage_shmem`` setting only affects the global toggle. If
+``thp_shmem`` is not specified, PMD_ORDER hugepage will default to
+``inherit``. However, if a valid ``thp_shmem`` setting is provided by the
+user, the PMD_ORDER hugepage policy will be overridden. If the policy for
+PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
+default to ``never``.
+
Hugepages in tmpfs/shmem
========================
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all available
+huge page sizes without any control over the exact sizes, behaving more like
+other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
always
Attempt to allocate huge pages every time we need a new page;
never
- Do not allocate huge pages;
+ Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)``
+ can still cause transparent huge pages to be obtained even if this mode
+ is specified everywhere;
within_size
Only allocate huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
+ Also respect madvise() hints;
advise
- Only allocate huge pages if requested with fadvise()/madvise();
+ Only allocate huge pages if requested with madvise();
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``.
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
``huge=never`` will not attempt to break up huge pages at all, just stop more
from being allocated.
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
deny
For use in emergencies, to force the huge option off from
@@ -365,13 +416,24 @@ deny
force
Force the huge option on for all - very useful for testing;
-Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
-control mTHP allocation:
-'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
-and its value for each mTHP is essentially consistent with the global
-setting. An 'inherit' option is added to ensure compatibility with these
-global settings. Conversely, the options 'force' and 'deny' are dropped,
-which are rather testing artifacts from the old ages.
+shmem / internal tmpfs
+----------------------
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
always
Attempt to allocate <size> huge pages every time we need a new page;
@@ -381,14 +443,16 @@ inherit
have enabled="inherit" and all other hugepage sizes have enabled="never";
never
- Do not allocate <size> huge pages;
+ Do not allocate <size> huge pages. Note that ``madvise(...,
+ MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained
+ even if this mode is specified everywhere;
within_size
Only allocate <size> huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
+ Also respect madvise() hints;
advise
- Only allocate <size> huge pages if requested with fadvise()/madvise();
+ Only allocate <size> huge pages if requested with madvise();
Need of application restart
===========================
@@ -413,7 +477,7 @@ AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
To identify what applications are mapping file transparent huge pages, it
-is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
+is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields
for each mapping.
Note that reading the smaps file is expensive and reading it
@@ -530,10 +594,28 @@ anon_fault_fallback_charge
instead falls back to using huge pages with lower orders or
small pages even though the allocation was successful.
-swpout
- is incremented every time a huge page is swapped out in one
+zswpout
+ is incremented every time a huge page is swapped out to zswap in one
piece without splitting.
+swpin
+ is incremented every time a huge page is swapped in from a non-zswap
+ swap device in one piece.
+
+swpin_fallback
+ is incremented if swapin fails to allocate or charge a huge page
+ and instead falls back to using huge pages with lower orders or
+ small pages.
+
+swpin_fallback_charge
+ is incremented if swapin fails to charge a huge page and instead
+ falls back to using huge pages with lower orders or small pages
+ even though the allocation was successful.
+
+swpout
+ is incremented every time a huge page is swapped out to a non-zswap
+ swap device in one piece without splitting.
+
swpout_fallback
is incremented if a huge page has to be split before swapout.
Usually because failed to allocate some continuous swap space