diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2022-10-11 03:53:04 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2022-10-11 03:53:04 +0300 |
commit | 27bc50fc90647bbf7b734c3fc306a5e61350da53 (patch) | |
tree | 75fc525fbfec8c07a97a7875a89592317bcad4ca /Documentation | |
parent | 70442fc54e6889a2a77f0e9554e8188a1557f00e (diff) | |
parent | bbff39cc6cbcb86ccfacb2dcafc79912a9f9df69 (diff) | |
download | linux-27bc50fc90647bbf7b734c3fc306a5e61350da53.tar.xz |
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
Diffstat (limited to 'Documentation')
25 files changed, 1136 insertions, 46 deletions
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers new file mode 100644 index 000000000000..45985e411f13 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers @@ -0,0 +1,25 @@ +What: /sys/devices/virtual/memory_tiering/ +Date: August 2022 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: A collection of all the memory tiers allocated. + + Individual memory tier details are contained in subdirectories + named by the abstract distance of the memory tier. + + /sys/devices/virtual/memory_tiering/memory_tierN/ + + +What: /sys/devices/virtual/memory_tiering/memory_tierN/ + /sys/devices/virtual/memory_tiering/memory_tierN/nodes +Date: August 2022 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Directory with details of a specific memory tier + + This is the directory containing information about a particular + memory tier, memtierN, where N is derived based on abstract distance. + + A smaller value of N implies a higher (faster) memory tier in the + hierarchy. + + nodes: NUMA nodes that are part of this memory tier. + diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst index 241d1a87f2cd..7103b62ba6d7 100644 --- a/Documentation/accounting/delay-accounting.rst +++ b/Documentation/accounting/delay-accounting.rst @@ -13,7 +13,7 @@ a) waiting for a CPU (while being runnable) b) completion of synchronous block I/O initiated by the task c) swapping in pages d) memory reclaim -e) thrashing page cache +e) thrashing f) direct compact g) write-protect copy diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 2cc502a75ef6..5b86245450bd 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. -2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +2.7 Kernel Memory Extension ----------------------------------------------- With the Kernel memory extension, the Memory Controller is able to limit @@ -386,8 +386,6 @@ U != 0, K >= U: a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG -c. Enable CONFIG_MEMCG_SWAP (to use swap extension) -d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) ------------------------------------------------------------------- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1b3565b61b65..69b1533c1f02 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1469,6 +1469,14 @@ Permit 'security.evm' to be updated regardless of current integrity status. + early_page_ext [KNL] Enforces page_ext initialization to earlier + stages so cover more early boot allocations. + Please note that as side effect some optimizations + might be disabled to achieve that (e.g. parallelized + memory initialization is disabled) so the boot process + might take longer, especially on systems with a lot of + memory. Available with CONFIG_PAGE_EXTENSION=y. + failslab= fail_usercopy= fail_page_alloc= @@ -6041,12 +6049,6 @@ This parameter controls use of the Protected Execution Facility on pSeries. - swapaccount= [KNL] - Format: [0|1] - Enable accounting of swap in memory resource - controller if no parameter or 1 is given or disable - it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst) - swiotlb= [ARM,IA-64,PPC,MIPS,X86] Format: { <int> [,<int>] | force | noforce } <int> -- Number of I/O TLB slabs diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 4e06ffabd78a..7367e6294ef6 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -5,10 +5,10 @@ CMA Debugfs Interface The CMA debugfs interface is useful to retrieve basic information out of the different CMA areas and to test allocation/release in each of the areas. -Each CMA zone represents a directory under <debugfs>/cma/, indexed by the -kernel's CMA index. So the first CMA zone would be: +Each CMA area represents a directory under <debugfs>/cma/, represented by +its CMA name like below: - <debugfs>/cma/cma-0 + <debugfs>/cma/<cma_name> The structure of the files created under that directory is as follows: @@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows: - [RO] bitmap: The bitmap of page states in the zone. - [WO] alloc: Allocate N pages from that CMA area. For example:: - echo 5 > <debugfs>/cma/cma-2/alloc + echo 5 > <debugfs>/cma/<cma_name>/alloc -would try to allocate 5 pages from the cma-2 area. +would try to allocate 5 pages from the 'cma_name' area. - [WO] free: Free N pages from that CMA area, similar to the above. diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 05500042f777..33d37bb2fb4e 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -======================== -Monitoring Data Accesses -======================== +========================== +DAMON: Data Access MONitor +========================== :doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. Using DAMON, users can analyze the memory access patterns of their systems and diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 4d5ca2c46288..9f88afc734da 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at https://github.com/awslabs/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. -Because DAMO is using the debugfs interface (refer to :doc:`usage` for the -detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as -below:: - - # mount -t debugfs none /sys/kernel/debug/ - -or append the following line to your ``/etc/fstab`` file so that your system -can automatically mount debugfs upon booting:: - - debugfs /sys/kernel/debug debugfs defaults 0 0 +Because DAMO is using the sysfs interface (refer to :doc:`usage` for the +detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is +mounted. Recording Data Access Patterns diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ca91ecc29078..b47b0cbbd491 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -393,6 +393,11 @@ the files as above. Above is only for an example. debugfs Interface ================= +.. note:: + + DAMON debugfs interface will be removed after next LTS kernel is released, so + users should move to the :ref:`sysfs interface <sysfs_interface>`. + DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and ``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1bd11118dfb1..d1064e0ba34a 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + multigen_lru nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index b244f0202a03..fb6ba2002a4b 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the ``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must be increased accordingly. +Monitoring KSM profit +===================== + +KSM can save memory by merging identical pages, but also can consume +additional memory, because it needs to generate a number of rmap_items to +save each scanned page's brief rmap information. Some of these pages may +be merged, but some may not be abled to be merged after being checked +several times, which are unprofitable memory consumed. + +1) How to determine whether KSM save memory or consume memory in system-wide + range? Here is a simple approximate calculation for reference:: + + general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * + sizeof(rmap_item); + + where all_rmap_items can be easily obtained by summing ``pages_sharing``, + ``pages_shared``, ``pages_unshared`` and ``pages_volatile``. + +2) The KSM profit inner a single process can be similarly obtained by the + following approximate calculation:: + + process_profit =~ ksm_merging_pages * sizeof(page) - + ksm_rmap_items * sizeof(rmap_item). + + where ksm_merging_pages is shown under the directory ``/proc/<pid>/``, + and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. + +From the perspective of application, a high ratio of ``ksm_rmap_items`` to +``ksm_merging_pages`` means a bad madvise-applied policy, so developers or +administrators have to rethink how to change madvise policy. Giving an example +for reference, a page's size is usually 4K, and the rmap_item's size is +separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture. +so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU +or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped, +because the ksm profit is approximately zero or negative. + Monitoring KSM events ===================== diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst new file mode 100644 index 000000000000..33e068830497 --- /dev/null +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Quick start +=========== +Build the kernel with the following configurations. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` + +All set! + +Runtime options +=============== +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the +following subsections. + +Kill switch +----------- +``enabled`` accepts different values to enable or disable the +following components. Its default value depends on +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled +unless some of them have unforeseen side effects. Writing to +``enabled`` has no effect when a component is not supported by the +hardware, and valid values will be accepted even when the main switch +is off. + +====== =============================================================== +Values Components +====== =============================================================== +0x0001 The main switch for the multi-gen LRU. +0x0002 Clearing the accessed bit in leaf page table entries in large + batches, when MMU sets it (e.g., on x86). This behavior can + theoretically worsen lock contention (mmap_lock). If it is + disabled, the multi-gen LRU will suffer a minor performance + degradation for workloads that contiguously map hot pages, + whose accessed bits can be otherwise cleared by fewer larger + batches. +0x0004 Clearing the accessed bit in non-leaf page table entries as + well, when MMU sets it (e.g., on x86). This behavior was not + verified on x86 varieties other than Intel and AMD. If it is + disabled, the multi-gen LRU will suffer a negligible + performance degradation. +[yYnN] Apply to all the components above. +====== =============================================================== + +E.g., +:: + + echo y >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0007 + echo 5 >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0005 + +Thrashing prevention +-------------------- +Personal computers are more sensitive to thrashing because it can +cause janks (lags when rendering UI) and negatively impact user +experience. The multi-gen LRU offers thrashing prevention to the +majority of laptop and desktop users who do not have ``oomd``. + +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of +``N`` milliseconds from getting evicted. The OOM killer is triggered +if this working set cannot be kept in memory. In other words, this +option works as an adjustable pressure relief valve, and when open, it +terminates applications that are hopefully not being used. + +Based on the average human detectable lag (~100ms), ``N=1000`` usually +eliminates intolerable janks due to thrashing. Larger values like +``N=3000`` make janks less noticeable at the risk of premature OOM +kills. + +The default value ``0`` means disabled. + +Experimental features +===================== +``/sys/kernel/debug/lru_gen`` accepts commands described in the +following subsections. Multiple command lines are supported, so does +concatenation with delimiters ``,`` and ``;``. + +``/sys/kernel/debug/lru_gen_full`` provides additional stats for +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from +evicted generations in this file. + +Working set estimation +---------------------- +Working set estimation measures how much memory an application needs +in a given time interval, and it is usually done with little impact on +the performance of the application. E.g., data centers want to +optimize job scheduling (bin packing) to improve memory utilizations. +When a new job comes in, the job scheduler needs to find out whether +each server it manages can allocate a certain amount of memory for +this new job before it can pick a candidate. To do so, the job +scheduler needs to estimate the working sets of the existing jobs. + +When it is read, ``lru_gen`` returns a histogram of numbers of pages +accessed over different time intervals for each memcg and node. +``MAX_NR_GENS`` decides the number of bins for each histogram. The +histograms are noncumulative. +:: + + memcg memcg_id memcg_path + node node_id + min_gen_nr age_in_ms nr_anon_pages nr_file_pages + ... + max_gen_nr age_in_ms nr_anon_pages nr_file_pages + +Each bin contains an estimated number of pages that have been accessed +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of +the former is the largest and that of the latter is the smallest. + +Users can write the following command to ``lru_gen`` to create a new +generation ``max_gen_nr+1``: + + ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` + +``can_swap`` defaults to the swap setting and, if it is set to ``1``, +it forces the scan of anon pages when swap is off, and vice versa. +``force_scan`` defaults to ``1`` and, if it is set to ``0``, it +employs heuristics to reduce the overhead, which is likely to reduce +the coverage as well. + +A typical use case is that a job scheduler runs this command at a +certain time interval to create new generations, and it ranks the +servers it manages based on the sizes of their cold pages defined by +this time interval. + +Proactive reclaim +----------------- +Proactive reclaim induces page reclaim when there is no memory +pressure. It usually targets cold pages only. E.g., when a new job +comes in, the job scheduler wants to proactively reclaim cold pages on +the server it selected, to improve the chance of successfully landing +this new job. + +Users can write the following command to ``lru_gen`` to evict +generations less than or equal to ``min_gen_nr``. + + ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` + +``min_gen_nr`` should be less than ``max_gen_nr-1``, since +``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to +the active list) and therefore cannot be evicted. ``swappiness`` +overrides the default value in ``/proc/sys/vm/swappiness``. +``nr_to_reclaim`` limits the number of pages to evict. + +A typical use case is that a job scheduler runs this command before it +tries to land a new job on a server. If it fails to materialize enough +cold pages because of the overestimation, it retries on the next +server according to the ranking result obtained from the working set +estimation step. This less forceful approach limits the impacts on the +existing jobs. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index c9c37f16eef8..8ee78ec232eb 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt:: /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs -The khugepaged progress can be seen in the number of pages collapsed:: +The khugepaged progress can be seen in the number of pages collapsed (note +that this counter may not be an exact count of the number of pages +collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping +being replaced by a PMD mapping, or (2) All 4K physical pages replaced by +one 2M hugepage. Each may happen independently, or together, depending on +the type of memory and the failures that occur. As such, this value should +be interpreted roughly as a sign of progress, and counters in /proc/vmstat +consulted for more accurate accounting):: /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed @@ -366,10 +373,9 @@ thp_split_pmd page table entry. thp_zero_page_alloc - is incremented every time a huge zero page is - successfully allocated. It includes allocations which where - dropped due race with other allocation. Note, it doesn't count - every map of the huge zero page, only its allocation. + is incremented every time a huge zero page used for thp is + successfully allocated. Note, it doesn't count every map of + the huge zero page, only its allocation. thp_zero_page_alloc_failed is incremented if kernel fails to allocate diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 6528036093e1..83f31919ebb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. +Userspace creates a new userfaultfd, initializes it, and registers one or more +regions of virtual memory with it. Then, any page faults which occur within the +region(s) result in a message being delivered to the userfaultfd, notifying +userspace of the fault. The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the ``userfaultfd`` runtime load never takes the mmap_lock for writing). - Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The ``userfaultfd`` once opened by invoking the syscall, can also be +The ``userfaultfd``, once created, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on @@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``). API === +Creating a userfaultfd +---------------------- + +There are two ways to create a new userfaultfd, each of which provide ways to +restrict access to this functionality (since historically userfaultfds which +handle kernel page faults have been a useful tool for exploiting the kernel). + +The first way, supported since userfaultfd was introduced, is the +userfaultfd(2) syscall. Access to this is controlled in several ways: + +- Any user can always create a userfaultfd which traps userspace page faults + only. Such a userfaultfd can be created using the userfaultfd(2) syscall + with the flag UFFD_USER_MODE_ONLY. + +- In order to also trap kernel page faults for the address space, either the + process needs the CAP_SYS_PTRACE capability, or the system must have + vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd + is set to 0. + +The second way, added to the kernel more recently, is by opening +/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method +yields equivalent userfaultfds to the userfaultfd(2) syscall. + +Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal +filesystem permissions (user/group/mode), which gives fine grained access to +userfaultfd specifically, without also granting other unrelated privileges at +the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access +to /dev/userfaultfd can always create userfaultfds that trap kernel page faults; +vm.unprivileged_userfaultfd is not considered. + +Initializing a userfaultfd +-------------------------- + When first opened the ``userfaultfd`` must be enabled invoking the ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or a later API version) which will specify the ``read/POLLIN`` protocol diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ee6572b1edad..835c8844bba4 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -635,6 +635,17 @@ different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too. +numa_balancing_promote_rate_limit_MBps +====================================== + +Too high promotion/demotion throughput between different memory types +may hurt application latency. This can be used to rate limit the +promotion throughput. The per-node max promotion throughput in MB/s +will be limited to be no more than the set value. + +A rule of thumb is to set this to less than 1/10 of the PMEM node +write bandwidth. + oops_all_cpu_backtrace ====================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 9b833e439f09..988f6a4c8084 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -926,6 +926,9 @@ calls without any restrictions. The default value is 0. +Another way to control permissions for userfaultfd is to use +/dev/userfaultfd instead of userfaultfd(2). See +Documentation/admin-guide/mm/userfaultfd.rst. user_reserve_kbytes =================== diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index b0e7b4771fff..77eb775b8b42 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -37,6 +37,7 @@ Library functionality that is used throughout the kernel. kref assoc_array xarray + maple_tree idr circular-buffers rbtree diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst new file mode 100644 index 000000000000..45defcf15da7 --- /dev/null +++ b/Documentation/core-api/maple_tree.rst @@ -0,0 +1,217 @@ +.. SPDX-License-Identifier: GPL-2.0+ + + +========== +Maple Tree +========== + +:Author: Liam R. Howlett + +Overview +======== + +The Maple Tree is a B-Tree data type which is optimized for storing +non-overlapping ranges, including ranges of size 1. The tree was designed to +be simple to use and does not require a user written search method. It +supports iterating over a range of entries and going to the previous or next +entry in a cache-efficient manner. The tree can also be put into an RCU-safe +mode of operation which allows reading and writing concurrently. Writers must +synchronize on a lock, which can be the default spinlock, or the user can set +the lock to an external lock of a different type. + +The Maple Tree maintains a small memory footprint and was designed to use +modern processor cache efficiently. The majority of the users will be able to +use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex +scenarios. The most important usage of the Maple Tree is the tracking of the +virtual memory areas. + +The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple +Tree reserves values with the bottom two bits set to '10' which are below 4096 +(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved +entries then the users can convert the entries using xa_mk_value() and convert +them back by calling xa_to_value(). If the user needs to use a reserved +value, then the user can convert the value when using the +:ref:`maple-tree-advanced-api`, but are blocked by the normal API. + +The Maple Tree can also be configured to support searching for a gap of a given +size (or larger). + +Pre-allocating of nodes is also supported using the +:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a +successful store operation within a given +code segment when allocating cannot be done. Allocations of nodes are +relatively small at around 256 bytes. + +.. _maple-tree-normal-api: + +Normal API +========== + +Start by initialising a maple tree, either with DEFINE_MTREE() for statically +allocated maple trees or mt_init() for dynamically allocated ones. A +freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0`` +- ``ULONG_MAX``. There are currently two types of maple trees supported: the +allocation tree and the regular tree. The regular tree has a higher branching +factor for internal nodes. The allocation tree has a lower branching factor +but allows the user to search for a gap of a given size or larger from either +``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by +passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree. + +You can then set entries using mtree_store() or mtree_store_range(). +mtree_store() will overwrite any entry with the new entry and return 0 on +success or an error code otherwise. mtree_store_range() works in the same way +but takes a range. mtree_load() is used to retrieve the entry stored at a +given index. You can use mtree_erase() to erase an entire range by only +knowing one value within that range, or mtree_store() call with an entry of +NULL may be used to partially erase a range or many ranges at once. + +If you want to only store a new entry to a range (or index) if that range is +currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which +return -EEXIST if the range is not empty. + +You can search for an entry from an index upwards by using mt_find(). + +You can walk each entry within a range by calling mt_for_each(). You must +provide a temporary variable to store a cursor. If you want to walk each +element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If +the caller is going to hold the lock for the duration of the walk then it is +worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api` +section. + +Sometimes it is necessary to ensure the next call to store to a maple tree does +not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. + +Finally, you can remove all entries from a maple tree by calling +mtree_destroy(). If the maple tree entries are pointers, you may wish to free +the entries first. + +Allocating Nodes +---------------- + +The allocations are handled by the internal tree code. See +:ref:`maple-tree-advanced-alloc` for other options. + +Locking +------- + +You do not have to worry about locking. See :ref:`maple-tree-advanced-locks` +for other options. + +The Maple Tree uses RCU and an internal spinlock to synchronise access: + +Takes RCU read lock: + * mtree_load() + * mt_find() + * mt_for_each() + * mt_next() + * mt_prev() + +Takes ma_lock internally: + * mtree_store() + * mtree_store_range() + * mtree_insert() + * mtree_insert_range() + * mtree_erase() + * mtree_destroy() + * mt_set_in_rcu() + * mt_clear_in_rcu() + +If you want to take advantage of the internal lock to protect the data +structures that you are storing in the Maple Tree, you can call mtree_lock() +before calling mtree_load(), then take a reference count on the object you +have found before calling mtree_unlock(). This will prevent stores from +removing the object from the tree between looking up the object and +incrementing the refcount. You can also use RCU to avoid dereferencing +freed memory, but an explanation of that is beyond the scope of this +document. + +.. _maple-tree-advanced-api: + +Advanced API +============ + +The advanced API offers more flexibility and better performance at the +cost of an interface which can be harder to use and has fewer safeguards. +You must take care of your own locking while using the advanced API. +You can use the ma_lock, RCU or an external lock for protection. +You can mix advanced and normal operations on the same array, as long +as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented +in terms of the advanced API. + +The advanced API is based around the ma_state, this is where the 'mas' +prefix originates. The ma_state struct keeps track of tree operations to make +life easier for both internal and external tree users. + +Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`. +Please see above. + +The maple state keeps track of the range start and end in mas->index and +mas->last, respectively. + +mas_walk() will walk the tree to the location of mas->index and set the +mas->index and mas->last according to the range for the entry. + +You can set entries using mas_store(). mas_store() will overwrite any entry +with the new entry and return the first existing entry that is overwritten. +The range is passed in as members of the maple state: index and last. + +You can use mas_erase() to erase an entire range by setting index and +last of the maple state to the desired range to erase. This will erase +the first range that is found in that range, set the maple state index +and last as the range that was erased and return the entry that existed +at that location. + +You can walk each entry within a range by using mas_for_each(). If you want +to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as +the range. If the lock needs to be periodically dropped, see the locking +section mas_pause(). + +Using a maple state allows mas_next() and mas_prev() to function as if the +tree was a linked list. With such a high branching factor the amortized +performance penalty is outweighed by cache optimization. mas_next() will +return the next entry which occurs after the entry at index. mas_prev() +will return the previous entry which occurs before the entry at index. + +mas_find() will find the first entry which exists at or above index on +the first call, and the next entry from every subsequent calls. + +mas_find_rev() will find the fist entry which exists at or below the last on +the first call, and the previous entry from every subsequent calls. + +If the user needs to yield the lock during an operation, then the maple state +must be paused using mas_pause(). + +There are a few extra interfaces provided when using an allocation tree. +If you wish to search for a gap within a range, then mas_empty_area() +or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap +starting at the lowest index given up to the maximum of the range. +mas_empty_area_rev() searches for a gap starting at the highest index given +and continues downward to the lower bound of the range. + +.. _maple-tree-advanced-alloc: + +Advanced Allocating Nodes +------------------------- + +Allocations are usually handled internally to the tree, however if allocations +need to occur before a write occurs then calling mas_expected_entries() will +allocate the worst-case number of needed nodes to insert the provided number of +ranges. This also causes the tree to enter mass insertion mode. Once +insertions are complete calling mas_destroy() on the maple state will free the +unused allocations. + +.. _maple-tree-advanced-locks: + +Advanced Locking +---------------- + +The maple tree uses a spinlock by default, but external locks can be used for +tree updates as well. To use an external lock, the tree must be initialized +with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the +MTREE_INIT_EXT() #define, which takes an external lock as an argument. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/maple_tree.h +.. kernel-doc:: lib/maple_tree.c diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index 1ebcc6c3fafe..f5dde5bceaea 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -19,9 +19,6 @@ User Space Memory Access Memory Allocation Controls ========================== -.. kernel-doc:: include/linux/gfp.h - :internal: - .. kernel-doc:: include/linux/gfp_types.h :doc: Page mobility and placement hints diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index 4621eac290f4..6b0663075dc0 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst @@ -24,6 +24,7 @@ Documentation/dev-tools/testing-overview.rst kcov gcov kasan + kmsan ubsan kmemleak kcsan diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 1772fd457fed..5c93ab915049 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -111,9 +111,17 @@ parameter can be used to control panic and reporting behaviour: report or also panic the kernel (default: ``report``). The panic happens even if ``kasan_multi_shot`` is enabled. -Hardware Tag-Based KASAN mode (see the section about various modes below) is -intended for use in production as a security mitigation. Therefore, it supports -additional boot parameters that allow disabling KASAN or controlling features: +Software and Hardware Tag-Based KASAN modes (see the section about various +modes below) support altering stack trace collection behavior: + +- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack + traces collection (default: ``on``). +- ``kasan.stack_ring_size=<number of entries>`` specifies the number of entries + in the stack ring (default: ``32768``). + +Hardware Tag-Based KASAN mode is intended for use in production as a security +mitigation. Therefore, it supports additional boot parameters that allow +disabling KASAN altogether or controlling its features: - ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``). @@ -132,9 +140,6 @@ additional boot parameters that allow disabling KASAN or controlling features: - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc allocations (default: ``on``). -- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack - traces collection (default: ``on``). - Error reports ~~~~~~~~~~~~~ diff --git a/Documentation/dev-tools/kmsan.rst b/Documentation/dev-tools/kmsan.rst new file mode 100644 index 000000000000..2a53a801198c --- /dev/null +++ b/Documentation/dev-tools/kmsan.rst @@ -0,0 +1,427 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. Copyright (C) 2022, Google LLC. + +=================================== +The Kernel Memory Sanitizer (KMSAN) +=================================== + +KMSAN is a dynamic error detector aimed at finding uses of uninitialized +values. It is based on compiler instrumentation, and is quite similar to the +userspace `MemorySanitizer tool`_. + +An important note is that KMSAN is not intended for production use, because it +drastically increases kernel memory footprint and slows the whole system down. + +Usage +===== + +Building the kernel +------------------- + +In order to build a kernel with KMSAN you will need a fresh Clang (14.0.6+). +Please refer to `LLVM documentation`_ for the instructions on how to build Clang. + +Now configure and build the kernel with CONFIG_KMSAN enabled. + +Example report +-------------- + +Here is an example of a KMSAN report:: + + ===================================================== + BUG: KMSAN: uninit-value in test_uninit_kmsan_check_memory+0x1be/0x380 [kmsan_test] + test_uninit_kmsan_check_memory+0x1be/0x380 mm/kmsan/kmsan_test.c:273 + kunit_run_case_internal lib/kunit/test.c:333 + kunit_try_run_case+0x206/0x420 lib/kunit/test.c:374 + kunit_generic_run_threadfn_adapter+0x6d/0xc0 lib/kunit/try-catch.c:28 + kthread+0x721/0x850 kernel/kthread.c:327 + ret_from_fork+0x1f/0x30 ??:? + + Uninit was stored to memory at: + do_uninit_local_array+0xfa/0x110 mm/kmsan/kmsan_test.c:260 + test_uninit_kmsan_check_memory+0x1a2/0x380 mm/kmsan/kmsan_test.c:271 + kunit_run_case_internal lib/kunit/test.c:333 + kunit_try_run_case+0x206/0x420 lib/kunit/test.c:374 + kunit_generic_run_threadfn_adapter+0x6d/0xc0 lib/kunit/try-catch.c:28 + kthread+0x721/0x850 kernel/kthread.c:327 + ret_from_fork+0x1f/0x30 ??:? + + Local variable uninit created at: + do_uninit_local_array+0x4a/0x110 mm/kmsan/kmsan_test.c:256 + test_uninit_kmsan_check_memory+0x1a2/0x380 mm/kmsan/kmsan_test.c:271 + + Bytes 4-7 of 8 are uninitialized + Memory access of size 8 starts at ffff888083fe3da0 + + CPU: 0 PID: 6731 Comm: kunit_try_catch Tainted: G B E 5.16.0-rc3+ #104 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 + ===================================================== + +The report says that the local variable ``uninit`` was created uninitialized in +``do_uninit_local_array()``. The third stack trace corresponds to the place +where this variable was created. + +The first stack trace shows where the uninit value was used (in +``test_uninit_kmsan_check_memory()``). The tool shows the bytes which were left +uninitialized in the local variable, as well as the stack where the value was +copied to another memory location before use. + +A use of uninitialized value ``v`` is reported by KMSAN in the following cases: + - in a condition, e.g. ``if (v) { ... }``; + - in an indexing or pointer dereferencing, e.g. ``array[v]`` or ``*v``; + - when it is copied to userspace or hardware, e.g. ``copy_to_user(..., &v, ...)``; + - when it is passed as an argument to a function, and + ``CONFIG_KMSAN_CHECK_PARAM_RETVAL`` is enabled (see below). + +The mentioned cases (apart from copying data to userspace or hardware, which is +a security issue) are considered undefined behavior from the C11 Standard point +of view. + +Disabling the instrumentation +----------------------------- + +A function can be marked with ``__no_kmsan_checks``. Doing so makes KMSAN +ignore uninitialized values in that function and mark its output as initialized. +As a result, the user will not get KMSAN reports related to that function. + +Another function attribute supported by KMSAN is ``__no_sanitize_memory``. +Applying this attribute to a function will result in KMSAN not instrumenting +it, which can be helpful if we do not want the compiler to interfere with some +low-level code (e.g. that marked with ``noinstr`` which implicitly adds +``__no_sanitize_memory``). + +This however comes at a cost: stack allocations from such functions will have +incorrect shadow/origin values, likely leading to false positives. Functions +called from non-instrumented code may also receive incorrect metadata for their +parameters. + +As a rule of thumb, avoid using ``__no_sanitize_memory`` explicitly. + +It is also possible to disable KMSAN for a single file (e.g. main.o):: + + KMSAN_SANITIZE_main.o := n + +or for the whole directory:: + + KMSAN_SANITIZE := n + +in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every +function in the file or directory. Most users won't need KMSAN_SANITIZE, unless +their code gets broken by KMSAN (e.g. runs at early boot time). + +Support +======= + +In order for KMSAN to work the kernel must be built with Clang, which so far is +the only compiler that has KMSAN support. The kernel instrumentation pass is +based on the userspace `MemorySanitizer tool`_. + +The runtime library only supports x86_64 at the moment. + +How KMSAN works +=============== + +KMSAN shadow memory +------------------- + +KMSAN associates a metadata byte (also called shadow byte) with every byte of +kernel memory. A bit in the shadow byte is set iff the corresponding bit of the +kernel memory byte is uninitialized. Marking the memory uninitialized (i.e. +setting its shadow bytes to ``0xff``) is called poisoning, marking it +initialized (setting the shadow bytes to ``0x00``) is called unpoisoning. + +When a new variable is allocated on the stack, it is poisoned by default by +instrumentation code inserted by the compiler (unless it is a stack variable +that is immediately initialized). Any new heap allocation done without +``__GFP_ZERO`` is also poisoned. + +Compiler instrumentation also tracks the shadow values as they are used along +the code. When needed, instrumentation code invokes the runtime library in +``mm/kmsan/`` to persist shadow values. + +The shadow value of a basic or compound type is an array of bytes of the same +length. When a constant value is written into memory, that memory is unpoisoned. +When a value is read from memory, its shadow memory is also obtained and +propagated into all the operations which use that value. For every instruction +that takes one or more values the compiler generates code that calculates the +shadow of the result depending on those values and their shadows. + +Example:: + + int a = 0xff; // i.e. 0x000000ff + int b; + int c = a | b; + +In this case the shadow of ``a`` is ``0``, shadow of ``b`` is ``0xffffffff``, +shadow of ``c`` is ``0xffffff00``. This means that the upper three bytes of +``c`` are uninitialized, while the lower byte is initialized. + +Origin tracking +--------------- + +Every four bytes of kernel memory also have a so-called origin mapped to them. +This origin describes the point in program execution at which the uninitialized +value was created. Every origin is associated with either the full allocation +stack (for heap-allocated memory), or the function containing the uninitialized +variable (for locals). + +When an uninitialized variable is allocated on stack or heap, a new origin +value is created, and that variable's origin is filled with that value. When a +value is read from memory, its origin is also read and kept together with the +shadow. For every instruction that takes one or more values, the origin of the +result is one of the origins corresponding to any of the uninitialized inputs. +If a poisoned value is written into memory, its origin is written to the +corresponding storage as well. + +Example 1:: + + int a = 42; + int b; + int c = a + b; + +In this case the origin of ``b`` is generated upon function entry, and is +stored to the origin of ``c`` right before the addition result is written into +memory. + +Several variables may share the same origin address, if they are stored in the +same four-byte chunk. In this case every write to either variable updates the +origin for all of them. We have to sacrifice precision in this case, because +storing origins for individual bits (and even bytes) would be too costly. + +Example 2:: + + int combine(short a, short b) { + union ret_t { + int i; + short s[2]; + } ret; + ret.s[0] = a; + ret.s[1] = b; + return ret.i; + } + +If ``a`` is initialized and ``b`` is not, the shadow of the result would be +0xffff0000, and the origin of the result would be the origin of ``b``. +``ret.s[0]`` would have the same origin, but it will never be used, because +that variable is initialized. + +If both function arguments are uninitialized, only the origin of the second +argument is preserved. + +Origin chaining +~~~~~~~~~~~~~~~ + +To ease debugging, KMSAN creates a new origin for every store of an +uninitialized value to memory. The new origin references both its creation stack +and the previous origin the value had. This may cause increased memory +consumption, so we limit the length of origin chains in the runtime. + +Clang instrumentation API +------------------------- + +Clang instrumentation pass inserts calls to functions defined in +``mm/kmsan/nstrumentation.c`` into the kernel code. + +Shadow manipulation +~~~~~~~~~~~~~~~~~~~ + +For every memory access the compiler emits a call to a function that returns a +pair of pointers to the shadow and origin addresses of the given memory:: + + typedef struct { + void *shadow, *origin; + } shadow_origin_ptr_t + + shadow_origin_ptr_t __msan_metadata_ptr_for_load_{1,2,4,8}(void *addr) + shadow_origin_ptr_t __msan_metadata_ptr_for_store_{1,2,4,8}(void *addr) + shadow_origin_ptr_t __msan_metadata_ptr_for_load_n(void *addr, uintptr_t size) + shadow_origin_ptr_t __msan_metadata_ptr_for_store_n(void *addr, uintptr_t size) + +The function name depends on the memory access size. + +The compiler makes sure that for every loaded value its shadow and origin +values are read from memory. When a value is stored to memory, its shadow and +origin are also stored using the metadata pointers. + +Handling locals +~~~~~~~~~~~~~~~ + +A special function is used to create a new origin value for a local variable and +set the origin of that variable to that value:: + + void __msan_poison_alloca(void *addr, uintptr_t size, char *descr) + +Access to per-task data +~~~~~~~~~~~~~~~~~~~~~~~ + +At the beginning of every instrumented function KMSAN inserts a call to +``__msan_get_context_state()``:: + + kmsan_context_state *__msan_get_context_state(void) + +``kmsan_context_state`` is declared in ``include/linux/kmsan.h``:: + + struct kmsan_context_state { + char param_tls[KMSAN_PARAM_SIZE]; + char retval_tls[KMSAN_RETVAL_SIZE]; + char va_arg_tls[KMSAN_PARAM_SIZE]; + char va_arg_origin_tls[KMSAN_PARAM_SIZE]; + u64 va_arg_overflow_size_tls; + char param_origin_tls[KMSAN_PARAM_SIZE]; + depot_stack_handle_t retval_origin_tls; + }; + +This structure is used by KMSAN to pass parameter shadows and origins between +instrumented functions (unless the parameters are checked immediately by +``CONFIG_KMSAN_CHECK_PARAM_RETVAL``). + +Passing uninitialized values to functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Clang's MemorySanitizer instrumentation has an option, +``-fsanitize-memory-param-retval``, which makes the compiler check function +parameters passed by value, as well as function return values. + +The option is controlled by ``CONFIG_KMSAN_CHECK_PARAM_RETVAL``, which is +enabled by default to let KMSAN report uninitialized values earlier. +Please refer to the `LKML discussion`_ for more details. + +Because of the way the checks are implemented in LLVM (they are only applied to +parameters marked as ``noundef``), not all parameters are guaranteed to be +checked, so we cannot give up the metadata storage in ``kmsan_context_state``. + +String functions +~~~~~~~~~~~~~~~~ + +The compiler replaces calls to ``memcpy()``/``memmove()``/``memset()`` with the +following functions. These functions are also called when data structures are +initialized or copied, making sure shadow and origin values are copied alongside +with the data:: + + void *__msan_memcpy(void *dst, void *src, uintptr_t n) + void *__msan_memmove(void *dst, void *src, uintptr_t n) + void *__msan_memset(void *dst, int c, uintptr_t n) + +Error reporting +~~~~~~~~~~~~~~~ + +For each use of a value the compiler emits a shadow check that calls +``__msan_warning()`` in the case that value is poisoned:: + + void __msan_warning(u32 origin) + +``__msan_warning()`` causes KMSAN runtime to print an error report. + +Inline assembly instrumentation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +KMSAN instruments every inline assembly output with a call to:: + + void __msan_instrument_asm_store(void *addr, uintptr_t size) + +, which unpoisons the memory region. + +This approach may mask certain errors, but it also helps to avoid a lot of +false positives in bitwise operations, atomics etc. + +Sometimes the pointers passed into inline assembly do not point to valid memory. +In such cases they are ignored at runtime. + + +Runtime library +--------------- + +The code is located in ``mm/kmsan/``. + +Per-task KMSAN state +~~~~~~~~~~~~~~~~~~~~ + +Every task_struct has an associated KMSAN task state that holds the KMSAN +context (see above) and a per-task flag disallowing KMSAN reports:: + + struct kmsan_context { + ... + bool allow_reporting; + struct kmsan_context_state cstate; + ... + } + + struct task_struct { + ... + struct kmsan_context kmsan; + ... + } + +KMSAN contexts +~~~~~~~~~~~~~~ + +When running in a kernel task context, KMSAN uses ``current->kmsan.cstate`` to +hold the metadata for function parameters and return values. + +But in the case the kernel is running in the interrupt, softirq or NMI context, +where ``current`` is unavailable, KMSAN switches to per-cpu interrupt state:: + + DEFINE_PER_CPU(struct kmsan_ctx, kmsan_percpu_ctx); + +Metadata allocation +~~~~~~~~~~~~~~~~~~~ + +There are several places in the kernel for which the metadata is stored. + +1. Each ``struct page`` instance contains two pointers to its shadow and +origin pages:: + + struct page { + ... + struct page *shadow, *origin; + ... + }; + +At boot-time, the kernel allocates shadow and origin pages for every available +kernel page. This is done quite late, when the kernel address space is already +fragmented, so normal data pages may arbitrarily interleave with the metadata +pages. + +This means that in general for two contiguous memory pages their shadow/origin +pages may not be contiguous. Consequently, if a memory access crosses the +boundary of a memory block, accesses to shadow/origin memory may potentially +corrupt other pages or read incorrect values from them. + +In practice, contiguous memory pages returned by the same ``alloc_pages()`` +call will have contiguous metadata, whereas if these pages belong to two +different allocations their metadata pages can be fragmented. + +For the kernel data (``.data``, ``.bss`` etc.) and percpu memory regions +there also are no guarantees on metadata contiguity. + +In the case ``__msan_metadata_ptr_for_XXX_YYY()`` hits the border between two +pages with non-contiguous metadata, it returns pointers to fake shadow/origin regions:: + + char dummy_load_page[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE))); + char dummy_store_page[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE))); + +``dummy_load_page`` is zero-initialized, so reads from it always yield zeroes. +All stores to ``dummy_store_page`` are ignored. + +2. For vmalloc memory and modules, there is a direct mapping between the memory +range, its shadow and origin. KMSAN reduces the vmalloc area by 3/4, making only +the first quarter available to ``vmalloc()``. The second quarter of the vmalloc +area contains shadow memory for the first quarter, the third one holds the +origins. A small part of the fourth quarter contains shadow and origins for the +kernel modules. Please refer to ``arch/x86/include/asm/pgtable_64_types.h`` for +more details. + +When an array of pages is mapped into a contiguous virtual memory space, their +shadow and origin pages are similarly mapped into contiguous regions. + +References +========== + +E. Stepanov, K. Serebryany. `MemorySanitizer: fast detector of uninitialized +memory use in C++ +<https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43308.pdf>`_. +In Proceedings of CGO 2015. + +.. _MemorySanitizer tool: https://clang.llvm.org/docs/MemorySanitizer.html +.. _LLVM documentation: https://llvm.org/docs/GettingStarted.html +.. _LKML discussion: https://lore.kernel.org/all/20220614144853.3693273-1-glider@google.com/ diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index 575ccd40e30c..4aa12b8be278 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose. ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/mm/ksm.rst b/Documentation/mm/ksm.rst index 9e37add068e6..f83cfbc12f4c 100644 --- a/Documentation/mm/ksm.rst +++ b/Documentation/mm/ksm.rst @@ -26,7 +26,7 @@ tree. If a KSM page is shared between less than ``max_page_sharing`` VMAs, the node of the stable tree that represents such KSM page points to a -list of struct rmap_item and the ``page->mapping`` of the +list of struct ksm_rmap_item and the ``page->mapping`` of the KSM page points to the stable tree node. When the sharing passes this threshold, KSM adds a second dimension to diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst new file mode 100644 index 000000000000..d7062c6a8946 --- /dev/null +++ b/Documentation/mm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. A page accessed ``N`` +times through file descriptors is in tier ``order_base_2(N)``. Unlike +generations, tiers do not have dedicated ``lrugen->lists[]``. In +contrast to moving across generations, which requires the LRU lock, +moving across tiers only involves atomic operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +this end, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Rmap walks +* Page table walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect. diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst index f5c954afe97c..f18fd8907049 100644 --- a/Documentation/mm/page_owner.rst +++ b/Documentation/mm/page_owner.rst @@ -94,6 +94,11 @@ Usage Page allocated via order XXX, ... PFN XXX ... // Detailed stack + By default, it will do full pfn dump, to start with a given pfn, + page_owner supports fseek. + + FILE *fp = fopen("/sys/kernel/debug/page_owner", "r"); + fseek(fp, pfn_start, SEEK_SET); The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows in buf, uses regexp to extract the page order value, counts the times |