diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-06-04 06:24:15 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-06-04 06:24:15 +0300 |
commit | ee01c4d72adffb7d424535adf630f2955748fa8b (patch) | |
tree | 9ea9f40473e105e936e7477ab7dc7248d899af21 /Documentation | |
parent | c444eb564fb16645c172d550359cb3d75fe8a040 (diff) | |
parent | 09587a09ada2ed7c39aedfa2681152b5ac5641ee (diff) | |
download | linux-ee01c4d72adffb7d424535adf630f2955748fa8b.tar.xz |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"More mm/ work, plenty more to come
Subsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"
* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/cgroup-v1/memory.rst | 19 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 40 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/hugetlbpage.rst | 35 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/transhuge.rst | 7 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 23 | ||||
-rw-r--r-- | Documentation/core-api/padata.rst | 41 | ||||
-rw-r--r-- | Documentation/features/vm/numa-memblock/arch-support.txt | 34 | ||||
-rw-r--r-- | Documentation/vm/memory-model.rst | 9 | ||||
-rw-r--r-- | Documentation/vm/page_owner.rst | 3 |
9 files changed, 130 insertions, 81 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 0ae4f564c2d6..12757e63b26c 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -199,11 +199,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is unaccounted when it's removed from radix-tree. Even if RSS pages are fully unmapped (by kswapd), they may exist as SwapCache in the system until they are really freed. Such SwapCaches are also accounted. -A swapped-in page is not accounted until it's mapped. +A swapped-in page is accounted after adding into swapcache. Note: The kernel does swapin-readahead and reads multiple swaps at once. -This means swapped-in pages may contain pages for other tasks than a task -causing page fault. So, we avoid accounting at swap-in I/O. +Since page's memcg recorded into swap whatever memsw enabled, the page will +be accounted after swapin. At page migration, accounting information is kept. @@ -222,18 +222,13 @@ the cgroup that brought it in -- this will happen on memory pressure). But see section 8.2: when moving a task to another cgroup, its pages may be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. -Exception: If CONFIG_MEMCG_SWAP is not used. -When you do swapoff and make swapped-out pages of shmem(tmpfs) to -be backed into memory in force, charges for pages are accounted against the -caller of swapoff rather than the users of shmem. - -2.4 Swap Extension (CONFIG_MEMCG_SWAP) +2.4 Swap Extension -------------------------------------- -Swap Extension allows you to record charge for swap. A swapped-in page is -charged back to original page allocator if possible. +Swap usage is always recorded for each of cgroup. Swap Extension allows you to +read and limit it. -When swap is accounted, following files are added. +When CONFIG_SWAP is enabled, following files are added. - memory.memsw.usage_in_bytes. - memory.memsw.limit_in_bytes. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a76d83ed5262..0b4b9e1e35b6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -834,12 +834,15 @@ See also Documentation/networking/decnet.rst. default_hugepagesz= - [same as hugepagesz=] The size of the default - HugeTLB page size. This is the size represented by - the legacy /proc/ hugepages APIs, used for SHM, and - default size when mounting hugetlbfs filesystems. - Defaults to the default architecture's huge page size - if not specified. + [HW] The size of the default HugeTLB page. This is + the size represented by the legacy /proc/ hugepages + APIs. In addition, this is the default hugetlb size + used for shmget(), mmap() and mounting hugetlbfs + filesystems. If not specified, defaults to the + architecture's default huge page size. Huge page + sizes are architecture dependent. See also + Documentation/admin-guide/mm/hugetlbpage.rst. + Format: size[KMG] deferred_probe_timeout= [KNL] Debugging option to set a timeout in seconds for @@ -1484,13 +1487,24 @@ hugepages using the cma allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped. - hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot. - hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages. - On x86-64 and powerpc, this option can be specified - multiple times interleaved with hugepages= to reserve - huge pages of different sizes. Valid pages sizes on - x86-64 are 2M (when the CPU supports "pse") and 1G - (when the CPU supports the "pdpe1gb" cpuinfo flag). + hugepages= [HW] Number of HugeTLB pages to allocate at boot. + If this follows hugepagesz (below), it specifies + the number of pages of hugepagesz to be allocated. + If this is the first HugeTLB parameter on the command + line, it specifies the number of pages to allocate for + the default huge page size. See also + Documentation/admin-guide/mm/hugetlbpage.rst. + Format: <integer> + + hugepagesz= + [HW] The size of the HugeTLB pages. This is used in + conjunction with hugepages (above) to allocate huge + pages of a specific size at boot. The pair + hugepagesz=X hugepages=Y can be specified once for + each supported huge page size. Huge page sizes are + architecture dependent. See also + Documentation/admin-guide/mm/hugetlbpage.rst. + Format: size[KMG] hung_task_panic= [KNL] Should the hung task detector generate panics. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 1cc0bc78d10e..5026e58826e2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -100,6 +100,41 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the "default_hugepagesz=<size>" boot parameter. +Hugetlb boot command line parameter semantics +hugepagesz - Specify a huge page size. Used in conjunction with hugepages + parameter to preallocate a number of huge pages of the specified + size. Hence, hugepagesz and hugepages are typically specified in + pairs such as: + hugepagesz=2M hugepages=512 + hugepagesz can only be specified once on the command line for a + specific huge page size. Valid huge page sizes are architecture + dependent. +hugepages - Specify the number of huge pages to preallocate. This typically + follows a valid hugepagesz or default_hugepagesz parameter. However, + if hugepages is the first or only hugetlb command line parameter it + implicitly specifies the number of huge pages of default size to + allocate. If the number of huge pages of default size is implicitly + specified, it can not be overwritten by a hugepagesz,hugepages + parameter pair for the default size. + For example, on an architecture with 2M default huge page size: + hugepages=256 hugepagesz=2M hugepages=512 + will result in 256 2M huge pages being allocated and a warning message + indicating that the hugepages=512 parameter is ignored. If a hugepages + parameter is preceded by an invalid hugepagesz parameter, it will + be ignored. +default_hugepagesz - Specify the default huge page size. This parameter can + only be specified once on the command line. default_hugepagesz can + optionally be followed by the hugepages parameter to preallocate a + specific number of huge pages of default size. The number of default + sized huge pages to preallocate can also be implicitly specified as + mentioned in the hugepages section above. Therefore, on an + architecture with 2M default huge page size: + hugepages=256 + default_hugepagesz=2M hugepages=256 + hugepages=256 default_hugepagesz=2M + will all result in 256 2M huge pages being allocated. Valid default + huge page size is architecture dependent. + When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` indicates the current number of pre-allocated huge pages of the default size. Thus, one can use the following command to dynamically allocate/deallocate diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 2f31de8f7c74..6a233e42be08 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -220,6 +220,13 @@ memory. A lower value can prevent THPs from being collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. +``max_ptes_shared`` specifies how many pages can be shared across multiple +processes. Exceeding the number would block the collapse:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared + +A higher value may increase memory footprint for some workloads. + Boot parameter ============== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 0329a4d3fa9e..d46d5b7013c6 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -831,14 +831,27 @@ tooling to work, you can do:: swappiness ========== -This control is used to define how aggressive the kernel will swap -memory pages. Higher values will increase aggressiveness, lower values -decrease the amount of swap. A value of 0 instructs the kernel not to -initiate swap until the amount of free and file-backed pages is less -than the high water mark in a zone. +This control is used to define the rough relative IO cost of swapping +and filesystem paging, as a value between 0 and 200. At 100, the VM +assumes equal IO cost and will thus apply memory pressure to the page +cache and swap-backed pages equally; lower values signify more +expensive swap IO, higher values indicates cheaper. + +Keep in mind that filesystem IO patterns under memory pressure tend to +be more efficient than swap's random IO. An optimal value will require +experimentation and will also be workload-dependent. The default value is 60. +For in-memory swap, like zram or zswap, as well as hybrid setups that +have swap on faster devices than the filesystem, values beyond 100 can +be considered. For example, if the random IO against the swap device +is on average 2x faster than IO from the filesystem, swappiness should +be 133 (x + 2x = 200, 2x = 133.33). + +At 0, the kernel will not initiate swap until the amount of free and +file-backed pages is less than the high watermark in a zone. + unprivileged_userfaultfd ======================== diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst index 9a24c111781d..0830e5b0e821 100644 --- a/Documentation/core-api/padata.rst +++ b/Documentation/core-api/padata.rst @@ -4,23 +4,26 @@ The padata parallel execution mechanism ======================================= -:Date: December 2019 +:Date: May 2020 Padata is a mechanism by which the kernel can farm jobs out to be done in -parallel on multiple CPUs while retaining their ordering. It was developed for -use with the IPsec code, which needs to be able to perform encryption and -decryption on large numbers of packets without reordering those packets. The -crypto developers made a point of writing padata in a sufficiently general -fashion that it could be put to other uses as well. +parallel on multiple CPUs while optionally retaining their ordering. -Usage -===== +It was originally developed for IPsec, which needs to perform encryption and +decryption on large numbers of packets without reordering those packets. This +is currently the sole consumer of padata's serialized job support. + +Padata also supports multithreaded jobs, splitting up the job evenly while load +balancing and coordinating between threads. + +Running Serialized Jobs +======================= Initializing ------------ -The first step in using padata is to set up a padata_instance structure for -overall control of how jobs are to be run:: +The first step in using padata to run serialized jobs is to set up a +padata_instance structure for overall control of how jobs are to be run:: #include <linux/padata.h> @@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse:: It is the user's responsibility to ensure all outstanding jobs are complete before any of the above are called. +Running Multithreaded Jobs +========================== + +A multithreaded job has a main thread and zero or more helper threads, with the +main thread participating in the job and then waiting until all helpers have +finished. padata splits the job into units called chunks, where a chunk is a +piece of the job that one thread completes in one call to the thread function. + +A user has to do three things to run a multithreaded job. First, describe the +job by defining a padata_mt_job structure, which is explained in the Interface +section. This includes a pointer to the thread function, which padata will +call each time it assigns a job chunk to a thread. Then, define the thread +function, which accepts three arguments, ``start``, ``end``, and ``arg``, where +the first two delimit the range that the thread operates on and the last is a +pointer to the job's shared state, if any. Prepare the shared state, which is +typically allocated on the main thread's stack. Last, call +padata_do_multithreaded(), which will return once the job is finished. + Interface ========= diff --git a/Documentation/features/vm/numa-memblock/arch-support.txt b/Documentation/features/vm/numa-memblock/arch-support.txt deleted file mode 100644 index 3004beb0fd71..000000000000 --- a/Documentation/features/vm/numa-memblock/arch-support.txt +++ /dev/null @@ -1,34 +0,0 @@ -# -# Feature name: numa-memblock -# Kconfig: HAVE_MEMBLOCK_NODE_MAP -# description: arch supports NUMA aware memblocks -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | .. | - | arm: | .. | - | arm64: | ok | - | c6x: | .. | - | csky: | .. | - | h8300: | .. | - | hexagon: | .. | - | ia64: | ok | - | m68k: | .. | - | microblaze: | ok | - | mips: | ok | - | nds32: | TODO | - | nios2: | .. | - | openrisc: | .. | - | parisc: | .. | - | powerpc: | ok | - | riscv: | ok | - | s390: | ok | - | sh: | ok | - | sparc: | ok | - | um: | .. | - | unicore32: | .. | - | x86: | ok | - | xtensa: | .. | - ----------------------- diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst index 58a12376b7df..91228044ed16 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/vm/memory-model.rst @@ -46,11 +46,10 @@ maps the entire physical memory. For most architectures, the holes have entries in the `mem_map` array. The `struct page` objects corresponding to the holes are never fully initialized. -To allocate the `mem_map` array, architecture specific setup code -should call :c:func:`free_area_init_node` function or its convenience -wrapper :c:func:`free_area_init`. Yet, the mappings array is not -usable until the call to :c:func:`memblock_free_all` that hands all -the memory to the page allocator. +To allocate the `mem_map` array, architecture specific setup code should +call :c:func:`free_area_init` function. Yet, the mappings array is not +usable until the call to :c:func:`memblock_free_all` that hands all the +memory to the page allocator. If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option, it may free parts of the `mem_map` array that do not cover the diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst index 0ed5ab8c7ab4..079f3f8c4784 100644 --- a/Documentation/vm/page_owner.rst +++ b/Documentation/vm/page_owner.rst @@ -83,8 +83,7 @@ Usage 4) Analyze information from page owner:: cat /sys/kernel/debug/page_owner > page_owner_full.txt - grep -v ^PFN page_owner_full.txt > page_owner.txt - ./page_owner_sort page_owner.txt sorted_page_owner.txt + ./page_owner_sort page_owner_full.txt sorted_page_owner.txt See the result about who allocated each page in the ``sorted_page_owner.txt``. |