diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-10-27 05:33:41 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-10-27 05:33:41 +0300 |
commit | 345671ea0f9258f410eb057b9ced9cefbbe5dc78 (patch) | |
tree | fe97ba3d27679789e6aa34e39b002ee64ce25412 /Documentation | |
parent | 4904008165c8a1c48602b8316139691b8c735e6e (diff) | |
parent | 22146c3ce98962436e401f7b7016a6f664c9ffb5 (diff) | |
download | linux-345671ea0f9258f410eb057b9ced9cefbbe5dc78.tar.xz |
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
hugetlbfs: dirty pages as they are added to pagecache
mm: export add_swap_extent()
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
mm/gup: cache dev_pagemap while pinning pages
Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
mm: return zero_resv_unavail optimization
mm: zero remaining unavailable struct pages
tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
mm/gup_benchmark.c: add additional pinning methods
mm/gup_benchmark.c: time put_page()
mm: don't raise MEMCG_OOM event due to failed high-order allocation
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/accounting/psi.txt | 73 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 22 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 12 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 4 | ||||
-rw-r--r-- | Documentation/vm/slub.rst | 12 | ||||
-rw-r--r-- | Documentation/x86/pat.txt | 4 |
6 files changed, 122 insertions, 5 deletions
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt new file mode 100644 index 000000000000..b8ca28b60215 --- /dev/null +++ b/Documentation/accounting/psi.txt @@ -0,0 +1,73 @@ +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner <hannes@cmpxchg.org> + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +The format for CPU is as such: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 +full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios are tracked as recent trends over ten, sixty, and three +hundred second windows, which gives insight into short term events as +well as medium and long term trends. The total absolute stall time is +tracked and exported as well, to allow detection of latency spikes +which wouldn't necessarily make a dent in the time averages, or to +average trends over custom time frames. + +Cgroup2 interface +================= + +In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem +mounted, pressure stall information is also tracked for tasks grouped +into cgroups. Each subdirectory in the cgroupfs mountpoint contains +cpu.pressure, memory.pressure, and io.pressure files; the format is +the same as the /proc/pressure/ files. diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index caf36105a1c7..8384c681a4b2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -966,6 +966,12 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + cpu.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for CPU. See + Documentation/accounting/psi.txt for details. + Memory ------ @@ -1127,6 +1133,10 @@ PAGE_SIZE multiple when read back. disk readahead. For now OOM in memory cgroup kills tasks iff shortage has happened inside page fault. + This event is not raised if the OOM killer is not + considered as an option, e.g. for failed high-order + allocations. + oom_kill The number of processes belonging to this cgroup killed by any kind of OOM killer. @@ -1271,6 +1281,12 @@ PAGE_SIZE multiple when read back. higher than the limit for an extended period of time. This reduces the impact on the workload and memory management. + memory.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for memory. See + Documentation/accounting/psi.txt for details. + Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1408,6 +1424,12 @@ IO Interface Files 8:16 rbps=2097152 wbps=max riops=max wiops=max + io.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for IO. See + Documentation/accounting/psi.txt for details. + Writeback ~~~~~~~~~ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 47ca5cda0eef..b90fe3b6bc6c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4851,6 +4851,18 @@ This is actually a boot loader parameter; the value is passed to the kernel using a special protocol. + vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y. + May slow down system boot speed, especially when + enabled on systems with a large amount of memory. + All options are enabled by default, and this + interface is meant to allow for selectively + enabling or disabling specific virtual memory + debugging features. + + Available options are: + P Enable page structure init time poisoning + - Disable all of the above options + vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase the minimum size (128MB on x86). It can also be used to diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 22b4b00dee31..12a5e6e693b6 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -858,6 +858,7 @@ Writeback: 0 kB AnonPages: 861800 kB Mapped: 280372 kB Shmem: 644 kB +KReclaimable: 168048 kB Slab: 284364 kB SReclaimable: 159856 kB SUnreclaim: 124508 kB @@ -925,6 +926,9 @@ AnonHugePages: Non-file backed huge pages mapped into userspace page tables ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated with huge pages ShmemPmdMapped: Shared memory mapped into userspace with huge pages +KReclaimable: Kernel allocations that the kernel will attempt to reclaim + under memory pressure. Includes SReclaimable (below), and other + direct allocations with a shrinker. Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure diff --git a/Documentation/vm/slub.rst b/Documentation/vm/slub.rst index 3a775fd64e2d..195928808bac 100644 --- a/Documentation/vm/slub.rst +++ b/Documentation/vm/slub.rst @@ -36,9 +36,10 @@ debugging is enabled. Format: slub_debug=<Debug-Options> Enable options for all slabs -slub_debug=<Debug-Options>,<slab name> - Enable options only for select slabs +slub_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) Possible debug options are:: @@ -62,7 +63,12 @@ Trying to find an issue in the dentry cache? Try:: slub_debug=,dentry -to only enable debugging on the dentry cache. +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs: + + slub_debug=P,kmalloc-*,dentry Red zoning and tracking may realign the slab. We can just apply sanity checks to the dentry cache with:: diff --git a/Documentation/x86/pat.txt b/Documentation/x86/pat.txt index 2a4ee6302122..481d8d8536ac 100644 --- a/Documentation/x86/pat.txt +++ b/Documentation/x86/pat.txt @@ -90,12 +90,12 @@ pci proc | -- | -- | WC | Advanced APIs for drivers ------------------------- A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range, -vm_insert_pfn +vmf_insert_pfn Drivers wanting to export some pages to userspace do it by using mmap interface and a combination of 1) pgprot_noncached() -2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn() +2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn() With PAT support, a new API pgprot_writecombine is being added. So, drivers can continue to use the above sequence, with either pgprot_noncached() or |