summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/cgroup-v2.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst212
1 files changed, 172 insertions, 40 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 315ede811c9d..bd98ea3175ec 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-6. Device
5-7. RDMA
5-7-1. RDMA Interface Files
- 5-8. HugeTLB
- 5.8-1. HugeTLB Interface Files
- 5-9. Misc
- 5.9-1 Miscellaneous cgroup Interface Files
- 5.9-2 Migration and Ownership
- 5-10. Others
- 5-10-1. perf_event
+ 5-8. DMEM
+ 5-9. HugeTLB
+ 5.9-1. HugeTLB Interface Files
+ 5-10. Misc
+ 5.10-1 Miscellaneous cgroup Interface Files
+ 5.10-2 Migration and Ownership
+ 5-11. Others
+ 5-11-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1075,33 +1076,53 @@ cpufreq governor about the minimum desired frequency which should always be
provided by a CPU, as well as the maximum desired frequency, which should not
be exceeded by a CPU.
-WARNING: cgroup2 doesn't yet support control of realtime processes. For
-a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group
-scheduling of realtime processes, the cpu controller can only be enabled
-when all RT processes are in the root cgroup. This limitation does
-not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system
-management software may already have placed RT processes into nonroot
-cgroups during the system boot process, and these processes may need
-to be moved to the root cgroup before the cpu controller can be enabled
-with a CONFIG_RT_GROUP_SCHED enabled kernel.
+WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of
+realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option
+enabled for group scheduling of realtime processes, the cpu controller can only
+be enabled when all RT processes are in the root cgroup. Be aware that system
+management software may already have placed RT processes into non-root cgroups
+during the system boot process, and these processes may need to be moved to the
+root cgroup before the cpu controller can be enabled with a
+CONFIG_RT_GROUP_SCHED enabled kernel.
+
+With CONFIG_RT_GROUP_SCHED disabled, this limitation does not apply and some of
+the interface files either affect realtime processes or account for them. See
+the following section for details. Only the cpu controller is affected by
+CONFIG_RT_GROUP_SCHED. Other controllers can be used for the resource control of
+realtime processes irrespective of CONFIG_RT_GROUP_SCHED.
CPU Interface Files
~~~~~~~~~~~~~~~~~~~
-All time durations are in microseconds.
+The interaction of a process with the cpu controller depends on its scheduling
+policy and the underlying scheduler. From the point of view of the cpu controller,
+processes can be categorized as follows:
+
+* Processes under the fair-class scheduler
+* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback
+* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler
+ without the ``cgroup_set_weight`` callback
+
+For details on when a process is under the fair-class scheduler or a BPF scheduler,
+check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`.
+
+For each of the following interface files, the above categories
+will be referred to. All time durations are in microseconds.
cpu.stat
A read-only flat-keyed file.
This file exists whether the controller is enabled or not.
- It always reports the following three stats:
+ It always reports the following three stats, which account for all the
+ processes in the cgroup:
- usage_usec
- user_usec
- system_usec
- and the following five when the controller is enabled:
+ and the following five when the controller is enabled, which account for
+ only the processes under the fair-class scheduler:
- nr_periods
- nr_throttled
@@ -1119,6 +1140,10 @@ All time durations are in microseconds.
If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
then the weight will show as a 0.
+ This file affects only processes under the fair-class scheduler and a BPF
+ scheduler with the ``cgroup_set_weight`` callback depending on what the
+ callback actually does.
+
cpu.weight.nice
A read-write single value file which exists on non-root
cgroups. The default is "0".
@@ -1131,6 +1156,10 @@ All time durations are in microseconds.
granularity is coarser for the nice values, the read value is
the closest approximation of the current weight.
+ This file affects only processes under the fair-class scheduler and a BPF
+ scheduler with the ``cgroup_set_weight`` callback depending on what the
+ callback actually does.
+
cpu.max
A read-write two value file which exists on non-root cgroups.
The default is "max 100000".
@@ -1143,43 +1172,55 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ This file affects only processes under the fair-class scheduler.
+
cpu.max.burst
A read-write single value file which exists on non-root
cgroups. The default is "0".
The burst in the range [0, $MAX].
+ This file affects only processes under the fair-class scheduler.
+
cpu.pressure
A read-write nested-keyed file.
Shows pressure stall information for CPU. See
:ref:`Documentation/accounting/psi.rst <psi>` for details.
+ This file accounts for all the processes in the cgroup.
+
cpu.uclamp.min
- A read-write single value file which exists on non-root cgroups.
- The default is "0", i.e. no utilization boosting.
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
- The requested minimum utilization (protection) as a percentage
- rational number, e.g. 12.34 for 12.34%.
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp,
+ including those of realtime processes.
- This interface allows reading and setting minimum utilization clamp
- values similar to the sched_setattr(2). This minimum utilization
- value is used to clamp the task specific minimum utilization clamp.
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
- The requested minimum utilization (protection) is always capped by
- the current value for the maximum utilization (limit), i.e.
- `cpu.uclamp.max`.
+ This file affects all the processes in the cgroup.
cpu.uclamp.max
- A read-write single value file which exists on non-root cgroups.
- The default is "max". i.e. no utilization capping
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
- The requested maximum utilization (limit) as a percentage rational
- number, e.g. 98.76 for 98.76%.
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp,
+ including those of realtime processes.
- This interface allows reading and setting maximum utilization clamp
- values similar to the sched_setattr(2). This maximum utilization
- value is used to clamp the task specific maximum utilization clamp.
+ This file affects all the processes in the cgroup.
cpu.idle
A read-write single value file which exists on non-root cgroups.
@@ -1191,7 +1232,7 @@ All time durations are in microseconds.
own relative priorities, but the cgroup itself will be treated as
very low priority relative to its peers.
-
+ This file affects only processes under the fair-class scheduler.
Memory
------
@@ -1293,6 +1334,18 @@ PAGE_SIZE multiple when read back.
monitors the limited cgroup to alleviate heavy reclaim
pressure.
+ If memory.high is opened with O_NONBLOCK then the synchronous
+ reclaim is bypassed. This is useful for admin processes that
+ need to dynamically adjust the job's memory limits without
+ expending their own CPU resources on memory reclamation. The
+ job will trigger the reclaim and/or get throttled on its
+ next charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
+
memory.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
@@ -1310,6 +1363,18 @@ PAGE_SIZE multiple when read back.
Caller could retry them differently, return into userspace
as -ENOMEM or silently ignore in cases like disk readahead.
+ If memory.max is opened with O_NONBLOCK, then the synchronous
+ reclaim and oom-kill are bypassed. This is useful for admin
+ processes that need to dynamically adjust the job's memory limits
+ without expending their own CPU resources on memory reclamation.
+ The job will trigger the reclaim and/or oom-kill on its next
+ charge request.
+
+ Please note that with O_NONBLOCK, there is a chance that the
+ target memory cgroup may take indefinite amount of time to
+ reduce usage below the limit due to delayed charge request or
+ busy-hitting its memory to slow down reclaim.
+
memory.reclaim
A write-only nested-keyed file which exists for all cgroups.
@@ -1342,6 +1407,9 @@ The following nested keys are defined.
same semantics as vm.swappiness applied to memcg reclaim with
all the existing limitations and potential future extensions.
+ The valid range for swappiness is [0-200, max], setting
+ swappiness=max exclusively reclaims anonymous memory.
+
memory.peak
A read-write single value file which exists on non-root cgroups.
@@ -1439,7 +1507,10 @@ The following nested keys are defined.
anon
Amount of memory used in anonymous mappings such as
- brk(), sbrk(), and mmap(MAP_ANONYMOUS)
+ brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
+ some kernel configurations might account complete larger
+ allocations (e.g., THP) if only some, but not all the
+ memory of such an allocation is mapped anymore.
file
Amount of memory used to cache filesystem data,
@@ -1482,7 +1553,10 @@ The following nested keys are defined.
Amount of application memory swapped out to zswap.
file_mapped
- Amount of cached filesystem data mapped with mmap()
+ Amount of cached filesystem data mapped with mmap(). Note
+ that some kernel configurations might account complete
+ larger allocations (e.g., THP) if only some, but not
+ not all the memory of such an allocation is mapped.
file_dirty
Amount of cached filesystem data that was modified but
@@ -1554,6 +1628,12 @@ The following nested keys are defined.
workingset_nodereclaim
Number of times a shadow node has been reclaimed
+ pswpin (npn)
+ Number of pages swapped into memory
+
+ pswpout (npn)
+ Number of pages swapped out of memory
+
pgscan (npn)
Amount of scanned pages (in an inactive LRU list)
@@ -1569,6 +1649,9 @@ The following nested keys are defined.
pgscan_khugepaged (npn)
Amount of scanned pages by khugepaged (in an inactive LRU list)
+ pgscan_proactive (npn)
+ Amount of scanned pages proactively (in an inactive LRU list)
+
pgsteal_kswapd (npn)
Amount of reclaimed pages by kswapd
@@ -1578,6 +1661,9 @@ The following nested keys are defined.
pgsteal_khugepaged (npn)
Amount of reclaimed pages by khugepaged
+ pgsteal_proactive (npn)
+ Amount of reclaimed pages proactively
+
pgfault (npn)
Total number of page faults incurred
@@ -1655,6 +1741,9 @@ The following nested keys are defined.
pgdemote_khugepaged
Number of pages demoted by khugepaged.
+ pgdemote_proactive
+ Number of pages demoted by proactively.
+
hugetlb
Amount of memory used by hugetlb pages. This metric only shows
up if hugetlb usage is accounted for in memory.current (i.e.
@@ -2626,6 +2715,49 @@ RDMA Interface Files
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
+DMEM
+----
+
+The "dmem" controller regulates the distribution and accounting of
+device memory regions. Because each memory region may have its own page size,
+which does not have to be equal to the system page size, the units are always bytes.
+
+DMEM Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+ dmem.max, dmem.min, dmem.low
+ A readwrite nested-keyed file that exists for all the cgroups
+ except root that describes current configured resource limit
+ for a region.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 1073741824
+ drm/0000:03:00.0/stolen max
+
+ The semantics are the same as for the memory cgroup controller, and are
+ calculated in the same way.
+
+ dmem.capacity
+ A read-only file that describes maximum region capacity.
+ It only exists on the root cgroup. Not all memory can be
+ allocated by cgroups, as the kernel reserves some for
+ internal use.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 8514437120
+ drm/0000:03:00.0/stolen 67108864
+
+ dmem.current
+ A read-only file that describes current resource usage.
+ It exists for all the cgroup except root.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 12550144
+ drm/0000:03:00.0/stolen 8650752
+
HugeTLB
-------
@@ -2949,7 +3081,7 @@ Filesystem Support for Writeback
--------------------------------
A filesystem can support cgroup writeback by updating
-address_space_operations->writepage[s]() to annotate bio's using the
+address_space_operations->writepages() to annotate bio's using the
following two functions.
wbc_init_bio(@wbc, @bio)