diff options
Diffstat (limited to 'Documentation/cgroups/unified-hierarchy.txt')
-rw-r--r-- | Documentation/cgroups/unified-hierarchy.txt | 647 |
1 files changed, 0 insertions, 647 deletions
diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt deleted file mode 100644 index 781b1d475bcf..000000000000 --- a/Documentation/cgroups/unified-hierarchy.txt +++ /dev/null @@ -1,647 +0,0 @@ - -Cgroup unified hierarchy - -April, 2014 Tejun Heo <tj@kernel.org> - -This document describes the changes made by unified hierarchy and -their rationales. It will eventually be merged into the main cgroup -documentation. - -CONTENTS - -1. Background -2. Basic Operation - 2-1. Mounting - 2-2. cgroup.subtree_control - 2-3. cgroup.controllers -3. Structural Constraints - 3-1. Top-down - 3-2. No internal tasks -4. Delegation - 4-1. Model of delegation - 4-2. Common ancestor rule -5. Other Changes - 5-1. [Un]populated Notification - 5-2. Other Core Changes - 5-3. Controller File Conventions - 5-3-1. Format - 5-3-2. Control Knobs - 5-4. Per-Controller Changes - 5-4-1. io - 5-4-2. cpuset - 5-4-3. memory -6. Planned Changes - 6-1. CAP for resource control - - -1. Background - -cgroup allows an arbitrary number of hierarchies and each hierarchy -can host any number of controllers. While this seems to provide a -high level of flexibility, it isn't quite useful in practice. - -For example, as there is only one instance of each controller, utility -type controllers such as freezer which can be useful in all -hierarchies can only be used in one. The issue is exacerbated by the -fact that controllers can't be moved around once hierarchies are -populated. Another issue is that all controllers bound to a hierarchy -are forced to have exactly the same view of the hierarchy. It isn't -possible to vary the granularity depending on the specific controller. - -In practice, these issues heavily limit which controllers can be put -on the same hierarchy and most configurations resort to putting each -controller on its own hierarchy. Only closely related ones, such as -the cpu and cpuacct controllers, make sense to put on the same -hierarchy. This often means that userland ends up managing multiple -similar hierarchies repeating the same steps on each hierarchy -whenever a hierarchy management operation is necessary. - -Unfortunately, support for multiple hierarchies comes at a steep cost. -Internal implementation in cgroup core proper is dazzlingly -complicated but more importantly the support for multiple hierarchies -restricts how cgroup is used in general and what controllers can do. - -There's no limit on how many hierarchies there may be, which means -that a task's cgroup membership can't be described in finite length. -The key may contain any varying number of entries and is unlimited in -length, which makes it highly awkward to handle and leads to addition -of controllers which exist only to identify membership, which in turn -exacerbates the original problem. - -Also, as a controller can't have any expectation regarding what shape -of hierarchies other controllers would be on, each controller has to -assume that all other controllers are operating on completely -orthogonal hierarchies. This makes it impossible, or at least very -cumbersome, for controllers to cooperate with each other. - -In most use cases, putting controllers on hierarchies which are -completely orthogonal to each other isn't necessary. What usually is -called for is the ability to have differing levels of granularity -depending on the specific controller. In other words, hierarchy may -be collapsed from leaf towards root when viewed from specific -controllers. For example, a given configuration might not care about -how memory is distributed beyond a certain level while still wanting -to control how CPU cycles are distributed. - -Unified hierarchy is the next version of cgroup interface. It aims to -address the aforementioned issues by having more structure while -retaining enough flexibility for most use cases. Various other -general and controller-specific interface issues are also addressed in -the process. - - -2. Basic Operation - -2-1. Mounting - -Currently, unified hierarchy can be mounted with the following mount -command. Note that this is still under development and scheduled to -change soon. - - mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT - -All controllers which support the unified hierarchy and are not bound -to other hierarchies are automatically bound to unified hierarchy and -show up at the root of it. Controllers which are enabled only in the -root of unified hierarchy can be bound to other hierarchies. This -allows mixing unified hierarchy with the traditional multiple -hierarchies in a fully backward compatible way. - -A controller can be moved across hierarchies only after the controller -is no longer referenced in its current hierarchy. Because per-cgroup -controller states are destroyed asynchronously and controllers may -have lingering references, a controller may not show up immediately on -the unified hierarchy after the final umount of the previous -hierarchy. Similarly, a controller should be fully disabled to be -moved out of the unified hierarchy and it may take some time for the -disabled controller to become available for other hierarchies; -furthermore, due to dependencies among controllers, other controllers -may need to be disabled too. - -While useful for development and manual configurations, dynamically -moving controllers between the unified and other hierarchies is -strongly discouraged for production use. It is recommended to decide -the hierarchies and controller associations before starting using the -controllers. - - -2-2. cgroup.subtree_control - -All cgroups on unified hierarchy have a "cgroup.subtree_control" file -which governs which controllers are enabled on the children of the -cgroup. Let's assume a hierarchy like the following. - - root - A - B - C - \ D - -root's "cgroup.subtree_control" file determines which controllers are -enabled on A. A's on B. B's on C and D. This coincides with the -fact that controllers on the immediate sub-level are used to -distribute the resources of the parent. In fact, it's natural to -assume that resource control knobs of a child belong to its parent. -Enabling a controller in a "cgroup.subtree_control" file declares that -distribution of the respective resources of the cgroup will be -controlled. Note that this means that controller enable states are -shared among siblings. - -When read, the file contains a space-separated list of currently -enabled controllers. A write to the file should contain a -space-separated list of controllers with '+' or '-' prefixed (without -the quotes). Controllers prefixed with '+' are enabled and '-' -disabled. If a controller is listed multiple times, the last entry -wins. The specific operations are executed atomically - either all -succeed or fail. - - -2-3. cgroup.controllers - -Read-only "cgroup.controllers" file contains a space-separated list of -controllers which can be enabled in the cgroup's -"cgroup.subtree_control" file. - -In the root cgroup, this lists controllers which are not bound to -other hierarchies and the content changes as controllers are bound to -and unbound from other hierarchies. - -In non-root cgroups, the content of this file equals that of the -parent's "cgroup.subtree_control" file as only controllers enabled -from the parent can be used in its children. - - -3. Structural Constraints - -3-1. Top-down - -As it doesn't make sense to nest control of an uncontrolled resource, -all non-root "cgroup.subtree_control" files can only contain -controllers which are enabled in the parent's "cgroup.subtree_control" -file. A controller can be enabled only if the parent has the -controller enabled and a controller can't be disabled if one or more -children have it enabled. - - -3-2. No internal tasks - -One long-standing issue that cgroup faces is the competition between -tasks belonging to the parent cgroup and its children cgroups. This -is inherently nasty as two different types of entities compete and -there is no agreed-upon obvious way to handle it. Different -controllers are doing different things. - -The cpu controller considers tasks and cgroups as equivalents and maps -nice levels to cgroup weights. This works for some cases but falls -flat when children should be allocated specific ratios of CPU cycles -and the number of internal tasks fluctuates - the ratios constantly -change as the number of competing entities fluctuates. There also are -other issues. The mapping from nice level to weight isn't obvious or -universal, and there are various other knobs which simply aren't -available for tasks. - -The io controller implicitly creates a hidden leaf node for each -cgroup to host the tasks. The hidden leaf has its own copies of all -the knobs with "leaf_" prefixed. While this allows equivalent control -over internal tasks, it's with serious drawbacks. It always adds an -extra layer of nesting which may not be necessary, makes the interface -messy and significantly complicates the implementation. - -The memory controller currently doesn't have a way to control what -happens between internal tasks and child cgroups and the behavior is -not clearly defined. There have been attempts to add ad-hoc behaviors -and knobs to tailor the behavior to specific workloads. Continuing -this direction will lead to problems which will be extremely difficult -to resolve in the long term. - -Multiple controllers struggle with internal tasks and came up with -different ways to deal with it; unfortunately, all the approaches in -use now are severely flawed and, furthermore, the widely different -behaviors make cgroup as whole highly inconsistent. - -It is clear that this is something which needs to be addressed from -cgroup core proper in a uniform way so that controllers don't need to -worry about it and cgroup as a whole shows a consistent and logical -behavior. To achieve that, unified hierarchy enforces the following -structural constraint: - - Except for the root, only cgroups which don't contain any task may - have controllers enabled in their "cgroup.subtree_control" files. - -Combined with other properties, this guarantees that, when a -controller is looking at the part of the hierarchy which has it -enabled, tasks are always only on the leaves. This rules out -situations where child cgroups compete against internal tasks of the -parent. - -There are two things to note. Firstly, the root cgroup is exempt from -the restriction. Root contains tasks and anonymous resource -consumption which can't be associated with any other cgroup and -requires special treatment from most controllers. How resource -consumption in the root cgroup is governed is up to each controller. - -Secondly, the restriction doesn't take effect if there is no enabled -controller in the cgroup's "cgroup.subtree_control" file. This is -important as otherwise it wouldn't be possible to create children of a -populated cgroup. To control resource distribution of a cgroup, the -cgroup must create children and transfer all its tasks to the children -before enabling controllers in its "cgroup.subtree_control" file. - - -4. Delegation - -4-1. Model of delegation - -A cgroup can be delegated to a less privileged user by granting write -access of the directory and its "cgroup.procs" file to the user. Note -that the resource control knobs in a given directory concern the -resources of the parent and thus must not be delegated along with the -directory. - -Once delegated, the user can build sub-hierarchy under the directory, -organize processes as it sees fit and further distribute the resources -it got from the parent. The limits and other settings of all resource -controllers are hierarchical and regardless of what happens in the -delegated sub-hierarchy, nothing can escape the resource restrictions -imposed by the parent. - -Currently, cgroup doesn't impose any restrictions on the number of -cgroups in or nesting depth of a delegated sub-hierarchy; however, -this may in the future be limited explicitly. - - -4-2. Common ancestor rule - -On the unified hierarchy, to write to a "cgroup.procs" file, in -addition to the usual write permission to the file and uid match, the -writer must also have write access to the "cgroup.procs" file of the -common ancestor of the source and destination cgroups. This prevents -delegatees from smuggling processes across disjoint sub-hierarchies. - -Let's say cgroups C0 and C1 have been delegated to user U0 who created -C00, C01 under C0 and C10 under C1 as follows. - - ~~~~~~~~~~~~~ - C0 - C00 - ~ cgroup ~ \ C01 - ~ hierarchy ~ - ~~~~~~~~~~~~~ - C1 - C10 - -C0 and C1 are separate entities in terms of resource distribution -regardless of their relative positions in the hierarchy. The -resources the processes under C0 are entitled to are controlled by -C0's ancestors and may be completely different from C1. It's clear -that the intention of delegating C0 to U0 is allowing U0 to organize -the processes under C0 and further control the distribution of C0's -resources. - -On traditional hierarchies, if a task has write access to "tasks" or -"cgroup.procs" file of a cgroup and its uid agrees with the target, it -can move the target to the cgroup. In the above example, U0 will not -only be able to move processes in each sub-hierarchy but also across -the two sub-hierarchies, effectively allowing it to violate the -organizational and resource restrictions implied by the hierarchical -structure above C0 and C1. - -On the unified hierarchy, let's say U0 wants to write the pid of a -process which has a matching uid and is currently in C10 into -"C00/cgroup.procs". U0 obviously has write access to the file and -migration permission on the process; however, the common ancestor of -the source cgroup C10 and the destination cgroup C00 is above the -points of delegation and U0 would not have write access to its -"cgroup.procs" and thus be denied with -EACCES. - - -5. Other Changes - -5-1. [Un]populated Notification - -cgroup users often need a way to determine when a cgroup's -subhierarchy becomes empty so that it can be cleaned up. cgroup -currently provides release_agent for it; unfortunately, this mechanism -is riddled with issues. - -- It delivers events by forking and execing a userland binary - specified as the release_agent. This is a long deprecated method of - notification delivery. It's extremely heavy, slow and cumbersome to - integrate with larger infrastructure. - -- There is single monitoring point at the root. There's no way to - delegate management of a subtree. - -- The event isn't recursive. It triggers when a cgroup doesn't have - any tasks or child cgroups. Events for internal nodes trigger only - after all children are removed. This again makes it impossible to - delegate management of a subtree. - -- Events are filtered from the kernel side. A "notify_on_release" - file is used to subscribe to or suppress release events. This is - unnecessarily complicated and probably done this way because event - delivery itself was expensive. - -Unified hierarchy implements "populated" field in "cgroup.events" -interface file which can be used to monitor whether the cgroup's -subhierarchy has tasks in it or not. Its value is 0 if there is no -task in the cgroup and its descendants; otherwise, 1. poll and -[id]notify events are triggered when the value changes. - -This is significantly lighter and simpler and trivially allows -delegating management of subhierarchy - subhierarchy monitoring can -block further propagation simply by putting itself or another process -in the subhierarchy and monitor events that it's interested in from -there without interfering with monitoring higher in the tree. - -In unified hierarchy, the release_agent mechanism is no longer -supported and the interface files "release_agent" and -"notify_on_release" do not exist. - - -5-2. Other Core Changes - -- None of the mount options is allowed. - -- remount is disallowed. - -- rename(2) is disallowed. - -- The "tasks" file is removed. Everything should at process - granularity. Use the "cgroup.procs" file instead. - -- The "cgroup.procs" file is not sorted. pids will be unique unless - they got recycled in-between reads. - -- The "cgroup.clone_children" file is removed. - -- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged - to before exiting. If the cgroup is removed before the zombie is - reaped, " (deleted)" is appeneded to the path. - - -5-3. Controller File Conventions - -5-3-1. Format - -In general, all controller files should be in one of the following -formats whenever possible. - -- Values only files - - VAL0 VAL1...\n - -- Flat keyed files - - KEY0 VAL0\n - KEY1 VAL1\n - ... - -- Nested keyed files - - KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... - KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... - ... - -For a writeable file, the format for writing should generally match -reading; however, controllers may allow omitting later fields or -implement restricted shortcuts for most common use cases. - -For both flat and nested keyed files, only the values for a single key -can be written at a time. For nested keyed files, the sub key pairs -may be specified in any order and not all pairs have to be specified. - - -5-3-2. Control Knobs - -- Settings for a single feature should generally be implemented in a - single file. - -- In general, the root cgroup should be exempt from resource control - and thus shouldn't have resource control knobs. - -- If a controller implements ratio based resource distribution, the - control knob should be named "weight" and have the range [1, 10000] - and 100 should be the default value. The values are chosen to allow - enough and symmetric bias in both directions while keeping it - intuitive (the default is 100%). - -- If a controller implements an absolute resource guarantee and/or - limit, the control knobs should be named "min" and "max" - respectively. If a controller implements best effort resource - gurantee and/or limit, the control knobs should be named "low" and - "high" respectively. - - In the above four control files, the special token "max" should be - used to represent upward infinity for both reading and writing. - -- If a setting has configurable default value and specific overrides, - the default settings should be keyed with "default" and appear as - the first entry in the file. Specific entries can use "default" as - its value to indicate inheritance of the default value. - -- For events which are not very high frequency, an interface file - "events" should be created which lists event key value pairs. - Whenever a notifiable event happens, file modified event should be - generated on the file. - - -5-4. Per-Controller Changes - -5-4-1. io - -- blkio is renamed to io. The interface is overhauled anyway. The - new name is more in line with the other two major controllers, cpu - and memory, and better suited given that it may be used for cgroup - writeback without involving block layer. - -- Everything including stat is always hierarchical making separate - recursive stat files pointless and, as no internal node can have - tasks, leaf weights are meaningless. The operation model is - simplified and the interface is overhauled accordingly. - - io.stat - - The stat file. The reported stats are from the point where - bio's are issued to request_queue. The stats are counted - independent of which policies are enabled. Each line in the - file follows the following format. More fields may later be - added at the end. - - $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS - - io.weight - - The weight setting, currently only available and effective if - cfq-iosched is in use for the target device. The weight is - between 1 and 10000 and defaults to 100. The first line - always contains the default weight in the following format to - use when per-device setting is missing. - - default $WEIGHT - - Subsequent lines list per-device weights of the following - format. - - $MAJ:$MIN $WEIGHT - - Writing "$WEIGHT" or "default $WEIGHT" changes the default - setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight - while "$MAJ:$MIN default" clears it. - - This file is available only on non-root cgroups. - - io.max - - The maximum bandwidth and/or iops setting, only available if - blk-throttle is enabled. The file is of the following format. - - $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS - - ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are - read/write IOs per second. "max" indicates no limit. Writing - to the file follows the same format but the individual - settings may be omitted or specified in any order. - - This file is available only on non-root cgroups. - - -5-4-2. cpuset - -- Tasks are kept in empty cpusets after hotplug and take on the masks - of the nearest non-empty ancestor, instead of being moved to it. - -- A task can be moved into an empty cpuset, and again it takes on the - masks of the nearest non-empty ancestor. - - -5-4-3. memory - -- use_hierarchy is on by default and the cgroup file for the flag is - not created. - -- The original lower boundary, the soft limit, is defined as a limit - that is per default unset. As a result, the set of cgroups that - global reclaim prefers is opt-in, rather than opt-out. The costs - for optimizing these mostly negative lookups are so high that the - implementation, despite its enormous size, does not even provide the - basic desirable behavior. First off, the soft limit has no - hierarchical meaning. All configured groups are organized in a - global rbtree and treated like equal peers, regardless where they - are located in the hierarchy. This makes subtree delegation - impossible. Second, the soft limit reclaim pass is so aggressive - that it not just introduces high allocation latencies into the - system, but also impacts system performance due to overreclaim, to - the point where the feature becomes self-defeating. - - The memory.low boundary on the other hand is a top-down allocated - reserve. A cgroup enjoys reclaim protection when it and all its - ancestors are below their low boundaries, which makes delegation of - subtrees possible. Secondly, new cgroups have no reserve per - default and in the common case most cgroups are eligible for the - preferred reclaim pass. This allows the new low boundary to be - efficiently implemented with just a minor addition to the generic - reclaim code, without the need for out-of-band data structures and - reclaim passes. Because the generic reclaim code considers all - cgroups except for the ones running low in the preferred first - reclaim pass, overreclaim of individual groups is eliminated as - well, resulting in much better overall workload performance. - -- The original high boundary, the hard limit, is defined as a strict - limit that can not budge, even if the OOM killer has to be called. - But this generally goes against the goal of making the most out of - the available memory. The memory consumption of workloads varies - during runtime, and that requires users to overcommit. But doing - that with a strict upper limit requires either a fairly accurate - prediction of the working set size or adding slack to the limit. - Since working set size estimation is hard and error prone, and - getting it wrong results in OOM kills, most users tend to err on the - side of a looser limit and end up wasting precious resources. - - The memory.high boundary on the other hand can be set much more - conservatively. When hit, it throttles allocations by forcing them - into direct reclaim to work off the excess, but it never invokes the - OOM killer. As a result, a high boundary that is chosen too - aggressively will not terminate the processes, but instead it will - lead to gradual performance degradation. The user can monitor this - and make corrections until the minimal memory footprint that still - gives acceptable performance is found. - - In extreme cases, with many concurrent allocations and a complete - breakdown of reclaim progress within the group, the high boundary - can be exceeded. But even then it's mostly better to satisfy the - allocation from the slack available in other groups or the rest of - the system than killing the group. Otherwise, memory.max is there - to limit this type of spillover and ultimately contain buggy or even - malicious applications. - -- The original control file names are unwieldy and inconsistent in - many different ways. For example, the upper boundary hit count is - exported in the memory.failcnt file, but an OOM event count has to - be manually counted by listening to memory.oom_control events, and - lower boundary / soft limit events have to be counted by first - setting a threshold for that value and then counting those events. - Also, usage and limit files encode their units in the filename. - That makes the filenames very long, even though this is not - information that a user needs to be reminded of every time they type - out those names. - - To address these naming issues, as well as to signal clearly that - the new interface carries a new configuration model, the naming - conventions in it necessarily differ from the old interface. - -- The original limit files indicate the state of an unset limit with a - Very High Number, and a configured limit can be unset by echoing -1 - into those files. But that very high number is implementation and - architecture dependent and not very descriptive. And while -1 can - be understood as an underflow into the highest possible value, -2 or - -10M etc. do not work, so it's not consistent. - - memory.low, memory.high, and memory.max will use the string "max" to - indicate and set the highest possible value. - -6. Planned Changes - -6-1. CAP for resource control - -Unified hierarchy will require one of the capabilities(7), which is -yet to be decided, for all resource control related knobs. Process -organization operations - creation of sub-cgroups and migration of -processes in sub-hierarchies may be delegated by changing the -ownership and/or permissions on the cgroup directory and -"cgroup.procs" interface file; however, all operations which affect -resource control - writes to a "cgroup.subtree_control" file or any -controller-specific knobs - will require an explicit CAP privilege. - -This, in part, is to prevent the cgroup interface from being -inadvertently promoted to programmable API used by non-privileged -binaries. cgroup exposes various aspects of the system in ways which -aren't properly abstracted for direct consumption by regular programs. -This is an administration interface much closer to sysctl knobs than -system calls. Even the basic access model, being filesystem path -based, isn't suitable for direct consumption. There's no way to -access "my cgroup" in a race-free way or make multiple operations -atomic against migration to another cgroup. - -Another aspect is that, for better or for worse, the cgroup interface -goes through far less scrutiny than regular interfaces for -unprivileged userland. The upside is that cgroup is able to expose -useful features which may not be suitable for general consumption in a -reasonable time frame. It provides a relatively short path between -internal details and userland-visible interface. Of course, this -shortcut comes with high risk. We go through what we go through for -general kernel APIs for good reasons. It may end up leaking internal -details in a way which can exert significant pain by locking the -kernel into a contract that can't be maintained in a reasonable -manner. - -Also, due to the specific nature, cgroup and its controllers don't -tend to attract attention from a wide scope of developers. cgroup's -short history is already fraught with severely mis-designed -interfaces, unnecessary commitments to and exposing of internal -details, broken and dangerous implementations of various features. - -Keeping cgroup as an administration interface is both advantageous for -its role and imperative given its nature. Some of the cgroup features -may make sense for unprivileged access. If deemed justified, those -must be further abstracted and implemented as a different interface, -be it a system call or process-private filesystem, and survive through -the scrutiny that any interface for general consumption is required to -go through. - -Requiring CAP is not a complete solution but should serve as a -significant deterrent against spraying cgroup usages in non-privileged -programs. |