From 7caa47151ab2e644dd221f741ec7578d9532c9a3 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 28 Aug 2019 15:05:58 -0700 Subject: blkcg: implement blk-iocost This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> --- Documentation/admin-guide/cgroup-v2.rst | 94 +++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) (limited to 'Documentation/admin-guide/cgroup-v2.rst') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3b29005aa981..1521c7e554f5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1435,6 +1435,100 @@ IO Interface Files 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + io.cost.qos + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the Quality of Service of the IO cost + model based controller (CONFIG_BLK_CGROUP_IOCOST) which + currently implements "io.weight" proportional control. Lines + are keyed by $MAJ:$MIN device numbers and not ordered. The + line for a given device is populated on the first write for + the device on "io.cost.qos" or "io.cost.model". The following + nested keys are defined. + + ====== ===================================== + enable Weight-based control enable + ctrl "auto" or "user" + rpct Read latency percentile [0, 100] + rlat Read latency threshold + wpct Write latency percentile [0, 100] + wlat Write latency threshold + min Minimum scaling percentage [1, 10000] + max Maximum scaling percentage [1, 10000] + ====== ===================================== + + The controller is disabled by default and can be enabled by + setting "enable" to 1. "rpct" and "wpct" parameters default + to zero and the controller uses internal device saturation + state to adjust the overall IO rate between "min" and "max". + + When a better control quality is needed, latency QoS + parameters can be configured. For example:: + + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 + + shows that on sdb, the controller is enabled, will consider + the device saturated if the 95th percentile of read completion + latencies is above 75ms or write 150ms, and adjust the overall + IO issue rate between 50% and 150% accordingly. + + The lower the saturation point, the better the latency QoS at + the cost of aggregate bandwidth. The narrower the allowed + adjustment range between "min" and "max", the more conformant + to the cost model the IO behavior. Note that the IO issue + base rate may be far off from 100% and setting "min" and "max" + blindly can lead to a significant loss of device capacity or + control quality. "min" and "max" are useful for regulating + devices which show wide temporary behavior changes - e.g. a + ssd which accepts writes at the line speed for a while and + then completely stalls for multiple seconds. + + When "ctrl" is "auto", the parameters are controlled by the + kernel and may change automatically. Setting "ctrl" to "user" + or setting any of the percentile and latency parameters puts + it into "user" mode and disables the automatic changes. The + automatic mode can be restored by setting "ctrl" to "auto". + + io.cost.model + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the cost model of the IO cost model based + controller (CONFIG_BLK_CGROUP_IOCOST) which currently + implements "io.weight" proportional control. Lines are keyed + by $MAJ:$MIN device numbers and not ordered. The line for a + given device is populated on the first write for the device on + "io.cost.qos" or "io.cost.model". The following nested keys + are defined. + + ===== ================================ + ctrl "auto" or "user" + model The cost model in use - "linear" + ===== ================================ + + When "ctrl" is "auto", the kernel may change all parameters + dynamically. When "ctrl" is set to "user" or any other + parameters are written to, "ctrl" become "user" and the + automatic changes are disabled. + + When "model" is "linear", the following model parameters are + defined. + + ============= ======================================== + [r|w]bps The maximum sequential IO throughput + [r|w]seqiops The maximum 4k sequential IOs per second + [r|w]randiops The maximum 4k random IOs per second + ============= ======================================== + + From the above, the builtin linear model determines the base + costs of a sequential and random IO and the cost coefficient + for the IO size. While simple, this model can cover most + common device classes acceptably. + + The IO cost model isn't expected to be accurate in absolute + sense and is scaled to the device behavior dynamically. + io.weight A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100". -- cgit v1.2.3