diff options
author | Shaohua Li <shli@fb.com> | 2017-03-27 20:51:41 +0300 |
---|---|---|
committer | Jens Axboe <axboe@fb.com> | 2017-03-28 17:02:20 +0300 |
commit | 9e234eeafbe17e85908584392f249f0b329b8e1b (patch) | |
tree | 9d822cd38526ecc8132ffd4f4a720bb53a8eef0f /block/bio.c | |
parent | 7394e31fa440ab7cd20cebd233580b360a7e9ecc (diff) | |
download | linux-9e234eeafbe17e85908584392f249f0b329b8e1b.tar.xz |
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Diffstat (limited to 'block/bio.c')
-rw-r--r-- | block/bio.c | 2 |
1 files changed, 2 insertions, 0 deletions
diff --git a/block/bio.c b/block/bio.c index 6194a8cf2aab..f1857c0f0826 100644 --- a/block/bio.c +++ b/block/bio.c @@ -30,6 +30,7 @@ #include <linux/cgroup.h> #include <trace/events/block.h> +#include "blk.h" /* * Test patch to inline a certain number of bi_io_vec's inside the bio @@ -1845,6 +1846,7 @@ again: goto again; } + blk_throtl_bio_endio(bio); if (bio->bi_end_io) bio->bi_end_io(bio); } |