summaryrefslogtreecommitdiff
path: root/block/blk-iocost.c
AgeCommit message (Collapse)AuthorFilesLines
2022-02-11block: partition include/linux/blk-cgroup.hMing Lei1-1/+1
Partition include/linux/blk-cgroup.h into two parts: one is public part, the other is block layer private part. Suggested by Christoph Hellwig. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-14iocost: Fix divide-by-zero on donation from low hweight cgroupTejun Heo1-1/+8
The donation calculation logic assumes that the donor has non-zero after-donation hweight, so the lowest active hweight a donating cgroup can have is 2 so that it can donate 1 while keeping the other 1 for itself. Earlier, we only donated from cgroups with sizable surpluses so this condition was always true. However, with the precise donation algorithm implemented, f1de2439ec43 ("blk-iocost: revamp donation amount determination") made the donation amount calculation exact enabling even low hweight cgroups to donate. This means that in rare occasions, a cgroup with active hweight of 1 can enter donation calculation triggering the following warning and then a divide-by-zero oops. WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867] ... RIP: 0010:transfer_surpluses.cold+0x0/0x53 Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95 ff 92 ff <0f> 0b 48 c7 c7 30 da ab b5 e8 71 3f fe ff 4c 89 e8 4d 85 ed 74 0 4 ... Call Trace: <IRQ> ioc_timer_fn+0x1043/0x1390 call_timer_fn+0xa1/0x2c0 __run_timers.part.0+0x1ec/0x2e0 run_timer_softirq+0x35/0x70 ... iocg: invalid donation weights in /a/b: active=1 donating=1 after=0 Fix it by excluding cgroups w/ active hweight < 2 from donating. Excluding these extreme low hweight donations shouldn't affect work conservation in any meaningful way. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: f1de2439ec43 ("blk-iocost: revamp donation amount determination") Cc: stable@vger.kernel.org # v5.10+ Link: https://lore.kernel.org/r/Ybfh86iSvpWKxhVM@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: convert the rest of block to bdev_get_queuePavel Begunkov1-6/+6
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached queue pointer and so is faster. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds1-14/+9
Pull block updates from Jens Axboe: "Nothing major in here - lots of good cleanups and tech debt handling, which is also evident in the diffstats. In particular: - Add disk sequence numbers (Matteo) - Discard merge fix (Ming) - Relax disk zoned reporting restrictions (Niklas) - Bio error handling zoned leak fix (Pavel) - Start of proper add_disk() error handling (Luis, Christoph) - blk crypto fix (Eric) - Non-standard GPT location support (Dmitry) - IO priority improvements and cleanups (Damien)o - blk-throtl improvements (Chunguang) - diskstats_show() stack reduction (Abd-Alrhman) - Loop scheduler selection (Bart) - Switch block layer to use kmap_local_page() (Christoph) - Remove obsolete disk_name helper (Christoph) - block_device refcounting improvements (Christoph) - Ensure gendisk always has a request queue reference (Christoph) - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)" * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits) sg: pass the device name to blk_trace_setup block, bfq: cleanup the repeated declaration blk-crypto: fix check for too-large dun_bytes blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN blk-zoned: allow zone management send operations without CAP_SYS_ADMIN block: mark blkdev_fsync static block: refine the disk_live check in del_gendisk mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA mmc: block: Support alternative_gpt_sector() operation partitions/efi: Support non-standard GPT location block: Add alternative_gpt_sector() operation bio: fix page leak bio_add_hw_page failure block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT block: remove a pointless call to MINOR() in device_add_disk null_blk: add error handling support for add_disk() virtio_blk: add error handling support for add_disk() block: add error handling for device_add_disk / add_disk block: return errors from disk_alloc_events block: return errors from blk_integrity_add block: call blk_register_queue earlier in device_add_disk ...
2021-08-16blk-cgroup: stop using seq_get_bufChristoph Hellwig1-14/+9
seq_get_buf is a crutch that undoes all the memory safety of the seq_file interface. Use the normal seq_printf interfaces instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10blk-iocost: fix lockdep warning on blkcg->lockMing Lei1-4/+4
blkcg->lock depends on q->queue_lock which may depend on another driver lock required in irq context, one example is dm-thin: Chain exists of: &pool->lock#3 --> &q->queue_lock --> &blkcg->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&blkcg->lock); local_irq_disable(); lock(&pool->lock#3); lock(&q->queue_lock); <Interrupt> lock(&pool->lock#3); Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write(). Cc: Tejun Heo <tj@kernel.org> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-28blk-iocost: fix operation ordering in iocg_wake_fn()Tejun Heo1-5/+6
iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it wants the wq_entry to be always removed whether it ended up waking the task or not. finish_wait() tests whether wq_entry needs removal without grabbing the wait_queue lock and expects the waker to use list_del_init_careful() after all waking operations are complete, which iocg_wake_fn() didn't do. The operation order was wrong and the regular list_del_init() was used. The result is that if a waiter wakes up racing the waker, it can free pop the wq_entry off stack before the waker is still looking at it, which can lead to a backtrace like the following. [7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP ... [7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0 ... [7312084.858314] Call Trace: [7312084.863548] _raw_spin_lock_irqsave+0x22/0x30 [7312084.872605] try_to_wake_up+0x4c/0x4f0 [7312084.880444] iocg_wake_fn+0x71/0x80 [7312084.887763] __wake_up_common+0x71/0x140 [7312084.895951] iocg_kick_waitq+0xe8/0x2b0 [7312084.903964] ioc_rqos_throttle+0x275/0x650 [7312084.922423] __rq_qos_throttle+0x20/0x30 [7312084.930608] blk_mq_make_request+0x120/0x650 [7312084.939490] generic_make_request+0xca/0x310 [7312084.957600] submit_bio+0x173/0x200 [7312084.981806] swap_readpage+0x15c/0x240 [7312084.989646] read_swap_cache_async+0x58/0x60 [7312084.998527] swap_cluster_readahead+0x201/0x320 [7312085.023432] swapin_readahead+0x2df/0x450 [7312085.040672] do_swap_page+0x52f/0x820 [7312085.058259] handle_mm_fault+0xa16/0x1420 [7312085.066620] do_page_fault+0x2c6/0x5c0 [7312085.074459] page_fault+0x2f/0x40 Fix it by switching to list_del_init_careful() and putting it at the end. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rik van Riel <riel@surriel.com> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-12blk-iocost: fix weight updates of inner active iocgsTejun Heo1-2/+12
When the weight of an active iocg is updated, weight_updated() is called which in turn calls __propagate_weights() to update the active and inuse weights so that the effective hierarchical weights are update accordingly. The current implementation is incorrect for inner active nodes. For an active leaf iocg, inuse can be any value between 1 and active and the difference represents how much the iocg is donating. When weight is updated, as long as inuse is clamped between 1 and the new weight, we're alright and this is what __propagate_weights() currently implements. However, that's not how an active inner node's inuse is set. An inner node's inuse is solely determined by the ratio between the sums of inuse's and active's of its children - ie. they're results of propagating the leaves' active and inuse weights upwards. __propagate_weights() incorrectly applies the same clamping as for a leaf when an active inner node's weight is updated. Consider a hierarchy which looks like the following with saturating workloads in AA and BB. R / \ A B | | AA BB 1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5. 2. echo 200 > A/io.weight 3. __propagate_weights() update A's active to 200 and leave inuse at 100 as it's already between 1 and the new active, making A:active=200, A:inuse=100. As R's active_sum is updated along with A's active, A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the hwi's remain unchanged at 0.5. 4. The weight of A is now twice that of B but AA and BB still have the same hwi of 0.5 and thus are doing the same amount of IOs. Fix it by making __propgate_weights() always calculate the inuse of an active inner iocg based on the ratio of child_inuse_sum to child_active_sum. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Schatzberg <dschatzberg@fb.com> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-26blk-iocost: don't ignore vrate_min on QD contentionTejun Heo1-4/+0
ioc_adjust_base_vrate() ignored vrate_min when rq_wait_pct indicates that there is QD contention. The reasoning was that QD depletion always reliably indicates device saturation and thus it's safe to override user specified vrate_min. However, this sometimes leads to unnecessary throttling, especially on really fast devices, because vrate adjustments have delays and inertia. It also confuses users because the behavior violates the explicitly specified configuration. This patch drops the special case handling so that vrate_min is always applied. Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/YIIo1HuyNmhDeiNx@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05blk-iocost: fix NULL iocg deref from racing against initializationTejun Heo1-5/+11
When initializing iocost for a queue, its rqos should be registered before the blkcg policy is activated to allow policy data initiailization to lookup the associated ioc. This unfortunately means that the rqos methods can be called on bios before iocgs are attached to all existing blkgs. While the race is theoretically possible on ioc_rqos_throttle(), it mostly happened in ioc_rqos_merge() due to the difference in how they lookup ioc. The former determines it from the passed in @rqos and then bails before dereferencing iocg if the looked up ioc is disabled, which most likely is the case if initialization is still in progress. The latter looked up ioc by dereferencing the possibly NULL iocg making it a lot more prone to actually triggering the bug. * Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look up ioc for consistency. * Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before dereferencing it. * Explain the danger of NULL iocgs in blk_iocost_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jonathan Lemon <bsd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17blk-iocost: Add iocg idle state tracepointBaolin Wang1-0/+3
It will be helpful to trace the iocg's whole state, including active and idle state. And we can easily expand the original iocost_iocg_activate trace event to support a state trace class, including active and idle state tracing. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07blk-iocost: Factor out the base vrate change into a separate functionBaolin Wang1-45/+54
Factor out the base vrate change code into a separate function to fimplify the ioc_timer_fn(). No functional change. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07blk-iocost: Factor out the active iocgs' state check into a separate functionBaolin Wang1-40/+54
Factor out the iocgs' state check into a separate function to simplify the ioc_timer_fn(). No functional change. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07blk-iocost: Move the usage ratio calculation to the correct placeBaolin Wang1-21/+23
We only use the hweight based usage ratio to calculate the new hweight_inuse of the iocg to decide if this iocg can donate some surplus vtime. Thus move the usage ratio calculation to the correct place to avoid unnecessary calculation for some vtime shortage iocgs. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07blk-iocost: Remove unnecessary advance declarationBaolin Wang1-2/+0
Remove unnecessary advance declaration of struct ioc_gq. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07blk-iocost: Fix some typos in commentsBaolin Wang1-6/+6
Fix some typos in comments. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: simplify bdev/disk lookup in blkdev_getChristoph Hellwig1-18/+18
To simplify block device lookup and a few other upcoming areas, make sure that we always have a struct block_device available for each disk and each partition, and only find existing block devices in bdget. The only downside of this is that each device and partition uses a little more memory. The upside will be that a lot of code can be simplified. With that all we need to look up the block device is to lookup the inode and do a few sanity checks on the gendisk, instead of the separate lookup for the gendisk. For blk-cgroup which wants to access a gendisk without opening it, a new blkdev_{get,put}_no_open low-level interface is added to replace the previous get_gendisk use. Note that the change to look up block device directly instead of the two step lookup using struct gendisk causes a subtile change in behavior: accessing a non-existing partition on an existing block device can now cause a call to request_module. That call is harmless, and in practice no recent system will access these nodes as they aren't created by udev and static /dev/ setups are unusual. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09block: use helper function to test queue registerYufen Yu1-1/+1
We have defined common interface blk_queue_registered() to test QUEUE_FLAG_REGISTERED. Just use it. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08block: Remove redundant 'return' statementBaolin Wang1-1/+1
Remove redundant 'return' statement for 'void' functions. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: consider iocgs with active delays for debt forgivenessTejun Heo1-4/+7
An iocg may have 0 debt but non-zero delay. The current debt forgiveness logic doesn't act on such iocgs. This can lead to unexpected behaviors - an iocg with a little bit of debt will have its delay canceled through debt forgiveness but one w/o any debt but active delay will have to wait out until its delay decays out. This patch updates the debt handling logic so that it treats delays the same as debts. If either debt or delay is active, debt forgiveness logic kicks in and acts on both the same way. Also, avoid turning the debt and delay directly to zero as that can confuse state transitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: add iocg_forgive_debt tracepointTejun Heo1-0/+12
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: reimplement debt forgiveness using average usageTejun Heo1-25/+69
Debt forgiveness logic was counting the number of consecutive !busy periods as the trigger condition. While this usually works, it can easily be thrown off by temporary fluctuations especially on configurations w/ short periods. This patch reimplements debt forgiveness so that: * Use the average usage over the forgiveness period instead of counting consecutive periods. * Debt is reduced at around the target rate (1/2 every 100ms) regardless of ioc period duration. * Usage threshold is raised to 50%. Combined with the preceding changes and the switch to average usage, this makes debt forgivness a lot more effective at reducing the amount of unnecessary idleness. * Constants are renamed with DFGV_ prefix. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: recalculate delay after debt reductionTejun Heo1-1/+3
Debt sets the initial delay duration which is decayed over time. The current debt reduction halved the debt but didn't change the delay. It prevented future debts from increasing delay but didn't do anything to lower the existing delay, limiting the mechanism's ability to reduce unnecessary idling. Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq() recalculates new delay value based on the reduced debt amount. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level oneTejun Heo1-8/+5
Debt reduction was blocked if any iocg was short on budget in the past period to avoid reducing debts while some iocgs are saturated. However, this ends up unnecessarily blocking debt reduction due to temporary local imbalances when the device is generally being underutilized, while also failing to block when the underlying device is overwhelmed and the usage becomes low from high latency. Given that debt accumulation mostly happens with swapout bursts which can significantly deteriorate the underlying device's latency response, the current logic is not great. Let's replace it with ioc->busy_level based condition so that we block debt reduction when the underlying device is being saturated. ioc_forgive_debts() call is moved after busy_level determination. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25iocost: factor out ioc_forgive_debts()Tejun Heo1-31/+35
Debt reduction logic is going to be improved and expanded. Factor it out into ioc_forgive_debts() and generalize the comment a bit. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-15iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()Tejun Heo1-3/+9
adjust_inuse_and_calc_cost() is responsible for reducing the amount of donated weights dynamically in period as the budget runs low. Because we don't want to do full donation calculation in period, we keep latching up inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the resulting hweight_inuse is satisfactory. Unfortunately, the adj_step calculation was reading the active weight before acquiring ioc->lock. Because the current thread could have lost race to activate the iocg to another thread before entering this function, it may read the active weight as zero before acquiring ioc->lock. When this happens, the adj_step is calculated as zero and the incremental adjustment loop becomes an infinite one. Fix it by fetching the active weight after acquiring ioc->lock. Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-12blk-iocost: fix divide-by-zero in transfer_surpluses()Tejun Heo1-4/+10
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but all hweight calculations round up and thus it may end up >= WEIGHT_ONE triggering divide-by-zero and other issues. Bound the value to avoid surprises. Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: add three debug stat - cost.wait, indebt and indelayTejun Heo1-5/+72
These are really cheap to collect and can be useful in debugging iocost behavior. Add them as debug stats for now. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: restore inuse update tracepointsTejun Heo1-0/+16
Update and restore the inuse update tracepoints. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: implement vtime loss compensationTejun Heo1-42/+90
When an iocg accumulates too much vtime or gets deactivated, we throw away some vtime, which lowers the overall device utilization. As the exact amount which is being thrown away is known, we can compensate by accelerating the vrate accordingly so that the extra vtime generated in the current period matches what got lost. This significantly improves work conservation when involving high weight cgroups with intermittent and bursty IO patterns. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: halve debts if device stays idleTejun Heo1-1/+48
A low weight iocg can amass a large amount of debt, for example, when anonymous memory gets reclaimed aggressively. If the system has a lot of memory paired with a slow IO device, the debt can span multiple seconds or more. If there are no other subsequent IO issuers, the in-debt iocg may end up blocked paying its debt while the IO device is idle. This patch implements a mechanism to protect against such pathological cases. If the device has been sufficiently idle for a substantial amount of time, the debts are halved. The criteria are on the conservative side as we want to resolve the rare extreme cases without impacting regular operation by forgiving debts too readily. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: implement delay adjustment hysteresisTejun Heo1-50/+69
Curently, iocost syncs the delay duration to the outstanding debt amount, which seemed enough to protect the system from anon memory hogs. However, that was mostly because the delay calcuation was using hweight_inuse which quickly converges towards zero under debt for delay duration calculation, often pusnishing debtors overly harshly for longer than deserved. The previous patch fixed the delay calcuation and now the protection against anonymous memory hogs isn't enough because the effect of delay is indirect and non-linear and a huge amount of future debt can accumulate abruptly while unthrottled. This patch implements delay hysteresis so that delay is decayed exponentially over time instead of getting cleared immediately as debt is paid off. While the overall behavior is similar to the blk-cgroup implementation used by blk-iolatency, a lot of the details are different and due to the empirical nature of the mechanism, it's challenging to adapt the mechanism for one controller without negatively impacting the other. As the delay is gradually decayed now, there's no point in running it from its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and the dedicated hrtimer is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: revamp debt handlingTejun Heo1-24/+93
Debt handling had several issues. * How much inuse a debtor carries wasn't clearly defined. inuse would be driven down over time from not issuing IOs but it'd be better to clamp it to minimum immediately once in debt. * How much can be paid off was determined by hweight_inuse. As inuse was driven down, the payment amount would fall together regardless of the debtor's active weight. This means that the debtors were punished harshly. * ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after iocg_kick_delay(). This patch revamps debt handling so that * Debt handling owns inuse for iocgs in debt and keeps them at zero. * Payment amount is determined by hweight_active. This is more deterministic and safer than hweight_inuse but still far from ideal in that it doesn't factor in possible donations from other iocgs for debt payments. This likely needs further improvements in the future. * iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: revamp in-period donation snapbacksTejun Heo1-37/+96
When the margin drops below the minimum on a donating iocg, donation is immediately canceled in full. There are a couple shortcomings with the current behavior. * It's abrupt. A small temporary budget deficit can lead to a wide swing in weight allocation and a large surplus. * It's open coded in the issue path but not implemented for the merge path. A series of merges at a low inuse can make the iocg incur debts and stall incorrectly. This patch reimplements in-period donation snapbacks so that * inuse adjustment and cost calculations are factored into adjust_inuse_and_calc_cost() which is called from both the issue and merge paths. * Snapbacks are more gradual. It occurs in quarter steps. * A snapback triggers if the margin goes below the low threshold and is lower than the budget at the time of the last adjustment. * For the above, __propagate_weights() stores the margin in iocg->saved_margin. Move iocg->last_inuse storing together into __propagate_weights() for consistency. * Full snapback is guaranteed when there are waiters. * With precise donation and gradual snapbacks, inuse adjustments are now a lot more effective and the value of scaling inuse on weight changes isn't clear. Removed inuse scaling from weight_update(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: revamp donation amount determinationTejun Heo1-82/+51
iocost has various safety nets to combat inuse adjustment calculation inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse adjustment calculations are now accurate and we can make donation amount determinations accurate too. * Stop keeping track of past usage history and using the maximum. Act on the immediate usage information. * Remove donation constraints defined by SURPLUS_* constants. Donate whatever isn't used. * Determine the donation amount so that the iocg will end up with MARGIN_TARGET_PCT budget at the end of the coming period assuming the same usage as the previous period. TARGET is set at 50% of period, which is the previous maximum. This provides smooth convergence for most repetitive IO patterns. * Apply donation logic early at 20% budget. There's no risk in doing so as the calculation is based on the delta between the current budget and the target budget at the end of the coming period. * Remove preemptive iocg activation for zero cost IOs. As donation can reach near zero now, the mere activation doesn't provide any protection anymore. In the unlikely case that this becomes a problem, the right solution is assigning appropriate costs for such IOs. This significantly improves the donation determination logic while also simplifying it. Now all donations are immediate, exact and smooth. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: implement Andy's method for donation weight updatesTejun Heo1-8/+244
iocost implements work conservation by reducing iocg->inuse and propagating the adjustment upwards proportionally. However, while I knew the target absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure out how to determine the iocg->inuse adjustment to achieve that and approximated the adjustment by scaling iocg->inuse using the proportion of the needed hweight_inuse changes. When nested, these scalings aren't accurate even when adjusting a single node as the donating node also receives the benefit of the donated portion. When multiple nodes are donating as they often do, they can be wildly wrong. iocost employed various safety nets to combat the inaccuracies. There are ample buffers in determining how much to donate, the adjustments are conservative and gradual. While it can achieve a reasonable level of work conservation in simple scenarios, the inaccuracies can easily add up leading to significant loss of total work. This in turn makes it difficult to closely cap vrate as vrate adjustment is needed to compensate for the loss of work. The combination of inaccurate donation calculations and vrate adjustments can lead to wide fluctuations and clunky overall behaviors. Andy Newell devised a method to calculate the needed ->inuse updates to achieve the target hweight_inuse's. The method is compatible with the proportional inuse adjustment propagation which allows all hot path operations to be local to each iocg. To roughly summarize, Andy's method divides the tree into donating and non-donating parts, calculates global donation rate which is used to determine the target hweight_inuse for each node, and then derives per-level proportions. There's non-trivial amount of math involved. Please refer to the following pdfs for detailed descriptions. https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN This patch implements Andy's method in transfer_surpluses(). This makes the donation calculations accurate per cycle and enables further improvements in other parts of the donation logic. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: restructure surplus donation logicTejun Heo1-76/+103
The way the surplus donation logic is structured isn't great. There are two separate paths for starting/increasing donations and decreasing them making the logic harder to follow and is prone to unnecessary behavior differences. In preparation for improved donation handling, this patch restructures the code so that * All donors - new, increasing and decreasing - are funneled through the same code path. * The target donation calculation is factored into hweight_after_donation() which is called once from the same spot for all possible donors. * Actual inuse adjustment is factored into trasnfer_surpluses(). This change introduces a few behavior differences - e.g. donation amount reduction now uses the max usage of the recent three periods just like new and increasing donations, and inuse now gets adjusted upwards the same way it gets downwards. These differences are unlikely to have severely negative implications and the whole logic will be revamped soon. This patch also removes two tracepoints. The existing TPs don't quite fit the new implementation. A later patch will update and reinstate them. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: decouple vrate adjustment from surplus transfersTejun Heo1-12/+7
Budget donations are inaccurate and could take multiple periods to converge. To prevent triggering vrate adjustments while surplus transfers were catching up, vrate adjustment was suppressed if donations were increasing, which was indicated by non-zero nr_surpluses. This entangling won't be necessary with the scheduled rewrite of donation mechanism which will make it precise and immediate. Let's decouple the two in preparation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: replace iocg->has_surplus with ->surplus_listTejun Heo1-8/+11
Instead of marking iocgs with surplus with a flag and filtering for them while walking all active iocgs, build a surpluses list. This doesn't make much difference now but will help implementing improved donation logic which will iterate iocgs with surplus multiple times. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_usTejun Heo1-27/+45
Currently, iocg->usages[] which are used to guide inuse adjustments are calculated from vtime deltas. This, however, assumes that the hierarchical inuse weight at the time of calculation held for the entire period, which often isn't true and can lead to significant errors. Now that we have absolute usage information collected, we can derive iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment decisions are made based on actual absolute usage. The calculated usage is clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal saturation regardless of the current hierarchical inuse weight. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: add absolute usage statTejun Heo1-6/+149
Currently, iocost doesn't collect or expose any statistics punting off all monitoring duties to drgn based iocost_monitor.py. While it works for some scenarios, there are some usability and data availability challenges. For example, accurate per-cgroup usage information can't be tracked by vtime progression at all and the number available in iocg->usages[] are really short-term snapshots used for control heuristics with possibly significant errors. This patch implements per-cgroup absolute usage stat counter and exposes it through io.stat along with the current vrate. Usage stat collection and flushing employ the same method as cgroup rstat on the active iocg's and the only hot path overhead is preemption toggling and adding to a percpu counter. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: grab ioc->lock for debt handlingTejun Heo1-19/+73
Currently, debt handling requires only iocg->waitq.lock. In the future, we want to adjust and propagate inuse changes depending on debt status. Let's grab ioc->lock in debt handling paths in preparation. * Because ioc->lock nests outside iocg->waitq.lock, the decision to grab ioc->lock needs to be made before entering the critical sections. * Add and use iocg_[un]lock() which handles the conditional double locking. * Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when the caller grabbed both locks. This patch is prepatory and the comments contain references to future changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: streamline vtime margin and timer slack handlingTejun Heo1-29/+38
The margin handling was pretty inconsistent. * ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin thresholds. However, the two are in different units with the former requiring conversion to vtime on use. * iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use different values for the two. This patch cleans up margin and timer slack handling: * vtime margins are now recorded in ioc->margins.{min, max} on period duration changes and used consistently. * Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and used consistently for iocg_kick_waitq() and iocg_kick_delay(). The only functional change is shortening of timer slack. No meaningful visible change is expected. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: make ioc_now->now and ioc->period_at 64bitTejun Heo1-2/+2
They are in microseconds and wrap in around 1.2 hours with u32. While unlikely, confusions from wraparounds are still possible. We aren't saving anything meaningful by keeping these u32. Let's make them u64. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: use WEIGHT_ONE based fixed point number for weightsTejun Heo1-12/+13
To improve weight donations, we want to able to scale inuse with a greater accuracy and down below 1. Let's make non-hierarchical weights to use WEIGHT_ONE based fixed point numbers too like hierarchical ones. This doesn't cause any behavior changes yet. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/gTejun Heo1-10/+10
We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to WEIGHT_ONE. Pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debtTejun Heo1-1/+2
iocg_kick_waitq() is the function which pays debt and iocg_kick_delay() updates the actual delay status accordingly. If iocg_kick_delay() is not called after iocg_kick_delay() updated debt, unnecessarily large delays can be applied temporarily. Let's make sure such conditions don't occur by making iocg_kick_waitq() always call iocg_kick_delay() after paying debt. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: move iocg_kick_delay() above iocg_kick_waitq()Tejun Heo1-60/+60
We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in preparation. This is pure code reorganization. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: clamp inuse and skip noops in __propagate_weights()Tejun Heo1-1/+4
__propagate_weights() currently expects the callers to clamp inuse within [1, active], which is needlessly fragile. The inuse adjustment logic is going to be revamped, in preparation, let's make __propagate_weights() clamp inuse on entry. Also, make it avoid weight updates altogether if neither active or inuse is changed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02blk-iocost: rename propagate_active_weights() to propagate_weights()Tejun Heo1-20/+20
It already propagates two weights - active and inuse - and there will be another soon. Let's drop the confusing misnomers. Rename [__]propagate_active_weights() to [__]propagate_weights() and commit_active_weights() to commit_weights(). This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>