summaryrefslogtreecommitdiff
path: root/block/mq-deadline.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-09block/mq-deadline: use correct way to throttling write requestsZhiguo Niu1-1/+2
The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-12block/mq-deadline: Fix a bug in deadline_from_pos()Bart Van Assche1-1/+1
A bug was introduced in deadline_from_pos() while implementing the suggestion to use round_down() in the following code: pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos); This patch makes deadline_from_pos() use round_down() such that 'pos' is rounded down. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/ Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Fixes: 0effb390c4ba ("block: mq-deadline: Handle requeued requests correctly") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Fix handling of at-head zoned writesBart Van Assche1-2/+7
Before dispatching a zoned write from the FIFO list, check whether there are any zoned writes in the RB-tree with a lower LBA for the same zone. This patch ensures that zoned writes happen in order even if at_head is set for some writes for a zone and not for others. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Handle requeued requests correctlyBart Van Assche1-2/+32
Start dispatching from the start of a zone instead of from the starting position of the most recently dispatched request. If a zoned write is requeued with an LBA that is lower than already inserted zoned writes, make sure that it is submitted first. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Track the dispatch positionBart Van Assche1-14/+31
Track the position (sector_t) of the most recently dispatched request instead of tracking a pointer to the next request to dispatch. This patch is the basis for patch "Handle requeued requests correctly". Without this patch it would be significantly more complicated to make sure that zoned writes are dispatched in LBA order per zone. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Reduce lock contentionBart Van Assche1-6/+6
blk_mq_free_requests() calls dd_finish_request() indirectly. Prevent nested locking of dd->lock and dd->zone_lock by moving the code for freeing requests. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Simplify deadline_skip_seq_writes()Bart Van Assche1-6/+3
Make the deadline_skip_seq_writes() code shorter without changing its functionality. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Clean up deadline_check_fifo()Bart Van Assche1-11/+5
Change the return type of deadline_check_fifo() from 'int' into 'bool'. Use time_is_before_eq_jiffies() instead of time_after_eq(). No functionality has been changed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: mq-deadline: Add a word in a source code commentBart Van Assche1-1/+1
Add the missing word "and". Cc: Damien Le Moal <dlemoal@kernel.org> Suggested-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline IO scheduler") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to elevator_type->insert_requestsChristoph Hellwig1-4/+5
Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move blk_mq_sched_insert_request to blk-mq.cChristoph Hellwig1-1/+1
blk_mq_sched_insert_request is the main request insert helper and not directly I/O scheduler related. Move blk_mq_sched_insert_request to blk-mq.c, rename it to blk_mq_insert_request and mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_listChristoph Hellwig1-1/+1
blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: include <linux/blk-mq.h> in block/blk-mq.hChristoph Hellwig1-1/+0
block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove blk-mq-tag.hChristoph Hellwig1-1/+0
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-29block: mq-deadline: Rename deadline_is_seq_writes()Damien Le Moal1-2/+2
Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the "s" plural) to more correctly reflect the fact that this function tests a single request, not multiple requests. Fixes: 015d02f48537 ("block: mq-deadline: Do not break sequential write streams to zoned HDDs") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20221126025550.967914-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-24block: mq-deadline: Do not break sequential write streams to zoned HDDsDamien Le Moal1-4/+62
mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-24block: mq-deadline: Fix dd_finish_request() for zoned devicesDamien Le Moal1-2/+15
dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block/mq-deadline: Use the new blk_opf_t typeBart Van Assche1-2/+2
Use the new blk_opf_t type for an argument that represents a bitwise combination of a request operation and request flags. Rename that argument from 'op' into 'opf'. This patch does not change any functionality. Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16blk-mq: avoid to touch q->elevator without any protectionMing Lei1-0/+3
q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14block/mq-deadline: Set the fifo_time member also if inserting at headBart Van Assche1-0/+1
Before commit 322cff70d46c the fifo_time member of requests on a dispatch list was not used. Commit 322cff70d46c introduces code that reads the fifo_time member of requests on dispatch lists. Hence this patch that sets the fifo_time member when adding a request to a dispatch list. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Fixes: 322cff70d46c ("block/mq-deadline: Prioritize high-priority requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220513171307.32564-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-20block: fix async_depth sysfs interface for mq-deadlineJens Axboe1-2/+2
A previous commit added this feature, but it inadvertently used the wrong variable to show/store the setting from/to, victimized by copy/paste. Fix it up so that the async_depth sysfs interface reads and writes from the right setting. Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485 Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Stop using pointers for blk_mq_tags bitmap tagsJohn Garry1-1/+1
Now that we use shared tags for shared sbitmap support, we don't require the tags sbitmap pointers, so drop them. This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for blk_mq_tags bitmap tags"). Function blk_mq_init_bitmap_tags() is removed also, since it would be only a wrappper for blk_mq_init_bitmaps(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/mq-deadline: Prioritize high-priority requestsBart Van Assche1-4/+73
In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline: Prioritize high-priority requests""), this patch uses 'jiffies' instead of ktime_get() in the code for aging lower priority requests. This patch has been tested as follows: Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: 555 K IOPS. Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler. Result without and with this patch: about 380 K IOPS. Ran the following script: set -e scriptdir=$(dirname "$0") if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi modprobe scsi_debug ndelay=1000000 max_queue=16 sd='' while [ -z "$sd" ]; do sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire" if [ -e /sys/fs/cgroup/io.prio.class ]; then cd /sys/fs/cgroup echo restrict-to-be >io.prio.class echo +io > cgroup.subtree_control else cd /sys/fs/cgroup/blkio/ echo restrict-to-be >blkio.prio.class fi echo $$ >cgroup.procs mkdir -p hipri cd hipri if [ -e io.prio.class ]; then echo none-to-rt >io.prio.class else echo none-to-rt >blkio.prio.class fi { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & } echo $$ >cgroup.procs "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the prio aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org [axboe: @latest -> @latest_start] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/mq-deadline: Stop using per-CPU countersBart Van Assche1-68/+56
Calculating the sum over all CPUs of per-CPU counters frequently is inefficient. Hence switch from per-CPU to individual counters. Three counters are protected by the mq-deadline spinlock since these are only accessed from contexts that already hold that spinlock. The fourth counter is atomic because protecting it with the mq-deadline spinlock would trigger lock contention. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/mq-deadline: Add an invariant checkBart Van Assche1-6/+12
Check a statistics invariant at module unload time. When running blktests, the invariant is verified every time a request queue is removed and hence is verified at least once per test. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/mq-deadline: Improve request accounting furtherBart Van Assche1-7/+10
The scheduler .insert_requests() callback is called when a request is queued for the first time and also when it is requeued. Only count a request the first time it is queued. Additionally, since the mq-deadline scheduler only performs zone locking for requests that have been inserted, skip the zone unlock code for requests that have not been inserted into the mq-deadline scheduler. Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics") Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: move elevator.h to block/Christoph Hellwig1-1/+1
Except for the features passed to blk_queue_required_elevator_features, elevator.h is only needed internally to the block layer. Move the ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to block/, dropping all the spurious includes outside of that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02block/mq-deadline: Move dd_queued() to fix defined but not used warningGeert Uytterhoeven1-6/+6
If CONFIG_BLK_DEBUG_FS=n: block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function] 274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) | ^~~~~~~~~ Fix this by moving dd_queued() just before the sole function that calls it. Fixes: 7b05bf771084ff78 ("Revert "block/mq-deadline: Prioritize high-priority requests"") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+2
Pull block updates from Jens Axboe: "Nothing major in here - lots of good cleanups and tech debt handling, which is also evident in the diffstats. In particular: - Add disk sequence numbers (Matteo) - Discard merge fix (Ming) - Relax disk zoned reporting restrictions (Niklas) - Bio error handling zoned leak fix (Pavel) - Start of proper add_disk() error handling (Luis, Christoph) - blk crypto fix (Eric) - Non-standard GPT location support (Dmitry) - IO priority improvements and cleanups (Damien)o - blk-throtl improvements (Chunguang) - diskstats_show() stack reduction (Abd-Alrhman) - Loop scheduler selection (Bart) - Switch block layer to use kmap_local_page() (Christoph) - Remove obsolete disk_name helper (Christoph) - block_device refcounting improvements (Christoph) - Ensure gendisk always has a request queue reference (Christoph) - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)" * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits) sg: pass the device name to blk_trace_setup block, bfq: cleanup the repeated declaration blk-crypto: fix check for too-large dun_bytes blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN blk-zoned: allow zone management send operations without CAP_SYS_ADMIN block: mark blkdev_fsync static block: refine the disk_live check in del_gendisk mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA mmc: block: Support alternative_gpt_sector() operation partitions/efi: Support non-standard GPT location block: Add alternative_gpt_sector() operation bio: fix page leak bio_add_hw_page failure block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT block: remove a pointless call to MINOR() in device_add_disk null_blk: add error handling support for add_disk() virtio_blk: add error handling support for add_disk() block: add error handling for device_add_disk / add_disk block: return errors from disk_alloc_events block: return errors from blk_integrity_add block: call blk_register_queue earlier in device_add_disk ...
2021-08-26Revert "block/mq-deadline: Prioritize high-priority requests"Jens Axboe1-37/+5
This reverts commit fb926032b3209300f9dc454a36b8299582ae545c. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25mq-deadline: Fix request accountingBart Van Assche1-5/+11
The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-11Revert "block/mq-deadline: Add cgroup support"Tejun Heo1-0/+1130
This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Add cgroup supportBart Van Assche1-1095/+0
Maintain statistics per cgroup and export these to user space. These statistics are essential for verifying whether the proper I/O priorities have been assigned to requests. An example of the statistics data with this patch applied: $ cat /sys/fs/cgroup/io.stat 11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Track I/O statisticsBart Van Assche1-0/+100
Track I/O statistics per I/O priority and export these statistics to debugfs. These statistics help developers of the deadline scheduler. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Add I/O priority supportBart Van Assche1-114/+228
Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE and IDLE. Maintain statistics for each priority level. Split the debugfs attributes per priority level as follows: $ ls /sys/kernel/debug/block/.../sched/ async_depth dispatch2 read_next_rq write2_fifo_list batching read0_fifo_list starved write_next_rq dispatch0 read1_fifo_list write0_fifo_list dispatch1 read2_fifo_list write1_fifo_list Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Micro-optimize the batching algorithmBart Van Assche1-6/+5
When dispatching the first request of a batch, the deadline_move_request() call clears .next_rq[] for the opposite data direction. .next_rq[] is not restored when changing data direction. Fix this by not clearing .next_rq[] and by keeping track of the data direction of a batch in a variable instead. This patch is a micro-optimization because: - The number of deadline_next_request() calls for the read direction is halved. - The number of times that deadline_next_request() returns NULL is reduced. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-13-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Reserve 25% of scheduler tags for synchronous requestsBart Van Assche1-0/+55
For interactive workloads it is important that synchronous requests are not delayed. Hence reserve 25% of scheduler tags for synchronous requests. This patch still allows asynchronous requests to fill the hardware queues since blk_mq_init_sched() makes sure that the number of scheduler requests is the double of the hardware queue depth. From blk_mq_init_sched(): q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, BLKDEV_MAX_RQ); Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Improve the sysfs show and store macrosBart Van Assche1-37/+29
Define separate macros for integers and jiffies to improve readability. Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol(). The former functions are the recommended functions. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Improve compile-time argument checkingBart Van Assche1-47/+49
Modern compilers complain if an out-of-range value is passed to a function argument that has an enumeration type. Let the compiler detect out-of-range data direction arguments instead of verifying the data_dir argument at runtime. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()Bart Van Assche1-4/+4
Change "queue" into "sched" to make the function names reflect better the purpose of these functions. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Remove two local variablesBart Van Assche1-6/+2
Make __dd_dispatch_request() easier to read by removing two local variables. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Add two lockdep_assert_held() statementsBart Van Assche1-0/+4
Document the locking strategy by adding two lockdep_assert_held() statements. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-22block/mq-deadline: Add several commentsBart Van Assche1-0/+18
Make the code easier to read by adding more comments. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-11kyber: fix out of bounds access when preemptedOmar Sandoval1-2/+1
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-16bfq/mq-deadline: remove redundant check for passthrough requestLin Feng1-5/+2
Since commit 01e99aeca39796003 'blk-mq: insert passthrough request into hctx->dispatch directly', passthrough request should not appear in IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO schedulers is redundant. (Notes: this patch passes generic IO load test with hdds under SAS controller and hdds under AHCI controller but obviously not covers all. Not sure if passthrough request can still escape into IO scheduler from blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and has lots of indirect callers.) Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: get rid of the trace rq insert wrapperChaitanya Kulkarni1-1/+3
Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25Revert "blk-mq, elevator: Count requests per hctx to improve performance"Jan Kara1-6/+0
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-04blk-mq, elevator: Count requests per hctx to improve performanceKashyap Desai1-0/+6
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible for mq-deadline and bfq IO schedulers when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues()) even though only one hctx has pending commands. The elevator callback .has_work for mq-deadline and bfq scheduler considers pending work if there are any IOs on request queue but it does not account hctx context. Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap] Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: remove the bio argument to ->prepare_requestChristoph Hellwig1-1/+1
None of the I/O schedulers actually needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-06block: Introduce elevator featuresDamien Le Moal1-0/+1
Introduce the definition of elevator features through the elevator_features flags in the elevator_type structure. Each flag can represent a feature supported by an elevator. The first feature defined by this patch is support for zoned block device sequential write constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented by the mq-deadline elevator using zone write locking. Other possible features are IO priorities, write hints, latency targets or single-LUN dual-actuator disks (for which the elevator could maintain one LBA ordered list per actuator). The required_elevator_features field is also added to the request_queue structure to allow a device driver to specify elevator feature flags that an elevator must support for the correct operation of the device (e.g. device drivers for zoned block devices can have the ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature). The helper function blk_queue_required_elevator_features() is defined for setting this new field. With these two new fields in place, the elevator functions elevator_match() and elevator_find() are modified to allow a user to set only an elevator with a set of features that satisfies the device required features. Elevators not matching the device requirements are not shown in the device sysfs queue/scheduler file to prevent their use. The "none" elevator can always be selected as before. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>