summaryrefslogtreecommitdiff
path: root/block/bfq-iosched.c
AgeCommit message (Collapse)AuthorFilesLines
2024-09-11block, bfq: factor out a helper to split bfqq in bfq_init_rq()Yu Kuai1-51/+58
Make code cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-8-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()Yu Kuai1-21/+16
Now that 'bfqq_already_existing' is only used in one branch, it can be removed. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block, bfq: remove local variable 'split' in bfq_init_rq()Yu Kuai1-15/+8
The local variable is used to call bfq_bfqq_resume_state() later, since 'bfqd->lock' is held, and bfqq status will not change between setting 'split' and calling bfq_bfqq_resume_state(), move forward bfq_bfqq_resume_state() so that 'split' can be removed. There are no functional chagnes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()Yu Kuai1-4/+2
Because bfq_put_cooperator() is always followed by bfq_release_process_ref(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block, bfq: fix procress reference leakage for bfqq in merge chainYu Kuai1-20/+17
Original state: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()"), if P1 issues a new IO: Without the patch: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \------------------------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 bfqq3 will be used to handle IO from P1, this is not expected, IO should be redirected to bfqq4; With the patch: ------------------------------------------- | | Process 1 Process 2 Process 3 | Process 4 (BIC1) (BIC2) (BIC3) | (BIC4) | | | | \-------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 IO is redirected to bfqq4, however, procress reference of bfqq3 is still 2, while there is only P2 using it. Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify code. Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block, bfq: fix uaf for accessing waker_bfqq after splittingYu Kuai1-3/+28
After commit 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()"), if the current procress is the last holder of bfqq, the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq may in the merge chain of bfqq, hence just recored waker_bfqq is still not safe. Fix the problem by adding a helper bfq_waker_bfqq() to check if bfqq->waker_bfqq is in the merge chain, and current procress is the only holder. Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()Yu Kuai1-2/+2
Instead of open coding it, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: don't break merge chain in bfq_split_bfqq()Yu Kuai1-1/+1
Consider the following scenario: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \-------------\ \-------------\ \--------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq() decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will break the merge chain. Expected result: caller will allocate a new bfqq for BIC1 Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) | | | \-------------\ \--------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 1 3 Since the condition is only used for the last bfqq4 when the previous bfqq2 and bfqq3 are already splited. Fix the problem by checking if bfqq is the last one in the merge chain as well. Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()Yu Kuai1-2/+6
Consider the following merge chain: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 IO from Process 1 will get bfqf2 from BIC1 first, then bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then handle this IO from bfqq3. However, the merge chain can be much deeper and bfqq3 can be merged to other bfqq as well. Fix this problem by iterating to the last bfqq in bfq_setup_cooperator(). Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: fix possible UAF for bfqq->bic with merge chainYu Kuai1-1/+2
1) initial state, three tasks: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | Λ | Λ | Λ | | | | | | V | V | V | bfqq1 bfqq2 bfqq3 process ref: 1 1 1 2) bfqq1 merged to bfqq2: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | | | Λ \--------------\| | | V V | bfqq1--------->bfqq2 bfqq3 process ref: 0 2 1 3) bfqq2 merged to bfqq3: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) here -> Λ | | \--------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3 process ref: 0 1 3 In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then get bfqq3 through merge chain, and finially handle IO by bfqq3. Howerver, current code will think bfqq2 is owned by BIC1, like initial state, and set bfqq2->bic to BIC1. bfq_insert_request -> by Process 1 bfqq = bfq_init_rq(rq) bfqq = bfq_get_bfqq_handle_split bfqq = bic_to_bfqq -> get bfqq2 from BIC1 bfqq->ref++ rq->elv.priv[0] = bic rq->elv.priv[1] = bfqq if (bfqq_process_refs(bfqq) == 1) bfqq->bic = bic -> record BIC1 to bfqq2 __bfq_insert_request new_bfqq = bfq_setup_cooperator -> get bfqq3 from bfqq2->new_bfqq bfqq_request_freed(bfqq) new_bfqq->ref++ rq->elv.priv[1] = new_bfqq -> handle IO by bfqq3 Fix the problem by checking bfqq is from merge chain fist. And this might fix a following problem reported by our syzkaller(unreproducible): ================================================================== BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595 CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_requeue_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x10d/0x610 mm/kasan/report.c:475 kasan_report+0x8e/0xc0 mm/kasan/report.c:588 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757 bfq_init_rq block/bfq-iosched.c:6876 [inline] bfq_insert_request block/bfq-iosched.c:6254 [inline] bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 </TASK> Allocated by task 20776: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:763 [inline] slab_alloc_node mm/slub.c:3458 [inline] kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503 ioc_create_icq block/blk-ioc.c:370 [inline] ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517 blk_mq_get_new_requests block/blk-mq.c:2940 [inline] blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042 __submit_bio+0x331/0x6f0 block/blk-core.c:624 __submit_bio_noacct_mq block/blk-core.c:703 [inline] submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Freed by task 946: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3786 [inline] kmem_cache_free+0x118/0x6f0 mm/slub.c:3808 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Second to last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 The buggy address belongs to the object at ffff888123839d68 which belongs to the cache bfq_io_cq of size 1360 The buggy address is located 336 bytes inside of freed 1360-byte region [ffff888123839d68, ffff88812383a2b8) The buggy address belongs to the physical page: page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838 head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488 raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-17block: BFQ: Refactor bfq_exit_icq() to silence sparse warningJohn Garry1-18/+20
Currently building for C=1 generates the following warning: block/bfq-iosched.c:5498:9: warning: context imbalance in 'bfq_exit_icq' - different lock contexts for basic block Refactor bfq_exit_icq() into a core part which loops for the actuators, and only lock calling this routine when necessary. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240614090345.655716-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05block: add blk_time_get_ns() and blk_time_get() helpersJens Axboe1-14/+14
Convert any user of ktime_get_ns() to use blk_time_get_ns(), and ktime_get() to blk_time_get(), so we have a unified API for querying the current time in nanoseconds or as ktime. No functional changes intended, this patch just wraps ktime_get_ns() and ktime_get() with a block helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-4/+4
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-05-23scsi: block: ioprio: Clean up interface definitionDamien Le Moal1-4/+4
The I/O priority user interface defines the 16-bits ioprio values as the combination of the upper 3-bits for an I/O priority class and the lower 13-bits as priority data. However, the kernel only uses the lower 3-bits of the priority data to define priority levels for the RT and BE priority classes. The data part of an ioprio value is completely ignored for the IDLE and NONE classes. This is enforced by checks done in ioprio_check_cap(), which is called for all paths that allow defining an I/O priority for I/Os: the per-context ioprio_set() system call, aio interface and io_uring interface. Clarify this fact in the uapi ioprio.h header file and introduce the IOPRIO_PRIO_LEVEL_MASK and IOPRIO_PRIO_LEVEL() macros for users to define and get priority levels in an ioprio value. The coarser macro IOPRIO_PRIO_DATA() is retained for backward compatibility with old applications already using it. There is no functional change introduced with this. In-kernel users of the IOPRIO_PRIO_DATA() macro which are explicitly handling I/O priority data as a priority level are modified to use the new IOPRIO_PRIO_LEVEL() macro without any functional change. Since f2fs is the only user of this macro not explicitly using that value as a priority level, it is left unchanged. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-2-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-20block: BFQ: Move an invariant checkBart Van Assche1-1/+1
Check bfqq->dispatched for each BFQ queue instead of checking it for an invalid bfqq pointer. Fixes: 3e49c1e4a615 ("block: BFQ: Add several invariant checks") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519220347.3643295-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19block: BFQ: Add several invariant checksBart Van Assche1-0/+9
If anything goes wrong with the counters that track the number of requests, I/O locks up. Make such scenarios easier to debug by adding invariant checks for the request counters. Additionally, check that BFQ queues are empty before these are freed. Cc: Jan Kara <jack@suse.cz> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16block, bfq: Fix division by zero error on zero wsumColin Ian King1-0/+2
When the weighted sum is zero the calculation of limit causes a division by zero error. Fix this by continuing to the next level. This was discovered by running as root: stress-ng --ioprio 0 Fixes divison by error oops: [ 521.450556] divide error: 0000 [#1] SMP NOPTI [ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1 [ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400 [ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44 [ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046 [ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 [ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978 [ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0 [ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18 [ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970 [ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000 [ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0 [ 521.455491] PKRU: 55555554 [ 521.455619] Call Trace: [ 521.455736] <TASK> [ 521.455837] ? bfq_request_merge+0x3a/0xc0 [ 521.456027] ? elv_merge+0x115/0x140 [ 521.456191] bfq_limit_depth+0xc8/0x240 [ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0 [ 521.456577] blk_mq_submit_bio+0x23c/0x6c0 [ 521.456766] __submit_bio+0xb8/0x140 [ 521.457236] submit_bio_noacct_nocheck+0x212/0x300 [ 521.457748] submit_bio_noacct+0x1a6/0x580 [ 521.458220] submit_bio+0x43/0x80 [ 521.458660] ext4_io_submit+0x23/0x80 [ 521.459116] ext4_do_writepages+0x40a/0xd00 [ 521.459596] ext4_writepages+0x65/0x100 [ 521.460050] do_writepages+0xb7/0x1c0 [ 521.460492] __filemap_fdatawrite_range+0xa6/0x100 [ 521.460979] file_write_and_wait_range+0xbf/0x140 [ 521.461452] ext4_sync_file+0x105/0x340 [ 521.461882] __x64_sys_fsync+0x67/0x100 [ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0 [ 521.462768] do_syscall_64+0x3b/0xc0 [ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4 [ 521.463621] RIP: 0033:0x5640b6c56590 [ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to elevator_type->insert_requestsChristoph Hellwig1-8/+8
Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove blk-mq-tag.hChristoph Hellwig1-1/+0
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-08block, bfq: fix uaf for 'stable_merge_bfqq'Yu Kuai1-9/+9
Before commit fd571df0ac5b ("block, bfq: turn bfqq_data into an array in bfq_io_cq"), process reference is read before bfq_put_stable_ref(), and it's safe if bfq_put_stable_ref() put the last reference, because process reference will be 0 and 'stable_merge_bfqq' won't be accessed in this case. However, the commit changed the order and will cause uaf for 'stable_merge_bfqq'. In order to emphasize that bfq_put_stable_ref() can drop the last reference, fix the problem by moving bfq_put_stable_ref() to the end of bfq_setup_stable_merge(). Fixes: fd571df0ac5b ("block, bfq: turn bfqq_data into an array in bfq_io_cq") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20230307071448.rzihxbm4jhbf5krj@shindev/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-21Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds1-212/+417
Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...
2023-02-03blk-cgroup: pass a gendisk to blkcg_{de,}activate_policyChristoph Hellwig1-1/+1
Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: pass a gendisk to wbt_{enable,disable}_defaultChristoph Hellwig1-2/+2
Pass a gendisk to wbt_enable_default and wbt_disable_default to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unused bfq_wr_max_time in struct bfq_dataKemeng Shi1-4/+0
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed. It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqqKemeng Shi1-6/+3
We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove redundant check in bfq_put_cooperatorKemeng Shi1-2/+0
We have already avoided a circular list in bfq_setup_merge (see comments in bfq_setup_merge() for details), so bfq_queue will not appear in it's new_bfqq list. Just remove this check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unnecessary dereference to get async_bfqqKemeng Shi1-1/+1
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: use helper macro RQ_BFQQ to get bfqq of requestKemeng Shi1-3/+3
Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: initialize bfqq->decrease_time_jif correctlyKemeng Shi1-0/+2
Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: remove unsed parameter reason in bfq_bfqq_is_slowKemeng Shi1-3/+2
Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injectionKemeng Shi1-6/+4
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: fix uaf for bfqq in bic_set_bfqq()Yu Kuai1-1/+3
After commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'"), bic->bfqq will be accessed in bic_set_bfqq(), however, in some context bic->bfqq will be freed, and bic_set_bfqq() is called with the freed bic->bfqq. Fix the problem by always freeing bfqq after bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: balance I/O injection among underutilized actuatorsDavide Zini1-5/+13
Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: inject I/O to underutilized actuatorsDavide Zini1-35/+101
The main service scheme of BFQ for sync I/O is serving one sync bfq_queue at a time, for a while. In particular, BFQ enforces this scheme when it deems the latter necessary to boost throughput or to preserve service guarantees. Unfortunately, when BFQ enforces this policy, only one actuator at a time gets served for a while, because each bfq_queue contains I/O only for one actuator. The other actuators may remain underutilized. Actually, BFQ may serve (inject) extra I/O, taken from other bfq_queues, in parallel with that of the in-service queue. This injection mechanism may provide the ground for dealing also with the above actuator-underutilization problem. Yet BFQ does not take the actuator load into account when choosing which queue to pick extra I/O from. In addition, BFQ may happen to inject extra I/O only when the in-service queue is temporarily empty. In view of these facts, this commit extends the injection mechanism in such a way that the latter: (1) takes into account also the actuator load; (2) checks such a load on each dispatch, and injects I/O for an underutilized actuator, if there is one and there is I/O for it. To perform the check in (2), this commit introduces a load threshold, currently set to 4. A linear scan of each actuator is performed, until an actuator is found for which the following two conditions hold: the load of the actuator is below the threshold, and there is at least one non-in-service queue that contains I/O for that actuator. If such a pair (actuator, queue) is found, then the head request of that queue is returned for dispatch, instead of the head request of the in-service queue. We have set the threshold, empirically, to the minimum possible value for which an actuator is fully utilized, or close to be fully utilized. By doing so, injected I/O 'steals' as few drive-queue slots as possibile to the in-service queue. This reduces as much as possible the probability that the service of I/O from the in-service bfq_queue gets delayed because of slot exhaustion, i.e., because all the slots of the drive queue are filled with I/O injected from other queues (NCQ provides for 32 slots). This new mechanism also counters actuator underutilization in the case of asymmetric configurations of bfq_queues. Namely if there are few bfq_queues containing I/O for some actuators and many bfq_queues containing I/O for other actuators. Or if the bfq_queues containing I/O for some actuators have lower weights than the other bfq_queues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: retrieve independent access ranges from request queueFederico Gavioli1-8/+51
This patch implements the code to gather the content of the independent_access_ranges structure from the request_queue and copy it into the queue's bfq_data. This copy is done at queue initialization. We copy the access ranges into the bfq_data to avoid taking the queue lock each time we access the ranges. This implementation, however, puts a limit to the maximum independent ranges supported by the scheduler. Such a limit is equal to the constant BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of dynamic memory. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Co-developed-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: split also async bfq_queues on a per-actuator basisDavide Zini1-18/+23
Similarly to sync bfq_queues, also async bfq_queues need to be split on a per-actuator basis. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: turn bfqq_data into an array in bfq_io_cqPaolo Valente1-41/+59
When a bfq_queue Q is merged with another queue, several pieces of information are saved about Q. These pieces are stored in the bfqq_data field in the bfq_io_cq data structure of the process associated with Q. Yet, with a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. Each of these queues may undergo a merge. So, the bfq_io_cq data structure must be able to accommodate the above information for N queues. This commit solves this problem by turning the bfqq_data scalar field into an array of N elements (and by changing code so as to handle this array). This solution is written under the assumption that bfq_queues associated with different actuators cannot be cross-merged. This assumption holds naturally with basic queue merging: the latter is triggered by spatial locality, and sectors for different actuators are not close to each other (apart from the corner case of the last sectors served by a given actuator and the first sectors served by the next actuator). As for stable cross-merging, the assumption here is that it is disabled. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: move io_cq-persistent bfqq data into a dedicated structPaolo Valente1-59/+79
With a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq data structure must be able to accommodate its per-queue persistent information for N queues. Currently it stores this information for just one queue, in several scalar fields. This is a preparatory commit for moving to accommodating persistent information for N queues. In particular, this commit packs all the above scalar fields into a single data structure. Then there is now only one field, in bfq_io_cq, that stores all the above information. This scalar field will then be turned into an array by a following commit. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: forbid stable merging of queues associated with different actuatorsPaolo Valente1-4/+9
If queues associated with different actuators are merged, then control is lost on each actuator. Therefore some actuator may be underutilized, and throughput may decrease. This problem cannot occur with basic queue merging, because the latter is triggered by spatial locality, and sectors for different actuators are not close to each other. Yet it may happen with stable merging. To address this issue, this commit prevents stable merging from occurring among queues associated with different actuators. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-30block, bfq: split sync bfq_queues on a per-actuator basisPaolo Valente1-55/+105
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator SATA drives appear as a single device to the I/O subsystem [1]. Yet they address commands to different actuators internally, as a function of Logical Block Addressing (LBAs). A given sector is reachable by only one of the actuators. For example, Seagate’s Serial Advanced Technology Attachment (SATA) version contains two actuators and maps the lower half of the SATA LBA space to the lower actuator and the upper half to the upper actuator. Evidently, to fully utilize actuators, no actuator must be left idle or underutilized while there is pending I/O for it. The block layer must somehow control the load of each actuator individually. This commit lays the ground for allowing BFQ to provide such a per-actuator control. BFQ associates an I/O-request sync bfq_queue with each process doing synchronous I/O, or with a group of processes, in case of queue merging. Then BFQ serves one bfq_queue at a time. While in service, a bfq_queue is emptied in request-position order. Yet the same process, or group of processes, may generate I/O for different actuators. In this case, different streams of I/O (each for a different actuator) get all inserted into the same sync bfq_queue. So there is basically no individual control on when each stream is served, i.e., on when the I/O requests of the stream are picked from the bfq_queue and dispatched to the drive. This commit enables BFQ to control the service of each actuator individually for synchronous I/O, by simply splitting each sync bfq_queue into N queues, one for each actuator. In other words, a sync bfq_queue is now associated to a pair (process, actuator). As a consequence of this split, the per-queue proportional-share policy implemented by BFQ will guarantee that the sync I/O generated for each actuator, by each process, receives its fair share of service. This is just a preparatory patch. If the I/O of the same process happens to be sent to different queues, then each of these queues may undergo queue merging. To handle this event, the bfq_io_cq data structure must be properly extended. In addition, stable merging must be disabled to avoid loss of control on individual actuators. Finally, also async queues must be split. These issues are described in detail and addressed in next commits. As for this commit, although multiple per-process bfq_queues are provided, the I/O of each process or group of processes is still sent to only one queue, regardless of the actuator the I/O is for. The forwarding to distinct bfq_queues will be enabled after addressing the above issues. [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-26block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqqYu Kuai1-1/+1
Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq() can free bfqq first, and then call bic_set_bfqq(), which will cause uaf. Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHEDYuwei Guan1-0/+2
The 'bfqd->num_groups_with_pending_reqs' is used when CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled. Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: replace 0/1 with false/true in bic apisYu Kuai1-2/+2
Just to make the code a litter cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: fix possible uaf for 'bfqq->bic'Yu Kuai1-1/+6
Our test report a uaf for 'bfqq->bic' in 5.10: ================================================================== BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30 CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014 Call Trace: bfq_select_queue+0x378/0xa30 bfq_dispatch_request+0xe8/0x130 blk_mq_do_dispatch_sched+0x62/0xb0 __blk_mq_sched_dispatch_requests+0x215/0x2a0 blk_mq_sched_dispatch_requests+0x8f/0xd0 __blk_mq_run_hw_queue+0x98/0x180 __blk_mq_delay_run_hw_queue+0x22b/0x240 blk_mq_run_hw_queue+0xe3/0x190 blk_mq_sched_insert_requests+0x107/0x200 blk_mq_flush_plug_list+0x26e/0x3c0 blk_finish_plug+0x63/0x90 __iomap_dio_rw+0x7b5/0x910 iomap_dio_rw+0x36/0x80 ext4_dio_read_iter+0x146/0x190 [ext4] ext4_file_read_iter+0x1e2/0x230 [ext4] new_sync_read+0x29f/0x400 vfs_read+0x24e/0x2d0 ksys_read+0xd5/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Commit 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") changes that move process to a new cgroup will allocate a new bfqq to use, however, the old bfqq and new bfqq can point to the same bic: 1) Initial state, two process with io in the same cgroup. Process 1 Process 2 (BIC1) (BIC2) | Λ | Λ | | | | V | V | bfqq1 bfqq2 2) bfqq1 is merged to bfqq2. Process 1 Process 2 (BIC1) (BIC2) | | \-------------\| V bfqq1 bfqq2(coop) 3) Process 1 exit, then issue new io(denoce IOA) from Process 2. (BIC2) | Λ | | V | bfqq2(coop) 4) Before IOA is completed, move Process 2 to another cgroup and issue io. Process 2 (BIC2) Λ |\--------------\ | V bfqq2 bfqq3 Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2. If all the requests are completed, and Process 2 exit, BIC2 will be freed while there is no guarantee that bfqq2 will be freed before BIC2. Fix the problem by clearing bfqq->bic while bfqq is detached from bic. Fixes: 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-09bfq: ignore oom_bfqq in bfq_check_wakerKhazhismel Kumykov1-1/+3
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker detection. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-09bfq: fix waker_bfqq inconsistency crashKhazhismel Kumykov1-2/+7
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02block, bfq: remove dead code for updating 'rq_in_driver'Yu Kuai1-16/+0
Such code are not even compiled since they are inside marco "#if 0". Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: cleanup __bfq_weights_tree_remove()Yu Kuai1-10/+1
It's the same with bfq_weights_tree_remove() now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: cleanup bfq_weights_tree add/remove apisYu Kuai1-10/+9
The 'bfq_data' and 'rb_root_cached' can both be accessed through 'bfq_queue', thus only pass 'bfq_queue' as parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: do not idle if only one group is activatedYu Kuai1-2/+2
Now that root group is counted into 'num_groups_with_pending_reqs', 'num_groups_with_pending_reqs > 0' is always true in bfq_asymmetric_scenario(). Thus change the condition to '> 1'. On the other hand, this change can enable concurrent sync io if only one group is activated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>