kernel/linux.git/block, branch linux-6.0.y

block: don't allow splitting of a REQ_NOWAIT bio

2023-01-12T11:00:46+00:00

commit 9cea62b2cbabff8ed46f2df17778b624ad9dd25a upstream. If we split a bio marked with REQ_NOWAIT, then we can trigger spurious EAGAIN if constituent parts of that split bio end up failing request allocations. Parts will complete just fine, but just a single failure in one of the chained bios will yield an EAGAIN final result for the parent bio. Return EAGAIN early if we end up needing to split such a bio, which allows for saner recovery handling. Cc: stable@vger.kernel.org # 5.15+ Link: https://github.com/axboe/liburing/issues/766 Reported-by: Michael Kelley Reviewed-by: Keith Busch Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: mq-deadline: Do not break sequential write streams to zoned HDDs

2023-01-07T10:15:54+00:00

commit 015d02f48537cf2d1a65eeac50717566f9db6eec upstream. mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: Signed-off-by: Damien Le Moal Reviewed-by: Johannes Thumshirn Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: mq-deadline: Fix dd_finish_request() for zoned devices

2023-01-07T10:15:54+00:00

commit 2820e5d0820ac4daedff1272616a53d9c7682fd2 upstream. dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: Signed-off-by: Damien Le Moal Reviewed-by: Johannes Thumshirn Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: Do not reread partition table on exclusively open device

2023-01-04T10:26:31+00:00

commit 36369f46e91785688a5f39d7a5590e3f07981316 upstream. Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") we allow rereading of partition table although there are users of the block device. This has an undesirable consequence that e.g. if sda and sdb are assembled to a RAID1 device md0 with partitions, BLKRRPART ioctl on sda will rescan partition table and create sda1 device. This partition device under a raid device confuses some programs (such as libstorage-ng used for initial partitioning for distribution installation) leading to failures. Fix the problem refusing to rescan partitions if there is another user that has the block device exclusively open. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221130135344.2ul4cyfstfs3znxg@quack3 Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") Signed-off-by: Jan Kara Link: https://lore.kernel.org/r/20221130175653.24299-1-jack@suse.cz [axboe: fold in followup fix] Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq

2023-01-04T10:26:24+00:00

[ Upstream commit 246cf66e300b76099b5dbd3fdd39e9a5dbc53f02 ] Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq() can free bfqq first, and then call bic_set_bfqq(), which will cause uaf. Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-by: Yi Zhang Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-iolatency: Fix memory leak on add_disk() failures

2023-01-04T10:26:23+00:00

[ Upstream commit 813e693023ba10da9e75067780f8378465bf27cc ] When a gendisk is successfully initialized but add_disk() fails such as when a loop device has invalid number of minor device numbers specified, blkcg_init_disk() is called during init and then blkcg_exit_disk() during error handling. Unfortunately, iolatency gets initialized in the former but doesn't get cleaned up in the latter. This is because, in non-error cases, the cleanup is performed by del_gendisk() calling rq_qos_exit(), the assumption being that rq_qos policies, iolatency being one of them, can only be activated once the disk is fully registered and visible. That assumption is true for wbt and iocost, but not so for iolatency as it gets initialized before add_disk() is called. It is desirable to lazy-init rq_qos policies because they are optional features and add to hot path overhead once initialized - each IO has to walk all the registered rq_qos policies. So, we want to switch iolatency to lazy init too. However, that's a bigger change. As a fix for the immediate problem, let's just add an extra call to rq_qos_exit() in blkcg_exit_disk(). This is safe because duplicate calls to rq_qos_exit() become noop's. Signed-off-by: Tejun Heo Reported-by: darklight2357@icloud.com Cc: Josef Bacik Cc: Linus Torvalds Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/Y5TQ5gm3O4HXrXR3@slm.duckdns.org Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: pass a gendisk to blkg_destroy_all

2023-01-04T10:26:23+00:00

[ Upstream commit 00ad6991bbae116b7c83f68754edd6f4d5e65e01 ] Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig Reviewed-by: Andreas Herrmann Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220921180501.1539876-16-hch@lst.de Signed-off-by: Jens Axboe Stable-dep-of: 813e693023ba ("blk-iolatency: Fix memory leak on add_disk() failures") Signed-off-by: Sasha Levin

blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exit

2023-01-04T10:26:22+00:00

[ Upstream commit e13793bae65919cd3e6a7827f8d30f4dbb8584ee ] Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig Reviewed-by: Andreas Herrmann Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220921180501.1539876-13-hch@lst.de Signed-off-by: Jens Axboe Stable-dep-of: 813e693023ba ("blk-iolatency: Fix memory leak on add_disk() failures") Signed-off-by: Sasha Levin

blk-cgroup: pass a gendisk to blkcg_init_queue and blkcg_exit_queue

2023-01-04T10:26:22+00:00

[ Upstream commit 9823538fb7efe66ce987a1e4c0e0f3dc882623c4 ] Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving the blk-cgroup infrastructure to be gendisk based. Also remove the rather pointless kerneldoc comments for these internal functions with a single caller each. Signed-off-by: Christoph Hellwig Reviewed-by: Andreas Herrmann Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de Signed-off-by: Jens Axboe Stable-dep-of: 813e693023ba ("blk-iolatency: Fix memory leak on add_disk() failures") Signed-off-by: Sasha Levin

blk-cgroup: cleanup the blkg_lookup family of functions

2023-01-04T10:26:22+00:00

[ Upstream commit 4a69f325aa43847e0827fbfe4b3623307b0c9baa ] Add a fully inlined blkg_lookup as the extra two checks aren't going to generated a lot more code vs the call to the slowpath routine, and open code the hint update in the two callers that care. Signed-off-by: Christoph Hellwig Reviewed-by: Andreas Herrmann Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220921180501.1539876-5-hch@lst.de Signed-off-by: Jens Axboe Stable-dep-of: 813e693023ba ("blk-iolatency: Fix memory leak on add_disk() failures") Signed-off-by: Sasha Levin