kernel/linux.git/include/linux/blk-mq.h, branch v6.19.11

blk-mq: add blk_rq_nr_bvec() helper

2025-12-04T14:19:26+00:00

Add a new helper function blk_rq_nr_bvec() that returns the number of bvecs in a request. This count represents the number of iterations rq_for_each_bvec() would perform on a request. Drivers need to pre-allocate bvec arrays before iterating through a request's bvecs. Currently, they manually count bvecs using rq_for_each_bvec() in a loop, which is repetitive. The new helper centralizes this logic. This pattern exists in loop and zloop drivers, where multi-bio requests require copying bvecs into a contiguous array before creating an iov_iter for file operations. Update loop and zloop drivers to use the new helper, eliminating duplicate code. This patch also provides a clear API to avoid any potential misuse of blk_nr_phys_segments() for calculating the bvecs since, one bvec can have more than one segments and use of blk_nr_phys_segments() can lead to extra memory allocation :- [ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072 [ 6155.673846] null_blk: #### null_handle_data_transfer:1375 [ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2 [ 6155.674263] null_blk: #### null_handle_data_transfer:1375 [ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1 Reviewed-by: Niklas Cassel Signed-off-by: Chaitanya Kulkarni Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe

blk-mq: fix potential uaf for 'queue_hw_ctx'

2025-11-28T16:09:19+00:00

This is just apply Kuai's patch in [1] with mirror changes. blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate submit_queues through configfs for null_blk), while it might still be used from other context(e.g. switch elevator to none): t1 t2 elevator_switch blk_mq_unquiesce_queue blk_mq_run_hw_queues queue_for_each_hw_ctx // assembly code for hctx = (q)->queue_hw_ctx[i] mov 0x48(%rbp),%rdx -> read old queue_hw_ctx __blk_mq_update_nr_hw_queues blk_mq_realloc_hw_ctxs hctxs = q->queue_hw_ctx q->queue_hw_ctx = new_hctxs kfree(hctxs) movslq %ebx,%rax mov (%rdx,%rax,8),%rdi ->uaf This problem was found by code review, and I comfirmed that the concurrent scenario do exist(specifically 'q->queue_hw_ctx' can be changed during blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet without hacking the kernel. Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the problem by protecting 'queue_hw_ctx' through rcu where it can be accessed without grabbing 'q_usage_counter'. [1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/ Signed-off-by: Yu Kuai Signed-off-by: Fengnan Chang Signed-off-by: Jens Axboe

blk-mq: use array manage hctx map instead of xarray

2025-11-28T16:09:19+00:00

After commit 4e5cc99e1e48 ("blk-mq: manage hctx map via xarray"), we use an xarray instead of array to store hctx, but in poll mode, each time in blk_mq_poll, we need use xa_load to find corresponding hctx, this introduce some costs. In my test, xa_load may cost 3.8% cpu. This patch revert previous change, eliminates the overhead of xa_load and can result in a 3% performance improvement. Signed-off-by: Fengnan Chang Signed-off-by: Jens Axboe

block: accumulate memory segment gaps per bio

2025-11-07T01:11:58+00:00

The blk-mq dma iterator has an optimization for requests that align to the device's iommu merge boundary. This boundary may be larger than the device's virtual boundary, but the code had been depending on that queue limit to know ahead of time if the request is guaranteed to align to that optimization. Rather than rely on that queue limit, which many devices may not report, save the lowest set bit of any boundary gap between each segment in the bio while checking the segments. The request stores the value for merging and quickly checking per io if the request can use iova optimizations. Signed-off-by: Keith Busch Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

blk-mq: Document tags_srcu member in blk_mq_tag_set structure

2025-09-09T13:37:05+00:00

Add missing documentation for the tags_srcu member that was introduced to defer freeing of tags page_list to prevent use-after-free when iterating tags. Fixes htmldocs warning: WARNING: include/linux/blk-mq.h:536 struct member 'tags_srcu' not described in 'blk_mq_tag_set' Reported-by: Stephen Rothwell Signed-off-by: Ming Lei Signed-off-by: Jens Axboe

blk-mq: Defer freeing of tags page_list to SRCU callback

2025-09-08T14:05:32+00:00

Tag iterators can race with the freeing of the request pages(tags->page_list), potentially leading to use-after-free issues. Defer the freeing of the page list and the tags structure itself until after an SRCU grace period has passed. This ensures that any concurrent tag iterators have completed before the memory is released. With this way, we can replace the big tags->lock in tags iterator code path with srcu for solving the issue. This is achieved by: - Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect tag map iteration. - Adding an `rcu_head` to `struct blk_mq_tags` to be used with `call_srcu`. - Moving the page list freeing logic and the `kfree(tags)` call into a new callback function, `blk_mq_free_tags_callback`. - In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new callback for deferred execution. The read-side protection for the tag iterators will be added in a subsequent patch. Reviewed-by: Hannes Reinecke Reviewed-by: Yu Kuai Signed-off-by: Ming Lei Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

blk-mq: add number of queue calc helper

2025-07-01T16:24:19+00:00

Add two variants of helper functions that calculate the correct number of queues to use. Two variants are needed because some drivers base their maximum number of queues on the possible CPU mask, while others use the online CPU mask. Reviewed-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Reviewed-by: Ming Lei Signed-off-by: Daniel Wagner Reviewed-by: Chaitanya Kulkarni Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-2-13923686b54b@kernel.org Signed-off-by: Jens Axboe

block: remove the q argument from blk_rq_map_kern

2025-05-07T13:31:07+00:00

Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Reviewed-by: Hannes Reinecke Reviewed-by: Johannes Thumshirn Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe

block: fix warning on 'make htmldocs'

2025-05-07T11:37:39+00:00

Fix the following warning when running 'make htmldocs': +WARNING: include/linux/blk-mq.h:532 struct member 'update_nr_hwq_lock' not described in 'blk_mq_tag_set' Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: Stephen Rothwell Closes: https://lore.kernel.org/all/20250507163220.00141d77@canb.auug.org.au/ Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20250507092537.3009112-1-ming.lei@redhat.com Signed-off-by: Jens Axboe

block: prevent adding/deleting disk during updating nr_hw_queues

2025-05-06T13:43:42+00:00

Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't allow them in-progress when updating nr_hw_queues, kernel panic and kasan has been reported in [1]. Prevent adding/deleting disk during updating nr_hw_queues by adding rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(), and read lock is acquired when adding/deleting disk. Also mark GFP_NOIO allocation scope for adding/deleting disk because blk_mq_update_nr_hw_queues() is part of some driver's error handler. This way avoids lot of trouble. Reviewed-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Reviewed-by: Nilay Shroff Suggested-by: Nilay Shroff Reported-by: Nilay Shroff Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe