kernel/linux.git/include/linux/blkdev.h, branch v6.19.11

blk-mq: fix potential uaf for 'queue_hw_ctx'

2025-11-28T16:09:19+00:00

This is just apply Kuai's patch in [1] with mirror changes. blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate submit_queues through configfs for null_blk), while it might still be used from other context(e.g. switch elevator to none): t1 t2 elevator_switch blk_mq_unquiesce_queue blk_mq_run_hw_queues queue_for_each_hw_ctx // assembly code for hctx = (q)->queue_hw_ctx[i] mov 0x48(%rbp),%rdx -> read old queue_hw_ctx __blk_mq_update_nr_hw_queues blk_mq_realloc_hw_ctxs hctxs = q->queue_hw_ctx q->queue_hw_ctx = new_hctxs kfree(hctxs) movslq %ebx,%rax mov (%rdx,%rax,8),%rdi ->uaf This problem was found by code review, and I comfirmed that the concurrent scenario do exist(specifically 'q->queue_hw_ctx' can be changed during blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet without hacking the kernel. Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the problem by protecting 'queue_hw_ctx' through rcu where it can be accessed without grabbing 'q_usage_counter'. [1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/ Signed-off-by: Yu Kuai Signed-off-by: Fengnan Chang Signed-off-by: Jens Axboe

blk-mq: use array manage hctx map instead of xarray

2025-11-28T16:09:19+00:00

After commit 4e5cc99e1e48 ("blk-mq: manage hctx map via xarray"), we use an xarray instead of array to store hctx, but in poll mode, each time in blk_mq_poll, we need use xa_load to find corresponding hctx, this introduce some costs. In my test, xa_load may cost 3.8% cpu. This patch revert previous change, eliminates the overhead of xa_load and can result in a 3% performance improvement. Signed-off-by: Fengnan Chang Signed-off-by: Jens Axboe

block: Remove queue freezing from several sysfs store callbacks

2025-11-18T22:00:11+00:00

Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously. Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. Add the __data_racy annotation to request_queue.rq_timeout to suppress KCSAN data race reports about the rq_timeout reads. This patch may cause a small delay in applying the new settings. For all the attributes affected by this patch, I/O will complete correctly whether the old or the new value of the attribute is used. This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity Here is an example of a deadlock triggered by running test srp/002 if this patch is not applied: task:multipathd Call Trace: __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 task:(udev-worker) Call Trace: __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Cc: Christoph Hellwig Cc: Ming Lei Cc: Nilay Shroff Cc: Martin Wilck Cc: Benjamin Marzinski Cc: stable@vger.kernel.org Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche Reviewed-by: Nilay Shroff Signed-off-by: Jens Axboe

block: introduce bdev_zone_start()

2025-11-07T16:28:08+00:00

Introduce the function bdev_zone_start() as a more explicit (and clear) replacement for ALIGN_DOWN() to get the start sector of a zone containing a particular sector of a zoned block device. Use this new helper in blkdev_get_zone_info() and blkdev_report_zones_cached(). Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe

block: fix cached zone reporting after zone append was used

2025-11-06T23:15:27+00:00

No zone plugs are allocated when a zone is opened by calling Zone Append on it. This makes the cached zone reporting report incorrectly empty zones if the file system is unmounted and report zones is called after that, e.g. by xfstests test cases using the scratch device. Fix this by recording if zone append was used on a device, and disable cached reporting for the device until a ZONE_RESET_ALL happens that guarantees all zones are empty. We could probably do even better using a per-zone flag, but the practical use cache for zone reporting after the initial mount are rather limited, so let's keep things simple for now. Fixes: 31f0656a4ab7 ("block: introduce blkdev_report_zones_cached()") Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Reviewed-by: Bart Van Assche Signed-off-by: Jens Axboe

block: introduce blkdev_report_zones_cached()

2025-11-05T15:07:21+00:00

Introduce the function blkdev_report_zones_cached() to provide a fast report zone built using the blkdev_get_zone_info() function, which gets zone information from a disk zones_cond array or zone write plugs. For a large capacity SMR drive, such fast report zone can be completed in a few milliseconds compared to several seconds completion times when the report zone is obtained from the device. The zone report is built in the same manner as with the regular blkdev_report_zones() function, that is, the first zone reported is the one containing the specified start sector and the report is limited to the specified number of zones (nr_zones argument). The information for each zone in the report is obtained using blkdev_get_zone_info(). For zoned devices that do not use zone write plug resources, using blkdev_get_zone_info() is inefficient as the zone report would be very slow, generated one zone at a time. To avoid this, blkdev_report_zones_cached() falls back to calling blkdev_do_report_zones() to execute a regular zone report. In this case, the .report_active field of struct blk_report_zones_args is set to true to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in place of the implicit open, explicit open and closed conditions. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

block: introduce blkdev_get_zone_info()

2025-11-05T15:07:21+00:00

Introduce the function blkdev_get_zone_info() to obtain a single zone information from cached zone data, that is, either from the zone write plug for the target zone if it exists and from the disk zones_cond array otherwise. Since sequential zones that do not have a zone write plug are either full, empty or in a bad state (read-only or offline), the zone write pointer can be inferred from the zone condition cached in the disk zones_cond array. For sequential zones that have a zone write plug, the zone condition and zone write pointer are obtained from the condition and write pointer offset managed with the zone write plug. This allows obtaining the information for a zone much more quickly than having to execute a report zones command on the device. blkdev_get_zone_info() falls back to using a regular zone report if the target zone is flagged as needing an update with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not use zone write plugs (i.e. a device mapper device). In this case, the new function blkdev_report_zone_fallback() is used and the zone condition is reported consistantly with the cahced report, that is, the BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open, explicit open and closed conditions. This is achieved by adding the .report_active field to struct blk_report_zones_args and by having disk_report_zone() sets the correct zone condition if .report_active is true. In preparation for using blkdev_get_zone_info() in upcoming file systems changes, also export this function as a GPL symbol. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

block: use zone condition to determine conventional zones

2025-11-05T15:07:21+00:00

The conv_zones_bitmap field of struct gendisk is used to define a bitmap to identify the conventional zones of a zoned block device. The bit for a zone is set in this bitmap if the zone is a conventional one, that is, if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this always corresponds to the zone condition BLK_ZONE_COND_NOT_WP. In other words, conv_zones_bitmap tracks a single condition of the zones of a zoned block device. In preparation for tracking more zone conditions, change conv_zones_bitmap into an array of zone conditions, using 1 byte per zone. This increases the memory usage from 1 bit per zone to 1 byte per zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256 MiB zones. This is a trade-off to allow fast cached report zones later on top of this change. Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add a blk_revalidate_zone_cond() function to initialize the zones_cond array of a disk during device scan and to update it on device revalidation. Move the allocation of the zones_cond array to disk_revalidate_zone_resources(), making sure that this array is always allocated, even for devices that do not need zone write plugs (zone resources), to ensure that bdev_zone_is_seq() can be re-implemented to use the zone condition array in place of the conv zones bitmap. Finally, the function bdev_zone_is_seq() is rewritten to use a test on the condition of the target zone. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

block: introduce disk_report_zone()

2025-11-05T15:07:21+00:00

Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

block: rename min_segment_size

2025-10-22T13:39:39+00:00

Despite its name, the block layer is fine with segments smaller that the "min_segment_size" limit. The value is an optimization limit indicating the largest segment that can be used without considering boundary limits. Smaller segments can take a fast path, so give it a name that reflects that: max_fast_segment_size. Signed-off-by: Keith Busch Reviewed-by: Christoph Hellwig Reviewed-by: Ming Lei Reviewed-by: Chaitanya Kulkarni Signed-off-by: Jens Axboe