kernel/linux.git/block/blk.h, branch v7.0.10

block: add pgmap check to biovec_phys_mergeable

2026-05-14T13:31:10+00:00

commit 13920e4b7b784b40cf4519ff1f0f3e513476a499 upstream. biovec_phys_mergeable() is used by the request merge, DMA mapping, and integrity merge paths to decide if two physically contiguous bvec segments can be coalesced into one. It currently has no check for whether the segments belong to different dev_pagemaps. When zone device memory is registered in multiple chunks, each chunk gets its own dev_pagemap. A single bio can legitimately contain bvecs from different pgmaps -- iov_iter_extract_bvecs() breaks at pgmap boundaries but the outer loop in bio_iov_iter_get_pages() continues filling the same bio. If such bvecs are physically contiguous, biovec_phys_mergeable() will coalesce them, making it impossible to recover the correct pgmap for the merged segment via page_pgmap(). Add a zone_device_pages_have_same_pgmap() check to prevent merging bvec segments that span different pgmaps. Fixes: 49580e690755 ("block: add check when merging zone device pages") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Naman Jain Link: https://patch.msgid.link/20260410153414.4159050-2-namjain@linux.microsoft.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: relax pgmap check in bio_add_page for compatible zone device pages

2026-05-07T04:13:54+00:00

commit 41c665aae2b5dbecddddcc8ace344caf630cc7a4 upstream. bio_add_page() and bio_integrity_add_page() reject pages from different dev_pagemaps entirely, returning 0 even when those pages have compatible DMA mapping requirements. This forces callers to start a new bio when buffers span pgmap boundaries, even though the pages could safely coexist as separate bvec entries. This matters for guests where memory is registered through devm_memremap_pages() with MEMORY_DEVICE_GENERIC in multiple calls, creating separate dev_pagemaps for each chunk. When a direct I/O buffer spans two such chunks, bio_add_page() rejects the second page, forcing an unnecessary bio split or I/O failure. Introduce zone_device_pages_compatible() in blk.h to check whether two pages can coexist in the same bio as separate bvec entries. The block DMA iterator (blk_dma_map_iter_start) caches the P2PDMA mapping state from the first segment and applies it to all others, so P2PDMA pages from different pgmaps must not be mixed, and neither must P2PDMA and non-P2PDMA pages. All other combinations (MEMORY_DEVICE_GENERIC pages from different pgmaps, or MEMORY_DEVICE_GENERIC with normal RAM) use the same dma_map_phys path and are safe. Replace the blanket zone_device_pages_have_same_pgmap() rejection with zone_device_pages_compatible(), while keeping zone_device_pages_have_same_pgmap() as a merge guard. Pages from different pgmaps can be added as separate bvec entries but must not be coalesced into the same segment, as that would make it impossible to recover the correct pgmap via page_pgmap(). Fixes: 49580e690755 ("block: add check when merging zone device pages") Cc: stable@vger.kernel.org Signed-off-by: Naman Jain Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260410153414.4159050-3-namjain@linux.microsoft.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

blk-mq: use NOIO context to prevent deadlock during debugfs creation

2026-02-16T17:47:25+00:00

Creating debugfs entries can trigger fs reclaim, which can enter back into the block layer request_queue. This can cause deadlock if the queue is frozen. Previously, a WARN_ON_ONCE check was used in debugfs_create_files() to detect this condition, but it was racy since the queue can be frozen from another context at any time. Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave() and blk_debugfs_unlock_nomemrestore() variants for callers that don't need NOIO protection (e.g., debugfs removal or read-only operations). Replace all raw debugfs_mutex lock/unlock pairs with these helpers, using the _nomemsave/_nomemrestore variants where appropriate. Reported-by: Yi Zhang Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/ Reported-by: Shinichiro Kawasaki Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/ Suggested-by: Christoph Hellwig Signed-off-by: Yu Kuai Reviewed-by: Nilay Shroff Signed-off-by: Jens Axboe

Merge tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2026-02-10T02:14:52+00:00

Pull bounce buffer dio for stable pages from Jens Axboe: "This adds support for bounce buffering of dio for stable pages. This was all done by Christoph. In his words: This series tries to address the problem that under I/O pages can be modified during direct I/O, even when the device or file system require stable pages during I/O to calculate checksums, parity or data operations. It does so by adding block layer helpers to bounce buffer an iov_iter into a bio, then wires that up in iomap and ultimately XFS. The reason that the file system even needs to know about it, is because reads need a user context to copy the data back, and the infrastructure to defer ioends to a workqueue currently sits in XFS. I'm going to look into moving that into ioend and enabling it for other file systems. Additionally btrfs already has it's own infrastructure for this, and actually an urgent need to bounce buffer, so this should be useful there and could be wire up easily. In fact the idea comes from patches by Qu that did this in btrfs. This patch fixes all but one xfstests failures on T10 PI capable devices (generic/095 seems to have issues with a mix of mmap and splice still, I'm looking into that separately), and make qemu VMs running Windows, or Linux with swap enabled fine on an XFS file on a device using PI. Performance numbers on my (not exactly state of the art) NVMe PI test setup: Sequential reads using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): | size | zero copy | bounce | +------+--------------------------+--------------------------+ | 4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) | | 64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) | | 1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) | +------+--------------------------+--------------------------+ Sequential writes using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): | size | zero copy | bounce | +------+--------------------------+--------------------------+ | 4k | 882MiB/s (11.83/33.88%) | 750MiB/s (10.53/34.08%) | | 64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) | | 1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) | +------+--------------------------+--------------------------+ Note that the 64k read numbers look really odd to me for the baseline zero copy case, but are reproducible over many repeated runs. The bounce read numbers should further improve when moving the PI validation to the file system and removing the double context switch, which I have patches for that will sent out soon" * tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: xfs: use bounce buffering direct I/O when the device requires stable pages iomap: add a flag to bounce buffer direct I/O iomap: support ioends for direct reads iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED iomap: free the bio before completing the dio iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct iomap: split out the per-bio logic from iomap_dio_bio_iter iomap: simplify iomap_dio_bio_iter iomap: fix submission side handling of completion side errors block: add helpers to bounce buffer an iov_iter into bios block: remove bio_release_page iov_iter: extract a iov_iter_extract_bvecs helper from bio code block: open code bio_add_page and fix handling of mismatching P2P ranges block: refactor get_contig_folio_len block: add a BIO_MAX_SIZE constant and use it

block: decouple secure erase size limit from discard size limit

2026-02-05T03:42:12+00:00

Secure erase should use max_secure_erase_sectors instead of being limited by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from REQ_OP_DISCARD to allow each operation to use its own size limit. Signed-off-by: Luke Wang Reviewed-by: Ulf Hansson Signed-off-by: Jens Axboe

block: remove bio_release_page

2026-01-28T12:16:39+00:00

Merge bio_release_page into the only remaining caller. Signed-off-by: Christoph Hellwig Reviewed-by: Anuj Gupta Reviewed-by: Damien Le Moal Reviewed-by: Johannes Thumshirn Reviewed-by: Darrick J. Wong Tested-by: Anuj Gupta Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe

block: account for bi_bvec_done in bio_may_need_split()

2026-01-10T17:22:54+00:00

When checking if a bio fits in a single segment, bio_may_need_split() compares bi_size against the current bvec's bv_len. However, for partially consumed bvecs (bi_bvec_done > 0), such as in cloned or split bios, the remaining bytes in the current bvec is actually (bv_len - bi_bvec_done), not bv_len. This could cause bio_may_need_split() to incorrectly return false, leading to nr_phys_segments being set to 1 when the bio actually spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg() when the actual mapped segments exceed the expected count. Fix by subtracting bi_bvec_done from bv_len in the comparison. Reported-by: Venkat Rao Bagalkote Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/ Repored-and-bisected-by: Christoph Hellwig Tested-by: Venkat Rao Bagalkote Tested-by: Christoph Hellwig Fixes: ee623c892aa5 ("block: use bvec iterator helper for bio_may_need_split()") Cc: Nitesh Shetty Signed-off-by: Ming Lei Signed-off-by: Jens Axboe

block: use bvec iterator helper for bio_may_need_split()

2026-01-07T15:06:33+00:00

bio_may_need_split() uses bi_vcnt to determine if a bio has a single segment, but bi_vcnt is unreliable for cloned bios. Cloned bios share the parent's bi_io_vec array but iterate over a subset via bi_iter, so bi_vcnt may not reflect the actual segment count being iterated. Replace the bi_vcnt check with bvec iterator access via __bvec_iter_bvec(), comparing bi_iter.bi_size against the current bvec's length. This correctly handles both cloned and non-cloned bios. Move bi_io_vec into the first cache line adjacent to bi_iter. This is a sensible layout since bi_io_vec and bi_iter are commonly accessed together throughout the block layer - every bvec iteration requires both fields. This displaces bi_end_io to the second cache line, which is acceptable since bi_end_io and bi_private are always fetched together in bio_endio() anyway. The struct layout change requires bio_reset() to preserve and restore bi_io_vec across the memset, since it now falls within BIO_RESET_BYTES. Nitesh verified that this patch doesn't regress NVMe 512-byte IO perf [1]. Link: https://lore.kernel.org/linux-block/20251220081607.tvnrltcngl3cc2fh@green245.gost/ [1] Signed-off-by: Ming Lei Reviewed-by: Nitesh Shetty Signed-off-by: Jens Axboe

block: unify elevator tags and type xarrays into struct elv_change_ctx

2025-11-13T16:27:49+00:00

Currently, the nr_hw_queues update path manages two disjoint xarrays — one for elevator tags and another for elevator type — both used during elevator switching. Maintaining these two parallel structures for the same purpose adds unnecessary complexity and potential for mismatched state. This patch unifies both xarrays into a single structure, struct elv_change_ctx, which holds all per-queue elevator change context. A single xarray, named elv_tbl, now maps each queue (q->id) in a tagset to its corresponding elv_change_ctx entry, encapsulating the elevator tags, type and name references. This unification simplifies the code, improves maintainability, and clarifies ownership of per-queue elevator state. Reviewed-by: Ming Lei Reviewed-by: Yu Kuai Signed-off-by: Nilay Shroff Signed-off-by: Jens Axboe

block: handle zone management operations completions

2025-11-05T15:07:21+00:00

The functions blk_zone_wplug_handle_reset_or_finish() and blk_zone_wplug_handle_reset_all() both modify the zone write pointer offset of zone write plugs that are the target of a reset, reset all or finish zone management operation. However, these functions do this modification before the BIO is executed. So if the zone operation fails, the modified zone write pointer offsets become invalid. Avoid this by modifying the zone write pointer offset of a zone write plug that is the target of a zone management operation when the operation completes. To do so, modify blk_zone_bio_endio() to call the new function blk_zone_mgmt_bio_endio() which in turn calls the functions blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or blk_zone_finish_bio_endio() depending on the operation of the completed BIO, to modify a zone write plug write pointer offset accordingly. These functions are called only if the BIO execution was successful. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Reviewed-by: Martin K. Petersen Signed-off-by: Jens Axboe