<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/block/blk.h, branch v6.18.33</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.33</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.33'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-05-14T13:30:17+00:00</updated>
<entry>
<title>block: add pgmap check to biovec_phys_mergeable</title>
<updated>2026-05-14T13:30:17+00:00</updated>
<author>
<name>Naman Jain</name>
<email>namjain@linux.microsoft.com</email>
</author>
<published>2026-04-10T15:34:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f17d521075325b8afc42d1baa1c28a5e9aca111f'/>
<id>urn:sha1:f17d521075325b8afc42d1baa1c28a5e9aca111f</id>
<content type='text'>
commit 13920e4b7b784b40cf4519ff1f0f3e513476a499 upstream.

biovec_phys_mergeable() is used by the request merge, DMA mapping,
and integrity merge paths to decide if two physically contiguous
bvec segments can be coalesced into one. It currently has no check
for whether the segments belong to different dev_pagemaps.

When zone device memory is registered in multiple chunks, each chunk
gets its own dev_pagemap. A single bio can legitimately contain
bvecs from different pgmaps -- iov_iter_extract_bvecs() breaks at
pgmap boundaries but the outer loop in bio_iov_iter_get_pages()
continues filling the same bio. If such bvecs are physically
contiguous, biovec_phys_mergeable() will coalesce them, making it
impossible to recover the correct pgmap for the merged segment
via page_pgmap().

Add a zone_device_pages_have_same_pgmap() check to prevent merging
bvec segments that span different pgmaps.

Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Naman Jain &lt;namjain@linux.microsoft.com&gt;
Link: https://patch.msgid.link/20260410153414.4159050-2-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>block: relax pgmap check in bio_add_page for compatible zone device pages</title>
<updated>2026-05-07T04:11:42+00:00</updated>
<author>
<name>Naman Jain</name>
<email>namjain@linux.microsoft.com</email>
</author>
<published>2026-04-10T15:34:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8661370e0865c8670ed367e21c90ca166e7875a7'/>
<id>urn:sha1:8661370e0865c8670ed367e21c90ca166e7875a7</id>
<content type='text'>
commit 41c665aae2b5dbecddddcc8ace344caf630cc7a4 upstream.

bio_add_page() and bio_integrity_add_page() reject pages from different
dev_pagemaps entirely, returning 0 even when those pages have compatible
DMA mapping requirements. This forces callers to start a new bio when
buffers span pgmap boundaries, even though the pages could safely coexist
as separate bvec entries.

This matters for guests where memory is registered through
devm_memremap_pages() with MEMORY_DEVICE_GENERIC in multiple calls,
creating separate dev_pagemaps for each chunk. When a direct I/O buffer
spans two such chunks, bio_add_page() rejects the second page, forcing an
unnecessary bio split or I/O failure.

Introduce zone_device_pages_compatible() in blk.h to check whether two
pages can coexist in the same bio as separate bvec entries. The block DMA
iterator (blk_dma_map_iter_start) caches the P2PDMA mapping state from the
first segment and applies it to all others, so P2PDMA pages from different
pgmaps must not be mixed, and neither must P2PDMA and non-P2PDMA pages.
All other combinations (MEMORY_DEVICE_GENERIC pages from different pgmaps,
or MEMORY_DEVICE_GENERIC with normal RAM) use the same dma_map_phys path
and are safe.

Replace the blanket zone_device_pages_have_same_pgmap() rejection with
zone_device_pages_compatible(), while keeping
zone_device_pages_have_same_pgmap() as a merge guard.
Pages from different pgmaps can be added as separate bvec entries but
must not be coalesced into the same segment, as that would make
it impossible to recover the correct pgmap via page_pgmap().

Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Signed-off-by: Naman Jain &lt;namjain@linux.microsoft.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Link: https://patch.msgid.link/20260410153414.4159050-3-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>block: decouple secure erase size limit from discard size limit</title>
<updated>2026-03-04T12:19:38+00:00</updated>
<author>
<name>Luke Wang</name>
<email>ziniu.wang_1@nxp.com</email>
</author>
<published>2026-02-04T03:40:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f15b4e9286c7fa1417be8519e28352006f3ee549'/>
<id>urn:sha1:f15b4e9286c7fa1417be8519e28352006f3ee549</id>
<content type='text'>
[ Upstream commit ee81212f74a57c5d2b56cf504f40d528dac6faaf ]

Secure erase should use max_secure_erase_sectors instead of being limited
by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from
REQ_OP_DISCARD to allow each operation to use its own size limit.

Signed-off-by: Luke Wang &lt;ziniu.wang_1@nxp.com&gt;
Reviewed-by: Ulf Hansson &lt;ulf.hansson@linaro.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>block: handle zone management operations completions</title>
<updated>2026-01-08T09:16:52+00:00</updated>
<author>
<name>Damien Le Moal</name>
<email>dlemoal@kernel.org</email>
</author>
<published>2025-11-04T21:22:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7c87afd13484f1399cd7a7ac4d9c1e611d0f222f'/>
<id>urn:sha1:7c87afd13484f1399cd7a7ac4d9c1e611d0f222f</id>
<content type='text'>
commit efae226c2ef19528ffd81d29ba0eecf1b0896ca2 upstream.

The functions blk_zone_wplug_handle_reset_or_finish() and
blk_zone_wplug_handle_reset_all() both modify the zone write pointer
offset of zone write plugs that are the target of a reset, reset all or
finish zone management operation. However, these functions do this
modification before the BIO is executed. So if the zone operation fails,
the modified zone write pointer offsets become invalid.

Avoid this by modifying the zone write pointer offset of a zone write
plug that is the target of a zone management operation when the
operation completes. To do so, modify blk_zone_bio_endio() to call the
new function blk_zone_mgmt_bio_endio() which in turn calls the functions
blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
blk_zone_finish_bio_endio() depending on the operation of the completed
BIO, to modify a zone write plug write pointer offset accordingly.
These functions are called only if the BIO execution was successful.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal &lt;dlemoal@kernel.org&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Chaitanya Kulkarni &lt;kch@nvidia.com&gt;
Reviewed-by: Hannes Reinecke &lt;hare@suse.de&gt;
Reviewed-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>block: unify elevator tags and type xarrays into struct elv_change_ctx</title>
<updated>2026-01-02T11:56:50+00:00</updated>
<author>
<name>Nilay Shroff</name>
<email>nilay@linux.ibm.com</email>
</author>
<published>2025-11-13T08:58:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=22f98de8fa78f5e245046e2714244479e0ffd371'/>
<id>urn:sha1:22f98de8fa78f5e245046e2714244479e0ffd371</id>
<content type='text'>
[ Upstream commit 232143b605387b372dee0ec7830f93b93df5f67d ]

Currently, the nr_hw_queues update path manages two disjoint xarrays —
one for elevator tags and another for elevator type — both used during
elevator switching. Maintaining these two parallel structures for the
same purpose adds unnecessary complexity and potential for mismatched
state.

This patch unifies both xarrays into a single structure, struct
elv_change_ctx, which holds all per-queue elevator change context. A
single xarray, named elv_tbl, now maps each queue (q-&gt;id) in a tagset
to its corresponding elv_change_ctx entry, encapsulating the elevator
tags, type and name references.

This unification simplifies the code, improves maintainability, and
clarifies ownership of per-queue elevator state.

Reviewed-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Reviewed-by: Yu Kuai &lt;yukuai@fnnas.com&gt;
Signed-off-by: Nilay Shroff &lt;nilay@linux.ibm.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: 9869d3a6fed3 ("block: fix race between wbt_enable_default and IO submission")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>block: fix ordering of recursive split IO</title>
<updated>2025-09-10T11:23:46+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-09-10T06:30:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b2f5974079d82a4761f002e80601064d4e39a81f'/>
<id>urn:sha1:b2f5974079d82a4761f002e80601064d4e39a81f</id>
<content type='text'>
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current-&gt;bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up. This problem, on the one hand, will cause
performance degradation, especially for mdraid with large IO size; on
the other hand, will cause write errors for zoned block devices[1].

For example, in raid456 IO will first be split by max_sector from
md_submit_bio(), and then later be split again by chunksize for internal
handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current-&gt;bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current-&gt;bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current-&gt;bio_list: 1M+1M -&gt; 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current-&gt;bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current-&gt;bio_list: 512k+512k -&gt; 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -&gt; 1M+512k -&gt; 512k+512k -&gt; 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current-&gt;bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    &gt;=(KB) .. &lt;(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    &gt;=(KB) .. &lt;(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.

[1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/

Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren &lt;tieren@fnnas.com&gt;
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Bart Van Assche &lt;bvanassche@acm.org&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: skip unnecessary checks for split bio</title>
<updated>2025-09-10T11:23:46+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-09-10T06:30:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0b64682e78f7a53ea863e368b1aa66f05767858d'/>
<id>urn:sha1:0b64682e78f7a53ea863e368b1aa66f05767858d</id>
<content type='text'>
Lots of checks are already done while submitting this bio the first
time, and there is no need to check them again when this bio is
resubmitted after split.

Hence open code should_fail_bio() and blk_throtl_bio() that are still
necessary from submit_bio_split_bioset().

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: cleanup bio_issue</title>
<updated>2025-09-10T11:23:45+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-09-10T06:30:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1733e88874838ddebf7774440c285700865e6b08'/>
<id>urn:sha1:1733e88874838ddebf7774440c285700865e6b08</id>
<content type='text'>
Now that bio-&gt;bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blk-mq: Defer freeing flush queue to SRCU callback</title>
<updated>2025-09-08T14:05:32+00:00</updated>
<author>
<name>Ming Lei</name>
<email>ming.lei@redhat.com</email>
</author>
<published>2025-08-30T02:18:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=135b8521f21d4d4d4fde74e73b80d8e4d417e20a'/>
<id>urn:sha1:135b8521f21d4d4d4fde74e73b80d8e4d417e20a</id>
<content type='text'>
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags-&gt;lock in tags iterator
code path with srcu for solving the issue.

This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.

The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
  to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
  `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
  instance to ensure synchronization with tag iterators.

Reviewed-by: Hannes Reinecke &lt;hare@suse.de&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Reviewed-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: fix kobject double initialization in add_disk</title>
<updated>2025-08-11T14:00:49+00:00</updated>
<author>
<name>Zheng Qixing</name>
<email>zhengqixing@huawei.com</email>
</author>
<published>2025-08-08T05:36:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=343dc5423bfe876c12bb80c56f5e44286e442a07'/>
<id>urn:sha1:343dc5423bfe876c12bb80c56f5e44286e442a07</id>
<content type='text'>
Device-mapper can call add_disk() multiple times for the same gendisk
due to its two-phase creation process (dm create + dm load). This leads
to kobject double initialization errors when the underlying iSCSI devices
become temporarily unavailable and then reappear.

However, if the first add_disk() call fails and is retried, the queue_kobj
gets initialized twice, causing:

kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
something is seriously wrong.
 Call Trace:
  &lt;TASK&gt;
  dump_stack_lvl+0x5b/0x80
  kobject_init.cold+0x43/0x51
  blk_register_queue+0x46/0x280
  add_disk_fwnode+0xb5/0x280
  dm_setup_md_queue+0x194/0x1c0
  table_load+0x297/0x2d0
  ctl_ioctl+0x2a2/0x480
  dm_ctl_ioctl+0xe/0x20
  __x64_sys_ioctl+0xc7/0x110
  do_syscall_64+0x72/0x390
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by separating kobject initialization from sysfs registration:
 - Initialize queue_kobj early during gendisk allocation
 - add_disk() only adds the already-initialized kobject to sysfs
 - del_gendisk() removes from sysfs but doesn't destroy the kobject
 - Final cleanup happens when the disk is released

Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
Reported-by: Li Lingfeng &lt;lilingfeng3@huawei.com&gt;
Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
Signed-off-by: Zheng Qixing &lt;zhengqixing@huawei.com&gt;
Reviewed-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Nilay Shroff &lt;nilay@linux.ibm.com&gt;
Link: https://lore.kernel.org/r/20250808053609.3237836-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
