kernel/linux.git/block, branch v6.18.34

cgroup/rstat: validate cpu before css_rstat_cpu() access

2026-06-01T15:51:03+00:00

[ Upstream commit 8817005efbdfdf5d4e4814cb5dc52b53d12917d7 ] css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming Signed-off-by: Tejun Heo Signed-off-by: Sasha Levin

block: bio-integrity: Fix null-ptr-deref in bio_integrity_map_user()

2026-06-01T15:50:59+00:00

[ Upstream commit 8582792cf23b3d94674d4d838f7cde9a28d0fcaf ] pin_user_pages_fast() can partially succeed and return the number of pages that were actually pinned. However, the bio_integrity_map_user() does not handle this partial pinning. This leads to a general protection fault since bvec_from_pages() dereferences an unpinned page address, which is 0. To fix this, add a check to verify that all requested memory is pinned. If partial pinning occurs, unpin the memory and return -EFAULT. Kernel Oops: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 UID: 0 PID: 1061 Comm: nvme-passthroug Not tainted 7.0.0-11783-g90957f9314e8-dirty #16 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:bio_integrity_map_user.cold+0x1b0/0x9d6 Fixes: 492c5d455969 ("block: bio-integrity: directly map user buffers") Acked-by: Chao Shi Acked-by: Weidong Zhu Acked-by: Dave Tian Signed-off-by: Sungwoo Kim Tested-by: Shin'ichiro Kawasaki Link: https://github.com/linux-blktests/blktests/pull/244 Link: https://patch.msgid.link/20260512050929.541397-2-iam@sung-woo.kim Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: recompute nr_integrity_segments in blk_insert_cloned_request

2026-06-01T15:50:58+00:00

[ Upstream commit 2c6e6a18a37b905cb584eb0dda3ae482162a81ca ] blk_insert_cloned_request() already recomputes nr_phys_segments against the bottom queue, because "the queue settings related to segment counting may differ from the original queue." The exact same reasoning applies to integrity segments: a stacked driver's underlying queue can have tighter virt_boundary_mask, seg_boundary_mask, or max_segment_size than the top queue, in which case blk_rq_count_integrity_sg() against the bottom queue produces a different count than the cached rq->nr_integrity_segments inherited from the source request by blk_rq_prep_clone(). When the cached count is lower than the bottom queue's actual count, blk_rq_map_integrity_sg() trips BUG_ON(segments > rq->nr_integrity_segments); on dispatch. The same families of stacked setups that motivated the existing nr_phys_segments recompute -- dm-multipath fanning out to nvme-rdma in particular -- can produce this. Mirror the nr_phys_segments handling: when the request carries integrity, recompute nr_integrity_segments against the bottom queue and reject the request if it exceeds the bottom queue's max_integrity_segments. blk_rq_count_integrity_sg() and queue_max_integrity_segments() are both already available via , which blk-mq.c includes. This closes a latent gap in the stacking contract and brings the integrity-segment accounting in line with the existing phys-segment accounting. Fixes: 76c313f658d2 ("blk-integrity: improved sg segment mapping") Signed-off-by: Casey Chen Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260511212230.27511-1-cachen@purestorage.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: don't overwrite bip_vcnt in bio_integrity_copy_user()

2026-06-01T15:50:58+00:00

[ Upstream commit 637ad3a56a3b889527d1dacea6fea2a8bd648140 ] bio_integrity_add_page() already sets bip_vcnt to 1 for the bounce segment. Overwriting it with nr_vecs breaks bip_vcnt <= bip_max_vcnt on WRITE (bip_max_vcnt is 1), so the gap-merge checks in block/blk.h read past the bip_vec[] flex array. On READ the read is in bounds but lands on a saved user bvec instead of the bounce. The line was added for split propagation, but bio_integrity_clone() doesn't copy bip_vcnt and BIP_CLONE_FLAGS excludes BIP_COPY_USER. Fixes: 3991657ae707 ("block: set bip_vcnt correctly") Signed-off-by: David Carlier Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260511215151.346228-1-devnexen@gmail.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()

2026-05-23T11:06:23+00:00

[ Upstream commit 23308af722fefed00af5f238024c11710938fba3 ] Add the missing put_disk() on the error path in blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or blkg_tryget() fails, the function jumps to the out label which only calls rcu_read_unlock() but does not release the disk reference acquired by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk is already set to NULL before the lookup, blkcg_exit() cannot release this reference either, causing the disk to never be freed. Restore the reference release that was present as blk_put_queue() in the original code but was inadvertently dropped during the conversion from request_queue to gendisk. Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct") Signed-off-by: Jackie Liu Acked-by: Tejun Heo Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

loop: fix partition scan race between udev and loop_reread_partitions()

2026-05-23T11:06:23+00:00

[ Upstream commit 267ec4d7223a783f029a980f41b93c39b17996da ] When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: https://github.com/linux-blktests/blktests/pull/240. Fixes: 9f65c489b68d ("loop: raise media_change event") Signed-off-by: Daan De Meyer Acked-by: Christian Brauner Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: wait for blkcg cleanup before initializing new disk

2026-05-23T11:06:22+00:00

[ Upstream commit 3dbaacf6ab68f81e3375fe769a2ecdbd3ce386fd ] When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the previous disk's blkcg state is cleaned up asynchronously via disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk() runs before that cleanup finishes, we may overwrite q->root_blkg while the old one is still alive, and radix_tree_insert() in blkg_create() fails with -EEXIST because the old blkg entries still occupy the same queue id slot in blkcg->blkg_tree. This causes the sd probe to fail with -ENOMEM. Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL, which indicates the previous disk's blkcg cleanup has completed. Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") Cc: Yi Zhang Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: fix zone write plug removal

2026-05-17T15:15:35+00:00

[ Upstream commit b7d4ffb510373cc6ecf16022dd0e510a023034fb ] Commit 7b295187287e ("block: Do not remove zone write plugs still in use") modified disk_should_remove_zone_wplug() to add a check on the reference count of a zone write plug to prevent removing zone write plugs from a disk hash table when the plugs are still being referenced by BIOs or requests in-flight. However, this check does not take into account that a BIO completion may happen right after its submission by a zone write plug BIO work, and before the zone write plug BIO work releases the zone write plug reference count. This situation leads to disk_should_remove_zone_wplug() returning false as in this case the zone write plug reference count is at least equal to 3. If the BIO that completes in such manner transitioned the zone to the FULL condition, the zone write plug for the FULL zone will remain in the disk hash table. Furthermore, relying on a particular value of a zone write plug reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as reading the atomic reference count and doing a comparison with some value is not overall atomic at all. Address these issues by reworking the reference counting of zone write plugs so that removing plugs from a disk hash table can be done directly from disk_put_zone_wplug() when the last reference on a plug is dropped. To do so, replace the function disk_remove_zone_wplug() with disk_mark_zone_wplug_dead(). This new function sets the zone write plug flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and drops the initial reference on the zone write plug taken when the plug was added to the disk hash table. This function is called either for zones that are empty or full, or directly in the case of a forced plug removal (e.g. when the disk hash table is being destroyed on disk removal). With this change, disk_should_remove_zone_wplug() is also removed. disk_put_zone_wplug() is modified to call the function disk_free_zone_wplug() to remove a zone write plug from a disk hash table and free the plug structure (with a call_rcu()), when the last reference on a zone write plug is dropped. disk_free_zone_wplug() always checks that the BLK_ZONE_WPLUG_DEAD flag is set. In order to avoid having multiple zone write plugs for the same zone in the disk hash table, disk_get_and_lock_zone_wplug() checked for the BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for the new BLK_ZONE_WPLUG_DEAD flag is added to blk_zone_wplug_handle_write(). With this change, we continue preventing adding multiple zone write plugs for the same zone and at the same time re-inforce checks on the user behavior by failing new incoming write BIOs targeting a zone that is marked as dead. This case can happen only if the user erroneously issues write BIOs to zones that are full, or to zones that are currently being reset or finished. Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Signed-off-by: Jens Axboe [ dropped upstream blk_zone_set_cond() call and disk_zone_wplug_update_cond() context line ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

block: only read from sqe on initial invocation of blkdev_uring_cmd()

2026-05-14T13:30:17+00:00

commit 212ec34e4e726e8cd4af7bea4740db24de8a9dab upstream. This passthrough helper currently only supports discards. Part of that command is the start and length, which is read from the SQE. It does so on every invocation, where it really should just make it stable on the first invocation. This avoids needing to copy the SQE upfront, as we only really need those two 8b values stored in our per-req payload. Cc: stable@vger.kernel.org # 6.17+ Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: add pgmap check to biovec_phys_mergeable

2026-05-14T13:30:17+00:00

commit 13920e4b7b784b40cf4519ff1f0f3e513476a499 upstream. biovec_phys_mergeable() is used by the request merge, DMA mapping, and integrity merge paths to decide if two physically contiguous bvec segments can be coalesced into one. It currently has no check for whether the segments belong to different dev_pagemaps. When zone device memory is registered in multiple chunks, each chunk gets its own dev_pagemap. A single bio can legitimately contain bvecs from different pgmaps -- iov_iter_extract_bvecs() breaks at pgmap boundaries but the outer loop in bio_iov_iter_get_pages() continues filling the same bio. If such bvecs are physically contiguous, biovec_phys_mergeable() will coalesce them, making it impossible to recover the correct pgmap for the merged segment via page_pgmap(). Add a zone_device_pages_have_same_pgmap() check to prevent merging bvec segments that span different pgmaps. Fixes: 49580e690755 ("block: add check when merging zone device pages") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Naman Jain Link: https://patch.msgid.link/20260410153414.4159050-2-namjain@linux.microsoft.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman