kernel/linux.git/drivers/block, branch v6.18.34

ublk: reject max_sectors smaller than PAGE_SECTORS in parameter validation

2026-06-01T15:50:55+00:00

[ Upstream commit 1860c2f85922917d8a46f16a6f4bd2298ffa0fb5 ] blk_validate_limits() requires max_hw_sectors >= PAGE_SECTORS and fires a WARN_ON_ONCE if this invariant is violated. ublk_validate_params() only checked the upper bound of max_sectors against max_io_buf_bytes, allowing userspace to pass small values (including zero) that trigger the warning when blk_mq_alloc_disk() is called from ublk_ctrl_start_dev(). Before 494ea040bcb5, ublk used blk_queue_max_hw_sectors() which silently clamped small values up to PAGE_SECTORS. The conversion to passing queue_limits directly to blk_mq_alloc_disk() lost that clamping and now hits blk_validate_limits()'s WARN_ON_ONCE instead. Validate that max_sectors is at least PAGE_SECTORS in ublk_validate_params() so invalid values are rejected early with -EINVAL instead of reaching the block layer. Fixes: 494ea040bcb5 ("ublk: pass queue_limits to blk_mq_alloc_disk") Signed-off-by: Ming Lei Link: https://patch.msgid.link/20260510144843.769031-1-tom.leiming@gmail.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

rbd: eliminate a race in lock_dwork draining on unmap

2026-06-01T15:50:44+00:00

commit 9fc75b71fdd38465c76c6f6a884cdd4ae3c72d90 upstream. Given how rbd_lock_add_request() and rbd_img_exclusive_lock() are written, lock_dwork may be (re)queued more than it's actually needed: for example in case a new I/O request comes in while we are in the middle of rbd_acquire_lock() on behalf of another I/O request. This is expected and with rbd_release_lock() preemptively canceling lock_dwork is benign under normal operation. A more problematic example is maybe_kick_acquire(): if (have_requests || delayed_work_pending(&rbd_dev->lock_dwork)) { dout("%s rbd_dev %p kicking lock_dwork\n", __func__, rbd_dev); mod_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); } It's not unrealistic for lock_dwork to get canceled right after delayed_work_pending() returns true and for mod_delayed_work() to requeue it right there anyway. This is a classic TOCTOU race. When it comes to unmapping the image, there is an implicit assumption of no self-initiated exclusive lock activity past the point of return from rbd_dev_image_unlock() which unlocks the lock if it happens to be held. This unlock is assumed to be final and lock_dwork (as well as all other exclusive lock tasks, really) isn't expected to get queued again. However, lock_dwork is canceled only in cancel_tasks_sync() (i.e. later in the unmap sequence) and on top of that the cancellation can get in effect nullified by maybe_kick_acquire(). This may result in rbd_acquire_lock() executing after rbd_dev_device_release() and rbd_dev_image_release() run and free and/or reset a bunch of things. One of the possible failure modes then is a violated rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); in rbd_dev_header_info() which is called via rbd_dev_refresh() from rbd_post_acquire_action(). Redo exclusive lock task draining to provide saner semantics and try to meet the assumptions around rbd_dev_image_unlock(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov Reviewed-by: Viacheslav Dubeyko Signed-off-by: Greg Kroah-Hartman

drbd: Balance RCU calls in drbd_adm_dump_devices()

2026-05-23T11:06:23+00:00

[ Upstream commit 2b31e86387e60b3689339f0f0fbb4d3623d9d494 ] Make drbd_adm_dump_devices() call rcu_read_lock() before rcu_read_unlock() is called. This has been detected by the Clang thread-safety analyzer. Tested-by: Christoph Böhmwalder Reviewed-by: Christoph Hellwig Cc: Andreas Gruenbacher Fixes: a55bbd375d18 ("drbd: Backport the "status" command") Signed-off-by: Bart Van Assche Link: https://patch.msgid.link/20260326214054.284593-1-bvanassche@acm.org Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

rbd: fix null-ptr-deref when device_add_disk() fails

2026-05-07T04:11:41+00:00

commit d1fef92e414433ca7b89abf85cb0df42b8d475eb upstream. do_rbd_add() publishes the device with device_add() before calling device_add_disk(). If device_add_disk() fails after device_add() succeeds, the error path calls rbd_free_disk() directly and then later falls through to rbd_dev_device_release(), which calls rbd_free_disk() again. This double teardown can leave blk-mq cleanup operating on invalid state and trigger a null-ptr-deref in __blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set(). Fix this by following the normal remove ordering: call device_del() before rbd_dev_device_release() when device_add_disk() fails after device_add(). That keeps the teardown sequence consistent and avoids re-entering disk cleanup through the wrong path. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64 guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer confines failslab injections to the __add_disk() range and injects fail-nth while mapping an RBD image through /sys/bus/rbd/add_single_major. On the unpatched kernel, fail-nth=4 reliably triggered the fault: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 UID: 0 PID: 273 Comm: bash Not tainted 7.0.0-01247-gd60bc1401583 #6 PREEMPT(lazy) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__blk_mq_free_map_and_rqs+0x8c/0x240 Code: 00 00 48 8b 6b 60 41 89 f4 49 c1 e4 03 4c 01 e5 45 85 ed 0f 85 0a 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 e9 48 c1 e9 03 <80> 3c 01 00 0f 85 31 01 00 00 4c 8b 6d 00 4d 85 ed 0f 84 e2 00 00 RSP: 0018:ff1100000ab0fac8 EFLAGS: 00000246 RAX: dffffc0000000000 RBX: ff1100000c4806a0 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ff1100000c4806f4 RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c000189001b R10: ff1100000c4800df R11: ff1100006cf37be0 R12: 0000000000000000 R13: 0000000000000000 R14: ff1100000c480700 R15: ff1100000c480004 FS: 00007f0fbe8fe740(0000) GS:ff110000e5851000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe53473b2e0 CR3: 0000000012eef000 CR4: 00000000007516f0 PKRU: 55555554 Call Trace: blk_mq_free_tag_set+0x77/0x460 do_rbd_add+0x1446/0x2b80 ? __pfx_do_rbd_add+0x10/0x10 ? lock_acquire+0x18c/0x300 ? find_held_lock+0x2b/0x80 ? sysfs_file_kobj+0xb6/0x1b0 ? __pfx_sysfs_kf_write+0x10/0x10 kernfs_fop_write_iter+0x2f4/0x4a0 vfs_write+0x98e/0x1000 ? expand_files+0x51f/0x850 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf2/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x690 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0fbea15907 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffe22346ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000058 RCX: 00007f0fbea15907 RDX: 0000000000000058 RSI: 0000563ace6c0ef0 RDI: 0000000000000001 RBP: 0000563ace6c0ef0 R08: 0000563ace6c0ef0 R09: 6b6435726d694141 R10: 5250337279762f78 R11: 0000000000000246 R12: 0000000000000058 R13: 00007f0fbeb1c780 R14: ff1100000c480700 R15: ff1100000c480004 With this fix applied, rerunning the reproducer over fail-nth=1..256 yields no KASAN reports. [ idryomov: rename err_out_device_del -> err_out_device ] Cc: stable@vger.kernel.org Fixes: 27c97abc30e2 ("rbd: add add_disk() error handling") Signed-off-by: Zilin Guan Signed-off-by: Dawei Feng Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

zram: do not forget to endio for partial discard requests

2026-05-07T04:11:34+00:00

commit e3668b371329ea036ff022ce8ecc82f8befcf003 upstream. As reported by Qu Wenruo and Avinesh Kumar, the following getconf PAGESIZE 65536 blkdiscard -p 4k /dev/zram0 takes literally forever to complete. zram doesn't support partial discards and just returns immediately w/o doing any discard work in such cases. The problem is that we forget to endio on our way out, so blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to end_bio label, which does bio_endio(). Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org Fixes: 0120dd6e4e20 ("zram: make zram_bio_discard more self-contained") Signed-off-by: Sergey Senozhatsky Reported-by: Qu Wenruo Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com Tested-by: Qu Wenruo Reported-by: Avinesh Kumar Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530 Reviewed-by: Christoph Hellwig Cc: Brian Geffon Cc: Jens Axboe Cc: Minchan Kim Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

ublk: fix NULL pointer dereference in ublk_ctrl_set_size()

2026-03-25T10:10:32+00:00

[ Upstream commit 25966fc097691e5c925ad080f64a2f19c5fd940a ] ublk_ctrl_set_size() unconditionally dereferences ub->ub_disk via set_capacity_and_notify() without checking if it is NULL. ub->ub_disk is NULL before UBLK_CMD_START_DEV completes (it is only assigned in ublk_ctrl_start_dev()) and after UBLK_CMD_STOP_DEV runs (ublk_detach_disk() sets it to NULL). Since the UBLK_CMD_UPDATE_SIZE handler performs no state validation, a user can trigger a NULL pointer dereference by sending UPDATE_SIZE to a device that has been added but not yet started, or one that has been stopped. Fix this by checking ub->ub_disk under ub->mutex before dereferencing it, and returning -ENODEV if the disk is not available. Fixes: 98b995660bff ("ublk: Add UBLK_U_CMD_UPDATE_SIZE") Cc: stable@vger.kernel.org Signed-off-by: Mehul Rao Reviewed-by: Ming Lei Signed-off-by: Jens Axboe [ adapted `&header` to `header` ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

drbd: fix null-pointer dereference on local read error

2026-03-12T11:09:42+00:00

commit 0d195d3b205ca90db30d70d09d7bb6909aac178f upstream. In drbd_request_endio(), READ_COMPLETED_WITH_ERROR is passed to __req_mod() with a NULL peer_device: __req_mod(req, what, NULL, &m); The READ_COMPLETED_WITH_ERROR handler then unconditionally passes this NULL peer_device to drbd_set_out_of_sync(), which dereferences it, causing a null-pointer dereference. Fix this by obtaining the peer_device via first_peer_device(device), matching how drbd_req_destroy() handles the same situation. Cc: stable@vger.kernel.org Reported-by: Tuo Li Link: https://lore.kernel.org/linux-block/20260104165355.151864-1-islituo@gmail.com Signed-off-by: Christoph Böhmwalder Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

drbd: fix "LOGIC BUG" in drbd_al_begin_io_nonblock()

2026-03-12T11:09:42+00:00

commit ab140365fb62c0bdab22b2f516aff563b2559e3b upstream. Even though we check that we "should" be able to do lc_get_cumulative() while holding the device->al_lock spinlock, it may still fail, if some other code path decided to do lc_try_lock() with bad timing. If that happened, we logged "LOGIC BUG for enr=...", but still did not return an error. The rest of the code now assumed that this request has references for the relevant activity log extents. The implcations are that during an active resync, mutual exclusivity of resync versus application IO is not guaranteed. And a potential crash at this point may not realizs that these extents could have been target of in-flight IO and would need to be resynced just in case. Also, once the request completes, it will give up activity log references it does not even hold, which will trigger a BUG_ON(refcnt == 0) in lc_put(). Fix: Do not crash the kernel for a condition that is harmless during normal operation: also catch "e->refcnt == 0", not only "e == NULL" when being noisy about "al_complete_io() called on inactive extent %u\n". And do not try to be smart and "guess" whether something will work, then be surprised when it does not. Deal with the fact that it may or may not work. If it does not, remember a possible "partially in activity log" state (only possible for requests that cross extent boundaries), and return an error code from drbd_al_begin_io_nonblock(). A latter call for the same request will then resume from where we left off. Cc: stable@vger.kernel.org Signed-off-by: Lars Ellenberg Signed-off-by: Christoph Böhmwalder Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

zloop: check for spurious options passed to remove

2026-03-12T11:09:16+00:00

[ Upstream commit 3c4617117a2b7682cf037be5e5533e379707f050 ] Zloop uses a command option parser for all control commands, but most options are only valid for adding a new device. Check for incorrectly specified options in the remove handler. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

zloop: advertise a volatile write cache

2026-03-12T11:09:16+00:00

[ Upstream commit 6acf7860dcc79ed045cc9e6a79c8a8bb6959dba7 ] Zloop is file system backed and thus needs to sync the underlying file system to persist data. Set BLK_FEAT_WRITE_CACHE so that the block layer actually send flush commands, and fix the flush implementation as sync_filesystem requires s_umount to be held and the code currently misses that. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin