kernel/linux.git/drivers/block/rbd.c, branch v6.12.80

rbd: check for EOD after exclusive lock is ensured to be held

2026-02-11T12:40:16+00:00

commit bd3884a204c3b507e6baa9a4091aa927f9af5404 upstream. Similar to commit 870611e4877e ("rbd: get snapshot context after exclusive lock is ensured to be held"), move the "beyond EOD" check into the image request state machine so that it's performed after exclusive lock is ensured to be held. This avoids various race conditions which can arise when the image is shrunk under I/O (in practice, mostly readahead). In one such scenario rbd_assert(objno < rbd_dev->object_map_size); can be triggered if a close-to-EOD read gets queued right before the shrink is initiated and the EOD check is performed against an outdated mapping_size. After the resize is done on the server side and exclusive lock is (re)acquired bringing along the new (now shrunk) object map, the read starts going through the state machine and rbd_obj_may_exist() gets invoked on an object that is out of bounds of rbd_dev->object_map array. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov Reviewed-by: Dongsheng Yang Signed-off-by: Greg Kroah-Hartman

Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client

2024-07-26T17:34:42+00:00

Pull ceph updates from Ilya Dryomov: "A small patchset to address bogus I/O errors and ultimately an assertion failure in the face of watch errors with -o exclusive mappings in RBD marked for stable and some assorted CephFS fixes" * tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client: rbd: don't assume rbd_is_lock_owner() for exclusive mappings rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait ceph: fix incorrect kmalloc size of pagevec mempool ceph: periodically flush the cap releases ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch() ceph: use cap_wait_list only if debugfs is enabled

rbd: don't assume rbd_is_lock_owner() for exclusive mappings

2024-07-25T10:18:29+00:00

Expanding on the previous commit, assuming that rbd_is_lock_owner() always returns true (i.e. that we are either in RBD_LOCK_STATE_LOCKED or RBD_LOCK_STATE_QUIESCING) if the mapping is exclusive is wrong too. In case ceph_cls_set_cookie() fails, the lock would be temporarily released even if the mapping is exclusive, meaning that we can end up even in RBD_LOCK_STATE_UNLOCKED. IOW, exclusive mappings are really "just" about disabling automatic lock transitions (as documented in the man page), not about grabbing the lock and holding on to it whatever it takes. Cc: stable@vger.kernel.org Fixes: 637cd060537d ("rbd: new exclusive lock wait/wake code") Signed-off-by: Ilya Dryomov Reviewed-by: Dongsheng Yang

rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings

2024-07-25T10:18:28+00:00

Every time a watch is reestablished after getting lost, we need to update the cookie which involves quiescing exclusive lock. For this, we transition from RBD_LOCK_STATE_LOCKED to RBD_LOCK_STATE_QUIESCING roughly for the duration of rbd_reacquire_lock() call. If the mapping is exclusive and I/O happens to arrive in this time window, it's failed with EROFS (later translated to EIO) based on the wrong assumption in rbd_img_exclusive_lock() -- "lock got released?" check there stopped making sense with commit a2b1da09793d ("rbd: lock should be quiesced on reacquire"). To make it worse, any such I/O is added to the acquiring list before EROFS is returned and this sets up for violating rbd_lock_del_request() precondition that the request is either on the running list or not on any list at all -- see commit ded080c86b3f ("rbd: don't move requests to the running list on errors"). rbd_lock_del_request() ends up processing these requests as if they were on the running list which screws up quiescing_wait completion counter and ultimately leads to rbd_assert(!completion_done(&rbd_dev->quiescing_wait)); being triggered on the next watch error. Cc: stable@vger.kernel.org # 06ef84c4e9c4: rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait Cc: stable@vger.kernel.org Fixes: 637cd060537d ("rbd: new exclusive lock wait/wake code") Signed-off-by: Ilya Dryomov Reviewed-by: Dongsheng Yang

rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait

2024-07-25T10:18:01+00:00

... to RBD_LOCK_STATE_QUIESCING and quiescing_wait to recognize that this state and the associated completion are backing rbd_quiesce_lock(), which isn't specific to releasing the lock. While exclusive lock does get quiesced before it's released, it also gets quiesced before an attempt to update the cookie is made and there the lock is not released as long as ceph_cls_set_cookie() succeeds. Signed-off-by: Ilya Dryomov Reviewed-by: Dongsheng Yang

block: move the stable_writes flag to queue_limits

2024-06-19T13:58:28+00:00

Move the stable_writes flag into the queue_limits feature field so that it can be set atomically with the queue frozen. The flag is now inherited by blk_stack_limits, which greatly simplifies the code in dm, and fixed md which previously did not pass on the flag set on lower devices. Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Reviewed-by: Hannes Reinecke Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de Signed-off-by: Jens Axboe

block: move the nonrot flag to queue_limits

2024-06-19T13:58:28+00:00

Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig Reviewed-by: Damien Le Moal Reviewed-by: Hannes Reinecke Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe

block: take io_opt and io_min into account for max_sectors

2024-06-14T16:19:44+00:00

The soft max_sectors limit is normally capped by the hardware limits and an arbitrary upper limit enforced by the kernel, but can be modified by the user. A few drivers want to increase this limit (nbd, rbd) or adjust it up or down based on hardware capabilities (sd). Change blk_validate_limits to default max_sectors to the optimal I/O size, or upgrade it to the preferred minimal I/O size if that is larger than the kernel default if no optimal I/O size is provided based on the logic in the SD driver. This keeps the existing kernel default for drivers that do not provide an io_opt or very big io_min value, but picks a much more useful default for those who provide these hints, and allows to remove the hacks to set the user max_sectors limit in nbd, rbd and sd. Signed-off-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Damien Le Moal Acked-by: Ilya Dryomov Reviewed-by: Martin K. Petersen Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de Signed-off-by: Jens Axboe

rbd: increase io_opt again

2024-06-14T16:19:44+00:00

Commit 16d80c54ad42 ("rbd: set io_min, io_opt and discard_granularity to alloc_size") lowered the io_opt size for rbd from objset_bytes which is 4MB for typical setup to alloc_size which is typically 64KB. The commit mostly talks about discard behavior and does mention io_min in passing. Reducing io_opt means reducing the readahead size, which seems counter-intuitive given that rbd currently abuses the user max_sectors setting to actually increase the I/O size. Switch back to the old setting to allow larger reads (the readahead size despite it's name actually limits the size of any buffered read) and to prepare for using io_opt in the max_sectors calculation and getting drivers out of the business of overriding the max_user_sectors value. Signed-off-by: Christoph Hellwig Acked-by: Ilya Dryomov Reviewed-by: Martin K. Petersen Link: https://lore.kernel.org/r/20240531074837.1648501-4-hch@lst.de Signed-off-by: Jens Axboe

rbd: pass queue_limits to blk_mq_alloc_disk

2024-02-19T23:59:31+00:00

Pass the limits rbd imposes directly to blk_mq_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20240215070300.2200308-8-hch@lst.de Signed-off-by: Jens Axboe