summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
6 daysmd/md-bitmap: add a none backend for bitmap growYu Kuai3-16/+137
[ Upstream commit f2926a533d03fe70d753b512b713e06a2aa174af ] Add a real none bitmap backend that exposes the common bitmap sysfs group and use it to keep bitmap/location available when an array has no bitmap. Then switch the bitmap location sysfs path to move only between none and the classic bitmap backend, using the no-sysfs bitmap helpers while merging or unmerging the internal bitmap sysfs group. This restores mdadm --grow bitmap addition through bitmap/location. Fixes: fb8cc3b0d9db ("md/md-bitmap: delay registration of bitmap_ops until creating bitmap") Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd/md-bitmap: split bitmap sysfs groupsYu Kuai4-13/+40
[ Upstream commit aba3d6d6cb55c6e1116d1215140559dd7ecdf9a9 ] Split the classic bitmap sysfs files into a common bitmap group with the location attribute and a separate internal bitmap group for the remaining files. At the same time, convert bitmap operations from a single sysfs group to a sysfs group array so backends can share part of their sysfs layout while adding backend-specific attributes separately. Switch the bitmap sysfs helpers to use sysfs_update_groups() for the add and update path, and remove groups in reverse order so shared named groups are unmerged before the last group removes the directory. Also make bitmap operation lookup depend only on the currently selected bitmap id matching the installed backend. This prepares the lookup path for a later registered none backend. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow") Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: factor bitmap creation away from sysfs handlingYu Kuai1-29/+49
[ Upstream commit 8776d342cf8fa0b98ca5e6fb2d956966fb5ca364 ] Factor bitmap creation and destruction into helpers that do not touch bitmap sysfs registration. This prepares the bitmap sysfs rework so callers such as the sysfs bitmap location path can create or destroy a bitmap backend without coupling that to sysfs group lifetime management. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow") Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: add fallback to correct bitmap_ops on version mismatchYu Kuai1-1/+110
[ Upstream commit 09af773650024279a60348e7319d599e6571b15c ] If default bitmap version and on-disk version doesn't match, and mdadm is not the latest version to set bitmap_type, set bitmap_ops based on the disk version. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Stable-dep-of: f2926a533d03 ("md/md-bitmap: add a none backend for bitmap grow") Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd/raid1,raid10: don't fail devices for invalid IO errorsKeith Busch1-1/+6
[ Upstream commit f7b24c7b41f23b5f9caa8b913afe79cd4c397d39 ] BLK_STS_INVAL indicates the IO request itself was invalid, not that the device has failed. When raid1 treats this as a device error, it retries on alternate mirrors which fail the same way, eventually exceeding the read error threshold and removing the device from the array. This happens when stacking configurations bypass bio_split_to_limits() in the IO path: dm-raid calls md_handle_request() directly without going through md_submit_bio(), skipping the alignment validation that would otherwise reject invalid bios early. The invalid bio reaches the lower block layers, which fail the bio with BLK_STS_INVAL, and raid1 wrongly interprets this as a device failure. Add BLK_STS_INVAL to raid1_should_handle_error() so that invalid IO errors are propagated back to the caller rather than triggering device removal. This is consistent with the previous kernel behavior when alignment checks were done earlier in the direct-io path. Fixes: 5ff3f74e145adc7 ("block: simplify direct io validity check") Reported-by: Tomáš Trnka <trnka@scm.com> Closes: https://lore.kernel.org/linux-block/2982107.4sosBPzcNG@electra/ Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Tomáš Trnka <trnka@scm.com> Link: https://lore.kernel.org/r/20260416140345.3872265-1-kbusch@meta.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix missing return in invalidate_committed's error pathMing-Hung Tsai1-1/+3
[ Upstream commit 8c0ee19db81f0fa1ff25fd75b22b17c0cc2acde3 ] In passthrough mode, dm-cache defers write submission until after metadata commit completes via the invalidate_committed() continuation. On commit error, invalidate_committed() calls invalidate_complete() to end the bio and free the migration struct, after which it should return immediately. The patch 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") omitted this early return, causing execution to fall through into the success path on error. This results in use-after-free on the migration struct in the subsequent calls. Fix by adding the missing return after the invalidate_complete() call. Fixes: 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/ Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm init: ensure device probing has finished in dm-mod.waitfor=Guillaume Gonnet1-1/+3
[ Upstream commit 99a2312f69805f4ba92d98a757625e0300a747ab ] The early_lookup_bdev() function returns successfully when the disk device is present but not necessarily its partitions. In this situation, dm_early_create() fails as the partition block device does not exist yet. In my case, this phenomenon occurs quite often because the device is an SD card with slow reading times, on which kernel takes time to enumerate available partitions. Fortunately, the underlying device is back to "probing" state while enumerating partitions. Waiting for all probing to end is enough to fix this issue. That's also the reason why this problem never occurs with rootwait= parameter: the while loop inside wait_for_root() explicitly waits for probing to be done and then the function calls async_synchronize_full(). These lines were omitted in 035641b, even though the commit says it's based on the rootwait logic... Anyway, calling wait_for_device_probe() after our while loop does the job (it both waits for probing and calls async_synchronize_full). Fixes: 035641b01e72 ("dm init: add dm-mod.waitfor to wait for asynchronously probed block devices") Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm log: fix out-of-bounds write due to region_count overflowJunrui Luo1-1/+5
[ Upstream commit c20e36b7631d83e7535877f08af8b0af72c44b1a ] The local variable region_count in create_log_context() is declared as unsigned int (32-bit), but dm_sector_div_up() returns sector_t (64-bit). When a device-mapper target has a sufficiently large ti->len with a small region_size, the division result can exceed UINT_MAX. The truncated value is then used to calculate bitset_size, causing clean_bits, sync_bits, and recovering_bits to be allocated far smaller than needed for the actual number of regions. Subsequent log operations (log_set_bit, log_clear_bit, log_test_bit) use region indices derived from the full untruncated region space, causing out-of-bounds writes to kernel heap memory allocated by vmalloc. This can be reproduced by creating a mirror target whose region_count overflows 32 bits: dmsetup create bigzero --table '0 8589934594 zero' dmsetup create mymirror --table '0 8589934594 mirror \ core 2 2 nosync 2 /dev/mapper/bigzero 0 \ /dev/mapper/bigzero 0' The status output confirms the truncation (sync_count=1 instead of 4294967297, because 0x100000001 was truncated to 1): $ dmsetup status mymirror 0 8589934594 mirror 2 254:1 254:1 1/4294967297 ... This leads to a kernel crash in core_in_sync: BUG: scheduling while atomic: (udev-worker)/9150/0x00000000 RIP: 0010:core_in_sync+0x14/0x30 [dm_log] CR2: 0000000000000008 Fixing recursive fault but reboot is needed! Fix by widening the local region_count to sector_t and adding an explicit overflow check before the value is assigned to lc->region_count. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache metadata: fix memory leak on metadata abort retryMing-Hung Tsai1-5/+8
[ Upstream commit 044ca491d4086dc5bf233e9fcb71db52df32f633 ] When failing to acquire the root_lock in dm_cache_metadata_abort because the block_manager is read-only, the temporary block_manager created outside the root_lock is not properly released, causing a memory leak. Reproduce steps: This can be reproduced by reloading a new table while the metadata is read-only. While the second call to dm_cache_metadata_abort is caused by lack of support for table preload in dm-cache, mentioned in commit 9b1cc9f251af ("dm cache: share cache-metadata object across inactive and active DM tables"), it exposes the memory leak in dm_cache_metadata_abort when the function is called multiple times. Specifically, dm-cache fails to sync the new cache object's mode during preresume, creating the reproducer condition. This issue could also occur through concurrent metadata_operation_failed calls due to races in cache mode updates, but the table preload scenario below provides a reliable reproducer. 1. Create a cache device with some faulty trailing metadata blocks dmsetup create cmeta <<EOF 0 200 linear /dev/sdc 0 200 7992 error EOF dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 131072 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 1 writethrough smq 0" 2. Suspend and resume the cache to start a new metadata transaction and trigger metadata io errors on the next metadata commit. dmsetup suspend cache dmsetup resume cache 3. Write to the cache device to update metadata fio --filename=/dev/mapper/cache --name test --rw=randwrite --bs=4k \ --randrepeat=0 --direct=1 --size 64k 4. Preload the same table dmsetup reload cache --table "$(dmsetup table cache)" 5. Resume the new table. This triggers the memory leak. dmsetup suspend cache dmsetup resume cache kmemleak logs: <snip> unreferenced object 0xffff8880080c2010 (size 16): comm "dmsetup", pid 132, jiffies 4294982580 hex dump (first 16 bytes): 00 38 b9 07 80 88 ff ff 6a 6b 6b 6b 6b 6b 6b a5 ... backtrace (crc 3118f31c): kmemleak_alloc+0x28/0x40 __kmalloc_cache_noprof+0x3d9/0x510 dm_block_manager_create+0x51/0x140 dm_cache_metadata_abort+0x85/0x320 metadata_operation_failed+0x103/0x1e0 cache_preresume+0xacd/0xe70 dm_table_resume_targets+0xd3/0x320 __dm_resume+0x1b/0xf0 dm_resume+0x127/0x170 <snip> Fixes: 352b837a5541 ("dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abort") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm-mpath: don't stop probing paths at presuspendBenjamin Marzinski1-7/+2
[ Upstream commit 51d81e14fe6788dc6463064c7517480f2acd2724 ] Commit 5c977f102315 ("dm-mpath: Don't grab work_mutex while probing paths"), added code to make multipath quit probing paths early, if it was trying to suspend. This isn't necessary. It was just an optimization to try to keep path probing from delaying a suspend. However it causes problems with the intended user of this code, qemu. The path probing code was added because failed ioctls to multipath devices don't cause paths to fail in cases where a regular IO failure would. If an ioctl to a path failed because the path was down, and the multipath device had passed presuspend, the M_MPATH_PROBE_PATHS ioctl would exit early, without probing the path. The caller would then retry the original ioctl, hoping to use a different path. But if there was only one path in the pathgroup, it would pick the same non-working path again, even if there were working paths in other pathgroups. ioctls to a suspended dm device will return -EAGAIN, notifying the caller that the device is suspended, but ioctls to a device that is just preparing to suspend won't (and in general, shouldn't). This means that the caller (qemu in this case) would get into a tight loop where it would issue an ioctl that failed, skip probing the paths because the device had already passed presuspend, and start over issuing the ioctl again. This would continue until the multipath device finally fully suspended, or the caller gave up and failed the ioctl. multipath's path probing code could return -EAGAIN in this case, and the caller could delay a bit before retrying, but the whole purpose of skipping the probe after presuspend was to speed things up, and that would just slow them down. Instead, remove the is_suspending flag, and check dm_suspended() instead to decide whether to exit the probing code early. This means that when the probing code exits early, future ioctls will also be delayed, because the device is fully suspended. Fixes: 5c977f102315 ("dm-mpath: Don't grab work_mutex while probing paths") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix dirty mapping checking in passthrough mode switchingMing-Hung Tsai3-33/+8
[ Upstream commit 322586745bd1a0e5f3559fd1635fdeb4dbd1d6b8 ] As mentioned in commit 9b1cc9f251af ("dm cache: share cache-metadata object across inactive and active DM tables"), dm-cache assumed table reload occurs after suspension, while LVM's table preload breaks this assumption. The dirty mapping check for passthrough mode was designed around this assumption and is performed during table creation, causing the check to fail with preload while metadata updates are ongoing. This risks loading dirty mappings into passthrough mode, resulting in data loss. Reproduce steps: 1. Create a writeback cache with zero migration_threshold to produce dirty mappings dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writeback smq \ 2 migration_threshold 0" 2. Preload a table in passthrough mode dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 passthrough smq 0" 3. Write to the first cache block to make it dirty fio --filename=/dev/mapper/cache --name=populate --rw=write --bs=4k \ --direct=1 --size=64k 4. Resume the inactive table. Now it's possible to load the dirty block into passthrough mode. dmsetup resume cache Fix by moving the checks to the preresume phase to support table preloading. Also remove the unused function dm_cache_metadata_all_clean. Fixes: 2ee57d587357 ("dm cache: add passthrough mode") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix concurrent write failure in passthrough modeMing-Hung Tsai1-0/+9
[ Upstream commit e4f66341779d0cf4c83c74793753a84094286d9e ] When bio prison cell lock acquisition fails due to concurrent writes to the same block in passthrough mode, dm-cache incorrectly returns an I/O error instead of properly handling the concurrency. This can occur in both process and workqueue contexts when invalidate_lock() is called for exclusive access to a data block. Fix this by deferring the write bios to ensure proper block device behavior. Reproduce steps: 1. Create a cache device dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. Promote the first data block into cache fio --filename=/dev/mapper/cache --name=populate --rw=write --bs=4k \ --direct=1 --size=64k 3. Reload the cache into passthrough mode dmsetup suspend cache dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 passthrough smq 0" dmsetup resume cache 4. Write to the first cached block concurrently. Sometimes one of the processes will receive I/O errors. fio --filename=/dev/mapper/cache --name test --rw=randwrite --bs=4k \ --randrepeat=0 --direct=1 --numjobs=2 --size 64k <snip> fio-3.41 fio: io_u error on file /dev/mapper/cache: Input/output error: write offset=4096, buflen=4096 fio: pid=106, err=5/file:io_u.c:2008, func=io_u error, error=Input/output error test: (groupid=0, jobs=1): err= 0: pid=105 test: (groupid=0, jobs=1): err= 5 (file:io_u.c:2008, func=io_u error, error=Input/output error): pid=106 <snip> Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache policy smq: fix missing locks in invalidating cache blocksMing-Hung Tsai1-0/+4
[ Upstream commit 2d1f7b65f5deedd2e6b09fdc6ea27f8375f24b45 ] In passthrough mode, the policy invalidate_mapping operation is called simultaneously from multiple workers, thus it should be protected by a lock. Otherwise, we might end up with data races on the allocated blocks counter, or even use-after-free issues with internal data structures when doing concurrent writes. Note that the existing FIXME in smq_invalidate_mapping() doesn't affect passthrough mode since migration tasks don't exist there, but would need attention if supporting fast device shrinking via suspend/resume without target reloading. Reproduce steps: 1. Create a cache device consisting of 1024 cache entries dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. Populate the cache, and record the number of cached blocks fio --name=populate --filename=/dev/mapper/cache --rw=randwrite --bs=4k \ --size=64m --direct=1 nr_cached=$(dmsetup status cache | awk '{split($7, a, "/"); print a[1]}') 3. Reload the cache into passthrough mode dmsetup suspend cache dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 passthrough smq 0" dmsetup resume cache 4. Write to the passthrough cache. By setting multiple jobs with I/O size equal to the cache block size, cache blocks are invalidated concurrently from different workers. fio --filename=/dev/mapper/cache --name=test --rw=randwrite --bs=64k \ --direct=1 --numjobs=2 --randrepeat=0 --size=64m 5. Check if demoted matches cached block count. These numbers should match but may differ due to the data race. nr_demoted=$(dmsetup status cache | awk '{print $12}') echo "$nr_cached, $nr_demoted" Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix write hang in passthrough modeMing-Hung Tsai1-5/+25
[ Upstream commit 4ca8b8bd952df7c3ccdc68af9bd3419d0839a04b ] The invalidate_remove() function has incomplete logic for handling write hit bios after cache invalidation. It sets up the remapping for the overwrite_bio but then drops it immediately without submission, causing write operations to hang. Fix by adding a new invalidate_committed() continuation that submits the remapped writes to the cache origin after metadata commit completes, while using the overwrite_endio hook to ensure proper completion sequencing. This maintains existing coherency. Also improve error handling in invalidate_complete() to preserve the original error status instead of using bio_io_error() unconditionally. Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix write path cache coherency in passthrough modeMing-Hung Tsai1-0/+1
[ Upstream commit 0c5eef0aad508231d8e43ff8392692925e131b68 ] In passthrough mode, dm-cache defers write bio submission until cache invalidation completes to maintain existing coherency, requiring the target map function to return DM_MAPIO_SUBMITTED. The current map_bio() returns DM_MAPIO_REMAPPED, violating the required ordering constraint. Reproduce steps: 1. Create a cache device dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. Promote the first data block into the cache fio --filename=/dev/mapper/cache --name=populate --rw=write --bs=4k \ --direct=1 --size=64k 3. Reload the cache into passthrough mode dmsetup suspend cache dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 passthrough smq 0" dmsetup resume cache 4. Write to the first data block, and check io ordering using ftrace echo 1 > /sys/kernel/debug/tracing/events/block/block_bio_queue/enable echo 1 > /sys/kernel/debug/tracing/events/block/block_bio_complete/enable echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable fio --filename=/dev/mapper/cache --name=test --rw=write --bs=64k \ --direct=1 --size 64k 5. ftrace logs show that write operations to the cache origin (252:2) and metadata operations (252:0) are unsynchronized: the origin write occurs before metadata commit. <snip> fio-146 [000] ..... 420.139562: block_bio_queue: 252,3 WS 0 + 128 [fio] fio-146 [000] ..... 420.149395: block_bio_queue: 252,2 WS 0 + 128 [fio] fio-146 [000] ..... 420.149763: block_bio_queue: 8,32 WS 262144 + 128 [fio] fio-146 [000] dNh1. 420.151446: block_rq_complete: 8,32 WS () 262144 + 128 be,0,4 [0] fio-146 [000] dNh1. 420.152731: block_bio_complete: 252,2 WS 0 + 128 [0] fio-146 [000] dNh1. 420.154229: block_bio_complete: 252,3 WS 0 + 128 [0] kworker/0:0-9 [000] ..... 420.160530: block_bio_queue: 252,0 W 408 + 8 [kworker/0:0] kworker/0:0-9 [000] ..... 420.161641: block_bio_queue: 8,32 W 408 + 8 [kworker/0:0] kworker/0:0-9 [000] ..... 420.162533: block_bio_queue: 252,0 W 416 + 8 [kworker/0:0] kworker/0:0-9 [000] ..... 420.162821: block_bio_queue: 8,32 W 416 + 8 [kworker/0:0] <snip> Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysdm cache: fix null-deref with concurrent writes in passthrough modeMing-Hung Tsai1-2/+4
[ Upstream commit 7d1f98d668ee34c1d15bdc0420fdd062f24a27c0 ] In passthrough mode, when dm-cache starts to invalidate a cache entry and bio prison cell lock fails due to concurrent write to the same cached block, mg->cell remains NULL. The error path in invalidate_complete() attempts to unlock and free the cell unconditionally, causing a NULL pointer dereference: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 UID: 0 PID: 134 Comm: fio Not tainted 6.19.0-rc7 #3 PREEMPT RIP: 0010:dm_cell_unlock_v2+0x3f/0x210 <snip> Call Trace: invalidate_complete+0xef/0x430 map_bio+0x130f/0x1a10 cache_map+0x320/0x6b0 __map_bio+0x458/0x510 dm_submit_bio+0x40e/0x16d0 __submit_bio+0x419/0x870 <snip> Reproduce steps: 1. Create a cache device dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 262144 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. Promote the first data block into cache fio --filename=/dev/mapper/cache --name=populate --rw=write --bs=4k \ --direct=1 --size=64k 3. Reload the cache into passthrough mode dmsetup suspend cache dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 passthrough smq 0" dmsetup resume cache 4. Write to the first cached block concurrently fio --filename=/dev/mapper/cache --name test --rw=randwrite --bs=4k \ --randrepeat=0 --direct=1 --numjobs=2 --size 64k Fix by checking if mg->cell is valid before attempting to unlock it. Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: wake raid456 reshape waiters before suspendYu Kuai1-0/+11
[ Upstream commit cf86bb53b9c92354904a328e947a05ffbfdd1840 ] During raid456 reshape, direct IO across the reshape position can sleep in raid5_make_request() waiting for reshape progress while still holding an active_io reference. If userspace then freezes reshape and writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io and waits for all in-flight IO to drain. This can deadlock: the IO needs reshape progress to continue, but the reshape thread is already frozen, so the active_io reference is never dropped and suspend never completes. raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do the same for normal md suspend when reshape is already interrupted, so waiting raid456 IO can abort, drop its reference, and let suspend finish. The mdadm test tests/25raid456-reshape-deadlock reproduces the hang. Fixes: 714d20150ed8 ("md: add new helpers to suspend/resume array") Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: remove unused static md_wq workqueueAbd-Alrhman Masalkhi1-8/+0
[ Upstream commit e4979f4fac4d6bbe757be50441b45e28e6bf7360 ] The md_wq workqueue is defined as static and initialized in md_init(), but it is not used anywhere within md.c. All asynchronous and deferred work in this file is handled via md_misc_wq or dedicated md threads. Fixes: b75197e86e6d3 ("md: Remove flush handling") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: fix array_state=clear sysfs deadlockYu Kuai1-1/+7
[ Upstream commit 2aa72276fab9851dbd59c2daeb4b590c5a113908 ] When "clear" is written to array_state, md_attr_store() breaks sysfs active protection so the array can delete itself from its own sysfs store method. However, md_attr_store() currently drops the mddev reference before calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0) has made the mddev eligible for delayed deletion, the temporary kobject reference taken by sysfs_break_active_protection() can become the last kobject reference protecting the md kobject. That allows sysfs_unbreak_active_protection() to drop the last kobject reference from the current sysfs writer context. kobject teardown then recurses into kernfs removal while the current sysfs node is still being unwound, and lockdep reports recursive locking on kn->active with kernfs_drain() in the call chain. Reproducer on an existing level: 1. Create an md0 linear array and activate it: mknod /dev/md0 b 9 0 echo none > /sys/block/md0/md/metadata_version echo linear > /sys/block/md0/md/level echo 1 > /sys/block/md0/md/raid_disks echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \ /sys/block/md0/md/dev-sdb/size echo 0 > /sys/block/md0/md/dev-sdb/slot echo active > /sys/block/md0/md/array_state 2. Wait briefly for the array to settle, then clear it: sleep 2 echo clear > /sys/block/md0/md/array_state The warning looks like: WARNING: possible recursive locking detected bash/588 is trying to acquire lock: (kn->active#65) at __kernfs_remove+0x157/0x1d0 but task is already holding lock: (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40 ... Call Trace: kernfs_drain __kernfs_remove kernfs_remove_by_name_ns sysfs_remove_group sysfs_remove_groups __kobject_del kobject_put md_attr_store kernfs_fop_write_iter vfs_write ksys_write Restore active protection before mddev_put() so the extra sysfs kobject reference is dropped while the mddev is still held alive. The actual md kobject deletion is then deferred until after the sysfs write path has fully returned. Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd/raid1: fix the comparing region of interval treeXiao Ni1-2/+2
[ Upstream commit de3544d2e5ea99064498de3c21ba490155864657 ] Interval tree uses [start, end] as a region which stores in the tree. In raid1, it uses the wrong end value. For example: bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B). The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and bio2 overlap which is not right. Fix this problem by using right end value of the region. Fixes: d0d2d8ba0494 ("md/raid1: introduce wait_for_serialization") Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysmd: suppress spurious superblock update error message for dm-raidChen Cheng1-1/+3
[ Upstream commit eff0d74c6c8fd358bc9474c05002e51fa5aa56ad ] dm-raid has external metadata management (mddev->external = 1) and no persistent superblock (mddev->persistent = 0). For these arrays, there's no superblock to update, so the error message is spurious. The error appears as: md_update_sb: can't update sb for read-only array md0 Fixes: 8c9e376b9d1a ("md: warn about updating super block failure") Reported-by: Tj <tj.iam.tj@proton.me> Closes: https://lore.kernel.org/all/20260128082430.96788-1-tj.iam.tj@proton.me/ Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20260210133847.269986-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-14md/raid10: fix divide-by-zero in setup_geo() with zero far_copiesJunrui Luo1-0/+2
commit 9aa6d860b0930e2f72795665c42c44252a558a0c upstream. setup_geo() extracts near_copies (nc) and far_copies (fc) from the user-provided layout parameter without checking for zero. When fc=0 with the "improved" far set layout selected, 'geo->far_set_size = disks / fc' triggers a divide-by-zero. Validate nc and fc immediately after extraction, returning -1 if either is zero. Fixes: 475901aff158 ("MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB7881A5E2556806CC1D318582AF232@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-verity-fec: fix the size of dm_verity_fec_io::erasuresEric Biggers1-1/+2
commit a7fca324d7d90f7b139d4d32747c83a629fdb446 upstream. At most 25 entries in dm_verity_fec_io::erasures are used: the maximum number of FEC roots plus one. Therefore, set the array size accordingly. This reduces the size of dm_verity_fec_io by 912 bytes. Note: a later commit introduces a constant DM_VERITY_FEC_MAX_ROOTS, which allows the size to be more clearly expressed as DM_VERITY_FEC_MAX_ROOTS + 1. This commit just fixes the size first. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-verity-fec: fix reading parity bytes split across blocks (take 3)Eric Biggers1-56/+44
commit 430a05cb926f6bdf53e81460a2c3a553257f3f61 upstream. fec_decode_bufs() assumes that the parity bytes of the first RS codeword it decodes are never split across parity blocks. This assumption is false. Consider v->fec->block_size == 4096 && v->fec->roots == 17 && fio->nbufs == 1, for example. In that case, each call to fec_decode_bufs() consumes v->fec->roots * (fio->nbufs << DM_VERITY_FEC_BUF_RS_BITS) = 272 parity bytes. Considering that the parity data for each message block starts on a block boundary, the byte alignment in the parity data will iterate through 272*i mod 4096 until the 3 parity blocks have been consumed. On the 16th call (i=15), the alignment will be 4080 bytes into the first block. Only 16 bytes remain in that block, but 17 parity bytes will be needed. The code reads out-of-bounds from the parity block buffer. Fortunately this doesn't normally happen, since it can occur only for certain non-default values of fec_roots *and* when the maximum number of buffers couldn't be allocated due to low memory. For example with block_size=4096 only the following cases are affected: fec_roots=17: nbufs in [1, 3, 5, 15] fec_roots=19: nbufs in [1, 229] fec_roots=21: nbufs in [1, 3, 5, 13, 15, 39, 65, 195] fec_roots=23: nbufs in [1, 89] Regardless, fix it by refactoring how the parity blocks are read. Fixes: 6df90c02bae4 ("dm-verity FEC: Fix RS FEC repair for roots unaligned to block size (take 2)") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-verity-fec: fix corrected block count statEric Biggers1-3/+2
commit 48640c88a8ddd482b6456fcbc084b08dd2bac083 upstream. dm_verity_fec::corrected seems to have been intended to count the number of corrected blocks. However, it actually counted the number of calls to fec_decode_bufs() that corrected at least one error. That's not the same thing. For example, in low-memory situations correcting a single block can require many calls to fec_decode_bufs(). Fix it to count corrected blocks instead. Fixes: ae97648e14f7 ("dm verity fec: Expose corrected block count via status") Cc: Shubhankar Mishra <shubhankarm@google.com> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-verity-fec: correctly reject too-small hash devicesEric Biggers1-1/+2
commit 4355142245f7e55336dcc005ec03592df4d546f8 upstream. Fix verity_fec_ctr() to reject too-small hash devices by correctly taking hash_start into account. Note that this is necessary because dm-verity doesn't call dm_bufio_set_sector_offset() on the hash device's bufio client (v->bufio). Thus, dm_bufio_get_device_size(v->bufio) returns a size relative to 0 rather than hash_start. An alternative fix would be to call dm_bufio_set_sector_offset() on v->bufio, but then all the code that reads from the hash device would have to be adjusted accordingly. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-verity-fec: correctly reject too-small FEC devicesEric Biggers1-3/+2
commit 2b14e0bb63cc671120e7791658f5c494fc66d072 upstream. Fix verity_fec_ctr() to reject too-small FEC devices by correctly computing the number of parity blocks as 'f->rounds * f->roots'. Previously it incorrectly used 'div64_u64(f->rounds * f->roots, v->fec->roots << SECTOR_SHIFT)' which is a much smaller value. Note that the units of 'rounds' are blocks, not bytes. This matches the units of the value returned by dm_bufio_get_device_size(), which are also blocks. A later commit will give 'rounds' a clearer name. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm: fix a buffer overflow in ioctl processingMikulas Patocka1-0/+4
commit 2fa49cc884f6496a915c35621ba4da35649bf159 upstream. Tony Asleson (using Claude) found a buffer overflow in dm-ioctl in the function retrieve_status: 1. The code in retrieve_status checks that the output string fits into the output buffer and writes the output string there 2. Then, the code aligns the "outptr" variable to the next 8-byte boundary: outptr = align_ptr(outptr); 3. The alignment doesn't check overflow, so outptr could point past the buffer end 4. The "for" loop is iterated again, it executes: remaining = len - (outptr - outbuf); 5. If "outptr" points past "outbuf + len", the arithmetics wraps around and the variable "remaining" contains unusually high number 6. With "remaining" being high, the code writes more data past the end of the buffer Luckily, this bug has no security implications because: 1. Only root can issue device mapper ioctls 2. The commonly used libraries that communicate with device mapper (libdevmapper and devicemapper-rs) use buffer size that is aligned to 8 bytes - thus, "outptr = align_ptr(outptr)" can't overshoot the input buffer and the bug can't happen accidentally Reported-by: Tony Asleson <tasleson@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Bryn M. Reeves <bmr@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm: don't report warning when doing deferred removeMikulas Patocka1-1/+1
commit b7cce3e2cca9cd78418f3c3784474b778e7996fe upstream. If dm_hash_remove_all was called from dm_deferred_remove, it would write a warning "remove_all left %d open device(s)" if there are some other devices active. The warning is bogus, so let's disable it in this case. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Cc: stable@vger.kernel.org Fixes: 2c140a246dc0 ("dm: allow remove to be deferred") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-14dm-thin: fix metadata refcount underflowMikulas Patocka1-0/+8
commit 09a65adc7d8bbfce06392cb6d375468e2728ead5 upstream. There's a bug in dm-thin in the function rebalance_children. If the internal btree node has one entry, the code tries to copy all btree entries from the node's child to the node itself and then decrement the child's reference count. If the child node is shared (it has reference count > 1), we won't free it, so there would be two pointers to each of the grandchildren nodes. But the reference counts of the grandchildren is not increased, thus the reference count doesn't match the number of pointers that point to the grandchildren. This results in "device mapper: space map common: unable to decrement block" errors. Fix this bug by incrementing reference counts on the grandchildren if the btree node is shared. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 3241b1d3e0aa ("dm: add persistent data library") Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07dm mirror: fix integer overflow in create_dirty_log()Junrui Luo1-3/+3
commit 4c788c6f921b22f9b6c3f316c4a071c05683e7de upstream. The argument count calculation in create_dirty_log() performs `*args_used = 2 + param_count` before validating against argc. When a user provides a param_count close to UINT_MAX via the device mapper table string, this unsigned addition wraps around to a small value, causing the subsequent `argc < *args_used` check to be bypassed. The overflowed param_count is then passed as argc to dm_dirty_log_create(), where it can cause out-of-bounds reads on the argv array. Fix by comparing param_count against argc - 2 before performing the addition, following the same pattern used by parse_features() in the same file. Since argc >= 2 is already guaranteed, the subtraction is safe. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07md/raid5: validate payload size before accessing journal metadataJunrui Luo1-15/+33
commit b0cc3ae97e893bf54bbce447f4e9fd2e0b88bff9 upstream. r5c_recovery_analyze_meta_block() and r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a journal metadata block using on-disk payload size fields without validating them against the remaining space in the metadata block. A corrupted journal contains payload sizes extending beyond the PAGE_SIZE boundary can cause out-of-bounds reads when accessing payload fields or computing offsets. Add bounds validation for each payload type to ensure the full payload fits within meta_size before processing. Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07md/raid5: fix soft lockup in retry_aligned_read()Chia-Ming Chang1-1/+7
commit 7f9f7c697474268d9ef9479df3ddfe7cdcfbbffc upstream. When retry_aligned_read() encounters an overlapped stripe, it releases the stripe via raid5_release_stripe() which puts it on the lockless released_stripes llist. In the next raid5d loop iteration, release_stripe_list() drains the stripe onto handle_list (since STRIPE_HANDLE is set by the original IO), but retry_aligned_read() runs before handle_active_stripes() and removes the stripe from handle_list via find_get_stripe() -> list_del_init(). This prevents handle_stripe() from ever processing the stripe to resolve the overlap, causing an infinite loop and soft lockup. Fix this by using __release_stripe() with temp_inactive_list instead of raid5_release_stripe() in the failure path, so the stripe does not go through the released_stripes llist. This allows raid5d to break out of its loop, and the overlap will be resolved when the stripe is eventually processed by handle_stripe(). Fixes: 773ca82fa1ee ("raid5: make release_stripe lockless") Cc: stable@vger.kernel.org Signed-off-by: FengWei Shih <dannyshih@synology.com> Signed-off-by: Chia-Ming Chang <chiamingc@synology.com> Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07md/md-llbitmap: raise barrier before state machine transitionYu Kuai1-4/+4
commit ef4ca3d4bf09716cff9ba00eb0351deadc8417ab upstream. Move the barrier raise operation before calling llbitmap_state_machine() in both llbitmap_start_write() and llbitmap_start_discard(). This ensures the barrier is in place before any state transitions occur, preventing potential race conditions where the state machine could complete before the barrier is properly raised. Cc: stable@vger.kernel.org Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07md/md-llbitmap: skip reading rdevs that are not in_syncYu Kuai1-1/+2
commit 7701e68b5072faa03a8f30b4081dc16df9092381 upstream. When reading bitmap pages from member disks, the code iterates through all rdevs and attempts to read from the first available one. However, it only checks for raid_disk assignment and Faulty flag, missing the In_sync flag check. This can cause bitmap data to be read from spare disks that are still being rebuilt and don't have valid bitmap information yet. Reading stale or uninitialized bitmap data from such disks can lead to incorrect dirty bit tracking, potentially causing data corruption during recovery or normal operation. Add the In_sync flag check to ensure bitmap pages are only read from fully synchronized member disks that have valid bitmap data. Cc: stable@vger.kernel.org Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-07md/raid10: fix deadlock with check operation and nowait requestsJosh Hunt1-2/+2
commit 7d96f3120a7fb7210d21b520c5b6f495da6ba436 upstream. When an array check is running it will raise the barrier at which point normal requests will become blocked and increment the nr_pending value to signal there is work pending inside of wait_barrier(). NOWAIT requests do not block and so will return immediately with an error, and additionally do not increment nr_pending in wait_barrier(). Upstream change commit 43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request") added a call to raid_end_bio_io() to fix a memory leak when NOWAIT requests hit this condition. raid_end_bio_io() eventually calls allow_barrier() and it will unconditionally do an atomic_dec_and_test(&conf->nr_pending) even though the corresponding increment on nr_pending didn't happen in the NOWAIT case. This can be easily seen by starting a check operation while an application is doing nowait IO on the same array. This results in a deadlocked state due to nr_pending value underflowing and so the md resync thread gets stuck waiting for nr_pending to == 0. Output of r10conf state of the array when we hit this condition: crash> struct r10conf barrier = 1, nr_pending = { counter = -41 }, nr_waiting = 15, nr_queued = 0, Example of md_sync thread stuck waiting on raise_barrier() and other requests stuck in wait_barrier(): md1_resync [<0>] raise_barrier+0xce/0x1c0 [<0>] raid10_sync_request+0x1ca/0x1ed0 [<0>] md_do_sync+0x779/0x1110 [<0>] md_thread+0x90/0x160 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 kworker/u1040:2+flush-253:4 [<0>] wait_barrier+0x1de/0x220 [<0>] regular_request_wait+0x30/0x180 [<0>] raid10_make_request+0x261/0x1000 [<0>] md_handle_request+0x13b/0x230 [<0>] __submit_bio+0x107/0x1f0 [<0>] submit_bio_noacct_nocheck+0x16f/0x390 [<0>] ext4_io_submit+0x24/0x40 [<0>] ext4_do_writepages+0x254/0xc80 [<0>] ext4_writepages+0x84/0x120 [<0>] do_writepages+0x7a/0x260 [<0>] __writeback_single_inode+0x3d/0x300 [<0>] writeback_sb_inodes+0x1dd/0x470 [<0>] __writeback_inodes_wb+0x4c/0xe0 [<0>] wb_writeback+0x18b/0x2d0 [<0>] wb_workfn+0x2a1/0x400 [<0>] process_one_work+0x149/0x330 [<0>] worker_thread+0x2d2/0x410 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 Fixes: 43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request") Cc: stable@vger.kernel.org Signed-off-by: Josh Hunt <johunt@akamai.com> Link: https://lore.kernel.org/linux-raid/20260303005619.1352958-1-johunt@akamai.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-22bcache: fix cached_dev.sb_bio use-after-free and crashMingzhe Zou1-0/+7
commit fec114a98b8735ee89c75216c45a78e28be0f128 upstream. In our production environment, we have received multiple crash reports regarding libceph, which have caught our attention: ``` [6888366.280350] Call Trace: [6888366.280452] blk_update_request+0x14e/0x370 [6888366.280561] blk_mq_end_request+0x1a/0x130 [6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd] [6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd] [6888366.280903] __complete_request+0x22/0x70 [libceph] [6888366.281032] osd_dispatch+0x15e/0xb40 [libceph] [6888366.281164] ? inet_recvmsg+0x5b/0xd0 [6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [6888366.281405] ceph_con_process_message+0x79/0x140 [libceph] [6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph] [6888366.281661] ceph_con_workfn+0x329/0x680 [libceph] ``` After analyzing the coredump file, we found that the address of dc->sb_bio has been freed. We know that cached_dev is only freed when it is stopped. Since sb_bio is a part of struct cached_dev, rather than an alloc every time. If the device is stopped while writing to the superblock, the released address will be accessed at endio. This patch hopes to wait for sb_write to complete in cached_dev_free. It should be noted that we analyzed the cause of the problem, then tell all details to the QWEN and adopted the modifications it made. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Fixes: cafe563591446 ("bcache: A block layer cache") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook7-11/+9
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds7-16/+8
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds7-7/+7
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds65-116/+116
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook75-199/+185
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-19dm: dm-zoned: Adjust dmz_load_mapping() allocation typeKees Cook1-1/+1
In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct dmz_mblock **" but the returned type will be "struct dmz_mblk **". These are the same allocation size (pointer size), but the types do not match. Adjust the allocation type to match the assignment. Link: https://patch.msgid.link/20250426061707.work.587-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-19dm-crypt: Adjust crypt_alloc_tfms_aead() allocation typeKees Cook1-1/+1
In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct crypto_skcipher **" but the returned type will be "struct crypto_aead **". These are the same allocation size (pointer size), but the types don't match. Adjust the allocation type to match the assignment. Link: https://patch.msgid.link/20250426061629.work.266-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-17Merge tag 'block-7.0-20260216' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds3-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12md: ignore discard return valueChaitanya Kulkarni1-2/+2
__blkdev_issue_discard() always returns 0, making all error checking at call sites dead code. Simplify md to only check !discard_bio by ignoring the __blkdev_issue_discard() value. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-12Merge tag 'for-7.0/dm-changes' of ↵Linus Torvalds35-332/+404
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - dm-verity: - various optimizations and fixes related to forward error correction - add a .dm-verity keyring - dm-integrity: fix bugs with growing a device in bitmap mode - dm-mpath: - fix leaking fake timeout requests - fix UAF bug caused by stale rq->bio - fix minor bugs in device creation - dm-core: - fix a bug related to blkg association - avoid unnecessary blk-crypto work on invalid keys - dm-bufio: - dm-bufio cleanup and optimization (reducing hash table lookups) - various other minor fixes and cleanups * tag 'for-7.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm mpath: make pg_init_delay_msecs settable Revert "dm: fix a race condition in retrieve_deps" dm mpath: Add missing dm_put_device when failing to get scsi dh name dm vdo encodings: clean up header and version functions dm: use bio_clone_blkg_association dm: fix excessive blk-crypto operations for invalid keys dm-verity: fix section mismatch error dm-unstripe: fix mapping bug when there are multiple targets in a table dm-integrity: fix recalculation in bitmap mode dm-bufio: avoid redundant buffer_tree lookups dm-bufio: merge cache_put() into cache_put_and_wake() selftests: add dm-verity keyring selftests dm-verity: add dm-verity keyring dm: clear cloned request bio pointer when last clone bio completes dm-verity: fix up various workqueue-related comments dm-verity: switch to bio_advance_iter_single() dm-verity: consolidate the BH and normal work structs dm: add WQ_PERCPU to alloc_workqueue users dm-integrity: fix a typo in the code for write/discard race dm: use READ_ONCE in dm_blk_report_zones ...
2026-02-10Merge tag 'for-7.0/block-20260206' of ↵Linus Torvalds14-356/+315
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...
2026-02-02md: fix return value of mddev_trylockXiao Ni1-2/+2
A return value of 0 is treaded as successful lock acquisition. In fact, a return value of 1 means getting the lock successfully. Link: https://lore.kernel.org/linux-raid/20260127073951.17248-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reported-by: Bart Van Assche <bvanassche@acm.org> Closes: https://lore.kernel.org/linux-raid/20250611073108.25463-1-xni@redhat.com/T/#mfa369ef5faa4aa58e13e6d9fdb88aecd862b8f2f Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>