summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2026-03-25dm-verity: disable recursive forward error correctionMikulas Patocka2-6/+1
[ Upstream commit d9f3e47d3fae0c101d9094bc956ed24e7a0ee801 ] There are two problems with the recursive correction: 1. It may cause denial-of-service. In fec_read_bufs, there is a loop that has 253 iterations. For each iteration, we may call verity_hash_for_block recursively. There is a limit of 4 nested recursions - that means that there may be at most 253^4 (4 billion) iterations. Red Hat QE team actually created an image that pushes dm-verity to this limit - and this image just makes the udev-worker process get stuck in the 'D' state. 2. It doesn't work. In fec_read_bufs we store data into the variable "fio->bufs", but fio bufs is shared between recursive invocations, if "verity_hash_for_block" invoked correction recursively, it would overwrite partially filled fio->bufs. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Guangwu Zhang <guazhang@redhat.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> [ The context change is due to the commit bdf253d580d7 ("dm-verity: remove support for asynchronous hashes") in v6.18 and the commit 9356fcfe0ac4 ("dm verity: set DM_TARGET_SINGLETON feature flag") in v6.9 which are irrelevant to the logic of this patch. ] Signed-off-by: Rahul Sharma <black.hawk@163.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04dm mpath: make pg_init_delay_msecs settableBenjamin Marzinski1-1/+1
[ Upstream commit 218b16992a37ea97b9e09b7659a25a864fb9976f ] "pg_init_delay_msecs X" can be passed as a feature in the multipath table and is used to set m->pg_init_delay_msecs in parse_features(). However, alloc_multipath_stage2(), which is called after parse_features(), resets m->pg_init_delay_msecs to its default value. Instead, set m->pg_init_delay_msecs in alloc_multipath(), which is called before parse_features(), to avoid overwriting a value passed in by the table. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04md/bitmap: fix GPF in write_page caused by resize raceJack Wang1-1/+2
[ Upstream commit 46ef85f854dfa9d5226b3c1c46493d79556c9589 ] A General Protection Fault occurs in write_page() during array resize: RIP: 0010:write_page+0x22b/0x3c0 [md_mod] This is a use-after-free race between bitmap_daemon_work() and __bitmap_resize(). The daemon iterates over `bitmap->storage.filemap` without locking, while the resize path frees that storage via md_bitmap_file_unmap(). `quiesce()` does not stop the md thread, allowing concurrent access to freed pages. Fix by holding `mddev->bitmap_info.mutex` during the bitmap update. Link: https://lore.kernel.org/linux-raid/20260120102456.25169-1-jinpu.wang@ionos.com Closes: https://lore.kernel.org/linux-raid/CAMGffE=Mbfp=7xD_hYxXk1PAaCZNSEAVeQGKGy7YF9f2S4=NEA@mail.gmail.com/T/#u Cc: stable@vger.kernel.org Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.") Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm-unstripe: fix mapping bug when there are multiple targets in a tableMatt Whitlock1-1/+1
[ Upstream commit 83c10e8dd43628d0bf86486616556cd749a3c310 ] The "unstriped" device-mapper target incorrectly calculates the sector offset on the mapped device when the target's origin is not zero. Take for example this hypothetical concatenation of the members of a two-disk RAID0: linearized: 0 2097152 unstriped 2 128 0 /dev/md/raid0 0 linearized: 2097152 2097152 unstriped 2 128 1 /dev/md/raid0 0 The intent in this example is to create a single device named /dev/mapper/linearized that comprises all of the chunks of the first disk of the RAID0 set, followed by all of the chunks of the second disk of the RAID0 set. This fails because dm-unstripe.c's map_to_core function does its computations based on the sector number within the mapper device rather than the sector number within the target. The bug turns invisible when the target's origin is at sector zero of the mapper device, as is the common case. In the example above, however, what happens is that the first half of the mapper device gets mapped correctly to the first disk of the RAID0, but the second half of the mapper device gets mapped past the end of the RAID0 device, and accesses to any of those sectors return errors. Signed-off-by: Matt Whitlock <kernel@mattwhitlock.name> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 18a5bf270532 ("dm: add unstriped target") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm-integrity: fix recalculation in bitmap modeMikulas Patocka1-0/+13
[ Upstream commit 118ba36e446c01e3cd34b3eedabf1d9436525e1d ] There's a logic quirk in the handling of suspend in the bitmap mode: This is the sequence of calls if we are reloading a dm-integrity table: * dm_integrity_ctr reads a superblock with the flag SB_FLAG_DIRTY_BITMAP set. * dm_integrity_postsuspend initializes a journal and clears the flag SB_FLAG_DIRTY_BITMAP. * dm_integrity_resume sees the superblock with SB_FLAG_DIRTY_BITMAP set - thus it interprets the journal as if it were a bitmap. This quirk causes recalculation problem if the user increases the size of the device in the bitmap mode. Fix this by reading a fresh copy on the superblock in dm_integrity_resume. This commit also fixes another logic quirk - the branch that sets bitmap bits if the device was extended should only be executed if the flag SB_FLAG_DIRTY_BITMAP is set. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Ondrej Kozina <okozina@redhat.com> Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm: clear cloned request bio pointer when last clone bio completesMichael Liang1-3/+10
[ Upstream commit fb8a6c18fb9a6561f7a15b58b272442b77a242dd ] Stale rq->bio values have been observed to cause double-initialization of cloned bios in request-based device-mapper targets, leading to use-after-free and double-free scenarios. One such case occurs when using dm-multipath on top of a PCIe NVMe namespace, where cloned request bios are freed during blk_complete_request(), but rq->bio is left intact. Subsequent clone teardown then attempts to free the same bios again via blk_rq_unprep_clone(). The resulting double-free path looks like: nvme_pci_complete_batch() nvme_complete_batch() blk_mq_end_request_batch() blk_complete_request() // called on a DM clone request bio_endio() // first free of all clone bios ... rq->end_io() // end_clone_request() dm_complete_request(tio->orig) dm_softirq_done() dm_done() dm_end_request() blk_rq_unprep_clone() // second free of clone bios Fix this by clearing the clone request's bio pointer when the last cloned bio completes, ensuring that later teardown paths do not attempt to free already-released bios. Signed-off-by: Michael Liang <mliang@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm-integrity: fix a typo in the code for write/discard raceMikulas Patocka1-1/+1
[ Upstream commit c698b7f417801fcd79f0dc844250b3361d38e6b8 ] If we send a write followed by a discard, it may be possible that the discarded data end up being overwritten by the previous write from the journal. The code tries to prevent that, but there was a typo in this logic that made it not being activated as it should be. Note that if we end up here the second time (when discard_retried is true), it means that the write bio is actually racing with the discard bio, and in this situation it is not specified which of them should win. Cc: stable@vger.kernel.org Fixes: 31843edab7cb ("dm integrity: improve discard in journal mode") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm-verity: correctly handle dm_bufio_client_create() failureEric Biggers1-2/+2
[ Upstream commit 119f4f04186fa4f33ee6bd39af145cdaff1ff17f ] If either of the calls to dm_bufio_client_create() in verity_fec_ctr() fails, then dm_bufio_client_destroy() is later called with an ERR_PTR() argument. That causes a crash. Fix this. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm: remove fake timeout to avoid leak requestDing Hui1-2/+1
[ Upstream commit f3a9c95a15d2f4466acad5c68faeff79ca5e9f47 ] Since commit 15f73f5b3e59 ("blk-mq: move failure injection out of blk_mq_complete_request"), drivers are responsible for calling blk_should_fake_timeout() at appropriate code paths and opportunities. However, the dm driver does not implement its own timeout handler and relies on the timeout handling of its slave devices. If an io-timeout-fail error is injected to a dm device, the request will be leaked and never completed, causing tasks to hang indefinitely. Reproduce: 1. prepare dm which has iscsi slave device 2. inject io-timeout-fail to dm echo 1 >/sys/class/block/dm-0/io-timeout-fail echo 100 >/sys/kernel/debug/fail_io_timeout/probability echo 10 >/sys/kernel/debug/fail_io_timeout/times 3. read/write dm 4. iscsiadm -m node -u Result: hang task like below [ 862.243768] INFO: task kworker/u514:2:151 blocked for more than 122 seconds. [ 862.244133] Tainted: G E 6.19.0-rc1+ #51 [ 862.244337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 862.244718] task:kworker/u514:2 state:D stack:0 pid:151 tgid:151 ppid:2 task_flags:0x4288060 flags:0x00080000 [ 862.245024] Workqueue: iscsi_ctrl_3:1 __iscsi_unbind_session [scsi_transport_iscsi] [ 862.245264] Call Trace: [ 862.245587] <TASK> [ 862.245814] __schedule+0x810/0x15c0 [ 862.246557] schedule+0x69/0x180 [ 862.246760] blk_mq_freeze_queue_wait+0xde/0x120 [ 862.247688] elevator_change+0x16d/0x460 [ 862.247893] elevator_set_none+0x87/0xf0 [ 862.248798] blk_unregister_queue+0x12e/0x2a0 [ 862.248995] __del_gendisk+0x231/0x7e0 [ 862.250143] del_gendisk+0x12f/0x1d0 [ 862.250339] sd_remove+0x85/0x130 [sd_mod] [ 862.250650] device_release_driver_internal+0x36d/0x530 [ 862.250849] bus_remove_device+0x1dd/0x3f0 [ 862.251042] device_del+0x38a/0x930 [ 862.252095] __scsi_remove_device+0x293/0x360 [ 862.252291] scsi_remove_target+0x486/0x760 [ 862.252654] __iscsi_unbind_session+0x18a/0x3e0 [scsi_transport_iscsi] [ 862.252886] process_one_work+0x633/0xe50 [ 862.253101] worker_thread+0x6df/0xf10 [ 862.253647] kthread+0x36d/0x720 [ 862.254533] ret_from_fork+0x2a6/0x470 [ 862.255852] ret_from_fork_asm+0x1a/0x30 [ 862.256037] </TASK> Remove the blk_should_fake_timeout() check from dm, as dm has no native timeout handling and should not attempt to fake timeouts. Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04dm: use bio_clone_blkg_associationMikulas Patocka1-0/+2
[ Upstream commit 2df8b310bcfe76827fd71092f58a2493ee6590b0 ] The origin bio carries blk-cgroup information which could be set from foreground(task_css(css) - wbc->wb->blkcg_css), so the blkcg won't control buffer io since commit ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone"). The synchronous io is still under control by blkcg, because 'bio->bi_blkg' is set by io submitting task which has been added into 'cgroup.procs'. Fix it by using bio_clone_blkg_association when submitting a cloned bio. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220985 Fixes: ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04md/raid10: fix any_working flag handling in raid10_sync_requestLi Nan1-1/+1
[ Upstream commit 99582edb3f62e8ee6c34512021368f53f9b091f2 ] In raid10_sync_request(), 'any_working' indicates if any IO will be submitted. When there's only one In_sync disk with badblocks, 'any_working' might be set to 1 but no IO is submitted. Fix it by setting 'any_working' after badblock checks. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-11-linan666@huaweicloud.com Fixes: e875ecea266a ("md/raid10 record bad blocks as needed during recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-11dm-bufio: align write boundary on physical block sizeMikulas Patocka1-4/+6
commit d0ac06ae53be0cdb61f5fe6b62d25d3317c51657 upstream. There may be devices with physical block size larger than 4k. If dm-bufio sends I/O that is not aligned on physical block size, performance is degraded. The 4k minimum alignment limit is there because some SSDs report logical and physical block size 512 despite having 4k internally - so dm-bufio shouldn't send I/Os not aligned on 4k boundary, because they perform badly (the SSD does read-modify-write for them). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11dm-ebs: Mark full buffer dirty even on partial writeUladzislau Rezki (Sony)1-1/+1
commit 7fa3e7d114abc9cc71cc35d768e116641074ddb4 upstream. When performing a read-modify-write(RMW) operation, any modification to a buffered block must cause the entire buffer to be marked dirty. Marking only a subrange as dirty is incorrect because the underlying device block size(ubs) defines the minimum read/write granularity. A lower device can perform I/O only on regions which are fully aligned and sized to ubs. This change ensures that write-back operations always occur in full ubs-sized chunks, matching the intended emulation semantics of the EBS target. As for user space visible impact, submitting sub-ubs and misaligned I/O for devices which are tuned to ubs sizes only, will reject such requests, therefore it can lead to losing data. Example: 1) Create a 8K nvme device in qemu by adding -device nvme,drive=drv0,serial=foo,logical_block_size=8192,physical_block_size=8192 2) Setup dm-ebs to emulate 512B to 8K mapping urezki@pc638:~/bin$ cat dmsetup.sh lower=/dev/nvme0n1 len=$(blockdev --getsz "$lower") echo "0 $len ebs $lower 0 1 16" | dmsetup create nvme-8k urezki@pc638:~/bin$ offset 0, ebs=1 and ubs=16(in sectors). 3) Create an ext4 filesystem(default 4K block size) urezki@pc638:~/bin$ sudo mkfs.ext4 -F /dev/dm-0 mke2fs 1.47.0 (5-Feb-2023) Discarding device blocks: done Creating filesystem with 2072576 4k blocks and 518144 inodes Filesystem UUID: bd0b6ca6-0506-4e31-86da-8d22c9d50b63 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: mkfs.ext4: Input/output error while writing out and closing file system urezki@pc638:~/bin$ dmesg <snip> [ 1618.875449] buffer_io_error: 1028 callbacks suppressed [ 1618.875456] Buffer I/O error on dev dm-0, logical block 0, lost async page write [ 1618.875527] Buffer I/O error on dev dm-0, logical block 1, lost async page write [ 1618.875602] Buffer I/O error on dev dm-0, logical block 2, lost async page write [ 1618.875620] Buffer I/O error on dev dm-0, logical block 3, lost async page write [ 1618.875639] Buffer I/O error on dev dm-0, logical block 4, lost async page write [ 1618.894316] Buffer I/O error on dev dm-0, logical block 5, lost async page write [ 1618.894358] Buffer I/O error on dev dm-0, logical block 6, lost async page write [ 1618.894380] Buffer I/O error on dev dm-0, logical block 7, lost async page write [ 1618.894405] Buffer I/O error on dev dm-0, logical block 8, lost async page write [ 1618.894427] Buffer I/O error on dev dm-0, logical block 9, lost async page write <snip> Many I/O errors because the lower 8K device rejects sub-ubs/misaligned requests. with a patch: urezki@pc638:~/bin$ sudo mkfs.ext4 -F /dev/dm-0 mke2fs 1.47.0 (5-Feb-2023) Discarding device blocks: done Creating filesystem with 2072576 4k blocks and 518144 inodes Filesystem UUID: 9b54f44f-ef55-4bd4-9e40-c8b775a616ac Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done urezki@pc638:~/bin$ sudo mount /dev/dm-0 /mnt/ urezki@pc638:~/bin$ ls -al /mnt/ total 24 drwxr-xr-x 3 root root 4096 Oct 17 15:13 . drwxr-xr-x 19 root root 4096 Jul 10 19:42 .. drwx------ 2 root root 16384 Oct 17 15:13 lost+found urezki@pc638:~/bin$ After this change: mkfs completes; mount succeeds. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11dm log-writes: Add missing set_freezable() for freezable kthreadHaotian Zhang1-0/+1
[ Upstream commit ab08f9c8b363297cafaf45475b08f78bf19b88ef ] The log_writes_kthread() calls try_to_freeze() but lacks set_freezable(), rendering the freeze attempt ineffective since kernel threads are non-freezable by default. This prevents proper thread suspension during system suspend/hibernate. Add set_freezable() to explicitly mark the thread as freezable. Fixes: 0e9cebe72459 ("dm: add log writes target") Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-11dm-raid: fix possible NULL dereference with undefined raid typeAlexey Simakov1-0/+2
[ Upstream commit 2f6cfd6d7cb165a7af8877b838a9f6aab4159324 ] rs->raid_type is assigned from get_raid_type_by_ll(), which may return NULL. This NULL value could be dereferenced later in the condition 'if (!(rs_is_raid10(rs) && rt_is_raid0(rs->raid_type)))'. Add a fail-fast check to return early with an error if raid_type is NULL, similar to other uses of this function. Found by Linux Verification Center (linuxtesting.org) with Svace. Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping") Signed-off-by: Alexey Simakov <bigalex934@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-11md/raid5: fix IO hang when array is broken with IO inflightYu Kuai1-2/+4
[ Upstream commit a913d1f6a7f607c110aeef8b58c8988f47a4b24e ] Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-11md: export md_is_rdwr() and is_md_suspended()Yu Kuai2-16/+17
[ Upstream commit 431e61257d631157e1d374f1368febf37aa59f7c ] The two apis will be used later to fix a deadlock in raid456, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com Stable-dep-of: a913d1f6a7f6 ("md/raid5: fix IO hang when array is broken with IO inflight") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07dm-verity: fix unreliable memory allocationMikulas Patocka1-5/+1
commit fe680d8c747f4e676ac835c8c7fb0f287cd98758 upstream. GFP_NOWAIT allocation may fail anytime. It needs to be changed to GFP_NOIO. There's no need to handle an error because mempool_alloc with GFP_NOIO can't fail. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15dm: fix NULL pointer dereference in __dm_suspend()Zheng Qixing1-3/+4
commit 8d33a030c566e1f105cd5bf27f37940b6367f3be upstream. There is a race condition between dm device suspend and table load that can lead to null pointer dereference. The issue occurs when suspend is invoked before table load completes: BUG: kernel NULL pointer dereference, address: 0000000000000054 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 6798 Comm: dmsetup Not tainted 6.6.0-g7e52f5f0ca9b #62 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:blk_mq_wait_quiesce_done+0x0/0x50 Call Trace: <TASK> blk_mq_quiesce_queue+0x2c/0x50 dm_stop_queue+0xd/0x20 __dm_suspend+0x130/0x330 dm_suspend+0x11a/0x180 dev_suspend+0x27e/0x560 ctl_ioctl+0x4cf/0x850 dm_ctl_ioctl+0xd/0x20 vfs_ioctl+0x1d/0x50 __se_sys_ioctl+0x9b/0xc0 __x64_sys_ioctl+0x19/0x30 x64_sys_call+0x2c4a/0x4620 do_syscall_64+0x9e/0x1b0 The issue can be triggered as below: T1 T2 dm_suspend table_load __dm_suspend dm_setup_md_queue dm_mq_init_request_queue blk_mq_init_allocated_queue => q->mq_ops = set->ops; (1) dm_stop_queue / dm_wait_for_completion => q->tag_set NULL pointer! (2) => q->tag_set = set; (3) Fix this by checking if a valid table (map) exists before performing request-based suspend and waiting for target I/O. When map is NULL, skip these table-dependent suspend steps. Even when map is NULL, no I/O can reach any target because there is no table loaded; I/O submitted in this state will fail early in the DM layer. Skipping the table-dependent suspend logic in this case is safe and avoids NULL pointer dereferences. Fixes: c4576aed8d85 ("dm: fix request-based dm's use of dm_wait_for_completion") Cc: stable@vger.kernel.org Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15dm: fix queue start/stop imbalance under suspend/load/resume racesZheng Qixing2-3/+6
commit 7f597c2cdb9d3263a6fce07c4fc0a9eaa8e8fc43 upstream. When suspend and load run concurrently, before q->mq_ops is set in blk_mq_init_allocated_queue(), __dm_suspend() skip dm_stop_queue(). As a result, the queue's quiesce depth is not incremented. Later, once table load has finished and __dm_resume() runs, which triggers q->quiesce_depth ==0 warning in blk_mq_unquiesce_queue(): Call Trace: <TASK> dm_start_queue+0x16/0x20 [dm_mod] __dm_resume+0xac/0xb0 [dm_mod] dm_resume+0x12d/0x150 [dm_mod] do_resume+0x2c2/0x420 [dm_mod] dev_suspend+0x30/0x130 [dm_mod] ctl_ioctl+0x402/0x570 [dm_mod] dm_ctl_ioctl+0x23/0x30 [dm_mod] Fix this by explicitly tracking whether the request queue was stopped in __dm_suspend() via a new DMF_QUEUE_STOPPED flag. Only call dm_start_queue() in __dm_resume() if the queue was actually stopped. Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Cc: stable@vger.kernel.org Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15dm-integrity: limit MAX_TAG_SIZE to 255Mikulas Patocka1-1/+1
[ Upstream commit 77b8e6fbf9848d651f5cb7508f18ad0971f3ffdb ] MAX_TAG_SIZE was 0x1a8 and it may be truncated in the "bi->metadata_size = ic->tag_size" assignment. We need to limit it to 255. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15minmax: don't use max() in situations that want a C constant expressionLinus Torvalds1-1/+1
[ Upstream commit cb04e8b1d2f24c4c2c92f7b7529031fc35a16fed ] We only had a couple of array[] declarations, and changing them to just use 'MAX()' instead of 'max()' fixes the issue. This will allow us to simplify our min/max macros enormously, since they can now unconditionally use temporary variables to avoid using the argument values multiple times. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: add a few more MIN_T/MAX_T usersLinus Torvalds1-1/+1
commit 4477b39c32fdc03363affef4b11d48391e6dc9ff upstream. Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28dm-table: fix checking for rq stackable devicesBenjamin Marzinski1-5/+5
[ Upstream commit 8ca719b81987be690f197e82fdb030580c0a07f3 ] Due to the semantics of iterate_devices(), the current code allows a request-based dm table as long as it includes one request-stackable device. It is supposed to only allow tables where there are no non-request-stackable devices. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28dm-mpath: don't print the "loaded" message if registering failsMikulas Patocka4-4/+12
[ Upstream commit 6e11952a6abc4641dc8ae63f01b318b31b44e8db ] If dm_register_path_selector, don't print the "version X loaded" message. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28md: dm-zoned-target: Initialize return variable r to avoid uninitialized usePurva Yeshi1-1/+1
[ Upstream commit 487767bff572d46f7c37ad846c4078f6d6c9cc55 ] Fix Smatch-detected error: drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices() error: uninitialized symbol 'r'. Smatch detects a possible use of the uninitialized variable 'r' in dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is skipped and 'r' is returned without being set, leading to undefined behavior. Initialize 'r' to 0 before the loop. This ensures that if there are no devices to iterate over, the function still returns a defined value. Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17raid10: cleanup memleak at raid10_make_requestNigel Croxon1-2/+8
[ Upstream commit 43806c3d5b9bb7d74ba4e33a6a8a41ac988bde24 ] If raid10_read_request or raid10_write_request registers a new request and the REQ_NOWAIT flag is set, the code does not free the malloc from the mempool. unreferenced object 0xffff8884802c3200 (size 192): comm "fio", pid 9197, jiffies 4298078271 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00 .........A...... 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc c1a049a2): __kmalloc+0x2bb/0x450 mempool_alloc+0x11b/0x320 raid10_make_request+0x19e/0x650 [raid10] md_handle_request+0x3b3/0x9e0 __submit_bio+0x394/0x560 __submit_bio_noacct+0x145/0x530 submit_bio_noacct_nocheck+0x682/0x830 __blkdev_direct_IO_async+0x4dc/0x6b0 blkdev_read_iter+0x1e5/0x3b0 __io_read+0x230/0x1110 io_read+0x13/0x30 io_issue_sqe+0x134/0x1180 io_submit_sqes+0x48c/0xe90 __do_sys_io_uring_enter+0x574/0x8b0 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x76/0x7e V4: changing backing tree to see if CKI tests will pass. The patch code has not changed between any versions. Fixes: c9aa889b035f ("md: raid10 add nowait support") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17md/raid1: Fix stack memory use after return in raid1_reshapeWang Jinchao1-0/+1
[ Upstream commit d67ed2ccd2d1dcfda9292c0ea8697a9d0f2f0d98 ] In the raid1_reshape function, newpool is allocated on the stack and assigned to conf->r1bio_pool. This results in conf->r1bio_pool.wait.head pointing to a stack address. Accessing this address later can lead to a kernel panic. Example access path: raid1_reshape() { // newpool is on the stack mempool_t newpool, oldpool; // initialize newpool.wait.head to stack address mempool_init(&newpool, ...); conf->r1bio_pool = newpool; } raid1_read_request() or raid1_write_request() { alloc_r1bio() { mempool_alloc() { // if pool->alloc fails remove_element() { --pool->curr_nr; } } } } mempool_free() { if (pool->curr_nr < pool->min_nr) { // pool->wait.head is a stack address // wake_up() will try to access this invalid address // which leads to a kernel panic return; wake_up(&pool->wait); } } Fix: reinit conf->r1bio_pool.wait after assigning newpool. Fixes: afeee514ce7f ("md: convert to bioset_init()/mempool_init()") Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17md/md-bitmap: fix GPF in bitmap_get_stats()Håkon Bugge1-2/+1
commit c17fb542dbd1db745c9feac15617056506dd7195 upstream. The commit message of commit 6ec1f0239485 ("md/md-bitmap: fix stats collection for external bitmaps") states: Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). But, the code does not adhere to the above, as it does only check for a valid super-block for "internal" bitmaps. Hence, we observe: Oops: GPF, probably for non-canonical address 0x1cd66f1f40000028 RIP: 0010:bitmap_get_stats+0x45/0xd0 Call Trace: seq_read_iter+0x2b9/0x46a seq_read+0x12f/0x180 proc_reg_read+0x57/0xb0 vfs_read+0xf6/0x380 ksys_read+0x6d/0xf0 do_syscall_64+0x8c/0x1b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e We fix this by checking the existence of a super-block for both the internal and external case. Fixes: 6ec1f0239485 ("md/md-bitmap: fix stats collection for external bitmaps") Cc: stable@vger.kernel.org Reported-by: Gerald Gibson <gerald.gibson@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://lore.kernel.org/linux-raid/20250702091035.2061312-1-haakon.bugge@oracle.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06dm-raid: fix variable in journal device checkHeinz Mauelshagen1-1/+1
commit db53805156f1e0aa6d059c0d3f9ac660d4ef3eb4 upstream. Replace "rdev" with correct loop variable name "r". Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org Fixes: 63c32ed4afc2 ("dm raid: add raid4/5/6 journaling support") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06bcache: fix NULL pointer in cache_set_flush()Linggang Zeng1-1/+6
[ Upstream commit 1e46ed947ec658f89f1a910d880cd05e42d3763e ] 1. LINE#1794 - LINE#1887 is some codes about function of bch_cache_set_alloc(). 2. LINE#2078 - LINE#2142 is some codes about function of register_cache_set(). 3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098. 1794 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) 1795 { ... 1860 if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) || 1861 mempool_init_slab_pool(&c->search, 32, bch_search_cache) || 1862 mempool_init_kmalloc_pool(&c->bio_meta, 2, 1863 sizeof(struct bbio) + sizeof(struct bio_vec) * 1864 bucket_pages(c)) || 1865 mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) || 1866 bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), 1867 BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) || 1868 !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) || 1869 !(c->moving_gc_wq = alloc_workqueue("bcache_gc", 1870 WQ_MEM_RECLAIM, 0)) || 1871 bch_journal_alloc(c) || 1872 bch_btree_cache_alloc(c) || 1873 bch_open_buckets_alloc(c) || 1874 bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages))) 1875 goto err; ^^^^^^^^ 1876 ... 1883 return c; 1884 err: 1885 bch_cache_set_unregister(c); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1886 return NULL; 1887 } ... 2078 static const char *register_cache_set(struct cache *ca) 2079 { ... 2098 c = bch_cache_set_alloc(&ca->sb); 2099 if (!c) 2100 return err; ^^^^^^^^^^ ... 2128 ca->set = c; 2129 ca->set->cache[ca->sb.nr_this_dev] = ca; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... 2138 return NULL; 2139 err: 2140 bch_cache_set_unregister(c); 2141 return err; 2142 } (1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and call bch_cache_set_unregister()(LINE#1885). (2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return. (3) As (2) has returned, LINE#2128 - LINE#2129 would do *not* give the value to c->cache[], it means that c->cache[] is NULL. LINE#1624 - LINE#1665 is some codes about function of cache_set_flush(). As (1), in LINE#1885 call bch_cache_set_unregister() ---> bch_cache_set_stop() ---> closure_queue() -.-> cache_set_flush() (as below LINE#1624) 1624 static void cache_set_flush(struct closure *cl) 1625 { ... 1654 for_each_cache(ca, c, i) 1655 if (ca->alloc_thread) ^^ 1656 kthread_stop(ca->alloc_thread); ... 1665 } (4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the kernel crash occurred as below: [ 846.712887] bcache: register_cache() error drbd6: cannot allocate memory [ 846.713242] bcache: register_bcache() error : failed to register device [ 846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered [ 846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8 [ 846.714790] PGD 0 P4D 0 [ 846.715129] Oops: 0000 [#1] SMP PTI [ 846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1 [ 846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018 [ 846.716451] Workqueue: events cache_set_flush [bcache] [ 846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache] [ 846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7 [ 846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202 [ 846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000 [ 846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665 [ 846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700 [ 846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00 [ 846.720132] FS: 0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000 [ 846.720726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0 [ 846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 846.722131] PKRU: 55555554 [ 846.722467] Call Trace: [ 846.722814] process_one_work+0x1a7/0x3b0 [ 846.723157] worker_thread+0x30/0x390 [ 846.723501] ? create_worker+0x1a0/0x1a0 [ 846.723844] kthread+0x112/0x130 [ 846.724184] ? kthread_flush_work_fn+0x10/0x10 [ 846.724535] ret_from_fork+0x35/0x40 Now, check whether that ca is NULL in LINE#1655 to fix the issue. Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06md/md-bitmap: fix dm-raid max_write_behind settingYu Kuai1-1/+1
[ Upstream commit 2afe17794cfed5f80295b1b9facd66e6f65e5002 ] It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-14-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27dm-mirror: fix a tiny race conditionMikulas Patocka1-3/+2
commit 829451beaed6165eb11d7a9fb4e28eb17f489980 upstream. There's a tiny race condition in dm-mirror. The functions queue_bio and write_callback grab a spinlock, add a bio to the list, drop the spinlock and wake up the mirrord thread that processes bios in the list. It may be possible that the mirrord thread processes the bio just after spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious wake-up is normally harmless, however if the device mapper device is unloaded just after the bio was processed, it may be possible that wakeup_mirrord(ms) uses invalid "ms" pointer. Fix this bug by moving wakeup_mirrord inside the spinlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27dm: free table mempools if not used in __bindBenjamin Marzinski1-4/+4
[ Upstream commit e8819e7f03470c5b468720630d9e4e1d5b99159e ] With request-based dm, the mempools don't need reloading when switching tables, but the unused table mempools are not freed until the active table is finally freed. Free them immediately if they are not needed. Fixes: 29dec90a0f1d9 ("dm: fix bio_set allocation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27dm: don't change md if dm_table_set_restrictions() failsBenjamin Marzinski1-10/+12
[ Upstream commit 9eb7109a5bfc5b8226e9517e9f3cc6d414391884 ] __bind was changing the disk capacity, geometry and mempools of the mapped device before calling dm_table_set_restrictions() which could fail, forcing dm to drop the new table. Failing here would leave the device using the old table but with the wrong capacity and mempools. Move dm_table_set_restrictions() earlier in __bind(). Since it needs the capacity to be set, save the old version and restore it on failure. Fixes: bb37d77239af2 ("dm: introduce zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04dm: fix unconditional IO throttle caused by REQ_PREFLUSHJinliang Zheng1-2/+6
[ Upstream commit 88f7f56d16f568f19e1a695af34a7f4a6ce537a6 ] When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 #8 [ffff800084a2fa60] generic_make_request at ffff800040570138 #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc #18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def2845cc33 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04dm cache: prevent BUG_ON by blocking retries on failed device resumesMing-Hung Tsai1-0/+24
[ Upstream commit 5da692e2262b8f81993baa9592f57d12c2703dea ] A cache device failing to resume due to mapping errors should not be retried, as the failure leaves a partially initialized policy object. Repeating the resume operation risks triggering BUG_ON when reloading cache mappings into the incomplete policy object. Reproduce steps: 1. create a cache metadata consisting of 512 or more cache blocks, with some mappings stored in the first array block of the mapping array. Here we use cache_restore v1.0 to build the metadata. cat <<EOF >> cmeta.xml <superblock uuid="" block_size="128" nr_cache_blocks="512" \ policy="smq" hint_width="4"> <mappings> <mapping cache_block="0" origin_block="0" dirty="false"/> </mappings> </superblock> EOF dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2 dmsetup remove cmeta 2. wipe the second array block of the mapping array to simulate data degradations. mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \ 2>/dev/null | hexdump -e '1/8 "%u\n"') ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \ 2>/dev/null | hexdump -e '1/8 "%u\n"') dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock 3. try bringing up the cache device. The resume is expected to fail due to the broken array block. dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dmsetup create cache --notable dmsetup load cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" dmsetup resume cache 4. try resuming the cache again. An unexpected BUG_ON is triggered while loading cache mappings. dmsetup resume cache Kernel logs: (snip) ------------[ cut here ]------------ kernel BUG at drivers/md/dm-cache-policy-smq.c:752! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 332 Comm: dmsetup Not tainted 6.13.4 #3 RIP: 0010:smq_load_mapping+0x3e5/0x570 Fix by disallowing resume operations for devices that failed the initial attempt. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04dm: restrict dm device size to 2^63-512 bytesMikulas Patocka1-0/+4
[ Upstream commit 45fc728515c14f53f6205789de5bfd72a95af3b8 ] The devices with size >= 2^63 bytes can't be used reliably by userspace because the type off_t is a signed 64-bit integer. Therefore, we limit the maximum size of a device mapper device to 2^63-512 bytes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18dm: add missing unlock on in dm_keyslot_evict()Dan Carpenter1-1/+2
commit 650266ac4c7230c89bcd1307acf5c9c92cfa85e2 upstream. We need to call dm_put_live_table() even if dm_get_live_table() returns NULL. Fixes: 9355a9eb21a5 ("dm: support key eviction from keyslot managers of underlying devices") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm: fix copying after src array boundariesTudor Ambarus1-1/+1
commit f1aff4bc199cb92c055668caed65505e3b4d2656 upstream. The blammed commit copied to argv the size of the reallocated argv, instead of the size of the old_argv, thus reading and copying from past the old_argv allocated memory. Following BUG_ON was hit: [ 3.038929][ T1] kernel BUG at lib/string_helpers.c:1040! [ 3.039147][ T1] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... [ 3.056489][ T1] Call trace: [ 3.056591][ T1] __fortify_panic+0x10/0x18 (P) [ 3.056773][ T1] dm_split_args+0x20c/0x210 [ 3.056942][ T1] dm_table_add_target+0x13c/0x360 [ 3.057132][ T1] table_load+0x110/0x3ac [ 3.057292][ T1] dm_ctl_ioctl+0x424/0x56c [ 3.057457][ T1] __arm64_sys_ioctl+0xa8/0xec [ 3.057634][ T1] invoke_syscall+0x58/0x10c [ 3.057804][ T1] el0_svc_common+0xa8/0xdc [ 3.057970][ T1] do_el0_svc+0x1c/0x28 [ 3.058123][ T1] el0_svc+0x50/0xac [ 3.058266][ T1] el0t_64_sync_handler+0x60/0xc4 [ 3.058452][ T1] el0t_64_sync+0x1b0/0x1b4 [ 3.058620][ T1] Code: f800865e a9bf7bfd 910003fd 941f48aa (d4210000) [ 3.058897][ T1] ---[ end trace 0000000000000000 ]--- [ 3.059083][ T1] Kernel panic - not syncing: Oops - BUG: Fatal exception Fix it by copying the size of src, and not the size of dst, as it was. Fixes: 5a2a6c428190 ("dm: always update the array size in realloc_argv on success") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09md: move initialization and destruction of 'io_acct_set' to md.cYu Kuai4-63/+23
commit c567c86b90d4715081adfe5eb812141a5b6b4883 upstream. 'io_acct_set' is only used for raid0 and raid456, prepare to use it for raid1 and raid10, so that io accounting from different levels can be consistent. By the way, follow up patches will also use this io clone mechanism to make sure 'active_io' represents in flight io, not io that is dispatching, so that mddev_suspend will wait for io to be done as designed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm-bufio: don't schedule in atomic contextLongPing Wei1-1/+2
commit a3d8f0a7f5e8b193db509c7191fefeed3533fc44 upstream. A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and try_verify_in_tasklet are enabled. [ 129.444685][ T934] BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2421 [ 129.444723][ T934] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 934, name: kworker/1:4 [ 129.444740][ T934] preempt_count: 201, expected: 0 [ 129.444756][ T934] RCU nest depth: 0, expected: 0 [ 129.444781][ T934] Preemption disabled at: [ 129.444789][ T934] [<ffffffd816231900>] shrink_work+0x21c/0x248 [ 129.445167][ T934] kernel BUG at kernel/sched/walt/walt_debug.c:16! [ 129.445183][ T934] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 129.445204][ T934] Skip md ftrace buffer dump for: 0x1609e0 [ 129.447348][ T934] CPU: 1 PID: 934 Comm: kworker/1:4 Tainted: G W OE 6.6.56-android15-8-o-g6f82312b30b9-debug #1 1400000003000000474e5500b3187743670464e8 [ 129.447362][ T934] Hardware name: Qualcomm Technologies, Inc. Parrot QRD, Alpha-M (DT) [ 129.447373][ T934] Workqueue: dm_bufio_cache shrink_work [ 129.447394][ T934] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 129.447406][ T934] pc : android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug] [ 129.447435][ T934] lr : __traceiter_android_rvh_schedule_bug+0x44/0x6c [ 129.447451][ T934] sp : ffffffc0843dbc90 [ 129.447459][ T934] x29: ffffffc0843dbc90 x28: ffffffffffffffff x27: 0000000000000c8b [ 129.447479][ T934] x26: 0000000000000040 x25: ffffff804b3d6260 x24: ffffffd816232b68 [ 129.447497][ T934] x23: ffffff805171c5b4 x22: 0000000000000000 x21: ffffffd816231900 [ 129.447517][ T934] x20: ffffff80306ba898 x19: 0000000000000000 x18: ffffffc084159030 [ 129.447535][ T934] x17: 00000000d2b5dd1f x16: 00000000d2b5dd1f x15: ffffffd816720358 [ 129.447554][ T934] x14: 0000000000000004 x13: ffffff89ef978000 x12: 0000000000000003 [ 129.447572][ T934] x11: ffffffd817a823c4 x10: 0000000000000202 x9 : 7e779c5735de9400 [ 129.447591][ T934] x8 : ffffffd81560d004 x7 : 205b5d3938373434 x6 : ffffffd8167397c8 [ 129.447610][ T934] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffffffc0843db9e0 [ 129.447629][ T934] x2 : 0000000000002f15 x1 : 0000000000000000 x0 : 0000000000000000 [ 129.447647][ T934] Call trace: [ 129.447655][ T934] android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug 1400000003000000474e550080cce8a8a78606b6] [ 129.447681][ T934] __might_resched+0x190/0x1a8 [ 129.447694][ T934] shrink_work+0x180/0x248 [ 129.447706][ T934] process_one_work+0x260/0x624 [ 129.447718][ T934] worker_thread+0x28c/0x454 [ 129.447729][ T934] kthread+0x118/0x158 [ 129.447742][ T934] ret_from_fork+0x10/0x20 [ 129.447761][ T934] Code: ???????? ???????? ???????? d2b5dd1f (d4210000) [ 129.447772][ T934] ---[ end trace 0000000000000000 ]--- dm_bufio_lock will call spin_lock_bh when try_verify_in_tasklet is enabled, and __scan will be called in atomic context. Fixes: 7cd326747f46 ("dm bufio: remove dm_bufio_cond_resched()") Signed-off-by: LongPing Wei <weilongping@oppo.com> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm: always update the array size in realloc_argv on successBenjamin Marzinski1-2/+3
commit 5a2a6c428190f945c5cbf5791f72dbea83e97f66 upstream. realloc_argv() was only updating the array size if it was called with old_argv already allocated. The first time it was called to create an argv array, it would allocate the array but return the array size as zero. dm_split_args() would think that it couldn't store any arguments in the array and would call realloc_argv() again, causing it to reallocate the initial slots (this time using GPF_KERNEL) and finally return a size. Aside from being wasteful, this could cause deadlocks on targets that need to process messages without starting new IO. Instead, realloc_argv should always update the allocated array size on success. Fixes: a0651926553c ("dm table: don't copy from a NULL pointer in realloc_argv()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm-integrity: fix a warning on invalid table lineMikulas Patocka1-1/+1
commit 0a533c3e4246c29d502a7e0fba0e86d80a906b04 upstream. If we use the 'B' mode and we have an invalit table line, cancel_delayed_work_sync would trigger a warning. This commit avoids the warning. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02md/raid1: Add check for missing source disk in process_checks()Meir Elisha1-10/+16
[ Upstream commit b7c178d9e57c8fd4238ff77263b877f6f16182ba ] During recovery/check operations, the process_checks function loops through available disks to find a 'primary' source with successfully read data. If no suitable source disk is found after checking all possibilities, the 'primary' index will reach conf->raid_disks * 2. Add an explicit check for this condition after the loop. If no source disk was found, print an error message and return early to prevent further processing without a valid primary source. Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com Signed-off-by: Meir Elisha <meir.elisha@volumez.com> Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25md: fix mddev uaf while iterating all_mddevs listYu Kuai1-6/+12
commit 8542870237c3a48ff049b6c5df5f50c8728284fa upstream. While iterating all_mddevs list from md_notify_reboot() and md_exit(), list_for_each_entry_safe is used, and this can race with deletint the next mddev, causing UAF: t1: spin_lock //list_for_each_entry_safe(mddev, n, ...) mddev_get(mddev1) // assume mddev2 is the next entry spin_unlock t2: //remove mddev2 ... mddev_free spin_lock list_del spin_unlock kfree(mddev2) mddev_put(mddev1) spin_lock //continue dereference mddev2->all_mddevs The old helper for_each_mddev() actually grab the reference of mddev2 while holding the lock, to prevent from being freed. This problem can be fixed the same way, however, the code will be complex. Hence switch to use list_for_each_entry, in this case mddev_put() can free the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor out a helper mddev_put_locked() to fix this problem. Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot") Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit") Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org> Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [skip md_seq_show() that is not exist] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25md: factor out a helper from mddev_put()Yu Kuai1-14/+18
commit 3d8d32873c7b6d9cec5b40c2ddb8c7c55961694f upstream. There are no functional changes, prepare to simplify md_seq_ops in next patch. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com [minor context conflict] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25md/md-bitmap: fix stats collection for external bitmapsZheng Qixing1-3/+2
[ Upstream commit 6ec1f0239485028445d213d91cfee5242f3211ba ] The bitmap_get_stats() function incorrectly returns -ENOENT for external bitmaps. Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). Note: "bitmap_info.external" here refers to a bitmap stored in a separate file (bitmap_file), not to external metadata. Fixes: 8d28d0ddb986 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25md/raid10: fix missing discard IO accountingYu Kuai1-0/+1
[ Upstream commit d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 ] md_account_bio() is not called from raid10_handle_discard(), now that we handle bitmap inside md_account_bio(), also fix missing bitmap_startwrite for discard. Test whole disk discard for 20G raid10: Before: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 48.00 16.00 0.00 0.00 5.42 341.33 After: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 68.00 20462.00 0.00 0.00 2.65 308133.65 Link: https://lore.kernel.org/linux-raid/20250325015746.3195035-1-yukuai1@huaweicloud.com Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25dm-verity: fix prefetch-vs-suspend raceMikulas Patocka1-0/+8
commit 2de510fccbca3d1906b55f4be5f1de83fa2424ef upstream. There's a possible race condition in dm-verity - the prefetch work item may race with suspend and it is possible that prefetch continues to run while the device is suspended. Fix this by calling flush_workqueue and dm_bufio_client_reset in the postsuspend hook. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>