summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
9 daysmd/raid1: fix data lost for writemostly rdevYu Kuai1-1/+1
[ Upstream commit 93dec51e716db88f32d770dc9ab268964fff320b ] If writemostly is enabled, alloc_behind_master_bio() will allocate a new bio for rdev, with bi_opf set to 0. Later, raid1_write_request() will clone from this bio, hence bi_opf is still 0 for the cloned bio. Submit this cloned bio will end up to be read, causing write data lost. Fix this problem by inheriting bi_opf from original bio for behind_mast_bio. Fixes: e879a0d9cb08 ("md/raid1,raid10: don't ignore IO flags") Reported-and-tested-by: Ian Dall <ian@beware.dropbear.id.au> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220507 Link: https://lore.kernel.org/linux-raid/20250903014140.3690499-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysmd: prevent incorrect update of resync/recovery offsetLi Nan1-0/+5
[ Upstream commit 7202082b7b7a256d04ec96131c7f859df0a79f64 ] In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent call to md_sync_position() will return MaxSector. This causes 'curr_resync' (and later 'recovery_offset') to be set to MaxSector too, which incorrectly signals that recovery/resync has completed, even though disk data has not actually been updated. To fix this issue, skip updating any offset values when the sync action is FROZEN. The same holds true for IDLE. Fixes: 7d9f107a4e94 ("md: use new helpers in md_do_sync()") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysmd/md-bitmap: fix wrong bitmap_limit for clustermd when write sbSu Yue1-3/+3
commit 6130825f34d41718c98a9b1504a79a23e379701e upstream. In clustermd, separate write-intent-bitmaps are used for each cluster node: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | So in node 1, pg_index in __write_sb_page() could equal to bitmap->storage.file_pages. Then bitmap_limit will be calculated to 0. md_super_write() will be called with 0 size. That means the first 4k sb area of node 1 will never be updated through filemap_write_page(). This bug causes hang of mdadm/clustermd_tests/01r1_Grow_resize. Here use (pg_index % bitmap->storage.file_pages) to make calculation of bitmap_limit correct. Fixes: ab99a87542f1 ("md/md-bitmap: fix writing non bitmap pages") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Link: https://lore.kernel.org/linux-raid/20250303033918.32136-1-glass.su@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmd/raid1,raid10: strip REQ_NOWAIT from member biosZheng Qixing2-1/+4
commit 5fa31c49928139fa948f078b094d80f12ed83f5f upstream. RAID layers don't implement proper non-blocking semantics for REQ_NOWAIT, making the flag potentially misleading when propagated to member disks. This patch clear REQ_NOWAIT from cloned bios in raid1/raid10. Retain original bio's REQ_NOWAIT flag for upper layer error handling. Maybe we can implement non-blocking I/O handling mechanisms within RAID in future work. Fixes: 9f346f7d4ea7 ("md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250702102341.1969154-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmd/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAITYu Kuai3-14/+26
commit 9f346f7d4ea73692b82f5102ca8698e4040469ea upstream. IO with REQ_RAHEAD or REQ_NOWAIT can fail early, even if the storage medium is fine, hence record badblocks or remove the disk from array does not make sense. This problem if found by lvm2 test lvcreate-large-raid, where dm-zero will fail read ahead IO directly. Fixes: e879a0d9cb08 ("md/raid1,raid10: don't ignore IO flags") Reported-and-tested-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/34fa755d-62c8-4588-8ee1-33cb1249bdf2@redhat.com/ Link: https://lore.kernel.org/linux-raid/20250527081407.3004055-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmd/raid1,raid10: don't ignore IO flagsYu Kuai2-11/+0
commit e879a0d9cb086c8e52ce6c04e5bfa63825a6213c upstream. If blk-wbt is enabled by default, it's found that raid write performance is quite bad because all IO are throttled by wbt of underlying disks, due to flag REQ_IDLE is ignored. And turns out this behaviour exist since blk-wbt is introduced. Other than REQ_IDLE, other flags should not be ignored as well, for example REQ_META can be set for filesystems, clearing it can cause priority reverse problems; And REQ_NOWAIT should not be cleared as well, because io will wait instead of failing directly in underlying disks. Fix those problems by keep IO flags from master bio. Fises: f51d46d0e7cb ("md: add support for REQ_NOWAIT") Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Fixes: 5404bc7a87b9 ("[PATCH] Allow file systems to differentiate between data and meta reads") Link: https://lore.kernel.org/linux-raid/20250227121657.832356-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> [ Harshit: Resolve conflicts due to missing commit: f2a38abf5f1c ("md/raid1: Atomic write support") and commit: a1d9b4fd42d9 ("md/raid10: Atomic write support") in 6.12.y, we don't have Atomic writes feature in 6.12.y ] Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28dm: Check for forbidden splitting of zone write operationsDamien Le Moal1-4/+13
commit 409f9287dab3b53bffe8d28d883a529028aa6a42 upstream. DM targets must not split zone append and write operations using dm_accept_partial_bio() as doing so is forbidden for zone append BIOs, breaks zone append emulation using regular write BIOs and potentially creates deadlock situations with queue freeze operations. Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all these cases, that is, check that the BIO is a write or write zeroes operation. This change packs all the zone related checks together under a static_branch_unlikely(&zoned_enabled) and done only if the target is a zoned device. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28dm: dm-crypt: Do not partially accept write BIOs with zoned targetsDamien Le Moal1-10/+39
commit e549663849e5bb3b985dc2d293069f0d9747ae72 upstream. Read and write operations issued to a dm-crypt target may be split according to the dm-crypt internal limits defined by the max_read_size and max_write_size module parameters (default is 128 KB). The intent is to improve processing time of large BIOs by splitting them into smaller operations that can be parallelized on different CPUs. For zoned dm-crypt targets, this BIO splitting is still done but without the parallel execution to ensure that the issuing order of write operations to the underlying devices remains sequential. However, the splitting itself causes other problems: 1) Since dm-crypt relies on the block layer zone write plugging to handle zone append emulation using regular write operations, the reminder of a split write BIO will always be plugged into the target zone write plugged. Once the on-going write BIO finishes, this reminder BIO is unplugged and issued from the zone write plug work. If this reminder BIO itself needs to be split, the reminder will be re-issued and plugged again, but that causes a call to a blk_queue_enter(), which may block if a queue freeze operation was initiated. This results in a deadlock as DM submission still holds BIOs that the queue freeze side is waiting for. 2) dm-crypt relies on the emulation done by the block layer using regular write operations for processing zone append operations. This still requires to properly return the written sector as the BIO sector of the original BIO. However, this can be done correctly only and only if there is a single clone BIO used for processing the original zone append operation issued by the user. If the size of a zone append operation is larger than dm-crypt max_write_size, then the orginal BIO will be split and processed as a chain of regular write operations. Such chaining result in an incorrect written sector being returned to the zone append issuer using the original BIO sector. This in turn results in file system data corruptions using xfs or btrfs. Fix this by modifying get_max_request_size() to always return the size of the BIO to avoid it being split with dm_accpet_partial_bio() in crypt_map(). get_max_request_size() is renamed to get_max_request_sectors() to clarify the unit of the value returned and its interface is changed to take a struct dm_target pointer and a pointer to the struct bio being processed. In addition to this change, to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt device max_hw_sectors limit to be at most BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture). This forces DM core to split write BIOs before passing them to crypt_map(), and thus guaranteeing that dm-crypt can always accept an entire write BIO without needing to split it. This change does not have any effect on the read path of dm-crypt. Read operations can still be split and the BIO fragments processed in parallel. There is also no impact on the performance of the write path given that all zone write BIOs were already processed inline instead of in parallel. This change also does not affect in any way regular dm-crypt block devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20dm: split write BIOs on zone boundaries when zone append is not emulatedShin'ichiro Kawasaki1-11/+7
commit 675f940576351bb049f5677615140b9d0a7712d0 upstream. Commit 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") updates the device-mapper driver to perform splits for the write BIOs. However, it did not address the cases where DM targets do not emulate zone append, such as in the cases of dm-linear or dm-flakey. For these targets, when the write BIOs span across zone boundaries, they trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in blk_zone_wplug_handle_write(). This results in I/O errors. The errors are reproduced by running blktests test case zbd/004 using zoned dm-linear or dm-flakey devices. To avoid the I/O errors, handle the write BIOs regardless whether DM targets emulate zone append or not, so that all write BIOs are split at zone boundaries. For that purpose, drop the check for zone append emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer used then drop it also. Fixes: 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250717103539.37279-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20dm: Always split write BIOs to zoned device limitsDamien Le Moal1-7/+22
commit 2df7168717b7d2d32bcf017c68be16e4aae9dd13 upstream. Any zoned DM target that requires zone append emulation will use the block layer zone write plugging. In such case, DM target drivers must not split BIOs using dm_accept_partial_bio() as doing so can potentially lead to deadlocks with queue freeze operations. Regular write operations used to emulate zone append operations also cannot be split by the target driver as that would result in an invalid writen sector value return using the BIO sector. In order for zoned DM target drivers to avoid such incorrect BIO splitting, we must ensure that large BIOs are split before being passed to the map() function of the target, thus guaranteeing that the limits for the mapped device are not exceeded. dm-crypt and dm-flakey are the only target drivers supporting zoned devices and using dm_accept_partial_bio(). In the case of dm-crypt, this function is used to split BIOs to the internal max_write_size limit (which will be suppressed in a different patch). However, since crypt_alloc_buffer() uses a bioset allowing only up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS (128), must thus be respected and write BIOs split accordingly. In the case of dm-flakey, since zone append emulation is not required, the block layer zone write plugging is not used and no splitting of BIOs required. Modify the function dm_zone_bio_needs_split() to use the block layer helper function bio_needs_zone_write_plugging() to force a call to bio_split_to_limits() in dm_split_and_process_bio(). This allows DM target drivers to avoid using dm_accept_partial_bio() for write operations on zoned DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250625093327.548866-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20block: Introduce bio_needs_zone_write_plugging()Damien Le Moal1-1/+3
commit f70291411ba20d50008db90a6f0731efac27872c upstream. In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20dm-table: fix checking for rq stackable devicesBenjamin Marzinski1-5/+5
[ Upstream commit 8ca719b81987be690f197e82fdb030580c0a07f3 ] Due to the semantics of iterate_devices(), the current code allows a request-based dm table as long as it includes one request-stackable device. It is supposed to only allow tables where there are no non-request-stackable devices. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20dm-mpath: don't print the "loaded" message if registering failsMikulas Patocka4-4/+12
[ Upstream commit 6e11952a6abc4641dc8ae63f01b318b31b44e8db ] If dm_register_path_selector, don't print the "version X loaded" message. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20md: dm-zoned-target: Initialize return variable r to avoid uninitialized usePurva Yeshi1-1/+1
[ Upstream commit 487767bff572d46f7c37ad846c4078f6d6c9cc55 ] Fix Smatch-detected error: drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices() error: uninitialized symbol 'r'. Smatch detects a possible use of the uninitialized variable 'r' in dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is skipped and 'r' is returned without being set, leading to undefined behavior. Initialize 'r' to 0 before the loop. This ensures that if there are no devices to iterate over, the function still returns a defined value. Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20md/raid10: set chunk_sectors limitJohn Garry1-0/+1
[ Upstream commit 7ef50c4c6a9c36fa3ea6f1681a80c0bf9a797345 ] Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20dm-stripe: limit chunk_sectors to the stripe sizeJohn Garry1-0/+1
[ Upstream commit 5fb9d4341b782a80eefa0dc1664d131ac3c8885d ] Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Setting chunk_sectors limit in this way overrides the stacked limit already calculated based on the bottom device limits. This is ok, as when any bios are sent to the bottom devices, the block layer will still respect the bottom device chunk_sectors. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15Revert "bcache: remove heap-related macros and switch to generic min_heap"Kuan-Wei Chiu11-263/+217
commit 48fd7ebe00c1cdc782b42576548b25185902f64c upstream. This reverts commit 866898efbb25bb44fd42848318e46db9e785973a. The generic bottom-up min_heap implementation causes performance regression in invalidate_buckets_lru(), a hot path in bcache. Before the cache is fully populated, new_bucket_prio() often returns zero, leading to many equal comparisons. In such cases, bottom-up sift_down performs up to 2 * log2(n) comparisons, while the original top-down approach completes with just O() comparisons, resulting in a measurable performance gap. The performance degradation is further worsened by the non-inlined min_heap API functions introduced in commit 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions"), adding function call overhead to this critical path. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. This revert aims to restore bcache's original low-latency behavior. Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-3-visitorckw@gmail.com Fixes: 866898efbb25 ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: 92a8b224b833 ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15md/md-cluster: handle REMOVE message earlierHeming Zhao1-3/+6
[ Upstream commit 948b1fe12005d39e2b49087b50e5ee55c9a8f76f ] Commit a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add) Consider a 2-node cluster: - node1 set faulty & remove command on a disk. - node2 must correctly update the array metadata. Before a1fd37f97808, on node1, the delay between msg:METADATA_UPDATED (triggered by faulty) and msg:REMOVE was sufficient for node2 to reload the disk info (written by node1). After a1fd37f97808, node1 no longer waits between faulty and remove, causing it to send msg:REMOVE while node2 is still reloading disk info. This often results in node2 failing to remove the faulty disk. == how to trigger == set up a 2-node cluster (node1 & node2) with disks vdc & vdd. on node1: mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc check array status on both nodes with "mdadm -D /dev/md0". node1 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd node2 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd 0 254 32 - faulty /dev/vdc Fixes: a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/linux-raid/20250728042145.9989-1-heming.zhao@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24dm-bufio: fix sched in atomic contextSheng Yong1-1/+5
commit b1bf1a782fdf5c482215c0c661b5da98b8e75773 upstream. If "try_verify_in_tasklet" is set for dm-verity, DM_BUFIO_CLIENT_NO_SLEEP is enabled for dm-bufio. However, when bufio tries to evict buffers, there is a chance to trigger scheduling in spin_lock_bh, the following warning is hit: BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2745 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 123, name: kworker/2:2 preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by kworker/2:2/123: #0: ffff88800a2d1548 ((wq_completion)dm_bufio_cache){....}-{0:0}, at: process_one_work+0xe46/0x1970 #1: ffffc90000d97d20 ((work_completion)(&dm_bufio_replacement_work)){....}-{0:0}, at: process_one_work+0x763/0x1970 #2: ffffffff8555b528 (dm_bufio_clients_lock){....}-{3:3}, at: do_global_cleanup+0x1ce/0x710 #3: ffff88801d5820b8 (&c->spinlock){....}-{2:2}, at: do_global_cleanup+0x2a5/0x710 Preemption disabled at: [<0000000000000000>] 0x0 CPU: 2 UID: 0 PID: 123 Comm: kworker/2:2 Not tainted 6.16.0-rc3-g90548c634bd0 #305 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: dm_bufio_cache do_global_cleanup Call Trace: <TASK> dump_stack_lvl+0x53/0x70 __might_resched+0x360/0x4e0 do_global_cleanup+0x2f5/0x710 process_one_work+0x7db/0x1970 worker_thread+0x518/0xea0 kthread+0x359/0x690 ret_from_fork+0xf3/0x1b0 ret_from_fork_asm+0x1a/0x30 </TASK> That can be reproduced by: veritysetup format --data-block-size=4096 --hash-block-size=4096 /dev/vda /dev/vdb SIZE=$(blockdev --getsz /dev/vda) dmsetup create myverity -r --table "0 $SIZE verity 1 /dev/vda /dev/vdb 4096 4096 <data_blocks> 1 sha256 <root_hash> <salt> 1 try_verify_in_tasklet" mount /dev/dm-0 /mnt -o ro echo 102400 > /sys/module/dm_bufio/parameters/max_cache_size_bytes [read files in /mnt] Cc: stable@vger.kernel.org # v6.4+ Fixes: 450e8dee51aa ("dm bufio: improve concurrent IO performance") Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17raid10: cleanup memleak at raid10_make_requestNigel Croxon1-2/+8
[ Upstream commit 43806c3d5b9bb7d74ba4e33a6a8a41ac988bde24 ] If raid10_read_request or raid10_write_request registers a new request and the REQ_NOWAIT flag is set, the code does not free the malloc from the mempool. unreferenced object 0xffff8884802c3200 (size 192): comm "fio", pid 9197, jiffies 4298078271 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00 .........A...... 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc c1a049a2): __kmalloc+0x2bb/0x450 mempool_alloc+0x11b/0x320 raid10_make_request+0x19e/0x650 [raid10] md_handle_request+0x3b3/0x9e0 __submit_bio+0x394/0x560 __submit_bio_noacct+0x145/0x530 submit_bio_noacct_nocheck+0x682/0x830 __blkdev_direct_IO_async+0x4dc/0x6b0 blkdev_read_iter+0x1e5/0x3b0 __io_read+0x230/0x1110 io_read+0x13/0x30 io_issue_sqe+0x134/0x1180 io_submit_sqes+0x48c/0xe90 __do_sys_io_uring_enter+0x574/0x8b0 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x76/0x7e V4: changing backing tree to see if CKI tests will pass. The patch code has not changed between any versions. Fixes: c9aa889b035f ("md: raid10 add nowait support") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17md/raid1: Fix stack memory use after return in raid1_reshapeWang Jinchao1-0/+1
[ Upstream commit d67ed2ccd2d1dcfda9292c0ea8697a9d0f2f0d98 ] In the raid1_reshape function, newpool is allocated on the stack and assigned to conf->r1bio_pool. This results in conf->r1bio_pool.wait.head pointing to a stack address. Accessing this address later can lead to a kernel panic. Example access path: raid1_reshape() { // newpool is on the stack mempool_t newpool, oldpool; // initialize newpool.wait.head to stack address mempool_init(&newpool, ...); conf->r1bio_pool = newpool; } raid1_read_request() or raid1_write_request() { alloc_r1bio() { mempool_alloc() { // if pool->alloc fails remove_element() { --pool->curr_nr; } } } } mempool_free() { if (pool->curr_nr < pool->min_nr) { // pool->wait.head is a stack address // wake_up() will try to access this invalid address // which leads to a kernel panic return; wake_up(&pool->wait); } } Fix: reinit conf->r1bio_pool.wait after assigning newpool. Fixes: afeee514ce7f ("md: convert to bioset_init()/mempool_init()") Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17md/md-bitmap: fix GPF in bitmap_get_stats()Håkon Bugge1-2/+1
commit c17fb542dbd1db745c9feac15617056506dd7195 upstream. The commit message of commit 6ec1f0239485 ("md/md-bitmap: fix stats collection for external bitmaps") states: Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). But, the code does not adhere to the above, as it does only check for a valid super-block for "internal" bitmaps. Hence, we observe: Oops: GPF, probably for non-canonical address 0x1cd66f1f40000028 RIP: 0010:bitmap_get_stats+0x45/0xd0 Call Trace: seq_read_iter+0x2b9/0x46a seq_read+0x12f/0x180 proc_reg_read+0x57/0xb0 vfs_read+0xf6/0x380 ksys_read+0x6d/0xf0 do_syscall_64+0x8c/0x1b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e We fix this by checking the existence of a super-block for both the internal and external case. Fixes: 6ec1f0239485 ("md/md-bitmap: fix stats collection for external bitmaps") Cc: stable@vger.kernel.org Reported-by: Gerald Gibson <gerald.gibson@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://lore.kernel.org/linux-raid/20250702091035.2061312-1-haakon.bugge@oracle.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06dm-raid: fix variable in journal device checkHeinz Mauelshagen1-1/+1
commit db53805156f1e0aa6d059c0d3f9ac660d4ef3eb4 upstream. Replace "rdev" with correct loop variable name "r". Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org Fixes: 63c32ed4afc2 ("dm raid: add raid4/5/6 journaling support") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06bcache: fix NULL pointer in cache_set_flush()Linggang Zeng1-1/+6
[ Upstream commit 1e46ed947ec658f89f1a910d880cd05e42d3763e ] 1. LINE#1794 - LINE#1887 is some codes about function of bch_cache_set_alloc(). 2. LINE#2078 - LINE#2142 is some codes about function of register_cache_set(). 3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098. 1794 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) 1795 { ... 1860 if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) || 1861 mempool_init_slab_pool(&c->search, 32, bch_search_cache) || 1862 mempool_init_kmalloc_pool(&c->bio_meta, 2, 1863 sizeof(struct bbio) + sizeof(struct bio_vec) * 1864 bucket_pages(c)) || 1865 mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) || 1866 bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), 1867 BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) || 1868 !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) || 1869 !(c->moving_gc_wq = alloc_workqueue("bcache_gc", 1870 WQ_MEM_RECLAIM, 0)) || 1871 bch_journal_alloc(c) || 1872 bch_btree_cache_alloc(c) || 1873 bch_open_buckets_alloc(c) || 1874 bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages))) 1875 goto err; ^^^^^^^^ 1876 ... 1883 return c; 1884 err: 1885 bch_cache_set_unregister(c); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1886 return NULL; 1887 } ... 2078 static const char *register_cache_set(struct cache *ca) 2079 { ... 2098 c = bch_cache_set_alloc(&ca->sb); 2099 if (!c) 2100 return err; ^^^^^^^^^^ ... 2128 ca->set = c; 2129 ca->set->cache[ca->sb.nr_this_dev] = ca; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... 2138 return NULL; 2139 err: 2140 bch_cache_set_unregister(c); 2141 return err; 2142 } (1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and call bch_cache_set_unregister()(LINE#1885). (2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return. (3) As (2) has returned, LINE#2128 - LINE#2129 would do *not* give the value to c->cache[], it means that c->cache[] is NULL. LINE#1624 - LINE#1665 is some codes about function of cache_set_flush(). As (1), in LINE#1885 call bch_cache_set_unregister() ---> bch_cache_set_stop() ---> closure_queue() -.-> cache_set_flush() (as below LINE#1624) 1624 static void cache_set_flush(struct closure *cl) 1625 { ... 1654 for_each_cache(ca, c, i) 1655 if (ca->alloc_thread) ^^ 1656 kthread_stop(ca->alloc_thread); ... 1665 } (4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the kernel crash occurred as below: [ 846.712887] bcache: register_cache() error drbd6: cannot allocate memory [ 846.713242] bcache: register_bcache() error : failed to register device [ 846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered [ 846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8 [ 846.714790] PGD 0 P4D 0 [ 846.715129] Oops: 0000 [#1] SMP PTI [ 846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1 [ 846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018 [ 846.716451] Workqueue: events cache_set_flush [bcache] [ 846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache] [ 846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7 [ 846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202 [ 846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000 [ 846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665 [ 846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700 [ 846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00 [ 846.720132] FS: 0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000 [ 846.720726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0 [ 846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 846.722131] PKRU: 55555554 [ 846.722467] Call Trace: [ 846.722814] process_one_work+0x1a7/0x3b0 [ 846.723157] worker_thread+0x30/0x390 [ 846.723501] ? create_worker+0x1a0/0x1a0 [ 846.723844] kthread+0x112/0x130 [ 846.724184] ? kthread_flush_work_fn+0x10/0x10 [ 846.724535] ret_from_fork+0x35/0x40 Now, check whether that ca is NULL in LINE#1655 to fix the issue. Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06md/md-bitmap: fix dm-raid max_write_behind settingYu Kuai1-1/+1
[ Upstream commit 2afe17794cfed5f80295b1b9facd66e6f65e5002 ] It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-14-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06dm vdo indexer: don't read request structure after enqueuingMatthew Sakai1-11/+13
[ Upstream commit 3da732687d72078e52cc7f334a482383e84ca156 ] The function get_volume_page_protected may place a request on a queue for another thread to process asynchronously. When this happens, the volume should not read the request from the original thread. This can not currently cause problems, due to the way request processing is handled, but it is not safe in general. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27dm: lock limits when reading themMikulas Patocka1-1/+7
commit abb4cf2f4c1c1b637cad04d726f2e13fd3051e03 upstream. Lock queue limits when reading them, so that we don't read halfway modified values. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27dm-verity: fix a memory leak if some arguments are specified multiple timesMikulas Patocka3-5/+24
commit 66be40a14e496689e1f0add50118408e22c96169 upstream. If some of the arguments "check_at_most_once", "ignore_zero_blocks", "use_fec_from_device", "root_hash_sig_key_desc" were specified more than once on the target line, a memory leak would happen. This commit fixes the memory leak. It also fixes error handling in verity_verify_sig_parse_opt_args. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27dm-mirror: fix a tiny race conditionMikulas Patocka1-3/+2
commit 829451beaed6165eb11d7a9fb4e28eb17f489980 upstream. There's a tiny race condition in dm-mirror. The functions queue_bio and write_callback grab a spinlock, add a bio to the list, drop the spinlock and wake up the mirrord thread that processes bios in the list. It may be possible that the mirrord thread processes the bio just after spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious wake-up is normally harmless, however if the device mapper device is unloaded just after the bio was processed, it may be possible that wakeup_mirrord(ms) uses invalid "ms" pointer. Fix this bug by moving wakeup_mirrord inside the spinlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19dm-flakey: make corrupting read bios workBenjamin Marzinski1-26/+28
[ Upstream commit 13e79076c89f6e96a6cca8f6df38b40d025907b4 ] dm-flakey corrupts the read bios in the endio function. However, the corrupt_bio_* functions checked bio_has_data() to see if there was data to corrupt. Since this was the endio function, there was no data left to complete, so bio_has_data() was always false. Fix this by saving a copy of the bio's bi_iter in flakey_map(), and using this to initialize the iter for corrupting the read bios. This patch also skips cloning the bio for write bios with no data. Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Fixes: a3998799fb4df ("dm flakey: add corrupt_bio_byte feature") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19dm-flakey: error all IOs when num_features is absentBenjamin Marzinski1-8/+8
[ Upstream commit 40ed054f39bc99eac09871c33198e501f4acdf24 ] dm-flakey would error all IOs if num_features was 0, but if it was absent, dm-flakey would never error any IO. Fix this so that no num_features works the same as num_features set to 0. Fixes: aa7d7bc99fed7 ("dm flakey: add an "error_reads" option") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19dm: fix dm_blk_report_zonesBenjamin Marzinski2-8/+18
[ Upstream commit 37f53a2c60d03743e0eacf7a0c01c279776fef4e ] If dm_get_live_table() returned NULL, dm_put_live_table() was never called. Also, it is possible that md->zone_revalidate_map will change while calling this function. Only read it once, so that we are always using the same value. Otherwise we might miss a call to dm_put_live_table(). Finally, while md->zone_revalidate_map is set and a process is calling blk_revalidate_disk_zones() to set up the zone append emulation resources, it is possible that another process, perhaps triggered by blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If blk_revalidate_disk_zones() fails, these resources can be freed while the other process is still using them, causing a use-after-free error. blk_revalidate_disk_zones() will only ever be called when initially setting up the zone append emulation resources, such as when setting up a zoned dm-crypt table for the first time. Further table swaps will not set md->zone_revalidate_map or call blk_revalidate_disk_zones(). However it must be called using the new table (referenced by md->zone_revalidate_map) and the new queue limits while the DM device is suspended. dm_blk_report_zones() needs some way to distinguish between a call from blk_revalidate_disk_zones(), which must be allowed to use md->zone_revalidate_map to access this not yet activated table, and all other calls to dm_blk_report_zones(), which should not be allowed while the device is suspended and cannot use md->zone_revalidate_map, since the zone resources might be freed by the process currently calling blk_revalidate_disk_zones(). Solve this by tracking the process that sets md->zone_revalidate_map in dm_revalidate_zones() and only allowing that process to make use of it in dm_blk_report_zones(). Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19dm: free table mempools if not used in __bindBenjamin Marzinski1-4/+4
[ Upstream commit e8819e7f03470c5b468720630d9e4e1d5b99159e ] With request-based dm, the mempools don't need reloading when switching tables, but the unused table mempools are not freed until the active table is finally freed. Free them immediately if they are not needed. Fixes: 29dec90a0f1d9 ("dm: fix bio_set allocation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19dm: don't change md if dm_table_set_restrictions() failsBenjamin Marzinski1-10/+12
[ Upstream commit 9eb7109a5bfc5b8226e9517e9f3cc6d414391884 ] __bind was changing the disk capacity, geometry and mempools of the mapped device before calling dm_table_set_restrictions() which could fail, forcing dm to drop the new table. Failing here would leave the device using the old table but with the wrong capacity and mempools. Move dm_table_set_restrictions() earlier in __bind(). Since it needs the capacity to be set, save the old version and restore it on failure. Fixes: bb37d77239af2 ("dm: introduce zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dm vdo: use a short static string for thread name prefixMatthew Sakai1-10/+1
[ Upstream commit 3280c9313c9adce01550cc9f00edfb1dc7c744da ] Also remove MODULE_NAME and a BUG_ON check, both unneeded. This fixes a warning about string truncation in snprintf that will never happen in practice: drivers/md/dm-vdo/vdo.c: In function ‘vdo_make’: drivers/md/dm-vdo/vdo.c:564:5: error: ‘%s’ directive output may be truncated writing up to 55 bytes into a region of size 16 [-Werror=format-truncation=] "%s%u", MODULE_NAME, instance); ^~ drivers/md/dm-vdo/vdo.c:563:2: note: ‘snprintf’ output between 2 and 66 bytes into a destination of size 16 snprintf(vdo->thread_name_prefix, sizeof(vdo->thread_name_prefix), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "%s%u", MODULE_NAME, instance); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reported-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dm vdo indexer: prevent unterminated string warningChung Chung1-2/+3
[ Upstream commit f4e99b846c90163d350c69d6581ac38dd5818eb8 ] Fix array initialization that triggers a warning: error: initializer-string for array of ‘unsigned char’ is too long [-Werror=unterminated-string-initialization] Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dm: fix unconditional IO throttle caused by REQ_PREFLUSHJinliang Zheng1-2/+6
[ Upstream commit 88f7f56d16f568f19e1a695af34a7f4a6ce537a6 ] When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 #8 [ffff800084a2fa60] generic_make_request at ffff800040570138 #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc #18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def2845cc33 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dm cache: prevent BUG_ON by blocking retries on failed device resumesMing-Hung Tsai1-0/+24
[ Upstream commit 5da692e2262b8f81993baa9592f57d12c2703dea ] A cache device failing to resume due to mapping errors should not be retried, as the failure leaves a partially initialized policy object. Repeating the resume operation risks triggering BUG_ON when reloading cache mappings into the incomplete policy object. Reproduce steps: 1. create a cache metadata consisting of 512 or more cache blocks, with some mappings stored in the first array block of the mapping array. Here we use cache_restore v1.0 to build the metadata. cat <<EOF >> cmeta.xml <superblock uuid="" block_size="128" nr_cache_blocks="512" \ policy="smq" hint_width="4"> <mappings> <mapping cache_block="0" origin_block="0" dirty="false"/> </mappings> </superblock> EOF dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2 dmsetup remove cmeta 2. wipe the second array block of the mapping array to simulate data degradations. mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \ 2>/dev/null | hexdump -e '1/8 "%u\n"') ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \ 2>/dev/null | hexdump -e '1/8 "%u\n"') dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock 3. try bringing up the cache device. The resume is expected to fail due to the broken array block. dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dmsetup create cache --notable dmsetup load cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" dmsetup resume cache 4. try resuming the cache again. An unexpected BUG_ON is triggered while loading cache mappings. dmsetup resume cache Kernel logs: (snip) ------------[ cut here ]------------ kernel BUG at drivers/md/dm-cache-policy-smq.c:752! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 332 Comm: dmsetup Not tainted 6.13.4 #3 RIP: 0010:smq_load_mapping+0x3e5/0x570 Fix by disallowing resume operations for devices that failed the initial attempt. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dm: restrict dm device size to 2^63-512 bytesMikulas Patocka1-0/+4
[ Upstream commit 45fc728515c14f53f6205789de5bfd72a95af3b8 ] The devices with size >= 2^63 bytes can't be used reliably by userspace because the type off_t is a signed 64-bit integer. Therefore, we limit the maximum size of a device mapper device to 2^63-512 bytes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18dm: add missing unlock on in dm_keyslot_evict()Dan Carpenter1-1/+2
commit 650266ac4c7230c89bcd1307acf5c9c92cfa85e2 upstream. We need to call dm_put_live_table() even if dm_get_live_table() returns NULL. Fixes: 9355a9eb21a5 ("dm: support key eviction from keyslot managers of underlying devices") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm: fix copying after src array boundariesTudor Ambarus1-1/+1
commit f1aff4bc199cb92c055668caed65505e3b4d2656 upstream. The blammed commit copied to argv the size of the reallocated argv, instead of the size of the old_argv, thus reading and copying from past the old_argv allocated memory. Following BUG_ON was hit: [ 3.038929][ T1] kernel BUG at lib/string_helpers.c:1040! [ 3.039147][ T1] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... [ 3.056489][ T1] Call trace: [ 3.056591][ T1] __fortify_panic+0x10/0x18 (P) [ 3.056773][ T1] dm_split_args+0x20c/0x210 [ 3.056942][ T1] dm_table_add_target+0x13c/0x360 [ 3.057132][ T1] table_load+0x110/0x3ac [ 3.057292][ T1] dm_ctl_ioctl+0x424/0x56c [ 3.057457][ T1] __arm64_sys_ioctl+0xa8/0xec [ 3.057634][ T1] invoke_syscall+0x58/0x10c [ 3.057804][ T1] el0_svc_common+0xa8/0xdc [ 3.057970][ T1] do_el0_svc+0x1c/0x28 [ 3.058123][ T1] el0_svc+0x50/0xac [ 3.058266][ T1] el0t_64_sync_handler+0x60/0xc4 [ 3.058452][ T1] el0t_64_sync+0x1b0/0x1b4 [ 3.058620][ T1] Code: f800865e a9bf7bfd 910003fd 941f48aa (d4210000) [ 3.058897][ T1] ---[ end trace 0000000000000000 ]--- [ 3.059083][ T1] Kernel panic - not syncing: Oops - BUG: Fatal exception Fix it by copying the size of src, and not the size of dst, as it was. Fixes: 5a2a6c428190 ("dm: always update the array size in realloc_argv on success") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm: always update the array size in realloc_argv on successBenjamin Marzinski1-2/+3
commit 5a2a6c428190f945c5cbf5791f72dbea83e97f66 upstream. realloc_argv() was only updating the array size if it was called with old_argv already allocated. The first time it was called to create an argv array, it would allocate the array but return the array size as zero. dm_split_args() would think that it couldn't store any arguments in the array and would call realloc_argv() again, causing it to reallocate the initial slots (this time using GPF_KERNEL) and finally return a size. Aside from being wasteful, this could cause deadlocks on targets that need to process messages without starting new IO. Instead, realloc_argv should always update the allocated array size on success. Fixes: a0651926553c ("dm table: don't copy from a NULL pointer in realloc_argv()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm-integrity: fix a warning on invalid table lineMikulas Patocka1-1/+1
commit 0a533c3e4246c29d502a7e0fba0e86d80a906b04 upstream. If we use the 'B' mode and we have an invalit table line, cancel_delayed_work_sync would trigger a warning. This commit avoids the warning. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09dm-bufio: don't schedule in atomic contextLongPing Wei1-1/+8
commit a3d8f0a7f5e8b193db509c7191fefeed3533fc44 upstream. A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and try_verify_in_tasklet are enabled. [ 129.444685][ T934] BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2421 [ 129.444723][ T934] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 934, name: kworker/1:4 [ 129.444740][ T934] preempt_count: 201, expected: 0 [ 129.444756][ T934] RCU nest depth: 0, expected: 0 [ 129.444781][ T934] Preemption disabled at: [ 129.444789][ T934] [<ffffffd816231900>] shrink_work+0x21c/0x248 [ 129.445167][ T934] kernel BUG at kernel/sched/walt/walt_debug.c:16! [ 129.445183][ T934] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 129.445204][ T934] Skip md ftrace buffer dump for: 0x1609e0 [ 129.447348][ T934] CPU: 1 PID: 934 Comm: kworker/1:4 Tainted: G W OE 6.6.56-android15-8-o-g6f82312b30b9-debug #1 1400000003000000474e5500b3187743670464e8 [ 129.447362][ T934] Hardware name: Qualcomm Technologies, Inc. Parrot QRD, Alpha-M (DT) [ 129.447373][ T934] Workqueue: dm_bufio_cache shrink_work [ 129.447394][ T934] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 129.447406][ T934] pc : android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug] [ 129.447435][ T934] lr : __traceiter_android_rvh_schedule_bug+0x44/0x6c [ 129.447451][ T934] sp : ffffffc0843dbc90 [ 129.447459][ T934] x29: ffffffc0843dbc90 x28: ffffffffffffffff x27: 0000000000000c8b [ 129.447479][ T934] x26: 0000000000000040 x25: ffffff804b3d6260 x24: ffffffd816232b68 [ 129.447497][ T934] x23: ffffff805171c5b4 x22: 0000000000000000 x21: ffffffd816231900 [ 129.447517][ T934] x20: ffffff80306ba898 x19: 0000000000000000 x18: ffffffc084159030 [ 129.447535][ T934] x17: 00000000d2b5dd1f x16: 00000000d2b5dd1f x15: ffffffd816720358 [ 129.447554][ T934] x14: 0000000000000004 x13: ffffff89ef978000 x12: 0000000000000003 [ 129.447572][ T934] x11: ffffffd817a823c4 x10: 0000000000000202 x9 : 7e779c5735de9400 [ 129.447591][ T934] x8 : ffffffd81560d004 x7 : 205b5d3938373434 x6 : ffffffd8167397c8 [ 129.447610][ T934] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffffffc0843db9e0 [ 129.447629][ T934] x2 : 0000000000002f15 x1 : 0000000000000000 x0 : 0000000000000000 [ 129.447647][ T934] Call trace: [ 129.447655][ T934] android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug 1400000003000000474e550080cce8a8a78606b6] [ 129.447681][ T934] __might_resched+0x190/0x1a8 [ 129.447694][ T934] shrink_work+0x180/0x248 [ 129.447706][ T934] process_one_work+0x260/0x624 [ 129.447718][ T934] worker_thread+0x28c/0x454 [ 129.447729][ T934] kthread+0x118/0x158 [ 129.447742][ T934] ret_from_fork+0x10/0x20 [ 129.447761][ T934] Code: ???????? ???????? ???????? d2b5dd1f (d4210000) [ 129.447772][ T934] ---[ end trace 0000000000000000 ]--- dm_bufio_lock will call spin_lock_bh when try_verify_in_tasklet is enabled, and __scan will be called in atomic context. Fixes: 7cd326747f46 ("dm bufio: remove dm_bufio_cond_resched()") Signed-off-by: LongPing Wei <weilongping@oppo.com> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02md/raid1: Add check for missing source disk in process_checks()Meir Elisha1-10/+16
[ Upstream commit b7c178d9e57c8fd4238ff77263b877f6f16182ba ] During recovery/check operations, the process_checks function loops through available disks to find a 'primary' source with successfully read data. If no suitable source disk is found after checking all possibilities, the 'primary' index will reach conf->raid_disks * 2. Add an explicit check for this condition after the loop. If no source disk was found, print an error message and return early to prevent further processing without a valid primary source. Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com Signed-off-by: Meir Elisha <meir.elisha@volumez.com> Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25md: fix mddev uaf while iterating all_mddevs listYu Kuai1-9/+13
commit 8542870237c3a48ff049b6c5df5f50c8728284fa upstream. While iterating all_mddevs list from md_notify_reboot() and md_exit(), list_for_each_entry_safe is used, and this can race with deletint the next mddev, causing UAF: t1: spin_lock //list_for_each_entry_safe(mddev, n, ...) mddev_get(mddev1) // assume mddev2 is the next entry spin_unlock t2: //remove mddev2 ... mddev_free spin_lock list_del spin_unlock kfree(mddev2) mddev_put(mddev1) spin_lock //continue dereference mddev2->all_mddevs The old helper for_each_mddev() actually grab the reference of mddev2 while holding the lock, to prevent from being freed. This problem can be fixed the same way, however, the code will be complex. Hence switch to use list_for_each_entry, in this case mddev_put() can free the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor out a helper mddev_put_locked() to fix this problem. Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot") Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit") Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org> Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25md/md-bitmap: fix stats collection for external bitmapsZheng Qixing1-3/+2
[ Upstream commit 6ec1f0239485028445d213d91cfee5242f3211ba ] The bitmap_get_stats() function incorrectly returns -ENOENT for external bitmaps. Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). Note: "bitmap_info.external" here refers to a bitmap stored in a separate file (bitmap_file), not to external metadata. Fixes: 8d28d0ddb986 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25md/raid10: fix missing discard IO accountingYu Kuai1-0/+1
[ Upstream commit d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 ] md_account_bio() is not called from raid10_handle_discard(), now that we handle bitmap inside md_account_bio(), also fix missing bitmap_startwrite for discard. Test whole disk discard for 20G raid10: Before: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 48.00 16.00 0.00 0.00 5.42 341.33 After: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 68.00 20462.00 0.00 0.00 2.65 308133.65 Link: https://lore.kernel.org/linux-raid/20250325015746.3195035-1-yukuai1@huaweicloud.com Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20dm-verity: fix prefetch-vs-suspend raceMikulas Patocka1-0/+8
commit 2de510fccbca3d1906b55f4be5f1de83fa2424ef upstream. There's a possible race condition in dm-verity - the prefetch work item may race with suspend and it is possible that prefetch continues to run while the device is suspended. Fix this by calling flush_workqueue and dm_bufio_client_reset in the postsuspend hook. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20dm-integrity: fix non-constant-time tag verificationJo Van Bulck1-23/+22
commit 8bde1033f9cfc1c08628255cc434c6cf39c9d9ba upstream. When using dm-integrity in standalone mode with a keyed hmac algorithm, integrity tags are calculated and verified internally. Using plain memcmp to compare the stored and computed tags may leak the position of the first byte mismatch through side-channel analysis, allowing to brute-force expected tags in linear time (e.g., by counting single-stepping interrupts in confidential virtual machine environments). Co-developed-by: Luca Wilke <work@luca-wilke.com> Signed-off-by: Luca Wilke <work@luca-wilke.com> Signed-off-by: Jo Van Bulck <jo.vanbulck@cs.kuleuven.be> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>