summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2024-01-01dm-integrity: don't modify bio's immutable bio_vec in integrity_metadata()Mikulas Patocka1-5/+6
commit b86f4b790c998afdbc88fe1aa55cfe89c4068726 upstream. __bio_for_each_segment assumes that the first struct bio_vec argument doesn't change - it calls "bio_advance_iter_single((bio), &(iter), (bvl).bv_len)" to advance the iterator. Unfortunately, the dm-integrity code changes the bio_vec with "bv.bv_len -= pos". When this code path is taken, the iterator would be out of sync and dm-integrity would report errors. This happens if the machine is out of memory and "kmalloc" fails. Fix this bug by making a copy of "bv" and changing the copy instead. Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-01dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_clientLi Lingfeng6-34/+44
[ Upstream commit d48300120627a1cb98914738fff38b424625b8ad ] As described in commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be triggered because shrinker_rwsem currently needs to held by dm_pool_abort_metadata() as a side-effect of thin-pool metadata operation failure. The following three problem scenarios have been noticed: 1) Described by commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") 2) shrinker_rwsem and throttle->lock P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_wait_block_bitmap __ext4_error ext4_handle_error ext4_commit_super ... dm_submit_bio do_worker throttle_work_update down_write(&t->lock) -- LOCK B process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker down_write(&shrinker_rwsem) -- LOCK A thin_map thin_bio_map thin_defer_bio_with_throttle throttle_lock down_read(&t->lock) - LOCK B 3) shrinker_rwsem and wait_on_buffer P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab ... ext4_wait_block_bitmap __ext4_error ext4_handle_error jbd2_journal_abort jbd2_journal_update_sb_errno jbd2_write_superblock submit_bh // LOCK B // RELEASE B do_worker throttle_work_update down_write(&t->lock) - LOCK B process_deferred_bios process_bio commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker register_shrinker_prepared down_write(&shrinker_rwsem) - LOCK A bio_endio wait_on_buffer __wait_on_buffer Fix these by resetting dm_bufio_client without holding shrinker_rwsem. Fixes: 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20bcache: avoid NULL checking to c->root in run_cache_set()Coly Li1-1/+1
[ Upstream commit 3eba5e0b2422aec3c9e79822029599961fdcab97 ] In run_cache_set() after c->root returned from bch_btree_node_get(), it is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL because bch_btree_node_get() will not return NULL pointer to caller. This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()Coly Li1-0/+7
[ Upstream commit 31f5b956a197d4ec25c8a07cb3a2ab69d0c0b82f ] This patch adds code comments to bch_btree_node_get() and __bch_btree_node_alloc() that NULL pointer will not be returned and it is unnecessary to check NULL pointer by the callers of these routines. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20bcache: remove redundant assignment to variable cur_idxColin Ian King1-1/+1
[ Upstream commit be93825f0e6428c2d3f03a6e4d447dc48d33d7ff ] Variable cur_idx is being initialized with a value that is never read, it is being re-assigned later in a while-loop. Remove the redundant assignment. Cleans up clang scan build warning: drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-20bcache: avoid oversize memory allocation by small stripe_sizeColy Li2-0/+3
[ Upstream commit baf8fb7e0e5ec54ea0839f0c534f2cdcd79bea9c ] Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are used for dirty data writeback, their sizes are decided by backing device capacity and stripe size. Larger backing device capacity or smaller stripe size make these two arraies occupies more dynamic memory space. Currently bcache->stripe_size is directly inherited from queue->limits.io_opt of underlying storage device. For normal hard drives, its limits.io_opt is 0, and bcache sets the corresponding stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for devices do declare value for queue->limits.io_opt, small stripe_size (comparing to 1TB) becomes an issue for oversize memory allocations of bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the capacity of hard drives gets much larger in recent decade. For example a raid5 array assembled by three 20TB hardrives, the raid device capacity is 40TB with typical 512KB limits.io_opt. After the math calculation in bcache code, these two arraies will occupy 400MB dynamic memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is declared on a new 2TB hard drive, then these two arraies request 2GB and 512MB dynamic memory from kzalloc(). The result is that bcache device always fails to initialize on his system. To avoid the oversize memory allocation, bcache->stripe_size should not directly inherited by queue->limits.io_opt from the underlying device. This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size and set bcache device's stripe size against the declared limits.io_opt value from the underlying storage device, - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will set its stripe size directly by this limits.io_opt value. - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will set its stripe size by a value multiplying limits.io_opt and euqal or large than BCH_MIN_STRIPE_SZ. Then the minimal stripe size of a bcache device will always be >= 4MB. For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied by these two arraies will be 2.5MB in total. Such mount of memory allocated for bcache->stripe_sectors_dirty and bcache->full_dirty_stripes is reasonable for most of storage devices. Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net> Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-13md/raid6: use valid sector values to determine if an I/O should wait on the ↵David Jeffery1-2/+2
reshape commit c467e97f079f0019870c314996fae952cc768e82 upstream. During a reshape or a RAID6 array such as expanding by adding an additional disk, I/Os to the region of the array which have not yet been reshaped can stall indefinitely. This is from errors in the stripe_ahead_of_reshape function causing md to think the I/O is to a region in the actively undergoing the reshape. stripe_ahead_of_reshape fails to account for the q disk having a sector value of 0. By not excluding the q disk from the for loop, raid6 will always generate a min_sector value of 0, causing a return value which stalls. The function's max_sector calculation also uses min() when it should use max(), causing the max_sector value to always be 0. During a backwards rebuild this can cause the opposite problem where it allows I/O to advance when it should wait. Fixing these errors will allow safe I/O to advance in a timely manner and delay only I/O which is unsafe due to stripes in the middle of undergoing the reshape. Fixes: 486f60558607 ("md/raid5: Check all disks in a stripe_head for reshape progress") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231128181233.6187-1-djeffery@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()Yu Kuai1-11/+13
[ Upstream commit c9f7cb5b2bc968adcdc686c197ed108f47fd8eb0 ] If md_set_readonly() failed, the array could still be read-write, however 'MD_RECOVERY_FROZEN' could still be set, which leave the array in an abnormal state that sync or recovery can't continue anymore. Hence make sure the flag is cleared after md_set_readonly() returns. Fixes: 88724bfa68be ("md: wait for pending superblock updates before switching to read-only") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231205094215.1824240-3-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-13md: introduce md_ro_stateYe Bin1-70/+82
[ Upstream commit f97a5528b21eb175d90dce2df9960c8d08e1be82 ] Introduce md_ro_state for mddev->ro, so it is easy to understand. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: c9f7cb5b2bc9 ("md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-08bcache: revert replacing IS_ERR_OR_NULL with IS_ERRMarkus Weippert1-1/+1
commit bb6cc253861bd5a7cf8439e2118659696df9619f upstream. Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") replaced IS_ERR_OR_NULL by IS_ERR. This leads to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000080 Call Trace: ? __die_body.cold+0x1a/0x1f ? page_fault_oops+0xd2/0x2b0 ? exc_page_fault+0x70/0x170 ? asm_exc_page_fault+0x22/0x30 ? btree_node_free+0xf/0x160 [bcache] ? up_write+0x32/0x60 btree_gc_coalesce+0x2aa/0x890 [bcache] ? bch_extent_bad+0x70/0x170 [bcache] btree_gc_recurse+0x130/0x390 [bcache] ? btree_gc_mark_node+0x72/0x230 [bcache] bch_btree_gc+0x5da/0x600 [bcache] ? cpuusage_read+0x10/0x10 ? bch_btree_gc+0x600/0x600 [bcache] bch_gc_thread+0x135/0x180 [bcache] The relevant code starts with: new_nodes[0] = NULL; for (i = 0; i < nodes; i++) { if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key))) goto out_nocoalesce; // ... out_nocoalesce: // ... for (i = 0; i < nodes; i++) if (!IS_ERR(new_nodes[i])) { // IS_ERR_OR_NULL before 028ddcac477b btree_node_free(new_nodes[i]); // new_nodes[0] is NULL rw_unlock(true, new_nodes[i]); } This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this. Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") Link: https://lore.kernel.org/all/3DF4A87A-2AC1-4893-AE5F-E921478419A9@suse.de/ Cc: stable@vger.kernel.org Cc: Zheng Wang <zyytlz.wz@163.com> Cc: Coly Li <colyli@suse.de> Signed-off-by: Markus Weippert <markus@gekmihesg.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08dm verity: don't perform FEC for failed readahead IOWu Bo1-1/+3
commit 0193e3966ceeeef69e235975918b287ab093082b upstream. We found an issue under Android OTA scenario that many BIOs have to do FEC where the data under dm-verity is 100% complete and no corruption. Android OTA has many dm-block layers, from upper to lower: dm-verity dm-snapshot dm-origin & dm-cow dm-linear ufs DM tables have to change 2 times during Android OTA merging process. When doing table change, the dm-snapshot will be suspended for a while. During this interval, many readahead IOs are submitted to dm_verity from filesystem. Then the kverity works are busy doing FEC process which cost too much time to finish dm-verity IO. This causes needless delay which feels like system is hung. After adding debugging it was found that each readahead IO needed around 10s to finish when this situation occurred. This is due to IO amplification: dm-snapshot suspend erofs_readahead // 300+ io is submitted dm_submit_bio (dm_verity) dm_submit_bio (dm_snapshot) bio return EIO bio got nothing, it's empty verity_end_io verity_verify_io forloop range(0, io->n_blocks) // each io->nblocks ~= 20 verity_fec_decode fec_decode_rsb fec_read_bufs forloop range(0, v->fec->rsn) // v->fec->rsn = 253 new_read submit_bio (dm_snapshot) end loop end loop dm-snapshot resume Readahead BIOs get nothing while dm-snapshot is suspended, so all of them will cause verity's FEC. Each readahead BIO needs to verify ~20 (io->nblocks) blocks. Each block needs to do FEC, and every block needs to do 253 (v->fec->rsn) reads. So during the suspend interval(~200ms), 300 readahead BIOs trigger ~1518000 (300*20*253) IOs to dm-snapshot. As readahead IO is not required by userspace, and to fix this issue, it is best to pass readahead errors to upper layer to handle it. Cc: stable@vger.kernel.org Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Signed-off-by: Wu Bo <bo.wu@vivo.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08dm verity: initialize fec io before freeing itWu Bo1-1/+2
commit 7be05bdfb4efc1396f7692562c7161e2b9f595f1 upstream. If BIO error, verity_end_io() can call verity_finish_io() before verity_fec_init_io(). Therefore, fec_io->rs is not initialized and may crash when doing memory freeing in verity_fec_finish_io(). Crash call stack: die+0x90/0x2b8 __do_kernel_fault+0x260/0x298 do_bad_area+0x2c/0xdc do_translation_fault+0x3c/0x54 do_mem_abort+0x54/0x118 el1_abort+0x38/0x5c el1h_64_sync_handler+0x50/0x90 el1h_64_sync+0x64/0x6c free_rs+0x18/0xac fec_rs_free+0x10/0x24 mempool_free+0x58/0x148 verity_fec_finish_io+0x4c/0xb0 verity_end_io+0xb8/0x150 Cc: stable@vger.kernel.org # v6.0+ Fixes: 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Wu Bo <bo.wu@vivo.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-08dm-verity: align struct dm_verity_fec_io properlyMikulas Patocka2-7/+2
commit 38bc1ab135db87577695816b190e7d6d8ec75879 upstream. dm_verity_fec_io is placed after the end of two hash digests. If the hash digest has unaligned length, struct dm_verity_fec_io could be unaligned. This commit fixes the placement of struct dm_verity_fec_io, so that it's aligned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03bcache: fixup lock c->root errorMingzhe Zou1-3/+11
commit e34820f984512b433ee1fc291417e60c47d56727 upstream. We had a problem with io hung because it was waiting for c->root to release the lock. crash> cache_set.root -l cache_set.list ffffa03fde4c0050 root = 0xffff802ef454c800 crash> btree -o 0xffff802ef454c800 | grep rw_semaphore [ffff802ef454c858] struct rw_semaphore lock; crash> struct rw_semaphore ffff802ef454c858 struct rw_semaphore { count = { counter = -4294967297 }, wait_list = { next = 0xffff00006786fc28, prev = 0xffff00005d0efac8 }, wait_lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, osq = { tail = { counter = 0 } }, owner = 0xffffa03fdc586603 } The "counter = -4294967297" means that lock count is -1 and a write lock is being attempted. Then, we found that there is a btree with a counter of 1 in btree_cache_freeable. crash> cache_set -l cache_set.list ffffa03fde4c0050 -o|grep btree_cache [ffffa03fde4c1140] struct list_head btree_cache; [ffffa03fde4c1150] struct list_head btree_cache_freeable; [ffffa03fde4c1160] struct list_head btree_cache_freed; [ffffa03fde4c1170] unsigned int btree_cache_used; [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait; [ffffa03fde4c1190] struct task_struct *btree_cache_alloc_lock; crash> list -H ffffa03fde4c1140|wc -l 973 crash> list -H ffffa03fde4c1150|wc -l 1123 crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050 btree_cache_used = 2097 crash> list -s btree -l btree.list -H ffffa03fde4c1140|grep -E -A2 "^ lock = {" > btree_cache.txt crash> list -s btree -l btree.list -H ffffa03fde4c1150|grep -E -A2 "^ lock = {" > btree_cache_freeable.txt [root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd /var/crash/127.0.0.1-2023-08-04-16:40:28 [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt|grep counter|grep -v "counter = 0" [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt|grep counter|grep -v "counter = 0" counter = 1 We found that this is a bug in bch_sectors_dirty_init() when locking c->root: (1). Thread X has locked c->root(A) write. (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A). (3). Thread X bch_btree_set_root() changes c->root from A to B. (4). Thread X releases the lock(c->root A). (5). Thread Y successfully locks c->root(A). (6). Thread Y releases the lock(c->root B). down_write locked ---(1)----------------------┐ | | | down_read waiting ---(2)----┐ | | | ┌-------------┐ ┌-------------┐ bch_btree_set_root ===(3)========>> | c->root A | | c->root B | | | └-------------┘ └-------------┘ up_write ---(4)---------------------┘ | | | | | down_read locked ---(5)-----------┘ | | | up_read ---(6)-----------------------------┘ Since c->root may change, the correct steps to lock c->root should be the same as bch_root_usage(), compare after locking. static unsigned int bch_root_usage(struct cache_set *c) { unsigned int bytes = 0; struct bkey *k; struct btree *b; struct btree_iter iter; goto lock_root; do { rw_unlock(false, b); lock_root: b = c->root; rw_lock(false, b, b->level); } while (b != c->root); for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad) bytes += bkey_bytes(k); rw_unlock(false, b); return (bytes * 100) / btree_bytes(c); } Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03bcache: fixup init dirty data errorsMingzhe Zou1-1/+4
commit 7cc47e64d3d69786a2711a4767e26b26ba63d7ed upstream. We found that after long run, the dirty_data of the bcache device will have errors. This error cannot be eliminated unless re-register. We also found that reattach after detach, this error can accumulate. In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted again. This is wrong, we only need to count the keys of the current device. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03bcache: prevent potential division by zero errorRand Deeb1-1/+1
commit 2c7f497ac274a14330208b18f6f734000868ebf9 upstream. In SHOW(), the variable 'n' is of type 'size_t.' While there is a conditional check to verify that 'n' is not equal to zero before executing the 'do_div' macro, concerns arise regarding potential division by zero error in 64-bit environments. The concern arises when 'n' is 64 bits in size, greater than zero, and the lower 32 bits of it are zeros. In such cases, the conditional check passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to 'uint32_t,' effectively truncating it to its lower 32 bits. Consequently, the 'n' value becomes zero. To fix this potential division by zero error and ensure precise division handling, this commit replaces the 'do_div' macro with div64_u64(). div64_u64() is designed to work with 64-bit operands, guaranteeing that division is performed correctly. This change enhances the robustness of the code, ensuring that division operations yield accurate results in all scenarios, eliminating the possibility of division by zero, and improving compatibility across different 64-bit environments. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Rand Deeb <rand.sec96@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03bcache: check return value from btree_node_alloc_replacement()Coly Li1-0/+2
commit 777967e7e9f6f5f3e153abffb562bffaf4430d26 upstream. In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns from btree_gc_rewrite_node(). There is potential possibility that 'n' is a non NULL ERR_PTR(), referencing such error code is not permitted in following code. Therefore a return value checking is necessary after 'n' is back from btree_node_alloc_replacement(). Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03dm-delay: fix a race between delay_presuspend and delay_bioMikulas Patocka1-5/+12
[ Upstream commit 6fc45b6ed921dc00dfb264dc08c7d67ee63d2656 ] In delay_presuspend, we set the atomic variable may_delay and then stop the timer and flush pending bios. The intention here is to prevent the delay target from re-arming the timer again. However, this test is racy. Suppose that one thread goes to delay_bio, sees that dc->may_delay is one and proceeds; now, another thread executes delay_presuspend, it sets dc->may_delay to zero, deletes the timer and flushes pending bios. Then, the first thread continues and adds the bio to delayed->list despite the fact that dc->may_delay is false. Fix this bug by changing may_delay's type from atomic_t to bool and only access it while holding the delayed_bios_lock mutex. Note that we don't have to grab the mutex in delay_resume because there are no bios in flight at this point. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up raceMingzhe Zou1-1/+2
commit 2faac25d7958c4761bb8cec54adb79f806783ad6 upstream. We get a kernel crash about "unable to handle kernel paging request": ```dmesg [368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5 [368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1 [368033.032012] Oops: 0003 [#1] SMP PTI [368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.es8_24.x86_64 #1 [368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017 [368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0 [368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93 ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02 [368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082 [368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3 [368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68 [368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff [368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70 [368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [368033.032038] FS: 0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000 [368033.032039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0 [368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b [368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [368033.032046] Call Trace: [368033.032054] _raw_spin_lock_irqsave+0x32/0x40 [368033.032061] __wake_up_common_lock+0x63/0xc0 [368033.032073] ? bch_ptr_invalid+0x10/0x10 [bcache] [368033.033502] bch_dirty_init_thread+0x14c/0x160 [bcache] [368033.033511] ? read_dirty_submit+0x60/0x60 [bcache] [368033.033516] kthread+0x112/0x130 [368033.033520] ? kthread_flush_work_fn+0x10/0x10 [368033.034505] ret_from_fork+0x35/0x40 ``` The crash occurred when call wake_up(&state->wait), and then we want to look at the value in the state. However, bch_sectors_dirty_init() is not found in the stack of any task. Since state is allocated on the stack, we guess that bch_sectors_dirty_init() has exited, causing bch_dirty_init_thread() to be unable to handle kernel paging request. In order to verify this idea, we added some printing information during wake_up(&state->wait). We find that "wake up" is printed twice, however we only expect the last thread to wake up once. ```dmesg [ 994.641004] alcache: bch_dirty_init_thread() wake up [ 994.641018] alcache: bch_dirty_init_thread() wake up [ 994.641523] alcache: bch_sectors_dirty_init() init exit ``` There is a race. If bch_sectors_dirty_init() exits after the first wake up, the second wake up will trigger this bug("unable to handle kernel paging request"). Proceed as follows: bch_sectors_dirty_init kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0]) ... ... atomic_inc(&state.started) ... ... ... atomic_read(&state.enough) ... ... atomic_set(&state->enough, 1) kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1]) ... atomic_dec_and_test(&state->started) ... atomic_inc(&state.started) ... ... ... wake_up(&state->wait) ... atomic_read(&state.enough) atomic_dec_and_test(&state->started) ... ... wait_event(state.wait, atomic_read(&state.started) == 0) ... return ... wake_up(&state->wait) We believe it is very common to wake up twice if there is no dirty, but crash is an extremely low probability event. It's hard for us to reproduce this issue. We attached and detached continuously for a week, with a total of more than one million attaches and only one crash. Putting atomic_inc(&state.started) before kthread_run() can avoid waking up twice. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03md: fix bi_status reporting in md_end_clone_ioSong Liu1-1/+2
commit 45b478951b2ba5aea70b2850c49c1aa83aedd0d2 upstream. md_end_clone_io() may overwrite error status in orig_bio->bi_status with BLK_STS_OK. This could happen when orig_bio has BIO_CHAIN (split by md_submit_bio => bio_split_to_limits, for example). As a result, upper layer may miss error reported from md (or the device) and consider the failed IO was successful. Fix this by only update orig_bio->bi_status when current bio reports error and orig_bio is BLK_STS_OK. This is the same behavior as __bio_chain_endio(). Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5") Cc: stable@vger.kernel.org # v5.14+ Reported-by: Bhanu Victor DiCara <00bvd0+linux@gmail.com> Closes: https://lore.kernel.org/regressions/5727380.DvuYhMxLoT@bvd0/ Signed-off-by: Song Liu <song@kernel.org> Tested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()Coly Li1-1/+1
commit f72f4312d4388376fc8a1f6cf37cb21a0d41758b upstream. Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") do the following change inside btree_gc_coalesce(), 31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce( 32 memset(new_nodes, 0, sizeof(new_nodes)); 33 closure_init_stack(&cl); 34 35 - while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b)) 36 + while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b)) 37 keys += r[nodes++].keys; 38 39 blocks = btree_default_blocks(b->c) * 2 / 3; At line 35 the original r[nodes].b is not always allocatored from __bch_btree_node_alloc(), and possibly initialized as NULL pointer by caller of btree_gc_coalesce(). Therefore the change at line 36 is not correct. This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid potential issue. Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") Cc: <stable@vger.kernel.org> # 6.5+ Cc: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28dm-verity: don't use blocking calls from taskletsMikulas Patocka3-14/+15
commit 28f07f2ab4b3a2714f1fefcc58ada4bcc195f806 upstream. The commit 5721d4e5a9cd enhanced dm-verity, so that it can verify blocks from tasklets rather than from workqueues. This reportedly improves performance significantly. However, dm-verity was using the flag CRYPTO_TFM_REQ_MAY_SLEEP from tasklets which resulted in warnings about sleeping function being called from non-sleeping context. BUG: sleeping function called from invalid context at crypto/internal.h:206 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0 preempt_count: 100, expected: 0 RCU nest depth: 0, expected: 0 CPU: 0 PID: 14 Comm: ksoftirqd/0 Tainted: G W 6.7.0-rc1 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x32/0x50 __might_resched+0x110/0x160 crypto_hash_walk_done+0x54/0xb0 shash_ahash_update+0x51/0x60 verity_hash_update.isra.0+0x4a/0x130 [dm_verity] verity_verify_io+0x165/0x550 [dm_verity] ? free_unref_page+0xdf/0x170 ? psi_group_change+0x113/0x390 verity_tasklet+0xd/0x70 [dm_verity] tasklet_action_common.isra.0+0xb3/0xc0 __do_softirq+0xaf/0x1ec ? smpboot_thread_fn+0x1d/0x200 ? sort_range+0x20/0x20 run_ksoftirqd+0x15/0x30 smpboot_thread_fn+0xed/0x200 kthread+0xdc/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x28/0x40 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> This commit fixes dm-verity so that it doesn't use the flags CRYPTO_TFM_REQ_MAY_SLEEP and CRYPTO_TFM_REQ_MAY_BACKLOG from tasklets. The crypto API would do GFP_ATOMIC allocation instead, it could return -ENOMEM and we catch -ENOMEM in verity_tasklet and requeue the request to the workqueue. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v6.0+ Fixes: 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10dm zoned: free dmz->ddev array in dmz_put_zoned_devicesFedor Pchelkin1-8/+7
commit 9850ccd5dd88075b2b7fd28d96299d5535f58cc5 upstream. Commit 4dba12881f88 ("dm zoned: support arbitrary number of devices") made the pointers to additional zoned devices to be stored in a dynamically allocated dmz->ddev array. However, this array is not freed. Rename dmz_put_zoned_device to dmz_put_zoned_devices and fix it to free the dmz->ddev array when cleaning up zoned device information. Remove NULL assignment for all dmz->ddev elements and just free the dmz->ddev array instead. Found by Linux Verification Center (linuxtesting.org). Fixes: 4dba12881f88 ("dm zoned: support arbitrary number of devices") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10md/raid5: release batch_last before waiting for another stripe_headDavid Jeffery1-0/+7
commit 2fd7b0f6d5ad655b1d947d3acdd82f687c31465e upstream. When raid5_get_active_stripe is called with a ctx containing a stripe_head in its batch_last pointer, it can cause a deadlock if the task sleeps waiting on another stripe_head to become available. The stripe_head held by batch_last can be blocking the advancement of other stripe_heads, leading to no stripe_heads being released so raid5_get_active_stripe waits forever. Like with the quiesce state handling earlier in the function, batch_last needs to be released by raid5_get_active_stripe before it waits for another stripe_head. Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231002183422.13047-1-djeffery@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06dm: fix a race condition in retrieve_depsMikulas Patocka3-9/+31
[ Upstream commit f6007dce0cd35d634d9be91ef3515a6385dcee16 ] There's a race condition in the multipath target when retrieve_deps races with multipath_message calling dm_get_device and dm_put_device. retrieve_deps walks the list of open devices without holding any lock but multipath may add or remove devices to the list while it is running. The end result may be memory corruption or use-after-free memory access. See this description of a UAF with multipath_message(): https://listman.redhat.com/archives/dm-devel/2022-October/052373.html Fix this bug by introducing a new rw semaphore "devices_lock". We grab devices_lock for read in retrieve_deps and we grab it for write in dm_get_device and dm_put_device. Reported-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Tested-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23dm: don't attempt to queue IO under RCU protectionJens Axboe1-21/+2
commit a9ce385344f916cd1c36a33905e564f5581beae9 upstream. dm looks up the table for IO based on the request type, with an assumption that if the request is marked REQ_NOWAIT, it's fine to attempt to submit that IO while under RCU read lock protection. This is not OK, as REQ_NOWAIT just means that we should not be sleeping waiting on other IO, it does not mean that we can't potentially schedule. A simple test case demonstrates this quite nicely: int main(int argc, char *argv[]) { struct iovec iov; int fd; fd = open("/dev/dm-0", O_RDONLY | O_DIRECT); posix_memalign(&iov.iov_base, 4096, 4096); iov.iov_len = 4096; preadv2(fd, &iov, 1, 0, RWF_NOWAIT); return 0; } which will instantly spew: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5580, name: dm-nowait preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 INFO: lockdep is turned off. CPU: 7 PID: 5580 Comm: dm-nowait Not tainted 6.6.0-rc1-g39956d2dcd81 #132 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x11d/0x1b0 __might_resched+0x3c3/0x5e0 ? preempt_count_sub+0x150/0x150 mempool_alloc+0x1e2/0x390 ? mempool_resize+0x7d0/0x7d0 ? lock_sync+0x190/0x190 ? lock_release+0x4b7/0x670 ? internal_get_user_pages_fast+0x868/0x2d40 bio_alloc_bioset+0x417/0x8c0 ? bvec_alloc+0x200/0x200 ? internal_get_user_pages_fast+0xb8c/0x2d40 bio_alloc_clone+0x53/0x100 dm_submit_bio+0x27f/0x1a20 ? lock_release+0x4b7/0x670 ? blk_try_enter_queue+0x1a0/0x4d0 ? dm_dax_direct_access+0x260/0x260 ? rcu_is_watching+0x12/0xb0 ? blk_try_enter_queue+0x1cc/0x4d0 __submit_bio+0x239/0x310 ? __bio_queue_enter+0x700/0x700 ? kvm_clock_get_cycles+0x40/0x60 ? ktime_get+0x285/0x470 submit_bio_noacct_nocheck+0x4d9/0xb80 ? should_fail_request+0x80/0x80 ? preempt_count_sub+0x150/0x150 ? lock_release+0x4b7/0x670 ? __bio_add_page+0x143/0x2d0 ? iov_iter_revert+0x27/0x360 submit_bio_noacct+0x53e/0x1b30 submit_bio_wait+0x10a/0x230 ? submit_bio_wait_endio+0x40/0x40 __blkdev_direct_IO_simple+0x4f8/0x780 ? blkdev_bio_end_io+0x4c0/0x4c0 ? stack_trace_save+0x90/0xc0 ? __bio_clone+0x3c0/0x3c0 ? lock_release+0x4b7/0x670 ? lock_sync+0x190/0x190 ? atime_needs_update+0x3bf/0x7e0 ? timestamp_truncate+0x21b/0x2d0 ? inode_owner_or_capable+0x240/0x240 blkdev_direct_IO.part.0+0x84a/0x1810 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 ? blkdev_read_iter+0x40d/0x530 ? reacquire_held_locks+0x4e0/0x4e0 ? __blkdev_direct_IO_simple+0x780/0x780 ? rcu_is_watching+0x12/0xb0 ? __mark_inode_dirty+0x297/0xd50 ? preempt_count_add+0x72/0x140 blkdev_read_iter+0x2a4/0x530 do_iter_readv_writev+0x2f2/0x3c0 ? generic_copy_file_range+0x1d0/0x1d0 ? fsnotify_perm.part.0+0x25d/0x630 ? security_file_permission+0xd8/0x100 do_iter_read+0x31b/0x880 ? import_iovec+0x10b/0x140 vfs_readv+0x12d/0x1a0 ? vfs_iter_read+0xb0/0xb0 ? rcu_is_watching+0x12/0xb0 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 do_preadv+0x1b3/0x260 ? do_readv+0x370/0x370 __x64_sys_preadv2+0xef/0x150 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5af41ad806 Code: 41 54 41 89 fc 55 44 89 c5 53 48 89 cb 48 83 ec 18 80 3d e4 dd 0d 00 00 74 7a 45 89 c1 49 89 ca 45 31 c0 b8 47 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 be 00 00 00 48 85 c0 79 4a 48 8b 0d da 55 RSP: 002b:00007ffd3145c7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000147 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5af41ad806 RDX: 0000000000000001 RSI: 00007ffd3145c850 RDI: 0000000000000003 RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffd3145c850 R14: 000055f5f0431dd8 R15: 0000000000000001 </TASK> where in fact it is dm itself that attempts to allocate a bio clone with GFP_NOIO under the rcu read lock, regardless of the request type. Fix this by getting rid of the special casing for REQ_NOWAIT, and just use the normal SRCU protected table lookup. Get rid of the bio based table locking helpers at the same time, as they are now unused. Cc: stable@vger.kernel.org Fixes: 563a225c9fd2 ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23md: Put the right device in md_seq_nextMariusz Tkaczyk1-1/+1
commit c8870379a21fbd9ad14ca36204ccfbe9d25def43 upstream. If there are multiple arrays in system and one mddevice is marked with MD_DELETED and md_seq_next() is called in the middle of removal then it _get()s proper device but it may _put() deleted one. As a result, active counter may never be zeroed for mddevice and it cannot be removed. Put the device which has been _get with previous md_seq_next() call. Cc: stable@vger.kernel.org Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Reported-by: AceLan Kao <acelan@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798 Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23md/raid1: fix error: ISO C90 forbids mixed declarationsNigel Croxon1-2/+1
[ Upstream commit df203da47f4428bc286fc99318936416253a321c ] There is a compile error when this commit is added: md: raid1: fix potential OOB in raid1_remove_disk() drivers/md/raid1.c: In function 'raid1_remove_disk': drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] 1844 |         struct raid1_info *p = conf->mirrors + number;     |         ^~~~~~ That's because the new code was inserted before the struct. The change is move the struct command above this commit. Fixes: 8b0472b50bcf ("md: raid1: fix potential OOB in raid1_remove_disk()") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23md: raid1: fix potential OOB in raid1_remove_disk()Zhang Shurong1-0/+4
[ Upstream commit 8b0472b50bcf0f19a5119b00a53b63579c8e1e4d ] If rddev->raid_disk is greater than mddev->raid_disks, there will be an out-of-bounds in raid1_remove_disk(). We have already found similar reports as follows: 1) commit d17f744e883b ("md-raid10: fix KASAN warning") 2) commit 1ebc2cec0b7d ("dm raid: fix KASAN warning in raid5_remove_disk") Fix this bug by checking whether the "number" variable is valid. Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: fix regression for null-ptr-deference in __md_stop()Yu Kuai1-1/+2
commit 433279beba1d4872da10b7b60a539e0cb828b32b upstream. Commit 3e453522593d ("md: Free resources in __md_stop") tried to fix null-ptr-deference for 'active_io' by moving percpu_ref_exit() to __md_stop(), however, the commit also moving 'writes_pending' to __md_stop(), and this will cause mdadm tests broken: BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37 RIP: 0010:free_percpu+0x465/0x670 Call Trace: <TASK> __percpu_ref_exit+0x48/0x70 percpu_ref_exit+0x1a/0x90 __md_stop+0xe9/0x170 do_md_stop+0x1e1/0x7b0 md_ioctl+0x90c/0x1aa0 blkdev_ioctl+0x19b/0x400 vfs_ioctl+0x20/0x50 __x64_sys_ioctl+0xba/0xe0 do_syscall_64+0x6c/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0xcd And the problem can be reporduced 100% by following test: mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force echo inactive > /sys/block/md0/md/array_state echo read-auto > /sys/block/md0/md/array_state echo inactive > /sys/block/md0/md/array_state Root cause: // start raid raid1_run mddev_init_writes_pending percpu_ref_init // inactive raid array_state_store do_md_stop __md_stop percpu_ref_exit // start raid again array_state_store do_md_run raid1_run mddev_init_writes_pending if (mddev->writes_pending.percpu_count_ptr) // won't reinit // inactive raid again ... percpu_ref_exit -> null-ptr-deference Before the commit, 'writes_pending' is exited when mddev is freed, and it's safe to restart raid because mddev_init_writes_pending() already make sure that 'writes_pending' will only be initialized once. Fix the prblem by moving 'writes_pending' back, it's a litter hard to find the relationship between alloc memory and free memory, however, code changes is much less and we lived with this for a long time already. Fixes: 3e453522593d ("md: Free resources in __md_stop") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: Free resources in __md_stopXiao Ni1-8/+5
commit 3e453522593d74a87cf68a38e14aa36ebca1dbcd upstream. If md_run() fails after ->active_io is initialized, then percpu_ref_exit is called in error path. However, later md_free_disk will call percpu_ref_exit again which leads to a panic because of null pointer dereference. It can also trigger this bug when resources are initialized but are freed in error path, then will be freed again in md_free_disk. BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP Workqueue: md_misc mddev_delayed_delete RIP: 0010:free_percpu+0x110/0x630 Call Trace: <TASK> __percpu_ref_exit+0x44/0x70 percpu_ref_exit+0x16/0x90 md_free_disk+0x2f/0x80 disk_release+0x101/0x180 device_release+0x84/0x110 kobject_put+0x12a/0x380 kobject_put+0x160/0x380 mddev_delayed_delete+0x19/0x30 process_one_work+0x269/0x680 worker_thread+0x266/0x640 kthread+0x151/0x1b0 ret_from_fork+0x1f/0x30 For creating raid device, md raid calls do_md_run->md_run, dm raid calls md_run. We alloc those memory in md_run. For stopping raid device, md raid calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can free those memory resources in __md_stop. Fixes: 72adae23a72c ("md: Change active_io to percpu") Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: raid0: account for split bio in iostat accountingDavid Jeffery1-2/+1
[ Upstream commit cc22b5407e9ca76adb7efeed843146510b1b72a5 ] When a bio is split by md raid0, the newly created bio will not be tracked by md for I/O accounting. Only the portion of I/O still assigned to the original bio which was reduced by the split will be accounted for. This results in md iostat data sometimes showing I/O values far below the actual amount of data being sent through md. md_account_bio() needs to be called for all bio generated by the bio split. A simple example of the issue was generated using a raid0 device on partitions to the same device. Since all raid0 I/O then goes to one device, it makes it easy to see a gap between the md device and its sd storage. Reading an lvm device on top of the md device, the iostat output (some 0 columns and extra devices removed to make the data more compact) was: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 md2 1364.00 411496.00 0.00 0.00 411496 sde 1734.00 646144.00 0.00 0.00 646144 md2 1699.00 510680.00 0.00 0.00 510680 sde 2155.00 802784.00 0.00 0.00 802784 md2 803.00 241480.00 0.00 0.00 241480 sde 1016.00 377888.00 0.00 0.00 377888 md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 I/O was generated doing large direct I/O reads (12M) with dd to a linear lvm volume on top of the 4 leg raid0 device. The md2 reads were showing as roughly 2/3 of the reads to the sde device containing all of md2's raid partitions. The sum of reads to sde was 1826816 kB, which was the expected amount as it was the amount read by dd. With the patch, the total reads from md will match the reads from sde and be consistent with the amount of I/O generated. Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5") Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid0: Fix performance regression for large sequential writesJan Kara1-1/+1
[ Upstream commit 319ff40a542736d67e5bce18635de35d0e7a0bff ] Commit f00d7c85be9e ("md/raid0: fix up bio splitting.") among other things changed how bio that needs to be split is submitted. Before this commit, we have split the bio, mapped and submitted each part. After this commit, we map only the first part of the split bio and submit the second part unmapped. Due to bio sorting in __submit_bio_noacct() this results in the following request ordering: 9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392 Split off chunk-sized (1024 sectors) request: 9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488 Request is unaligned to the chunk so it's split in raid0_make_request(). This is the first part mapped and punted to bio_list: 8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464 Now raid0_make_request() returns, second part is postponed on bio_list. __submit_bio_noacct() resorts the bio_list, mapped request is submitted to the underlying device: 8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016 Now we take another request from the bio_list which is the remainder of the original huge request. Split off another chunk-sized bit from it and the situation repeats: 9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512 8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488 8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016 ... 9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write] 8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512 8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016 ... This repeats until we consume the whole original huge request. Now we finally get to processing the second parts of the split off requests (in reverse order): 8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920 8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896 8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872 8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848 ... I guess it is obvious that this IO pattern is extremely inefficient way to perform sequential IO. It also makes bio_list to grow to rather long lengths. Change raid0_make_request() to map both parts of the split bio. Since we know we are provided with at most chunk-sized bios, we will always need to split the incoming bio at most once. Fixes: f00d7c85be9e ("md/raid0: fix up bio splitting.") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid0: Factor out helper for mapping and submitting a bioJan Kara1-39/+40
[ Upstream commit af50e20afb401cc203bd2a9ff62ece0ae4976103 ] Factor out helper function for mapping and submitting a bio out of raid0_make_request(). We will use it later for submitting both parts of a split bio. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: add error_handlers for raid0 and linearMariusz Tkaczyk4-10/+31
[ Upstream commit c31fea2f8e2a72c817f318016bbc327095175a9f ] After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10") MD_BROKEN must be set if array is failed because state_store() checks it. If it is set then -EBUSY is returned to userspace. For raid0 and linear MD_BROKEN is not set by error_handler(). As a result mdadm is unable to trigger clean-up actions. It is a regression. This patch adds appropriate error_handler for raid0 and linear. The error handler sets MD_BROKEN for this device. Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com Stable-dep-of: 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()Yu Kuai1-2/+1
[ Upstream commit 0d0bd28c500173bfca78aa840f8f36d261ef1765 ] r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is empty, and then submit 'flush_bio', however, r5l_log_flush_endio() is clearing the list first and then clear the bio, which will cause null-ptr-deref: T1: submit flush io raid5d handle_active_stripes r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init submit_bio // io1 T2: io1 is done r5l_log_flush_endio list_splice_tail_init // clear the list T3: submit new flush io ... r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init bio_uninit // clear bio->bi_blkg submit_bio // null-ptr-deref Fix this problem by clearing bio before clearing the list in r5l_log_flush_endio(). Fixes: 0dd00cba99c3 ("raid5-cache: fully initialize flush_bio when needed") Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org> Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid5-cache: fix a deadlock in r5l_exit_log()Yu Kuai1-3/+6
[ Upstream commit a705b11b358dee677aad80630e7608b2d5f56691 ] Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai1-0/+7
[ Upstream commit 44abfa6a95df425c0660d56043020b67e6d93ab8 ] Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai1-2/+0
[ Upstream commit b4d129640f194ffc4cc64c3e97f98ae944c072e8 ] Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan1-8/+5
[ Upstream commit 673643490b9a0eb3b25633abe604f62b8f63dba1 ] Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid10: factor out dereference_rdev_and_rrdev()Li Nan1-9/+20
[ Upstream commit b99f8fd2d91eb734f13098aa1cf337edaca454b7 ] Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 673643490b9a ("md/raid10: use dereference_rdev_and_rrdev() to get devices") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: restore 'noio_flag' for the last mddev_resume()Yu Kuai1-2/+4
[ Upstream commit e24ed04389f9619e0aaef615a8948633c182a8b0 ] memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Change active_io to percpuXiao Ni2-20/+25
[ Upstream commit 72adae23a72cb12e2ef0dcd7c0aa042867f27998 ] Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Factor out is_md_suspended helperXiao Ni1-5/+12
[ Upstream commit d19329133d25ad3dc32f8a62635692cb2f189014 ] This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm cache policy smq: ensure IO doesn't prevent cleaner policy progressJoe Thornber1-10/+18
commit 1e4ab7b4c881cf26c1c72b3f56519e03475486fb upstream. When using the cleaner policy to decommission the cache, there is never any writeback started from the cache as it is constantly delayed due to normal I/O keeping the device busy. Meaning @idle=false was always being passed to clean_target_met() Fix this by adding a specific 'cleaner' flag that is set when the cleaner policy is configured. This flag serves to always allow the cleaner's writeback work to be queued until the cache is decommissioned (even if the cache isn't idle). Reported-by: David Jeffery <djeffery@redhat.com> Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03dm raid: protect md_stop() with 'reconfig_mutex'Yu Kuai2-1/+5
[ Upstream commit 7d5fff8982a2199d49ec067818af7d84d4f95ca0 ] __md_stop_writes() and __md_stop() will modify many fields that are protected by 'reconfig_mutex', and all the callers will grab 'reconfig_mutex' except for md_stop(). Also, update md_stop() to make certain 'reconfig_mutex' is held using lockdep_assert_held(). Fixes: 9d09e663d550 ("dm: raid456 basic support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm raid: clean up four equivalent goto tags in raid_ctr()Yu Kuai1-18/+9
[ Upstream commit e74c874eabe2e9173a8fbdad616cd89c70eb8ffd ] There are four equivalent goto tags in raid_ctr(), clean them up to use just one. There is no functional change and this is preparation to fix raid_ctr()'s unprotected md_stop(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: 7d5fff8982a2 ("dm raid: protect md_stop() with 'reconfig_mutex'") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm raid: fix missing reconfig_mutex unlock in raid_ctr() error pathsYu Kuai1-2/+7
[ Upstream commit bae3028799dc4f1109acc4df37c8ff06f2d8f1a0 ] In the error paths 'bad_stripe_cache' and 'bad_check_reshape', 'reconfig_mutex' is still held after raid_ctr() returns. Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23dm: verity-loadpin: Add NULL pointer check for 'bdev' parameterMatthias Kaehlcke1-0/+3
commit 47f04616f2c9b2f4f0c9127e30ca515a078db591 upstream. Add a NULL check for the 'bdev' parameter of dm_verity_loadpin_is_bdev_trusted(). The function is called by loadpin_check(), which passes the block device that corresponds to the super block of the file system from which a file is being loaded. Generally a super_block structure has an associated block device, however that is not always the case (e.g. tmpfs). Cc: stable@vger.kernel.org # v6.0+ Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Link: https://lore.kernel.org/r/20230627202800.1.Id63f7f59536d20f1ab83e1abdc1fda1471c7d031@changeid Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23dm init: add dm-mod.waitfor to wait for asynchronously probed block devicesPeter Korsgaard1-1/+21
commit 035641b01e72af4f6c6cf22a4bdb5d7dfc4e8e8e upstream. Just calling wait_for_device_probe() is not enough to ensure that asynchronously probed block devices are available (E.G. mmc, usb), so add a "dm-mod.waitfor=<device1>[,..,<deviceN>]" parameter to get dm-init to explicitly wait for specific block devices before initializing the tables with logic similar to the rootwait logic that was introduced with commit cc1ed7542c8c ("init: wait for asynchronously scanned block devices"). E.G. with dm-verity on mmc using: dm-mod.waitfor="PARTLABEL=hash-a,PARTLABEL=root-a" [ 0.671671] device-mapper: init: waiting for all devices to be available before creating mapped devices [ 0.671679] device-mapper: init: waiting for device PARTLABEL=hash-a ... [ 0.710695] mmc0: new HS200 MMC card at address 0001 [ 0.711158] mmcblk0: mmc0:0001 004GA0 3.69 GiB [ 0.715954] mmcblk0boot0: mmc0:0001 004GA0 partition 1 2.00 MiB [ 0.722085] mmcblk0boot1: mmc0:0001 004GA0 partition 2 2.00 MiB [ 0.728093] mmcblk0rpmb: mmc0:0001 004GA0 partition 3 512 KiB, chardev (249:0) [ 0.738274] mmcblk0: p1 p2 p3 p4 p5 p6 p7 [ 0.751282] device-mapper: init: waiting for device PARTLABEL=root-a ... [ 0.751306] device-mapper: init: all devices available [ 0.751683] device-mapper: verity: sha256 using implementation "sha256-generic" [ 0.759344] device-mapper: ioctl: dm-0 (vroot) is ready [ 0.766540] VFS: Mounted root (squashfs filesystem) readonly on device 254:0. Signed-off-by: Peter Korsgaard <peter@korsgaard.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>