summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2023-05-24md: fix soft lockup in status_resyncYu Kuai1-9/+9
[ Upstream commit 6efddf1e32e2a264694766ca485a4f5e04ee82a7 ] status_resync() will calculate 'curr_resync - recovery_active' to show user a progress bar like following: [============>........] resync = 61.4% 'curr_resync' and 'recovery_active' is updated in md_do_sync(), and status_resync() can read them concurrently, hence it's possible that 'curr_resync - recovery_active' can overflow to a huge number. In this case status_resync() will be stuck in the loop to print a large amount of '=', which will end up soft lockup. Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case, this way resync in progress will be reported to user. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11dm: don't lock fs when the map is NULL in process of resumeLi Lingfeng1-1/+4
commit 38d11da522aacaa05898c734a1cec86f1e611129 upstream. Commit fa247089de99 ("dm: requeue IO if mapping table not yet available") added a detection of whether the mapping table is available in the IO submission process. If the mapping table is unavailable, it returns BLK_STS_RESOURCE and requeues the IO. This can lead to the following deadlock problem: dm create mount ioctl(DM_DEV_CREATE_CMD) ioctl(DM_TABLE_LOAD_CMD) do_mount vfs_get_tree ext4_get_tree get_tree_bdev sget_fc alloc_super // got &s->s_umount down_write_nested(&s->s_umount, ...); ext4_fill_super ext4_load_super ext4_read_bh submit_bio // submit and wait io end ioctl(DM_DEV_SUSPEND_CMD) dev_suspend do_resume dm_suspend __dm_suspend lock_fs freeze_bdev get_active_super grab_super // wait for &s->s_umount down_write(&s->s_umount); dm_swap_table __bind // set md->map(can't get here) IO will be continuously requeued while holding the lock since mapping table is NULL. At the same time, mapping table won't be set since the lock is not available. Like request-based DM, bio-based DM also has the same problem. It's not proper to just abort IO if the mapping table not available. So clear DM_SKIP_LOCKFS_FLAG when the mapping table is NULL, this allows the DM table to be loaded and the IO submitted upon resume. Fixes: fa247089de99 ("dm: requeue IO if mapping table not yet available") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm ioctl: fix nested locking in table_clear() to remove deadlock concernMike Snitzer1-3/+4
commit 3d32aaa7e66d5c1479a3c31d6c2c5d45dd0d3b89 upstream. syzkaller found the following problematic rwsem locking (with write lock already held): down_read+0x9d/0x450 kernel/locking/rwsem.c:1509 dm_get_inactive_table+0x2b/0xc0 drivers/md/dm-ioctl.c:773 __dev_status+0x4fd/0x7c0 drivers/md/dm-ioctl.c:844 table_clear+0x197/0x280 drivers/md/dm-ioctl.c:1537 In table_clear, it first acquires a write lock https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L1520 down_write(&_hash_lock); Then before the lock is released at L1539, there is a path shown above: table_clear -> __dev_status -> dm_get_inactive_table -> down_read https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L773 down_read(&_hash_lock); It tries to acquire the same read lock again, resulting in the deadlock problem. Fix this by moving table_clear()'s __dev_status() call to after its up_write(&_hash_lock); Cc: stable@vger.kernel.org Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm flakey: fix a crash with invalid table lineMikulas Patocka1-2/+2
commit 98dba02d9a93eec11bffbb93c7c51624290702d2 upstream. This command will crash with NULL pointer dereference: dmsetup create flakey --table \ "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 2 corrupt_bio_byte 512" Fix the crash by checking if arg_name is non-NULL before comparing it. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm integrity: call kmem_cache_destroy() in dm_integrity_init() error pathMike Snitzer1-3/+5
commit 6b79a428c02769f2a11f8ae76bf866226d134887 upstream. Otherwise the journal_io_cache will leak if dm_register_target() fails. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm clone: call kmem_cache_destroy() in dm_clone_init() error pathMike Snitzer1-0/+1
commit 6827af4a9a9f5bb664c42abf7c11af4978d72201 upstream. Otherwise the _hydration_cache will leak if dm_register_target() fails. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm verity: fix error handling for check_at_most_once on FECYeongjin Gil1-1/+1
commit e8c5d45f82ce0c238a4817739892fe8897a3dcc3 upstream. In verity_end_io(), if bi_status is not BLK_STS_OK, it can be return directly. But if FEC configured, it is desired to correct the data page through verity_verify_io. And the return value will be converted to blk_status and passed to verity_finish_io(). BTW, when a bit is set in v->validated_blocks, verity_verify_io() skips verification regardless of I/O error for the corresponding bio. In this case, the I/O error could not be returned properly, and as a result, there is a problem that abnormal data could be read for the corresponding block. To fix this problem, when an I/O error occurs, do not skip verification even if the bit related is set in v->validated_blocks. Fixes: 843f38d382b1 ("dm verity: add 'check_at_most_once' option to only validate hashes once") Cc: stable@vger.kernel.org Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11md/raid5: Improve performance for sequential IOJan Kara1-1/+44
commit fc05e06e6098ca2c28f7a10da0e00aeea20fa59e upstream. Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the order in which requests for underlying disks are created. Since for large sequential IO adding of requests frequently races with md_raid5 thread submitting bios to underlying disks, this results in a change in IO pattern because intermediate states of new order of request creation result in more smaller discontiguous requests. For RAID5 on top of three rotational disks our performance testing revealed this results in regression in write throughput: iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R before 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 493670 525964 524575 513384 131072000 8 540467 532880 512028 513703 after 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 421785 456184 531278 509248 131072000 8 459283 456354 528449 543834 To reduce the amount of discontiguous requests we can start generating requests with the stripe with the lowest chunk offset as that has the best chance of being adjacent to IO queued previously. This improves the performance to: KB reclen write rewrite read reread 131072000 4 497682 506317 518043 514559 131072000 8 514048 501886 506453 504319 restoring big part of the regression. Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230417171537.17899-1-jack@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11md/raid10: fix null-ptr-deref in raid10_sync_requestLi Nan1-4/+4
commit a405c6f0229526160aa3f177f65e20c86fce84c5 upstream. init_resync() inits mempool and sets conf->have_replacemnt at the beginning of sync, close_sync() frees the mempool when sync is completed. After [1] recovery might be skipped and init_resync() is called but close_sync() is not. null-ptr-deref occurs with r10bio->dev[i].repl_bio. The following is one way to reproduce the issue. 1) create a array, wait for resync to complete, mddev->recovery_cp is set to MaxSector. 2) recovery is woken and it is skipped. conf->have_replacement is set to 0 in init_resync(). close_sync() not called. 3) some io errors and rdev A is set to WantReplacement. 4) a new device is added and set to A's replacement. 5) recovery is woken, A have replacement, but conf->have_replacemnt is 0. r10bio->dev[i].repl_bio will not be alloced and null-ptr-deref occurs. Fix it by not calling init_resync() if recovery skipped. [1] commit 7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled") Fixes: 7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled") Cc: stable@vger.kernel.org Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230222041000.3341651-3-linan666@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11md/raid10: don't call bio_start_io_acct twice for bio which experienced read ↵Yu Kuai1-1/+3
error [ Upstream commit 7cddb055bfda5f7b0be931e8ea750fc28bc18a27 ] handle_read_error() will resumit r10_bio by raid10_read_request(), which will call bio_start_io_acct() again, while bio_end_io_acct() will only be called once. Fix the problem by don't account io again from handle_read_error(). Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11md/raid10: fix memleak of md threadYu Kuai1-3/+3
[ Upstream commit f0ddb83da3cbbf8a1f9087a642c448ff52ee9abd ] In raid10_run(), if setup_conf() succeed and raid10_run() failed before setting 'mddev->thread', then in the error path 'conf->thread' is not freed. Fix the problem by setting 'mddev->thread' right after setup_conf(). Fixes: 43a521238aca ("md-cluster: choose correct label when clustered layout is not supported") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11md/raid10: fix memleak for 'conf->bio_split'Yu Kuai1-20/+17
[ Upstream commit c9ac2acde53f5385de185bccf6aaa91cf9ac1541 ] In the error path of raid10_run(), 'conf' need be freed, however, 'conf->bio_split' is missed and memory will be leaked. Since there are 3 places to free 'conf', factor out a helper to fix the problem. Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11md/raid10: fix leak of 'r10bio->remaining' for recoveryYu Kuai1-10/+13
[ Upstream commit 26208a7cffd0c7cbf14237ccd20c7270b3ffeb7e ] raid10_sync_request() will add 'r10bio->remaining' for both rdev and replacement rdev. However, if the read io fails, recovery_request_write() returns without issuing the write io, in this case, end_sync_request() is only called once and 'remaining' is leaked, cause an io hang. Fix the problem by decreasing 'remaining' according to if 'bio' and 'repl_bio' is valid. Fixes: 24afd80d99f8 ("md/raid10: handle recovery of replacement devices.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11md/raid10: fix task hung in raid10dLi Nan1-5/+13
[ Upstream commit 72c215ed8731c88b2d7e09afc51fffc207ae47b8 ] commit fe630de009d0 ("md/raid10: avoid deadlock on recovery.") allowed normal io and sync io to exist at the same time. Task hung will occur as below: T1 T2 T3 T4 raid10d handle_read_error allow_barrier conf->nr_pending-- -> 0 //submit sync io raid10_sync_request raise_barrier ->will not be blocked ... //submit to drivers raid10_read_request wait_barrier conf->nr_pending++ -> 1 //retry read fail raid10_end_read_request reschedule_retry add to retry_list conf->nr_queued++ -> 1 //sync io fail end_sync_read __end_sync_read reschedule_retry add to retry_list conf->nr_queued++ -> 2 ... handle_read_error get form retry_list conf->nr_queued-- freeze_array wait nr_pending == nr_queued+1 ->1 ->2 //task hung retry read and sync io will be added to retry_list(nr_queued->2) if they fails. raid10d() called handle_read_error() and hung in freeze_array(). nr_queued will not decrease because raid10d is blocked, nr_pending will not increase because conf->barrier is not released. Fix it by moving allow_barrier() after raid10_read_request(). raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io and regular io will not be issued at the same time. Also remove the check of nr_queued in stop_waiting_barrier. It can be 0 but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as the check is redundent. Fixes: fe630de009d0 ("md/raid10: avoid deadlock on recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11blk-crypto: make blk_crypto_evict_key() return voidEric Biggers1-14/+5
commit 70493a63ba04f754f7a7dd53a4fcc82700181490 upstream. blk_crypto_evict_key() is only called in contexts such as inode eviction where failure is not an option. So there is nothing the caller can do with errors except log them. (dm-table.c does "use" the error code, but only to pass on to upper layers, so it doesn't really count.) Just make blk_crypto_evict_key() return void and log errors itself. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11blk-crypto: don't use struct request_queue for public interfacesChristoph Hellwig1-1/+1
commit fce3caea0f241f5d34855c82c399d5e0e2d91f07 upstream. Switch all public blk-crypto interfaces to use struct block_device arguments to specify the device they operate on instead of th request_queue, which is a block layer implementation detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-13dm: fix improper splitting for abnormal biosMike Snitzer1-3/+4
[ Upstream commit f7b58a69fad9d2c4c90cab0247811155dd0d48e7 ] "Abnormal" bios include discards, write zeroes and secure erase. By no longer passing the calculated 'len' pointer, commit 7dd06a2548b2 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios") took a senseless approach to disallowing dm_accept_partial_bio() from working for duplicate bios processed using __send_duplicate_bios(). It inadvertently and incorrectly stopped the use of 'len' when initializing a target's io (in alloc_tio). As such the resulting tio could address more area of a device than it should. For example, when discarding an entire DM striped device with the following DM table: vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048 Before this fix: device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400 blkdiscard: attempt to access beyond end of device loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400 blkdiscard: attempt to access beyond end of device loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920 After this fix; device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872 Fixes: 7dd06a2548b2 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios") Cc: stable@vger.kernel.org Reported-by: Orange Kao <orange@aiven.io> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-13dm: change "unsigned" to "unsigned int"Heinz Mauelshagen76-972/+972
[ Upstream commit 86a3238c7b9b759cb864f4f768ab2e24687dc0e6 ] Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: f7b58a69fad9 ("dm: fix improper splitting for abnormal bios") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-13dm integrity: Remove bi_sector that's only used by commented debug codeJiapeng Chong1-7/+0
[ Upstream commit 5cd6d1d53a1f74222e73d8b42ab7ecf28ee2f34f ] drivers/md/dm-integrity.c:1738:13: warning: variable 'bi_sector' set but not used. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3895 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: f7b58a69fad9 ("dm: fix improper splitting for abnormal bios") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-13dm cache: Add some documentation to dm-cache-background-tracker.hJoe Thornber1-3/+37
[ Upstream commit 22c40e134c4c7a828ac09d25a5a8597b1e45c031 ] Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: f7b58a69fad9 ("dm: fix improper splitting for abnormal bios") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-06dm: fix __send_duplicate_bios() to always allow for splitting IOMike Snitzer1-0/+2
commit 666eed46769d929c3e13636134ecfc67d75ef548 upstream. Commit 7dd76d1feec70 ("dm: improve bio splitting and associated IO accounting") only called setup_split_accounting() from __send_duplicate_bios() if a single bio were being issued. But the case where duplicate bios are issued must call it too. Otherwise the bio won't be split and resubmitted (via recursion through block core back to DM) to submit the later portions of a bio (which may map to an entirely different target). For example, when discarding an entire DM striped device with the following DM table: vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048 Before (broken, discards the first striped target's devices twice): device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872 device-mapper: striped: target_stripe=0, bdev=7:0, start=2049 len=22528 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=22528 After (works as expected): device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872 device-mapper: striped: target_stripe=0, bdev=7:2, start=2048 len=22528 device-mapper: striped: target_stripe=1, bdev=7:3, start=2048 len=22528 Fixes: 7dd76d1feec70 ("dm: improve bio splitting and associated IO accounting") Cc: stable@vger.kernel.org Reported-by: Orange Kao <orange@aiven.io> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-06md: avoid signed overflow in slot_store()NeilBrown1-0/+3
[ Upstream commit 3bc57292278a0b6ac4656cad94c14f2453344b57 ] slot_store() uses kstrtouint() to get a slot number, but stores the result in an "int" variable (by casting a pointer). This can result in a negative slot number if the unsigned int value is very large. A negative number means that the slot is empty, but setting a negative slot number this way will not remove the device from the array. I don't think this is a serious problem, but it could cause confusion and it is best to fix it. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-30dm crypt: avoid accessing uninitialized taskletMike Snitzer1-6/+9
commit d9a02e016aaf5a57fb44e9a5e6da8ccd3b9e2e70 upstream. When neither "no_read_workqueue" nor "no_write_workqueue" are enabled, tasklet_trylock() in crypt_dec_pending() may still return false due to an uninitialized state, and dm-crypt will unnecessarily do io completion in io_queue workqueue instead of current context. Fix this by adding an 'in_tasklet' flag to dm_crypt_io struct and initialize it to false in crypt_io_init(). Set this flag to true in kcryptd_queue_crypt() before calling tasklet_schedule(). If set crypt_dec_pending() will punt io completion to a workqueue. This also nicely avoids the tasklet_trylock/unlock hack when tasklets aren't in use. Fixes: 8e14f610159d ("dm crypt: do not call bio_endio() from the dm-crypt tasklet") Cc: stable@vger.kernel.org Reported-by: Hou Tao <houtao1@huawei.com> Suggested-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-30dm crypt: add cond_resched() to dmcrypt_write()Mikulas Patocka1-0/+1
commit fb294b1c0ba982144ca467a75e7d01ff26304e2b upstream. The loop in dmcrypt_write may be running for unbounded amount of time, thus we need cond_resched() in it. This commit fixes the following warning: [ 3391.153255][ C12] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [dmcrypt_write/2:2897] ... [ 3391.387210][ C12] Call trace: [ 3391.390338][ C12] blk_attempt_bio_merge.part.6+0x38/0x158 [ 3391.395970][ C12] blk_attempt_plug_merge+0xc0/0x1b0 [ 3391.401085][ C12] blk_mq_submit_bio+0x398/0x550 [ 3391.405856][ C12] submit_bio_noacct+0x308/0x380 [ 3391.410630][ C12] dmcrypt_write+0x1e4/0x208 [dm_crypt] [ 3391.416005][ C12] kthread+0x130/0x138 [ 3391.419911][ C12] ret_from_fork+0x10/0x18 Reported-by: yangerkun <yangerkun@huawei.com> Fixes: dc2676210c42 ("dm crypt: offload writes to thread") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-30dm stats: check for and propagate alloc_percpu failureJiasheng Jiang3-3/+10
commit d3aa3e060c4a80827eb801fc448debc9daa7c46b upstream. Check alloc_precpu()'s return value and return an error from dm_stats_init() if it fails. Update alloc_dev() to fail if dm_stats_init() does. Otherwise, a NULL pointer dereference will occur in dm_stats_cleanup() even if dm-stats isn't being actively used. Fixes: fd2ed4d25270 ("dm: add statistics support") Cc: stable@vger.kernel.org Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-30dm thin: fix deadlock when swapping to thin deviceColy Li1-0/+2
commit 9bbf5feecc7eab2c370496c1c161bbfe62084028 upstream. This is an already known issue that dm-thin volume cannot be used as swap, otherwise a deadlock may happen when dm-thin internal memory demand triggers swap I/O on the dm-thin volume itself. But thanks to commit a666e5c05e7c ("dm: fix deadlock when swapping to encrypted device"), the limit_swap_bios target flag can also be used for dm-thin to avoid the recursive I/O when it is used as swap. Fix is to simply set ti->limit_swap_bios to true in both pool_ctr() and thin_ctr(). In my test, I create a dm-thin volume /dev/vg/swap and use it as swap device. Then I run fio on another dm-thin volume /dev/vg/main and use large --blocksize to trigger swap I/O onto /dev/vg/swap. The following fio command line is used in my test, fio --name recursive-swap-io --lockmem 1 --iodepth 128 \ --ioengine libaio --filename /dev/vg/main --rw randrw \ --blocksize 1M --numjobs 32 --time_based --runtime=12h Without this fix, the whole system can be locked up within 15 seconds. With this fix, there is no any deadlock or hung task observed after 2 hours of running fio. Furthermore, if blocksize is changed from 1M to 128M, after around 30 seconds fio has no visible I/O, and the out-of-memory killer message shows up in kernel message. After around 20 minutes all fio processes are killed and the whole system is back to being alive. This is exactly what is expected when recursive I/O happens on dm-thin volume when it is used as swap. Depends-on: a666e5c05e7c ("dm: fix deadlock when swapping to encrypted device") Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22md: select BLOCK_LEGACY_AUTOLOADNeilBrown1-0/+4
commit 6c0f5898836c05c6d850a750ed7940ba29e4e6c5 upstream. When BLOCK_LEGACY_AUTOLOAD is not enable, mdadm is not able to activate new arrays unless "CREATE names=yes" appears in mdadm.conf As this is a regression we need to always enable BLOCK_LEGACY_AUTOLOAD for when MD is selected - at least until mdadm is updated and the updates widely available. Cc: stable@vger.kernel.org # v5.18+ Fixes: fbdee71bb5d8 ("block: deprecate autoloading based on dev_t") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm flakey: fix a bug with 32-bit highmem systemsMikulas Patocka1-1/+2
commit 8eb29c4fbf9661e6bd4dd86197a37ffe0ecc9d50 upstream. The function page_address does not work with 32-bit systems with high memory. Use bvec_kmap_local/kunmap_local instead. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm flakey: don't corrupt the zero pageMikulas Patocka1-2/+5
commit f50714b57aecb6b3dc81d578e295f86d9c73f078 upstream. When we need to zero some range on a block device, the function __blkdev_issue_zero_pages submits a write bio with the bio vector pointing to the zero page. If we use dm-flakey with corrupt bio writes option, it will corrupt the content of the zero page which results in crashes of various userspace programs. Glibc assumes that memory returned by mmap is zeroed and it uses it for calloc implementation; if the newly mapped memory is not zeroed, calloc will return non-zeroed memory. Fix this bug by testing if the page is equal to ZERO_PAGE(0) and avoiding the corruption in this case. Cc: stable@vger.kernel.org Fixes: a00f5276e266 ("dm flakey: Properly corrupt multi-page bios.") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm cache: free background tracker's queued work in btracker_destroyJoe Thornber1-0/+8
commit 95ab80a8a0fef2ce0cc494a306dd283948066ce7 upstream. Otherwise the kernel can BUG with: [ 2245.426978] ============================================================================= [ 2245.435155] BUG bt_work (Tainted: G B W ): Objects remaining in bt_work on __kmem_cache_shutdown() [ 2245.445233] ----------------------------------------------------------------------------- [ 2245.445233] [ 2245.454879] Slab 0x00000000b0ce2b30 objects=64 used=2 fp=0x000000000a3c6a4e flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 2245.467300] CPU: 7 PID: 10805 Comm: lvm Kdump: loaded Tainted: G B W 6.0.0-rc2 #19 [ 2245.476078] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.5.6 10/06/2021 [ 2245.483646] Call Trace: [ 2245.486100] <TASK> [ 2245.488206] dump_stack_lvl+0x34/0x48 [ 2245.491878] slab_err+0x95/0xcd [ 2245.495028] __kmem_cache_shutdown.cold+0x31/0x136 [ 2245.499821] kmem_cache_destroy+0x49/0x130 [ 2245.503928] btracker_destroy+0x12/0x20 [dm_cache] [ 2245.508728] smq_destroy+0x15/0x60 [dm_cache_smq] [ 2245.513435] dm_cache_policy_destroy+0x12/0x20 [dm_cache] [ 2245.518834] destroy+0xc0/0x110 [dm_cache] [ 2245.522933] dm_table_destroy+0x5c/0x120 [dm_mod] [ 2245.527649] __dm_destroy+0x10e/0x1c0 [dm_mod] [ 2245.532102] dev_remove+0x117/0x190 [dm_mod] [ 2245.536384] ctl_ioctl+0x1a2/0x290 [dm_mod] [ 2245.540579] dm_ctl_ioctl+0xa/0x20 [dm_mod] [ 2245.544773] __x64_sys_ioctl+0x8a/0xc0 [ 2245.548524] do_syscall_64+0x5c/0x90 [ 2245.552104] ? syscall_exit_to_user_mode+0x12/0x30 [ 2245.556897] ? do_syscall_64+0x69/0x90 [ 2245.560648] ? do_syscall_64+0x69/0x90 [ 2245.564394] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 2245.569447] RIP: 0033:0x7fe52583ec6b ... [ 2245.646771] ------------[ cut here ]------------ [ 2245.651395] kmem_cache_destroy bt_work: Slab cache still has objects when called from btracker_destroy+0x12/0x20 [dm_cache] [ 2245.651408] WARNING: CPU: 7 PID: 10805 at mm/slab_common.c:478 kmem_cache_destroy+0x128/0x130 Found using: lvm2-testsuite --only "cache-single-split.sh" Ben bisected and found that commit 0495e337b703 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete cleanup of its background tracker work objects. Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm flakey: fix logic when corrupting a bioMikulas Patocka1-10/+13
commit aa56b9b75996ff4c76a0a4181c2fa0206c3d91cc upstream. If "corrupt_bio_byte" is set to corrupt reads and corrupt_bio_flags is used, dm-flakey would erroneously return all writes as errors. Likewise, if "corrupt_bio_byte" is set to corrupt writes, dm-flakey would return errors for all reads. Fix the logic so that if fc->corrupt_bio_byte is non-zero, dm-flakey will not abort reads on writes with an error. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm: add cond_resched() to dm_wq_requeue_work()Mike Snitzer1-0/+1
commit f77692d65d54665d81815349cc727baa85e8b71d upstream. Otherwise the while() loop in dm_wq_requeue_work() can result in a "dead loop" on systems that have preemption disabled. This is particularly problematic on single cpu systems. Fixes: 8b211aaccb915 ("dm: add two stage requeue mechanism") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm: add cond_resched() to dm_wq_work()Pingfan Liu1-0/+1
commit 0ca44fcef241768fd25ee763b3d203b9852f269b upstream. Otherwise the while() loop in dm_wq_work() can result in a "dead loop" on systems that have preemption disabled. This is particularly problematic on single cpu systems. Cc: stable@vger.kernel.org Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm: send just one event on resize, not twoMikulas Patocka3-18/+24
commit 7533afa1d27ba1234146d31d2402c195cf195962 upstream. Device mapper sends an uevent when the device is suspended, using the function set_capacity_and_notify. However, this causes a race condition with udev. Udev skips scanning dm devices that are suspended. If we send an uevent while we are suspended, udev will be racing with device mapper resume code. If the device mapper resume code wins the race, udev will process the uevent after the device is resumed and it will properly scan the device. However, if udev wins the race, it will receive the uevent, find out that the dm device is suspended and skip scanning the device. This causes bugs such as systemd unmounting the device - see https://bugzilla.redhat.com/show_bug.cgi?id=2158628 This commit fixes this race. We replace the function set_capacity_and_notify with set_capacity, so that the uevent is not sent at this point. In do_resume, we detect if the capacity has changed and we pass a boolean variable need_resize_uevent to dm_kobject_uevent. dm_kobject_uevent adds "RESIZE=1" to the uevent if need_resize_uevent is set. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Peter Rajnoha <prajnoha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10md: don't update recovery_cp when curr_resync is ACTIVEHou Tao1-1/+1
commit 1d1f25bfda432a6b61bd0205d426226bbbd73504 upstream. Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise md may skip the resync of the first 3 sectors if the resync procedure is interrupted before the first calling of ->sync_request() as shown below: md_do_sync thread control thread // setup resync mddev->recovery_cp = 0 j = 0 mddev->curr_resync = MD_RESYNC_ACTIVE // e.g., set array as idle set_bit(MD_RECOVERY_INTR, &&mddev_recovery) // resync loop // check INTR before calling sync_request !test_bit(MD_RECOVERY_INTR, &mddev->recovery // resync interrupted // update recovery_cp from 0 to 3 // the resync of three 3 sectors will be skipped mddev->recovery_cp = 3 Fixes: eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync") Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-10dm cache: add cond_resched() to various workqueue loopsMike Snitzer1-0/+4
[ Upstream commit 76227f6dc805e9e960128bcc6276647361e0827c ] Otherwise on resource constrained systems these workqueues may be too greedy. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10dm thin: add cond_resched() to various workqueue loopsMike Snitzer1-0/+2
[ Upstream commit e4f80303c2353952e6e980b23914e4214487f2a6 ] Otherwise on resource constrained systems these workqueues may be too greedy. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10dm: remove flush_scheduled_work() during local_exit()Mike Snitzer1-1/+0
[ Upstream commit 0b22ff5360f5c4e11050b89206370fdf7dc0a226 ] Commit acfe0ad74d2e1 ("dm: allocate a special workqueue for deferred device removal") switched from using system workqueue to a single workqueue local to DM. But it didn't eliminate the call to flush_scheduled_work() that was introduced purely for the benefit of deferred device removal with commit 2c140a246dc ("dm: allow remove to be deferred"). Since DM core uses its own workqueue (and queue_work) there is no need to call flush_scheduled_work() from local_exit(). local_exit()'s destroy_workqueue(deferred_remove_workqueue) handles flushing work started with queue_work(). Fixes: acfe0ad74d2e1 ("dm: allocate a special workqueue for deferred device removal") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10dm: improve shrinker debug namesMike Snitzer2-2/+2
[ Upstream commit c87791bcc455a91e51ca9800faaacc21c8d67785 ] Commit e33c267ab70d ("mm: shrinkers: provide shrinkers with names") chose some fairly bad names for DM's shrinkers. Fixes: e33c267ab70d ("mm: shrinkers: provide shrinkers with names") Signed-off-by : Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09bcache: Silence memcpy() run-time false positive warningsKees Cook2-2/+4
[ Upstream commit be0d8f48ad97f5b775b0af3310343f676dbf318a ] struct bkey has internal padding in a union, but it isn't always named the same (e.g. key ## _pad, key_p, etc). This makes it extremely hard for the compiler to reason about the available size of copies done against such keys. Use unsafe_memcpy() for now, to silence the many run-time false positive warnings: memcpy: detected field-spanning write (size 264) of single field "&i->j" at drivers/md/bcache/journal.c:152 (size 240) memcpy: detected field-spanning write (size 24) of single field "&b->key" at drivers/md/bcache/btree.c:939 (size 16) memcpy: detected field-spanning write (size 24) of single field "&temp.key" at drivers/md/bcache/extents.c:428 (size 16) Reported-by: Alexandre Pereira <alexpereira@disroot.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216785 Acked-by: Coly Li <colyli@suse.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-bcache@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230106060229.never.047-kees@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-18block: handle bio_split_to_limits() NULL returnJens Axboe2-0/+4
commit 613b14884b8595e20b9fac4126bf627313827fbe upstream. This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07md/bitmap: Fix bitmap chunk size overflow issuesFlorian-Ewald Mueller1-8/+12
commit 4555211190798b6b6fa2c37667d175bf67945c78 upstream. - limit bitmap chunk size internal u64 variable to values not overflowing the u32 bitmap superblock structure variable stored on persistent media - assign bitmap chunk size internal u64 variable from unsigned values to avoid possible sign extension artifacts when assigning from a s32 value The bug has been there since at least kernel 4.0. Steps to reproduce it: 1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2 -n2 /dev/rnbd1 /dev/rnbd2 2 resize member device rnbd1 and rnbd2 to 8 TB 3 mdadm --grow /dev/mdx --size=max The bitmap_chunksize will overflow without patch. Cc: stable@vger.kernel.org Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm cache: set needs_check flag after aborting metadataMike Snitzer1-5/+5
commit 6b9973861cb2e96dcd0bb0f1baddc5c034207c5c upstream. Otherwise the commit that will be aborted will be associated with the metadata objects that will be torn down. Must write needs_check flag to metadata with a reset block manager. Found through code-inspection (and compared against dm-thin.c). Cc: stable@vger.kernel.org Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm cache: Fix UAF in destroy()Luo Meng1-0/+1
commit 6a459d8edbdbe7b24db42a5a9f21e6aa9e00c2aa upstream. Dm_cache also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in destroy(). Cc: stable@vger.kernel.org Fixes: c6b4fcbad044e ("dm: add cache target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm clone: Fix UAF in clone_dtr()Luo Meng1-0/+1
commit e4b5957c6f749a501c464f92792f1c8e26b61a94 upstream. Dm_clone also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in clone_dtr(). Cc: stable@vger.kernel.org Fixes: 7431b7835f554 ("dm: add clone target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm integrity: Fix UAF in dm_integrity_dtr()Luo Meng1-0/+2
commit f50cb2cbabd6c4a60add93d72451728f86e4791c upstream. Dm_integrity also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in dm_integrity_dtr(). Cc: stable@vger.kernel.org Fixes: 7eada909bfd7a ("dm: add integrity target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Fix UAF in run_timer_softirq()Luo Meng1-0/+2
commit 88430ebcbc0ec637b710b947738839848c20feff upstream. When dm_resume() and dm_destroy() are concurrent, it will lead to UAF, as follows: BUG: KASAN: use-after-free in __run_timers+0x173/0x710 Write of size 8 at addr ffff88816d9490f0 by task swapper/0/0 <snip> Call Trace: <IRQ> dump_stack_lvl+0x73/0x9f print_report.cold+0x132/0xaa2 _raw_spin_lock_irqsave+0xcd/0x160 __run_timers+0x173/0x710 kasan_report+0xad/0x110 __run_timers+0x173/0x710 __asan_store8+0x9c/0x140 __run_timers+0x173/0x710 call_timer_fn+0x310/0x310 pvclock_clocksource_read+0xfa/0x250 kvm_clock_read+0x2c/0x70 kvm_clock_get_cycles+0xd/0x20 ktime_get+0x5c/0x110 lapic_next_event+0x38/0x50 clockevents_program_event+0xf1/0x1e0 run_timer_softirq+0x49/0x90 __do_softirq+0x16e/0x62c __irq_exit_rcu+0x1fa/0x270 irq_exit_rcu+0x12/0x20 sysvec_apic_timer_interrupt+0x8e/0xc0 One of the concurrency UAF can be shown as below: use free do_resume | __find_device_hash_cell | dm_get | atomic_inc(&md->holders) | | dm_destroy | __dm_destroy | if (!dm_suspended_md(md)) | atomic_read(&md->holders) | msleep(1) dm_resume | __dm_resume | dm_table_resume_targets | pool_resume | do_waker #add delay work | dm_put | atomic_dec(&md->holders) | | dm_table_destroy | pool_dtr | __pool_dec | __pool_destroy | destroy_workqueue | kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3 The root cause of this UAF bug is that dm_resume() adds timer after dm_destroy() skips cancelling the timer because of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen. Therefore, cancelling timer again in __pool_destroy(). Cc: stable@vger.kernel.org Fixes: 991d9fa02da0d ("dm: add thin provisioning target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: resume even if in FAIL modeLuo Meng1-4/+12
commit 19eb1650afeb1aa86151f61900e9e5f1de5d8d02 upstream. If a thinpool set fail_io while suspending, resume will fail with: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument The thin-pool also can't be removed if an in-flight bio is in the deferred list. This can be easily reproduced using: echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume. Fix this by adding FAIL mode check at the end of pool_preresume(). Cc: stable@vger.kernel.org Fixes: da105ed5fd7e ("dm thin metadata: introduce dm_pool_abort_metadata") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Use last transaction's pmd->root when commit failedZhihao Cheng1-0/+9
commit 7991dbff6849f67e823b7cc0c15e5a90b0549b9f upstream. Recently we found a softlock up problem in dm thin pool btree lookup code due to corrupted metadata: Kernel panic - not syncing: softlockup: hung tasks CPU: 7 PID: 2669225 Comm: kworker/u16:3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: <IRQ> dump_stack+0x9c/0xd3 panic+0x35d/0x6b9 watchdog_timer_fn.cold+0x16/0x25 __run_hrtimer+0xa2/0x2d0 </IRQ> RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio] __bufio_new+0x11f/0x4f0 [dm_bufio] new_read+0xa3/0x1e0 [dm_bufio] dm_bm_read_lock+0x33/0xd0 [dm_persistent_data] ro_step+0x63/0x100 [dm_persistent_data] btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data] dm_btree_lookup+0x16f/0x210 [dm_persistent_data] dm_thin_find_block+0x12c/0x210 [dm_thin_pool] __process_bio_read_only+0xc5/0x400 [dm_thin_pool] process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool] process_one_work+0x3c5/0x730 Following process may generate a broken btree mixed with fresh and stale btree nodes, which could get dm thin trapped in an infinite loop while looking up data block: Transaction 1: pmd->root = A, A->B->C // One path in btree pmd->root = X, X->Y->Z // Copy-up Transaction 2: X,Z is updated on disk, Y write failed. // Commit failed, dm thin becomes read-only. process_bio_read_only dm_thin_find_block __find_block dm_btree_lookup(pmd->root) The pmd->root points to a broken btree, Y may contain stale node pointing to any block, for example X, which gets dm thin trapped into a dead loop while looking up Z. Fix this by setting pmd->root in __open_metadata(), so that dm thin will use the last transaction's pmd->root if commit failed. Fetch a reproducer in [Link]. Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790 Cc: stable@vger.kernel.org Fixes: 991d9fa02da0 ("dm: add thin provisioning target") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadataZhihao Cheng1-8/+43
commit 8111964f1b8524c4bb56b02cd9c7a37725ea21fd upstream. Following concurrent processes: P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_read_bh_nowait submit_bh dm_submit_bio do_worker process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata down_write(&pmd->root_lock) - LOCK B __destroy_persistent_data_objects dm_block_manager_destroy dm_bufio_client_destroy unregister_shrinker down_write(&shrinker_rwsem) thin_map | dm_thin_find_block ↓ down_read(&pmd->root_lock) --> ABBA deadlock , which triggers hung task: [ 76.974820] INFO: task kworker/u4:3:63 blocked for more than 15 seconds. [ 76.976019] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.978521] task:kworker/u4:3 state:D stack:0 pid:63 ppid:2 [ 76.978534] Workqueue: dm-thin do_worker [ 76.978552] Call Trace: [ 76.978564] __schedule+0x6ba/0x10f0 [ 76.978582] schedule+0x9d/0x1e0 [ 76.978588] rwsem_down_write_slowpath+0x587/0xdf0 [ 76.978600] down_write+0xec/0x110 [ 76.978607] unregister_shrinker+0x2c/0xf0 [ 76.978616] dm_bufio_client_destroy+0x116/0x3d0 [ 76.978625] dm_block_manager_destroy+0x19/0x40 [ 76.978629] __destroy_persistent_data_objects+0x5e/0x70 [ 76.978636] dm_pool_abort_metadata+0x8e/0x100 [ 76.978643] metadata_operation_failed+0x86/0x110 [ 76.978649] commit+0x6a/0x230 [ 76.978655] do_worker+0xc6e/0xd90 [ 76.978702] process_one_work+0x269/0x630 [ 76.978714] worker_thread+0x266/0x630 [ 76.978730] kthread+0x151/0x1b0 [ 76.978772] INFO: task test.sh:2646 blocked for more than 15 seconds. [ 76.979756] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.982111] task:test.sh state:D stack:0 pid:2646 ppid:2459 [ 76.982128] Call Trace: [ 76.982139] __schedule+0x6ba/0x10f0 [ 76.982155] schedule+0x9d/0x1e0 [ 76.982159] rwsem_down_read_slowpath+0x4f4/0x910 [ 76.982173] down_read+0x84/0x170 [ 76.982177] dm_thin_find_block+0x4c/0xd0 [ 76.982183] thin_map+0x201/0x3d0 [ 76.982188] __map_bio+0x5b/0x350 [ 76.982195] dm_submit_bio+0x2b6/0x930 [ 76.982202] __submit_bio+0x123/0x2d0 [ 76.982209] submit_bio_noacct_nocheck+0x101/0x3e0 [ 76.982222] submit_bio_noacct+0x389/0x770 [ 76.982227] submit_bio+0x50/0xc0 [ 76.982232] submit_bh_wbc+0x15e/0x230 [ 76.982238] submit_bh+0x14/0x20 [ 76.982241] ext4_read_bh_nowait+0xc5/0x130 [ 76.982247] ext4_read_block_bitmap_nowait+0x340/0xc60 [ 76.982254] ext4_mb_init_cache+0x1ce/0xdc0 [ 76.982259] ext4_mb_load_buddy_gfp+0x987/0xfa0 [ 76.982263] ext4_discard_preallocations+0x45d/0x830 [ 76.982274] ext4_clear_inode+0x48/0xf0 [ 76.982280] ext4_evict_inode+0xcf/0xc70 [ 76.982285] evict+0x119/0x2b0 [ 76.982290] dispose_list+0x43/0xa0 [ 76.982294] prune_icache_sb+0x64/0x90 [ 76.982298] super_cache_scan+0x155/0x210 [ 76.982303] do_shrink_slab+0x19e/0x4e0 [ 76.982310] shrink_slab+0x2bd/0x450 [ 76.982317] drop_slab+0xcc/0x1a0 [ 76.982323] drop_caches_sysctl_handler+0xb7/0xe0 [ 76.982327] proc_sys_call_handler+0x1bc/0x300 [ 76.982331] proc_sys_write+0x17/0x20 [ 76.982334] vfs_write+0x3d3/0x570 [ 76.982342] ksys_write+0x73/0x160 [ 76.982347] __x64_sys_write+0x1e/0x30 [ 76.982352] do_syscall_64+0x35/0x80 [ 76.982357] entry_SYSCALL_64_after_hwframe+0x63/0xcd Function metadata_operation_failed() is called when operations failed on dm pool metadata, dm pool will destroy and recreate metadata. So, shrinker will be unregistered and registered, which could down write shrinker_rwsem under pmd_write_lock. Fix it by allocating dm_block_manager before locking pmd->root_lock and destroying old dm_block_manager after unlocking pmd->root_lock, then old dm_block_manager is replaced with new dm_block_manager under pmd->root_lock. So, shrinker register/unregister could be done without holding pmd->root_lock. Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216676 Cc: stable@vger.kernel.org #v5.2+ Fixes: e49e582965b3 ("dm thin: add read only and fail io modes") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>