summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)AuthorFilesLines
2023-09-23md: Put the right device in md_seq_nextMariusz Tkaczyk1-1/+1
commit c8870379a21fbd9ad14ca36204ccfbe9d25def43 upstream. If there are multiple arrays in system and one mddevice is marked with MD_DELETED and md_seq_next() is called in the middle of removal then it _get()s proper device but it may _put() deleted one. As a result, active counter may never be zeroed for mddevice and it cannot be removed. Put the device which has been _get with previous md_seq_next() call. Cc: stable@vger.kernel.org Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Reported-by: AceLan Kao <acelan@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798 Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: fix regression for null-ptr-deference in __md_stop()Yu Kuai1-1/+2
commit 433279beba1d4872da10b7b60a539e0cb828b32b upstream. Commit 3e453522593d ("md: Free resources in __md_stop") tried to fix null-ptr-deference for 'active_io' by moving percpu_ref_exit() to __md_stop(), however, the commit also moving 'writes_pending' to __md_stop(), and this will cause mdadm tests broken: BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37 RIP: 0010:free_percpu+0x465/0x670 Call Trace: <TASK> __percpu_ref_exit+0x48/0x70 percpu_ref_exit+0x1a/0x90 __md_stop+0xe9/0x170 do_md_stop+0x1e1/0x7b0 md_ioctl+0x90c/0x1aa0 blkdev_ioctl+0x19b/0x400 vfs_ioctl+0x20/0x50 __x64_sys_ioctl+0xba/0xe0 do_syscall_64+0x6c/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0xcd And the problem can be reporduced 100% by following test: mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force echo inactive > /sys/block/md0/md/array_state echo read-auto > /sys/block/md0/md/array_state echo inactive > /sys/block/md0/md/array_state Root cause: // start raid raid1_run mddev_init_writes_pending percpu_ref_init // inactive raid array_state_store do_md_stop __md_stop percpu_ref_exit // start raid again array_state_store do_md_run raid1_run mddev_init_writes_pending if (mddev->writes_pending.percpu_count_ptr) // won't reinit // inactive raid again ... percpu_ref_exit -> null-ptr-deference Before the commit, 'writes_pending' is exited when mddev is freed, and it's safe to restart raid because mddev_init_writes_pending() already make sure that 'writes_pending' will only be initialized once. Fix the prblem by moving 'writes_pending' back, it's a litter hard to find the relationship between alloc memory and free memory, however, code changes is much less and we lived with this for a long time already. Fixes: 3e453522593d ("md: Free resources in __md_stop") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: Free resources in __md_stopXiao Ni1-8/+5
commit 3e453522593d74a87cf68a38e14aa36ebca1dbcd upstream. If md_run() fails after ->active_io is initialized, then percpu_ref_exit is called in error path. However, later md_free_disk will call percpu_ref_exit again which leads to a panic because of null pointer dereference. It can also trigger this bug when resources are initialized but are freed in error path, then will be freed again in md_free_disk. BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP Workqueue: md_misc mddev_delayed_delete RIP: 0010:free_percpu+0x110/0x630 Call Trace: <TASK> __percpu_ref_exit+0x44/0x70 percpu_ref_exit+0x16/0x90 md_free_disk+0x2f/0x80 disk_release+0x101/0x180 device_release+0x84/0x110 kobject_put+0x12a/0x380 kobject_put+0x160/0x380 mddev_delayed_delete+0x19/0x30 process_one_work+0x269/0x680 worker_thread+0x266/0x640 kthread+0x151/0x1b0 ret_from_fork+0x1f/0x30 For creating raid device, md raid calls do_md_run->md_run, dm raid calls md_run. We alloc those memory in md_run. For stopping raid device, md raid calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can free those memory resources in __md_stop. Fixes: 72adae23a72c ("md: Change active_io to percpu") Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: add error_handlers for raid0 and linearMariusz Tkaczyk1-0/+3
[ Upstream commit c31fea2f8e2a72c817f318016bbc327095175a9f ] After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10") MD_BROKEN must be set if array is failed because state_store() checks it. If it is set then -EBUSY is returned to userspace. For raid0 and linear MD_BROKEN is not set by error_handler(). As a result mdadm is unable to trigger clean-up actions. It is a regression. This patch adds appropriate error_handler for raid0 and linear. The error handler sets MD_BROKEN for this device. Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com Stable-dep-of: 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: restore 'noio_flag' for the last mddev_resume()Yu Kuai1-2/+4
[ Upstream commit e24ed04389f9619e0aaef615a8948633c182a8b0 ] memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Change active_io to percpuXiao Ni1-19/+24
[ Upstream commit 72adae23a72cb12e2ef0dcd7c0aa042867f27998 ] Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Factor out is_md_suspended helperXiao Ni1-5/+12
[ Upstream commit d19329133d25ad3dc32f8a62635692cb2f189014 ] This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm raid: protect md_stop() with 'reconfig_mutex'Yu Kuai1-0/+2
[ Upstream commit 7d5fff8982a2199d49ec067818af7d84d4f95ca0 ] __md_stop_writes() and __md_stop() will modify many fields that are protected by 'reconfig_mutex', and all the callers will grab 'reconfig_mutex' except for md_stop(). Also, update md_stop() to make certain 'reconfig_mutex' is held using lockdep_assert_held(). Fixes: 9d09e663d550 ("dm: raid456 basic support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix wrong setting of max_corr_read_errorsLi Nan1-0/+2
[ Upstream commit f8b20a405428803bd9881881d8242c9d72c6b2b2 ] There is no input check when echo md/max_read_errors and overflow might occur. Add check of input number. Fixes: 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix overflow of md/safe_mode_delayLi Nan1-3/+4
[ Upstream commit 6beb489b2eed25978523f379a605073f99240c50 ] There is no input check when echo md/safe_mode_delay in safe_delay_store(). And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by checking overflow in safe_delay_store() and use unsigned long conversion in safe_delay_show(). Fixes: 72e02075a33f ("md: factor out parsing of fixed-point numbers") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24md: fix soft lockup in status_resyncYu Kuai1-9/+9
[ Upstream commit 6efddf1e32e2a264694766ca485a4f5e04ee82a7 ] status_resync() will calculate 'curr_resync - recovery_active' to show user a progress bar like following: [============>........] resync = 61.4% 'curr_resync' and 'recovery_active' is updated in md_do_sync(), and status_resync() can read them concurrently, hence it's possible that 'curr_resync - recovery_active' can overflow to a huge number. In this case status_resync() will be stuck in the loop to print a large amount of '=', which will end up soft lockup. Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case, this way resync in progress will be reported to user. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-06md: avoid signed overflow in slot_store()NeilBrown1-0/+3
[ Upstream commit 3bc57292278a0b6ac4656cad94c14f2453344b57 ] slot_store() uses kstrtouint() to get a slot number, but stores the result in an "int" variable (by casting a pointer). This can result in a negative slot number if the unsigned int value is very large. A negative number means that the slot is empty, but setting a negative slot number this way will not remove the device from the array. I don't think this is a serious problem, but it could cause confusion and it is best to fix it. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10md: don't update recovery_cp when curr_resync is ACTIVEHou Tao1-1/+1
commit 1d1f25bfda432a6b61bd0205d426226bbbd73504 upstream. Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise md may skip the resync of the first 3 sectors if the resync procedure is interrupted before the first calling of ->sync_request() as shown below: md_do_sync thread control thread // setup resync mddev->recovery_cp = 0 j = 0 mddev->curr_resync = MD_RESYNC_ACTIVE // e.g., set array as idle set_bit(MD_RECOVERY_INTR, &&mddev_recovery) // resync loop // check INTR before calling sync_request !test_bit(MD_RECOVERY_INTR, &mddev->recovery // resync interrupted // update recovery_cp from 0 to 3 // the resync of three 3 sectors will be skipped mddev->recovery_cp = 3 Fixes: eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync") Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18block: handle bio_split_to_limits() NULL returnJens Axboe1-0/+2
commit 613b14884b8595e20b9fac4126bf627313827fbe upstream. This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04md: fix a crash in mempool_freeMikulas Patocka1-3/+6
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream. There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh. The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free. We fix this bug by moving bio_put before atomic_dec_and_test. We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown. The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds1-3/+2
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig1-2/+2
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-22md: Remove extra mddev_get() in md_seq_start()Logan Gunthorpe1-1/+0
A regression is seen where mddev devices stay permanently after they are stopped due to an elevated reference count. This was tracked down to an extra mddev_get() in md_seq_start(). It only happened rarely because most of the time the md_seq_start() is called with a zero offset. The path with an extra mddev_get() only happens when it starts with a non-zero offset. The commit noted below changed an mddev_get() to check its success but inadvertently left the original call in. Remove the extra call. Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-08-24md: call __md_stop_writes in md_stopGuoqing Jiang1-0/+1
From the link [1], we can see raid1d was running even after the path raid_dtr -> md_stop -> __md_stop. Let's stop write first in destructor to align with normal md-raid to fix the KASAN issue. [1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-08-24Revert "md-raid: destroy the bitmap after destroying the thread"Guoqing Jiang1-1/+1
This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it obviously breaks clustered raid as noticed by Neil though it fixed KASAN issue for dm-raid, let's revert it and fix KASAN issue in next commit. [1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470 Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread") Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-08-24md: Flush workqueue md_rdev_misc_wq in md_alloc()David Sloan1-0/+1
A race condition still exists when removing and re-creating md devices in test cases. However, it is only seen on some setups. The race condition was tracked down to a reference still being held to the kobject by the rdev in the md_rdev_misc_wq which will be released in rdev_delayed_delete(). md_alloc() waits for previous deletions by waiting on the md_misc_wq, but the md_rdev_misc_wq may still be holding a reference to a recently removed device. To fix this, also flush the md_rdev_misc_wq in md_alloc(). Signed-off-by: David Sloan <david.sloan@eideticom.com> [logang@deltatee.com: rewrote commit message] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-08-03block: change the blk_queue_split calling conventionChristoph Hellwig1-1/+1
The double indirect bio leads to somewhat suboptimal code generation. Instead return the (original or split) bio, and make sure the request_queue arguments to the lower level helpers is passed after the bio to avoid constant reshuffling of the argument passing registers. Also give it and the helpers used to implement it more descriptive names. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md-raid: destroy the bitmap after destroying the threadMikulas Patocka1-1/+1
When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with kasan, we got failure in write_page. The reason for the failure is that md_bitmap_destroy is called before destroying the thread and the thread may be waiting in the function write_page for the bio to complete. When the thread finishes waiting, it executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which triggers the kasan warning. Note that the commit 48df498daf62 that caused this bug claims that it is neede for md-cluster, you should check md-cluster and possibly find another bugfix for it. BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod] Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539 CPU: 10 PID: 5539 Comm: mdX_raid1 Not tainted 5.19.0-rc2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x45/0x57a ? __lock_text_start+0x18/0x18 ? write_page+0x18d/0x680 [md_mod] kasan_report+0xa8/0xe0 ? write_page+0x18d/0x680 [md_mod] kasan_check_range+0x13f/0x180 write_page+0x18d/0x680 [md_mod] ? super_sync+0x4d5/0x560 [dm_raid] ? md_bitmap_file_kick+0xa0/0xa0 [md_mod] ? rs_set_dev_and_array_sectors+0x2e0/0x2e0 [dm_raid] ? mutex_trylock+0x120/0x120 ? preempt_count_add+0x6b/0xc0 ? preempt_count_sub+0xf/0xc0 md_update_sb+0x707/0xe40 [md_mod] md_reap_sync_thread+0x1b2/0x4a0 [md_mod] md_check_recovery+0x533/0x960 [md_mod] raid1d+0xc8/0x2a20 [raid1] ? var_wake_function+0xe0/0xe0 ? psi_group_change+0x411/0x500 ? preempt_count_sub+0xf/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? raid1_end_read_request+0x2a0/0x2a0 [raid1] ? preempt_count_sub+0xf/0xc0 ? _raw_spin_unlock_irqrestore+0x19/0x40 ? del_timer_sync+0xa9/0x100 ? try_to_del_timer_sync+0xc0/0xc0 ? _raw_spin_lock_irqsave+0x78/0xc0 ? __lock_text_start+0x18/0x18 ? __list_del_entry_valid+0x68/0xa0 ? finish_wait+0xa3/0x100 md_thread+0x161/0x260 [md_mod] ? unregister_md_personality+0xa0/0xa0 [md_mod] ? _raw_spin_lock_irqsave+0x78/0xc0 ? prepare_to_wait_event+0x2c0/0x2c0 ? unregister_md_personality+0xa0/0xa0 [md_mod] kthread+0x148/0x180 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 5522: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x80/0xa0 md_bitmap_create+0xa8/0xe80 [md_mod] md_run+0x777/0x1300 [md_mod] raid_ctr+0x249c/0x4a30 [dm_raid] dm_table_add_target+0x2b0/0x620 [dm_mod] table_load+0x1c8/0x400 [dm_mod] ctl_ioctl+0x29e/0x560 [dm_mod] dm_compat_ctl_ioctl+0x7/0x20 [dm_mod] __do_compat_sys_ioctl+0xfa/0x160 do_syscall_64+0x90/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 5680: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x40 kasan_set_free_info+0x20/0x40 __kasan_slab_free+0xf7/0x140 kfree+0x80/0x240 md_bitmap_free+0x1c3/0x280 [md_mod] __md_stop+0x21/0x120 [md_mod] md_stop+0x9/0x40 [md_mod] raid_dtr+0x1b/0x40 [dm_raid] dm_table_destroy+0x98/0x1e0 [dm_mod] __dm_destroy+0x199/0x360 [dm_mod] dev_remove+0x10c/0x160 [dm_mod] ctl_ioctl+0x29e/0x560 [dm_mod] dm_compat_ctl_ioctl+0x7/0x20 [dm_mod] __do_compat_sys_ioctl+0xfa/0x160 do_syscall_64+0x90/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: return the allocated devices from md_allocChristoph Hellwig1-32/+22
Two callers of md_alloc want to use the newly allocated devices, so return it instead of letting them find it cumbersomely after the allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: open code md_probe in autorun_devicesChristoph Hellwig1-1/+1
autorun_devices should not be limited to the controls for the legacy probe on open, so just call md_alloc directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: remove unneeded semicolonYang Li1-1/+1
Eliminate the following coccicheck warning: ./drivers/md/md.c:8208:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: fix build failure for !MODULEStephen Rothwell1-0/+2
After merging the block tree, today's linux-next build (x86_64 allmodconfig) failed like this: drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function] 717 | static struct mddev *mddev_find(dev_t unit) | ^~~~~~~~~~ cc1: all warnings being treated as errors Caused by commit 4500d5c17910 ("md: simplify md_open") Make mddev_find() available only for non-modular builds. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220721131132.070be166@canb.auug.org.au Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: simplify md_openChristoph Hellwig1-27/+15
Now that devices are on the all_mddevs list until the gendisk is freed, there can't be any duplicates. Remove the global list lookup and just grab a reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: only delete entries from all_mddevs when the disk is freedChristoph Hellwig1-18/+38
This ensures device names don't get prematurely reused. Instead add a deleted flag to skip already deleted devices in mddev_get and other places that only want to see live mddevs. Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: stop using for_each_mddev in md_exitChristoph Hellwig1-28/+11
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a reference when we drop the lock and delete the now unused for_each_mddev macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: stop using for_each_mddev in md_notify_rebootChristoph Hellwig1-3/+9
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a reference when we drop the lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: stop using for_each_mddev in md_do_syncChristoph Hellwig1-3/+5
Just do a plain list_for_each that only grabs a mddev reference in the case where the thread sleeps and restarts the list iteration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: factor out the rdev overlaps check from rdev_size_storeChristoph Hellwig1-45/+39
This splits the code into nicely readable chunks and also avoids the refcount inc/dec manipulations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: rename md_free to md_kobj_releaseChristoph Hellwig1-2/+2
The md_free name is rather misleading, so pick a better one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: implement ->free_diskChristoph Hellwig1-6/+12
Ensure that all private data is only freed once all accesses are done. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: fix error handling in md_allocChristoph Hellwig1-13/+32
Error handling in md_alloc is a mess. Untangle it to just free the mddev directly before add_disk is called and thus the gendisk is globally visible. After that clear the hold flag and let the mddev_put take care of cleaning up the mddev through the usual mechanisms. Fixes: 5e55e2f5fc95 ("[PATCH] md: convert compile time warnings into runtime warnings") Fixes: 9be68dd7ac0e ("md: add error handling support for add_disk()") Fixes: 7ad1069166c0 ("md: properly unwind when failing to add the kobject in md_alloc") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: fix mddev->kobj lifetimeChristoph Hellwig1-6/+4
Once a kobject is initialized, the containing object should not be directly freed. So delay initialization until it is added. Also remove the kobject_del call as the last put will remove the kobject as well. The explicitly delete isn't needed here, and dropping it will simplify further fixes. With this md_free now does not need to check that ->gendisk is non-NULL as it is always set by the time that kobject_init is called on mddev->kobj. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: unlock mddev before reap sync_thread in action_storeGuoqing Jiang1-2/+17
Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: Explicitly create command-line configured devicesChris Webb1-1/+1
Boot-time assembly of arrays with md= command-line arguments breaks when CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c calls blkdev_get_by_dev(), assuming this implicitly creates the block device. Fix this by attempting to md_alloc() the array first. As in the probe path, ignore any error as failure is caught by blkdev_get_by_dev() anyway. Signed-off-by: Chris Webb <chris@arachsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: Notify sysfs sync_completed in md_reap_sync_thread()Logan Gunthorpe1-0/+1
The mdadm test 07layouts randomly produces a kernel hung task deadlock. The deadlock is caused by the suspend_lo/suspend_hi files being set by the mdadm background process during reshape and not being cleared because the process hangs. (Leaving aside the issue of the fragility of freezing kernel tasks by buggy userspace processes...) When the background mdadm process hangs it, is waiting (without a timeout) on a change to the sync_completed file signalling that the reshape has completed. The process is woken up a couple times when the reshape finishes but it is woken up before MD_RECOVERY_RUNNING is cleared so sync_completed_show() reports 0 instead of "none". To fix this, notify the sysfs file in md_reap_sync_thread() after MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes it to continue and write to suspend_lo/suspend_hi to allow IO to continue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: Ensure resync is reported after it startsLogan Gunthorpe1-2/+12
The 07layouts test in mdadm fails on some systems. The failure presents itself as the backup file not being removed before the next layout is grown into: mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup: File exists This is because the background mdadm process, which is responsible for cleaning up this backup file gets into an infinite loop waiting for the reshape to start. mdadm checks the mdstat file if a reshape is going and, if it is not, it waits for an event on the file or times out in 5 seconds. On faster machines, the reshape may complete before the 5 seconds times out, and thus the background mdadm process loops waiting for a reshape to start that has already occurred. mdadm reads the mdstat file to start, but mdstat does not report that the reshape has begun, even though it has indeed begun. So the mdstat_wait() call (in mdadm) which polls on the mdstat file won't ever return until timing out. The reason mdstat reports the reshape has started is due to an issue in status_resync(). recovery_active is subtracted from curr_resync which will result in a value of zero for the first chunk of reshaped data, and the resulting read will report no reshape in progress. To fix this, if "resync - recovery_active" is an overloaded value, force the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md: Use enum for overloaded magic numbers used by mddev->curr_resyncLogan Gunthorpe1-22/+18
Comments in the code document special values used for mddev->curr_resync. Make this clearer by using an enum to label these values. The only functional change is a couple places use the wrong comparison operator that implied 3 is another special value. They are all fixed to imply that 3 or greater is an active resync. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14md/core: Combine two sync_page_io() argumentsBart Van Assche1-5/+5
Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: remove bdevnameChristoph Hellwig1-1/+1
Replace the remaining calls of bdevname with snprintf using the %pg format specifier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28block: remove blk_cleanup_diskChristoph Hellwig1-2/+2
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15Revert "md: don't unregister sync_thread with reconfig_mutex held"Guoqing Jiang1-9/+5
The 07reshape5intr test is broke because of below path. md_reap_sync_thread -> mddev_unlock -> md_unregister_thread(&mddev->sync_thread) And md_check_recovery is triggered by, mddev_unlock -> md_wakeup_thread(mddev->thread) then mddev->reshape_position is set to MaxSector in raid5_finish_reshape since MD_RECOVERY_INTR is cleared in md_check_recovery, which means feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's reshape_position can't be updated accordingly. Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-05-23md: fix double free of io_acct_set biosetXiao Ni1-4/+0
Now io_acct_set is alloc and free in personality. Remove the codes that free io_acct_set in md_free and md_stop. Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality) Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2022-05-23md: remove most calls to bdevnameChristoph Hellwig1-87/+65
Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-05-23md: protect md_unregister_thread from reentrancyGuoqing Jiang1-5/+10
Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-05-23md: don't unregister sync_thread with reconfig_mutex heldGuoqing Jiang1-5/+9
Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>