kernel/linux.git/drivers/md/raid5.c, branch v6.12.80

md/raid5: fix raid5_run() to return error when log_init() fails

2026-03-04T12:19:28+00:00

[ Upstream commit 2d9f7150ac197ce79c9c917a004d4cf0b26ad7e0 ] Since commit f63f17350e53 ("md/raid5: use the atomic queue limit update APIs"), the abort path in raid5_run() returns 'ret' instead of -EIO. However, if log_init() fails, 'ret' is still 0 from the previous successful call, causing raid5_run() to return success despite the failure. Fix this by capturing the return value from log_init(). Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-2-yukuai@fnnas.com Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs") Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/ Signed-off-by: Yu Kuai Reviewed-by: Li Nan Reviewed-by: Xiao Ni Reviewed-by: Christoph Hellwig Signed-off-by: Sasha Levin

md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()

2026-01-08T09:14:37+00:00

[ Upstream commit 7ad6ef91d8745d04aff9cce7bdbc6320d8e05fe9 ] The variable mddev->private is first assigned to conf and then checked: conf = mddev->private; if (!conf) ... If conf is NULL, then mddev->private is also NULL. In this case, null-pointer dereferences can occur when calling raid5_quiesce(): raid5_quiesce(mddev, true); raid5_quiesce(mddev, false); since mddev->private is assigned to conf again in raid5_quiesce(), and conf is dereferenced in several places, for example: conf->quiesce = 0; wake_up(&conf->wait_for_quiescent); To fix this issue, the function should unlock mddev and return before invoking raid5_quiesce() when conf is NULL, following the existing pattern in raid5_change_consistency_policy(). Fixes: fa1944bbe622 ("md/raid5: Wait sync io to finish before changing group cnt") Signed-off-by: Tuo Li Reviewed-by: Xiao Ni Reviewed-by: Paul Menzel Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com Signed-off-by: Yu Kuai Signed-off-by: Sasha Levin

md/raid5: fix IO hang when array is broken with IO inflight

2025-12-18T12:55:13+00:00

[ Upstream commit a913d1f6a7f607c110aeef8b58c8988f47a4b24e ] Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai Reviewed-by: Li Nan Signed-off-by: Sasha Levin

md: fix mssing blktrace bio split events

2025-10-23T14:20:42+00:00

[ Upstream commit 22f166218f7313e8fe2d19213b5f4b3265f8c39e ] If bio is split by internal handling like chunksize or badblocks, the corresponding trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 4b1faf931650 ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai Reviewed-by: Damien Le Moal Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

md/md-bitmap: move bitmap_{start, end}write to md upper layer

2025-02-08T08:58:12+00:00

commit cd5fc653381811f1e0ba65f5d169918cab61476f upstream. There are two BUG reports that raid5 will hang at bitmap_startwrite([1],[2]), root cause is that bitmap start write and end write is unbalanced, it's not quite clear where, and while reviewing raid5 code, it's found that bitmap operations can be optimized. For example, for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to the array: ┌────────────────────────────────────────────────────────────┐ │chunk 0 │ │ ┌────────────┬─────────────┬─────────────┬────────────┼ │ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │ ┼──────┴────────────┴─────────────┴─────────────┴────────────┼ │chunk 1 │ │ ┌────────────┬─────────────┬─────────────┬────────────┤ │ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│ └──────┴────────────┴─────────────┴─────────────┴────────────┘ Before this patch, 4 stripe head will be used, and each sh will attach bio for 3 disks, and each attached bio will trigger bitmap_startwrite() once, which means total 12 times. - 3 times (0 + 4k), for (A0, A1 and A2) - 3 times (4 + 4k), for (B0, B1 and B2) - 3 times (8 + 4k), for (C0, C1 and C3) - 3 times (12 + 4k), for (D0, D1 and D3) After this patch, md upper layer will calculate that IO range (0 + 48k) is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite() just once. Noted that this patch will align bitmap ranges to the chunks, for example, if user issue a IO (0 + 4k) to array: - Before this patch, 1 time (0 + 4k), for A0; - After this patch, 1 time (0 + 8k) for chunk 0; Usually, one bitmap bit will represent more than one disk chunk, and this doesn't have any difference. And even if user really created a array that one chunk contain multiple bits, the overhead is that more data will be recovered after power failure. Also remove STRIPE_BITMAP_PENDING since it's not used anymore. [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/ [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/ Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu Signed-off-by: Yu Kuai Signed-off-by: Greg Kroah-Hartman

md/raid5: implement pers->bitmap_sector()

2025-02-08T08:58:12+00:00

commit 9c89f604476cf15c31fbbdb043cff7fbf1dbe0cb upstream. Bitmap is used for the whole array for raid1/raid10, hence IO for the array can be used directly for bitmap. However, bitmap is used for underlying disks for raid5, hence IO for the array can't be used directly for bitmap. Implement pers->bitmap_sector() for raid5 to convert IO ranges from the array to the underlying disks. Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20250109015145.158868-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu Signed-off-by: Yu Kuai Signed-off-by: Greg Kroah-Hartman

md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()

2025-02-08T08:58:11+00:00

commit 4f0e7d0e03b7b80af84759a9e7cfb0f81ac4adae upstream. For the case that IO failed for one rdev, the bit will be mark as NEEDED in following cases: 1) If badblocks is set and rdev is not faulty; 2) If rdev is faulty; Case 1) is useless because synchronize data to badblocks make no sense. Case 2) can be replaced with mddev->degraded. Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since case 2) no longer use them. Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu Signed-off-by: Yu Kuai Signed-off-by: Greg Kroah-Hartman

md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()

2025-02-08T08:58:11+00:00

commit 08c50142a128dcb2d7060aa3b4c5db8837f7a46a upstream. behind_write is only used in raid1, prepare to refactor bitmap_{start/end}write(), there are no functional changes. Signed-off-by: Yu Kuai Reviewed-by: Xiao Ni Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu Signed-off-by: Yu Kuai Signed-off-by: Greg Kroah-Hartman

md/raid5: Wait sync io to finish before changing group cnt

2024-12-09T09:40:56+00:00

commit fa1944bbe6220eb929e2c02e5e8706b908565711 upstream. One customer reports a bug: raid5 is hung when changing thread cnt while resync is running. The stripes are all in conf->handle_list and new threads can't handle them. Commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()") removes pers->quiesce from mddev_suspend/resume. Before this patch, mddev_suspend needs to wait for all ios including sync io to finish. Now it's used to only wait normal io. Fix this by calling raid5_quiesce from raid5_store_group_thread_cnt directly to wait all sync requests to finish before changing the group cnt. Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Cc: stable@vger.kernel.org Signed-off-by: Xiao Ni Reviewed-by: Yu Kuai Link: https://lore.kernel.org/r/20241106095124.74577-1-xni@redhat.com Signed-off-by: Song Liu Signed-off-by: Greg Kroah-Hartman

md/raid5: rename wait_for_overlap to wait_for_reshape

2024-08-29T16:37:10+00:00

The only remaining uses of wait_for_overlap are related to reshape so rename it accordingly. Signed-off-by: Artur Paszkiewicz Link: https://lore.kernel.org/r/20240827153536.6743-4-artur.paszkiewicz@intel.com Signed-off-by: Song Liu