kernel/linux.git/drivers/md/raid5.c, branch v6.18.21

md/raid5: fix IO hang with degraded array with llbitmap

2026-02-26T22:59:00+00:00

[ Upstream commit cd1635d844d26471c56c0a432abdee12fc9ad735 ] When llbitmap bit state is still unwritten, any new write should force rcw, as bitmap_ops->blocks_synced() is checked in handle_stripe_dirtying(). However, later the same check is missing in need_this_block(), causing stripe to deadloop during handling because handle_stripe() will decide to go to handle_stripe_fill(), meanwhile need_this_block() always return 0 and nothing is handled. Link: https://lore.kernel.org/linux-raid/20260123182623.3718551-2-yukuai@fnnas.com Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Yu Kuai Reviewed-by: Li Nan Signed-off-by: Sasha Levin

md/raid5: fix raid5_run() to return error when log_init() fails

2026-02-26T22:59:00+00:00

[ Upstream commit 2d9f7150ac197ce79c9c917a004d4cf0b26ad7e0 ] Since commit f63f17350e53 ("md/raid5: use the atomic queue limit update APIs"), the abort path in raid5_run() returns 'ret' instead of -EIO. However, if log_init() fails, 'ret' is still 0 from the previous successful call, causing raid5_run() to return success despite the failure. Fix this by capturing the return value from log_init(). Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-2-yukuai@fnnas.com Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs") Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/ Signed-off-by: Yu Kuai Reviewed-by: Li Nan Reviewed-by: Xiao Ni Reviewed-by: Christoph Hellwig Signed-off-by: Sasha Levin

md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt()

2026-01-08T09:16:51+00:00

[ Upstream commit 7ad6ef91d8745d04aff9cce7bdbc6320d8e05fe9 ] The variable mddev->private is first assigned to conf and then checked: conf = mddev->private; if (!conf) ... If conf is NULL, then mddev->private is also NULL. In this case, null-pointer dereferences can occur when calling raid5_quiesce(): raid5_quiesce(mddev, true); raid5_quiesce(mddev, false); since mddev->private is assigned to conf again in raid5_quiesce(), and conf is dereferenced in several places, for example: conf->quiesce = 0; wake_up(&conf->wait_for_quiescent); To fix this issue, the function should unlock mddev and return before invoking raid5_quiesce() when conf is NULL, following the existing pattern in raid5_change_consistency_policy(). Fixes: fa1944bbe622 ("md/raid5: Wait sync io to finish before changing group cnt") Signed-off-by: Tuo Li Reviewed-by: Xiao Ni Reviewed-by: Paul Menzel Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com Signed-off-by: Yu Kuai Signed-off-by: Sasha Levin

md/raid5: fix IO hang when array is broken with IO inflight

2025-12-18T13:03:28+00:00

[ Upstream commit a913d1f6a7f607c110aeef8b58c8988f47a4b24e ] Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai Reviewed-by: Li Nan Signed-off-by: Sasha Levin

Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

2025-10-02T17:16:56+00:00

Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...

md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter

2025-09-16T16:37:12+00:00

The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be equal to max_write_zeroes_sectors if it is set to a non-zero value. However, the stacked md drivers call md_init_stacking_limits() to initialize this parameter to UINT_MAX but only adjust max_write_zeroes_sectors when setting limits. Therefore, this discrepancy triggers a value check failure in blk_validate_limits(). $ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1 $ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2 mdadm: Defaulting to version 1.2 metadata mdadm: RUN_ARRAY failed: Invalid argument Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors. Since the linear and raid0 drivers support write zeroes, so they can support unmap write zeroes operation if all of the backend devices support it. However, the raid1/10/5 drivers don't support write zeroes, so we have to set it to zero. Fixes: 0c40d7cb5ef3 ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits") Reported-by: John Garry Closes: https://lore.kernel.org/linux-block/803a2183-a0bb-4b7a-92f1-afc5097630d2@oracle.com/ Signed-off-by: Zhang Yi Tested-by: John Garry Reviewed-by: Li Nan Reviewed-by: Martin K. Petersen Reviewed-by: Yu Kuai Reviewed-by: Hannes Reinecke Link: https://lore.kernel.org/linux-raid/20250910111107.3247530-2-yi.zhang@huaweicloud.com Signed-off-by: Yu Kuai

md/raid5: convert to use bio_submit_split_bioset()

2025-09-10T11:23:45+00:00

Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe

md: fix mssing blktrace bio split events

2025-09-10T11:23:45+00:00

If bio is split by internal handling like chunksize or badblocks, the corresponding trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 4b1faf931650 ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai Reviewed-by: Damien Le Moal Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe

md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER

2025-09-06T09:20:32+00:00

This flag is used by llbitmap in later patches to skip raid456 initial recover and delay building initial xor data to first write. https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai

md/md-bitmap: add a new method blocks_synced() in bitmap_operations

2025-09-06T09:20:01+00:00

Currently, raid456 must perform a whole array initial recovery to build initail xor data, then IO to the array won't have to read all the blocks in underlying disks. This behavior will affect IO performance a lot, and nowadays there are huge disks and the initial recovery can take a long time. Hence llbitmap will support lazy initial recovery in following patches. This method is used to check if data blocks is synced or not, if not then IO will still have to read all blocks for raid456. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai