summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)AuthorFilesLines
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-6/+6
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-7/+7
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-17Merge tag 'block-7.0-20260216' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-12md: ignore discard return valueChaitanya Kulkarni1-2/+2
__blkdev_issue_discard() always returns 0, making all error checking at call sites dead code. Simplify md to only check !discard_bio by ignoring the __blkdev_issue_discard() value. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-12Merge tag 'for-7.0/dm-changes' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - dm-verity: - various optimizations and fixes related to forward error correction - add a .dm-verity keyring - dm-integrity: fix bugs with growing a device in bitmap mode - dm-mpath: - fix leaking fake timeout requests - fix UAF bug caused by stale rq->bio - fix minor bugs in device creation - dm-core: - fix a bug related to blkg association - avoid unnecessary blk-crypto work on invalid keys - dm-bufio: - dm-bufio cleanup and optimization (reducing hash table lookups) - various other minor fixes and cleanups * tag 'for-7.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm mpath: make pg_init_delay_msecs settable Revert "dm: fix a race condition in retrieve_deps" dm mpath: Add missing dm_put_device when failing to get scsi dh name dm vdo encodings: clean up header and version functions dm: use bio_clone_blkg_association dm: fix excessive blk-crypto operations for invalid keys dm-verity: fix section mismatch error dm-unstripe: fix mapping bug when there are multiple targets in a table dm-integrity: fix recalculation in bitmap mode dm-bufio: avoid redundant buffer_tree lookups dm-bufio: merge cache_put() into cache_put_and_wake() selftests: add dm-verity keyring selftests dm-verity: add dm-verity keyring dm: clear cloned request bio pointer when last clone bio completes dm-verity: fix up various workqueue-related comments dm-verity: switch to bio_advance_iter_single() dm-verity: consolidate the BH and normal work structs dm: add WQ_PERCPU to alloc_workqueue users dm-integrity: fix a typo in the code for write/discard race dm: use READ_ONCE in dm_blk_report_zones ...
2026-01-26md raid: fix hang when stopping arrays with metadata through dm-raidHeinz Mauelshagen1-6/+8
When using device-mapper's dm-raid target, stopping a RAID array can cause the system to hang under specific conditions. This occurs when: - A dm-raid managed device tree is suspended from top to bottom (the top-level RAID device is suspended first, followed by its underlying metadata and data devices) - The top-level RAID device is then removed Removing the top-level device triggers a hang in the following sequence: the dm-raid destructor calls md_stop(), which tries to flush the write-intent bitmap by writing to the metadata sub-devices. However, these devices are already suspended, making them unable to complete the write-intent operations and causing an indefinite block. Fix: - Prevent bitmap flushing when md_stop() is called from dm-raid destructor context and avoid a quiescing/unquescing cycle which could also cause I/O - Still allow write-intent bitmap flushing when called from dm-raid suspend context This ensures that RAID array teardown can complete successfully even when the underlying devices are in a suspended state. This second patch uses md_is_rdwr() to distinguish between suspend and destructor paths as elaborated on above. Link: https://lore.kernel.org/linux-raid/CAM23VxqYrwkhKEBeQrZeZwQudbiNey2_8B_SEOLqug=pXxaFrA@mail.gmail.com Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: remove recovery_disabledLi Nan1-3/+0
'recovery_disabled' logic is complex and confusing, originally intended to preserve raid in extreme scenarios. It was used in following cases: - When sync fails and setting badblocks also fails, kick out non-In_sync rdev and block spare rdev from joining to preserve raid [1] - When last backup is unavailable, prevent repeated add-remove of spares triggering recovery [2] The original issues are now resolved: - Error handlers in all raid types prevent last rdev from being kicked out - Disks with failed recovery are marked Faulty and can't re-join Therefore, remove 'recovery_disabled' as it's no longer needed. [1] 5389042ffa36 ("md: change managed of recovery_disabled.") [2] 4044ba58dd15 ("md: don't retry recovery of raid1 that fails due to error on source drive.") Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-13-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: move finish_reshape to md_finish_sync()Li Nan1-9/+6
finish_reshape implementations of raid10 and raid5 only update mddev and rdev configurations. Move these operations to md_finish_sync() as it is more appropriate. No functional changes. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-10-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: factor out sync completion update into helperLi Nan1-35/+47
Repeatedly reading 'mddev->recovery' flags in md_do_sync() may introduce potential risk if this flag is modified during sync, leading to incorrect offset updates. Therefore, replace direct 'mddev->recovery' checks with 'action'. Move sync completion update logic into helper md_finish_sync(), which improves readability and maintainability. The reshape completion update remains safe as it only updated after successful reshape when MD_RECOVERY_INTR is not set and 'curr_resync' equals 'max_sectors'. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-9-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: remove MD_RECOVERY_ERROR handling and simplify resync_offset updateLi Nan1-17/+4
Following previous patch "md: update curr_resync_completed even when MD_RECOVERY_INTR is set", 'curr_resync_completed' always equals 'curr_resync' for resync, so MD_RECOVERY_ERROR can be removed. Also, simplify resync_offset update logic. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-8-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: update curr_resync_completed even when MD_RECOVERY_INTR is setLi Nan1-1/+1
An error sync IO may be done and sub 'recovery_active' while its error handling work is pending. This work sets 'recovery_disabled' and MD_RECOVERY_INTR, then later removes the bad disk without Faulty flag. If 'curr_resync_completed' is updated before the disk is removed, it could lead to reading from sync-failed regions. With the previous patch, error IO will set badblocks or mark rdev as Faulty, sync-failed regions are no longer readable. After waiting for 'recovery_active' to reach 0 (in the previous line), all sync IO has *completed*, regardless of whether MD_RECOVERY_INTR is set. Thus, the MD_RECOVERY_INTR check can be removed. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-7-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: mark rdev Faulty when badblocks setting failsLi Nan1-1/+7
Currently when sync read fails and badblocks set fails (exceeding 512 limit), rdev isn't immediately marked Faulty. Instead 'recovery_disabled' is set and non-In_sync rdevs are removed later. This preserves array availability if bad regions aren't read, but bad sectors might be read by users before rdev removal. This occurs due to incorrect resync/recovery_offset updates that include these bad sectors. When badblocks exceed 512, keeping the disk provides little benefit while adding complexity. Prompt disk replacement is more important. Therefore when badblocks set fails, directly call md_error to mark rdev Faulty immediately, preventing potential data access issues. After this change, cleanup of offset update logic and 'recovery_disabled' handling will follow. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-6-linan666@huaweicloud.com Fixes: 5e5702898e93 ("md/raid10: Handle read errors during recovery better.") Fixes: 3a9f28a5117e ("md/raid1: improve handling of read failure during recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: factor error handling out of md_done_sync into helperLi Nan1-7/+10
The 'ok' parameter in md_done_sync() is redundant for most callers that always pass 'true'. Factor error handling logic into a separate helper function md_sync_error() to eliminate unnecessary parameter passing and improve code clarity. No functional changes introduced. Link: https://lore.kernel.org/linux-raid/20260105110300.1442509-3-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-01-26md: merge mddev serialize_policy into mddev_flagsYu Kuai1-8/+12
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-5-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: merge mddev faillast_dev into mddev_flagsYu Kuai1-4/+6
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-26md: merge mddev has_superblock into mddev_flagsYu Kuai1-3/+3
There is not need to use a separate field in struct mddev, there are no functional changes. Link: https://lore.kernel.org/linux-raid/20260114171241.3043364-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2026-01-14dm: add WQ_PERCPU to alloc_workqueue usersMarco Crivellari1-2/+2
This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-12-27md: Fix forward incompatibility from configurable logical block sizeLi Nan1-4/+44
Commit 62ed1b582246 ("md: allow configuring logical block size") used reserved pad to add 'logical_block_size' to metadata. RAID rejects non-zero reserved pad, so arrays fail when rolling back to old kernels after booting new ones. Set 'logical_block_size' only for newly created arrays to support rollback to old kernels. Importantly new arrays still won't work on old kernels to prevent data loss issue from LBS changes. For arrays created on old kernels which confirmed not to rollback, configure LBS by echo current LBS (queue/logical_block_size) to md/logical_block_size. Fixes: 62ed1b582246 ("md: allow configuring logical block size") Reported-by: BugReports <bugreports61@gmail.com> Closes: https://lore.kernel.org/linux-raid/825e532d-d1e1-44bb-5581-692b7c091796@huaweicloud.com/T/#t Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20251226024221.724201-2-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27md: Fix logical_block_size configuration being overwrittenLi Nan1-1/+3
In super_1_validate(), mddev->logical_block_size is directly overwritten with the value from metadata. This causes the previously configured lbs to be lost, making the configuration ineffective. Fix it. Fixes: 62ed1b582246 ("md: allow configuring logical block size") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20251226024221.724201-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27md: suspend array while updating raid_disks via sysfsFengWei Shih1-2/+2
In raid1_reshape(), freeze_array() is called before modifying the r1bio memory pool (conf->r1bio_pool) and conf->raid_disks, and unfreeze_array() is called after the update is completed. However, freeze_array() only waits until nr_sync_pending and (nr_pending - nr_queued) of all buckets reaches zero. When an I/O error occurs, nr_queued is increased and the corresponding r1bio is queued to either retry_list or bio_end_io_list. As a result, freeze_array() may unblock before these r1bios are released. This can lead to a situation where conf->raid_disks and the mempool have already been updated while queued r1bios, allocated with the old raid_disks value, are later released. Consequently, free_r1bio() may access memory out of bounds in put_all_bios() and release r1bios of the wrong size to the new mempool, potentially causing issues with the mempool as well. Since only normal I/O might increase nr_queued while an I/O error occurs, suspending the array avoids this issue. Note: Updating raid_disks via ioctl SET_ARRAY_INFO already suspends the array. Therefore, we suspend the array when updating raid_disks via sysfs to avoid this issue too. Signed-off-by: FengWei Shih <dannyshih@synology.com> Link: https://lore.kernel.org/linux-raid/20251226101816.4506-1-dannyshih@synology.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-25md: Fix static checker warning in analyze_sbsLi Nan1-4/+1
The following warn is reported: drivers/md/md.c:3912 analyze_sbs() warn: iterator 'i' not incremented Fixes: d8730f0cf4ef ("md: Remove deprecated CONFIG_MD_MULTIPATH") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-raid/7e2e95ce-3740-09d8-a561-af6bfb767f18@huaweicloud.com/T/#t Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20251215124412.4015572-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-30md: remove legacy 1s delay in md_notify_rebootTarun Sahu1-11/+0
During system shutdown, the md driver registered notifier function (md_notify_reboot) currently imposes a hardcoded one-second delay. This delay was introduced approximately 23 years ago and was likely necessary for the hardware generation of that time. Proposing this patch to make sure there are no known devices that need this delay. Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com Signed-off-by: Tarun Sahu <tarunsahu@google.com> Reviewed-by: Yu Kuai<yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-30md: warn about updating super block failureYu Kuai1-0/+1
Many personalities will handle IO error from daemon thread(like raid1d, raid10d, raid5d), and sb will require to be clean before hanlding these failed IO. However update sb can fail, for example array is broken by IO failure, or user config sysfs api array_state. This patch adds warning if updating sb failed first, in case this will be related to IO hang. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-11md: allow configuring logical block sizeLi Nan1-0/+77
Previously, raid array used the maximum logical block size (LBS) of all member disks. Adding a larger LBS disk at runtime could unexpectedly increase RAID's LBS, risking corruption of existing partitions. This can be reproduced by: ``` # LBS of sd[de] is 512 bytes, sdf is 4096 bytes. mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean # LBS is 512 cat /sys/block/md0/queue/logical_block_size # create partition md0p1 parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100% lsblk | grep md0p1 # LBS becomes 4096 after adding sdf mdadm --add -q /dev/md0 /dev/sdf cat /sys/block/md0/queue/logical_block_size # partition lost partprobe /dev/md0 lsblk | grep md0p1 ``` Simply restricting larger-LBS disks is inflexible. In some scenarios, only disks with 512 bytes LBS are available currently, but later, disks with 4KB LBS may be added to the array. Making LBS configurable is the best way to solve this scenario. After this patch, the raid will: - store LBS in disk metadata - add a read-write sysfs 'mdX/logical_block_size' Future mdadm should support setting LBS via metadata field during RAID creation and the new sysfs. Though the kernel allows runtime LBS changes, users should avoid modifying it after creating partitions or filesystems to prevent compatibility issues. Only 1.x metadata supports configurable LBS. 0.90 metadata inits all fields to default values at auto-detect. Supporting 0.90 would require more extensive changes and no such use case has been observed. Note that many RAID paths rely on PAGE_SIZE alignment, including for metadata I/O. A larger LBS than PAGE_SIZE will result in metadata read/write failures. So this config should be prevented. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: add check_new_feature module parameterLi Nan1-3/+9
Raid checks if pad3 is zero when loading superblock from disk. Arrays created with new features may fail to assemble on old kernels as pad3 is used. Add module parameter check_new_feature to bypass this check. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: init bioset in mddev_initLi Nan1-36/+33
IO operations may be needed before md_run(), such as updating metadata after writing sysfs. Without bioset, this triggers a NULL pointer dereference as below: BUG: kernel NULL pointer dereference, address: 0000000000000020 Call Trace: md_update_sb+0x658/0xe00 new_level_store+0xc5/0x120 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Reproducer ``` mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd] echo inactive > /sys/block/md0/md/array_state echo 10 > /sys/block/md0/md/new_level ``` mddev_init() can only be called once per mddev, no need to test if bioset has been initialized anymore. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com Fixes: d981ed841930 ("md: Add new_level sysfs interface") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: delete md_redundancy_group when array is becoming inactiveLi Nan1-0/+4
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(), but these are not paired. Writing inactive/active to sysfs array_state can trigger md_run() multiple times without del_gendisk(), leading to duplicate creation as below: sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action' Call Trace: dump_stack_lvl+0x9f/0x120 dump_stack+0x14/0x20 sysfs_warn_dup+0x96/0xc0 sysfs_add_file_mode_ns+0x19c/0x1b0 internal_create_group+0x213/0x830 sysfs_create_group+0x17/0x20 md_run+0x856/0xe60 ? __x64_sys_openat+0x23/0x30 do_md_run+0x26/0x1d0 array_state_store+0x559/0x760 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 md: cannot register extra attributes for md0 Creation of it depends on 'pers', its lifecycle cannot be aligned with gendisk. So fix this issue by triggering 'md_redundancy_group' deletion when the array is becoming inactive. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: prevent adding disks with larger logical_block_size to active arraysLi Nan1-0/+7
When adding a disk to a md array, avoid updating the array's logical_block_size to match the new disk. This prevents accidental partition table loss that renders the array unusable. The later patch will introduce a way to configure the array's logical_block_size. The issue was introduced before Linux 2.6.12-rc2. Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/ Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: avoid repeated calls to del_gendiskXiao Ni1-1/+2
There is a uaf problem which is found by case 23rdev-lifetime: Oops: general protection fault, probably for non-canonical address 0xdead000000000122 RIP: 0010:bdi_unregister+0x4b/0x170 Call Trace: <TASK> __del_gendisk+0x356/0x3e0 mddev_unlock+0x351/0x360 rdev_attr_store+0x217/0x280 kernfs_fop_write_iter+0x14a/0x210 vfs_write+0x29e/0x550 ksys_write+0x74/0xf0 do_syscall_64+0xbb/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff5250a177e The sequence is: 1. rdev remove path gets reconfig_mutex 2. rdev remove path release reconfig_mutex in mddev_unlock 3. md stop calls do_md_stop and sets MD_DELETED 4. rdev remove path calls del_gendisk because MD_DELETED is set 5. md stop path release reconfig_mutex and calls del_gendisk again So there is a race condition we should resolve. This patch adds a flag MD_DO_DELETE to avoid the race condition. Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08Factor out code into md_should_do_recovery()Wu Guanghao1-12/+47
In md_check_recovery(), use new helper to make code cleaner. Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: fix rcu protection in md_wakeup_threadYun Zhou1-8/+6
We attempted to use RCU to protect the pointer 'thread', but directly passed the value when calling md_wakeup_thread(). This means that the RCU pointer has been acquired before rcu_read_lock(), which renders rcu_read_lock() ineffective and could lead to a use-after-free. Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com Fixes: 446931543982 ("md: protect md_thread with rcu") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: delete mddev kobj before deleting gendisk kobjXiao Ni1-1/+3
In sync del gendisk path, it deletes gendisk first and the directory /sys/block/md is removed. Then it releases mddev kobj in a delayed work. If we enable debug log in sysfs_remove_group, we can see the debug log 'sysfs group bitmap not found for kobject md'. It's the reason that the parent kobj has been deleted, so it can't find parent directory. In creating path, it allocs gendisk first, then adds mddev kobj. So it should delete mddev kobj before deleting gendisk. Before commit 9e59d609763f ("md: call del_gendisk in control path"), it releases mddev kobj first. If the kobj hasn't been deleted, it does clean job and deletes the kobj. Then it calls del_gendisk and releases gendisk kobj. So it doesn't need to call kobject_del to delete mddev kobj. After this patch, in sync del gendisk path, the sequence changes. So it needs to call kobject_del to delete mddev kobj. After this patch, the sequence is: 1. kobject del mddev kobj 2. del_gendisk deletes gendisk kobj 3. mddev_delayed_delete releases mddev kobj 4. md_kobj_release releases gendisk kobj Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds1-72/+310
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-09-06md/md-llbitmap: introduce new lockless bitmapYu Kuai1-0/+6
Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. For designs and details, see the comments in drivers/md-llbitmap.c. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: make method bitmap_ops->daemon_work optionalYu Kuai1-1/+1
daemon_work() will be called by daemon thread, on the one hand, daemon thread doesn't have strict wake-up time; on the other hand, too much work are put to daemon thread, like handle sync IO, handle failed or specail normal IO, handle recovery, and so on. Hence daemon thread may be too busy to clear dirty bits in time. Make bitmap_ops->daemon_work() optional and following patches will use separate async work to clear dirty bits for the new bitmap. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVERYu Kuai1-1/+46
This flag is used by llbitmap in later patches to skip raid456 initial recover and delay building initial xor data to first write. https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-06md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operationsYu Kuai1-0/+7
This method is used to check if blocks can be skipped before calling into pers->sync_request(), llbitmap will use this method to skip resync for unwritten/clean data blocks, and recovery/check/repair for unwritten data blocks; Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: delay registration of bitmap_ops until creating bitmapYu Kuai1-37/+53
Currently bitmap_ops is registered while allocating mddev, this is fine when there is only one bitmap_ops. Delay setting bitmap_ops until creating bitmap, so that user can choose which bitmap to use before running the array. Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: add a new sysfs api bitmap_typeYu Kuai1-0/+81
The api will be used by mdadm to set bitmap_type while creating new array or assembling array, prepare to add a new bitmap. Currently available options are: cat /sys/block/md0/md/bitmap_type none [bitmap] Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new mddev field 'bitmap_id'Yu Kuai1-6/+31
Prepare to store the bitmap id selected by user, also refactor mddev_set_bitmap_ops a bit in case the value is invalid. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: support discard for bitmap opsYu Kuai1-4/+11
Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw' in struct md_io_clone to handle discard IO, prepare to support new md bitmap. Since all bitmap functions to hanlde write IO are the same, also add typedef to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: factor out a helper raid_is_456()Yu Kuai1-8/+1
There are no functional changes, the helper will be used by llbitmap in following patches. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new parameter 'offset' to md_super_write()Yu Kuai1-21/+31
The parameter is always set to 0 for now, following patches will use this helper to write llbitmap to underlying disks, allow writing dirty sectors instead of the whole page. Also rename md_super_write to md_write_metadata since there is nothing super-block specific. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: introduce CONFIG_MD_BITMAPYu Kuai1-12/+28
Now that all implementations are internal, it's sensible to add a config option for md-bitmap, and it's a good way for isolation. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-16-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md: check before referencing mddev->bitmap_opsYu Kuai1-20/+48
Prepare to introduce CONFIG_MD_BITMAP. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-15-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: merge md_bitmap_group into bitmap_operationsYu Kuai1-1/+5
Now that all bitmap implementations are internal, it doesn't make sense to export md_bitmap_group anymore. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-04md: prevent incorrect update of resync/recovery offsetLi Nan1-0/+5
In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent call to md_sync_position() will return MaxSector. This causes 'curr_resync' (and later 'recovery_offset') to be set to MaxSector too, which incorrectly signals that recovery/resync has completed, even though disk data has not actually been updated. To fix this issue, skip updating any offset values when the sync action is FROZEN. The same holds true for IDLE. Fixes: 7d9f107a4e94 ("md: use new helpers in md_do_sync()") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-04Merge tag 'pull-getgeo' of ↵Jens Axboe1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-6.18/block Pull struct block_device getgeo changes from Al. "switching ->getgeo() from struct block_device to struct gendisk Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>" * tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: block: switch ->getgeo() to struct gendisk scsi: switch ->bios_param() to passing gendisk scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk
2025-08-16md: fix sync_action incorrect display during resyncZheng Qixing1-2/+35
During raid resync, if a disk becomes faulty, the operation is briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by the disk failure causes sync_action to incorrectly show "recover" instead of "resync". The same issue affects reshape operations. Reproduction steps: mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened mdadm -f /dev/md1 /dev/sda // -> resync interrupted cat sync_action -> recover Add progress checks in md_sync_action() for resync/recover/reshape to ensure the interface correctly reports the actual operation type. Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-3-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16md: add helper rdev_needs_recovery()Zheng Qixing1-11/+12
Add a helper for checking if an rdev needs recovery. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-2-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>