summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
14 daysMerge tag 'for-7.1/dm-fixes-3' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix race condition in dm-cache-policy-smq * tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: check allocation under invalidate lock
2026-06-01dm cache policy smq: check allocation under invalidate lockGuangshuo Li1-4/+8
commit 2d1f7b65f5de ("dm cache policy smq: fix missing locks in invalidating cache blocks") added mq->lock around the destructive part of smq_invalidate_mapping(), but left the e->allocated check outside the critical section. That leaves a check-then-act race. Two concurrent invalidators can both observe e->allocated as true before either of them takes mq->lock. The first invalidator that acquires the lock removes the entry from the queues and hash table and then calls free_entry(), which clears e->allocated and puts the entry back on the free list. The second invalidator can then acquire mq->lock and continue with the stale result of the unlocked check. This can corrupt the SMQ queues or hash table by deleting an entry that is no longer on those structures. It can also hit the allocation check in free_entry() when the same entry is freed again. Move the allocation check under mq->lock so the predicate and the destructive operations are serialized by the same lock. Fixes: 2d1f7b65f5de ("dm cache policy smq: fix missing locks in invalidating cache blocks") Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-05-25Merge tag 'for-7.1/dm-fixes-2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix crashes in dm-vdo if GFP_NOWAIT allocation fails * tag 'for-7.1/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm vdo: use GFP_NOIO for blkdev_issue_zeroout on format path
2026-05-04dm vdo: use GFP_NOIO for blkdev_issue_zeroout on format pathBruce Johnston1-2/+2
GFP_NOWAIT is inappropriate when blkdev_issue_zeroout may sleep and bio_alloc can fail under pressure; use GFP_NOIO for clear_partition and vdo_clear_layout zeroout calls. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: fc1d43826702 ("dm vdo: save the formatted metadata to disk")
2026-05-01Merge tag 'block-7.1-20260430' of ↵Linus Torvalds9-109/+250
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - MD pull request via Yu: - Fix a raid5 UAF on IO across the reshape position - Avoid failing RAID1/RAID10 devices for invalid IO errors - Fix RAID10 divide-by-zero when far_copies is zero - Restore bitmap grow through sysfs - Use mddev_is_dm() instead of open-coding gendisk checks - Use ATTRIBUTE_GROUPS() for md default sysfs attributes - Replace open-coded wait loops with wait_event helpers - NVMe pull request via Keith: - Target data transfer size configuation (Aurelien) - Enable P2P for RDMA (Shivaji Kant) - TCP target updates (Maurizio, Alistair, Chaitanya, Shivam Kumar) - TCP host updates (Alistair, Chaitanya) - Authentication updates (Alistair, Daniel, Chris Leech) - Multipath fixes (John Garry) - New quirks (Alan Cui, Tao Jiang) - Apple driver fix (Fedor Pchelkin) - PCI admin doorbell update fix (Keith) - Properly propagate CDROM read-only state to the block layer * tag 'block-7.1-20260430' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (35 commits) md: use ATTRIBUTE_GROUPS() for md default sysfs attributes md: use mddev_is_dm() instead of open-coding gendisk checks md/raid1: replace wait loop with wait_event_idle() in raid1_write_request() md/md-bitmap: add a none backend for bitmap grow md/md-bitmap: split bitmap sysfs groups md: factor bitmap creation away from sysfs handling md: use mddev_lock_nointr() in mddev_suspend_and_lock_nointr() md: replace wait loop with wait_event() in md_handle_request() md/raid10: fix divide-by-zero in setup_geo() with zero far_copies md/raid1,raid10: don't fail devices for invalid IO errors MAINTAINERS: Add Xiao Ni as md/raid reviewer md/raid5: Fix UAF on IO across the reshape position cdrom, scsi: sr: propagate read-only status to block layer via set_disk_ro() nvme-auth: Hash DH shared secret to create session key nvme-pci: fix missed admin queue sq doorbell write nvme-auth: Include SC_C in RVAL controller hash nvme-tcp: teardown circular locking fixes nvmet-tcp: Don't clear tls_key when freeing sq Revert "nvmet-tcp: Don't free SQ on authentication success" nvme: skip trace completion for host path errors ...
2026-04-28md: use ATTRIBUTE_GROUPS() for md default sysfs attributesAbd-Alrhman Masalkhi1-10/+2
Replace the md_default_group and md_attr_groups with ATTRIBUTE_GROUPS(). Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-4-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: use mddev_is_dm() instead of open-coding gendisk checksAbd-Alrhman Masalkhi1-2/+2
Replace direct checks on mddev->gendisk with mddev_is_dm() in md_handle_request() and md_run(). Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid1: replace wait loop with wait_event_idle() in raid1_write_request()Abd-Alrhman Masalkhi1-11/+4
The wait loop is equivalent to wait_event_idle(); use it to improve readability. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260423101303.48196-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/md-bitmap: add a none backend for bitmap growYu Kuai3-16/+137
Add a real none bitmap backend that exposes the common bitmap sysfs group and use it to keep bitmap/location available when an array has no bitmap. Then switch the bitmap location sysfs path to move only between none and the classic bitmap backend, using the no-sysfs bitmap helpers while merging or unmerging the internal bitmap sysfs group. This restores mdadm --grow bitmap addition through bitmap/location. Fixes: fb8cc3b0d9db ("md/md-bitmap: delay registration of bitmap_ops until creating bitmap") Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-4-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/md-bitmap: split bitmap sysfs groupsYu Kuai4-13/+40
Split the classic bitmap sysfs files into a common bitmap group with the location attribute and a separate internal bitmap group for the remaining files. At the same time, convert bitmap operations from a single sysfs group to a sysfs group array so backends can share part of their sysfs layout while adding backend-specific attributes separately. Switch the bitmap sysfs helpers to use sysfs_update_groups() for the add and update path, and remove groups in reverse order so shared named groups are unmerged before the last group removes the directory. Also make bitmap operation lookup depend only on the currently selected bitmap id matching the installed backend. This prepares the lookup path for a later registered none backend. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: factor bitmap creation away from sysfs handlingYu Kuai1-29/+49
Factor bitmap creation and destruction into helpers that do not touch bitmap sysfs registration. This prepares the bitmap sysfs rework so callers such as the sysfs bitmap location path can create or destroy a bitmap backend without coupling that to sysfs group lifetime management. Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/r/20260425024615.1696892-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: use mddev_lock_nointr() in mddev_suspend_and_lock_nointr()Abd-Alrhman Masalkhi1-1/+1
This keeps mddev locking consistent and ensures that any future changes to locking behavior are done through the wrapper. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/r/20260415140319.376578-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md: replace wait loop with wait_event() in md_handle_request()Abd-Alrhman Masalkhi1-9/+1
The wait loop is equivalent to wait_event() and can be simplified by usaing it for improving readability. Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/r/20260415140319.376578-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid10: fix divide-by-zero in setup_geo() with zero far_copiesJunrui Luo1-0/+2
setup_geo() extracts near_copies (nc) and far_copies (fc) from the user-provided layout parameter without checking for zero. When fc=0 with the "improved" far set layout selected, 'geo->far_set_size = disks / fc' triggers a divide-by-zero. Validate nc and fc immediately after extraction, returning -1 if either is zero. Fixes: 475901aff158 ("MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB7881A5E2556806CC1D318582AF232@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid1,raid10: don't fail devices for invalid IO errorsKeith Busch1-1/+6
BLK_STS_INVAL indicates the IO request itself was invalid, not that the device has failed. When raid1 treats this as a device error, it retries on alternate mirrors which fail the same way, eventually exceeding the read error threshold and removing the device from the array. This happens when stacking configurations bypass bio_split_to_limits() in the IO path: dm-raid calls md_handle_request() directly without going through md_submit_bio(), skipping the alignment validation that would otherwise reject invalid bios early. The invalid bio reaches the lower block layers, which fail the bio with BLK_STS_INVAL, and raid1 wrongly interprets this as a device failure. Add BLK_STS_INVAL to raid1_should_handle_error() so that invalid IO errors are propagated back to the caller rather than triggering device removal. This is consistent with the previous kernel behavior when alignment checks were done earlier in the direct-io path. Fixes: 5ff3f74e145adc7 ("block: simplify direct io validity check") Reported-by: Tomáš Trnka <trnka@scm.com> Closes: https://lore.kernel.org/linux-block/2982107.4sosBPzcNG@electra/ Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Tomáš Trnka <trnka@scm.com> Link: https://lore.kernel.org/r/20260416140345.3872265-1-kbusch@meta.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28md/raid5: Fix UAF on IO across the reshape positionBenjamin Marzinski3-25/+14
If make_stripe_request() returns STRIPE_WAIT_RESHAPE, raid5_make_request() will free the cloned bio. But raid5_make_request() can call make_stripe_request() multiple times, writing to the various stripes. If that bio got added to the toread or towrite lists of a stripe disk in an earlier call to make_stripe_request(), then it's not safe to just free the bio if a later part of it is found to cross the reshape position. Doing so can lead to a UAF error, when bio_endio() is called on the bio for the earlier stripes. Instead, raid5_make_request() needs to wait until all parts of the bio have called bio_endio(). To do this, bios that cross the reshape position while the reshape can't make progress are flagged as needing to wait for all parts to complete. When raid5_make_request() has a bio that failed make_stripe_request() with STRIPE_WAIT_RESHAPE, it sets bi->bi_private to a completion struct and waits for completion after ending the bio. When the bio_endio() is called for the last time on a clone bio with bi->bi_private set, it wakes up the waiter. This guarantees that raid5_make_request() doesn't return until the cloned bio needing a retry for io across the reshape boundary is safely cleaned up. There is a simple reproducer available at [1]. Compile the kernel with KASAN for more useful reporting when the error is triggered (this is not necessary to see the bug). [1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa5 Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20260408043548.1695157-1-bmarzins@redhat.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-28Merge tag 'for-7.1/dm-fixes' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix metadata corruption in dm-thin * tag 'for-7.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-thin: fix metadata refcount underflow
2026-04-20dm-thin: fix metadata refcount underflowMikulas Patocka1-0/+8
There's a bug in dm-thin in the function rebalance_children. If the internal btree node has one entry, the code tries to copy all btree entries from the node's child to the node itself and then decrement the child's reference count. If the child node is shared (it has reference count > 1), we won't free it, so there would be two pointers to each of the grandchildren nodes. But the reference counts of the grandchildren is not increased, thus the reference count doesn't match the number of pointers that point to the grandchildren. This results in "device mapper: space map common: unable to decrement block" errors. Fix this bug by incrementing reference counts on the grandchildren if the btree node is shared. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 3241b1d3e0aa ("dm: add persistent data library") Cc: stable@vger.kernel.org
2026-04-16Merge tag 'for-7.1/dm-changes' of ↵Linus Torvalds71-840/+1239
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Benjamin Marzinski: "There are fixes for some corner case crashes in dm-cache and dm-mirror, new setup functionality for dm-vdo, and miscellaneous minor fixes and cleanups, especially to dm-verity. dm-vdo: - Make dm-vdo able to format the device itself, like other dm targets, instead of needing a userspace formating program - Add some sanity checks and code cleanup dm-cache: - Fix crashes and hangs when operating in passthrough mode (which have been around, unnoticed, since 4.12), as well as a late arriving fix for an error path bug in the passthrough fix - Fix a corner case memory leak dm-verity: - Another set of minor bugfixes and code cleanups to the forward error correction code dm-mirror - Fix minor initialization bug - Fix overflow crash on a large devices with small region sizes dm-crypt - Reimplement elephant diffuser using AES library and minor cleanups dm-core: - Claude found a buffer overflow in /dev/mapper/contrl ioctl handling - make dm_mod.wait_for correctly wait for partitions - minor code fixes and cleanups" * tag 'for-7.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits) dm cache: fix missing return in invalidate_committed's error path dm: fix a buffer overflow in ioctl processing dm-crypt: Make crypt_iv_operations::post return void dm vdo: Fix spelling mistake "postive" -> "positive" dm: provide helper to set stacked limits dm-integrity: always set the io hints dm-integrity: fix mismatched queue limits dm-bufio: use kzalloc_flex dm vdo: save the formatted metadata to disk dm vdo: add formatting logic and initialization dm vdo: add synchronous metadata I/O submission helper dm vdo: add geometry block structure dm vdo: add geometry block encoding dm vdo: add upfront validation for logical size dm vdo: add formatting parameters to table line dm vdo: add super block initialization to encodings.c dm vdo: add geometry block initialization to encodings.c dm-crypt: Make crypt_iv_operations::wipe return void dm-crypt: Reimplement elephant diffuser using AES library dm-verity-fec: warn even when there were no errors ...
2026-04-10dm cache: fix missing return in invalidate_committed's error pathMing-Hung Tsai1-1/+3
In passthrough mode, dm-cache defers write submission until after metadata commit completes via the invalidate_committed() continuation. On commit error, invalidate_committed() calls invalidate_complete() to end the bio and free the migration struct, after which it should return immediately. The patch 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") omitted this early return, causing execution to fall through into the success path on error. This results in use-after-free on the migration struct in the subsequent calls. Fix by adding the missing return after the invalidate_complete() call. Fixes: 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/ Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-04-09dm: fix a buffer overflow in ioctl processingMikulas Patocka1-0/+4
Tony Asleson (using Claude) found a buffer overflow in dm-ioctl in the function retrieve_status: 1. The code in retrieve_status checks that the output string fits into the output buffer and writes the output string there 2. Then, the code aligns the "outptr" variable to the next 8-byte boundary: outptr = align_ptr(outptr); 3. The alignment doesn't check overflow, so outptr could point past the buffer end 4. The "for" loop is iterated again, it executes: remaining = len - (outptr - outbuf); 5. If "outptr" points past "outbuf + len", the arithmetics wraps around and the variable "remaining" contains unusually high number 6. With "remaining" being high, the code writes more data past the end of the buffer Luckily, this bug has no security implications because: 1. Only root can issue device mapper ioctls 2. The commonly used libraries that communicate with device mapper (libdevmapper and devicemapper-rs) use buffer size that is aligned to 8 bytes - thus, "outptr = align_ptr(outptr)" can't overshoot the input buffer and the bug can't happen accidentally Reported-by: Tony Asleson <tasleson@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Bryn M. Reeves <bmr@redhat.com> Cc: stable@vger.kernel.org
2026-04-07md/raid5: fix soft lockup in retry_aligned_read()Chia-Ming Chang1-1/+7
When retry_aligned_read() encounters an overlapped stripe, it releases the stripe via raid5_release_stripe() which puts it on the lockless released_stripes llist. In the next raid5d loop iteration, release_stripe_list() drains the stripe onto handle_list (since STRIPE_HANDLE is set by the original IO), but retry_aligned_read() runs before handle_active_stripes() and removes the stripe from handle_list via find_get_stripe() -> list_del_init(). This prevents handle_stripe() from ever processing the stripe to resolve the overlap, causing an infinite loop and soft lockup. Fix this by using __release_stripe() with temp_inactive_list instead of raid5_release_stripe() in the failure path, so the stripe does not go through the released_stripes llist. This allows raid5d to break out of its loop, and the overlap will be resolved when the stripe is eventually processed by handle_stripe(). Fixes: 773ca82fa1ee ("raid5: make release_stripe lockless") Cc: stable@vger.kernel.org Signed-off-by: FengWei Shih <dannyshih@synology.com> Signed-off-by: Chia-Ming Chang <chiamingc@synology.com> Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: wake raid456 reshape waiters before suspendYu Kuai1-0/+11
During raid456 reshape, direct IO across the reshape position can sleep in raid5_make_request() waiting for reshape progress while still holding an active_io reference. If userspace then freezes reshape and writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io and waits for all in-flight IO to drain. This can deadlock: the IO needs reshape progress to continue, but the reshape thread is already frozen, so the active_io reference is never dropped and suspend never completes. raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do the same for normal md suspend when reshape is already interrupted, so waiting raid456 IO can abort, drop its reference, and let suspend finish. The mdadm test tests/25raid456-reshape-deadlock reproduces the hang. Fixes: 714d20150ed8 ("md: add new helpers to suspend/resume array") Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid1: serialize overlap io for writemostly diskXiao Ni3-14/+39
Previously, using wait_event() would wake up all waiters simultaneously, and they would compete for the tree lock. The bio which gets the lock first will be handled, so the write sequence cannot be guaranteed. For example: bio1(100,200) bio2(150,200) bio3(150,300) The write sequence of fast device is bio1,bio2,bio3. But the write sequence of slow device could be bio1,bio3,bio2 due to lock competition. This causes data corruption. Replace waitqueue with a fifo list to guarantee the write sequence. And it also needs to iterate the list when removing one entry. If not, it may miss the opportunity to wake up the waiting io. For example: bio1(1,3), bio2(2,4) bio3(5,7), bio4(6,8) These four bios are in the same bucket. bio1 and bio3 are inserted into the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the first one. bio3 returns from slow disk and tries to wake up the waiting bios. bio2 is removed from the list and will be handled. But bio1 hasn't finished. So bio2 will be added into waiting list again. Then bio1 returns from slow disk and wakes up waiting bios. bio4 is removed from the list and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left on the waiting list. So it needs to iterate the waiting list to wake up the right bio. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/md-llbitmap: optimize initial sync with write_zeroes_unmap supportYu Kuai1-1/+61
For RAID-456 arrays with llbitmap, if all underlying disks support write_zeroes with unmap, issue write_zeroes to zero all disk data regions and initialize the bitmap to BitCleanUnwritten instead of BitUnwritten. This optimization skips the initial XOR parity building because: 1. write_zeroes with unmap guarantees zeroed reads after the operation 2. For RAID-456, when all data is zero, parity is automatically consistent (0 XOR 0 XOR ... = 0) 3. BitCleanUnwritten indicates parity is valid but no user data has been written The implementation adds two helper functions: - llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active disks support write_zeroes with unmap - llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each rdev's data region to zero all disks The zeroing and bitmap state setting happens in llbitmap_init_state() during bitmap initialization. If any disk fails to zero, we fall back to BitUnwritten and normal lazy recovery. This significantly reduces array initialization time for RAID-456 arrays built on modern NVMe SSDs or other devices that support write_zeroes with unmap. Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity buildingYu Kuai1-12/+128
Add new states to the llbitmap state machine to support proactive XOR parity building for RAID-5 arrays. This allows users to pre-build parity data for unwritten regions before any user data is written. New states added: - BitNeedSyncUnwritten: Transitional state when proactive sync is triggered via sysfs on Unwritten regions. - BitSyncingUnwritten: Proactive sync in progress for unwritten region. - BitCleanUnwritten: XOR parity has been pre-built, but no user data written yet. When user writes to this region, it transitions to BitDirty. New actions added: - BitmapActionProactiveSync: Trigger for proactive XOR parity building. - BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/ SyncingUnwritten states back to Unwritten before recovery starts. State flows: - Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean - New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten - On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean - On disk replacement: CleanUnwritten regions are converted to Unwritten before recovery starts, so recovery only rebuilds regions with user data A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync (write-only) to trigger proactive sync. This only works for RAID-456 arrays. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: add fallback to correct bitmap_ops on version mismatchYu Kuai1-1/+110
If default bitmap version and on-disk version doesn't match, and mdadm is not the latest version to set bitmap_type, set bitmap_ops based on the disk version. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid5: validate payload size before accessing journal metadataJunrui Luo1-15/+33
r5c_recovery_analyze_meta_block() and r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a journal metadata block using on-disk payload size fields without validating them against the remaining space in the metadata block. A corrupted journal contains payload sizes extending beyond the PAGE_SIZE boundary can cause out-of-bounds reads when accessing payload fields or computing offsets. Add bounds validation for each payload type to ensure the full payload fits within meta_size before processing. Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: remove unused static md_wq workqueueAbd-Alrhman Masalkhi1-8/+0
The md_wq workqueue is defined as static and initialized in md_init(), but it is not used anywhere within md.c. All asynchronous and deferred work in this file is handled via md_misc_wq or dedicated md threads. Fixes: b75197e86e6d3 ("md: Remove flush handling") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocationsGregory Price1-9/+9
syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof() triggered by create_strip_zones() in the RAID0 driver. When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER). Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with __GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the corresponding kfree calls to kvfree. Both arrays are pure metadata lookup tables (arrays of pointers and zone descriptors) accessed only via indexing, so they do not require physically contiguous memory. Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/ Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-06md: fix array_state=clear sysfs deadlockYu Kuai1-1/+7
When "clear" is written to array_state, md_attr_store() breaks sysfs active protection so the array can delete itself from its own sysfs store method. However, md_attr_store() currently drops the mddev reference before calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0) has made the mddev eligible for delayed deletion, the temporary kobject reference taken by sysfs_break_active_protection() can become the last kobject reference protecting the md kobject. That allows sysfs_unbreak_active_protection() to drop the last kobject reference from the current sysfs writer context. kobject teardown then recurses into kernfs removal while the current sysfs node is still being unwound, and lockdep reports recursive locking on kn->active with kernfs_drain() in the call chain. Reproducer on an existing level: 1. Create an md0 linear array and activate it: mknod /dev/md0 b 9 0 echo none > /sys/block/md0/md/metadata_version echo linear > /sys/block/md0/md/level echo 1 > /sys/block/md0/md/raid_disks echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \ /sys/block/md0/md/dev-sdb/size echo 0 > /sys/block/md0/md/dev-sdb/slot echo active > /sys/block/md0/md/array_state 2. Wait briefly for the array to settle, then clear it: sleep 2 echo clear > /sys/block/md0/md/array_state The warning looks like: WARNING: possible recursive locking detected bash/588 is trying to acquire lock: (kn->active#65) at __kernfs_remove+0x157/0x1d0 but task is already holding lock: (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40 ... Call Trace: kernfs_drain __kernfs_remove kernfs_remove_by_name_ns sysfs_remove_group sysfs_remove_groups __kobject_del kobject_put md_attr_store kernfs_fop_write_iter vfs_write ksys_write Restore active protection before mddev_put() so the extra sysfs kobject reference is dropped while the mddev is still held alive. The actual md kobject deletion is then deferred until after the sysfs write path has fully returned. Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-03bcache: fix uninitialized closure objectMingzhe Zou1-1/+2
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and crash"), we adopted a simple modification suggestion from AI to fix the use-after-free. But in actual testing, we found an extreme case where the device is stopped before calling bch_write_bdev_super(). At this point, struct closure sb_write has not been initialized yet. For this patch, we ensure that sb_bio has been completed via sb_write_mutex. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com Fixes: fec114a98b87 ("bcache: fix cached_dev.sb_bio use-after-free and crash") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03bcache: fix cached_dev.sb_bio use-after-free and crashMingzhe Zou1-0/+7
In our production environment, we have received multiple crash reports regarding libceph, which have caught our attention: ``` [6888366.280350] Call Trace: [6888366.280452] blk_update_request+0x14e/0x370 [6888366.280561] blk_mq_end_request+0x1a/0x130 [6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd] [6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd] [6888366.280903] __complete_request+0x22/0x70 [libceph] [6888366.281032] osd_dispatch+0x15e/0xb40 [libceph] [6888366.281164] ? inet_recvmsg+0x5b/0xd0 [6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [6888366.281405] ceph_con_process_message+0x79/0x140 [libceph] [6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph] [6888366.281661] ceph_con_workfn+0x329/0x680 [libceph] ``` After analyzing the coredump file, we found that the address of dc->sb_bio has been freed. We know that cached_dev is only freed when it is stopped. Since sb_bio is a part of struct cached_dev, rather than an alloc every time. If the device is stopped while writing to the superblock, the released address will be accessed at endio. This patch hopes to wait for sb_write to complete in cached_dev_free. It should be noted that we analyzed the cause of the problem, then tell all details to the QWEN and adopted the modifications it made. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Fixes: cafe563591446 ("bcache: A block layer cache") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-30dm-crypt: Make crypt_iv_operations::post return voidEric Biggers1-18/+13
Since all implementations of crypt_iv_operations::post now return 0, change the return type to void. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-30dm vdo: Fix spelling mistake "postive" -> "positive"Colin Ian King1-1/+1
There is a spelling mistake in a vdo_log_error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-28dm: provide helper to set stacked limitsKeith Busch4-27/+4
There are multiple device mappers that set up their stacking limits exactly the same for the logical, physical and minimum IO queue limits. Provide a helper for it. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-28dm-integrity: always set the io hintsKeith Busch1-13/+8
Don't depend on the defaults to be what is desired if the integrity device was set up with 512b sector size. Always set the queue limits to be at least what the device mapper wants. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-28dm-integrity: fix mismatched queue limitsKeith Busch1-3/+9
A user can integritysetup a device with a backing device using a 4k logical block size, but request the dm device use 1k or 2k. This mismatch creates an inconsistency such that the dm device would report limits for IO that it can't actually execute. Fix this by using the backing device's limits if they are larger. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm-bufio: use kzalloc_flexRosen Penev1-2/+2
Avoid manual size calculations and use the proper helper. Add __counted_by for extra runtime analysis. Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: save the formatted metadata to diskBruce Johnston3-20/+147
Add vdo_save_super_block() and vdo_save_geometry_block() to perform asynchronous writes of the super block and geometry block respectively. Add vdo_clear_layout() to zero the UDS index's first block, the block map partition, and the recovery journal partition. These operations are driven by new phases in the pre-load state machine (PRE_LOAD_PHASE_FORMAT_*), ensuring that disk writes happen during pre-resume rather than during dmsetup create. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add formatting logic and initializationBruce Johnston2-25/+81
Add the core formatting logic. The initialization path is updated to read the geometry block (block 0 on the storage device). If the block is entirely zeroed, the device is treated as unformatted and vdo_format() is called. Otherwise, the existing geometry is parsed and the VDO is loaded as before. The vdo_format() function initializes the volume geometry and super block, and marks the VDO as needing it's layout saved to disk. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add synchronous metadata I/O submission helperBruce Johnston3-13/+34
Add vdo_submit_metadata_vio_wait(), a synchronous I/O submission helper that blocks until completion. This is needed for I/O during early initialization before work queues are available. Refactor read_geometry_block() to use it. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block structureBruce Johnston2-45/+66
Introduce a vdo_geometry_block structure, containing a vio and buffer, mirroring the existing vdo_super_block structure. Both are now initialized at VDO startup and freed at shutdown, establishing the infrastructure needed to read and write the geometry block using the same mechanisms as the super block. Refactor read_geometry_block() to use the new structure. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block encodingBruce Johnston2-0/+58
Add vdo_encode_volume_geometry() to write the geometry block into a buffer so that it can be written to disk. The corresponding decode path already exists. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add upfront validation for logical sizeBruce Johnston1-0/+6
Add a validation check that the logical size passed via the table line does not exceed MAXIMUM_VDO_LOGICAL_BLOCKS. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add formatting parameters to table lineBruce Johnston5-17/+111
Extend the dm table line with three new optional parameters: indexMemory (UDS index memory size), indexSparse (dense vs sparse index), and slabSize (blocks per allocation slab). These values are parsed, validated, and stored in the device configuration for use during formatting. Rework the slab size constants from the single MAX_VDO_SLAB_BITS into explicit MIN_VDO_SLAB_BLOCKS, MAX_VDO_SLAB_BLOCKS, and DEFAULT_VDO_SLAB_BLOCKS values. Bump the target version from 9.1.0 to 9.2.0 to reflect this table line change. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add super block initialization to encodings.cBruce Johnston3-0/+90
Add vdo_initialize_component_states() to populate the super block, computing the space required for the main VDO components on disk. Those include the slab depot, block map, and recovery journal. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block initialization to encodings.cBruce Johnston4-0/+103
Add vdo_initialize_volume_geometry() to populate the geometry block, computing the space required for the two main regions on disk. Add uds_compute_index_size() to calculate the space required for the UDS indexer from the UDS configuration. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23dm-crypt: Make crypt_iv_operations::wipe return voidEric Biggers1-14/+6
Since all implementations of crypt_iv_operations::wipe now return 0, change the return type to void. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23dm-crypt: Reimplement elephant diffuser using AES libraryEric Biggers2-55/+31
Simplify and optimize dm-crypt's implementation of Bitlocker's "elephant diffuser" to use the AES library instead of an "ecb(aes)" crypto_skcipher. Note: struct aes_enckey is fixed-size, so it could be embedded directly in struct iv_elephant_private. But I kept it as a separate allocation so that the size of struct crypt_config doesn't increase. The elephant diffuser is rarely used in dm-crypt. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>