<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/raid5.c, branch v6.6.132</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-01-11T14:21:35+00:00</updated>
<entry>
<title>md/raid5: fix IO hang when array is broken with IO inflight</title>
<updated>2026-01-11T14:21:35+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai@fnnas.com</email>
</author>
<published>2025-11-17T08:55:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fe78b35d19aeb7778aaf95ef539b061e6aaf30ae'/>
<id>urn:sha1:fe78b35d19aeb7778aaf95ef539b061e6aaf30ae</id>
<content type='text'>
[ Upstream commit a913d1f6a7f607c110aeef8b58c8988f47a4b24e ]

Following test can cause IO hang:

mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none
sleep 5
echo 1 &gt; /sys/block/sda/device/delete
echo 1 &gt; /sys/block/sdb/device/delete
echo 1 &gt; /sys/block/sdc/device/delete
echo 1 &gt; /sys/block/sdd/device/delete

dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct

Root cause:

1) all disks removed, however all rdevs in the array is still in sync,
IO will be issued normally.

2) IO failure from sda, and set badblocks failed, sda will be faulty
and MD_SB_CHANGING_PENDING will be set.

3) error recovery try to recover this IO from other disks, IO will be
issued to sdb, sdc, and sdd.

4) IO failure from sdb, and set badblocks failed again, now array is
broken and will become read-only.

5) IO failure from sdc and sdd, however, stripe can't be handled anymore
because MD_SB_CHANGING_PENDING is set:

handle_stripe
 handle_stripe
 if (test_bit MD_SB_CHANGING_PENDING)
  set_bit STRIPE_HANDLE
  goto finish
  // skip handling failed stripe

release_stripe
 if (test_bit STRIPE_HANDLE)
  list_add_tail conf-&gt;hand_list

6) later raid5d can't handle failed stripe as well:

raid5d
 md_check_recovery
  md_update_sb
   if (!md_is_rdwr())
    // can't clear pending bit
    return
 if (test_bit MD_SB_CHANGING_PENDING)
  break;
  // can't handle failed stripe

Since MD_SB_CHANGING_PENDING can never be cleared for read-only array,
fix this problem by skip this checking for read-only array.

Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com
Fixes: d87f064f5874 ("md: never update metadata when array is read-only.")
Signed-off-by: Yu Kuai &lt;yukuai@fnnas.com&gt;
Reviewed-by: Li Nan &lt;linan122@huawei.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>md/md-bitmap: move bitmap_{start, end}write to md upper layer</title>
<updated>2025-02-21T12:57:26+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-02-10T07:33:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=96156eb5772629b741c3e1b440aaf66eca742bf6'/>
<id>urn:sha1:96156eb5772629b741c3e1b440aaf66eca742bf6</id>
<content type='text'>
commit cd5fc653381811f1e0ba65f5d169918cab61476f upstream.

There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, it's not quite clear where, and while reviewing raid5
code, it's found that bitmap operations can be optimized. For example,
for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
the array:

┌────────────────────────────────────────────────────────────┐
│chunk 0                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┼
│  sh0 │A0: 0 + 4k  │A1: 8k + 4k  │A2: 16k + 4k │A3: P       │
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P       │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┤
│  sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P        │C3: 40k + 4k│
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P        │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘

Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
 - 3 times (0 + 4k), for (A0, A1 and A2)
 - 3 times (4 + 4k), for (B0, B1 and B2)
 - 3 times (8 + 4k), for (C0, C1 and C3)
 - 3 times (12 + 4k), for (D0, D1 and D3)

After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.

Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:

- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;

Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.

Also remove STRIPE_BITMAP_PENDING since it's not used anymore.

[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
[There is no bitmap_operations, resolve conflicts by replacing
bitmap_ops-&gt;{startwrite, endwrite} with md_bitmap_{startwrite, endwrite}]
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: implement pers-&gt;bitmap_sector()</title>
<updated>2025-02-21T12:57:26+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-02-10T07:33:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b29e6400be51c214a0ad30dce937348b57abe33a'/>
<id>urn:sha1:b29e6400be51c214a0ad30dce937348b57abe33a</id>
<content type='text'>
commit 9c89f604476cf15c31fbbdb043cff7fbf1dbe0cb upstream.

Bitmap is used for the whole array for raid1/raid10, hence IO for the
array can be used directly for bitmap. However, bitmap is used for
underlying disks for raid5, hence IO for the array can't be used
directly for bitmap.

Implement pers-&gt;bitmap_sector() for raid5 to convert IO ranges from the
array to the underlying disks.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20250109015145.158868-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
[ Resolve minor conflicts ]
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/md-bitmap: remove the last parameter for bimtap_ops-&gt;endwrite()</title>
<updated>2025-02-21T12:57:26+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-02-10T07:33:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3e41ab9aef120ed172b8bada50839894341f1d7d'/>
<id>urn:sha1:3e41ab9aef120ed172b8bada50839894341f1d7d</id>
<content type='text'>
commit 4f0e7d0e03b7b80af84759a9e7cfb0f81ac4adae upstream.

For the case that IO failed for one rdev, the bit will be mark as NEEDED
in following cases:

1) If badblocks is set and rdev is not faulty;
2) If rdev is faulty;

Case 1) is useless because synchronize data to badblocks make no sense.
Case 2) can be replaced with mddev-&gt;degraded.

Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since
case 2) no longer use them.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
[ Resolve minor conflicts ]
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()</title>
<updated>2025-02-21T12:57:25+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2025-02-10T07:33:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3b666dad382876f3bb741a62508d20fb94e58631'/>
<id>urn:sha1:3b666dad382876f3bb741a62508d20fb94e58631</id>
<content type='text'>
commit 08c50142a128dcb2d7060aa3b4c5db8837f7a46a upstream.

behind_write is only used in raid1, prepare to refactor
bitmap_{start/end}write(), there are no functional changes.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Reviewed-by: Xiao Ni &lt;xni@redhat.com&gt;
Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
[There is no bitmap_operations, resolve conflicts by exporting new api
md_bitmap_{start,end}_behind_write]
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: recheck if reshape has finished with device_lock held</title>
<updated>2025-02-21T12:57:25+00:00</updated>
<author>
<name>Benjamin Marzinski</name>
<email>bmarzins@redhat.com</email>
</author>
<published>2025-02-10T07:33:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=03877cb32666e76c543aacf2e83a7ebac8d4c9b7'/>
<id>urn:sha1:03877cb32666e76c543aacf2e83a7ebac8d4c9b7</id>
<content type='text'>
commit 25b3a8237a03ec0b67b965b52d74862e77ef7115 upstream.

When handling an IO request, MD checks if a reshape is currently
happening, and if so, where the IO sector is in relation to the reshape
progress. MD uses conf-&gt;reshape_progress for both of these tasks.  When
the reshape finishes, conf-&gt;reshape_progress is set to MaxSector.  If
this occurs after MD checks if the reshape is currently happening but
before it calls ahead_of_reshape(), then ahead_of_reshape() will end up
comparing the IO sector against MaxSector. During a backwards reshape,
this will make MD think the IO sector is in the area not yet reshaped,
causing it to use the previous configuration, and map the IO to the
sector where that data was before the reshape.

This bug can be triggered by running the lvm2
lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop,
although it's very hard to reproduce.

Fix this by factoring the code that checks where the IO sector is in
relation to the reshape out to a helper called get_reshape_loc(),
which reads reshape_progress and reshape_safe while holding the
device_lock, and then rechecks if the reshape has finished before
calling ahead_of_reshape with the saved values.

Also use the helper during the REQ_NOWAIT check to see if the location
is inside of the reshape region.

Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.")
Signed-off-by: Benjamin Marzinski &lt;bmarzins@redhat.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: avoid BUG_ON() while continue reshape after reassembling</title>
<updated>2024-08-14T11:58:41+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2024-06-11T13:22:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=775a9ba16c9ffe98fe54ebf14e55d5660f2bf600'/>
<id>urn:sha1:775a9ba16c9ffe98fe54ebf14e55d5660f2bf600</id>
<content type='text'>
[ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ]

Currently, mdadm support --revert-reshape to abort the reshape while
reassembling, as the test 07revert-grow. However, following BUG_ON()
can be triggerred by the test:

kernel BUG at drivers/md/raid5.c:6278!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
irq event stamp: 158985
CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94
RIP: 0010:reshape_request+0x3f1/0xe60
Call Trace:
 &lt;TASK&gt;
 raid5_sync_request+0x43d/0x550
 md_do_sync+0xb7a/0x2110
 md_thread+0x294/0x2b0
 kthread+0x147/0x1c0
 ret_from_fork+0x59/0x70
 ret_from_fork_asm+0x1a/0x30
 &lt;/TASK&gt;

Root cause is that --revert-reshape update the raid_disks from 5 to 4,
while reshape position is still set, and after reassembling the array,
reshape position will be read from super block, then during reshape the
checking of 'writepos' that is caculated by old reshape position will
fail.

Fix this panic the easy way first, by converting the BUG_ON() to
WARN_ON(), and stop the reshape if checkings fail.

Noted that mdadm must fix --revert-shape as well, and probably md/raid
should enhance metadata validation as well, however this means
reassemble will fail and there must be user tools to fix the wrong
metadata.

Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: fix deadlock that raid5d() wait for itself to clear MD_SB_CHANGE_PENDING</title>
<updated>2024-06-16T11:47:34+00:00</updated>
<author>
<name>Yu Kuai</name>
<email>yukuai3@huawei.com</email>
</author>
<published>2024-03-22T08:10:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cd2538e5af495b3c747e503db346470fc1ffc447'/>
<id>urn:sha1:cd2538e5af495b3c747e503db346470fc1ffc447</id>
<content type='text'>
commit 151f66bb618d1fd0eeb84acb61b4a9fa5d8bb0fa upstream.

Xiao reported that lvm2 test lvconvert-raid-takeover.sh can hang with
small possibility, the root cause is exactly the same as commit
bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")

However, Dan reported another hang after that, and junxiao investigated
the problem and found out that this is caused by plugged bio can't issue
from raid5d().

Current implementation in raid5d() has a weird dependence:

1) md_check_recovery() from raid5d() must hold 'reconfig_mutex' to clear
   MD_SB_CHANGE_PENDING;
2) raid5d() handles IO in a deadloop, until all IO are issued;
3) IO from raid5d() must wait for MD_SB_CHANGE_PENDING to be cleared;

This behaviour is introduce before v2.6, and for consequence, if other
context hold 'reconfig_mutex', and md_check_recovery() can't update
super_block, then raid5d() will waste one cpu 100% by the deadloop, until
'reconfig_mutex' is released.

Refer to the implementation from raid1 and raid10, fix this problem by
skipping issue IO if MD_SB_CHANGE_PENDING is still set after
md_check_recovery(), daemon thread will be woken up when 'reconfig_mutex'
is released. Meanwhile, the hang problem will be fixed as well.

Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
Cc: stable@vger.kernel.org # v5.19+
Reported-and-tested-by: Dan Moulding &lt;dan@danm.net&gt;
Closes: https://lore.kernel.org/all/20240123005700.9302-1-dan@danm.net/
Investigated-by: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Signed-off-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Link: https://lore.kernel.org/r/20240322081005.1112401-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: fix atomicity violation in raid5_cache_count</title>
<updated>2024-04-03T13:28:22+00:00</updated>
<author>
<name>Gui-Dong Han</name>
<email>2045gemini@gmail.com</email>
</author>
<published>2024-01-12T07:10:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8ec4a68a25a231e2303716a0aba8ff1413d47c87'/>
<id>urn:sha1:8ec4a68a25a231e2303716a0aba8ff1413d47c87</id>
<content type='text'>
[ Upstream commit dfd2bf436709b2bccb78c2dda550dde93700efa7 ]

In raid5_cache_count():
    if (conf-&gt;max_nr_stripes &lt; conf-&gt;min_nr_stripes)
        return 0;
    return conf-&gt;max_nr_stripes - conf-&gt;min_nr_stripes;
The current check is ineffective, as the values could change immediately
after being checked.

In raid5_set_cache_size():
    ...
    conf-&gt;min_nr_stripes = size;
    ...
    while (size &gt; conf-&gt;max_nr_stripes)
        conf-&gt;min_nr_stripes = conf-&gt;max_nr_stripes;
    ...

Due to intermediate value updates in raid5_set_cache_size(), concurrent
execution of raid5_cache_count() and raid5_set_cache_size() may lead to
inconsistent reads of conf-&gt;max_nr_stripes and conf-&gt;min_nr_stripes.
The current checks are ineffective as values could change immediately
after being checked, raising the risk of conf-&gt;min_nr_stripes exceeding
conf-&gt;max_nr_stripes and potentially causing an integer overflow.

This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations. The above possible bug is
reported when our tool analyzes the source code of Linux 6.2.

To resolve this issue, it is suggested to introduce local variables
'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the
values remain stable throughout the check. Adding locks in
raid5_cache_count() fails to resolve atomicity violations, as
raid5_set_cache_size() may hold intermediate values of
conf-&gt;min_nr_stripes while unlocked. With this patch applied, our tool no
longer reports the bug, with the kernel configuration allyesconfig for
x86_64. Due to the lack of associated hardware, we cannot test the patch
in runtime testing, and just verify it according to the code logic.

Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han &lt;2045gemini@gmail.com&gt;
Reviewed-by: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""</title>
<updated>2024-01-25T23:36:01+00:00</updated>
<author>
<name>Song Liu</name>
<email>song@kernel.org</email>
</author>
<published>2024-01-25T08:21:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=aab69ef769707ad987ff905d79e0bd6591812580'/>
<id>urn:sha1:aab69ef769707ad987ff905d79e0bd6591812580</id>
<content type='text'>
This reverts commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d.

The original set [1][2] was expected to undo a suboptimal fix in [2], and
replace it with a better fix [1]. However, as reported by Dan Moulding [2]
causes an issue with raid5 with journal device.

Revert [2] for now to close the issue. We will follow up on another issue
reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a
good trade-off, because the latter issue happens less freqently.

In the meanwhile, we will NOT revert [1], as it contains the right logic.

[1] commit d6e035aad6c0 ("md: bypass block throttle for superblock update")
[2] commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")

Reported-by: Dan Moulding &lt;dan@danm.net&gt;
Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/
Fixes: bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"")
Cc: stable@vger.kernel.org # v5.19+
Cc: Junxiao Bi &lt;junxiao.bi@oracle.com&gt;
Cc: Yu Kuai &lt;yukuai3@huawei.com&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
</feed>
