<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/raid5.c, branch v4.4.171</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.4.171</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.4.171'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2018-09-19T20:48:57+00:00</updated>
<entry>
<title>md/raid5: fix data corruption of replacements after originals dropped</title>
<updated>2018-09-19T20:48:57+00:00</updated>
<author>
<name>BingJing Chang</name>
<email>bingjingc@synology.com</email>
</author>
<published>2018-08-01T09:08:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ccae23ff45ccd3e9c610c46ee8449c2617118daa'/>
<id>urn:sha1:ccae23ff45ccd3e9c610c46ee8449c2617118daa</id>
<content type='text'>
[ Upstream commit d63e2fc804c46e50eee825c5d3a7228e07048b47 ]

During raid5 replacement, the stripes can be marked with R5_NeedReplace
flag. Data can be read from being-replaced devices and written to
replacing spares without reading all other devices. (It's 'replace'
mode. s.replacing = 1) If a being-replaced device is dropped, the
replacement progress will be interrupted and resumed with pure recovery
mode. However, existing stripes before being interrupted cannot read
from the dropped device anymore. It prints lots of WARN_ON messages.
And it results in data corruption because existing stripes write
problematic data into its replacement device and update the progress.

\# Erase disks (1MB + 2GB)
dd if=/dev/zero of=/dev/sda bs=1MB count=2049
dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
\# Ensure array stores non-zero data
dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
\# Start replacement
mdadm /dev/md0 -a /dev/sdd
mdadm /dev/md0 --replace /dev/sda

Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
echo check &gt; /sys/block/md0/md/sync_action
cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

Soon after you hot-plug out /dev/sda, you will see many WARN_ON
messages. The replacement recovery will be interrupted shortly. After
the recovery finishes, it will result in data corruption.

Actually, it's just an unhandled case of replacement. In commit
&lt;f94c0b6658c7&gt; (md/raid5: fix interaction of 'replace' and 'recovery'.),
if a NeedReplace device is not UPTODATE then that is an error, the
commit just simply print WARN_ON but also mark these corrupted stripes
with R5_WantReplace. (it means it's ready for writes.)

To fix this case, we can leverage 'sync and replace' mode mentioned in
commit &lt;9a3e1101b827&gt; (md/raid5: detect and handle replacements during
recovery.). We can add logics to detect and use 'sync and replace' mode
for these stripes.

Reported-by: Alex Chen &lt;alexchen@synology.com&gt;
Reviewed-by: Alex Wu &lt;alexwu@synology.com&gt;
Reviewed-by: Chung-Chiang Cheng &lt;cccheng@synology.com&gt;
Signed-off-by: BingJing Chang &lt;bingjingc@synology.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md: raid5: avoid string overflow warning</title>
<updated>2018-05-30T05:49:00+00:00</updated>
<author>
<name>Arnd Bergmann</name>
<email>arnd@arndb.de</email>
</author>
<published>2018-02-20T13:09:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e163691fa8e8403d1637690c7ef43c0a423deb25'/>
<id>urn:sha1:e163691fa8e8403d1637690c7ef43c0a423deb25</id>
<content type='text'>
[ Upstream commit 53b8d89ddbdbb0e4625a46d2cdbb6f106c52f801 ]

gcc warns about a possible overflow of the kmem_cache string, when adding
four characters to a string of the same length:

drivers/md/raid5.c: In function 'setup_conf':
drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
                                  ^~~~
drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
  sprintf(conf-&gt;cache_name[1], "%s-alt", conf-&gt;cache_name[0]);
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If I'm counting correctly, we need 11 characters for the fixed part
of the string and 18 characters for a 64-bit pointer (when no gendisk
is used), so that leaves three characters for conf-&gt;level, which should
always be sufficient.

This makes the code use snprintf() with the correct length, to
make the code more robust against changes, and to get the compiler
to shut up.

In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for
kmem_cache when mddev has no gendisk") from 2010, Neil said that
the pointer could be removed "shortly" once devices without gendisk
are disallowed. I have no idea if that happened, but if it did, that
should probably be changed as well.

Signed-off-by: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Shaohua Li &lt;sh.li@alibaba-inc.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock</title>
<updated>2018-04-13T17:50:00+00:00</updated>
<author>
<name>Julia Cartwright</name>
<email>julia@ni.com</email>
</author>
<published>2017-04-28T17:41:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5c6230241a858905eadbfcbfa19d715191d9d394'/>
<id>urn:sha1:5c6230241a858905eadbfcbfa19d715191d9d394</id>
<content type='text'>
[ Upstream commit 3d05f3aed5d721c2c77d20288c29ab26c6193ed5 ]

On mainline, there is no functional difference, just less code, and
symmetric lock/unlock paths.

On PREEMPT_RT builds, this fixes the following warning, seen by
Alexander GQ Gerasiov, due to the sleeping nature of spinlocks.

   BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993
   in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1
   CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G        W       4.9.20-rt16-stand6-686 #1
   Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015
   Workqueue: writeback wb_workfn (flush-253:0)
   Call Trace:
    dump_stack+0x47/0x68
    ? migrate_enable+0x4a/0xf0
    ___might_sleep+0x101/0x180
    rt_spin_lock+0x17/0x40
    add_stripe_bio+0x4e3/0x6c0 [raid456]
    ? preempt_count_add+0x42/0xb0
    raid5_make_request+0x737/0xdd0 [raid456]

Reported-by: Alexander GQ Gerasiov &lt;gq@redlab-i.ru&gt;
Tested-by: Alexander GQ Gerasiov &lt;gq@redlab-i.ru&gt;
Signed-off-by: Julia Cartwright &lt;julia@ni.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid6: Fix anomily when recovering a single device in RAID6.</title>
<updated>2018-03-22T08:23:25+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-04-03T02:11:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2b84883f74b2be66de7f424e1f7d9bdf3b1530b2'/>
<id>urn:sha1:2b84883f74b2be66de7f424e1f7d9bdf3b1530b2</id>
<content type='text'>
[ Upstream commit 7471fb77ce4dc4cb81291189947fcdf621a97987 ]

When recoverying a single missing/failed device in a RAID6,
those stripes where the Q block is on the missing device are
handled a bit differently.  In these cases it is easy to
check that the P block is correct, so we do.  This results
in the P block be destroy.  Consequently the P block needs
to be read a second time in order to compute Q.  This causes
lots of seeks and hurts performance.

It shouldn't be necessary to re-read P as it can be computed
from the DATA.  But we only compute blocks on missing
devices, since c337869d9501 ("md: do not compute parity
unless it is on a failed drive").

So relax the change made in that commit to allow computing
of the P block in a RAID6 which it is the only missing that
block.

This makes RAID6 recovery run much faster as the disk just
"before" the recovering device is no longer seeking
back-and-forth.

Reported-by-tested-by: Brad Campbell &lt;lists2009@fnarfbargle.com&gt;
Reviewed-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>raid5: Set R5_Expanded on parity devices as well as data.</title>
<updated>2017-12-20T09:05:00+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-10-17T05:18:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=24bc48af0aee2e144655b3b1d36a96def03480d0'/>
<id>urn:sha1:24bc48af0aee2e144655b3b1d36a96def03480d0</id>
<content type='text'>
[ Upstream commit 235b6003fb28f0dd8e7ed8fbdb088bb548291766 ]

When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.

Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.

So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.

Reported-by: Curt &lt;lightspd@gmail.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list</title>
<updated>2017-10-05T07:41:44+00:00</updated>
<author>
<name>Dennis Yang</name>
<email>dennisyang@qnap.com</email>
</author>
<published>2017-09-06T03:02:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=29854a77f7936feb85194c65d5d873a86f9b01de'/>
<id>urn:sha1:29854a77f7936feb85194c65d5d873a86f9b01de</id>
<content type='text'>
commit 184a09eb9a2fe425e49c9538f1604b05ed33cfef upstream.

In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
set, it indicates that this stripe_head is already in the raid5_plug_cb
list and release_stripe() would be called instead to drop a reference
count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
stripe_head and it will get queued into the raid5_plug_cb list.

Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
A stripe could be re-added to plug list while it is still on that list
in the following situation. If stripe_head A is added to another
stripe_head B's batch list, in this case A will have its
batch_head != NULL and be added into the plug list. After that,
stripe_head B gets handled and called break_stripe_batch_list() to
reset all the batched stripe_head(including A which is still on
the plug list)'s state and reset their batch_head to NULL.
Before the plug list gets processed, if there is another write request
comes in and get stripe_head A, A will have its batch_head == NULL
(cleared by calling break_stripe_batch_list() on B) and be added to
plug list once again.

Signed-off-by: Dennis Yang &lt;dennisyang@qnap.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>md/raid5: fix a race condition in stripe batch</title>
<updated>2017-10-05T07:41:44+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-08-25T17:40:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d03d1567866e8015db3b7cc706c3659deba500de'/>
<id>urn:sha1:d03d1567866e8015db3b7cc706c3659deba500de</id>
<content type='text'>
commit 3664847d95e60a9a943858b7800f8484669740fc upstream.

We have a race condition in below scenario, say have 3 continuous stripes, sh1,
sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:

CPU1				CPU2				CPU3
handle_stripe(sh3)
				stripe_add_to_batch_list(sh3)
				-&gt; lock(sh2, sh3)
				-&gt; lock batch_lock(sh1)
				-&gt; add sh3 to batch_list of sh1
				-&gt; unlock batch_lock(sh1)
								clear_batch_ready(sh1)
								-&gt; lock(sh1) and batch_lock(sh1)
								-&gt; clear STRIPE_BATCH_READY for all stripes in batch_list
								-&gt; unlock(sh1) and batch_lock(sh1)
-&gt;clear_batch_ready(sh3)
--&gt;test_and_clear_bit(STRIPE_BATCH_READY, sh3)
---&gt;return 0 as sh-&gt;batch == NULL
				-&gt; sh3-&gt;batch_head = sh1
				-&gt; unlock (sh2, sh3)

In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
of sh1. By moving sh3-&gt;batch_head assignment in to batch_lock, we make it
impossible to clear STRIPE_BATCH_READY before batch_head is set.

Thanks Stephane for helping debug this tricky issue.

Reported-and-tested-by: Stephane Thiell &lt;sthiell@stanford.edu&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>md/raid5: release/flush io in raid5_do_work()</title>
<updated>2017-09-27T09:00:12+00:00</updated>
<author>
<name>Song Liu</name>
<email>songliubraving@fb.com</email>
</author>
<published>2017-08-24T16:53:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d5c59ee8482042a0c63fa033c043989d00582525'/>
<id>urn:sha1:d5c59ee8482042a0c63fa033c043989d00582525</id>
<content type='text'>
commit 9c72a18e46ebe0f09484cce8ebf847abdab58498 upstream.

In raid5, there are scenarios where some ios are deferred to a later
time, and some IO need a flush to complete. To make sure we make
progress with these IOs, we need to call the following functions:

    flush_deferred_bios(conf);
    r5l_flush_stripe_to_raid(conf-&gt;log);

Both of these functions are called in raid5d(), but missing in
raid5_do_work(). As a result, these functions are not called
when multi-threading (group_thread_cnt &gt; 0) is enabled. This patch
adds calls to these function to raid5_do_work().

Note for stable branches:

  r5l_flush_stripe_to_raid(conf-&gt;log) is need for 4.4+
  flush_deferred_bios(conf) is only needed for 4.11+

Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>md/raid5: add thread_group worker async_tx_issue_pending_all</title>
<updated>2017-08-07T02:19:39+00:00</updated>
<author>
<name>Ofer Heifetz</name>
<email>oferh@marvell.com</email>
</author>
<published>2017-07-24T06:17:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=376830c340eca3f6c2f872a50b90afc0ebf88c72'/>
<id>urn:sha1:376830c340eca3f6c2f872a50b90afc0ebf88c72</id>
<content type='text'>
commit 7e96d559634b73a8158ee99a7abece2eacec2668 upstream.

Since thread_group worker and raid5d kthread are not in sync, if
worker writes stripe before raid5d then requests will be waiting
for issue_pendig.

Issue observed when building raid5 with ext4, in some build runs
jbd2 would get hung and requests were waiting in the HW engine
waiting to be issued.

Fix this by adding a call to async_tx_issue_pending_all in the
raid5_do_work.

Signed-off-by: Ofer Heifetz &lt;oferh@marvell.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Raid5 should update rdev-&gt;sectors after reshape</title>
<updated>2017-07-27T22:06:06+00:00</updated>
<author>
<name>Xiao Ni</name>
<email>xni@redhat.com</email>
</author>
<published>2017-07-05T09:34:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=098333918b4f70a21bc2bc996ae20d2d523bd0b5'/>
<id>urn:sha1:098333918b4f70a21bc2bc996ae20d2d523bd0b5</id>
<content type='text'>
commit b5d27718f38843a74552e9a93d32e2391fd3999f upstream.

The raid5 md device is created by the disks which we don't use the total size. For example,
the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
and assemble it again. It fails.
mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
mdadm /dev/md0 --grow --chunk=64
wait reshape to finish
mdadm -S /dev/md0
mdadm -As
The error messages:
[197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
[197519.821686] md: md_import_device returned -22

After reshape the data offset is changed. It selects backwards direction in this condition.
In function super_1_load it compares the available space of the underlying device with
sb-&gt;data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
rdev-&gt;sectors is updated in md_finish_reshape. Then sb-&gt;data_size is set in super_1_sync based
on rdev-&gt;sectors. So add md_finish_reshape in end_reshape.

Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Acked-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
