<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/raid1.c, branch v4.11.5</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2017-05-20T12:49:48+00:00</updated>
<entry>
<title>md/raid1: avoid reusing a resync bio after error handling.</title>
<updated>2017-05-20T12:49:48+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2017-04-06T02:06:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9bcb9cc96d7396430d9f978cce5085f8c9928970'/>
<id>urn:sha1:9bcb9cc96d7396430d9f978cce5085f8c9928970</id>
<content type='text'>
commit 0c9d5b127f695818c2c5a3868c1f28ca2969e905 upstream.

fix_sync_read_error() modifies a bio on a newly faulty
device by setting bi_end_io to end_sync_write.
This ensure that put_buf() will still call rdev_dec_pending()
as required, but makes sure that subsequent code in
fix_sync_read_error() doesn't try to read from the device.

Unfortunately this interacts badly with sync_request_write()
which assumes that any bio with bi_end_io set to non-NULL
other than end_sync_read is safe to write to.

As the device is now faulty it doesn't make sense to write.
As the bio was recently used for a read, it is "dirty"
and not suitable for immediate submission.
In particular, -&gt;bi_next might be non-NULL, which will cause
generic_make_request() to complain.

Break this interaction by refusing to write to devices
which are marked as Faulty.

Reported-and-tested-by: Michael Wang &lt;yun.wang@profitbricks.com&gt;
Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>md/raid1: fix a trivial typo in comments</title>
<updated>2017-03-14T18:10:44+00:00</updated>
<author>
<name>Zhilong Liu</name>
<email>zlliu@suse.com</email>
</author>
<published>2017-03-14T07:52:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=11353b9d10392e79e32603d2178e75feb25eaf0d'/>
<id>urn:sha1:11353b9d10392e79e32603d2178e75feb25eaf0d</id>
<content type='text'>
raid1.c: fix a trivial typo in comments of freeze_array().

Cc: Jack Wang &lt;jack.wang.usish@gmail.com&gt;
Cc: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Cc: John Stoffel &lt;john@stoffel.org&gt;
Acked-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Zhilong Liu &lt;zlliu@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid1/10: fix potential deadlock</title>
<updated>2017-03-09T17:02:42+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-02-28T21:00:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=61eb2b43b99ebdc9bc6bc83d9792257b243e7cb3'/>
<id>urn:sha1:61eb2b43b99ebdc9bc6bc83d9792257b243e7cb3</id>
<content type='text'>
Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current-&gt;bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current-&gt;bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 &amp; 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current-&gt;bio_list.

Cc: Coly Li &lt;colyli@suse.de&gt;
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown &lt;neilb@suse.com&gt;
Reviewed-by: Jack Wang &lt;jinpu.wang@profitbricks.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md: move funcs from pers-&gt;resize to update_size</title>
<updated>2017-03-09T17:02:18+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>gqjiang@suse.com</email>
</author>
<published>2017-02-24T03:15:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c94836342192b05d599d6aa3397f732f7a238689'/>
<id>urn:sha1:c94836342192b05d599d6aa3397f732f7a238689</id>
<content type='text'>
raid1_resize and raid5_resize should also check the
mddev-&gt;queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
pers-&gt;resize such as raid1, raid10 and raid5. So
move them from personality file to common code.

Reviewed-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>sched/headers: Prepare for new header dependencies before moving code to &lt;linux/sched/signal.h&gt;</title>
<updated>2017-03-02T07:42:29+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2017-02-08T17:51:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3f07c0144132e4f59d88055ac8ff3e691a5fa2b8'/>
<id>urn:sha1:3f07c0144132e4f59d88055ac8ff3e691a5fa2b8</id>
<content type='text'>
We are going to split &lt;linux/sched/signal.h&gt; out of &lt;linux/sched.h&gt;, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder &lt;linux/sched/signal.h&gt; file that just
maps to &lt;linux/sched.h&gt; to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md</title>
<updated>2017-02-24T22:42:19+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-02-24T22:42:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a682e0035494c449e53a57d039f86f75b9e2fe67'/>
<id>urn:sha1:a682e0035494c449e53a57d039f86f75b9e2fe67</id>
<content type='text'>
Pull md updates from Shaohua Li:
 "Mainly fixes bugs and improves performance:

   - Improve scalability for raid1 from Coly

   - Improve raid5-cache read performance, disk efficiency and IO
     pattern from Song and me

   - Fix a race condition of disk hotplug for linear from Coly

   - A few cleanup patches from Ming and Byungchul

   - Fix a memory leak from Neil

   - Fix WRITE SAME IO failure from me

   - Add doc for raid5-cache from me"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
  md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
  md/raid1: handle flush request correctly
  md/linear: shutup lockdep warnning
  md/raid1: fix a use-after-free bug
  RAID1: avoid unnecessary spin locks in I/O barrier code
  RAID1: a new I/O barrier implementation to remove resync window
  md/raid5: Don't reinvent the wheel but use existing llist API
  md: fast clone bio in bio_clone_mddev()
  md: remove unnecessary check on mddev
  md/raid1: use bio_clone_bioset_partial() in case of write behind
  md: fail if mddev-&gt;bio_set can't be created
  block: introduce bio_clone_bioset_partial()
  md: disable WRITE SAME if it fails in underlayer disks
  md/raid5-cache: exclude reclaiming stripes in reclaim check
  md/raid5-cache: stripe reclaim only counts valid stripes
  MD: add doc for raid5-cache
  Documentation: move MD related doc into a separate dir
  md: ensure md devices are freed before module is unloaded.
  md/r5cache: improve journal device efficiency
  md/r5cache: enable chunk_aligned_read with write back cache
  ...
</content>
</entry>
<entry>
<title>md/raid1: fix write behind issues introduced by bio_clone_bioset_partial</title>
<updated>2017-02-23T19:59:44+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-02-21T22:27:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1ec492232ed659acde8cc00b9ecc7529778e03e1'/>
<id>urn:sha1:1ec492232ed659acde8cc00b9ecc7529778e03e1</id>
<content type='text'>
There are two issues, introduced by commit 8e58e32(md/raid1: use
bio_clone_bioset_partial() in case of write behind):
- bio_clone_bioset_partial() uses bytes instead of sectors as parameters
- in writebehind mode, we return bio if all !writemostly disk bios finish,
  which could happen before writemostly disk bios run. So all
  writemostly disk bios should have their bvec. Here we just make sure
  all bios are cloned instead of fast cloned.

Reviewed-by: Ming Lei &lt;tom.leiming@gmail.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid1: handle flush request correctly</title>
<updated>2017-02-23T19:59:43+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-02-21T18:56:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=aff8da09f2381f0869faaf6637b0d892a3ee99ed'/>
<id>urn:sha1:aff8da09f2381f0869faaf6637b0d892a3ee99ed</id>
<content type='text'>
I got a warning triggered in align_to_barrier_unit_end. It's a flush
request so sectors == 0. The flush request happens to work well without
the new barrier patch, but we'd better handle it explictly.

Cc: NeilBrown &lt;neilb@suse.com&gt;
Acked-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid1: fix a use-after-free bug</title>
<updated>2017-02-20T06:41:27+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shli@fb.com</email>
</author>
<published>2017-02-20T06:41:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=af5f42a7e426a87bfa69adc9b9d8930385a1ddf6'/>
<id>urn:sha1:af5f42a7e426a87bfa69adc9b9d8930385a1ddf6</id>
<content type='text'>
Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync
window) introduces a user-after-free bug.

Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>RAID1: avoid unnecessary spin locks in I/O barrier code</title>
<updated>2017-02-20T06:04:25+00:00</updated>
<author>
<name>colyli@suse.de</name>
<email>colyli@suse.de</email>
</author>
<published>2017-02-17T19:05:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=824e47daddbfc6ebe1006b8659f080620472a136'/>
<id>urn:sha1:824e47daddbfc6ebe1006b8659f080620472a136</id>
<content type='text'>
When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf-&gt;resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf-&gt;resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf-&gt;nr_pending[idx], conf-&gt;nr_waiting[idx], conf-&gt;
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf-&gt;
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf-&gt;resync_lock, it can just increase
conf-&gt;nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf-&gt;nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf-&gt;nr_queued[] from atomic_t to int.
- Change conf-&gt;array_frozen from atomic_t back to int, and use
  READ_ONCE(conf-&gt;array_frozen) to check value of conf-&gt;array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&amp;conf-&gt;wait_barrier) after atomic_dec(&amp;conf-&gt;nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Hannes Reinecke &lt;hare@suse.com&gt;
Cc: Johannes Thumshirn &lt;jthumshirn@suse.de&gt;
Cc: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Reviewed-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
</feed>
