<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/raid1.h, branch v4.11.5</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.11.5'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2017-02-20T06:04:25+00:00</updated>
<entry>
<title>RAID1: avoid unnecessary spin locks in I/O barrier code</title>
<updated>2017-02-20T06:04:25+00:00</updated>
<author>
<name>colyli@suse.de</name>
<email>colyli@suse.de</email>
</author>
<published>2017-02-17T19:05:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=824e47daddbfc6ebe1006b8659f080620472a136'/>
<id>urn:sha1:824e47daddbfc6ebe1006b8659f080620472a136</id>
<content type='text'>
When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf-&gt;resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf-&gt;resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf-&gt;nr_pending[idx], conf-&gt;nr_waiting[idx], conf-&gt;
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf-&gt;
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf-&gt;resync_lock, it can just increase
conf-&gt;nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf-&gt;nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf-&gt;nr_queued[] from atomic_t to int.
- Change conf-&gt;array_frozen from atomic_t back to int, and use
  READ_ONCE(conf-&gt;array_frozen) to check value of conf-&gt;array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&amp;conf-&gt;wait_barrier) after atomic_dec(&amp;conf-&gt;nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Cc: Shaohua Li &lt;shli@fb.com&gt;
Cc: Hannes Reinecke &lt;hare@suse.com&gt;
Cc: Johannes Thumshirn &lt;jthumshirn@suse.de&gt;
Cc: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Reviewed-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>RAID1: a new I/O barrier implementation to remove resync window</title>
<updated>2017-02-20T06:04:24+00:00</updated>
<author>
<name>colyli@suse.de</name>
<email>colyli@suse.de</email>
</author>
<published>2017-02-17T19:05:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fd76863e37fef26fe05547fddfa6e3d05e1682e6'/>
<id>urn:sha1:fd76863e37fef26fe05547fddfa6e3d05e1682e6</id>
<content type='text'>
'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several variables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a global unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflicts, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be reduced to an acceptable number if there are enough
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio requests won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE&lt;&lt;9 (64MB) in bytes.
   For random I/O 64MB is large enough for both read and write requests,
   for sequential I/O considering underlying block layer may merge them
   into larger requests, 64MB is still good enough.
   Neil also points out that for resync operation, "we want the resync to
   move from region to region fairly quickly so that the slowness caused
   by having to synchronize with the resync is averaged out over a fairly
   small time frame". For full speed resync, 64MB should take less then 1
   second. When resync is competing with other I/O, it could take up a few
   minutes. Therefore 64MB size is fairly good range for resync.

 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
        #define BARRIER_BUCKETS_NR        (1&lt;&lt;BARRIER_BUCKETS_NR_BITS)
   this patch makes the bellowed members of struct r1conf from integer
   to array of integers,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     *nr_pending;
        +       int                     *nr_waiting;
        +       int                     *nr_queued;
        +       int                     *barrier;
   number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
   kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
   barrier buckets, and each array of integers occupies single memory page.
   1024 means for a request which is smaller than the I/O barrier unit size
   has ~0.1% chance to wait for resync to pause, which is quite a small
   enough fraction. Also requesting single memory page is more friendly to
   kernel page allocator than larger memory size.

 - I/O barrier bucket is indexed by bio start sector
   If multiple I/O requests hit different I/O barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same I/O barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by sector_to_idx() which is defined
   in raid1.h as an inline function,
        static inline int sector_to_idx(sector_t sector)
        {
                return hash_long(sector &gt;&gt; BARRIER_UNIT_SECTOR_BITS,
                                BARRIER_BUCKETS_NR_BITS);
        }
   Here sector_nr is the start sector number of a bio.

 - Single bio won't go across boundary of a I/O barrier unit
   If a request goes across boundary of barrier unit, it will be split. A
   bio may be split in raid1_make_request() or raid1_sync_request(), if
   sectors returned by align_to_barrier_unit_end() is smaller than
   original bio size.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will conflict within a single barrier units. So the I/O behavior is
   similar to single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of conflict might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket indexs with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low conflict rate with much
   simpler barrier algorithm and implementation.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf-&gt;nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&amp;conf-&gt;device_lock, flags);
        conf-&gt;nr_queued[idx]--;
        spin_unlock_irqrestore(&amp;conf-&gt;device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(&amp;conf-&gt;nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Mainline raid1 code split original raid1_make_request() into
   raid1_read_request() and raid1_write_request(). If the original bio
   goes across an I/O barrier unit size, this bio will be split before
   calling raid1_read_request() or raid1_write_request(),  this change
   the code logic more simple and clear.
 - In this patch wait_barrier() is moved from raid1_make_request() to
   raid1_write_request(). In raid_read_request(), original wait_barrier()
   is replaced by raid1_read_request().
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
Changelog
V4:
- Add alloc_r1bio() to remove redundant r1bio memory allocation code.
- Fix many typos in patch comments.
- Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
  - Back to use pointers to replace arraries in struct r1conf
  - Remove total_barriers from struct r1conf
  - Add more patch comments to explain how/why the values of
    BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
  - Use get_unqueued_pending() to replace get_all_pendings() and
    get_all_queued()
  - Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf-&gt;total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Cc: Johannes Thumshirn &lt;jthumshirn@suse.de&gt;
Cc: Guoqing Jiang &lt;gqjiang@suse.com&gt;
Reviewed-by: Neil Brown &lt;neilb@suse.de&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md/raid1: add failfast handling for reads.</title>
<updated>2016-11-22T17:13:18+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2016-11-18T05:16:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2e52d449bcec31cb66d80aa8c798b15f76f1f5e0'/>
<id>urn:sha1:2e52d449bcec31cb66d80aa8c798b15f76f1f5e0</id>
<content type='text'>
If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure.  If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST.  A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are &gt; 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable.  The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find.  So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md: define mddev flags, recovery flags and r1bio state bits using enums</title>
<updated>2016-11-09T20:53:52+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2016-11-08T23:21:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=be306c2989804ca5b90388df66fd3cf28ec74967'/>
<id>urn:sha1:be306c2989804ca5b90388df66fd3cf28ec74967</id>
<content type='text'>
This is less error prone than using individual #defines.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
Signed-off-by: Shaohua Li &lt;shli@fb.com&gt;
</content>
</entry>
<entry>
<title>md-cluster: Use a small window for resync</title>
<updated>2015-10-12T06:32:05+00:00</updated>
<author>
<name>Goldwyn Rodrigues</name>
<email>rgoldwyn@suse.com</email>
</author>
<published>2015-08-18T22:14:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c40f341f1e7fd4eddcfc5881d94cfa8669071ee6'/>
<id>urn:sha1:c40f341f1e7fd4eddcfc5881d94cfa8669071ee6</id>
<content type='text'>
Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.

Signed-off-by: Goldwyn Rodrigues &lt;rgoldwyn@suse.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md/raid1: ensure device failure recorded before write request returns.</title>
<updated>2015-08-31T17:43:23+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2015-08-14T01:11:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=55ce74d4bfe1b9444436264c637f39a152d1e5ac'/>
<id>urn:sha1:55ce74d4bfe1b9444436264c637f39a152d1e5ac</id>
<content type='text'>
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
</content>
</entry>
<entry>
<title>md: make -&gt;congested robust against personality changes.</title>
<updated>2015-02-03T21:35:52+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-12-15T01:56:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5c675f83c68fbdf9c0e103c1090b06be747fa62c'/>
<id>urn:sha1:5c675f83c68fbdf9c0e103c1090b06be747fa62c</id>
<content type='text'>
There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md: remove unwanted white space from md.c</title>
<updated>2014-10-14T02:08:29+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-09-30T04:23:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f72ffdd68616e3697bc782b21c82197aeb480fd5'/>
<id>urn:sha1:f72ffdd68616e3697bc782b21c82197aeb480fd5</id>
<content type='text'>
My editor shows much of this is RED.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>raid1: Rewrite the implementation of iobarrier.</title>
<updated>2013-11-19T04:19:18+00:00</updated>
<author>
<name>majianpeng</name>
<email>majianpeng@gmail.com</email>
</author>
<published>2013-11-15T06:55:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=79ef3a8aa1cb1523cc231c9a90a278333c21f761'/>
<id>urn:sha1:79ef3a8aa1cb1523cc231c9a90a278333c21f761</id>
<content type='text'>
There is an iobarrier in raid1 because of contention between normal IO and
resync IO.  It suspends all normal IO when resync/recovery happens.

However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.

We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
        start   next_resync   start_next_window    end_window

start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window

Firstly we introduce some concepts:

1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
      same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
      So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
      and start_next_window.  It also indicates the distance between
      start_next_window and end_window.
      It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
      this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
      next_resync.
5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
      beyond next_resync.  Normal-io after this position doesn't need to
      wait for resync-io to complete.
6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
      next_resync.  This also doesn't need to wait, but is counted
      differently.
7 - current_window_requests:  the count of normalIO between
      start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.

NormalIO will be partitioned into four types:

NormIO1:  the end sector of bio is smaller or equal the start
NormIO2:  the start sector of bio larger or equal to end_window
NormIO3:  the start sector of bio larger or equal to
          start_next_window.
NormIO4:  the location between start_next_window and end_window

|--------|-----------|--------------------|----------------|-------------|
    | start   |   next_resync   |  start_next_window   |  end_window |
 NormIO1   NormIO4            NormIO4                NormIO3      NormIO2

For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
    mechanism.  The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests &gt; 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero,  start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.

There is a problem which when and how to change from NormIO2 to
NormIO3.  Only then can sync action progress.

We add a field in struct r1conf "start_next_window".

A: if start_next_window == MaxSector, it means there are no NormIO2/3.
   So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 &amp;&amp; next_window_requests != 0, it
   means start_next_window move to end_window

There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

We add a field in struct r1bio "start_next_window".
This is used to record the position conf-&gt;start_next_window when the call
to wait_barrier() is made in make_request().

In allow_barrier(), we check the conf-&gt;start_next_window.
If r1bio-&gt;stat_next_window == conf-&gt;start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio-&gt;start_next_window != conf-&gt;start_next_window, it mean
there was a transition between NormIO2 and NormIO3.  There can only
have been one transition.  So it only means the bio is old NormIO2.

For one bio, there may be many r1bio's. So we make sure
all the r1bio-&gt;start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf-&gt;start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.

Signed-off-by: Jianpeng Ma &lt;majianpeng@gmail.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>raid1: Add a field array_frozen to indicate whether raid in freeze state.</title>
<updated>2013-11-19T04:19:18+00:00</updated>
<author>
<name>majianpeng</name>
<email>majianpeng@gmail.com</email>
</author>
<published>2013-11-14T04:16:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b364e3d048e49b1d177eb7ee7853e77aa0560464'/>
<id>urn:sha1:b364e3d048e49b1d177eb7ee7853e77aa0560464</id>
<content type='text'>
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.

Signed-off-by: Jianpeng Ma &lt;majianpeng@gmail.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
</feed>
