<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/bcache, branch v4.14.85</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.14.85</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.14.85'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2018-11-13T19:14:45+00:00</updated>
<entry>
<title>bcache: fix miss key refill-&gt;end in writeback</title>
<updated>2018-11-13T19:14:45+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui.linux@gmail.com</email>
</author>
<published>2018-10-08T12:41:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dcd048d6bb08d3da1e9cb1923bce80dbc0f90fed'/>
<id>urn:sha1:dcd048d6bb08d3da1e9cb1923bce80dbc0f90fed</id>
<content type='text'>
commit 2d6cb6edd2c7fb4f40998895bda45006281b1ac5 upstream.

refill-&gt;end record the last key of writeback, for example, at the first
time, keys (1,128K) to (1,1024K) are flush to the backend device, but
the end key (1,1024K) is not included, since the bellow code:
	if (bkey_cmp(k, refill-&gt;end) &gt;= 0) {
		ret = MAP_DONE;
		goto out;
	}
And in the next time when we refill writeback keybuf again, we searched
key start from (1,1024K), and got a key bigger than it, so the key
(1,1024K) missed.
This patch modify the above code, and let the end key to be included to
the writeback key buffer.

Signed-off-by: Tang Junhui &lt;tang.junhui.linux@gmail.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bcache: trace missed reading by cache_missed</title>
<updated>2018-11-13T19:14:45+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui.linux@gmail.com</email>
</author>
<published>2018-10-08T12:41:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=82239bda9dac002ffb68c6019134c5245aeb4cae'/>
<id>urn:sha1:82239bda9dac002ffb68c6019134c5245aeb4cae</id>
<content type='text'>
commit 502b291568fc7faf1ebdb2c2590f12851db0ff76 upstream.

Missed reading IOs are identified by s-&gt;cache_missed, not the
s-&gt;cache_miss, so in trace_bcache_read() using trace_bcache_read
to identify whether the IO is missed or not.

Signed-off-by: Tang Junhui &lt;tang.junhui.linux@gmail.com&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bcache: release dc-&gt;writeback_lock properly in bch_writeback_thread()</title>
<updated>2018-09-09T17:56:01+00:00</updated>
<author>
<name>Shan Hai</name>
<email>shan.hai@oracle.com</email>
</author>
<published>2018-08-22T18:02:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d1a265da7b2983e3201988f1a8202e749e2d4352'/>
<id>urn:sha1:d1a265da7b2983e3201988f1a8202e749e2d4352</id>
<content type='text'>
commit 3943b040f11ed0cc6d4585fd286a623ca8634547 upstream.

The writeback thread would exit with a lock held when the cache device
is detached via sysfs interface, fix it by releasing the held lock
before exiting the while-loop.

Fixes: fadd94e05c02 (bcache: quit dc-&gt;writeback_thread when BCACHE_DEV_DETACHING is set)
Signed-off-by: Shan Hai &lt;shan.hai@oracle.com&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Tested-by: Shenghui Wang &lt;shhuiw@foxmail.com&gt;
Cc: stable@vger.kernel.org #4.17+
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>bcache: quit dc-&gt;writeback_thread when BCACHE_DEV_DETACHING is set</title>
<updated>2018-05-30T05:52:30+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2018-03-19T00:36:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4a092479bb4f302d5e51a21d0d18e74aa6ea5837'/>
<id>urn:sha1:4a092479bb4f302d5e51a21d0d18e74aa6ea5837</id>
<content type='text'>
[ Upstream commit fadd94e05c02afec7b70b0b14915624f1782f578 ]

In patch "bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc-&gt;writeback_thread, and
cached_dev_put() is called when exiting dc-&gt;writeback_thread. This
modification works well unless people detach the bcache device manually by
    'echo 1 &gt; /sys/block/bcache&lt;N&gt;/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc-&gt;writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev-&gt;count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc-&gt;count in these locations,
refcount d-&gt;count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+	if (!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags) &amp;&amp;
+	    (!atomic_read(&amp;dc-&gt;has_dirty) || !dc-&gt;writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not,
   kthread_stop() from cached_dev_detach_finish() will wake up it and
   terminate by making kthread_should_stop() return true. And in normal
   run time, bit on index BCACHE_DEV_DETACHING is always cleared, the
   condition
	!test_bit(BCACHE_DEV_DETACHING, &amp;dc-&gt;disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(&amp;dc-&gt;has_dirty)" or "!dc-&gt;writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(&amp;dc-&gt;has_dirty)"
   If dc-&gt;has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc-&gt;writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc-&gt;has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc-&gt;writeback_running)"
   dc-&gt;writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc-&gt;writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev-&gt;count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Hannes Reinecke &lt;hare@suse.com&gt;
Cc: Huijun Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix kcrashes with fio in RAID5 backend dev</title>
<updated>2018-05-30T05:52:05+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-27T17:49:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f07b6505f474a108326ebb99018c0036c789c153'/>
<id>urn:sha1:f07b6505f474a108326ebb99018c0036c789c153</id>
<content type='text'>
[ Upstream commit 60eb34ec5526e264c2bbaea4f7512d714d791caf ]

Kernel crashed when run fio in a RAID5 backend bcache device, the call
trace is bellow:
[  440.012034] kernel BUG at block/blk-ioc.c:146!
[  440.012696] invalid opcode: 0000 [#1] SMP NOPTI
[  440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[  440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[  440.028615] RIP: 0010:put_io_context+0x8b/0x90
[  440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
[  440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
0f4240
[  440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
94fca0
[  440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
0c1700
[  440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
01000
[  440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
bd1e0
[  440.035239] FS:  0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
0000000000000000
[  440.060190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
606e0
[  440.110498] Call Trace:
[  440.135443]  bio_disassociate_task+0x1b/0x60
[  440.160355]  bio_free+0x1b/0x60
[  440.184666]  bio_put+0x23/0x30
[  440.208272]  search_free+0x23/0x40 [bcache]
[  440.231448]  cached_dev_write_complete+0x31/0x70 [bcache]
[  440.254468]  closure_put+0xb6/0xd0 [bcache]
[  440.277087]  request_endio+0x30/0x40 [bcache]
[  440.298703]  bio_endio+0xa1/0x120
[  440.319644]  handle_stripe+0x418/0x2270 [raid456]
[  440.340614]  ? load_balance+0x17b/0x9c0
[  440.360506]  handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[  440.380675]  ? __release_stripe+0x15/0x20 [raid456]
[  440.400132]  raid5d+0x3ed/0x5d0 [raid456]
[  440.419193]  ? schedule+0x36/0x80
[  440.437932]  ? schedule_timeout+0x1d2/0x2f0
[  440.456136]  md_thread+0x122/0x150
[  440.473687]  ? wait_woken+0x80/0x80
[  440.491411]  kthread+0x102/0x140
[  440.508636]  ? find_pers+0x70/0x70
[  440.524927]  ? kthread_associate_blkcg+0xa0/0xa0
[  440.541791]  ret_from_fork+0x35/0x40
[  440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 &lt;0f&gt; 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
[  440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
[  440.628575] ---[ end trace a1fd79d85643a73e ]--

All the crash issue happened when a bypass IO coming, in such scenario
s-&gt;iop.bio is pointed to the s-&gt;orig_bio. In search_free(), it finishes the
s-&gt;orig_bio by calling bio_complete(), and after that, s-&gt;iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.

This patch moves bio_complete() under bio_put() to avoid kernel crash.

[mlyle: fixed commit subject for character limits]

Reported-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Tested-by: Matthias Ferdinand &lt;bcache@mfedv.net&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: return attach error when no cache set exist</title>
<updated>2018-04-26T09:02:18+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c4c9fd55899fd780ee010eda172fe574c65bf56e'/>
<id>urn:sha1:c4c9fd55899fd780ee010eda172fe574c65bf56e</id>
<content type='text'>
[ Upstream commit 7f4fc93d4713394ee8f1cd44c238e046e11b4f15 ]

I attach a back-end device to a cache set, and the cache set is not
registered yet, this back-end device did not attach successfully, and no
error returned:
[root]# echo 87859280-fec6-4bcc-20df7ca8f86b &gt; /sys/block/sde/bcache/attach
[root]#

In sysfs_attach(), the return value "v" is initialized to "size" in
the beginning, and if no cache set exist in bch_cache_sets, the "v" value
would not change any more, and return to sysfs, sysfs regard it as success
since the "size" is a positive number.

This patch fixes this issue by assigning "v" with "-ENOENT" in the
initialization.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix for data collapse after re-attaching an attached device</title>
<updated>2018-04-26T09:02:18+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4c8e0270dc7afb10d4e06be4adfd177db5cb3cb8'/>
<id>urn:sha1:4c8e0270dc7afb10d4e06be4adfd177db5cb3cb8</id>
<content type='text'>
[ Upstream commit 73ac105be390c1de42a2f21643c9778a5e002930 ]

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f &gt; attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 &gt; label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc-&gt;sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc-&gt;sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc-&gt;sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc-&gt;sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc-&gt;sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix for allocator and register thread race</title>
<updated>2018-04-26T09:02:18+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-02-07T19:41:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=311e31419b7240151de0cf7aa3f2d972ee103a5c'/>
<id>urn:sha1:311e31419b7240151de0cf7aa3f2d972ee103a5c</id>
<content type='text'>
[ Upstream commit 682811b3ce1a5a4e20d700939a9042f01dbc66c4 ]

After long time running of random small IO writing,
I reboot the machine, and after the machine power on,
I found bcache got stuck, the stack is:
[root@ceph153 ~]# cat /proc/2510/task/*/stack
[&lt;ffffffffa06b2455&gt;] closure_sync+0x25/0x90 [bcache]
[&lt;ffffffffa06b6be8&gt;] bch_journal+0x118/0x2b0 [bcache]
[&lt;ffffffffa06b6dc7&gt;] bch_journal_meta+0x47/0x70 [bcache]
[&lt;ffffffffa06be8f7&gt;] bch_prio_write+0x237/0x340 [bcache]
[&lt;ffffffffa06a8018&gt;] bch_allocator_thread+0x3c8/0x3d0 [bcache]
[&lt;ffffffff810a631f&gt;] kthread+0xcf/0xe0
[&lt;ffffffff8164c318&gt;] ret_from_fork+0x58/0x90
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff
[root@ceph153 ~]# cat /proc/2038/task/*/stack
[&lt;ffffffffa06b1abd&gt;] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[&lt;ffffffffa06b1bd1&gt;] bch_btree_insert+0xf1/0x170 [bcache]
[&lt;ffffffffa06b637f&gt;] bch_journal_replay+0x13f/0x230 [bcache]
[&lt;ffffffffa06c75fe&gt;] run_cache_set+0x79a/0x7c2 [bcache]
[&lt;ffffffffa06c0cf8&gt;] register_bcache+0xd48/0x1310 [bcache]
[&lt;ffffffff812f702f&gt;] kobj_attr_store+0xf/0x20
[&lt;ffffffff8125b216&gt;] sysfs_write_file+0xc6/0x140
[&lt;ffffffff811dfbfd&gt;] vfs_write+0xbd/0x1e0
[&lt;ffffffff811e069f&gt;] SyS_write+0x7f/0xe0
[&lt;ffffffff8164c3c9&gt;] system_call_fastpath+0x16/0x1
The stack shows the register thread and allocator thread
were getting stuck when registering cache device.

I reboot the machine several times, the issue always
exsit in this machine.

I debug the code, and found the call trace as bellow:
register_bcache()
   ==&gt;run_cache_set()
      ==&gt;bch_journal_replay()
         ==&gt;bch_btree_insert()
            ==&gt;__bch_btree_map_nodes()
               ==&gt;btree_insert_fn()
                  ==&gt;btree_split() //node need split
                     ==&gt;btree_check_reserve()
In btree_check_reserve(), It will check if there is enough buckets
of RESERVE_BTREE type, since allocator thread did not work yet, so
no buckets of RESERVE_BTREE type allocated, so the register thread
waits on c-&gt;btree_cache_wait, and goes to sleep.

Then the allocator thread initialized, the call trace is bellow:
bch_allocator_thread()
==&gt;bch_prio_write()
   ==&gt;bch_journal_meta()
      ==&gt;bch_journal()
         ==&gt;journal_wait_for_write()
In journal_wait_for_write(), It will check if journal is full by
journal_full(), but the long time random small IO writing
causes the exhaustion of journal buckets(journal.blocks_free=0),
In order to release the journal buckets,
the allocator calls btree_flush_write() to flush keys to
btree nodes, and waits on c-&gt;journal.wait until btree nodes writing
over or there has already some journal buckets space, then the
allocator thread goes to sleep. but in btree_flush_write(), since
bch_journal_replay() is not finished, so no btree nodes have journal
(condition "if (btree_current_write(b)-&gt;journal)" never satisfied),
so we got no btree node to flush, no journal bucket released,
and allocator sleep all the times.

Through the above analysis, we can see that:
1) Register thread wait for allocator thread to allocate buckets of
   RESERVE_BTREE type;
2) Alloctor thread wait for register thread to replay journal, so it
   can flush btree nodes and get journal bucket.
   then they are all got stuck by waiting for each other.

Hua Rui provided a patch for me, by allocating some buckets of
RESERVE_BTREE type in advance, so the register thread can get bucket
when btree node splitting and no need to waiting for the allocator
thread. I tested it, it has effect, and register thread run a step
forward, but finally are still got stuck, the reason is only 8 bucket
of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
then btree_check_reserve() is not satisfied anymore, so it goes to sleep
again, and in the same time, alloctor thread did not flush enough btree
nodes to release a journal bucket, so they all got stuck again.

So we need to allocate more buckets of RESERVE_BTREE type in advance,
but how much is enough?  By experience and test, I think it should be
as much as journal buckets. Then I modify the code as this patch,
and test in the machine, and it works.

This patch modified base on Hua Rui’s patch, and allocate more buckets
of RESERVE_BTREE type in advance to avoid register thread and allocate
thread going to wait for each other.

[patch v2] ca-&gt;sb.njournal_buckets would be 0 in the first time after
cache creation, and no journal exists, so just 8 btree buckets is OK.

Signed-off-by: Hua Rui &lt;huarui.dev@gmail.com&gt;
Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: properly set task state in bch_writeback_thread()</title>
<updated>2018-04-26T09:02:18+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2018-02-07T19:41:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f89edd17aff46a3970f302add6949700198dfd92'/>
<id>urn:sha1:f89edd17aff46a3970f302add6949700198dfd92</id>
<content type='text'>
[ Upstream commit 99361bbf26337186f02561109c17a4c4b1a7536a ]

Kernel thread routine bch_writeback_thread() has the following code block,

447         down_write(&amp;dc-&gt;writeback_lock);
448~450     if (check conditions) {
451                 up_write(&amp;dc-&gt;writeback_lock);
452                 set_current_state(TASK_INTERRUPTIBLE);
453
454                 if (kthread_should_stop())
455                         return 0;
456
457                 schedule();
458                 continue;
459         }

If condition check is true, its task state is set to TASK_INTERRUPTIBLE
and call schedule() to wait for others to wake up it.

There are 2 issues in current code,
1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
   another process changes the condition and call wake_up_process(dc-&gt;
   writeback_thread), then at line 452 task state is set back to
   TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
   waken up.
2, At line 454 if kthread_should_stop() is true, writeback kernel thread
   will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
   call do_exit(). It is not good to enter do_exit() with task state
   TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
   warning message is reported by __might_sleep(): "WARNING: do not call
   blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".

For the first issue, task state should be set before condition checks.
Ineed because dc-&gt;writeback_lock is required when modifying all the
conditions, calling set_current_state() inside code block where dc-&gt;
writeback_lock is hold is safe. But this is quite implicit, so I still move
set_current_state() before all the condition checks.

For the second issue, frankley speaking it does not hurt when kernel thread
exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
makes them feel there might be something risky with bcache and hurt their
data.  Setting task state to TASK_RUNNING before returning fixes this
problem.

In alloc.c:allocator_wait(), there is also a similar issue, and is also
fixed in this patch.

Changelog:
v3: merge two similar fixes into one patch
v2: fix the race issue in v1 patch.
v1: initial buggy fix.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Reviewed-by: Hannes Reinecke &lt;hare@suse.de&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Michael Lyle &lt;mlyle@lyle.org&gt;
Cc: Junhui Tang &lt;tang.junhui@zte.com.cn&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: segregate flash only volume write streams</title>
<updated>2018-04-12T10:32:18+00:00</updated>
<author>
<name>Tang Junhui</name>
<email>tang.junhui@zte.com.cn</email>
</author>
<published>2018-01-08T20:21:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fad9bcb1176be5c9068c3ef6f13398de502b35d8'/>
<id>urn:sha1:fad9bcb1176be5c9068c3ef6f13398de502b35d8</id>
<content type='text'>
[ Upstream commit 4eca1cb28d8b0574ca4f1f48e9331c5f852d43b9 ]

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui &lt;tang.junhui@zte.com.cn&gt;
Reviewed-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Michael Lyle &lt;mlyle@lyle.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
</feed>
