<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/bcache, branch v5.4.50</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.4.50</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.4.50'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2020-06-24T15:50:45+00:00</updated>
<entry>
<title>bcache: fix potential deadlock problem in btree_gc_coalesce</title>
<updated>2020-06-24T15:50:45+00:00</updated>
<author>
<name>Zhiqiang Liu</name>
<email>liuzhiqiang26@huawei.com</email>
</author>
<published>2020-06-14T16:53:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f651e94899ed08b1766bda30f410d33fdd3970ff'/>
<id>urn:sha1:f651e94899ed08b1766bda30f410d33fdd3970ff</id>
<content type='text'>
[ Upstream commit be23e837333a914df3f24bf0b32e87b0331ab8d1 ]

coccicheck reports:
  drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417

In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]-&gt;write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]-&gt;
write_lock for freeing new_nodes[i] before return.

btree_gc_coalesce func details as follows:
	if alloc new_nodes[i] fails:
		goto out_nocoalesce;
	// obtain new_nodes[i]-&gt;write_lock
	mutex_lock(&amp;new_nodes[i]-&gt;write_lock)
	// main coalescing process
	for (i = nodes - 1; i &gt; 0; --i)
		[snipped]
		if coalescing process fails:
			// Here, directly goto out_nocoalesce
			 // tag will cause a deadlock
			goto out_nocoalesce;
		[snipped]
	// release new_nodes[i]-&gt;write_lock
	mutex_unlock(&amp;new_nodes[i]-&gt;write_lock)
	// coalesing succ, return
	return;
out_nocoalesce:
	btree_node_free(new_nodes[i])	// free new_nodes[i]
	// obtain new_nodes[i]-&gt;write_lock
	mutex_lock(&amp;new_nodes[i]-&gt;write_lock);
	// set flag for reuse
	clear_bit(BTREE_NODE_dirty, &amp;ew_nodes[i]-&gt;flags);
	// release new_nodes[i]-&gt;write_lock
	mutex_unlock(&amp;new_nodes[i]-&gt;write_lock);

To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]-&gt;write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]-&gt;write_lock before free new_nodes[i] in
out_nocoalesce tag.

(Coly Li helps to clean up commit log format.)

Fixes: 2a285686c109816 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu &lt;liuzhiqiang26@huawei.com&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix refcount underflow in bcache_device_free()</title>
<updated>2020-06-22T07:31:09+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-05-27T04:01:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=62e7e4f5976ce617d57453b01ec4ab24608499cd'/>
<id>urn:sha1:62e7e4f5976ce617d57453b01ec4ab24608499cd</id>
<content type='text'>
[ Upstream commit 86da9f736740eba602389908574dfbb0f517baa5 ]

The problematic code piece in bcache_device_free() is,

 785 static void bcache_device_free(struct bcache_device *d)
 786 {
 787     struct gendisk *disk = d-&gt;disk;
 [snipped]
 799     if (disk) {
 800             if (disk-&gt;flags &amp; GENHD_FL_UP)
 801                     del_gendisk(disk);
 802
 803             if (disk-&gt;queue)
 804                     blk_cleanup_queue(disk-&gt;queue);
 805
 806             ida_simple_remove(&amp;bcache_device_idx,
 807                               first_minor_to_idx(disk-&gt;first_minor));
 808             put_disk(disk);
 809         }
 [snipped]
 816 }

At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
being underflow.

Here is how to reproduce the issue,
- Attche the backing device to a cache device and do random write to
  make the cache being dirty.
- Stop the bcache device while the cache device has dirty data of the
  backing device.
- Only register the backing device back, NOT register cache device.
- The bcache device node /dev/bcache0 won't show up, because backing
  device waits for the cache device shows up for the missing dirty
  data.
- Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
  backing device.
- After the pending backing device stopped, use 'dmesg' to check kernel
  message, a use-after-free warning from KASA reported the refcount of
  kobject linked to the 'disk' is underflow.

The dropping refcount at line 808 in the above code piece is added by
add_disk(d-&gt;disk) in bch_cached_dev_run(). But in the above condition
the cache device is not registered, bch_cached_dev_run() has no chance
to be called and the refcount is not added. The put_disk() for a non-
added refcount of gendisk kobject triggers a underflow warning.

This patch checks whether GENHD_FL_UP is set in disk-&gt;flags, if it is
not set then the bcache device was not added, don't call put_disk()
and the the underflow issue can be avoided.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: properly initialize 'path' and 'err' in register_bcache()</title>
<updated>2020-02-24T07:37:03+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-01-23T17:01:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cea9007ebb956945f1fb4a4922cdca837e09ef06'/>
<id>urn:sha1:cea9007ebb956945f1fb4a4922cdca837e09ef06</id>
<content type='text'>
[ Upstream commit 29cda393bcaad160c4bf3676ddd99855adafc72f ]

Patch "bcache: rework error unwinding in register_bcache" from
Christoph Hellwig changes the local variables 'path' and 'err'
in undefined initial state. If the code in register_bcache() jumps
to label 'out:' or 'out_module_put:' by goto, these two variables
might be reference with undefined value by the following line,

	out_module_put:
	        module_put(THIS_MODULE);
	out:
	        pr_info("error %s: %s", path, err);
	        return ret;

Therefore this patch initializes these two local variables properly
in register_bcache() to avoid such issue.

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix incorrect data type usage in btree_flush_write()</title>
<updated>2020-02-24T07:37:01+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-02-01T14:42:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=793137b0511ca6b92e2350da72a73cb6caba58fa'/>
<id>urn:sha1:793137b0511ca6b92e2350da72a73cb6caba58fa</id>
<content type='text'>
[ Upstream commit d1c3cc34f5a78b38d2b809b289d912c3560545df ]

Dan Carpenter points out that from commit 2aa8c529387c ("bcache: avoid
unnecessary btree nodes flushing in btree_flush_write()"), there is a
incorrect data type usage which leads to the following static checker
warning:
	drivers/md/bcache/journal.c:444 btree_flush_write()
	warn: 'ref_nr' unsigned &lt;= 0

drivers/md/bcache/journal.c
   422  static void btree_flush_write(struct cache_set *c)
   423  {
   424          struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR];
   425          unsigned int i, nr, ref_nr;
                                    ^^^^^^

   426          atomic_t *fifo_front_p, *now_fifo_front_p;
   427          size_t mask;
   428
   429          if (c-&gt;journal.btree_flushing)
   430                  return;
   431
   432          spin_lock(&amp;c-&gt;journal.flush_write_lock);
   433          if (c-&gt;journal.btree_flushing) {
   434                  spin_unlock(&amp;c-&gt;journal.flush_write_lock);
   435                  return;
   436          }
   437          c-&gt;journal.btree_flushing = true;
   438          spin_unlock(&amp;c-&gt;journal.flush_write_lock);
   439
   440          /* get the oldest journal entry and check its refcount */
   441          spin_lock(&amp;c-&gt;journal.lock);
   442          fifo_front_p = &amp;fifo_front(&amp;c-&gt;journal.pin);
   443          ref_nr = atomic_read(fifo_front_p);
   444          if (ref_nr &lt;= 0) {
                    ^^^^^^^^^^^
Unsigned can't be less than zero.

   445                  /*
   446                   * do nothing if no btree node references
   447                   * the oldest journal entry
   448                   */
   449                  spin_unlock(&amp;c-&gt;journal.lock);
   450                  goto out;
   451          }
   452          spin_unlock(&amp;c-&gt;journal.lock);

As the warning information indicates, local varaible ref_nr in unsigned
int type is wrong, which does not matche atomic_read() and the "&lt;= 0"
checking.

This patch fixes the above error by defining local variable ref_nr as
int type.

Fixes: 2aa8c529387c ("bcache: avoid unnecessary btree nodes flushing in btree_flush_write()")
Reported-by: Dan Carpenter &lt;dan.carpenter@oracle.com&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: explicity type cast in bset_bkey_last()</title>
<updated>2020-02-24T07:37:01+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-02-01T14:42:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=57a180a630d874d630feca9a3347405c9d98321e'/>
<id>urn:sha1:57a180a630d874d630feca9a3347405c9d98321e</id>
<content type='text'>
[ Upstream commit 7c02b0055f774ed9afb6e1c7724f33bf148ffdc0 ]

In bset.h, macro bset_bkey_last() is defined as,
    bkey_idx((struct bkey *) (i)-&gt;d, (i)-&gt;keys)

Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.

bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().

Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)-&gt;keys to unsigned int type in macro bset_bkey_last().

Reported-by: kbuild test robot &lt;lkp@intel.com&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix memory corruption in bch_cache_accounting_clear()</title>
<updated>2020-02-24T07:37:01+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-02-01T14:42:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=374eec821858d3f0df8ce5ae70fa018d19355669'/>
<id>urn:sha1:374eec821858d3f0df8ce5ae70fa018d19355669</id>
<content type='text'>
[ Upstream commit 5bebf7486d4f4940b2a8e4009beb1dff5041853d ]

Commit 83ff9318c44ba ("bcache: not use hard coded memset size in
bch_cache_accounting_clear()") tries to make the code more easy to
understand by removing the hard coded number with following change,
	void bch_cache_accounting_clear(...)
	{
		memset(&amp;acc-&gt;total.cache_hits,
			0,
	-		sizeof(unsigned long) * 7);
	+		sizeof(struct cache_stats));
	}

Unfortunately the change was wrong (it also tells us the original code
was not easy to correctly understand). The hard coded number 7 is used
because in struct cache_stats,
 15 struct cache_stats {
 16         struct kobject          kobj;
 17
 18         unsigned long cache_hits;
 19         unsigned long cache_misses;
 20         unsigned long cache_bypass_hits;
 21         unsigned long cache_bypass_misses;
 22         unsigned long cache_readaheads;
 23         unsigned long cache_miss_collisions;
 24         unsigned long sectors_bypassed;
 25
 26         unsigned int            rescale;
 27 };
only members in LINE 18-24 want to be set to 0. It is wrong to use
'sizeof(struct cache_stats)' to replace 'sizeof(unsigned long) * 7), the
memory objects behind acc-&gt;total is staled by this change.

Сорокин Артем Сергеевич reports that by the following steps, kernel
panic will be triggered,
1. Create new set: make-bcache -B /dev/nvme1n1 -C /dev/sda --wipe-bcache
2. Run in /sys/fs/bcache/&lt;uuid&gt;:
   echo 1 &gt; clear_stats &amp;&amp; cat stats_five_minute/cache_bypass_hits

I can reproduce the panic and get following dmesg with KASAN enabled,
[22613.172742] ==================================================================
[22613.172862] BUG: KASAN: null-ptr-deref in sysfs_kf_seq_show+0x117/0x230
[22613.172864] Read of size 8 at addr 0000000000000000 by task cat/6753

[22613.172870] CPU: 1 PID: 6753 Comm: cat Not tainted 5.5.0-rc7-lp151.28.16-default+ #11
[22613.172872] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[22613.172873] Call Trace:
[22613.172964]  dump_stack+0x8b/0xbb
[22613.172968]  ? sysfs_kf_seq_show+0x117/0x230
[22613.172970]  ? sysfs_kf_seq_show+0x117/0x230
[22613.173031]  __kasan_report+0x176/0x192
[22613.173064]  ? pr_cont_kernfs_name+0x40/0x60
[22613.173067]  ? sysfs_kf_seq_show+0x117/0x230
[22613.173070]  kasan_report+0xe/0x20
[22613.173072]  sysfs_kf_seq_show+0x117/0x230
[22613.173105]  seq_read+0x199/0x6d0
[22613.173110]  vfs_read+0xa5/0x1a0
[22613.173113]  ksys_read+0x110/0x160
[22613.173115]  ? kernel_write+0xb0/0xb0
[22613.173177]  do_syscall_64+0x77/0x290
[22613.173238]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[22613.173241] RIP: 0033:0x7fc2c886ac61
[22613.173244] Code: fe ff ff 48 8d 3d c7 a0 09 00 48 83 ec 08 e8 46 03 02 00 66 0f 1f 44 00 00 8b 05 ca fb 2c 00 48 63 ff 85 c0 75 13 31 c0 0f 05 &lt;48&gt; 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48 89
[22613.173245] RSP: 002b:00007ffebe776d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[22613.173248] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2c886ac61
[22613.173249] RDX: 0000000000020000 RSI: 00007fc2c8cca000 RDI: 0000000000000003
[22613.173250] RBP: 0000000000020000 R08: ffffffffffffffff R09: 0000000000000000
[22613.173251] R10: 000000000000038c R11: 0000000000000246 R12: 00007fc2c8cca000
[22613.173253] R13: 0000000000000003 R14: 00007fc2c8cca00f R15: 0000000000020000
[22613.173255] ==================================================================
[22613.173256] Disabling lock debugging due to kernel taint
[22613.173350] BUG: kernel NULL pointer dereference, address: 0000000000000000
[22613.178380] #PF: supervisor read access in kernel mode
[22613.180959] #PF: error_code(0x0000) - not-present page
[22613.183444] PGD 0 P4D 0
[22613.184867] Oops: 0000 [#1] SMP KASAN PTI
[22613.186797] CPU: 1 PID: 6753 Comm: cat Tainted: G    B             5.5.0-rc7-lp151.28.16-default+ #11
[22613.191253] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[22613.196706] RIP: 0010:sysfs_kf_seq_show+0x117/0x230
[22613.199097] Code: ff 48 8b 0b 48 8b 44 24 08 48 01 e9 eb a6 31 f6 48 89 cf ba 00 10 00 00 48 89 4c 24 10 e8 b1 e6 e9 ff 4c 89 ff e8 19 07 ea ff &lt;49&gt; 8b 07 48 85 c0 48 89 44 24 08 0f 84 91 00 00 00 49 8b 6d 00 48
[22613.208016] RSP: 0018:ffff8881d4f8fd78 EFLAGS: 00010246
[22613.210448] RAX: 0000000000000000 RBX: ffff8881eb99b180 RCX: ffffffff810d9ef6
[22613.213691] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[22613.216893] RBP: 0000000000001000 R08: fffffbfff072ddcd R09: fffffbfff072ddcd
[22613.220075] R10: 0000000000000001 R11: fffffbfff072ddcc R12: ffff8881de5c0200
[22613.223256] R13: ffff8881ed175500 R14: ffff8881eb99b198 R15: 0000000000000000
[22613.226290] FS:  00007fc2c8d3d500(0000) GS:ffff8881f2a80000(0000) knlGS:0000000000000000
[22613.229637] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22613.231993] CR2: 0000000000000000 CR3: 00000001ec89a004 CR4: 00000000003606e0
[22613.234909] Call Trace:
[22613.235931]  seq_read+0x199/0x6d0
[22613.237259]  vfs_read+0xa5/0x1a0
[22613.239229]  ksys_read+0x110/0x160
[22613.240590]  ? kernel_write+0xb0/0xb0
[22613.242040]  do_syscall_64+0x77/0x290
[22613.243625]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[22613.245450] RIP: 0033:0x7fc2c886ac61
[22613.246706] Code: fe ff ff 48 8d 3d c7 a0 09 00 48 83 ec 08 e8 46 03 02 00 66 0f 1f 44 00 00 8b 05 ca fb 2c 00 48 63 ff 85 c0 75 13 31 c0 0f 05 &lt;48&gt; 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48 89
[22613.253296] RSP: 002b:00007ffebe776d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[22613.255835] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2c886ac61
[22613.258472] RDX: 0000000000020000 RSI: 00007fc2c8cca000 RDI: 0000000000000003
[22613.260807] RBP: 0000000000020000 R08: ffffffffffffffff R09: 0000000000000000
[22613.263188] R10: 000000000000038c R11: 0000000000000246 R12: 00007fc2c8cca000
[22613.265598] R13: 0000000000000003 R14: 00007fc2c8cca00f R15: 0000000000020000
[22613.268729] Modules linked in: scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock fuse bnep kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel snd_ens1371 snd_ac97_codec ac97_bus bcache snd_pcm btusb btrtl btbcm btintel crc64 aesni_intel glue_helper crypto_simd vmw_balloon cryptd bluetooth snd_timer snd_rawmidi snd joydev pcspkr e1000 rfkill vmw_vmci soundcore ecdh_generic ecc gameport i2c_piix4 mptctl ac button hid_generic usbhid sr_mod cdrom ata_generic ehci_pci vmwgfx uhci_hcd drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm ehci_hcd mptspi scsi_transport_spi mptscsih ata_piix mptbase ahci usbcore libahci drm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[22613.292429] CR2: 0000000000000000
[22613.293563] ---[ end trace a074b26a8508f378 ]---
[22613.295138] RIP: 0010:sysfs_kf_seq_show+0x117/0x230
[22613.296769] Code: ff 48 8b 0b 48 8b 44 24 08 48 01 e9 eb a6 31 f6 48 89 cf ba 00 10 00 00 48 89 4c 24 10 e8 b1 e6 e9 ff 4c 89 ff e8 19 07 ea ff &lt;49&gt; 8b 07 48 85 c0 48 89 44 24 08 0f 84 91 00 00 00 49 8b 6d 00 48
[22613.303553] RSP: 0018:ffff8881d4f8fd78 EFLAGS: 00010246
[22613.305280] RAX: 0000000000000000 RBX: ffff8881eb99b180 RCX: ffffffff810d9ef6
[22613.307924] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[22613.310272] RBP: 0000000000001000 R08: fffffbfff072ddcd R09: fffffbfff072ddcd
[22613.312685] R10: 0000000000000001 R11: fffffbfff072ddcc R12: ffff8881de5c0200
[22613.315076] R13: ffff8881ed175500 R14: ffff8881eb99b198 R15: 0000000000000000
[22613.318116] FS:  00007fc2c8d3d500(0000) GS:ffff8881f2a80000(0000) knlGS:0000000000000000
[22613.320743] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22613.322628] CR2: 0000000000000000 CR3: 00000001ec89a004 CR4: 00000000003606e0

Here this patch fixes the following problem by explicity set all the 7
members to 0 in bch_cache_accounting_clear().

Reported-by: Сорокин Артем Сергеевич &lt;a.sorokin@bank-hlynov.ru&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: fix use-after-free in register_bcache()</title>
<updated>2020-02-24T07:36:57+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-01-23T17:01:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b5f6bf0fdd71eecec7fcac94049caa1017d0769c'/>
<id>urn:sha1:b5f6bf0fdd71eecec7fcac94049caa1017d0769c</id>
<content type='text'>
[ Upstream commit ae3cd299919af6eb670d5af0bc9d7ba14086bd8e ]

The patch "bcache: rework error unwinding in register_bcache" introduces
a use-after-free regression in register_bcache(). Here are current code,
	2510 out_free_path:
	2511         kfree(path);
	2512 out_module_put:
	2513         module_put(THIS_MODULE);
	2514 out:
	2515         pr_info("error %s: %s", path, err);
	2516         return ret;
If some error happens and the above code path is executed, at line 2511
path is released, but referenced at line 2515. Then KASAN reports a use-
after-free error message.

This patch changes line 2515 in the following way to fix the problem,
	2515         pr_info("error %s: %s", path?path:"", err);

Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: rework error unwinding in register_bcache</title>
<updated>2020-02-24T07:36:56+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-01-23T17:01:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=393b8509be3348d5b11a643f0d05597ce3ea4c7c'/>
<id>urn:sha1:393b8509be3348d5b11a643f0d05597ce3ea4c7c</id>
<content type='text'>
[ Upstream commit 50246693f81fe887f4db78bf7089051d7f1894cc ]

Split the successful and error return path, and use one goto label for each
resource to unwind.  This also fixes some small errors like leaking the
module reference count in the reboot case (which seems entirely harmless)
or printing the wrong warning messages for early failures.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: cached_dev_free needs to put the sb page</title>
<updated>2020-02-24T07:36:56+00:00</updated>
<author>
<name>Liang Chen</name>
<email>liangchen.linux@gmail.com</email>
</author>
<published>2020-01-23T17:01:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f7d8ebf26d239ed7f9cc4b940ff240d78c4b61d1'/>
<id>urn:sha1:f7d8ebf26d239ed7f9cc4b940ff240d78c4b61d1</id>
<content type='text'>
[ Upstream commit e8547d42095e58bee658f00fef8e33d2a185c927 ]

Same as cache device, the buffer page needs to be put while
freeing cached_dev.  Otherwise a page would be leaked every
time a cached_dev is stopped.

Signed-off-by: Liang Chen &lt;liangchen.linux@gmail.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>bcache: avoid unnecessary btree nodes flushing in btree_flush_write()</title>
<updated>2020-02-14T21:34:19+00:00</updated>
<author>
<name>Coly Li</name>
<email>colyli@suse.de</email>
</author>
<published>2020-01-23T17:01:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=837c36e0451f14225e684bb4942a9f4b4eed3cb7'/>
<id>urn:sha1:837c36e0451f14225e684bb4942a9f4b4eed3cb7</id>
<content type='text'>
commit 2aa8c529387c25606fdc1484154b92f8bfbc5746 upstream.

the commit 91be66e1318f ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c-&gt;btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c-&gt;btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c-&gt;btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c-&gt;bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c-&gt;btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c-&gt;btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c-&gt;btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c-&gt;btree_cache is not
  a LRU-like list, the btree nodes with oldest journal entry won't
  be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
  entry, before flushing all the nodes the oldest journal entry cannot
  be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c-&gt;btree_cache list is really NOT a good idea.

Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c-&gt;journal.lock when getting oldest journal entry of
  fifo c-&gt;journal.pin. In rested locations check the journal entries
  locklessly, so their values can be changed on other cores
  in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
  point of fifo c-&gt;journal.pin. If it is different from the original
  point which we get with locking c-&gt;journal.lock, it means the oldest
  journal entry is reclaim on other cores. At this moment, all selected
  dirty nodes recorded in array btree_nodes[] are all flushed and clean
  on other CPU cores, it is unncessary to iterate c-&gt;btree_cache any
  longer. Just quit the list_for_each_entry_safe_reverse() loop and
  the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
  loop. Check the refcount value of orignial fifo front point, if the
  value is larger than selected node number of btree_nodes[], it means
  more matching btree nodes should be scanned. Otherwise it means no
  more matching btee nodes in rest of c-&gt;btree_cache list, the loop
  can be quit. If the original oldest journal entry is reclaimed and
  fifo front point is updated, the refcount of original fifo front point
  will be 0, then the loop will be quit too.
- Not hold c-&gt;bucket_lock too long time. c-&gt;bucket_lock is also required
  for space allocation for cached data, hold it for too long time will
  block regular I/O requests. When iterating list c-&gt;btree_cache, even
  there are a lot of maching btree nodes, in order to not holding
  c-&gt;bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
  selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for  unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c-&gt;btree_cache.

At last, one more thing to explain: Why it is safe to read front point
of c-&gt;journal.pin without holding c-&gt;journal.lock inside the
list_for_each_entry_safe_reverse() loop ?

Here is my answer: When reading the front point of fifo c-&gt;journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c-&gt;jouranl.lock is
held). For such purpose, it works as expected without holding
c-&gt;journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b-&gt;write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c-&gt;btree_cache.

Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang &lt;fangguoju@gmail.com&gt;
Reported-by: Shuang Li &lt;psymon@bonuscloud.io&gt;
Signed-off-by: Coly Li &lt;colyli@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
