<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/dm-thin.c, branch v5.7.4</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.7.4</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.7.4'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2020-01-15T01:23:13+00:00</updated>
<entry>
<title>dm thin: change data device's flush_bio to be member of struct pool</title>
<updated>2020-01-15T01:23:13+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2020-01-13T20:41:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f06c03d1ded22cf1c059650c4ba02496e93a0c06'/>
<id>urn:sha1:f06c03d1ded22cf1c059650c4ba02496e93a0c06</id>
<content type='text'>
With commit fe64369163c5 ("dm thin: don't allow changing data device
during thin-pool load") it is now possible to re-parent the data
device's flush_bio from the pool_c to pool structure.  Doing so offers
improved lifetime guarantees for the flush_bio so that the call to
dm_pool_register_pre_commit_callback can now be done safely from
pool_ctr().

Depends-on: fe64369163c5 ("dm thin: don't allow changing data device during thin-pool load")
Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: don't allow changing data device during thin-pool reload</title>
<updated>2020-01-15T01:22:51+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2020-01-13T20:04:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=873937e75f9a8ea231a502c3d29d9cb6ad91b3ef'/>
<id>urn:sha1:873937e75f9a8ea231a502c3d29d9cb6ad91b3ef</id>
<content type='text'>
The existing code allows changing the data device when the thin-pool
target is reloaded.

This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
	https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.

Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.

Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: fix use-after-free in metadata_pre_commit_callback</title>
<updated>2020-01-15T01:22:50+00:00</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2020-01-13T17:29:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a4a8d286586d4b28c8517a51db8d86954aadc74b'/>
<id>urn:sha1:a4a8d286586d4b28c8517a51db8d86954aadc74b</id>
<content type='text'>
dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.

Since commit 694cfe7f31db3 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd-&gt;pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd-&gt;pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.

Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.

In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.

Fixes: 694cfe7f31db3 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac &lt;zkabelac@redhat.com&gt;
Reported-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: Flush data device before committing metadata</title>
<updated>2019-12-06T16:46:16+00:00</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-12-04T14:07:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=694cfe7f31db36912725e63a38a5179c8628a496'/>
<id>urn:sha1:694cfe7f31db36912725e63a38a5179c8628a496</id>
<content type='text'>
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Joe Thornber &lt;ejt@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: wakeup worker only when deferred bios exist</title>
<updated>2019-11-18T15:03:12+00:00</updated>
<author>
<name>Jeffle Xu</name>
<email>jefflexu@linux.alibaba.com</email>
</author>
<published>2019-11-18T01:50:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d256d796279de0bdc227ff4daef565aa7e80c898'/>
<id>urn:sha1:d256d796279de0bdc227ff4daef565aa7e80c898</id>
<content type='text'>
Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
numjobs=1) over dm-thin device has poor performance versus bare nvme
device.

Further investigation with perf indicates that queue_work_on() consumes
over 20% CPU time when doing IO over dm-thin device. The call stack is
as follows.

- 40.57% thin_map
    + 22.07% queue_work_on
    + 9.95% dm_thin_find_block
    + 2.80% cell_defer_no_holder
      1.91% inc_all_io_entry.isra.33.part.34
    + 1.78% bio_detain.isra.35

In cell_defer_no_holder(), wakeup_worker() is always called, no matter
whether the tc-&gt;deferred_bio_list list is empty or not. In single thread
IO model, this list is most likely empty. So skip waking up worker thread
if tc-&gt;deferred_bio_list list is empty.

Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
once the needless wake_worker() calls are properly skipped.

Signed-off-by: Jeffle Xu &lt;jefflexu@linux.alibaba.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: replace spin_lock_irqsave with spin_lock_irq</title>
<updated>2019-11-05T19:38:26+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2019-10-15T12:16:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8e0c9dacc39eccd27e70cf3243896b43f5398c17'/>
<id>urn:sha1:8e0c9dacc39eccd27e70cf3243896b43f5398c17</id>
<content type='text'>
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: add sanity checks to thin-pool and external snapshot creation</title>
<updated>2019-03-05T19:53:49+00:00</updated>
<author>
<name>Jason Cai (Xiang Feng)</name>
<email>jason.cai.kern@gmail.com</email>
</author>
<published>2019-01-20T14:39:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=70de2cbda8a5d788284469e755f8b097d339c240'/>
<id>urn:sha1:70de2cbda8a5d788284469e755f8b097d339c240</id>
<content type='text'>
Invoking dm_get_device() twice on the same device path with different
modes is dangerous.  Because in that case, upgrade_mode() will alloc a
new 'dm_dev' and free the old one, which may be referenced by a previous
caller.  Dereferencing the dangling pointer will trigger kernel NULL
pointer dereference.

The following two cases can reproduce this issue.  Actually, they are
invalid setups that must be disallowed, e.g.:

1. Creating a thin-pool with read_only mode, and the same device as
both metadata and data.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdb /dev/vdb 128 0 1 read_only"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
...
Call Trace:
 new_read+0xfb/0x110 [dm_bufio]
 dm_bm_read_lock+0x43/0x190 [dm_persistent_data]
 ? kmem_cache_alloc_trace+0x15c/0x1e0
 __create_persistent_data_objects+0x65/0x3e0 [dm_thin_pool]
 dm_pool_metadata_open+0x8c/0xf0 [dm_thin_pool]
 pool_ctr.cold.79+0x213/0x913 [dm_thin_pool]
 ? realloc_argv+0x50/0x70 [dm_mod]
 dm_table_add_target+0x14e/0x330 [dm_mod]
 table_load+0x122/0x2e0 [dm_mod]
 ? dev_status+0x40/0x40 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ? handle_mm_fault+0xda/0x200
 ? __do_page_fault+0x26c/0x4f0
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

2. Creating a external snapshot using the same thin-pool device.

dmsetup create thinp --table \
    "0 41943040 thin-pool /dev/vdc /dev/vdb 128 0 2 ignore_discard"
dmsetup message /dev/mapper/thinp 0 "create_thin 0"
dmsetup create snap --table \
            "0 204800 thin /dev/mapper/thinp 0 /dev/mapper/thinp"

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
Call Trace:
? __alloc_pages_nodemask+0x13c/0x2e0
retrieve_status+0xa5/0x1f0 [dm_mod]
? dm_get_live_or_inactive_table.isra.7+0x20/0x20 [dm_mod]
 table_status+0x61/0xa0 [dm_mod]
 ctl_ioctl+0x1aa/0x3e0 [dm_mod]
 dm_ctl_ioctl+0xa/0x10 [dm_mod]
 do_vfs_ioctl+0xa2/0x600
 ksys_ioctl+0x60/0x90
 ? ksys_write+0x4f/0xb0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x55/0x150
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Jason Cai (Xiang Feng) &lt;jason.cai@linux.alibaba.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm: eliminate 'split_discard_bios' flag from DM target interface</title>
<updated>2019-02-21T04:24:55+00:00</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2019-01-18T19:19:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=61697a6abd24acba941359c6268a94f4afe4a53d'/>
<id>urn:sha1:61697a6abd24acba941359c6268a94f4afe4a53d</id>
<content type='text'>
There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits.  A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.

Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: fix bug where bio that overwrites thin block ignores FUA</title>
<updated>2019-02-15T00:02:29+00:00</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-02-14T18:38:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4ae280b4ee3463fa57bbe6eede26b97daff8a0f1'/>
<id>urn:sha1:4ae280b4ee3463fa57bbe6eede26b97daff8a0f1</id>
<content type='text'>
When provisioning a new data block for a virtual block, either because
the block was previously unallocated or because we are breaking sharing,
if the whole block of data is being overwritten the bio that triggered
the provisioning is issued immediately, skipping copying or zeroing of
the data block.

When this bio completes the new mapping is inserted in to the pool's
metadata by process_prepared_mapping(), where the bio completion is
signaled to the upper layers.

This completion is signaled without first committing the metadata.  If
the bio in question has the REQ_FUA flag set and the system crashes
right after its completion and before the next metadata commit, then the
write is lost despite the REQ_FUA flag requiring that I/O completion for
this request must only be signaled after the data has been committed to
non-volatile storage.

Fix this by deferring the completion of overwrite bios, with the REQ_FUA
flag set, until after the metadata has been committed.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Joe Thornber &lt;ejt@redhat.com&gt;
Acked-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: fix passdown_double_checking_shared_status()</title>
<updated>2019-01-15T21:10:41+00:00</updated>
<author>
<name>Joe Thornber</name>
<email>ejt@redhat.com</email>
</author>
<published>2019-01-15T18:27:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d445bd9cec1a850c2100fcf53684c13b3fd934f2'/>
<id>urn:sha1:d445bd9cec1a850c2100fcf53684c13b3fd934f2</id>
<content type='text'>
Commit 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next
stage processing") changed process_prepared_discard_passdown_pt1() to
increment all the blocks being discarded until after the passdown had
completed to avoid them being prematurely reused.

IO issued to a thin device that breaks sharing with a snapshot, followed
by a discard issued to snapshot(s) that previously shared the block(s),
results in passdown_double_checking_shared_status() being called to
iterate through the blocks double checking their reference count is zero
and issuing the passdown if so.  So a side effect of commit 00a0ea33b495
is passdown_double_checking_shared_status() was broken.

Fix this by checking if the block reference count is greater than 1.
Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

Fixes: 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next stage processing")
Cc: stable@vger.kernel.org # 4.9+
Reported-by: ryan.p.norwood@gmail.com
Signed-off-by: Joe Thornber &lt;ejt@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
</feed>
