<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/md/dm-thin.c, branch v5.15.7</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.15.7</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.15.7'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2021-08-10T17:34:23+00:00</updated>
<entry>
<title>dm: update target status functions to support IMA measurement</title>
<updated>2021-08-10T17:34:23+00:00</updated>
<author>
<name>Tushar Sugandhi</name>
<email>tusharsu@linux.microsoft.com</email>
</author>
<published>2021-07-13T00:49:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8ec456629d0bf051e41ef2c87a60755f941dd11c'/>
<id>urn:sha1:8ec456629d0bf051e41ef2c87a60755f941dd11c</id>
<content type='text'>
For device mapper targets to take advantage of IMA's measurement
capabilities, the status functions for the individual targets need to be
updated to handle the status_type_t case for value STATUSTYPE_IMA.

Update status functions for the following target types, to log their
respective attributes to be measured using IMA.
 01. cache
 02. crypt
 03. integrity
 04. linear
 05. mirror
 06. multipath
 07. raid
 08. snapshot
 09. striped
 10. verity

For rest of the targets, handle the STATUSTYPE_IMA case by setting the
measurement buffer to NULL.

For IMA to measure the data on a given system, the IMA policy on the
system needs to be updated to have the following line, and the system
needs to be restarted for the measurements to take effect.

/etc/ima/ima-policy
 measure func=CRITICAL_DATA label=device-mapper template=ima-buf

The measurements will be reflected in the IMA logs, which are located at:

/sys/kernel/security/integrity/ima/ascii_runtime_measurements
/sys/kernel/security/integrity/ima/binary_runtime_measurements

These IMA logs can later be consumed by various attestation clients
running on the system, and send them to external services for attesting
the system.

The DM target data measured by IMA subsystem can alternatively
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
DM_TABLE_STATUS_CMD.

Signed-off-by: Tushar Sugandhi &lt;tusharsu@linux.microsoft.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: remove needless request_queue NULL pointer check</title>
<updated>2021-03-26T18:53:42+00:00</updated>
<author>
<name>Xu Wang</name>
<email>vulab@iscas.ac.cn</email>
</author>
<published>2021-03-19T08:11:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=695902bb2e17baf10a5a312ef048b71f738ddbe8'/>
<id>urn:sha1:695902bb2e17baf10a5a312ef048b71f738ddbe8</id>
<content type='text'>
Since commit ff9ea323816d ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.

Signed-off-by: Xu Wang &lt;vulab@iscas.ac.cn&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>writeback: remove bdi-&gt;congested_fn</title>
<updated>2020-07-08T23:20:46+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-07-01T09:06:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=21cf866145047f8bfecb38ec8d2fed64464c074f'/>
<id>urn:sha1:21cf866145047f8bfecb38ec8d2fed64464c074f</id>
<content type='text'>
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures.  And
pktdvd is a deprecated driver that isn't useful in stack setup
either.  So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Song Liu &lt;song@kernel.org&gt;
Acked-by: David Sterba &lt;dsterba@suse.com&gt;
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: rename generic_make_request to submit_bio_noacct</title>
<updated>2020-07-01T13:27:24+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-07-01T08:59:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead'/>
<id>urn:sha1:ed00aabd5eb9fb44d6aff1173234a2e911b9fead</id>
<content type='text'>
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>dm thin: change data device's flush_bio to be member of struct pool</title>
<updated>2020-01-15T01:23:13+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2020-01-13T20:41:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f06c03d1ded22cf1c059650c4ba02496e93a0c06'/>
<id>urn:sha1:f06c03d1ded22cf1c059650c4ba02496e93a0c06</id>
<content type='text'>
With commit fe64369163c5 ("dm thin: don't allow changing data device
during thin-pool load") it is now possible to re-parent the data
device's flush_bio from the pool_c to pool structure.  Doing so offers
improved lifetime guarantees for the flush_bio so that the call to
dm_pool_register_pre_commit_callback can now be done safely from
pool_ctr().

Depends-on: fe64369163c5 ("dm thin: don't allow changing data device during thin-pool load")
Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: don't allow changing data device during thin-pool reload</title>
<updated>2020-01-15T01:22:51+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2020-01-13T20:04:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=873937e75f9a8ea231a502c3d29d9cb6ad91b3ef'/>
<id>urn:sha1:873937e75f9a8ea231a502c3d29d9cb6ad91b3ef</id>
<content type='text'>
The existing code allows changing the data device when the thin-pool
target is reloaded.

This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
	https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.

Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.

Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: fix use-after-free in metadata_pre_commit_callback</title>
<updated>2020-01-15T01:22:50+00:00</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2020-01-13T17:29:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a4a8d286586d4b28c8517a51db8d86954aadc74b'/>
<id>urn:sha1:a4a8d286586d4b28c8517a51db8d86954aadc74b</id>
<content type='text'>
dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.

Since commit 694cfe7f31db3 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd-&gt;pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd-&gt;pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.

Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.

In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.

Fixes: 694cfe7f31db3 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac &lt;zkabelac@redhat.com&gt;
Reported-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: Flush data device before committing metadata</title>
<updated>2019-12-06T16:46:16+00:00</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-12-04T14:07:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=694cfe7f31db36912725e63a38a5179c8628a496'/>
<id>urn:sha1:694cfe7f31db36912725e63a38a5179c8628a496</id>
<content type='text'>
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Joe Thornber &lt;ejt@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: wakeup worker only when deferred bios exist</title>
<updated>2019-11-18T15:03:12+00:00</updated>
<author>
<name>Jeffle Xu</name>
<email>jefflexu@linux.alibaba.com</email>
</author>
<published>2019-11-18T01:50:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d256d796279de0bdc227ff4daef565aa7e80c898'/>
<id>urn:sha1:d256d796279de0bdc227ff4daef565aa7e80c898</id>
<content type='text'>
Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
numjobs=1) over dm-thin device has poor performance versus bare nvme
device.

Further investigation with perf indicates that queue_work_on() consumes
over 20% CPU time when doing IO over dm-thin device. The call stack is
as follows.

- 40.57% thin_map
    + 22.07% queue_work_on
    + 9.95% dm_thin_find_block
    + 2.80% cell_defer_no_holder
      1.91% inc_all_io_entry.isra.33.part.34
    + 1.78% bio_detain.isra.35

In cell_defer_no_holder(), wakeup_worker() is always called, no matter
whether the tc-&gt;deferred_bio_list list is empty or not. In single thread
IO model, this list is most likely empty. So skip waking up worker thread
if tc-&gt;deferred_bio_list list is empty.

Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
once the needless wake_worker() calls are properly skipped.

Signed-off-by: Jeffle Xu &lt;jefflexu@linux.alibaba.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm thin: replace spin_lock_irqsave with spin_lock_irq</title>
<updated>2019-11-05T19:38:26+00:00</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2019-10-15T12:16:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8e0c9dacc39eccd27e70cf3243896b43f5398c17'/>
<id>urn:sha1:8e0c9dacc39eccd27e70cf3243896b43f5398c17</id>
<content type='text'>
If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
</feed>
