diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-01-31 22:05:47 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-01-31 22:05:47 +0300 |
commit | 0be600a5add76e8e8b9e1119f2a7426ff849aca8 (patch) | |
tree | d5fcc2b119f03143f9bed1b9aa5cb85458c8bd03 /Documentation | |
parent | 040639b7fcf73ee39c15d38257f652a2048e96f2 (diff) | |
parent | 9614e2ba9161c7f5419f4212fa6057d2a65f6ae6 (diff) | |
download | linux-0be600a5add76e8e8b9e1119f2a7426ff849aca8.tar.xz |
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core fixes to ensure that bio submission follows a depth-first
tree walk; this is critical to allow forward progress without the
need to use the bioset's BIOSET_NEED_RESCUER.
- Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
- DM core cleanups and improvements to make bio-based DM more efficient
(e.g. reduced memory footprint as well leveraging per-bio-data more).
- Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
the more direct IO submission path in the block layer; this mode is
used by DM multipath and also optimizes targets like DM thin-pool
that stack directly on NVMe data device.
- DM multipath improvements to factor out legacy SCSI-only (e.g.
scsi_dh) code paths to allow for more optimized support for NVMe
multipath.
- A fix for DM multipath path selectors (service-time and queue-length)
to select paths in a more balanced way; largely academic but doesn't
hurt.
- Numerous DM raid target fixes and improvements.
- Add a new DM "unstriped" target that enables Intel to workaround
firmware limitations in some NVMe drives that are striped internally
(this target also works when stacked above the DM "striped" target).
- Various Documentation fixes and improvements.
- Misc cleanups and fixes across various DM infrastructure and targets
(e.g. bufio, flakey, log-writes, snapshot).
* tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
dm cache: Documentation: update default migration_throttling value
dm mpath selector: more evenly distribute ties
dm unstripe: fix target length versus number of stripes size check
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
dm table: fix NVMe bio-based dm_table_determine_type() validation
dm: various cleanups to md->queue initialization code
dm mpath: delay the retry of a request if the target responded as busy
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
dm log writes: fix max length used for kstrndup
dm: backfill missing calls to mutex_destroy()
dm snapshot: use mutex instead of rw_semaphore
dm flakey: check for null arg_name in parse_features()
dm thin: extend thinpool status format string with omitted fields
dm thin: fixes in thin-provisioning.txt
dm thin: document representation of <highest mapped sector> when there is none
dm thin: fix documentation relative to low water mark threshold
dm cache: be consistent in specifying sectors and SI units in cache.txt
dm cache: delete obsoleted paragraph in cache.txt
dm cache: fix grammar in cache-policies.txt
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/device-mapper/cache-policies.txt | 4 | ||||
-rw-r--r-- | Documentation/device-mapper/cache.txt | 9 | ||||
-rw-r--r-- | Documentation/device-mapper/dm-raid.txt | 5 | ||||
-rw-r--r-- | Documentation/device-mapper/snapshot.txt | 4 | ||||
-rw-r--r-- | Documentation/device-mapper/thin-provisioning.txt | 14 | ||||
-rw-r--r-- | Documentation/device-mapper/unstriped.txt | 124 |
6 files changed, 146 insertions, 14 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt index d3ca8af21a31..86786d87d9a8 100644 --- a/Documentation/device-mapper/cache-policies.txt +++ b/Documentation/device-mapper/cache-policies.txt @@ -60,7 +60,7 @@ Memory usage: The mq policy used a lot of memory; 88 bytes per cache block on a 64 bit machine. -smq uses 28bit indexes to implement it's data structures rather than +smq uses 28bit indexes to implement its data structures rather than pointers. It avoids storing an explicit hit count for each block. It has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of the entries (each hotspot block covers a larger area than a single @@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions. Adaptability: The mq policy maintained a hit count for each cache block. For a -different block to get promoted to the cache it's hit count has to +different block to get promoted to the cache its hit count has to exceed the lowest currently in the cache. This meant it could take a long time for the cache to adapt between varying IO patterns. diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt index cdfd0feb294e..ff0841711fd5 100644 --- a/Documentation/device-mapper/cache.txt +++ b/Documentation/device-mapper/cache.txt @@ -59,7 +59,7 @@ Fixed block size The origin is divided up into blocks of a fixed size. This block size is configurable when you first create the cache. Typically we've been using block sizes of 256KB - 1024KB. The block size must be between 64 -(32KB) and 2097152 (1GB) and a multiple of 64 (32KB). +sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). Having a fixed block size simplifies the target a lot. But it is something of a compromise. For instance, a small part of a block may be @@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments. For the time being, a message "migration_threshold <#sectors>" can be used to set the maximum number of sectors being migrated, -the default being 204800 sectors (or 100MB). +the default being 2048 sectors (1MB). Updating on-disk metadata ------------------------- @@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the dirty flags this data is lost if there's a crash so a safe fallback value should always be possible. -For instance, the 'mq' policy, which is currently the default policy, -uses this facility to store the hit count of the cache blocks. If -there's a crash this information will be lost, which means the cache -may be less efficient until those hit counts are regenerated. - Policy hints affect performance, not correctness. Policy messaging diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt index 32df07e29f68..390c145f01d7 100644 --- a/Documentation/device-mapper/dm-raid.txt +++ b/Documentation/device-mapper/dm-raid.txt @@ -343,5 +343,8 @@ Version History 1.11.0 Fix table line argument order (wrong raid10_copies/raid10_format sequence) 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option -1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available +1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') +1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an + state races. +1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt index ad6949bff2e3..b8bbb516f989 100644 --- a/Documentation/device-mapper/snapshot.txt +++ b/Documentation/device-mapper/snapshot.txt @@ -49,6 +49,10 @@ The difference between persistent and transient is with transient snapshots less metadata must be saved on disk - they can be kept in memory by the kernel. +When loading or unloading the snapshot target, the corresponding +snapshot-origin or snapshot-merge target must be suspended. A failure to +suspend the origin target could result in data corruption. + * snapshot-merge <origin> <COW device> <persistent> <chunksize> diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt index 1699a55b7b70..4bcd4b7f79f9 100644 --- a/Documentation/device-mapper/thin-provisioning.txt +++ b/Documentation/device-mapper/thin-provisioning.txt @@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If free space on the data device drops below this level then a dm event will be triggered which a userspace daemon should catch allowing it to extend the pool device. Only one such event will be sent. -Resuming a device with a new table itself triggers an event so the -userspace daemon can use this to detect a situation where a new table -already exceeds the threshold. + +No special event is triggered if a just resumed device's free space is below +the low water mark. However, resuming a device always triggers an +event; a userspace daemon should verify that free space exceeds the low +water mark when handling this event. A low water mark for the metadata device is maintained in the kernel and will trigger a dm event if free space on the metadata device drops below @@ -274,7 +276,8 @@ ii) Status <transaction id> <used metadata blocks>/<total metadata blocks> <used data blocks>/<total data blocks> <held metadata root> - [no_]discard_passdown ro|rw + ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space + needs_check|- transaction id: A 64-bit number used by userspace to help synchronise with metadata @@ -394,3 +397,6 @@ ii) Status If the pool has encountered device errors and failed, the status will just contain the string 'Fail'. The userspace recovery tools should then be used. + + In the case where <nr mapped sectors> is 0, there is no highest + mapped sector and the value of <highest mapped sector> is unspecified. diff --git a/Documentation/device-mapper/unstriped.txt b/Documentation/device-mapper/unstriped.txt new file mode 100644 index 000000000000..0b2a306c54ee --- /dev/null +++ b/Documentation/device-mapper/unstriped.txt @@ -0,0 +1,124 @@ +Introduction +============ + +The device-mapper "unstriped" target provides a transparent mechanism to +unstripe a device-mapper "striped" target to access the underlying disks +without having to touch the true backing block-device. It can also be +used to unstripe a hardware RAID-0 to access backing disks. + +Parameters: +<number of stripes> <chunk size> <stripe #> <dev_path> <offset> + +<number of stripes> + The number of stripes in the RAID 0. + +<chunk size> + The amount of 512B sectors in the chunk striping. + +<dev_path> + The block device you wish to unstripe. + +<stripe #> + The stripe number within the device that corresponds to physical + drive you wish to unstripe. This must be 0 indexed. + + +Why use this module? +==================== + +An example of undoing an existing dm-stripe +------------------------------------------- + +This small bash script will setup 4 loop devices and use the existing +striped target to combine the 4 devices into one. It then will use +the unstriped target ontop of the striped device to access the +individual backing loop devices. We write data to the newly exposed +unstriped devices and verify the data written matches the correct +underlying device on the striped array. + +#!/bin/bash + +MEMBER_SIZE=$((128 * 1024 * 1024)) +NUM=4 +SEQ_END=$((${NUM}-1)) +CHUNK=256 +BS=4096 + +RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) +DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" +COUNT=$((${MEMBER_SIZE} / ${BS})) + +for i in $(seq 0 ${SEQ_END}); do + dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct + losetup /dev/loop${i} member-${i} + DM_PARMS+=" /dev/loop${i} 0" +done + +echo $DM_PARMS | dmsetup create raid0 +for i in $(seq 0 ${SEQ_END}); do + echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} +done; + +for i in $(seq 0 ${SEQ_END}); do + dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct + diff /dev/mapper/set-${i} member-${i} +done; + +for i in $(seq 0 ${SEQ_END}); do + dmsetup remove set-${i} +done + +dmsetup remove raid0 + +for i in $(seq 0 ${SEQ_END}); do + losetup -d /dev/loop${i} + rm -f member-${i} +done + +Another example +--------------- + +Intel NVMe drives contain two cores on the physical device. +Each core of the drive has segregated access to its LBA range. +The current LBA model has a RAID 0 128k chunk on each core, resulting +in a 256k stripe across the two cores: + + Core 0: Core 1: + __________ __________ + | LBA 512| | LBA 768| + | LBA 0 | | LBA 256| + ---------- ---------- + +The purpose of this unstriping is to provide better QoS in noisy +neighbor environments. When two partitions are created on the +aggregate drive without this unstriping, reads on one partition +can affect writes on another partition. This is because the partitions +are striped across the two cores. When we unstripe this hardware RAID 0 +and make partitions on each new exposed device the two partitions are now +physically separated. + +With the dm-unstriped target we're able to segregate an fio script that +has read and write jobs that are independent of each other. Compared to +when we run the test on a combined drive with partitions, we were able +to get a 92% reduction in read latency using this device mapper target. + + +Example dmsetup usage +===================== + +unstriped ontop of Intel NVMe device that has 2 cores +----------------------------------------------------- +dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0' +dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' + +There will now be two devices that expose Intel NVMe core 0 and 1 +respectively: +/dev/mapper/nvmset0 +/dev/mapper/nvmset1 + +unstriped ontop of striped with 4 drives using 128K chunk size +-------------------------------------------------------------- +dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0' +dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' +dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0' +dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0' |