summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2020-12-29Merge tag 'for-5.11/dm-fix' of ↵Linus Torvalds1-4/+3
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: "Revert WQ_SYSFS change that broke reencryption (and all other functionality that requires reloading a dm-crypt DM table)" * tag 'for-5.11/dm-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: Revert "dm crypt: export sysfs of kcryptd workqueue"
2020-12-29Revert "dm crypt: export sysfs of kcryptd workqueue"Mike Snitzer1-4/+3
This reverts commit a2b8b2d975673b1a50ab0bcce5d146b9335edfad. WQ_SYSFS breaks the ability to reload a DM table due to sysfs kobject collision (due to active and inactive table). Given lack of demonstrated need for exposing this workqueue via sysfs: revert exposing it. Reported-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-24Merge tag 'block-5.11-2020-12-23' of git://git.kernel.dk/linux-blockLinus Torvalds2-3/+1
Pull block fixes from Jens Axboe: "A few stragglers in here, but mostly just straight fixes. In particular: - Set of rnbd fixes for issues around changes for the merge window (Gioh, Jack, Md Haris Iqbal) - iocost tracepoint addition (Baolin) - Copyright/maintainers update (Christoph) - Remove old blk-mq fast path CPU warning (Daniel) - loop max_part fix (Josh) - Remote IPI threaded IRQ fix (Sebastian) - dasd stable fixes (Stefan) - bcache merge window fixup and style fixup (Yi, Zheng)" * tag 'block-5.11-2020-12-23' of git://git.kernel.dk/linux-block: md/bcache: convert comma to semicolon bcache:remove a superfluous check in register_bcache block: update some copyrights block: remove a pointless self-reference in block_dev.c MAINTAINERS: add fs/block_dev.c to the block section blk-mq: Don't complete on a remote CPU in force threaded mode s390/dasd: fix list corruption of lcu list s390/dasd: fix list corruption of pavgroup group list s390/dasd: prevent inconsistent LCU device data s390/dasd: fix hanging device offline processing blk-iocost: Add iocg idle state tracepoint nbd: Respect max_part for all partition scans block/rnbd-clt: Does not request pdu to rtrs-clt block/rnbd-clt: Dynamically allocate sglist for rnbd_iu block/rnbd: Set write-back cache and fua same to the target device block/rnbd: Fix typos block/rnbd-srv: Protect dev session sysfs removal block/rnbd-clt: Fix possible memleak block/rnbd-clt: Get rid of warning regarding size argument in strlcpy blk-mq: Remove 'running from the wrong CPU' warning
2020-12-23md/bcache: convert comma to semicolonZheng Yongjun1-1/+1
Replace a comma between expression statements by a semicolon. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Coly Li <colyli@sue.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-23bcache:remove a superfluous check in register_bcacheYi Li1-2/+0
There have no reassign the bdev after check It is IS_ERR. the double check !IS_ERR(bdev) is superfluous. After commit 4e7b5671c6a8 ("block: remove i_bdev"), "Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code." so after lookup_bdev call, there no need to do bdput. remove a superfluous check the bdev & don't call bdput after lookup_bdev. Fixes: 4e7b5671c6a8("block: remove i_bdev") Signed-off-by: Yi Li <yili@winhong.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-23Merge tag 'for-5.11/dm-changes' of ↵Linus Torvalds18-26/+339
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Add DM verity support for signature verification with 2nd keyring - Fix DM verity to skip verity work if IO completes with error while system is shutting down - Add new DM multipath "IO affinity" path selector that maps IO destined to a given path to a specific CPU based on user provided mapping - Rename DM multipath path selector source files to have "dm-ps" prefix - Add REQ_NOWAIT support to some other simple DM targets that don't block in more elaborate ways waiting for IO - Export DM crypt's kcryptd workqueue via sysfs (WQ_SYSFS) - Fix error return code in DM's target_message() if empty message is received - A handful of other small cleanups * tag 'for-5.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: simplify the return expression of load_mapping() dm ebs: avoid double unlikely() notation when using IS_ERR() dm verity: skip verity work if I/O error when system is shutting down dm crypt: export sysfs of kcryptd workqueue dm ioctl: fix error return code in target_message dm crypt: Constify static crypt_iv_operations dm: add support for REQ_NOWAIT to various targets dm: rename multipath path selector source files to have "dm-ps" prefix dm mpath: add IO affinity path selector dm verity: Add support for signature verification with 2nd keyring dm: remove unnecessary current->bio_list check when submitting split bio
2020-12-22dm cache: simplify the return expression of load_mapping()Zheng Yongjun1-6/+1
Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-21dm ebs: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli1-1/+1
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-21dm verity: skip verity work if I/O error when system is shutting downHyeongseok Kim1-1/+11
If emergency system shutdown is called, like by thermal shutdown, a dm device could be alive when the block device couldn't process I/O requests anymore. In this state, the handling of I/O errors by new dm I/O requests or by those already in-flight can lead to a verity corruption state, which is a misjudgment. So, skip verity work in response to I/O error when system is shutting down. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-17Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds6-52/+75
Pull block driver updates from Jens Axboe: "Nothing major in here: - NVMe pull request from Christoph: - nvmet passthrough improvements (Chaitanya Kulkarni) - fcloop error injection support (James Smart) - read-only support for zoned namespaces without Zone Append (Javier González) - improve some error message (Minwoo Im) - reject I/O to offline fabrics namespaces (Victor Gladkov) - PCI queue allocation cleanups (Niklas Schnelle) - remove an unused allocation in nvmet (Amit Engel) - a Kconfig spelling fix (Colin Ian King) - nvme_req_qid simplication (Baolin Wang) - MD pull request from Song: - Fix race condition in md_ioctl() (Dae R. Jeong) - Initialize read_slot properly for raid10 (Kevin Vigor) - Code cleanup (Pankaj Gupta) - md-cluster resync/reshape fix (Zhao Heming) - Move null_blk into its own directory (Damien Le Moal) - null_blk zone and discard improvements (Damien Le Moal) - bcache race fix (Dongsheng Yang) - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang, Lutz Pogrell, Md Haris Iqbal) - lightnvm NULL pointer deref fix (tangzhenhao) - sr in_interrupt() removal (Sebastian Andrzej Siewior) - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included as it made it easier for them to funnel the feature through the block driver tree. - Follow up fixes (Colin Ian King)" * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits) block: drop dead assignments in loop_init() sr: Remove in_interrupt() usage in sr_init_command(). sr: Switch the sector size back to 2048 if sr_read_sector() changed it. cdrom: Reset sector_size back it is not 2048. drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c null_blk: Move driver into its own directory null_blk: Allow controlling max_hw_sectors limit null_blk: discard zones on reset null_blk: cleanup discard handling null_blk: Improve implicit zone close null_blk: improve zone locking block: Align max_hw_sectors to logical blocksize null_blk: Fail zone append to conventional zones null_blk: Fix zone size initialization bcache: fix race between setting bdev state to none and new write request direct to backing block/rnbd: fix a null pointer dereference on dev->blk_symlink_name block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name block/rnbd: call kobject_put in the failure path Documentation/ABI/rnbd-srv: add document for force_close block/rnbd-srv: close a mapped device from server side. ...
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds14-170/+112
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-14Revert "dm raid: fix discard limits for raid1 and raid10"Mike Snitzer1-7/+5
This reverts commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512. Reverting 6ffeb1c3f822 ("md: change mddev 'chunk_sectors' from int to unsigned") exposes dm-raid.c compiler warnings detailed that commit's header. Clearly this more conservative fix, of simply reverting e0910c8e4f8, would've been more prudent given how late we were in the v5.10 release. Lessons have been learned. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-14Revert "md: change mddev 'chunk_sectors' from int to unsigned"Mike Snitzer1-2/+2
This reverts commit 6ffeb1c3f8226244c08105bcdbeecc04bad6b89a. This change caused unexpected v5.10 raid6 mount failures, see: https://lkml.org/lkml/2020/12/14/7 Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-13Merge tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-blockLinus Torvalds6-393/+82
Pull block fixes from Jens Axboe: "This should be it for 5.10. Mike and Song looked into the warning case, and thankfully it appears the fix was pretty trivial - we can just change the md device chunk type to unsigned int to get rid of it. They cannot currently be < 0, and nobody is checking for that either. We're reverting the discard changes as the corruption reports came in very late, and there's just no time to attempt to deal with it at this point. Reverting the changes in question is the right call for 5.10" * tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-block: md: change mddev 'chunk_sectors' from int to unsigned Revert "md: add md_submit_discard_bio() for submitting discard bio" Revert "md/raid10: extend r10bio devs to raid disks" Revert "md/raid10: pull codes that wait for blocked dev into one function" Revert "md/raid10: improve raid10 discard request" Revert "md/raid10: improve discard request for far layout" Revert "dm raid: remove unnecessary discard limits for raid10"
2020-12-12md: change mddev 'chunk_sectors' from int to unsignedMike Snitzer1-2/+2
Commit e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"") exposed compiler warnings introduced by commit e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10"): In file included from ./include/linux/kernel.h:14, from ./include/asm-generic/bug.h:20, from ./arch/x86/include/asm/bug.h:93, from ./include/linux/bug.h:5, from ./include/linux/mmdebug.h:5, from ./include/linux/gfp.h:5, from ./include/linux/slab.h:15, from drivers/md/dm-raid.c:8: drivers/md/dm-raid.c: In function ‘raid_io_hints’: ./include/linux/minmax.h:18:28: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^~ ./include/linux/minmax.h:32:4: note: in expansion of macro ‘__typecheck’ (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ ./include/linux/minmax.h:42:24: note: in expansion of macro ‘__safe_cmp’ __builtin_choose_expr(__safe_cmp(x, y), \ ^~~~~~~~~~ ./include/linux/minmax.h:51:19: note: in expansion of macro ‘__careful_cmp’ #define min(x, y) __careful_cmp(x, y, <) ^~~~~~~~~~~~~ ./include/linux/minmax.h:84:39: note: in expansion of macro ‘min’ __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); }) ^~~ drivers/md/dm-raid.c:3739:33: note: in expansion of macro ‘min_not_zero’ limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors, ^~~~~~~~~~~~ Fix this by changing the chunk_sectors member of 'struct mddev' from int to 'unsigned int' to match the type used for the 'chunk_sectors' member of 'struct queue_limits'. Various MD code still uses 'int' but none of it appears to ever make use of signed int; and storing positive signed int in unsigned is perfectly safe. Reported-by: Song Liu <songliubraving@fb.com> Fixes: e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"") Fixes: e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10") Cc: stable@vger,kernel.org # e0910c8e4f87 was marked for stable@ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-10Revert "md: add md_submit_discard_bio() for submitting discard bio"Song Liu3-24/+12
This reverts commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: extend r10bio devs to raid disks"Song Liu1-4/+5
This reverts commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: pull codes that wait for blocked dev into one function"Song Liu1-67/+51
This reverts commit f046f5d0d79cdb968f219ce249e497fd1accf484. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: improve raid10 discard request"Song Liu1-255/+1
This reverts commit bcc90d280465ebd51ab8688be86e1f00c62dccf9. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: improve discard request for far layout"Song Liu2-64/+23
This reverts commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "dm raid: remove unnecessary discard limits for raid10"Song Liu1-0/+11
This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-07bcache: fix race between setting bdev state to none and new write request ↵Dongsheng Yang2-9/+9
direct to backing There is a race condition in detaching as below: A. detaching B. Write request (1) writing back (2) write back done, set bdev state to clean. (3) cached_dev_put() and schedule_work(&dc->detach); (4) write data [0 - 4K] directly into backing and ack to user. (5) power-failure... When we restart this bcache device, this bdev is clean but not detached, and read [0 - 4K], we will get unexpected old data from cache device. To fix this problem, set the bdev state to none when we writeback done in detaching, and then if power-failure happened as above, the data in cache will not be used in next bcache device starting, it's detached, we will read the correct data from backing derectly. Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-05dm crypt: export sysfs of kcryptd workqueueJeffle Xu1-3/+4
It should be helpful to export sysfs of "kcryptd" workqueue in some cases, such as setting specific CPU affinity of the workqueue. Besides, also tweak the name format a little. The slash inside a directory name will be translate into exclamation mark, such as /sys/devices/virtual/workqueue/'kcryptd!253:0'. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm ioctl: fix error return code in target_messageQinglang Miao1-0/+1
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 2ca4c92f58f9 ("dm ioctl: prevent empty message") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm crypt: Constify static crypt_iv_operationsRikard Falkeborn1-3/+3
The only usage of these structs is to assign their address to the iv_gen_ops field in the crypt config struct, which is a pointer to const. Make them const like the rest of the static crypt_iv_operations structs. This allows the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm: add support for REQ_NOWAIT to various targetsJeffle Xu4-1/+4
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added a new queue flag QUEUE_FLAG_NOWAIT to advertise if the bdev supports handling of REQ_NOWAIT or not. DM core supports stacking QUEUE_FLAG_NOWAIT since commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target"), in which only dm-linear enabled it. Update others DM targets, which just do simple remapping, to enable support for REQ_NOWAIT. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm: rename multipath path selector source files to have "dm-ps" prefixMike Snitzer6-7/+12
Additional prefix helps clarify that these source files implement path selectors. Required updating Makefile to still build modules _without_ the "dm-ps" prefix to preserve dm-multipath's ability to autoload path selector modules. While at it, cleaned up some DM whitespace in Makefile. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm mpath: add IO affinity path selectorMike Christie3-0/+282
This patch adds a path selector that selects paths based on a CPU to path mapping the user passes in and what CPU we are executing on. The primary user for this PS is where the app is optimized to use specific CPUs so other PSs undo the apps handy work, and the storage and it's transport are not a bottlneck. For these io-affinity PS setups a path's transport/interconnect perf is not going to flucuate a lot and there is no major differences between paths, so QL/HST smarts do not help and RR always messes up what the app is trying to do. On a system with 16 cores, where you have a job per CPU: fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=16 and a dm-multipath device setup where each CPU is mapped to one path: // When in mq mode I had to set dm_mq_nr_hw_queues=$NUM_PATHS. // Bio mode also showed similar results. 0 16777216 multipath 0 0 1 1 io-affinity 0 16 1 8:16 1 8:32 2 8:64 4 8:48 8 8:80 10 8:96 20 8:112 40 8:128 80 8:144 100 8:160 200 8:176 400 8:192 800 8:208 1000 8:224 2000 8:240 4000 65:0 8000 we can see a IOPs increase of 25%. The percent increase depends on the device and interconnect. For a slower/medium speed path/device that can do around 180K IOPs a path if you ran that fio command to it directly we saw a 25% increase like above. Slower path'd devices that could do around 90K per path showed maybe around a 2 - 5% increase. If you use something like null_blk or scsi_debug which can multi-million IOPs and hack it up so each device they export shows up as a path then you see 50%+ increases. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm verity: Add support for signature verification with 2nd keyringMickaël Salaün2-3/+19
Add a new configuration DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING to enable dm-verity signatures to be verified against the secondary trusted keyring. Instead of relying on the builtin trusted keyring (with hard-coded certificates), the second trusted keyring can include certificate authorities from the builtin trusted keyring and child certificates loaded at run time. Using the secondary trusted keyring enables to use dm-verity disks (e.g. loop devices) signed by keys which did not exist at kernel build time, leveraging the certificate chain of trust model. In practice, this makes it possible to update certificates without kernel update and reboot, aligning with module and kernel (kexec) signature verification which already use the secondary trusted keyring. Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05dm: remove unnecessary current->bio_list check when submitting split bioJeffle Xu1-1/+1
The depth-first splitting is introduced in commit 18a25da84354 ("dm: ensure bio submission follows a depth-first tree walk"), which is used to fix the potential deadlock in case of the misordering handling of bios caused by bio_list. There're two paths submitting split bios, dm_wq_work() from worker thread and submit_bio() from application. Back upon that time, dm_wq_work() thread calls __split_and_process_bio() directly and thus will not trigger this issue since bio_list doesn't exist here. So this issue will only be triggered from application calling submit_bio(), and the fix has to check if current->bio_list is non-NULL to distinguish this case. However since commit 0c2915b8c6db1 ("dm: fix missing imposition of queue_limits from dm_wq_work() thread"), dm_wq_work() thread calls submit_bio_noacct() and thus also uses bio_list. Since then all entries into __split_and_process_bio() are under protection of bio_list, and thus the checking of current->bio_list when determinning if the depth-first principle should be used, seems kind of nonsense. After all the checking always succeeds now. Fixes: 0c2915b8c6db1 ("dm: fix missing imposition of queue_limits from dm_wq_work() thread") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-04dm: remove invalid sparse __acquires and __releases annotationsMike Snitzer1-2/+0
Fixes sparse warnings: drivers/md/dm.c:508:12: warning: context imbalance in 'dm_prepare_ioctl' - wrong count at exit drivers/md/dm.c:543:13: warning: context imbalance in 'dm_unprepare_ioctl' - wrong count at exit Fixes: 971888c46993f ("dm: hold DM table for duration of ioctl rather than use blkdev_get") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-04dm: fix double RCU unlock in dm_dax_zero_page_range() error pathMike Snitzer1-2/+0
Remove redundant dm_put_live_table() in dm_dax_zero_page_range() error path to fix sparse warning: drivers/md/dm.c:1208:9: warning: context imbalance in 'dm_dax_zero_page_range' - unexpected unlock Fixes: cdf6cdcd3b99a ("dm,dax: Add dax zero_page_range operation") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-04dm: fix IO splittingMike Snitzer2-13/+11
Commit 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") caused a couple regressions: 1) Using lcm_not_zero() when stacking chunk_sectors was a bug because chunk_sectors must reflect the most limited of all devices in the IO stack. 2) DM targets that set max_io_len but that do _not_ provide an .iterate_devices method no longer had there IO split properly. And commit 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") also caused a regression where DM no longer supported varied (per target) IO splitting. The implication being the potential for severely reduced performance for IO stacks that use a DM target like dm-cache to hide performance limitations of a slower device (e.g. one that requires 4K IO splitting). Coming full circle: Fix all these issues by discontinuing stacking chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(), add optional chunk_sectors override argument to blk_max_size_offset() and update DM's max_io_len() to pass ti->max_io_len to its blk_max_size_offset() call. Passing in an optional chunk_sectors override to blk_max_size_offset() allows for code reuse of block's centralized calculation for max IO size based on provided offset and split boundary. Fixes: 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") Fixes: 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") Cc: stable@vger.kernel.org Reported-by: John Dorminy <jdorminy@redhat.com> Reported-by: Bruce Johnston <bjohnsto@redhat.com> Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: John Dorminy <jdorminy@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk>
2020-12-04block: remove the request_queue to argument request based tracepointsChristoph Hellwig1-1/+1
The request_queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04block: remove the request_queue argument to the block_bio_remap tracepointChristoph Hellwig7-25/+18
The request_queue can trivially be derived from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04block: remove the request_queue argument to the block_split tracepointChristoph Hellwig1-1/+1
The request_queue can trivially be derived from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: stop using bdget_disk for partition 0Christoph Hellwig1-14/+2
We can just dereference the point in struct gendisk instead. Also remove the now unused export. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: switch partition lookup to use struct block_deviceChristoph Hellwig3-6/+6
Use struct block_device to lookup partitions on a disk. This removes all usage of struct hd_struct from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: allocate struct hd_struct as part of struct bdev_inodeChristoph Hellwig2-3/+3
Allocate hd_struct together with struct block_device to pre-load the lifetime rule changes in preparation of merging the two structures. Note that part0 was previously embedded into struct gendisk, but is a separate allocation now, and already points to the block_device instead of the hd_struct. The lifetime of struct gendisk is still controlled by the struct device embedded in the part0 hd_struct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: remove the nr_sects field in struct hd_structChristoph Hellwig1-1/+1
Now that the hd_struct always has a block device attached to it, there is no need for having two size field that just get out of sync. Additionally the field in hd_struct did not use proper serialization, possibly allowing for torn writes. By only using the block_device field this problem also gets fixed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: remove i_bdevChristoph Hellwig2-17/+12
Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02block: add a bdev_kobj helperChristoph Hellwig2-8/+3
Add a little helper to find the kobject for a struct block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02dm: remove the block_device reference in struct mapped_deviceChristoph Hellwig2-13/+14
Get rid of the long-lasting struct block_device reference in struct mapped_device. The only remaining user is the freeze code, where we can trivially look up the block device at freeze time and release the reference at thaw time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02dm: simplify flush_bio initialization in __send_empty_flushChristoph Hellwig1-9/+3
We don't really need the struct block_device to initialize a bio. So switch from using bio_set_dev to manually setting up bi_disk (bi_partno will always be zero and has been cleared by bio_init already). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02fs: simplify freeze_bdev/thaw_bdevChristoph Hellwig2-19/+6
Store the frozen superblock in struct block_device to avoid the awkward interface that can return a sb only used a cookie, an ERR_PTR or NULL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01dm writecache: remove BUG() and fail gracefully insteadMike Snitzer1-1/+1
Building on arch/s390/ results in this build error: cc1: some warnings being treated as errors ../drivers/md/dm-writecache.c: In function 'persistent_memory_claim': ../drivers/md/dm-writecache.c:323:1: error: no return statement in function returning non-void [-Werror=return-type] Fix this by replacing the BUG() with an -EOPNOTSUPP return. Fixes: 48debafe4f2f ("dm: add writecache target") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01dm table: Remove BUG_ON(in_interrupt())Thomas Gleixner1-6/+0
The BUG_ON(in_interrupt()) in dm_table_event() is a historic leftover from a rework of the dm table code which changed the calling context. Issuing a BUG for a wrong calling context is frowned upon and in_interrupt() is deprecated and only covering parts of the wrong contexts. The sanity check for the context is covered by CONFIG_DEBUG_ATOMIC_SLEEP and other debug facilities already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01dm: fix bug with RCU locking in dm_blk_report_zonesSergei Shtepa1-2/+4
The dm_get_live_table() function makes RCU read lock so dm_put_live_table() must be called even if dm_table map is not found. Fixes: e76239a3748c9 ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01Revert "dm cache: fix arm link errors with inline"Nick Desaulniers1-4/+0
This reverts commit 43aeaa29573924df76f44eda2bbd94ca36e407b5. Since commit 0bddd227f3dc ("Documentation: update for gcc 4.9 requirement") the minimum supported version of GCC is gcc-4.9. It's now safe to remove this code. Link: https://github.com/ClangBuiltLinux/linux/issues/427 Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-30md/cluster: fix deadlock when node is doing resync jobZhao Heming2-31/+42
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Suggested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>