summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2020-05-21dm zoned: separate random and cache zonesHannes Reinecke4-67/+159
Instead of lumping emulated zones together with random zones we should be handling them as separate 'cache' zones. This improves code readability and allows an easier implementation of different cache policies. Also add additional allocation flags, to separate the type (cache, random, or sequential) from the purpose (eg reclaim). Also switch the allocation policy to not use random zones as buffer zones if cache zones are present. This avoids a performance drop when all cache zones are used. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zoneHannes Reinecke2-4/+4
The only case where dmz_get_zone_for_reclaim() cannot return a zone is if the respective lists are empty. So we should just return a simple NULL value here as we really don't have an error code which would make sense. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm zoned: Avoid 64-bit division error in dmz_fixup_devicesNathan Chancellor1-2/+3
When building arm32 allyesconfig: ld.lld: error: undefined symbol: __aeabi_uldivmod >>> referenced by dm-zoned-target.c >>> md/dm-zoned-target.o:(dmz_ctr) in archive drivers/built-in.a dmz_fixup_devices uses DIV_ROUND_UP with variables of type sector_t. As such, it should be using DIV_ROUND_UP_SECTOR_T, which handles this automatically. Fixes: 70978208ec91 ("dm zoned: metadata version 2") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm: use DMDEBUG macros now that they use pr_debug variantsMike Snitzer2-7/+7
Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use them to get the benefit of consistent DM_FMT formatting of debugging messages. While doing so, dm-mpath.c:dm_report_EIO() was switched over to using DMDEBUG_LIMIT due to the potential for error handling floods in the IO completion path. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm zoned: remove spurious newlines from debugging messagesHannes Reinecke2-4/+4
DMDEBUG will already add a newline to the logging messages, so we shouldn't be adding it to the message itself. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm: replace zero-length array with flexible-arrayGustavo A. R. Silva9-9/+9
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-21dm zoned: metadata version 2Hannes Reinecke3-102/+400
Implement handling for metadata version 2. The new metadata adds a label and UUID for the device mapper device, and additional UUID for the underlying block devices. It also allows for an additional regular drive to be used for emulating random access zones. The emulated zones will be placed logically in front of the zones from the zoned block device, causing the superblocks and metadata to be stored on that device. The first zone of the original zoned device will be used to hold another, tertiary copy of the metadata; this copy carries a generation number of 0 and is never updated; it's just used for identification. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20dm zoned: ignore metadata zone in dmz_alloc_zone()Hannes Reinecke1-0/+6
When looking up zones in dmz_alloc_zone() we need to ignore metadata zones so as not to accidentally overwrite metadata. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20dm zoned: Reduce logging output on startupHannes Reinecke1-12/+12
dm-zoned is becoming quite chatty during startup; reduce the noise by moving some information to 'debug' level. Suggested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-20dm zoned: add metadata logging functionsHannes Reinecke1-39/+57
Use the metadata label for logging and not the underlying device. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19dm zoned: use dmz_zone_to_dev() when handling metadata I/OHannes Reinecke1-5/+7
Use accessors to retrieve the device pointer in preparation for adding an additional block device. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19dm zoned: replace 'target' pointer in the bio contextHannes Reinecke1-20/+24
Replace the 'target' pointer in the bio context with the device pointer as this is what's actually used. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19dm zoned: remove 'dev' argument from reclaimHannes Reinecke3-30/+32
Use the dmz_zone_to_dev() mapping function to remove the 'dev' argument from reclaim. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-19blk-mq: allow blk_mq_make_request to consume the q_usage_counter referenceChristoph Hellwig1-1/+10
blk_mq_make_request currently needs to grab an q_usage_counter reference when allocating a request. This is because the block layer grabs one before calling blk_mq_make_request, but also releases it as soon as blk_mq_make_request returns. Remove the blk_queue_exit call after blk_mq_make_request returns, and instead let it consume the reference. This works perfectly fine for the block layer caller, just device mapper needs an extra reference as the old problem still persists there. Open code blk_queue_enter_live in device mapper, as there should be no other callers and this allows better documenting why we do a non-try get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-15dm zoned: Introduce dmz_dev_is_dying() and dmz_check_dev()Hannes Reinecke4-5/+18
Introduce accessors dmz_dev_is_dying() and dmz_check_dev() to avoid having to reference the devices directly. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: introduce dmz_metadata_label() to format device nameHannes Reinecke4-42/+62
Introduce dmz_metadata_label() to format the device-mapper device name and use it instead of the device name of the underlying device. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: move fields from struct dmz_dev to dmz_metadataHannes Reinecke4-63/+95
Move fields from the device structure into the metadata structure and provide accessor functions. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: store device in struct dmz_sbHannes Reinecke1-31/+59
Store the device together with the superblock so that we don't have to recur to the metadata to find it. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: use array for superblock zonesHannes Reinecke1-16/+25
Instead of storing just the first superblock zone and calculate the secondary relative to that we should be using an array for holding the superblock zones. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: store zone id within the zone structure and kill dmz_id()Hannes Reinecke4-36/+31
Instead of calculating the zone index by the offset within the zone array store the index within the structure itself. With that the helper dmz_id() is pointless and can be replaced with accessing the ->id value directly. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: add 'message' callbackHannes Reinecke1-0/+15
Add callback for 'dmsetup message' to allow the reclaim process to be triggered manually. Eg. dmsetup message /dev/dm-X 0 message will start the reclaim process even if the default threshold of 50 percent of free random zones is not reached. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm zoned: add 'status' callbackHannes Reinecke3-0/+44
Add callback to supply information for 'dmsetup status' and 'dmsetup table'. The output for 'dmsetup status' is 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential where <nr_unmap_rnd> is the number of unmapped (ie free) random zones, <nr_rnd> the total number of random zones, <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the total number of sequential zones. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm mpath: add Historical Service Time Path SelectorKhazhismel Kumykov3-0/+573
This new selector keeps an exponential moving average of the service time for each path (losely defined as delta between start_io and end_io), and uses this along with the number of inflight requests to estimate future service time for a path. Since we don't have a prober to account for temporally slow paths, re-try "slow" paths every once in a while (num_paths * historical_service_time). To account for fast paths transitioning to slow, if a path has not completed any request within (num_paths * historical_service_time), limit the number of outstanding requests. To account for low volume situations where number of inflight IOs would be zero, the last finish time of each path is factored in. Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm mpath: pass IO start time to path selectorGabriel Krisman Bertazi5-6/+18
The HST path selector needs this information to perform path prediction. For request-based mpath, struct request's io_start_time_ns is used, while for bio-based, use the start_time stored in dm_io. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm writecache: improve performance on DDR persistent memory (Optane)Mikulas Patocka1-1/+37
When testing the dm-writecache target on a real DDR persistent memory (Intel Optane), it turned out that explicit cache flushing using the clflushopt instruction performs better than non-temporal stores for block sizes 1k, 2k and 4k. The dm-writecache target is singlethreaded (all the copying is done while holding the writecache lock), so it benefits from clwb, see: http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com Add a new function memcpy_flushcache_optimized() that tests if clflushopt is present - and if it is, we use it instead of memcpy_flushcache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm writecache: remove superfluous test in persistent_memory_claimMikulas Patocka1-4/+0
Remove superfluous test if dax_dev is NULL - dax_direct_access already does this test. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm persistent data: switch exit_ro_spine to return voidZhiqiang Liu2-5/+3
In commit 4c7da06f5a78 ("dm persistent data: eliminate unnecessary return values"), r value in exit_ro_spine will not change, so exit_ro_spine doesn't need a return value. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm integrity: remove set but not used variablesYueHaibing1-4/+0
Fixes gcc '-Wunused-but-set-variable' warning: drivers/md/dm-integrity.c: In function 'integrity_metadata': drivers/md/dm-integrity.c:1557:12: warning: variable 'save_metadata_offset' set but not used [-Wunused-but-set-variable] drivers/md/dm-integrity.c:1556:12: warning: variable 'save_metadata_block' set but not used [-Wunused-but-set-variable] They are never used, so remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm ebs: pass discards down to underlying deviceHeinz Mauelshagen1-7/+34
Make use of dm_bufio_issue_discard() to pass discards down to the underlying device. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm bufio: implement discardMikulas Patocka1-5/+64
Add functions dm_bufio_issue_discard and dm_bufio_discard_buffers. dm_bufio_issue_discard sends discard request to the underlying device. dm_bufio_discard_buffers frees buffers in the range and then calls dm_bufio_issue_discard. Also, factor out block_to_sector for reuse in dm_bufio_issue_discard. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm: add emulated block size targetHeinz Mauelshagen3-0/+454
This new target is similar to the linear target except that it emulates a smaller logical block size on a device with a larger logical block size. Its main purpose is to emulate 512 byte sectors on 4K native disks (i.e. 512e). See Documentation/admin-guide/device-mapper/dm-ebs.rst for details. Reviewed-by: Damien Le Moal <DamienLeMoal@wdc.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> [Kconfig fixes] Signed-off-by: Zheng Bin <zhengbin13@huawei.com> [static fixes] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm mpath: switch paths in dm_blk_ioctl() code pathMartin Wilck1-1/+1
SCSI LUN passthrough code such as qemu's "scsi-block" device model pass every IO to the host via SG_IO ioctls. Currently, dm-multipath calls choose_pgpath() only in the block IO code path, not in the ioctl code path (unless current_pgpath is NULL). This has the effect that no path switching and thus no load balancing is done for SCSI-passthrough IO, unless the active path fails. Fix this by using the same logic in multipath_prepare_ioctl() as in multipath_clone_and_map(). Note: The allegedly best path selection algorithm, service-time, still wouldn't work perfectly, because the io size of the current request is always set to 0. Changing that for the IO passthrough case would require the ioctl cmd and arg to be passed to dm's prepare_ioctl() method. Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-15dm crypt: support using encrypted keysDmitry Baryshkov2-19/+58
Allow one to use "encrypted" in addition to "user" and "logon" key types for device encryption. Signed-off-by: Dmitry Baryshkov <dmitry_baryshkov@mentor.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-05-14block: Inline encryption support for blk-mqSatya Tangirala1-0/+3
We must have some way of letting a storage device driver know what encryption context it should use for en/decrypting a request. However, it's the upper layers (like the filesystem/fscrypt) that know about and manages encryption contexts. As such, when the upper layer submits a bio to the block layer, and this bio eventually reaches a device driver with support for inline encryption, the device driver will need to have been told the encryption context for that bio. We want to communicate the encryption context from the upper layer to the storage device along with the bio, when the bio is submitted to the block layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can represent an encryption context (note that we can't use the bi_private field in struct bio to do this because that field does not function to pass information across layers in the storage stack). We also introduce various functions to manipulate the bio_crypt_ctx and make the bio/request merging logic aware of the bio_crypt_ctx. We also make changes to blk-mq to make it handle bios with encryption contexts. blk-mq can merge many bios into the same request. These bios need to have contiguous data unit numbers (the necessary changes to blk-merge are also made to ensure this) - as such, it suffices to keep the data unit number of just the first bio, since that's all a storage driver needs to infer the data unit number to use for each data block in each bio in a request. blk-mq keeps track of the encryption context to be used for all the bios in a request with the request's rq_crypt_ctx. When the first bio is added to an empty request, blk-mq will program the encryption context of that bio into the request_queue's keyslot manager, and store the returned keyslot in the request's rq_crypt_ctx. All the functions to operate on encryption contexts are in blk-crypto.c. Upper layers only need to call bio_crypt_set_ctx with the encryption key, algorithm and data_unit_num; they don't have to worry about getting a keyslot for each encryption context, as blk-mq/blk-crypto handles that. Blk-crypto also makes it possible for request-based layered devices like dm-rq to make use of inline encryption hardware by cloning the rq_crypt_ctx and programming a keyslot in the new request_queue when necessary. Note that any user of the block layer can submit bios with an encryption context, such as filesystems, device-mapper targets, etc. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13md/raid1: Replace zero-length array with flexible-arrayGustavo A. R. Silva3-3/+3
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add a newline when printing parameter 'start_ro' by sysfsXiongfeng Wang1-1/+1
Add a missing newline when printing module parameter 'start_ro' by sysfs. Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: stop using ->queuedataChristoph Hellwig1-2/+1
Pointer to mddev is already available in private_data. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md/raid1: release pending accounting for an I/O only after write-behind is ↵David Jeffery1-6/+7
also finished When using RAID1 and write-behind, md can deadlock when errors occur. With write-behind, r1bio structs can be accounted by raid1 as queued but not counted as pending. The pending count is dropped when the original bio is returned complete but write-behind for the r1bio may still be active. This breaks the accounting used in some conditions to know when the raid1 md device has reached an idle state. It can result in calls to freeze_array deadlocking. freeze_array will never complete from a negative "unqueued" value being calculated due to a queued count larger than the pending count. To properly account for write-behind, move the call to allow_barrier from call_bio_endio to raid_end_bio_io. When using write-behind, md can call call_bio_endio before all write-behind I/O is complete. Using raid_end_bio_io for the point to call allow_barrier will release the pending count at a point where all I/O for an r1bio, even write-behind, is done. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: remove redundant memalloc scope API usageColy Li1-4/+4
In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save() and memalloc_noio_restore() are used when allocating memory by calling mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in raid array suspend context, it is unncessary to explicitly call them around mempool_create_kmalloc_pool() any longer. This patch removes the redundant memalloc scope APIs in mddev_create_serial_pool(). Signed-off-by: Coly Li <colyli@suse.de> Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13raid5: update code comment of scribble_alloc()Coly Li1-2/+5
Code comments of scribble_alloc() is outdated for a while. This patch update the comments in function header for the new parameter list. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13raid5: remove gfp flags from scribble_alloc()Coly Li1-6/+9
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does not have the expected behavior. kvmalloc_array() inside scribble_alloc() which receives the GFP_NOIO flag will eventually call kmalloc_node() to allocate physically continuous pages. Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to prevent memory reclaim I/Os during raid array suspend context, calling to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive I/O as expected. This patch removes the useless gfp flags from parameters list of scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The incorrect GFP_NOIO flag does not exist anymore. Fixes: b330e6a49dc3 ("md: convert to kvmalloc") Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: use memalloc scope APIs in mddev_suspend()/mddev_resume()Coly Li2-0/+5
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO flag, then it is sent into kvmalloc_array() inside scribble_alloc(). The problem is kvmalloc_array() eventually calls kvmalloc_node() which does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then kmalloc_node() is called indeed to allocate physically continuous pages. When system memory is under heavy pressure, and the requesting size is large, there is high probability that allocating continueous pages will fail. But simply using GFP_KERNEL flag to call kvmalloc_array() is also progblematic. In the code path where scribble_alloc() is called, the raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os and such I/Os go back to the suspend raid array, deadlock will happen. What is desired here is to allocate non-physically (a.k.a virtually) continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests to use the mmealloc sceope APIs to restrict memory reclaim I/O in allocating context, specifically to call memalloc_noio_save() when suspend the raid array and to call memalloc_noio_restore() when resume the raid array. This patch adds the memalloc scope APIs in mddev_suspend() and mddev_resume(), to restrict memory reclaim I/Os during the raid array is suspended. The benifit of adding the memalloc scope API in the unified entry point mddev_suspend()/mddev_resume() is, no matter which md raid array type (personality), we are sure the deadlock by recursive memory reclaim I/O won't happen on the suspending context. Please notice that the memalloc scope APIs only take effect on the raid array suspending context, if the memory allocation is from another new created kthread after raid array suspended, the recursive memory reclaim I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are used for the critical section where the raid metadata is modifying, creating a kthread to allocate memory inside the critical section is queer and very probably being buggy. Fixes: b330e6a49dc3 ("md: convert to kvmalloc") Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: remove the extra line for ->hot_add_diskGuoqing Jiang1-4/+2
It is not not necessary to add a newline for them since they don't exceed 80 characters, and it is not intutive to distinguish ->hot_add_disk() from hot_add_disk() too. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: flush md_rdev_misc_wq for HOT_ADD_DISK caseGuoqing Jiang1-1/+1
Since rdev->kobj is removed asynchronously, it is possible that the rdev->kobj still exists when try to add the rdev again after rdev is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk -> bind_rdev_to_array missed it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: don't flush workqueue unconditionally in md_openGuoqing Jiang1-1/+2
We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add new workqueue for delete rdevGuoqing Jiang1-8/+29
Since the purpose of call flush_workqueue in new_dev_store is to ensure md_delayed_delete() has completed, so we should check rdev->del_work is pending or not. To suppress lockdep warning, we have to check mddev->del_work while md_delayed_delete is attached to rdev->del_work, so it is not aligned to the purpose of flush workquee. So a new workqueue is needed to avoid the awkward situation, and introduce a new func flush_rdev_wq to flush the new workqueue after check if there was pending work. Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush workqueue while it holds bdev->bd_mutex, so make the same change applies to the ioctl to avoid similar lock issue. And md_delayed_delete actually wants to delete rdev, so rename the function to rdev_delayed_delete. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add checkings before flush md_misc_wqGuoqing Jiang1-2/+4
Coly reported possible circular locking dependencyi with LOCKDEP enabled, quote the below info from the detailed report [1]. [ 1607.673903] Chain exists of: [ 1607.673903] kn->count#256 --> (wq_completion)md_misc --> (work_completion)(&rdev->del_work) [ 1607.673903] [ 1607.827946] Possible unsafe locking scenario: [ 1607.827946] [ 1607.898780] CPU0 CPU1 [ 1607.952980] ---- ---- [ 1608.007173] lock((work_completion)(&rdev->del_work)); [ 1608.069690] lock((wq_completion)md_misc); [ 1608.149887] lock((work_completion)(&rdev->del_work)); [ 1608.242563] lock(kn->count#256); [ 1608.283238] [ 1608.283238] *** DEADLOCK *** [ 1608.283238] [ 1608.354078] 2 locks held by kworker/5:0/843: [ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.512399] #1: ffff888a1d3b7e10 ((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.632130] Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq, then lockdep_map lock is held if either of them are running, then both of them try to hold kernfs lock by call kobject_del. Then if new_dev_store or array_state_store are triggered by write to the related sysfs node, so the write operation gets kernfs lock, but need the lockdep_map because all of them would trigger flush_workqueue(md_misc_wq) finally, then the same lockdep_map lock is needed. To suppress the lockdep warnning, we should flush the workqueue in case the related work is pending. And several works are attached to md_misc_wq, so we need to check which work should be checked: 1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync thread is started if it was starting, so check mddev->del_work is pending or not since md_start_sync is attached to mddev->del_work. 2. __md_stop flushes md_misc_wq to ensure event_work is done, check the event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't need the kernfs lock. 3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has completed, this case will be handled in next patch. 4. md_open flushes workqueue to ensure the previous md is disappeared, but it holds bdev->bd_mutex then try to flush workqueue, so it is better to check mddev->del_work as well to avoid potential lock issue, this will be done in another patch. [1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2 Cc: Coly Li <colyli@suse.de> Reported-by: Coly Li <colyli@suse.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-01Merge tag 'for-5.7/dm-fixes-2' of ↵Linus Torvalds3-18/+42
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Document DM integrity allow_discard feature that was added during 5.7 merge window. - Fix potential for DM writecache data corruption during DM table reloads. - Fix DM verity's FEC support's hash block number calculation in verity_fec_decode(). - Fix bio-based DM multipath crash due to use of stale copy of MPATHF_QUEUE_IO flag state in __map_bio(). * tag 'for-5.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath dm verity fec: fix hash block number in verity_fec_decode dm writecache: fix data corruption when reloading the target dm integrity: document allow_discard option
2020-04-29dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpathGabriel Krisman Bertazi1-2/+4
When adding devices that don't have a scsi_dh on a BIO based multipath, I was able to consistently hit the warning below and lock-up the system. The problem is that __map_bio reads the flag before it potentially being modified by choose_pgpath, and ends up using the older value. The WARN_ON below is not trivially linked to the issue. It goes like this: The activate_path delayed_work is not initialized for non-scsi_dh devices, but we always set MPATHF_QUEUE_IO, asking for initialization. That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath. Nevertheless, only for BIO-based mpath, we cache the flag before calling choose_pgpath, and use the older version when deciding if we should initialize the path. Therefore, we end up trying to initialize the paths, and calling the non-initialized activate_path work. [ 82.437100] ------------[ cut here ]------------ [ 82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624 __queue_delayed_work+0x71/0x90 [ 82.438436] Modules linked in: [ 82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339 [ 82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90 [ 82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9 94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74 a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07 [ 82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007 [ 82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000 [ 82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200 [ 82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970 [ 82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930 [ 82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008 [ 82.445427] FS: 00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000 [ 82.446040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0 [ 82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 82.448012] Call Trace: [ 82.448164] queue_delayed_work_on+0x6d/0x80 [ 82.448472] __pg_init_all_paths+0x7b/0xf0 [ 82.448714] pg_init_all_paths+0x26/0x40 [ 82.448980] __multipath_map_bio.isra.0+0x84/0x210 [ 82.449267] __map_bio+0x3c/0x1f0 [ 82.449468] __split_and_process_non_flush+0x14a/0x1b0 [ 82.449775] __split_and_process_bio+0xde/0x340 [ 82.450045] ? dm_get_live_table+0x5/0xb0 [ 82.450278] dm_process_bio+0x98/0x290 [ 82.450518] dm_make_request+0x54/0x120 [ 82.450778] generic_make_request+0xd2/0x3e0 [ 82.451038] ? submit_bio+0x3c/0x150 [ 82.451278] submit_bio+0x3c/0x150 [ 82.451492] mpage_readpages+0x129/0x160 [ 82.451756] ? bdev_evict_inode+0x1d0/0x1d0 [ 82.452033] read_pages+0x72/0x170 [ 82.452260] __do_page_cache_readahead+0x1ba/0x1d0 [ 82.452624] force_page_cache_readahead+0x96/0x110 [ 82.452903] generic_file_read_iter+0x84f/0xae0 [ 82.453192] ? __seccomp_filter+0x7c/0x670 [ 82.453547] new_sync_read+0x10e/0x190 [ 82.453883] vfs_read+0x9d/0x150 [ 82.454172] ksys_read+0x65/0xe0 [ 82.454466] do_syscall_64+0x4e/0x210 [ 82.454828] entry_SYSCALL_64_after_hwframe+0x49/0xbe [...] [ 82.462501] ---[ end trace bb39975e9cf45daa ]--- Cc: stable@vger.kernel.org Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-04-25block: bypass ->make_request_fn for blk-mq driversChristoph Hellwig1-0/+3
Call blk_mq_make_request when no ->make_request_fn is set. This is safe now that blk_alloc_queue always sets up the pointer for make_request based drivers. This avoids an indirect call in the blk-mq driver I/O fast path, which is rather expensive due to spectre mitigations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>