summaryrefslogtreecommitdiff
path: root/drivers/md/dm.c
AgeCommit message (Collapse)AuthorFilesLines
2016-05-05dm: remove unused mapped_device argument from free_tio()Mike Snitzer1-6/+4
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-04-11dm: fix dm_target_io leak if clone_bio() returns an errorMikulas Patocka1-1/+3
Commit c80914e81ec5b08 ("dm: return error if bio_integrity_clone() fails in clone_bio()") changed clone_bio() such that if it does return error then the alloc_tio() created resources (both the bio that was allocated to be a clone and the containing dm_target_io struct) will leak. Fix this by calling free_tio() in __clone_and_map_data_bio()'s clone_bio() error path. Fixes: c80914e81ec5b08 ("dm: return error if bio_integrity_clone() fails in clone_bio()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-15dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()Bryn M. Reeves1-1/+1
An "old" (.request_fn) DM 'struct request' stores a pointer to the associated 'struct dm_rq_target_io' in rq->special. dm_requeue_original_request(), previously named dm_requeue_unmapped_original_request(), called dm_unprep_request() to reset rq->special to NULL. But rq_end_stats() would go on to hit a NULL pointer deference because its call to tio_from_request() returned NULL. Fix this by calling rq_end_stats() _before_ dm_unprep_request() Signed-off-by: Bryn M. Reeves <bmr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: e262f34741 ("dm stats: add support for request-based DM devices") Cc: stable@vger.kernel.org # 4.2+
2016-03-11dm: return error if bio_integrity_clone() fails in clone_bio()Mike Snitzer1-7/+20
clone_bio() now checks if bio_integrity_clone() returned an error rather than just drop it on the floor. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11dm: drop unnecessary assignment of md->queueBob Liu1-6/+1
md->queue and q are the same thing in dm_old_init_request_queue() and dm_mq_init_request_queue(). Also drop the temporary 'struct request_queue *q' in dm_old_init_request_queue(). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11dm: reorder 'struct mapped_device' members to fix alignment and holesMike Snitzer1-16/+18
Saves 16 bytes by eliminating 4 4byte holes but more importantly: numerous members that crossed cachelines were fixed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11dm: remove dummy definition of 'struct dm_table'Mike Snitzer1-11/+3
Change the map pointer in 'struct mapped_device' from 'struct dm_table __rcu *' to 'void __rcu *' to avoid the need for the dummy definition. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-03-11dm: add 'dm_numa_node' module parameterMike Snitzer1-8/+44
Allows user to control which NUMA node the memory for DM device structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set) is allocated from. Defaults to NUMA_NO_NODE (-1). Allowable range is from -1 until the last online NUMA node id. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-23dm: allow immutable request-based targets to use blk-mq pduMike Snitzer1-7/+26
This will allow DM multipath to use a portion of the blk-mq pdu space for target data (e.g. struct dm_mpath_io). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-23dm: rename target's per_bio_data_size to per_io_data_sizeMike Snitzer1-4/+4
Request-based DM will also make use of per_bio_data_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-23dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DMMike Snitzer1-49/+53
Rename various methods to have either a "dm_old" or "dm_mq" prefix. Improve code comments to assist with understanding the duality of code that handles both "dm_old" and "dm_mq" cases. It is no much easier to quickly look at the code and _know_ that a given method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-23dm: remove support for stacking dm-mq on .request_fn device(s)Mike Snitzer1-39/+16
Remove all fiddley code that propped up this support for a blk-mq request-queue ontop of all .request_fn devices. Testing has proven this niche request-based dm-mq mode to be buggy, when testing fault tolerance with DM multipath, and there is no point trying to preserve it. Should help improve efficiency of pure dm-mq code and make code maintenance less delicate. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-23dm: fix a couple locking issues with use of block interfacesMike Snitzer1-7/+21
old_stop_queue() was checking blk_queue_stopped() without holding the q->queue_lock. dm_requeue_original_request() needed to check blk_queue_stopped(), with q->queue_lock held, before calling blk_mq_kick_requeue_list(). And a side-effect of that change is start_queue() must also call blk_mq_kick_requeue_list(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: allocate blk_mq_tag_set rather than embed in mapped_deviceMike Snitzer1-18/+27
The blk_mq_tag_set is only needed for dm-mq support. There is point wasting space in 'struct mapped_device' for non-dm-mq devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
2016-02-22dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module paramsMike Snitzer1-2/+25
Allow user to change these values via module params or sysfs. 'dm_mq_nr_hw_queues' defaults to 1 (max 32). 'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too small under moderate sized workloads -- the dm-multipath device would continuously block waiting for tags (requests) to become available). The maximum is BLK_MQ_MAX_DEPTH (currently 10240). Keep in mind the total number of pre-allocated requests per request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth' (currently 2048). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: optimize dm_request_fn()Mike Snitzer1-30/+17
DM multipath is the only request-based DM target -- which only supports tables with a single target that is immutable. Leverage this fact in dm_request_fn(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: optimize dm_mq_queue_rq()Mike Snitzer1-22/+18
DM multipath is the only dm-mq target. But that aside, request-based DM only supports tables with a single target that is immutable. Leverage this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in the mapped_device when the table was made active. This saves the need to even take the read-side of the SRCU via dm_{get,put}_live_table. If the active DM table does not have an immutable target (e.g. "error" target was swapped in) then fallback to the slow-path where the target is looked up from the live table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: cleanup dm_any_congested()Mike Snitzer1-9/+8
The request-based DM support for checking queue congestion doesn't require access to the live DM table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: remove unused dm_get_rq_mapinfo()Mike Snitzer1-8/+0
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: fix excessive dm-mq context switchingMike Snitzer1-7/+6
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower than if an underlying null_blk device were used directly. One of the reasons for this drop in performance is that blk_insert_clone_request() was calling blk_mq_insert_request() with @async=true. This forced the use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues which ushered in ping-ponging between process context (fio in this case) and kblockd's kworker to submit the cloned request. The ftrace function_graph tracer showed: kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to _not_ use kblockd to submit the cloned requests isn't enough to eliminate the observed context switches. In addition to this dm-mq specific blk-core fix, there are 2 DM core fixes to dm-mq that (when paired with the blk-core fix) completely eliminate the observed context switching: 1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd just increases context switches. In my testing against a really fast null_blk device there was no benefit to running blk_mq_run_hw_queues() on completion (and no other blk-mq driver does this). So hopefully this change doesn't induce the need for yet another revert like commit 621739b00e16ca2d ! 2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs .request_fn branching pattern that other historic block interfaces do (e.g. blk_get_request). Using blk_mq_complete_request() for blk-mq requests is important for performance. It should be noted that, like blk_complete_request(), blk_mq_complete_request() doesn't natively handle partial completions -- but the request-based DM-multipath target does provide the required partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. dm-mq fix #2 is _much_ more important than #1 for eliminating the context switches. Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472 With these changes multithreaded async read IOPs improved from ~950K to ~1350K for this dm-mq stacked on null_blk test-case. The raw read IOPs of the underlying null_blk device for the same workload is ~1950K. Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()") Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM") Cc: stable@vger.kernel.org # 4.1+ Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-22dm: fix sparse "unexpected unlock" warnings in ioctl codeMike Snitzer1-27/+29
Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it do the dm_{get,put}_live_table() rather than split those operations. The dm_grab_bdev_for_ioctl() callers only care about the block_device associated with a singleton DM device so there isn't any need to retain a reference to the live DM table. It is sufficient to: 1) dm_get_live_table() 2) bdgrab() the bdev associated with the singleton table's target 3) dm_put_live_table() 4) bdput() the bdev Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: do not return target from dm_get_live_table_for_ioctl()Mike Snitzer1-20/+13
None of the callers actually used the returned target. Also, just reuse bdev pointer passed to dm_blk_ioctl(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq pathsMike Snitzer1-0/+2
Using request-based DM mpath configured with the following stacking (.request_fn DM mpath ontop of scsi-mq paths): echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq 'struct dm_rq_target_io' would leak if a request is requeued before a blk-mq clone is allocated (or fails to allocate). free_rq_tio() wasn't being called. kmemleak reported: unreferenced object 0xffff8800b90b98c0 (size 112): comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s) hex dump (first 32 bytes): 00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@....... e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............ backtrace: [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0 [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20 [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170 [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod] [<ffffffff812fbd78>] blk_peek_request+0x168/0x290 [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod] [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40 [<ffffffff812f9585>] blk_delay_work+0x25/0x40 [<ffffffff81096fff>] process_one_work+0x14f/0x3d0 [<ffffffff81097715>] worker_thread+0x125/0x4b0 [<ffffffff8109ce88>] kthread+0xd8/0xf0 [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff crash> struct -o dm_rq_target_io struct dm_rq_target_io { ... } SIZE: 112 Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-17dm: do not reuse dm_blk_ioctl block_device input as local variableMike Snitzer1-2/+3
(Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for targets' .prepare_ioctl to fail if they go on to check the bdev for !NULL. Fixes: e56f81e0b01e ("dm: refactor ioctl handling") Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-17dm: fix ioctl retry termination with signalJunichi Nomura1-1/+1
dm-mpath retries ioctl, when no path is readily available and the device is configured to queue I/O in such a case. If you want to stop the retry before multipathd decides to turn off queueing mode, you could send signal for the process to exit from the loop. However the check of fatal signal has not carried along when commit 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") moved the loop from dm-mpath to dm core. As a result, we can't terminate such a process in the retry loop. Easy reproducer of the situation is: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # dmsetup message mp 0 'queue_if_no_path' # sg_inq /dev/mapper/mp then you should be able to terminate sg_inq by pressing Ctrl+C. Fixes: 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-11-07block: change ->make_request_fn() and users to return a queue cookieJens Axboe1-3/+3
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-05Merge tag 'dm-4.4-changes' of ↵Linus Torvalds1-27/+182
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "Smaller set of DM changes for this merge. I've based these changes on Jens' for-4.4/reservations branch because the associated DM changes required it. - Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups" * tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm switch: simplify conditional in alloc_region_table() dm delay: document that offsets are specified in sectors dm delay: capitalize the start of an delay_ctr() error message dm delay: Use DM_MAPIO macros instead of open-coded equivalents dm linear: remove redundant target name from error messages dm persistent data: eliminate unnecessary return values dm: eliminate unused "bioset" process for each bio-based DM device dm: convert ffs to __ffs dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() dm: add support for passing through persistent reservations dm: refactor ioctl handling Revert "dm mpath: fix stalls when handling invalid ioctls" dm: initialize non-blk-mq queue data before queue is used
2015-11-05Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+0
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-01dm: eliminate unused "bioset" process for each bio-based DM deviceMikulas Patocka1-2/+6
Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make generic_make_request handle arbitrarily sized bios") makes it possible for block devices to process large bios. In doing so that commit allocates a new queue->bio_split bioset for each block device, this bioset is used for allocating bios when the driver needs to split large bios. Each bioset allocates a workqueue process, thus the above commit increases the number of processes allocated per block device. DM doesn't need the queue->bio_split bioset, thus we can deallocate it. This reduces the number of allocated processes per bio-based DM device from 3 to 2. Also remove the call to blk_queue_split(), it is not needed because DM does its own splitting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-01dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()Julia Lawall1-9/+4
Remove DM's unneeded NULL tests before calling these destroy functions, now that they check for NULL, thanks to these v4.3 commits: 3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()") 4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()") The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-01dm: add support for passing through persistent reservationsChristoph Hellwig1-0/+123
This adds support to pass through persistent reservation requests similar to the existing ioctl handling, and with the same limitations, e.g. devices may only have a single target attached. This is mostly intended for multipathing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-01dm: refactor ioctl handlingChristoph Hellwig1-13/+42
This moves the call to blkdev_ioctl and the argument checking to DM core code, and only leaves a callout to find the block device to operate on in the targets. This simplifies the code and allows us to pass through ioctl-like command using other methods in the next patch. Also split out a helper around calling the prepare_ioctl method that will be reused for persistent reservation handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-30dm: initialize non-blk-mq queue data before queue is usedMikulas Patocka1-3/+7
Commit bfebd1cdb497a57757c83f5fbf1a29931591e2a4 ("dm: add full blk-mq support to request-based DM") moves the initialization of the fields backing_dev_info.congested_fn, backing_dev_info.congested_data and queuedata from the function dm_init_md_queue (that is called when the device is created) to dm_init_old_md_queue (that is called after the device type is determined). There is no locking when accessing these variables, thus it is possible for other parts of the kernel to briefly see this data in a transient state (e.g. queue->backing_dev_info.congested_fn initialized and md->queue->backing_dev_info.congested_data uninitialized, resulting in passing an incorrect parameter to the function dm_any_congested). This queue data is left initialized for blk-mq devices even though they that don't use it. Fixes: bfebd1cdb497 ("dm: add full blk-mq support to request-based DM") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+
2015-10-21md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdownDan Williams1-2/+0
Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-06dm: fix request-based dm error reportingJunichi Nomura1-2/+3
end_clone_bio() is a endio callback for clone bio and should check and save the clone's bi_error for error reporting. However, 4246a0b63bd8 ("block: add a bi_error field to struct bio") changed the function to check the original bio's bi_error, which is 0. Without this fix, clone's error is ignored and reported to the original request as success. Thus data corruption will be observed. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-01dm: fix AB-BA deadlock in __dm_destroy()Junichi Nomura1-4/+2
__dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and suspend_lock in reverse order. Doing so can cause AB-BA deadlock: __dm_destroy dm_swap_table --------------------------------------------------- mutex_lock(suspend_lock) dm_get_live_table() srcu_read_lock(io_barrier) dm_sync_table() synchronize_srcu(io_barrier) .. waiting for dm_put_live_table() mutex_lock(suspend_lock) .. waiting for suspend_lock Fix this by taking the locks in proper order. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Fixes: ab7c7bb6f4ab ("dm: hold suspend_lock while suspending device during device deletion") Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-09-03Merge tag 'dm-4.3-changes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-126/+11
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-08-13block: kill merge_bvec_fn() completelyKent Overstreet1-125/+2
As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13block: make generic_make_request handle arbitrarily sized biosKent Overstreet1-0/+2
The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-12dm: test return value for DM_MAPIO_SUBMITTEDMikulas Patocka1-1/+1
In properly written code we should not assume that DM_MAPIO_SUBMITTED is zero. We should test the return value for DM_MAPIO_SUBMITTED rather than testing it for zero. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-04dm: fix dm_merge_bvec regression on 32 bit systemsMike Snitzer1-17/+10
A DM regression on 32 bit systems was reported against v4.2-rc3 here: https://lkml.org/lkml/2015/7/29/401 Fix this by reverting both commit 1c220c69 ("dm: fix casting bug in dm_merge_bvec()") and 148e51ba ("dm: improve documentation and code clarity in dm_merge_bvec"). This combined revert is done to eliminate the possibility of a partial revert in stable@ kernels. In hindsight the correct fix, at the time 1c220c69 was applied to fix the regression that 148e51ba introduced, should've been to simply revert 148e51ba. Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Tested-by: Adam Williamson <awilliam@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig1-8/+7
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-13dm: fix use after free crash due to incorrect cleanup sequenceMikulas Patocka1-2/+2
Linux 4.2-rc1 Commit 0f20972f7bf6 ("dm: factor out a common cleanup_mapped_device()") moved a common cleanup code to a separate function. Unfortunately, that commit incorrectly changed the order of cleanup, so that it destroys the mapped_device's srcu structure 'io_barrier' before destroying its workqueue. The function that is executed on the workqueue (dm_wq_work) uses the srcu structure, thus it may use it after being freed. That results in a crash in the LVM test suite's mirror-vgreduce-removemissing.sh test. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 0f20972f7bf6 ("dm: factor out a common cleanup_mapped_device()") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-08Revert "dm: only run the queue on completion if congested or no requests ↵Mike Snitzer1-6/+2
pending" This reverts commit 9a0e609e3fd8a95c96629b9fbde6b8c5b9a1456a. (Resolved a conflict during revert due to commit bfebd1cdb4 that came after) This revert is motivated by a couple failure reports on request-based DM multipath testbeds: 1) Netapp reported that their multipath fault injection test under heavy IO load can stall longer than 300 seconds. 2) IBM reported elevated lock contention in their testbed (likely due to increased back pressure due to IO not being dispatched as quickly): https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
2015-06-26Merge tag 'dm-4.2-fixes' of ↵Linus Torvalds1-68/+152
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"
2015-06-26Revert "block, dm: don't copy bios for request clones"Mike Snitzer1-40/+131
This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26Revert "dm: do not allocate any mempools for blk-mq request-based DM"Mike Snitzer1-38/+31
This reverts commit cbc4e3c1350beb47beab8f34ad9be3d34a20c705. Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26Merge tag 'dm-4.2-changes' of ↵Linus Torvalds1-78/+112
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core cleanups: * blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy * smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt * tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits) dm stats: add support for request-based DM devices dm stats: collect and report histogram of IO latencies dm stats: support precise timestamps dm stats: fix divide by zero if 'number_of_areas' arg is zero dm cache: switch the "default" cache replacement policy from mq to smq dm space map metadata: fix occasional leak of a metadata block on resize dm thin metadata: fix a race when entering fail mode dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages dm thin: range discard support dm thin metadata: add dm_thin_remove_range() dm thin metadata: add dm_thin_find_mapped_range() dm btree: add dm_btree_remove_leaves() dm stats: Use kvfree() in dm_kvfree() dm cache: age and write back cache entries even without active IO dm cache: prefix all DMERR and DMINFO messages with cache device name dm cache: add fail io mode and needs_check flag dm cache: wake the worker thread every time we free a migration object dm cache: add stochastic-multi-queue (smq) policy dm cache: boost promotion of blocks that will be overwritten dm cache: defer whole cells ...
2015-06-26Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...