summaryrefslogtreecommitdiff
path: root/drivers/md/dm-raid1.c
AgeCommit message (Collapse)AuthorFilesLines
2015-09-03Merge tag 'dm-4.3-changes' of ↵Linus Torvalds1-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock
2015-08-12dm: do not override error code returned from dm_get_device()Vivek Goyal1-3/+5
Some of the device mapper targets override the error code returned by dm_get_device() and return either -EINVAL or -ENXIO. There is nothing gained by this override. It is better to propagate the returned error code unchanged to caller. This work was motivated by hitting an issue where the underlying device was busy but -EINVAL was being returned. After this change we get -EBUSY instead and it is easier to figure out the problem. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig1-11/+13
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29dm raid1: keep issuing IO after leg failureLidong Zhong1-17/+58
Currently if there is a leg failure, the bio will be put into the hold list until userspace does a remove/replace on the leg. Doing so in a cluster config (clvmd) is problematic because there may be a temporary path failure that results in cluster raid1 remove/replace. Such recovery takes a long time due to a full resync. Update dm-raid1 to optionally ignore these failures so bios continue being issued without interrupton. To enable this feature userspace must pass "keep_log" when creating the dm-raid1 device. Signed-off-by: Lidong Zhong <lzhong@suse.com> Tested-by: Liuhua Wang <lwang@suse.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-22block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer1-2/+0
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_remaining for non-chainsJens Axboe1-1/+1
Struct bio has an atomic ref count for chained bio's, and we use this to know when to end IO on the bio. However, most bio's are not chained, so we don't need to always introduce this atomic operation as part of ending IO. Add a helper to elevate the bi_remaining count, and flag the bio as now actually needing the decrement at end_io time. Rename the field to __bi_remaining to catch any current users of this doing the incrementing manually. For high IOPS workloads, this reduces the overhead of bio_endio() substantially. Tested-by: Robert Elliott <elliott@hp.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-14dm mirror: do not degrade the mirror on discard errorMikulas Patocka1-0/+9
It may be possible that a device claims discard support but it rejects discards with -EOPNOTSUPP. It happens when using loopback on ext2/ext3 filesystem driven by the ext4 driver. It may also happen if the underlying devices are moved from one disk on another. If discard error happens, we reject the bio with -EOPNOTSUPP, but we do not degrade the array. This patch fixes failed test shell/lvconvert-repair-transient.sh in the lvm2 testsuite if the testsuite is extracted on an ext2 or ext3 filesystem and it is being driven by the ext4 driver. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-02-18dm raid1: fix immutable biovec related BUG when retrying read bioMikulas Patocka1-0/+3
When restoring bi_end_io, increase bi_remaining before retrying the bio to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-11-24block: Convert drivers to immutable biovecsKent Overstreet1-4/+4
Now that we've got a mechanism for immutable biovecs - bi_iter.bi_bvec_done - we need to convert drivers to use primitives that respect it instead of using the bvec array directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com
2013-11-24block: Abstract out bvec iteratorKent Overstreet1-8/+8
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-08-23dm: stop using WQ_NON_REENTRANTTejun Heo1-2/+1
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2013-03-24block: Use bio_sectors() more consistentlyKent Overstreet1-1/+1
Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>
2013-03-02dm kcopyd: introduce configurable throttlingMikulas Patocka1-1/+4
This patch allows the administrator to reduce the rate at which kcopyd issues I/O. Each module that uses kcopyd acquires a throttle parameter that can be set in /sys/module/*/parameters. We maintain a history of kcopyd usage by each module in the variables io_period and total_period in struct dm_kcopyd_throttle. The actual kcopyd activity is calculated as a percentage of time equal to "(100 * io_period / total_period)". This is compared with the user-defined throttle percentage threshold and if it is exceeded, we sleep. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm: rename request variables to biosAlasdair G Kergon1-2/+2
Use 'bio' in the name of variables and functions that deal with bios rather than 'request' to avoid confusion with the normal block layer use of 'request'. No functional changes. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm: fix truncated status stringsMikulas Patocka1-5/+3
Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm: remove map_infoMikulas Patocka1-4/+2
This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm raid1: dont use map_contextMikulas Patocka1-9/+13
Don't use map_info any more in dm-raid1. map_info was used for writes to hold the region number. For this purpose we add a new field dm_bio_details to dm_raid1_bio_record. map_info was used for reads to hold a pointer to dm_raid1_bio_record (if the pointer was non-NULL, bio details were saved; if the pointer was NULL, bio details were not saved). We use dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is NULL, details were not saved, if bi_bdev is non-NULL, details were saved. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm raid1: rename read_record to bio_recordMikulas Patocka1-11/+11
Rename struct read_record to bio_record in dm-raid1. In the following patch, the structure will be used for both read and write bios, so rename it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm raid1: use per_bio_dataMikulas Patocka1-34/+5
Replace read_record_pool with per_bio_data in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm raid: use DM_ENDIO_INCOMPLETEMikulas Patocka1-1/+1
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm raid1: remove impossible mempool_alloc error testMikulas Patocka1-5/+3
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition that tests if read_record is non-NULL is always true. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-08-21workqueue: deprecate flush[_delayed]_work_sync()Tejun Heo1-1/+1
flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
2012-07-27dm thin: commit before gathering statusAlasdair G Kergon1-1/+1
Commit outstanding metadata before returning the status for a dm thin pool so that the numbers reported are as up-to-date as possible. The commit is not performed if the device is suspended or if the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target through a new 'status_flags' parameter in the target's dm_status_fn. The userspace dmsetup tool will support the --noflush flag with the 'dmsetup status' and 'dmsetup wait' commands from version 1.02.76 onwards. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27dm: use bool bitfields in struct dm_targetAlasdair G Kergon1-1/+1
Use boolean bit fields for flags in struct dm_target. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27dm: support non power of two target max_io_lenMike Snitzer1-1/+5
Remove the restriction that limits a target's specified maximum incoming I/O size to be a power of 2. Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'. Change it from sector_t to uint32_t, which is plenty big enough, and introduce a wrapper function dm_set_target_max_io_len() to set it. Use sector_div() to process it now that it is not necessarily a power of 2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-20dm raid1: set discard_zeroes_data_unsupportedMikulas Patocka1-0/+1
We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if the underlying disks support zero on discard. So this patch sets ti->discard_zeroes_data_unsupported. For example, if the mirror is in the process of resynchronizing, it may happen that kcopyd reads a piece of data, then discard is sent on the same area and then kcopyd writes the piece of data to another leg. Consequently, the data is not zeroed. The flag was made available by commit 983c7db347db8ce2d8453fd1d89b7a4bb6920d56 (dm crypt: always disable discard_zeroes_data). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-20dm raid1: fix crash with mirror recovery and discardMikulas Patocka1-1/+1
This patch fixes a crash when a discard request is sent during mirror recovery. Firstly, some background. Generally, the following sequence happens during mirror synchronization: - function do_recovery is called - do_recovery calls dm_rh_recovery_prepare - dm_rh_recovery_prepare uses a semaphore to limit the number simultaneously recovered regions (by default the semaphore value is 1, so only one region at a time is recovered) - dm_rh_recovery_prepare calls __rh_recovery_prepare, __rh_recovery_prepare asks the log driver for the next region to recover. Then, it sets the region state to DM_RH_RECOVERING. If there are no pending I/Os on this region, the region is added to quiesced_regions list. If there are pending I/Os, the region is not added to any list. It is added to the quiesced_regions list later (by dm_rh_dec function) when all I/Os finish. - when the region is on quiesced_regions list, there are no I/Os in flight on this region. The region is popped from the list in dm_rh_recovery_start function. Then, a kcopyd job is started in the recover function. - when the kcopyd job finishes, recovery_complete is called. It calls dm_rh_recovery_end. dm_rh_recovery_end adds the region to recovered_regions or failed_recovered_regions list (depending on whether the copy operation was successful or not). The above mechanism assumes that if the region is in DM_RH_RECOVERING state, no new I/Os are started on this region. When I/O is started, dm_rh_inc_pending is called, which increases reg->pending count. When I/O is finished, dm_rh_dec is called. It decreases reg->pending count. If the count is zero and the region was in DM_RH_RECOVERING state, dm_rh_dec adds it to the quiesced_regions list. Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is in DM_RH_RECOVERING state, it could be added to quiesced_regions list multiple times or it could be added to this list when kcopyd is copying data (it is assumed that the region is not on any list while kcopyd does its jobs). This results in memory corruption and crash. There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests do not belong to any region, so they are always added to the sync list in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH requests. These bypasses avoid the crash possibility described above. These bypasses were improperly implemented for REQ_DISCARD when the mirror target gained discard support in commit 5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard). In do_writes, REQ_DISCARD requests is always added to the sync queue and immediately dispatched (even if the region is in DM_RH_RECOVERING). However, dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list corruption described above. This patch changes it so that REQ_DISCARD requests follow the same path as REQ_FLUSH. This avoids the crash. Reference: https://bugzilla.redhat.com/837607 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28dm: reject trailing characters in sccanf inputMikulas Patocka1-4/+8
Device mapper uses sscanf to convert arguments to numbers. The problem is that the way we use it ignores additional unmatched characters in the scanned string. For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number, but also it will match number with some garbage appended, like "123abc". As a result, device mapper accepts garbage after some numbers. For example the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"' will pass without an error. This patch fixes all sscanf uses in device mapper. It appends "%c" with a pointer to a dummy character variable to every sscanf statement. The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds only if string is a null-terminated number (optionally preceded by some whitespace characters). If there is some character appended after the number, sscanf matches "%c", writes the character to the dummy variable and returns 2. We check the return value for 1 and consequently reject numbers with some garbage appended. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29dm kcopyd: return client directly and not through a pointerMikulas Patocka1-2/+4
Return client directly from dm_kcopyd_client_create, not through a parameter, making it consistent with dm_io_client_create. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29dm kcopyd: reserve fewer pagesMikulas Patocka1-2/+1
Reserve just the minimum of pages needed to process one job. Because we allocate pages from page allocator, we don't need to reserve a large number of pages. The maximum job size is SUB_JOB_SIZE and we calculate the number of reserved pages based on this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29dm io: use fixed initial mempool sizeMikulas Patocka1-2/+1
Replace the arbitrary calculation of an initial io struct mempool size with a constant. The code calculated the number of reserved structures based on the request size and used a "magic" multiplication constant of 4. This patch changes it to reserve a fixed number - itself still chosen quite arbitrarily. Further testing might show if there is a better number to choose. Note that if there is no memory pressure, we can still allocate an arbitrary number of "struct io" structures. One structure is enough to process the whole request. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-10block: remove per-queue pluggingJens Axboe1-2/+0
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-13dm: use non reentrant workqueues if equivalentTejun Heo1-2/+3
kmirrord_wq, kcopyd_work and md->wq are created per dm instance and serve only a single work item from the dm instance, so non-reentrant workqueues would provide the same ordering guarantees as ordered ones while allowing CPU affinity and use of the workqueues for other purposes. Switch them to non-reentrant workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm: convert workqueues to alloc_orderedTejun Heo1-1/+1
Convert all create[_singlethread]_work() users to the new alloc[_ordered]_workqueue(). This conversion is mechanical and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm: dont use flush_scheduled_workTejun Heo1-1/+1
flush_scheduled_work() is being deprecated. Flush the used work directly instead. In all dm targets, the only work which uses system_wq is ->trigger_event. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-01-13dm raid1: support discardMike Snitzer1-2/+10
Enable discard support in the DM mirror target. Also change an existing use of 'bvec' to 'addr' in the union. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-09-10dm: implement REQ_FLUSH/FUA support for bio-based dmTejun Heo1-4/+4
This patch converts bio-based dm to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. * -EOPNOTSUPP handling logic dropped. * Preflush is handled as before but postflush is dropped and replaced with passing down REQ_FUA to member request_queues. This replaces one array wide cache flush w/ member specific FUA writes. * __split_and_process_bio() now calls __clone_and_map_flush() directly for flushes and guarantees all FLUSH bio's going to targets are zero ` length. * It's now guaranteed that all FLUSH bio's which are passed onto dm targets are zero length. bio_empty_barrier() tests are replaced with REQ_FLUSH tests. * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes. * Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely enough to be marked with unlikely(). * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue doesn't support cache flushing. Advertise REQ_FLUSH | REQ_FUA capability. * Request based dm isn't converted yet. dm_init_request_based_queue() resets flush support to 0 for now. To avoid disturbing request based dm code, dm->flush_error is added for bio based dm while requested based dm continues to use dm->barrier_error. Lightly tested linear, stripe, raid1, snap and crypt targets. Please proceed with caution as I'm not familiar with the code base. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-12dm: use dm_target_offset macroAlasdair G Kergon1-1/+1
Use new dm_target_offset() macro to avoid most references to ti->begin in dm targets. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-07block: unify flags for struct bio and struct requestChristoph Hellwig1-1/+1
Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-03-06dm raid1: fix deadlock when suspending failed deviceTakahiro Yasui1-18/+23
To prevent deadlock, bios in the hold list should be flushed before dm_rh_stop_recovery() is called in mirror_suspend(). The recovery can't start because there are pending bios and therefore dm_rh_stop_recovery deadlocks. When there are pending bios in the hold list, the recovery waits for the completion of the bios after recovery_count is acquired. The recovery_count is released when the recovery finished, however, the bios in the hold list are processed after dm_rh_stop_recovery() in mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count, then deadlock occurs. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
2010-03-06dm table: remove unused dm_get_device range parametersNikanth Karthikesan1-2/+1
Remove unused parameters(start and len) of dm_get_device() and fix the callers. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-03-06dm raid1: always return error if all legs failMikulas Patocka1-3/+6
If all mirror legs fail, always return an error instead of holding the bio, even if the handle_errors option was set. At present it is the responsibility of the driver underneath us to deal with retries, multipath etc. The patch adds the bio to the failures list instead of holding it directly. do_failures tests first if all legs failed and, if so, returns the bio with -EIO. If any leg is still alive and handle_errors is set, do_failures calls hold_bio. Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16dm raid1: fail writes if errors are not handled and log failsMikulas Patocka1-1/+1
If the mirror log fails when the handle_errors option was not selected and there is no remaining valid mirror leg, writes return success even though they weren't actually written to any device. This patch completes them with EIO instead. This code path is taken: do_writes: bio_list_merge(&ms->failures, &sync); do_failures: if (!get_valid_mirror(ms)) (false) else if (errors_handled(ms)) (false) else bio_endio(bio, 0); The logic in do_failures is based on presuming that the write was already tried: if it succeeded at least on one leg (without handle_errors) it is reported as success. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: explicitly initialise bio_listsMikulas Patocka1-0/+4
Explicitly initialize bio lists instead of relying on kzalloc. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: hold all write bios when leg failsMikulas Patocka1-2/+10
Hold all write bios when leg fails and errors are handled When using a userspace daemon such as dmeventd to handle errors, we must delay completing bios until it has done its job. This patch prevents the following race: - primary leg fails - write "1" fail, the write is held, secondary leg is set default - write "2" goes straight to the secondary leg Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: hold write bios when errors are handledMikulas Patocka1-31/+32
Hold all write bios when errors are handled. Previously the failures list was used only when handling errors with a userspace daemon such as dmeventd. Now, it is always used for all bios. The regions where some writes failed must be marked as nosync. This can only be done in process context (i.e. in raid1 workqueue), not in the write_callback function. Previously the write would succeed if writing to at least one leg succeeded. This is wrong because data from the failed leg may be replicated to the correct leg. Now, if using a userspace daemon, the write with some failures will be held until the daemon has done its job and reconfigured the array. If not using a daemon, the write still succeeds if at least one leg succeeds. This is bad, but it is consistent with current behavior. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: remove bio_endio from dm_rh_mark_nosyncMikulas Patocka1-1/+2
Move bio completion out of dm_rh_mark_nosync in preparation for the next patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: abstract get_valid_mirror functionMikulas Patocka1-7/+15
Move the logic to get a valid mirror leg into a function for re-use in a later patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: use hold framework in do_failuresMikulas Patocka1-25/+9
Use the hold framework in do_failures. This patch doesn't change the bio processing logic, it just simplifies failure handling and avoids periodically polling the failures list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-12-11dm raid1: add framework to hold bios during suspendMikulas Patocka1-0/+41
Add framework to delay bios until a suspend and then resubmit them with either DM_ENDIO_REQUEUE (if the suspend was noflush) or complete them with -EIO. I/O barrier support will use this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Tested-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>