summaryrefslogtreecommitdiff
path: root/drivers/md/dm-thin.c
AgeCommit message (Collapse)AuthorFilesLines
2014-02-27dm thin: allow metadata space larger than supported to go unusedMike Snitzer1-12/+19
It was always intended that a user could provide a thin metadata device that is larger than the max supported by the on-disk format. The extra space would just go unused. Unfortunately that never worked. If the user attempted to use a larger metadata device on creation they would get an error like the following: device-mapper: space map common: space map too large device-mapper: transaction manager: couldn't create metadata space map device-mapper: thin metadata: tm_create_with_sm failed device-mapper: table: 252:17: thin-pool: Error creating metadata object device-mapper: ioctl: error adding target to table Fix this by allowing the initial metadata space map creation to cap its size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS). get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported). Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Lastly, remove the "excess space will not be used" warning message from get_metadata_dev_size(); it resulted in printing the warning multiple times. Factor out warn_if_metadata_device_too_big(), call it from pool_ctr() and maybe_resize_metadata_dev(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-24dm thin: fix the error path for the thin device constructorMike Snitzer1-1/+4
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len() fails in thin_ctr(). Otherwise __pool_destroy() will fail because the pool will still have an open thin device: device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed. Also, must establish error code if failing thin_ctr() because the pool is in fail_io mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
2014-02-17dm thin: avoid metadata commit if a pool's thin devices haven't changedMike Snitzer1-1/+2
Commit 905e51b ("dm thin: commit outstanding data every second") introduced a periodic commit. This commit occurs regardless of whether any thin devices have made changes. Fix the periodic commit to check if any of a pool's thin devices have changed using dm_pool_changed_this_transaction(). Reported-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-12/+18
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-16dm thin: fix pool feature parsingMike Snitzer1-1/+1
Commit 787a996cb251e20 ("dm thin: add error_if_no_space feature") mistakenly forgot to increase the number of feature args supported. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-07dm thin: fix set_pool_mode exposed pool operation racesMike Snitzer1-13/+27
The pool mode must not be switched until after the corresponding pool process_* methods have been established. Otherwise, because set_pool_mode() isn't interlocked with the IO path for performance reasons, the IO path can end up executing process_* operations that don't match the mode. This patch eliminates problems like the following (as seen on really fast PCIe SSD storage when transitioning the pool's mode from PM_READ_ONLY to PM_WRITE): kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event. kernel: device-mapper: thin: 253:2: no free data space available. kernel: device-mapper: thin: 253:2: switching pool to read-only mode kernel: device-mapper: thin: 253:2: switching pool to write mode kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]() ... kernel: Workqueue: dm-thin do_worker [dm_thin_pool] kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3 kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800 kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3 kernel: Call Trace: kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0 kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20 kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool] kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool] kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool] kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool] kernel: [<ffffffff81067823>] process_one_work+0x183/0x490 kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0 kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160 kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0 kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70 kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70 kernel: ---[ end trace 3f00528e08ffa55c ]--- kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!? dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY); at the top of handle_unserviceable_bio(). And as the additional debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like expected, it was already PM_WRITE, yet pool->process_bio was still set to process_bio_read_only(). Also, while fixing this up, reduce logging of redundant pool mode transitions by checking new_mode is different from old_mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-01-07dm thin: eliminate the no_free_space flagMike Snitzer1-22/+4
The pool's error_if_no_space flag can easily serve the same purpose that no_free_space did, namely: control whether handle_unserviceable_bio() will error a bio or requeue it. This is cleaner since error_if_no_space is established when the pool's features are processed during table load. So it avoids managing the no_free_space flag by taking the pool's spinlock. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-07dm thin: add error_if_no_space featureMike Snitzer1-6/+25
If the pool runs out of data or metadata space, the pool can either queue or error the IO destined to the data device. The default is to queue the IO until more space is added. An admin may now configure the pool to error IO when no space is available by setting the 'error_if_no_space' feature when loading the thin-pool table. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: requeue bios to DM core if no_free_space and in read-only modeMike Snitzer1-6/+20
Now that we switch the pool to read-only mode when the data device runs out of space it causes active writers to get IO errors once we resume after resizing the data device. If no_free_space is set, save bios to the 'retry_on_resume_list' and requeue them on resume (once the data or metadata device may have been resized). With this patch the resize_io test passes again (on slower storage): dmtest run --suite thin-provisioning -n /resize_io/ Later patches fix some subtle races associated with the pool mode transitions done as part of the pool's -ENOSPC handling. These races are exposed on fast storage (e.g. PCIe SSD). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: cleanup and improve no space handlingMike Snitzer1-29/+32
Factor out_of_data_space() out of alloc_data_block(). Eliminate the use of 'no_free_space' as a latch in alloc_data_block() -- this is no longer needed now that we switch to read-only mode when we run out of data or metadata space. In a later patch, the 'no_free_space' flag will be eliminated entirely (in favor of checking metadata rather than relying on a transient flag). Move no metdata space handling into metdata_operation_failed(). Set no_free_space when metadata space is exhausted too. This is useful, because it offers consistency, for the following patch that will requeue data IOs if no_free_space. Also, rename no_space() to retry_bios_on_resume(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: log info when growing the data or metadata deviceMike Snitzer1-0/+7
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: handle metadata failures more consistentlyJoe Thornber1-21/+27
Introduce metadata_operation_failed() wrappers, around set_pool_mode(), to assist with improving the consistency of how metadata failures are handled. Logging is improved and metadata operation failures trigger read-only mode immediately. Also, eliminate redundant set_pool_mode() calls in the two alloc_data_block() caller's error paths. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-07dm thin: factor out check_low_water_mark and use boolsJoe Thornber1-15/+22
Factor check_low_water_mark() out of alloc_data_block(). Change a couple unsigned flags in the pool structure to bool. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-07dm thin: add mappings to end of prepared_* listsMike Snitzer1-3/+3
Mappings could be processed in descending logical block order, particularly if buffered IO is used. This could adversely affect the latency of IO processing. Fix this by adding mappings to the end of the 'prepared_mappings' and 'prepared_discards' lists. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: return error from alloc_data_block if pool is not in write modeJoe Thornber1-0/+3
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-07dm thin: use bool rather than unsigned for flags in structuresMike Snitzer1-11/+11
Also, move 'err' member in dm_thin_new_mapping structure to eliminate 4 byte hole (reduces size from 88 bytes to 80). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07dm thin: fix discard support to a previously shared blockJoe Thornber1-2/+12
If a snapshot is created and later deleted the origin dm_thin_device's snapshotted_time will have been updated to reflect the snapshot's creation time. The 'shared' flag in the dm_thin_lookup_result struct returned from dm_thin_find_block() is an approximation based on snapshotted_time -- this is done to avoid 0(n), or worse, time complexity. In this case, the shared flag would be true. But because the 'shared' flag reflects an approximation a block can be incorrectly assumed to be shared (e.g. false positive for 'shared' because the snapshot no longer exists). This could result in discards issued to a thin device not being passed down to the pool's underlying data device. To fix this we double check that a thin block is really still in-use after a mapping is removed using dm_pool_block_is_used(). If the reference count for a block is now zero the discard is allowed to be passed down. Also add a 'definitely_not_shared' member to the dm_thin_new_mapping structure -- reflects that the 'shared' flag in the response from dm_thin_find_block() can only be held as definitive if false is returned. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-01-07dm thin: initialize dm_thin_new_mapping returned by get_next_mappingMike Snitzer1-11/+6
As additional members are added to the dm_thin_new_mapping structure care should be taken to make sure they get initialized before use. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
2013-12-31Merge tag 'v3.13-rc6' into for-3.14/coreJens Axboe1-27/+39
Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c
2013-12-11dm thin: allow pool in read-only mode to transition to read-write modeJoe Thornber1-2/+10
A thin-pool may be in read-only mode because the pool's data or metadata space was exhausted. To allow for recovery, by adding more space to the pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE mode. Otherwise, running out of space will render the pool permanently read-only. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-12-11dm thin: re-establish read-only state when switching to fail modeJoe Thornber1-0/+1
If the thin-pool transitioned to fail mode and the thin-pool's table were reloaded for some reason: the new table's default pool mode would be read-write, though it will transition to fail mode during resume. When the pool mode transitions directly from PM_WRITE to PM_FAIL we need to re-establish the intermediate read-only state in both the metadata and persistent-data block manager (as is usually done with the normal pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-12-11dm thin: always fallback the pool mode if commit failsJoe Thornber1-22/+15
Rename commit_or_fallback() to commit(). Now all previous calls to commit() will trigger the pool mode to fallback if the commit fails. Also, check the error returned from commit() in alloc_data_block(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-12-11dm thin: switch to read-only mode if metadata space is exhaustedMike Snitzer1-2/+10
Switch the thin pool to read-only mode in alloc_data_block() if dm_pool_alloc_data_block() fails because the pool's metadata space is exhausted. Differentiate between data and metadata space in messages about no free space available. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed <snip ... these repeat for a _very_ long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: commit failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: thin: 253:4: reached low water mark for metadata device: sending event. device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
2013-12-11dm thin: switch to read only mode if a mapping insert failsJoe Thornber1-1/+3
Switch the thin pool to read-only mode when dm_thin_insert_block() fails since there is little reason to expect the cause of the failure to be resolved without further action by user space. This issue was noticed with the device-mapper-test-suite using: dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/ The quantity of errors logged in this case must be reduced. before patch: device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: dm_thin_insert_block() failed device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map metadata: unable to allocate new metadata block <snip ... these repeat for a long while ... > device-mapper: space map metadata: unable to allocate new metadata block device-mapper: space map common: dm_tm_shadow_block() failed device-mapper: thin: 253:4: no free metadata space available. device-mapper: thin: 253:4: switching pool to read-only mode after patch: device-mapper: space map metadata: unable to allocate new metadata block device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28 device-mapper: thin: 253:4: switching pool to read-only mode Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-11-24block: Generic bio chainingKent Overstreet1-2/+6
This adds a generic mechanism for chaining bio completions. This is going to be used for a bio_split() replacement, and it turns out to be very useful in a fair amount of driver code - a fair number of drivers were implementing this in their own roundabout ways, often painfully. Note that this means it's no longer to call bio_endio() more than once on the same bio! This can cause problems for drivers that save/restore bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all - in all but the simplest cases they'd be better off just cloning the bio, and immutable biovecs is making bio cloning cheaper. But for now, we add a bio_endio_nodec() for these cases. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>
2013-11-24block: Abstract out bvec iteratorKent Overstreet1-10/+12
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-09-23dm thin: do not expose non-zero discard limits if discards disabledMike Snitzer1-3/+11
Fix issue where the block layer would stack the discard limits of the pool's data device even if the "ignore_discard" pool feature was specified. The pool and thin device(s) still had discards disabled because the QUEUE_FLAG_DISCARD request_queue flag wasn't set. But to avoid user confusion when "ignore_discard" is used: both the pool device and the thin device(s) have zeroes for all discard limits. Also, always set discard_zeroes_data_unsupported in targets because they should never advertise the 'discard_zeroes_data' capability (even if the pool's data device supports it). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2013-09-06dm thin: always return -ENOSPC if no_free_space is setMike Snitzer1-25/+31
If pool has 'no_free_space' set it means a previous allocation already determined the pool has no free space (and failed that allocation with -ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do not allow the pool to oscillate between allocating blocks and then not. But a side-effect of this determinism is that if a user wants to be able to allocate new blocks they'll need to reload the pool's table (to clear the 'no_free_space' flag). This reload will happen automatically if the pool's data volume is resized. But if the user takes action to free a lot of space by deleting snapshot volumes, etc the pool will no longer allow data allocations to continue without an intervening table reload. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-06dm thin: set pool read-only if breaking_sharing fails block allocationMike Snitzer1-2/+4
break_sharing() now handles an arbitrary alloc_data_block() error the same way as provision_block(): marks pool read-only and errors the cell. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-09-06dm thin: prefix pool error messages with pool device nameMike Snitzer1-16/+32
Useful to know which pool is experiencing the error. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-08-23dm thin: fix stacking of geometry limitsMike Snitzer1-2/+10
Do not blindly override the queue limits (specifically io_min and io_opt). Allow traditional stacking of these limits if io_opt is a factor of the thin-pool's data block size. Without this patch mkfs.xfs does not recognize the thin device's provided limits as a useful geometry (e.g. raid) so these hints are ignored. This was due to setting io_min to a useless value. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2013-05-19dm thin: fix metadata dev resize detectionAlasdair G Kergon1-2/+2
Fix detection of the need to resize the dm thin metadata device. The code incorrectly tried to extend the metadata device when it didn't need to due to a merging error with patch 24347e9 ("dm thin: detect metadata device resizing"). device-mapper: transaction manager: couldn't open metadata space map device-mapper: thin metadata: tm_open_with_sm failed device-mapper: thin: aborting transaction failed device-mapper: thin: switching pool to failure mode Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: generate event when metadata threshold passedJoe Thornber1-0/+38
Generate a dm event when the amount of remaining thin pool metadata space falls below a certain level. The threshold is taken to be a quarter of the size of the metadata device with a minimum threshold of 4MB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: detect metadata device resizingJoe Thornber1-3/+51
Allow the dm thin pool metadata device to be extended. Whenever a pool is resumed, detect whether the size of the metadata device has increased, and if so, extend the metadata to use the new space. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: open dev read only when possibleJoe Thornber1-11/+14
If a thin pool is created in read-only-metadata mode then only open the metadata device read-only. Previously it was always opened with FMODE_READ | FMODE_WRITE. (Note that dm_get_device() still allows read-only dm devices to be used read-write at the moment: If I create a read-only linear device for the metadata, via dmsetup load --readonly, then I can still create a rw pool out of it.) Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: refactor data dev resizeJoe Thornber1-28/+59
Refactor device size functions in preparation for similar metadata device resizing functions. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20dm thin: fix non power of two discard granularity calcJoe Thornber1-1/+6
Fix a discard granularity calculation to work for non power of 2 block sizes. In order for thinp to passdown discard bios to the underlying data device, the data device must have a discard granularity that is a factor of the thinp block size. Originally this check was done by using bitops since the block_size was known to be a power of two. Introduced by commit f13945d75730081830b6f3360266950e2b7c9067 ("dm thin: support a non power of 2 discard_granularity"). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-20dm thin: fix discard corruptionJoe Thornber1-2/+2
Fix a bug in dm_btree_remove that could leave leaf values with incorrect reference counts. The effect of this was that removal of a shared block could result in the space maps thinking the block was no longer used. More concretely, if you have a thin device and a snapshot of it, sending a discard to a shared region of the thin could corrupt the snapshot. Thinp uses a 2-level nested btree to store it's mappings. This first level is indexed by thin device, and the second level by logical block. Often when we're removing an entry in this mapping tree we need to rebalance nodes, which can involve shadowing them, possibly creating a copy if the block is shared. If we do create a copy then children of that node need to have their reference counts incremented. In this way reference counts percolate down the tree as shared trees diverge. The rebalance functions were incrementing the children at the appropriate time, but they were always assuming the children were internal nodes. This meant the leaf values (in our case packed block/flags entries) were not being incremented. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm thin: remove cells from stackJoe Thornber1-15/+32
This patch takes advantage of the new bio-prison interface where the memory is now passed in rather than using a mempool in bio-prison. This allows the map function to avoid performing potentially-blocking allocations that could lead to deadlocks: We want to avoid the cell allocation that is done in bio_detain. (The potential for mempool deadlocks still remains in other functions that use bio_detain.) Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm bio prison: pass cell memory inJoe Thornber1-26/+77
Change the dm_bio_prison interface so that instead of allocating memory internally, dm_bio_detain is supplied with a pre-allocated cell each time it is called. This enables a subsequent patch to move the allocation of the struct dm_bio_prison_cell outside the thin target's mapping function so it can no longer block there. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm kcopyd: introduce configurable throttlingMikulas Patocka1-1/+4
This patch allows the administrator to reduce the rate at which kcopyd issues I/O. Each module that uses kcopyd acquires a throttle parameter that can be set in /sys/module/*/parameters. We maintain a history of kcopyd usage by each module in the variables io_period and total_period in struct dm_kcopyd_throttle. The actual kcopyd activity is calculated as a percentage of time equal to "(100 * io_period / total_period)". This is compared with the user-defined throttle percentage threshold and if it is exceeded, we sleep. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm: rename request variables to biosAlasdair G Kergon1-6/+6
Use 'bio' in the name of variables and functions that deal with bios rather than 'request' to avoid confusion with the normal block layer use of 'request'. No functional changes. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm thin: use block_size_is_power_of_twoMike Snitzer1-12/+13
Use block_size_is_power_of_two() rather than checking sectors_per_block_shift directly. Also introduce local pool variable in get_bio_block() to eliminate redundant tc->pool dereferences. No functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm thin: support a non power of 2 discard_granularityMike Snitzer1-8/+1
Support a non-power-of-2 discard granularity in dm-thin, now that the block layer supports this(via 8dd2cb7e880d2f77fba53b523c99133ad5054cfd "block: discard granularity might not be power of 2" and 59771079c18c44e39106f0f30054025acafadb41 "blk: avoid divide-by-zero with zero discard granularity"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-02dm: fix truncated status stringsMikulas Patocka1-31/+49
Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-01-31dm thin: fix queue limits stackingMike Snitzer1-12/+1
thin_io_hints() is blindly copying the queue limits from the thin-pool which can lead to incorrect limits being set. The fix here simply deletes the thin_io_hints() hook which leaves the existing stacking infrastructure to set the limits correctly. When a thin-pool uses an MD device for the data device a thin device from the thin-pool must respect MD's constraints about disallowing a bio from spanning multiple chunks. Otherwise we can see problems. If the raid0 chunksize is 1152K and thin-pool chunksize is 256K I see the following md/raid0 error (with extra debug tracing added to thin_endio) when mkfs.xfs is executed against the thin device: md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127 device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0 This extra DM debugging shows that the failing bio is spanning across the first and second logical 1152K chunk (sector 2080 + 255 takes the bio beyond the first chunk's boundary of sector 2304). So the bio splitting that DM is doing clearly isn't respecting the MD limits. max_hw_sectors_kb is 127 for both the thin-pool and thin device (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of precision). So this explains why bi_size is 130560. But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given that it doesn't have a .merge function (for bio_add_page to consult indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD device that has a compulsory merge_bvec_fn. This scenario is exactly why DM must resort to sending single PAGE_SIZE bios to the underlying layer. Some additional context for this is available in the header for commit 8cbeb67a ("dm: avoid unsupported spanning of md stripe boundaries"). Long story short, the reason a thin device doesn't properly get configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that thin_io_hints() is blindly copying the queue limits from the thin-pool device directly to the thin device's queue limits. Fix this by eliminating thin_io_hints. Doing so is safe because the block layer's queue limits stacking already enables the upper level thin device to inherit the thin-pool device's discard and minimum_io_size and optimal_io_size limits that get set in pool_io_hints. But avoiding the queue limits copy allows the thin and thin-pool limits to be different where it is important, namely max_hw_sectors_kb. Reported-by: Daniel Browning <db@kavod.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm: remove map_infoMikulas Patocka1-10/+5
This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm thin: dont use map_contextMikulas Patocka1-36/+13
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead. This patch removes any use of map_info in preparation for the next patch that removes map_info from bio-based device mapper. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm kcopyd: add WRITE SAME support to dm_kcopyd_zeroMike Snitzer1-1/+1
Add WRITE SAME support to dm-io and make it accessible to dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface whereas the blkdev_issue_write_same() interface is synchronous. WRITE SAME is a SCSI command that can be leveraged for more efficient zeroing of a specified logical extent of a device which supports it. Only a single zeroed logical block is transfered to the target for each WRITE SAME and the target then writes that same block across the specified extent. The dm thin target uses this. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-22dm thin: use DMERR_LIMIT for errorsMike Snitzer1-10/+15
Throttle all errors logged from the IO path by dm thin. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>