summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2016-05-17MD: make bio mergeableShaohua Li1-0/+2
[ Upstream commit 9c573de3283af007ea11c17bde1e4568d9417328 ] blk_queue_split marks bio unmergeable, which makes sense for normal bio. But if dispatching the bio to underlayer disk, the blk_queue_split checks are invalid, hence it's possible the bio becomes mergeable. In the reported bug, this bug causes trim against raid0 performance slash https://bugzilla.kernel.org/show_bug.cgi?id=117051 Reported-and-tested-by: Park Ju Hyung <qkrwngud825@gmail.com> Fixes: 6ac45aeb6bca(block: avoid to merge splitted bio) Cc: stable@vger.kernel.org (v4.3+) Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18md: multipath: don't hardcopy bio in .make_request pathMing Lei1-1/+3
[ Upstream commit fafcde3ac1a418688a734365203a12483b83907a ] Inside multipath_make_request(), multipath maps the incoming bio into low level device's bio, but it is totally wrong to copy the bio into mapped bio via '*mapped_bio = *bio'. For example, .__bi_remaining is kept in the copy, especially if the incoming bio is chained to via bio splitting, so .bi_end_io can't be called for the mapped bio at all in the completing path in this kind of situation. This patch fixes the issue by using clone style. Cc: stable@vger.kernel.org (v3.14+) Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: fix cache_set_flush() NULL pointer dereference on OOMEric Wheeler1-0/+3
[ Upstream commit f8b11260a445169989d01df75d35af0f56178f95 ] When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: cleaned up error handling around register_cache()Eric Wheeler1-12/+22
[ Upstream commit 9b299728ed777428b3908ac72ace5f8f84b97789 ] Fix null pointer dereference by changing register_cache() to return an int instead of being void. This allows it to return -ENOMEM or -ENODEV and enables upper layers to handle the OOM case without NULL pointer issues. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521 Fixes this error: gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register bcache: register_cache() error opening sdh2: cannot allocate memory BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8 IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] PGD 120dff067 PUD 1119a3067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: veth ip6table_filter ip6_tables (...) CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Workqueue: events cache_set_flush [bcache] task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000 RIP: 0010:[<ffffffffc05a7e8d>] [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache] Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18bcache: fix race of writeback thread starting before complete initializationEric Wheeler1-1/+8
[ Upstream commit 07cc6ef8edc47f8b4fc1e276d31127a0a5863d4d ] The bch_writeback_thread might BUG_ON in read_dirty() if dc->sb==BDEV_STATE_DIRTY and bch_sectors_dirty_init has not yet completed its related initialization. This patch downs the dc->writeback_lock until after initialization is complete, thus preventing bch_writeback_thread from proceeding prematurely. See this thread: http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3453 Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Marc MERLIN <marc@merlins.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18md/raid5: Compare apples to apples (or sectors to sectors)Jes Sorensen1-2/+2
[ Upstream commit e7597e69dec59b65c5525db1626b9d34afdfa678 ] 'max_discard_sectors' is in sectors, while 'stripe' is in bytes. This fixes the problem where DISCARD would get disabled on some larger RAID5 configurations (6 or more drives in my testing), while it worked as expected with smaller configurations. Fixes: 620125f2bf8 ("MD: raid5 trim support") Cc: stable@vger.kernel.org v3.7+ Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10dm snapshot: fix hung bios when copy error occursMikulas Patocka4-19/+12
[ Upstream commit 385277bfb57faac44e92497104ba542cdd82d5fe ] When there is an error copying a chunk dm-snapshot can incorrectly hold associated bios indefinitely, resulting in hung IO. The function copy_callback sets pe->error if there was error copying the chunk, and then calls complete_exception. complete_exception calls pending_complete on error, otherwise it calls commit_exception with commit_callback (and commit_callback calls complete_exception). The persistent exception store (dm-snap-persistent.c) assumes that calls to prepare_exception and commit_exception are paired. persistent_prepare_exception increases ps->pending_count and persistent_commit_exception decreases it. If there is a copy error, persistent_prepare_exception is called but persistent_commit_exception is not. This results in the variable ps->pending_count never returning to zero and that causes some pending exceptions (and their associated bios) to be held forever. Fix this by unconditionally calling commit_exception regardless of whether the copy was successful. A new "valid" parameter is added to commit_exception -- when the copy fails this parameter is set to zero so that the chunk that failed to copy (and all following chunks) is not recorded in the snapshot store. Also, remove commit_callback now that it is merely a wrapper around pending_complete. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: Change refill_dirty() to always scan entire disk if necessaryKent Overstreet1-7/+30
[ Upstream commit 627ccd20b4ad3ba836472468208e2ac4dfadbf03 ] Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: prevent crash on changing writeback_runningStefan Bader1-1/+2
[ Upstream commit 8d16ce540c94c9d366eb36fc91b7154d92d6397b ] Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: allows use of register in udev to avoid "device_busy" error.Gabriel de Perthuis1-2/+3
[ Upstream commit d7076f21629f8f329bca4a44dc408d94670f49e2 ] Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: unregister reboot notifier if bcache fails to unregister deviceZheng Liu1-1/+3
[ Upstream commit 2ecf0cdb2b437402110ab57546e02abfa68a716b ] In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: fix a leak in bch_cached_dev_run()Al Viro1-1/+4
[ Upstream commit 4d4d8573a8451acc9f01cbea24b7e55f04a252fe ] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing deviceZheng Liu1-0/+2
[ Upstream commit fecaee6f20ee122ad75402c53d8278f9bb142ddc ] This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: Add a cond_resched() call to gcKent Overstreet1-0/+1
[ Upstream commit c5f1e5adf956e3ba82d204c7c141a75da9fa449a ] Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-10bcache: fix a livelock when we cause a huge number of cache missesZheng Liu1-1/+3
[ Upstream commit 2ef9ccbfcb90cf84bdba320a571b18b05c41101b ] Subject : [PATCH v2] bcache: fix a livelock in btree lock Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() | ->cache_lookup_fn() | ->cached_dev_cache_miss() | ->bch_btree_insert_check_key() -| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-02dm thin: fix race condition when destroying thin pool workqueueNikolay Borisov1-2/+2
[ Upstream commit 18d03e8c25f173f4107a40d0b8c24defb6ed69f3 ] When a thin pool is being destroyed delayed work items are cancelled using cancel_delayed_work(), which doesn't guarantee that on return the delayed item isn't running. This can cause the work item to requeue itself on an already destroyed workqueue. Fix this by using cancel_delayed_work_sync() which guarantees that on return the work item is not running anymore. Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second") Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15md/raid5: fix locking in handle_stripe_clean_event()Roman Gushchin1-2/+4
[ Upstream commit b8a9d66d043ffac116100775a469f05f5158c16f ] After commit 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()") __find_stripe() is called under conf->hash_locks + hash. But handle_stripe_clean_event() calls remove_hash() under conf->device_lock. Under some cirscumstances the hash chain can be circuited, and we get an infinite loop with disabled interrupts and locked hash lock in __find_stripe(). This leads to hard lockup on multiple CPUs and following system crash. I was able to reproduce this behavior on raid6 over 6 ssd disks. The devices_handle_discard_safely option should be set to enable trim support. The following script was used: for i in `seq 1 32`; do dd if=/dev/zero of=large$i bs=10M count=100 & done neilb: original was against a 3.x kernel. I forward-ported to 4.3-rc. This verison is suitable for any kernel since Commit: 59fc630b8b5f ("RAID5: batch adjacent full stripe write") (v4.1+). I'll post a version for earlier kernels to stable. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> # 3.13 - 4.2 Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15Revert "md: allow a partially recovered device to be hot-added to an array."NeilBrown1-2/+1
[ Upstream commit d01552a76d71f9879af448e9142389ee9be6e95b ] This reverts commit 7eb418851f3278de67126ea0c427641ab4792c57. This commit is poorly justified, I can find not discusison in email, and it clearly causes a problem. If a device which is being recovered fails and is subsequently re-added to an array, there could easily have been changes to the array *before* the point where the recovery was up to. So the recovery must start again from the beginning. If a spare is being recovered and fails, then when it is re-added we really should do a bitmap-based recovery up to the recovery-offset, and then a full recovery from there. Before this reversion, we only did the "full recovery from there" which is not corect. After this reversion with will do a full recovery from the start, which is safer but not ideal. It will be left to a future patch to arrange the two different styles of recovery. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (3.14+) Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.") Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15md/raid10: submit_bio_wait() returns 0 on successJes Sorensen1-1/+1
[ Upstream commit 681ab4696062f5aa939c9e04d058732306a97176 ] This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15md/raid1: submit_bio_wait() returns 0 on successJes Sorensen1-1/+1
[ Upstream commit 203d27b0226a05202438ddb39ef0ef1acb14a759 ] This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b which changed the return value of submit_bio_wait() to return != 0 on error, but didn't update the caller accordingly. Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md") Cc: stable@vger.kernel.org (v3.10) Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15dm btree: fix leak of bufio-backed block in btree_split_beneath error pathMike Snitzer1-1/+1
[ Upstream commit 4dcb8b57df3593dcb20481d9d6cf79d1dc1534be ] btree_split_beneath()'s error path had an outstanding FIXME that speaks directly to the potential for _not_ cleaning up a previously allocated bufio-backed block. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-15dm btree remove: fix a bug when rebalancing nodes after removalJoe Thornber1-6/+11
[ Upstream commit 2871c69e025e8bc507651d5a9cf81a8a7da9d24b ] Commit 4c7e309340ff ("dm btree remove: fix bug in redistribute3") wasn't a complete fix for redistribute3(). The redistribute3 function takes 3 btree nodes and shares out the entries evenly between them. If the three nodes in total contained (MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in the center. Fix this issue by being more careful about calculating the target number of entries for the left and right nodes. Unit tested in userspace using this program: https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-13dm thin: fix missing pool reference count decrement in pool_ctr error pathMike Snitzer1-1/+1
[ Upstream commit ba30670f4d5292c4e7f7980bbd5071f7c4794cdd ] Fixes: ac8c3f3df ("dm thin: generate event when metadata threshold passed") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28md: flush ->event_work before stopping array.NeilBrown1-0/+2
[ Upstream commit ee5d004fd0591536a061451eba2b187092e9127c ] The 'event_work' worker used by dm-raid may still be running when the array is stopped. This can result in an oops. So flush the workqueue on which it is run after detaching and before destroying the device. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release) Fixes: 9d09e663d550 ("dm: raid456 basic support") Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28dm cache: fix NULL pointer when switching from cleaner policyJoe Thornber1-1/+1
[ Upstream commit 2bffa1503c5c06192eb1459180fac4416575a966 ] The cleaner policy doesn't make use of the per cache block hint space in the metadata (unlike the other policies). When switching from the cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block) was observed. The crash was caused by bugs in dm-cache-metadata.c when trying to skip creation of the hint btree. The minimal fix is to change hint size for the cleaner policy to 4 bytes (only hint size supported). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28dm raid: fix round up of default region sizeMikulas Patocka1-2/+1
[ Upstream commit 042745ee53a0a7c1f5aff191a4a24213c6dcfb52 ] Commit 3a0f9aaee028 ("dm raid: round region_size to power of two") intended to make sure that the default region size is a power of two. However, the logic in that commit is incorrect and sets the variable region_size to 0 or 1, depending on whether min_region_size is a power of two. Fix this logic, using roundup_pow_of_two(), so that region_size is properly rounded up to the next power of two. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 3a0f9aaee028 ("dm raid: round region_size to power of two") Cc: stable@vger.kernel.org # v3.8+ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-28dm btree: add ref counting ops for the leaves of top level btreesJoe Thornber4-15/+47
[ Upstream commit b0dc3c8bc157c60b1d470163882be8c13e1950af ] When using nested btrees, the top leaves of the top levels contain block addresses for the root of the next tree down. If we shadow a shared leaf node the leaf values (sub tree roots) should be incremented accordingly. This is only an issue if there is metadata sharing in the top levels. Which only occurs if metadata snapshots are being used (as is possible with dm-thinp). And could result in a block from the thinp metadata snap being reused early, thus corrupting the thinp metadata snap. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-10-07md/raid10: always set reshape_safe when initializing reshape_position.NeilBrown1-1/+4
[ Upstream commit 299b0685e31c9f3dcc2d58ee3beca761a40b44b3 ] 'reshape_position' tracks where in the reshape we have reached. 'reshape_safe' tracks where in the reshape we have safely recorded in the metadata. These are compared to determine when to update the metadata. So it is important that reshape_safe is initialised properly. Currently it isn't. When starting a reshape from the beginning it usually has the correct value by luck. But when reducing the number of devices in a RAID10, it has the wrong value and this leads to the metadata not being updated correctly. This can lead to corruption if the reshape is not allowed to complete. This patch is suitable for any -stable kernel which supports RAID10 reshape, which is 3.5 and later. Fixes: 3ea7daa5d7fd ("md/raid10: add reshape support") Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-09-16dm thin metadata: delete btrees when releasing metadata snapshotJoe Thornber1-2/+2
[ Upstream commit 7f518ad0a212e2a6fd68630e176af1de395070a7 ] The device details and mapping trees were just being decremented before. Now btree_del() is called to do a deep delete. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-27md/bitmap: return an error when bitmap superblock is corrupt.NeilBrown1-0/+2
[ Upstream commit HEAD ] commit b97e92574c0bf335db1cd2ec491d8ff5cd5d0b49 upstream Use separate bitmaps for each nodes in the cluster bitmap_read_sb() validates the bitmap superblock that it reads in. If it finds an inconsistency like a bad magic number or out-of-range version number, it prints an error and returns, but it incorrectly returns zero, so the array is still assembled with the (invalid) bitmap. This means it could try to use a bitmap with a new version number which it therefore does not understand. This bug was introduced in 3.5 and fix as part of a larger patch in 4.1. So the patch is suitable for any -stable kernel in that range. Fixes: 27581e5ae01f ("md/bitmap: centralise allocation of bitmap file pages.") Signed-off-by: NeilBrown <neilb@suse.com> Reported-by: GuoQing Jiang <gqjiang@suse.com> (cherry picked from commit ed9691677d6dda3fff331673f44d18e85938bd76) Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-27md/raid1: fix test for 'was read error from last working device'.NeilBrown1-1/+1
[ Upstream commit 34cab6f42003cb06f48f86a86652984dec338ae9 ] When we get a read error from the last working device, we don't try to repair it, and don't fail the device. We simple report a read error to the caller. However the current test for 'is this the last working device' is wrong. When there is only one fully working device, it assumes that a non-faulty device is that device. However a spare which is rebuilding would be non-faulty but so not the only working device. So change the test from "!Faulty" to "In_sync". If ->degraded says there is only one fully working device and this device is in_sync, this must be the one. This bug has existed since we allowed read_balance to read from a recovering spare in v3.0 Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Fixes: 76073054c95b ("md/raid1: clean up read_balance.") Cc: stable@vger.kernel.org (v3.0+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-27md/raid1: extend spinlock to protect raid1_end_read_request against ↵NeilBrown1-4/+6
inconsistencies [ Upstream commit 423f04d63cf421ea436bcc5be02543d549ce4b28 ] raid1_end_read_request() assumes that the In_sync bits are consistent with the ->degaded count. raid1_spare_active updates the In_sync bit before the ->degraded count and so exposes an inconsistency, as does error() So extend the spinlock in raid1_spare_active() and error() to hide those inconsistencies. This should probably be part of Commit: 34cab6f42003 ("md/raid1: fix test for 'was read error from last working device'.") as it addresses the same issue. It fixes the same bug and should go to -stable for same reasons. Fixes: 76073054c95b ("md/raid1: clean up read_balance.") Cc: stable@vger.kernel.org (v3.0+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-21md: use kzalloc() when bitmap is disabledBenjamin Randazzo1-2/+1
[ Upstream commit 33afeac21b9cb79ad8fc5caf239af89c79e25e1e ] commit b6878d9e03043695dbf3fa1caa6dfc09db225b16 upstream. In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a mdu_bitmap_file_t called "file". 5769 file = kmalloc(sizeof(*file), GFP_NOIO); 5770 if (!file) 5771 return -ENOMEM; This structure is copied to user space at the end of the function. 5786 if (err == 0 && 5787 copy_to_user(arg, file, sizeof(*file))) 5788 err = -EFAULT But if bitmap is disabled only the first byte of "file" is initialized with zero, so it's possible to read some bytes (up to 4095) of kernel space memory from user space. This is an information leak. 5775 /* bitmap disabled, zero the first byte and copy out */ 5776 if (!mddev->bitmap_info.file) 5777 file->pathname[0] = '\0'; Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-04md: fix a build warningFiro Yang1-1/+1
[ Upstream commit 4e023612325a9034a542bfab79f78b1fe5ebb841 ] Warning like this: drivers/md/md.c: In function "update_array_info": drivers/md/md.c:6394:26: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses] !mddev->persistent != info->not_persistent|| Fix it as Neil Brown said: mddev->persistent != !info->not_persistent || Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-04dm btree: silence lockdep lock inversion in dm_btree_del()Joe Thornber1-1/+1
[ Upstream commit 1c7518794a3647eb345d59ee52844e8a40405198 ] Allocate memory using GFP_NOIO when deleting a btree. dm_btree_del() can be called via an ioctl and we don't want to recurse into the FS or block layer. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-08-04dm btree remove: fix bug in redistribute3Dennis Yang1-3/+3
[ Upstream commit 4c7e309340ff85072e96f529582d159002c36734 ] redistribute3() shares entries out across 3 nodes. Some entries were being moved the wrong way, breaking the ordering. This manifested as a BUG() in dm-btree-remove.c:shift() when entries were removed from the btree. For additional context see: https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html Signed-off-by: Dennis Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-07-04dm stats: fix divide by zero if 'number_of_areas' arg is zeroMikulas Patocka1-0/+2
[ Upstream commit dd4c1b7d0c95be1c9245118a3accc41a16f1db67 ] If the number_of_areas argument was zero the kernel would crash on div-by-zero. Add better input validation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-07-04dm space map metadata: fix occasional leak of a metadata block on resizeJoe Thornber1-15/+35
[ Upstream commit 6096d91af0b65a3967139b32d5adbb3647858a26 ] The metadata space map has a simplified 'bootstrap' mode that is operational when extending the space maps. Whilst in this mode it's possible for some refcount decrement operations to become queued (eg, as a result of shadowing one of the bitmap indexes). These decrements were not being applied when switching out of bootstrap mode. The effect of this bug was the leaking of a 4k metadata block. This is detected by the latest version of thin_check as a non fatal error. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-14md/raid0: fix restore to sector variable in raid0_make_requestEric Work1-1/+3
[ Upstream commit a81157768a00e8cf8a7b43b5ea5cac931262374f ] The variable "sector" in "raid0_make_request()" was improperly updated by a call to "sector_div()" which modifies its first argument in place. Commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd restored this variable after the call for later re-use. Unfortunetly the restore was done after the referenced variable "bio" was advanced. This lead to the original value and the restored value being different. Here we move this line to the proper place. One observed side effect of this bug was discarding a file though unlinking would cause an unrelated file's contents to be discarded. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 47d68979cc96 ("md/raid0: fix bug with chunksize not a power of 2.") Cc: stable@vger.kernel.org (any that received above backport) URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501 Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-10md/raid5: don't record new size if resize_stripes fails.NeilBrown1-1/+2
[ Upstream commit 6e9eac2dcee5e19f125967dd2be3e36558c42fff ] If any memory allocation in resize_stripes fails we will return -ENOMEM, but in some cases we update conf->pool_size anyway. This means that if we try again, the allocations will be assumed to be larger than they are, and badness results. So only update pool_size if there is no error. This bug was introduced in 2.6.17 and the patch is suitable for -stable. Fixes: ad01c9e3752f ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array") Cc: stable@vger.kernel.org (v2.6.17+) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-23Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"Rabin Vincent1-6/+6
[ Upstream commit c0403ec0bb5a8c5b267fb7e16021bec0b17e4964 ] This reverts Linux 4.1-rc1 commit 0618764cb25f6fa9fb31152995de42a8a0496475. The problem which that commit attempts to fix actually lies in the Freescale CAAM crypto driver not dm-crypt. dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto driver should internally backlog requests which arrive when the queue is full and process them later. Until the crypto hw's queue becomes full, the driver returns -EINPROGRESS. When the crypto hw's queue if full, the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is expected to backlog the request and process it when the hardware has queue space. At the point when the driver takes the request from the backlog and starts processing it, it calls the completion function with a status of -EINPROGRESS. The completion function is called (for a second time, in the case of backlogged requests) with a status/err of 0 when a request is done. Crypto drivers for hardware without hardware queueing use the helpers, crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request() and crypto_get_backlog() helpers to implement this behaviour correctly, while others implement this behaviour without these helpers (ccp, for example). dm-crypt (before the patch that needs reverting) uses this API correctly. It queues up as many requests as the hw queues will allow (i.e. as long as it gets back -EINPROGRESS from the request function). Then, when it sees at least one backlogged request (gets -EBUSY), it waits till that backlogged request is handled (completion gets called with -EINPROGRESS), and then continues. The references to af_alg_wait_for_completion() and af_alg_complete() in that commit's commit message are irrelevant because those functions only handle one request at a time, unlink dm-crypt. The problem is that the Freescale CAAM driver, which that commit describes as having being tested with, fails to implement the backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware queue is full, it simply returns -EBUSY without backlogging the request. What the observed deadlock was is not described in the commit message but it is obviously the wait_for_completion() in crypto_convert() where dm-crypto would wait for the completion being called with -EINPROGRESS in the case of backlogged requests. This completion will never be completed due to the bug in the CAAM driver. Commit 0618764cb25 incorrectly made dm-crypt wait for every request, even when the driver/hardware queues are not full, which means that dm-crypt will never see -EBUSY. This means that that commit will cause a performance regression on all crypto drivers which implement the API correctly. Revert it. Correct backlog handling should be implemented in the CAAM driver instead. Cc'ing stable purely because commit 0618764cb25 did. If for some reason a stable@ kernel did pick up commit 0618764cb25 it should get reverted. Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Horia Geanta <horia.geanta@freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-18Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"Ben Collins1-6/+6
[ Upstream commit c0403ec0bb5a8c5b267fb7e16021bec0b17e4964 ] This reverts Linux 4.1-rc1 commit 0618764cb25f6fa9fb31152995de42a8a0496475. The problem which that commit attempts to fix actually lies in the Freescale CAAM crypto driver not dm-crypt. dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG. This means the the crypto driver should internally backlog requests which arrive when the queue is full and process them later. Until the crypto hw's queue becomes full, the driver returns -EINPROGRESS. When the crypto hw's queue if full, the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is expected to backlog the request and process it when the hardware has queue space. At the point when the driver takes the request from the backlog and starts processing it, it calls the completion function with a status of -EINPROGRESS. The completion function is called (for a second time, in the case of backlogged requests) with a status/err of 0 when a request is done. Crypto drivers for hardware without hardware queueing use the helpers, crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request() and crypto_get_backlog() helpers to implement this behaviour correctly, while others implement this behaviour without these helpers (ccp, for example). dm-crypt (before the patch that needs reverting) uses this API correctly. It queues up as many requests as the hw queues will allow (i.e. as long as it gets back -EINPROGRESS from the request function). Then, when it sees at least one backlogged request (gets -EBUSY), it waits till that backlogged request is handled (completion gets called with -EINPROGRESS), and then continues. The references to af_alg_wait_for_completion() and af_alg_complete() in that commit's commit message are irrelevant because those functions only handle one request at a time, unlink dm-crypt. The problem is that the Freescale CAAM driver, which that commit describes as having being tested with, fails to implement the backlogging behaviour correctly. In cam_jr_enqueue(), if the hardware queue is full, it simply returns -EBUSY without backlogging the request. What the observed deadlock was is not described in the commit message but it is obviously the wait_for_completion() in crypto_convert() where dm-crypto would wait for the completion being called with -EINPROGRESS in the case of backlogged requests. This completion will never be completed due to the bug in the CAAM driver. Commit 0618764cb25 incorrectly made dm-crypt wait for every request, even when the driver/hardware queues are not full, which means that dm-crypt will never see -EBUSY. This means that that commit will cause a performance regression on all crypto drivers which implement the API correctly. Revert it. Correct backlog handling should be implemented in the CAAM driver instead. Cc'ing stable purely because commit 0618764cb25 did. If for some reason a stable@ kernel did pick up commit 0618764cb25 it should get reverted. Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Horia Geanta <horia.geanta@freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-18md/raid0: fix bug with chunksize not a power of 2.NeilBrown1-1/+2
[ Upstream commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd ] Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 in v3.14-rc1 RAID0 has performed incorrect calculations when the chunksize is not a power of 2. This happens because "sector_div()" modifies its first argument, but this wasn't taken into account in the patch. So restore that first arg before re-using the variable. Reported-by: Joe Landman <joe.landman@gmail.com> Reported-by: Dave Chinner <david@fromorbit.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: stable@vger.kernel.org (3.14 and later). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-17dm snapshot: suspend merging snapshot when doing exception handoverMikulas Patocka2-6/+42
[ Upstream commit 09ee96b21456883e108c3b00597bb37ec512151b ] The "dm snapshot: suspend origin when doing exception handover" commit fixed a exception store handover bug associated with pending exceptions to the "snapshot-origin" target. However, a similar problem exists in snapshot merging. When snapshot merging is in progress, we use the target "snapshot-merge" instead of "snapshot-origin". Consequently, during exception store handover, we must find the snapshot-merge target and suspend its associated mapped_device. To avoid lockdep warnings, the target must be suspended and resumed without holding _origins_lock. Introduce a dm_hold() function that grabs a reference on a mapped_device, but unlike dm_get(), it doesn't crash if the device has the DMF_FREEING flag set, it returns an error in this case. In snapshot_resume() we grab the reference to the origin device using dm_hold() while holding _origins_lock (_origins_lock guarantees that the device won't disappear). Then we release _origins_lock, suspend the device and grab _origins_lock again. NOTE to stable@ people: When backporting to kernels 3.18 and older, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-17dm snapshot: suspend origin when doing exception handoverMikulas Patocka2-9/+86
[ Upstream commit b735fede8d957d9d255e9c5cf3964cfa59799637 ] In the function snapshot_resume we perform exception store handover. If there is another active snapshot target, the exception store is moved from this target to the target that is being resumed. The problem is that if there is some pending exception, it will point to an incorrect exception store after that handover, causing a crash due to dm-snap-persistent.c:get_exception()'s BUG_ON. This bug can be triggered by repeatedly changing snapshot permissions with "lvchange -p r" and "lvchange -p rw" while there are writes on the associated origin device. To fix this bug, we must suspend the origin device when doing the exception store handover to make sure that there are no pending exceptions: - introduce _origin_hash that keeps track of dm_origin structures. - introduce functions __lookup_dm_origin, __insert_dm_origin and __remove_dm_origin that manipulate the origin hash. - modify snapshot_resume so that it calls dm_internal_suspend_fast() and dm_internal_resume_fast() on the origin device. NOTE to stable@ people: When backporting to kernels 3.12-3.18, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. When backporting to kernels older than 3.12, you need to pick functions dm_internal_suspend and dm_internal_resume from the commit fd2ed4d252701d3bbed4cd3e3d267ad469bb832a. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-17dm thin: fix to consistently zero-fill reads to unprovisioned blocksSasha Levin1-11/+0
[ Upstream commit 5f027a3bf184d1d36e68745f7cd3718a8b879cc0 ] It was always intended that a read to an unprovisioned block will return zeroes regardless of whether the pool is in read-only or read-write mode. thin_bio_map() was inconsistent with its handling of such reads when the pool is in read-only mode, it now properly zero-fills the bios it returns in response to unprovisioned block reads. Eliminate thin_bio_map()'s special read-only mode handling of -ENODATA and just allow the IO to be deferred to the worker which will result in pool->process_bio() handling the IO (which already properly zero-fills reads to unprovisioned blocks). Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-17dm io: deal with wandering queue limits when handling REQ_DISCARD and ↵Darrick J. Wong1-4/+11
REQ_WRITE_SAME [ Upstream commit e5db29806b99ce2b2640d2e4d4fcb983cea115c5 ] Since it's possible for the discard and write same queue limits to change while the upper level command is being sliced and diced, fix up both of them (a) to reject IO if the special command is unsupported at the start of the function and (b) read the limits once and let the commands error out on their own if the status happens to change. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-17dm: hold suspend_lock while suspending device during device deletionMikulas Patocka1-0/+6
[ Upstream commit ab7c7bb6f4ab95dbca96fcfc4463cd69843e3e24 ] __dm_destroy() must take the suspend_lock so that its presuspend and postsuspend calls do not race with an internal suspend. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-24dm snapshot: fix a possible invalid memory access on unloadMikulas Patocka1-2/+2
commit 22aa66a3ee5b61e0f4a0bfeabcaa567861109ec3 upstream. When the snapshot target is unloaded, snapshot_dtr() waits until pending_exceptions_count drops to zero. Then, it destroys the snapshot. Therefore, the function that decrements pending_exceptions_count should not touch the snapshot structure after the decrement. pending_complete() calls free_pending_exception(), which decrements pending_exceptions_count, and then it performs up_write(&s->lock) and it calls retry_origin_bios() which dereferences s->origin. These two memory accesses to the fields of the snapshot may touch the dm_snapshot struture after it is freed. This patch moves the call to free_pending_exception() to the end of pending_complete(), so that the snapshot will not be destroyed while pending_complete() is in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-24dm: fix a race condition in dm_get_mdMikulas Patocka1-17/+10
commit 2bec1f4a8832e74ebbe859f176d8a9cb20dd97f4 upstream. The function dm_get_md finds a device mapper device with a given dev_t, increases the reference count and returns the pointer. dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the device, tests that the device doesn't have DMF_DELETING or DMF_FREEING flag, drops _minor_lock and returns pointer to the device. dm_get_md then calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag, otherwise it increments the reference count. There is a possible race condition - after dm_find_md exits and before dm_get is called, there are no locks held, so the device may disappear or DMF_FREEING flag may be set, which results in BUG. To fix this bug, we need to call dm_get while we hold _minor_lock. This patch renames dm_find_md to dm_get_md and changes it so that it calls dm_get while holding the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>