| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 248aa2645aa7fc9175d1107c2593cc90d4af5a4e ]
In cases where dec_in_flight() has to requeue the integrity_bio_wait
work to transfer the rest of the data, the bio's __bi_remaining might
already have been decremented to 0, e.g.: if bio passed to underlying
data device was split via blk_queue_split().
Use dm_bio_{record,restore} rather than effectively open-coding them in
dm-integrity -- these methods now manage __bi_remaining too.
Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
Reported-by: Daniel Glöckner <dg@emlix.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1b17159e52bb31f982f82a6278acd7fab1d3f67b ]
Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 53770f0ec5fd417429775ba006bc4abe14002335 upstream.
If we need to perform synchronous I/O in dm_integrity_map_continue(),
we must make sure that we are not in the map function - in order to
avoid the deadlock due to bio queuing in generic_make_request. To
avoid the deadlock, we offload the request to metadata_wq.
However, metadata_wq also processes metadata updates for write requests.
If there are too many requests that get offloaded to metadata_wq at the
beginning of dm_integrity_map_continue, the workqueue metadata_wq
becomes clogged and the system is incapable of processing any metadata
updates.
This causes a deadlock because all the requests that need to do metadata
updates wait for metadata_wq to proceed and metadata_wq waits inside
wait_and_add_new_range until some existing request releases its range
lock (which doesn't happen because the range lock is released after
metadata update).
In order to fix the deadlock, we create a new workqueue offload_wq and
offload requests to it - so that processing of offload_wq is independent
from processing of metadata_wq.
Fixes: 7eada909bfd7 ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 41c526c5af46d4c4dab7f72c99000b7fac0b9702 upstream.
Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 upstream.
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.
An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress. As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries
writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.
We must use drain_workqueue - that waits until the work and all re-queued
works finish.
As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.
With this change use of drain_workqueue() can be used effectively. This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7cdf6a0aae1cccf5167f3f04ecddcf648b78e289 upstream.
The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
while :; do make check T=shell/lvconvert-thin-external-cache.sh; done
The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.
The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.
Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.
Fixes: c6b4fcbad044 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7c02b0055f774ed9afb6e1c7724f33bf148ffdc0 ]
In bset.h, macro bset_bkey_last() is defined as,
bkey_idx((struct bkey *) (i)->d, (i)->keys)
Parameter i can be variable type of data structure, the macro always
works once the type of struct i has member 'd' and 'keys'.
bset_bkey_last() is also used in macro csum_set() to calculate the
checksum of a on-disk data structure. When csum_set() is used to
calculate checksum of on-disk bcache super block, the parameter 'i'
data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in
struct cache_sb) the member keys is __u16 type. But bkey_idx() expects
unsigned int (a 32bit width), so there is problem when sending
parameters via stack to call bkey_idx().
Sparse tool from Intel 0day kbuild system reports this incompatible
problem. bkey_idx() is part of user space API, so the simplest fix is
to cast the (i)->keys to unsigned int type in macro bset_bkey_last().
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e8547d42095e58bee658f00fef8e33d2a185c927 ]
Same as cache device, the buffer page needs to be put while
freeing cached_dev. Otherwise a page would be leaked every
time a cached_dev is stopped.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 038ba8cc1bffc51250add4a9b9249d4331576d8f upstream.
In year 2007 high performance SSD was still expensive, in order to
save more space for real workload or meta data, the readahead I/Os
for non-meta data was bypassed and not cached on SSD.
In now days, SSD price drops a lot and people can find larger size
SSD with more comfortable price. It is unncessary to alway bypass
normal readahead I/Os to save SSD space for now.
This patch adds options for readahead data cache policies via sysfs
file /sys/block/bcache<N>/readahead_cache_policy, the options are,
- "all": cache all readahead data I/Os.
- "meta-only": only cache meta data, and bypass other regular I/Os.
If users want to make bcache continue to only cache readahead request
for metadata and bypass regular data readahead, please set "meta-only"
to this sysfs file. By default, bcache will back to cache all read-
ahead requests now.
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aa9509209c5ac2f0b35d01a922bf9ae072d0c2fc upstream.
When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache
It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.
In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 47ace7e012b9f7ad71d43ac9063d335ea3d6820b upstream.
Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).
Otherwise generic_make_request() will crash simply by doing:
dmsetup create -n test
mount /dev/dm-N /mnt
While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.
Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab34583a ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c3860c ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4ea9471fbd1addb25a4d269991dc724e200ca5b5 upstream.
If benbi IV is used in AEAD construction, for example:
cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
the constructor uses wrong skcipher function and crashes:
BUG: kernel NULL pointer dereference, address: 00000014
...
EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
Call Trace:
? crypt_subkey_size+0x20/0x20 [dm_crypt]
crypt_ctr+0x567/0xfc0 [dm_crypt]
dm_table_add_target+0x15f/0x340 [dm_mod]
Fix this by properly using crypt_aead_blocksize() in this case.
Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # v4.12+
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051
Reported-by: Jerad Simpson <jbsimpson@gmail.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4feaef830de7ffdd8352e1fe14ad3bf13c9688f8 upstream.
The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets. There are variants for
tracking metadata blocks and data blocks.
Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.
When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was. This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).
This fix keeps the 'begin' field, but now it's just a hint to speed up
the search. Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free. Much simpler.
This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.
Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b39962950339912978484cdac50069258545d753 upstream.
dm-zoned is observed to log failed kernel assertions and not work
correctly when operating against a device with a zone size smaller
than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
bitmap size per zone is calculated as zero with such a small zone
size. Fix this problem and also make the code related to zone bitmap
management be able to handle per zone bitmaps smaller than a single
block.
A dm-zoned-tools patch is required to properly format dm-zoned devices
with zone sizes smaller than 128MiB.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d66c9920c0cf984cf99cab5036fd5f3a1b7fba46 ]
The copy_to_user() function returns the number of bytes remaining to be
copied, but the intention here was to return -EFAULT if the copy fails.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.
For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblock
This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 028288df635f5a9addd48ac4677b720192747944 ]
In raid1_sync_request func, rdev should be checked before reference.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b ]
In bch_mca_scan(), the number of shrinking btree node is calculated
by code like this,
unsigned long nr = sc->nr_to_scan;
nr /= c->btree_pages;
nr = min_t(unsigned long, nr, mca_can_free(c));
variable sc->nr_to_scan is number of objects (here is bcache B+tree
nodes' number) to shrink, and pointer variable sc is sent from memory
management code as parametr of a callback.
If sc->nr_to_scan is smaller than c->btree_pages, after the above
calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
and releasing mutex c->bucket_lock.
This patch checkes whether nr is 0 after the above calculation, if 0
is the result then set 1 to variable 'n'. Then at least bch_mca_scan()
will try to shrink a single B+tree node.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 84c529aea182939e68f618ed9813740c9165c7eb ]
bcache_allocator can call the following:
bch_allocator_thread()
-> bch_prio_write()
-> bch_bucket_alloc()
-> wait on &ca->set->bucket_wait
But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:
[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
[ 1158.495929] Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1158.504413] bcache_allocato D 0 15861 2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429] __schedule+0x2a8/0x670
[ 1158.504432] schedule+0x2d/0x90
[ 1158.504448] bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453] ? wait_woken+0x80/0x80
[ 1158.504466] bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476] bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491] kthread+0x121/0x140
[ 1158.504503] ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506] ? kthread_park+0xb0/0xb0
[ 1158.504510] ret_from_fork+0x35/0x40
Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.
Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.
BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2d8869518a525c9bce5f5268419df9dfbe3dfdeb ]
Commit cafe56359144 ("bcache: A block layer cache") leads to the
following static checker warning:
./drivers/md/bcache/super.c:770 bcache_device_free()
warn: variable dereferenced before check 'd->disk' (see line 766)
drivers/md/bcache/super.c
762 static void bcache_device_free(struct bcache_device *d)
763 {
764 lockdep_assert_held(&bch_register_lock);
765
766 pr_info("%s stopped", d->disk->disk_name);
^^^^^^^^^
Unchecked dereference.
767
768 if (d->c)
769 bcache_device_detach(d);
770 if (d->disk && d->disk->flags & GENHD_FL_UP)
^^^^^^^
Check too late.
771 del_gendisk(d->disk);
772 if (d->disk && d->disk->queue)
773 blk_cleanup_queue(d->disk->queue);
774 if (d->disk) {
775 ida_simple_remove(&bcache_device_idx,
776 first_minor_to_idx(d->disk->first_minor));
777 put_disk(d->disk);
778 }
779
It is not 100% sure that the gendisk struct of bcache device will always
be there, the warning makes sense when there is problem in block core.
This patch tries to remove the static checking warning by checking
d->disk to avoid NULL pointer deferences.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fadcbd2901a0f7c8721f3bdb69eac95c272dc8ed ]
We need to move "spin_lock_irq(&bitmap->counts.lock)" before unmap previous
storage, otherwise panic like belows could happen as follows.
[ 902.353802] sdl: detected capacity change from 1077936128 to 3221225472
[ 902.616948] general protection fault: 0000 [#1] SMP
[snip]
[ 902.618588] CPU: 12 PID: 33698 Comm: md0_raid1 Tainted: G O 4.14.144-1-pserver #4.14.144-1.1~deb10
[ 902.618870] Hardware name: Supermicro SBA-7142G-T4/BHQGE, BIOS 3.00 10/24/2012
[ 902.619120] task: ffff9ae1860fc600 task.stack: ffffb52e4c704000
[ 902.619301] RIP: 0010:bitmap_file_clear_bit+0x90/0xd0 [md_mod]
[ 902.619464] RSP: 0018:ffffb52e4c707d28 EFLAGS: 00010087
[ 902.619626] RAX: ffe8008b0d061000 RBX: ffff9ad078c87300 RCX: 0000000000000000
[ 902.619792] RDX: ffff9ad986341868 RSI: 0000000000000803 RDI: ffff9ad078c87300
[ 902.619986] RBP: ffff9ad0ed7a8000 R08: 0000000000000000 R09: 0000000000000000
[ 902.620154] R10: ffffb52e4c707ec0 R11: ffff9ad987d1ed44 R12: ffff9ad0ed7a8360
[ 902.620320] R13: 0000000000000003 R14: 0000000000060000 R15: 0000000000000800
[ 902.620487] FS: 0000000000000000(0000) GS:ffff9ad987d00000(0000) knlGS:0000000000000000
[ 902.620738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 902.620901] CR2: 000055ff12aecec0 CR3: 0000001005207000 CR4: 00000000000406e0
[ 902.621068] Call Trace:
[ 902.621256] bitmap_daemon_work+0x2dd/0x360 [md_mod]
[ 902.621429] ? find_pers+0x70/0x70 [md_mod]
[ 902.621597] md_check_recovery+0x51/0x540 [md_mod]
[ 902.621762] raid1d+0x5c/0xeb0 [raid1]
[ 902.621939] ? try_to_del_timer_sync+0x4d/0x80
[ 902.622102] ? del_timer_sync+0x35/0x40
[ 902.622265] ? schedule_timeout+0x177/0x360
[ 902.622453] ? call_timer_fn+0x130/0x130
[ 902.622623] ? find_pers+0x70/0x70 [md_mod]
[ 902.622794] ? md_thread+0x94/0x150 [md_mod]
[ 902.622959] md_thread+0x94/0x150 [md_mod]
[ 902.623121] ? wait_woken+0x80/0x80
[ 902.623280] kthread+0x119/0x130
[ 902.623437] ? kthread_create_on_node+0x60/0x60
[ 902.623600] ret_from_fork+0x22/0x40
[ 902.624225] RIP: bitmap_file_clear_bit+0x90/0xd0 [md_mod] RSP: ffffb52e4c707d28
Because mdadm was running on another cpu to do resize, so bitmap_resize was
called to replace bitmap as below shows.
PID: 38801 TASK: ffff9ad074a90e00 CPU: 0 COMMAND: "mdadm"
[exception RIP: queued_spin_lock_slowpath+56]
[snip]
-- <NMI exception stack> --
#5 [ffffb52e60f17c58] queued_spin_lock_slowpath at ffffffff9c0b27b8
#6 [ffffb52e60f17c58] bitmap_resize at ffffffffc0399877 [md_mod]
#7 [ffffb52e60f17d30] raid1_resize at ffffffffc0285bf9 [raid1]
#8 [ffffb52e60f17d50] update_size at ffffffffc038a31a [md_mod]
#9 [ffffb52e60f17d70] md_ioctl at ffffffffc0395ca4 [md_mod]
And the procedure to keep resize bitmap safe is allocate new storage
space, then quiesce, copy bits, replace bitmap, and re-start.
However the daemon (bitmap_daemon_work) could happen even the array is
quiesced, which means when bitmap_file_clear_bit is triggered by raid1d,
then it thinks it should be fine to access store->filemap since
counts->lock is held, but resize could change the storage without the
protection of the lock.
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 474e559567fa631dea8fb8407ab1b6090c903755 upstream.
We got the following warnings from thin_check during thin-pool setup:
$ thin_check /dev/vdb
examining superblock
examining devices tree
missing devices: [1, 84]
too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
examining mapping tree
The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:
$ new a thin pool
$ presume the max entries of details_info tree is 126
$ new 127 thin devices (e.g. 1~127) to make the root node being full
and then split
$ remove the first 43 (e.g. 1~43) thin devices to make the children
reblance repeatedly
$ stop the thin pool
$ thin_check
The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).
Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dbaf971c9cdf10843071a60dcafc1aaab3162354 upstream.
Removes the branching for edge-case where no SCSI device handler
exists. The __map_bio_fast() method was far too limited, by only
selecting a new pathgroup or path IFF there was a path failure, fix this
be eliminating it in favor of __map_bio(). __map_bio()'s extra SCSI
device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
path anyway.
This change restores full path selector functionality for bio-based
configurations that don't haave a SCSI device handler. But it should be
noted that the path selectors do have an impact on performance for
certain networks that are extremely fast (and don't require frequent
switching).
Fixes: 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Cc: stable@vger.kernel.org
Reported-by: Drew Hastings <dhastings@crucialwebhost.com>
Suggested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a7ede3d16808b8f3915c8572d783530a82b2f027 ]
With commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ("raid5: don't set
STRIPE_HANDLE to stripe which is in batch list"), we don't want to set
STRIPE_HANDLE flag for sh which is already in batch list.
However, the stripe which is the head of batch list should set this flag,
otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head),
it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved.
Thanks for Xiao's effort to verify the change.
Fixes: 6ce220dd2f8ea ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list")
Reported-by: Xiao Ni <xni@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream.
If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.
Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function. If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.
Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.
Fixes: 2bc13b83e629 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e7fad909b68aa37470d9f2d2731b5bec355ee5d6 upstream.
Commit 75d66ffb48efb3 added backing device health checks and as a part
of these checks, check_events() block ops template call is invoked in
dm-zoned mapping path as well as in reclaim and flush path. Calling
check_events() with ATA or SCSI backing devices introduces a blocking
scsi_test_unit_ready() call being made in sd_check_events(). Even though
the overhead of calling scsi_test_unit_ready() is small for ATA zoned
devices, it is much larger for SCSI and it affects performance in a very
negative way.
Fix this performance regression by executing check_events() only in case
of any I/O errors. The function dmz_bdev_is_dying() is modified to call
only blk_queue_dying(), while calls to check_events() are made in a new
helper function, dmz_check_bdev().
Reported-by: zhangxiaoxu <zhangxiaoxu5@huawei.com>
Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c1005322ff02110a4df7f0033368ea015062b583 upstream.
Call writecache_flush() on REQ_FUA in writecache_map().
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Maged Mokhtar <mmokhtar@petasan.org>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e3fc3f3d0943b126f76b8533960e4168412d9e5a ]
The first argument to WARN() is supposed to be a condition. The
original code will just print the mdname() instead of the full warning
message.
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 ]
Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync. This occurs because
raid_message() is testing recovery bits in a racy way.
Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a00f5276e26636cbf72f24f79831026d2e2868e7 ]
The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.
A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.
Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 79b791466e525c98f6aeee9acf5726e7b27f4231 ]
A fresh backing device is not attached to any cache_set, and
has no writeback kthread created until first attached to some
cache_set.
But bch_cached_dev_writeback_init run
"
dc->writeback_running = true;
WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
&dc->disk.flags));
"
for any newly formatted backing devices.
For a fresh standalone backing device, we can get something like
following even if no writeback kthread created:
------------------------
/sys/block/bcache0/bcache# cat writeback_running
1
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate: 512.0k/sec
dirty: 0.0k
target: 0.0k
proportional: 0.0k
integral: 0.0k
change: 0.0k/sec
next io: -15427384ms
The none ZERO fields are misleading as no alive writeback kthread yet.
Set dc->writeback_running false as no writeback thread created in
bch_cached_dev_writeback_init().
We have writeback thread created and woken up in bch_cached_dev_writeback
_start(). Set dc->writeback_running true before bch_writeback_queue()
called, as a writeback thread will check if dc->writeback_running is true
before writing back dirty data, and hung if false detected.
After the change, we can get the following output for a fresh standalone
backing device:
-----------------------
/sys/block/bcache0/bcache$ cat writeback_running
0
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate: 0.0k/sec
dirty: 0.0k
target: 0.0k
proportional: 0.0k
integral: 0.0k
change: 0.0k/sec
next io: 0ms
v1 -> v2:
Set dc->writeback_running before bch_writeback_queue() called,
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ae17102316550b4b230a283febe31b2a9ff30084 ]
debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.
Remove the check in cache_set_free and bch_debug_init.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 45422b704db392a6d79d07ee3e3670b11048bd53 upstream.
Due to unneeded multiplication in the out_free_pages portion of
r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is
possible to access a resync_pages offset that has not been
initialized. This access translates into a crash of the system
within resync_free_pages() while passing a bad pointer to
put_page(). Remove the multiplication, preventing access to the
uninitialized area.
Fixes: f0250618361db ("md: raid10: don't use bio's vec table to manage resync pages")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: John Pittman <jpittman@redhat.com>
Suggested-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d857ad75edf3c0066fcd920746f9dc75382b3324 ]
With raid4/5/6, journal device and write intent bitmap are mutually exclusive.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ]
Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.
Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7a55948d38eb9b274cbbdd56dc1dd4b96ebfbe04 ]
UUIDs are considered as metadata. __uuid_write should add the number
of buckets (in sectors) written to disk to ca->meta_sectors_written.
Currently only 1 bucket is used in uuid write.
Steps to test:
1) create a fresh backing device and a fresh cache device separately.
The backing device didn't attach to any cache set.
2) cd /sys/block/<cache device>/bcache
cat metadata_written // record the output value
cat bucket_size
3) attach the backing device to cache set
4) cat metadata_written
The output value is almost the same as the value in step 2
before the change.
After the change, the value is bigger about 1 bucket size.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Reviewed-by: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 059421e041eb461fb2b3e81c9adaec18ef03ca3c ]
Commit 35bfc52187f6 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.
Unfortunately, the patch doesn't really work. The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.
This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.
Reported-by: Jeff Mahoney <jeffm@suse.com>
Fixes: 35bfc52187f6 ("md: allow metadata update while suspending.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dab71b2db98dcdd4657d151b01a7be88ce10f9d1 ]
dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.
This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ]
Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.
The implementation of this throttling mechanism is prone to a deadlock:
1. One or more threads write to the origin device causing COW, which is
performed by kcopyd.
2. At some point some of these threads might reach the s->cow_count
semaphore limit and block in down(&s->cow_count), holding a read lock
on _origins_lock.
3. Someone tries to acquire a write lock on _origins_lock, e.g.,
snapshot_ctr(), which blocks because the threads at step (2) already
hold a read lock on it.
4. A COW operation completes and kcopyd runs dm-snapshot's completion
callback, which ends up calling pending_complete().
pending_complete() tries to resubmit any deferred origin bios. This
requires acquiring a read lock on _origins_lock, which blocks.
This happens because the read-write semaphore implementation gives
priority to writers, meaning that as soon as a writer tries to enter
the critical section, no readers will be allowed in, until all
writers have completed their work.
So, pending_complete() waits for the writer at step (3) to acquire
and release the lock. This writer waits for the readers at step (2)
to release the read lock and those readers wait for
pending_complete() (the kcopyd thread) to signal the s->cow_count
semaphore: DEADLOCK.
The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html
Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.
Reported-by: Guruswamy Basavaiah <guru2018@gmail.com>
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Tested-by: Nikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ]
This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 13bd677a472d534bf100bab2713efc3f9e3f5978 upstream.
GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.
If we go down this path:
map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.
If we go down this path:
map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.
Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3874d73e06c9b9dc15de0b7382fc223986d75571 ]
The message should match the parameter, i.e. raid0.default_layout.
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: NeilBrown <neilb@suse.de>
Reported-by: Ivan Topolsky <doktor.yak@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ]
If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.
A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones. All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.
It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.
Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 480523feae581ab714ba6610388a3b4619a2f695 upstream.
Since commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.
So we should only call it when the cost is justified.
md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays). For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.
In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO. So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.
This measurably reduces mean and maximum IO submission latency during
resync/recovery.
Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream.
Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready. As soon as
it appear to be zero, fsck can be run. If it find the size to be
zero, it will fail.
So add a new flag to provide an interlock between do_md_run() and
array_state_show(). This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.
Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.
We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared. This ensure the
information is ready before the notification is sent.
Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().
Note that MD_NOT_READY cleared twice. This is deliberate to cover
both success and error paths with minimal noise.
Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream.
7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.
Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)
V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.
This is Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 226b4fc75c78f9c497c5182d939101b260cfb9f3 ]
SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.
So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq). Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq. A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b76b4715eba0d0ed574f58918b29c1b2f0fa37a8 ]
While MD continues to count read errors returned by the lower layer.
If those errors are -EILSEQ, instead of -EIO, it should NOT increase
the read_errors count.
When RAID6 is set up on dm-integrity target that detects massive
corruption, the leg will be ejected from the array. Even if the
issue is correctable with a sector re-write and the array has
necessary redundancy to correct it.
The leg is ejected because it runs up the rdev->read_errors beyond
conf->max_nr_stripes. The return status in dm-drypt when there is
a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ]
If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
could be set with STRIPE_ACTIVE by the handle_stripe function. And if
error happens to the batch_head at the same time, break_stripe_batch_list
is called, then below warning could happen (the same report in [1]), it
means a member of batch list was set with STRIPE_ACTIVE.
[7028915.431770] stripe state: 2001
[7028915.431815] ------------[ cut here ]------------
[7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
[7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G O 4.14.86-1-storage #4.14.86-1.2~deb9
[7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
[7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
[7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
[7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
[7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
[7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
[7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
[7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
[7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
[7028915.431906] FS: 0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
[7028915.431907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
[7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7028915.431910] Call Trace:
[7028915.431923] handle_stripe+0x8e7/0x2020 [raid456]
[7028915.431930] ? __wake_up_common_lock+0x89/0xc0
[7028915.431935] handle_active_stripes.isra.58+0x35f/0x560 [raid456]
[7028915.431939] raid5_do_work+0xc6/0x1f0 [raid456]
Also commit 59fc630b8b5f9f ("RAID5: batch adjacent full stripe write")
said "If a stripe is added to batch list, then only the first stripe
of the list should be put to handle_list and run handle_stripe."
So don't set STRIPE_HANDLE to stripe which is already in batch list,
otherwise the stripe could be put to handle_list and run handle_stripe,
then the above warning could be triggered.
[1]. https://www.spinics.net/lists/raid/msg62552.html
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 07f1a6850c5d5a65c917c3165692b5179ac4cb6b ]
When run test case:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
mdadm run fail with kernel message as follow:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|