Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 984bf2cc531e778e49298fdf6730e0396166aa21 ]
There was a problem that a user burned a dm-integrity image on CDROM
and could not activate it because it had a non-empty journal.
Fix this problem by flushing the journal (done by the previous commit)
and clearing the journal (done by this commit). Once the journal is
cleared, dm-integrity won't attempt to replay it on the next
activation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e5dab5ec763d600fe0a67837dd9155bdc42f961 ]
This commit flushes the journal on suspend. It is prerequisite for the
next commit that enables activating dm integrity devices in read-only mode.
Note that we deliberately didn't flush the journal on suspend, so that the
journal replay code would be tested. However, the dm-integrity code is 5
years old now, so that journal replay is well-tested, and we can make this
change now.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 4fe1ec995483737f3d2a14c3fe1d8fe634972979 upstream.
__list_versions will first estimate the required space using the
"dm_target_iterate(list_version_get_needed, &needed)" call and then will
fill the space using the "dm_target_iterate(list_version_get_info,
&iter_info)" call. Each of these calls locks the targets using the
"down_read(&_lock)" and "up_read(&_lock)" calls, however between the first
and second "dm_target_iterate" there is no lock held and the target
modules can be loaded at this point, so the second "dm_target_iterate"
call may need more space than what was the first "dm_target_iterate"
returned.
The code tries to handle this overflow (see the beginning of
list_version_get_info), however this handling is incorrect.
The code sets "param->data_size = param->data_start + needed" and
"iter_info.end = (char *)vers+len" - "needed" is the size returned by the
first dm_target_iterate call; "len" is the size of the buffer allocated by
userspace.
"len" may be greater than "needed"; in this case, the code will write up
to "len" bytes into the buffer, however param->data_size is set to
"needed", so it may write data past the param->data_size value. The ioctl
interface copies only up to param->data_size into userspace, thus part of
the result will be truncated.
Fix this bug by setting "iter_info.end = (char *)vers + needed;" - this
guarantees that the second "dm_target_iterate" call will write only up to
the "needed" buffer and it will exit with "DM_BUFFER_FULL_FLAG" if it
overflows the "needed" space - in this case, userspace will allocate a
larger buffer and retry.
Note that there is also a bug in list_version_get_needed - we need to add
"strlen(tt->name) + 1" to the needed size, not "strlen(tt->name)".
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 99f4f5bcb975527508eb7a5e3e34bdb91d576746 ]
Fixes: 74fe6ba923949 ("dm: convert to blk_alloc_disk/blk_cleanup_disk")
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74 ]
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.
When the deadlock occurs, multiple threads will hang in different ways:
1) The group threads will hang in the blk-wbt code with bios waiting to
be submitted to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
ops_run_io+0x46b/0x1a30
handle_stripe+0xcd3/0x36b0
handle_active_stripes.constprop.0+0x6f6/0xa60
raid5_do_work+0x177/0x330
Or:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
flush_deferred_bios+0x136/0x170
raid5_do_work+0x262/0x330
2) The r5l_reclaim thread will hang in the same way, submitting a
bio to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
submit_bio+0x3f/0xf0
md_super_write+0x12f/0x1b0
md_update_sb.part.0+0x7c6/0xff0
md_update_sb+0x30/0x60
r5l_do_reclaim+0x4f9/0x5e0
r5l_reclaim_thread+0x69/0x30b
However, before hanging, the MD_SB_CHANGE_PENDING flag will be
set for sb_flags in r5l_write_super_and_discard_space(). This
flag will never be cleared because the submit_bio() call never
returns.
3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
will do no processing on any pending stripes and re-set
STRIPE_HANDLE. This will cause the raid5d thread to enter an
infinite loop, constantly trying to handle the same stripes
stuck in the queue.
The raid5d thread has a blk_plug that holds a number of bios
that are also stuck waiting seeing the thread is in a loop
that never schedules. These bios have been accounted for by
blk-wbt thus preventing the other threads above from
continuing when they try to submit bios. --Deadlock.
To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.
However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
from the same thread and can thus deadlock if the thread is put to
sleep. So avoid waiting if md_check_recovery() is being called in the
loop.
It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:
16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
is updated.")
Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d2d05b88035d2d51a5bb6c5afec88a0880c73df4 ]
Inside set_at_max_writeback_rate() the calculation in following if()
check is wrong,
if (atomic_inc_return(&c->idle_counter) <
atomic_read(&c->attached_dev_nr) * 6)
Because each attached backing device has its own writeback thread
running and increasing c->idle_counter, the counter increates much
faster than expected. The correct calculation should be,
(counter / dev_nr) < dev_nr * 6
which equals to,
counter < dev_nr * dev_nr * 6
This patch fixes the above mistake with correct calculation, and helper
routine idle_counter_exceeded() is added to make code be more clear.
Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c66a6f41e09ad386fd2cce22b9cded837bbbc704 ]
When running chunk-sized reads on disks with badblocks duplicate bio
free/puts are observed:
=============================================================================
BUG bio-200 (Not tainted): Object already free
-----------------------------------------------------------------------------
Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504
__slab_alloc.constprop.0+0x5a/0xb0
kmem_cache_alloc+0x31e/0x330
mempool_alloc_slab+0x17/0x20
mempool_alloc+0x100/0x2b0
bio_alloc_bioset+0x181/0x460
do_mpage_readpage+0x776/0xd00
mpage_readahead+0x166/0x320
blkdev_readahead+0x15/0x20
read_pages+0x13f/0x5f0
page_cache_ra_unbounded+0x18d/0x220
force_page_cache_ra+0x181/0x1c0
page_cache_sync_ra+0x65/0xb0
filemap_get_pages+0x1df/0xaf0
filemap_read+0x1e1/0x700
blkdev_read_iter+0x1e5/0x330
vfs_read+0x42a/0x570
Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504
kmem_cache_free+0x46d/0x490
mempool_free_slab+0x17/0x20
mempool_free+0x66/0x190
bio_free+0x78/0x90
bio_put+0x100/0x1a0
raid5_make_request+0x2259/0x2450
md_handle_request+0x402/0x600
md_submit_bio+0xd9/0x120
__submit_bio+0x11f/0x1b0
submit_bio_noacct_nocheck+0x204/0x480
submit_bio_noacct+0x32e/0xc70
submit_bio+0x98/0x1a0
mpage_readahead+0x250/0x320
blkdev_readahead+0x15/0x20
read_pages+0x13f/0x5f0
page_cache_ra_unbounded+0x18d/0x220
Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Workqueue: raid5wq raid5_do_work
Call Trace:
<TASK>
dump_stack_lvl+0x5a/0x78
dump_stack+0x10/0x16
print_trailer+0x158/0x165
object_err+0x35/0x50
free_debug_processing.cold+0xb7/0xbe
__slab_free+0x1ae/0x330
kmem_cache_free+0x46d/0x490
mempool_free_slab+0x17/0x20
mempool_free+0x66/0x190
bio_free+0x78/0x90
bio_put+0x100/0x1a0
mpage_end_io+0x36/0x150
bio_endio+0x2fd/0x360
md_end_io_acct+0x7e/0x90
bio_endio+0x2fd/0x360
handle_failed_stripe+0x960/0xb80
handle_stripe+0x1348/0x3760
handle_active_stripes.constprop.0+0x72a/0xaf0
raid5_do_work+0x177/0x330
process_one_work+0x616/0xb20
worker_thread+0x2bd/0x6f0
kthread+0x179/0x1b0
ret_from_fork+0x22/0x30
</TASK>
The double free is caused by an unnecessary bio_put() in the
if(is_badblock(...)) error path in raid5_read_one_chunk().
The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c
("md/raid5: move checking badblock before clone bio in
raid5_read_one_chunk"). The previous code checked and freed align_bio
which required a bio_put. After the move that is no longer needed as
raid_bio is returned to the control of the common io path which
performs its own endio resulting in a double free on bad device blocks.
Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk")
Signed-off-by: David Sloan <david.sloan@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2eed85bc75138a9eeb63863d20f8904ac42a577 ]
When doing degrade/recover tests using the journal a kernel BUG
is hit at drivers/md/raid5.c:4381 in handle_parity_checks5():
BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
This was found to occur because handle_stripe_fill() was skipped
for stripes in the journal due to a condition in that function.
Thus blocks were not fetched and R5_UPTODATE was not set when
the code reached handle_parity_checks5().
To fix this, don't skip handle_stripe_fill() unless the stripe is
for read.
Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path")
Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1727fd5015d8f93474148f94e34cda5aa6ad4a43 ]
Current code produces a warning as shown below when total characters
in the constituent block device names plus the slashes exceeds 200.
snprintf() returns the number of characters generated from the given
input, which could cause the expression “200 – len” to wrap around
to a large positive number. Fix this by using scnprintf() instead,
which returns the actual number of characters written into the buffer.
[ 1513.267938] ------------[ cut here ]------------
[ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510
[ 1513.267944] Modules linked in: <snip>
[ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu
[ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510
<-snip->
[ 1513.267982] Call Trace:
[ 1513.267986] snprintf+0x45/0x70
[ 1513.267990] ? disk_name+0x71/0xa0
[ 1513.267993] dump_zones+0x114/0x240 [raid0]
[ 1513.267996] ? _cond_resched+0x19/0x40
[ 1513.267998] raid0_run+0x19e/0x270 [raid0]
[ 1513.268000] md_run+0x5e0/0xc50
[ 1513.268003] ? security_capable+0x3f/0x60
[ 1513.268005] do_md_run+0x19/0x110
[ 1513.268006] md_ioctl+0x195e/0x1f90
[ 1513.268007] blkdev_ioctl+0x91f/0x9f0
[ 1513.268010] block_ioctl+0x3d/0x50
[ 1513.268012] do_vfs_ioctl+0xa9/0x640
[ 1513.268014] ? __fput+0x162/0x260
[ 1513.268016] ksys_ioctl+0x75/0x80
[ 1513.268017] __x64_sys_ioctl+0x1a/0x20
[ 1513.268019] do_syscall_64+0x5e/0x200
[ 1513.268021] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 766038846e875 ("md/raid0: replace printk() with pr_*()")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e8daf906f890560df430d30617c692a794acb73 ]
A race condition still exists when removing and re-creating md devices
in test cases. However, it is only seen on some setups.
The race condition was tracked down to a reference still being held
to the kobject by the rdev in the md_rdev_misc_wq which will be released
in rdev_delayed_delete().
md_alloc() waits for previous deletions by waiting on the md_misc_wq,
but the md_rdev_misc_wq may still be holding a reference to a recently
removed device.
To fix this, also flush the md_rdev_misc_wq in md_alloc().
Signed-off-by: David Sloan <david.sloan@eideticom.com>
[logang@deltatee.com: rewrote commit message]
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0dd84b319352bb8ba64752d4e45396d8b13e6018 upstream.
From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.
Let's stop write first in destructor to align with normal md-raid to
fix the KASAN issue.
[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1d258758cf06a0734482989911d184dd5837ed4e upstream.
This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it
obviously breaks clustered raid as noticed by Neil though it fixed KASAN
issue for dm-raid, let's revert it and fix KASAN issue in next commit.
[1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470
Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 104212471b1c1817b311771d817fb692af983173 ]
In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and
may cause sh to be released. However, sh is subsequently used in lines
2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an
use-after-free bug.
It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of
the function.
Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9973f0fa7d20269fe6fefe6333997fb5914449c1 ]
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7dad24db59d2d2803576f2e3645728866a056dab ]
There is a KASAN warning in raid_resume when running the lvm test
lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
is greater than rs->raid_disks, so the loop touches one entry beyond
the allocated length.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1fbeea217d8f297fe0e0956a1516d14ba97d0396 ]
There is this warning when using a kernel with the address sanitizer
and running this testsuite:
https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid
==================================================================
BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x9c
print_address_description.constprop.0+0x1f/0x1e0
print_report.cold+0x55/0x244
kasan_report+0xc9/0x100
raid_status+0x1747/0x2820 [dm_raid]
dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
table_load+0x35c/0x630 [dm_mod]
ctl_ioctl+0x411/0x630 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
__x64_sys_ioctl+0x12a/0x1a0
do_syscall_64+0x5b/0x80
The warning is caused by reading conf->max_nr_stripes in raid_status. The
code in raid_status reads mddev->private, casts it to struct r5conf and
reads the entry max_nr_stripes.
However, if we have different raid type than 4/5/6, mddev->private
doesn't point to struct r5conf; it may point to struct r0conf, struct
r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
of these structs to struct r5conf, we will be reading invalid memory
and KASAN warns about it.
Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3534e5a5ed2997ca1b00f44a0378a075bd05e8a3 ]
Fault inject on pool metadata device reports:
BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80
Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950
CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_address_description.constprop.0.cold+0xeb/0x3f4
kasan_report.cold+0xe6/0x147
dm_pool_register_metadata_threshold+0x40/0x80
pool_ctr+0xa0a/0x1150
dm_table_add_target+0x2c8/0x640
table_load+0x1fd/0x430
ctl_ioctl+0x2c4/0x5a0
dm_ctl_ioctl+0xa/0x10
__x64_sys_ioctl+0xb3/0xd0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This can be easily reproduced using:
echo offline > /sys/block/sda/device/state
dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10
dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0"
If a metadata commit fails, the transaction will be aborted and the
metadata space maps will be destroyed. If a DM table reload then
happens for this failed thin-pool, a use-after-free will occur in
dm_sm_register_threshold_callback (called from
dm_pool_register_metadata_threshold).
Fix this by in dm_pool_register_metadata_threshold() by returning the
-EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr()
with a new error message: "Error registering metadata threshold".
Fixes: ac8c3f3df65e4 ("dm thin: generate event when metadata threshold passed")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ca7dc242e358e46d963b32f9d9dd829785a9e957 ]
dm-writecache has the capability to limit the number of writeback jobs
in progress. However, this feature was off by default. As such there
were some out-of-memory crashes observed when lowering the low
watermark while the cache is full.
This commit enables writeback limit by default. It is set to 256MiB or
1/16 of total system memory, whichever is smaller.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e120a5f1e78fab6223544e425015f393d90d6f0d ]
Otherwise PR ops may be issued while the broader DM device is being
reconfigured, etc.
Fixes: 9c72bad1f31a ("dm: call PR reserve/unreserve on each underlying device")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2ee73ef60db4d79b9f9b8cd501e8188b5179449f ]
Change dm-writecache, so that it counts the number of blocks discarded
instead of the number of discard bios. Make it consistent with the
read and write statistics counters that were changed to count the
number of blocks instead of bios.
Fixes: e3a35d03407c ("dm writecache: add event counters")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b2676e1482af89714af6988ce5d31a84692e2530 ]
Change dm-writecache, so that it counts the number of blocks written
instead of the number of write bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.
Fixes: e3a35d03407c ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2c6e755b49d273243431f5f1184654e71221fc78 ]
Change dm-writecache, so that it counts the number of blocks read
instead of the number of read bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.
Fixes: e3a35d03407c ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9bc0c92e4b82adb017026dbb2aa816b1ac2bef31 ]
The functions writecache_map_remap_origin and writecache_bio_copy_ssd
only return a single value, thus they can be made to return void.
This helps simplify the following IO accounting changes.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit d17f744e883b2f8d13cca252d71cfe8ace346f7d upstream.
There's a KASAN warning in raid10_remove_disk when running the lvm
test lvconvert-raid-reshape.sh. We fix this warning by verifying that the
value "number" is valid.
BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10]
Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682
CPU: 3 PID: 124682 Comm: mdX_raid10 Not tainted 5.19.0-rc6 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x45/0x57a
? __lock_text_start+0x18/0x18
? raid10_remove_disk+0x61/0x2a0 [raid10]
kasan_report+0xa8/0xe0
? raid10_remove_disk+0x61/0x2a0 [raid10]
raid10_remove_disk+0x61/0x2a0 [raid10]
Buffer I/O error on dev dm-76, logical block 15344, async page read
? __mutex_unlock_slowpath.constprop.0+0x1e0/0x1e0
remove_and_add_spares+0x367/0x8a0 [md_mod]
? super_written+0x1c0/0x1c0 [md_mod]
? mutex_trylock+0xac/0x120
? _raw_spin_lock+0x72/0xc0
? _raw_spin_lock_bh+0xc0/0xc0
md_check_recovery+0x848/0x960 [md_mod]
raid10d+0xcf/0x3360 [raid10]
? sched_clock_cpu+0x185/0x1a0
? rb_erase+0x4d4/0x620
? var_wake_function+0xe0/0xe0
? psi_group_change+0x411/0x500
? preempt_count_sub+0xf/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? raid10_sync_request+0x36c0/0x36c0 [raid10]
? preempt_count_sub+0xf/0xc0
? _raw_spin_unlock_irqrestore+0x19/0x40
? del_timer_sync+0xa9/0x100
? try_to_del_timer_sync+0xc0/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? _raw_spin_unlock_irq+0x11/0x24
? __list_del_entry_valid+0x68/0xa0
? finish_wait+0xa3/0x100
md_thread+0x161/0x260 [md_mod]
? unregister_md_personality+0xa0/0xa0 [md_mod]
? _raw_spin_lock_irqsave+0x78/0xc0
? prepare_to_wait_event+0x2c0/0x2c0
? unregister_md_personality+0xa0/0xa0 [md_mod]
kthread+0x148/0x180
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 124495:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x80/0xa0
setup_conf+0x140/0x5c0 [raid10]
raid10_run+0x4cd/0x740 [raid10]
md_run+0x6f9/0x1300 [md_mod]
raid_ctr+0x2531/0x4ac0 [dm_raid]
dm_table_add_target+0x2b0/0x620 [dm_mod]
table_load+0x1c8/0x400 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
L __fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
__fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff889108f3d200
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 0 bytes to the right of
256-byte region [ffff889108f3d200, ffff889108f3d300)
The buggy address belongs to the physical page:
page:000000007ef2a34c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1108f3c
head:000000007ef2a34c order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=2)
raw: 4000000000010200 0000000000000000 dead000000000001 ffff889100042b40
raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e151db8ecfb019b7da31d076130a794574c89f6f upstream.
When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with
kasan, we got failure in write_page.
The reason for the failure is that md_bitmap_destroy is called before
destroying the thread and the thread may be waiting in the function
write_page for the bio to complete. When the thread finishes waiting, it
executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which
triggers the kasan warning.
Note that the commit 48df498daf62 that caused this bug claims that it is
neede for md-cluster, you should check md-cluster and possibly find
another bugfix for it.
BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod]
Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539
CPU: 10 PID: 5539 Comm: mdX_raid1 Not tainted 5.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x45/0x57a
? __lock_text_start+0x18/0x18
? write_page+0x18d/0x680 [md_mod]
kasan_report+0xa8/0xe0
? write_page+0x18d/0x680 [md_mod]
kasan_check_range+0x13f/0x180
write_page+0x18d/0x680 [md_mod]
? super_sync+0x4d5/0x560 [dm_raid]
? md_bitmap_file_kick+0xa0/0xa0 [md_mod]
? rs_set_dev_and_array_sectors+0x2e0/0x2e0 [dm_raid]
? mutex_trylock+0x120/0x120
? preempt_count_add+0x6b/0xc0
? preempt_count_sub+0xf/0xc0
md_update_sb+0x707/0xe40 [md_mod]
md_reap_sync_thread+0x1b2/0x4a0 [md_mod]
md_check_recovery+0x533/0x960 [md_mod]
raid1d+0xc8/0x2a20 [raid1]
? var_wake_function+0xe0/0xe0
? psi_group_change+0x411/0x500
? preempt_count_sub+0xf/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? raid1_end_read_request+0x2a0/0x2a0 [raid1]
? preempt_count_sub+0xf/0xc0
? _raw_spin_unlock_irqrestore+0x19/0x40
? del_timer_sync+0xa9/0x100
? try_to_del_timer_sync+0xc0/0xc0
? _raw_spin_lock_irqsave+0x78/0xc0
? __lock_text_start+0x18/0x18
? __list_del_entry_valid+0x68/0xa0
? finish_wait+0xa3/0x100
md_thread+0x161/0x260 [md_mod]
? unregister_md_personality+0xa0/0xa0 [md_mod]
? _raw_spin_lock_irqsave+0x78/0xc0
? prepare_to_wait_event+0x2c0/0x2c0
? unregister_md_personality+0xa0/0xa0 [md_mod]
kthread+0x148/0x180
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 5522:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x80/0xa0
md_bitmap_create+0xa8/0xe80 [md_mod]
md_run+0x777/0x1300 [md_mod]
raid_ctr+0x249c/0x4a30 [dm_raid]
dm_table_add_target+0x2b0/0x620 [dm_mod]
table_load+0x1c8/0x400 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Freed by task 5680:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x40
kasan_set_free_info+0x20/0x40
__kasan_slab_free+0xf7/0x140
kfree+0x80/0x240
md_bitmap_free+0x1c3/0x280 [md_mod]
__md_stop+0x21/0x120 [md_mod]
md_stop+0x9/0x40 [md_mod]
raid_dtr+0x1b/0x40 [dm_raid]
dm_table_destroy+0x98/0x1e0 [dm_mod]
__dm_destroy+0x199/0x360 [dm_mod]
dev_remove+0x10c/0x160 [dm_mod]
ctl_ioctl+0x29e/0x560 [dm_mod]
dm_compat_ctl_ioctl+0x7/0x20 [dm_mod]
__do_compat_sys_ioctl+0xfa/0x160
do_syscall_64+0x90/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 617b365872a247480e9dcd50a32c8d1806b21861 upstream.
There's a KASAN warning in raid5_add_disk when running the LVM testsuite.
The warning happens in the test
lvconvert-raid-reshape-linear_to_raid6-single-type.sh. We fix the warning
by verifying that rdev->saved_raid_disk is within limits.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 332bd0778775d0cf105c4b9e03e460b590749916 upstream.
On dm-raid table load (using raid_ctr), dm-raid allocates an array
rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks
is defined by the number of raid metadata and image tupples passed
into the target's constructor.
In the case of RAID layout changes being requested, that number can be
different from the current number of members for existing raid sets as
defined in their superblocks. Example RAID layout changes include:
- raid1 legs being added/removed
- raid4/5/6/10 number of stripes changed (stripe reshaping)
- takeover to higher raid level (e.g. raid5 -> raid6)
When accessing array members, rs->raid_disks must be used in control
loops instead of the potentially larger value in rs->md.raid_disks.
Otherwise it will cause memory access beyond the end of the rs->devs
array.
Fix this by changing code that is prone to out-of-bounds access.
Also fix validate_raid_redundancy() to validate all devices that are
added. Also, use braces to help clean up raid_iterate_devices().
The out-of-bounds memory accesses was discovered using KASAN.
This commit was verified to pass all LVM2 RAID tests (with KASAN
enabled).
Cc: stable@vger.kernel.org
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
bch_sectors_dirty_init()
commit 7d6b902ea0e02b2a25c480edf471cbaa4ebe6b3c upstream.
The local variables check_state (in bch_btree_check()) and state (in
bch_sectors_dirty_init()) should be fully filled by 0, because before
allocating them on stack, they were dynamically allocated by kzalloc().
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 90736eb3232d208ee048493f371075e4272e0944 upstream.
Commit 85e123c27d5c ("dm mirror log: round up region bitmap size to
BITS_PER_LONG") introduced a regression on 64-bit architectures in the
lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation.
If the device is shrunk, we need to clear log bits beyond the end of the
device. The code clears bits up to a 32-bit boundary and then calculates
lc->sync_count by summing set bits up to a 64-bit boundary (the commit
changed that; previously, this boundary was 32-bit too). So, it was using
some non-zeroed bits in the calculation and this caused misbehavior.
Fix this regression by clearing bits up to BITS_PER_LONG boundary.
Fixes: 85e123c27d5c ("dm mirror log: round up region bitmap size to BITS_PER_LONG")
Cc: stable@vger.kernel.org
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ae6e8b1c9bbf6874163d1243e393137313762b7 upstream.
During postsuspend dm-era does the following:
1. Archives the current era
2. Commits the metadata, as part of the RPC call for archiving the
current era
3. Stops the worker
Until the worker stops, it might write to the metadata again. Moreover,
these writes are not flushed to disk immediately, but are cached by the
dm-bufio client, which writes them back asynchronously.
As a result, the committed metadata of a suspended dm-era device might
not be consistent with the in-core metadata.
In some cases, this can result in the corruption of the on-disk
metadata. Suppose the following sequence of events:
1. Load a new table, e.g. a snapshot-origin table, to a device with a
dm-era table
2. Suspend the device
3. dm-era commits its metadata, but the worker does a few more metadata
writes until it stops, as part of digesting an archived writeset
4. These writes are cached by the dm-bufio client
5. Load the dm-era table to another device.
6. The new instance of the dm-era target loads the committed, on-disk
metadata, which don't include the extra writes done by the worker
after the metadata commit.
7. Resume the new device
8. The new dm-era target instance starts using the metadata
9. Resume the original device
10. The destructor of the old dm-era target instance is called and
destroys the dm-bufio client, which results in flushing the cached
writes to disk
11. These writes might overwrite the writes done by the new dm-era
instance, hence corrupting its metadata.
Fix this by committing the metadata after the worker stops running.
stop_worker uses flush_workqueue to flush the current work. However, the
work item may re-queue itself and flush_workqueue doesn't wait for
re-queued works to finish.
This could result in the worker changing the metadata after they have
been committed, or writing to the metadata concurrently with the commit
in the postsuspend thread.
Use drain_workqueue instead, which waits until the work and all
re-queued works finish.
Fixes: eec40579d8487 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 85e123c27d5cbc22cfdc01de1e2ca1d9003a02d0 upstream.
The code in dm-log rounds up bitset_size to 32 bits. It then uses
find_next_zero_bit_le on the allocated region. find_next_zero_bit_le
accesses the bitmap using unsigned long pointers. So, on 64-bit
architectures, it may access 4 bytes beyond the allocated size.
Fix this bug by rounding up bitset_size to BITS_PER_LONG.
This bug was found by running the lvm2 testsuite with kasan.
Fixes: 29121bd0b00e ("[PATCH] dm mirror log: bitset_size fix")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ea23994edc4169bd90d7a9b5908c6ccefd82fa40 upstream.
The RAID0 layout is irrelevant if all members have the same size so the
array has only one zone. It is *also* irrelevant if the array has two
zones and the second zone has only one device, for example if the array
has two members of different sizes.
So in that case it makes sense to allow assembly even when the layout is
undefined, like what is done when the array has only one zone.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1e267742283a4b5a8ca65755c44166be27e9aa0f ]
Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.
Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.
Let's take pers_lock at the beginning of function to ensure reentrancy.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 40f567bbb3b0639d2ec7d1c6ad4b1b018f80cf19 upstream.
The function kzalloc() in detached_dev_do_request() can fail, so its
return value should be checked.
Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices")
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 42b805af102471f53e3c7867b8c2b502ea4eef7e upstream.
Now io_acct_set is alloc and free in personality. Remove the codes that
free io_acct_set in md_free and md_stop.
Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality)
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0f2571ad7a30ff6b33cde142439f9378669f8b4f upstream.
In normal stop process, it does like this:
do_md_stop
|
__md_stop (pers->free(); mddev->private=NULL)
|
md_free (free mddev)
__md_stop sets mddev->private to NULL after pers->free. The raid device
will be stopped and mddev memory is free. But in reshape, it doesn't
free the mddev and mddev will still be used in new raid.
In reshape, it first sets mddev->private to new_pers and then runs
old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free.
The new raid can't work anymore. It will panic when dereference
mddev->private because of NULL pointer dereference.
It can panic like this:
[63010.814972] kernel BUG at drivers/md/raid10.c:928!
[63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1
[63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020
[63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10]
[63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246
[63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000
[63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800
[63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000
[63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200
[63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003
[63010.906399] FS: 0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000
[63010.914485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0
[63010.927363] Call Trace:
[63010.929822] ? bio_reset+0xe/0x40
[63010.933144] ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10]
[63010.938629] raid10_sync_request+0x756/0x1610 [raid10]
[63010.943770] md_do_sync.cold+0x3e4/0x94c
[63010.947698] md_thread+0xab/0x160
[63010.951024] ? md_write_inc+0x50/0x50
[63010.954688] kthread+0x149/0x170
[63010.957923] ? set_kthread_struct+0x40/0x40
[63010.962107] ret_from_fork+0x22/0x30
Removing the code that sets mddev->private to NULL in raid0 can fix
problem.
Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality)
Reported-by: Fine Fan <ffan@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6 upstream.
The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.
When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.
The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.
This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.
During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 80db4e4707e78cb22287da7d058d7274bd4cb370 upstream.
After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().
That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.
In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.
The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.
But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4dc34ae1b45fe26e772a44379f936c72623dd407 upstream.
Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be
multithreaded") makes bch_sectors_dirty_init() to be much faster
when counting dirty sectors by iterating all dirty keys in the btree.
But it isn't in ideal shape yet, still can be improved.
This patch does the following changes to improve current parallel dirty
keys iteration on the btree,
- Add read lock to root node when multiple threads iterating the btree,
to prevent the root node gets split by I/Os from other registered
bcache devices.
- Remove local variable "char name[32]" and generate kernel thread name
string directly when calling kthread_run().
- Allocate "struct bch_dirty_init_state state" directly on stack and
avoid the unnecessary dynamic memory allocation for it.
- Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
- Increase &state->started to count created kernel thread after it
succeeds to create.
- When wait for all dirty key counting threads to finish, use
wait_event() to replace wait_event_interruptible().
With the above changes, the code is more clear, and some potential error
conditions are avoided.
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 622536443b6731ec82c563aae7807165adbe9178 upstream.
Commit 8e7102273f59 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.
This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
multiple threads. Although currently it is not mandatory but it is
good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
to replace wait_event_interruptible().
With this change, the code is more clear, and some potential error
conditions are avoided.
Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 64c54d9244a4efe9bc6e9c98e13c4bbb8bb39083 upstream.
The bug is here:
if (!rdev || rdev->desc_nr != nr) {
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each_rcu(), so it is incorrect to assume that the
iterator value will be NULL if the list is empty or no element
found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
and lead to invalid memory access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'pdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org
Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone")
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fc8738343eefc4ea8afb6122826dea48eacde514 upstream.
The bug is here:
if (!rdev)
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element found.
Otherwise it will bypass the NULL check and lead to invalid memory
access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'rdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org
Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update")
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e68cb83a57a458b01c9739e2ad9cb70b04d1e6d2 ]
If bitmap area contains invalid data, kernel will crash then mdadm
triggers "Segmentation fault".
This is cluster-md speical bug. In non-clustered env, mdadm will
handle broken metadata case. In clustered array, only kernel space
handles bitmap slot info. But even this bug only happened in clustered
env, current sanity check is wrong, the code should be changed.
How to trigger: (faulty injection)
dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda
dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb
mdadm -Ss
echo aaa > magic.txt
== below modifying slot 2 bitmap data ==
dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic
dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize
mdadm -A /dev/md0 /dev/sda /dev/sdb
== kernel crashes. mdadm outputs "Segmentation fault" ==
Reason of kernel crash:
In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
trigger "divide error".
Crash log:
kernel: md: md0 stopped.
kernel: md/raid1:md0: not clean -- starting background reconstruction
kernel: md/raid1:md0: active with 2 out of 2 mirrors
kernel: dlm: ... ...
kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1
kernel: md0: invalid bitmap file superblock: bad magic
kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2
kernel: md-cluster: Could not gather bitmaps from slot 2
kernel: divide error: 0000 [#1] SMP NOPTI
kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246
kernel: ... ...
kernel: Call Trace:
kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0]
kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a]
kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0]
kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a]
kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a]
kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a]
kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a]
kernel: ? blkdev_ioctl+0xb1/0x2b0
kernel: block_ioctl+0x3b/0x40
kernel: __x64_sys_ioctl+0x7f/0xb0
kernel: do_syscall_64+0x59/0x80
kernel: ? exit_to_user_mode_prepare+0x1ab/0x230
kernel: ? syscall_exit_to_user_mode+0x18/0x40
kernel: ? do_syscall_64+0x69/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f4a15fa722b
kernel: ... ...
kernel: ---[ end trace 8afa7612f559c868 ]---
kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod]
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 57668f0a4cc4083a120cc8c517ca0055c4543b59 upstream.
Raid456 module had allowed to achieve failed state. It was fixed by
fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.
This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.
Reproduction steps:
mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
mkfs.xfs /dev/md126 -f
mount /dev/md126 /mnt/root/
fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
--bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
--time_based --group_reporting --name=throughput-test-job
--eta-newline=1 &
echo 1 > /sys/block/nvme2n1/device/device/remove
echo 1 > /sys/block/nvme1n1/device/device/remove
[ 1475.787779] Call Trace:
[ 1475.793111] __schedule+0x2a6/0x700
[ 1475.799460] schedule+0x38/0xa0
[ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
[ 1475.813856] ? finish_wait+0x80/0x80
[ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
[ 1475.828281] ? finish_wait+0x80/0x80
[ 1475.834727] ? finish_wait+0x80/0x80
[ 1475.841127] ? finish_wait+0x80/0x80
[ 1475.847480] md_handle_request+0x119/0x190
[ 1475.854390] md_make_request+0x8a/0x190
[ 1475.861041] generic_make_request+0xcf/0x310
[ 1475.868145] submit_bio+0x3c/0x160
[ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
[ 1475.882070] iomap_dio_bio_actor+0x175/0x390
[ 1475.889149] iomap_apply+0xff/0x310
[ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.909974] iomap_dio_rw+0x2f2/0x490
[ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
[ 1475.923680] ? atime_needs_update+0x77/0xe0
[ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
[ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
[ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
[ 1475.953403] aio_read+0xd5/0x180
[ 1475.959395] ? _cond_resched+0x15/0x30
[ 1475.965907] io_submit_one+0x20b/0x3c0
[ 1475.972398] __x64_sys_io_submit+0xa2/0x180
[ 1475.979335] ? do_io_getevents+0x7c/0xc0
[ 1475.986009] do_syscall_64+0x5b/0x1a0
[ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1476.000255] RIP: 0033:0x7f11fc27978d
[ 1476.006631] Code: Bad RIP value.
[ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
Cc: stable@vger.kernel.org
Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4caae58406f8ceb741603eee460d79bacca9b1b5 upstream.
The device-mapper framework provides a mechanism to mark targets as
immutable (and hence fail table reloads that try to change the target
type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's
feature flags to prevent switching the verity target with a different
target type.
Fixes: a4ffc152198e ("dm: add verity target")
Cc: stable@vger.kernel.org
Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bfe2b0146c4d0230b68f5c71a64380ff8d361f8b upstream.
dm-stats can be used with a very large number of entries (it is only
limited by 1/4 of total system memory), so add rescheduling points to
the loops that iterate over the entries.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 567dd8f34560fa221a6343729474536aa7ede4fd upstream.
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to
report the current key to userspace. However, this is not a constant-time
operation and it may leak information about the key via timing, via cache
access patterns or via the branch predictor.
Change dm-crypt's key printing to use "%c" instead of "%02x". Also
introduce hex2asc() that carefully avoids any branching or memory
accesses when converting a number in the range 0 ... 15 to an ascii
character.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d3f2a14b8906df913cb04a706367b012db94a6e8 upstream.
The "r" variable shadows an earlier "r" that has function scope. It
means that we accidentally return success instead of an error code.
Smatch has a warning for this:
drivers/md/dm-integrity.c:4503 dm_integrity_ctr()
warn: missing error code 'r'
Fixes: 7eada909bfd7 ("dm: add integrity target")
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 08c1af8f1c13bbf210f1760132f4df24d0ed46d6 upstream.
It is possible to set up dm-integrity in such a way that the
"tag_size" parameter is less than the actual digest size. In this
situation, a part of the digest beyond tag_size is ignored.
In this case, dm-integrity would write beyond the end of the
ic->recalc_tags array and corrupt memory. The corruption happened in
integrity_recalc->integrity_sector_checksum->crypto_shash_final.
Fix this corruption by increasing the tags array so that it has enough
padding at the end to accomodate the loop in integrity_recalc() being
able to write a full digest size for the last member of the tags
array.
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ce40426fdc3c92acdba6b5ca74bc7277ffaa6a3d ]
Mixing sched_clock() and ktime_get_ns() usage will give bad results.
Switch hst_select_path() from using sched_clock() to ktime_get_ns().
Also rename path_service_time()'s 'sched_now' variable to 'now'.
Fixes: 2613eab11996 ("dm mpath: add Historical Service Time Path Selector")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|