| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit d67ed2ccd2d1dcfda9292c0ea8697a9d0f2f0d98 ]
In the raid1_reshape function, newpool is
allocated on the stack and assigned to conf->r1bio_pool.
This results in conf->r1bio_pool.wait.head pointing
to a stack address.
Accessing this address later can lead to a kernel panic.
Example access path:
raid1_reshape()
{
// newpool is on the stack
mempool_t newpool, oldpool;
// initialize newpool.wait.head to stack address
mempool_init(&newpool, ...);
conf->r1bio_pool = newpool;
}
raid1_read_request() or raid1_write_request()
{
alloc_r1bio()
{
mempool_alloc()
{
// if pool->alloc fails
remove_element()
{
--pool->curr_nr;
}
}
}
}
mempool_free()
{
if (pool->curr_nr < pool->min_nr) {
// pool->wait.head is a stack address
// wake_up() will try to access this invalid address
// which leads to a kernel panic
return;
wake_up(&pool->wait);
}
}
Fix:
reinit conf->r1bio_pool.wait after assigning newpool.
Fixes: afeee514ce7f ("md: convert to bioset_init()/mempool_init()")
Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b7c178d9e57c8fd4238ff77263b877f6f16182ba ]
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.
If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.
Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit cd5fc653381811f1e0ba65f5d169918cab61476f upstream.
There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, it's not quite clear where, and while reviewing raid5
code, it's found that bitmap operations can be optimized. For example,
for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
the array:
┌────────────────────────────────────────────────────────────┐
│chunk 0 │
│ ┌────────────┬─────────────┬─────────────┬────────────┼
│ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1 │
│ ┌────────────┬─────────────┬─────────────┬────────────┤
│ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│
│ ┼────────────┼─────────────┼─────────────┼────────────┼
│ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘
Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
- 3 times (0 + 4k), for (A0, A1 and A2)
- 3 times (4 + 4k), for (B0, B1 and B2)
- 3 times (8 + 4k), for (C0, C1 and C3)
- 3 times (12 + 4k), for (D0, D1 and D3)
After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.
Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:
- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;
Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.
Also remove STRIPE_BITMAP_PENDING since it's not used anymore.
[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
[There is no bitmap_operations, resolve conflicts by replacing
bitmap_ops->{startwrite, endwrite} with md_bitmap_{startwrite, endwrite}]
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f0e7d0e03b7b80af84759a9e7cfb0f81ac4adae upstream.
For the case that IO failed for one rdev, the bit will be mark as NEEDED
in following cases:
1) If badblocks is set and rdev is not faulty;
2) If rdev is faulty;
Case 1) is useless because synchronize data to badblocks make no sense.
Case 2) can be replaced with mddev->degraded.
Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since
case 2) no longer use them.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
[ Resolve minor conflicts ]
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 08c50142a128dcb2d7060aa3b4c5db8837f7a46a upstream.
behind_write is only used in raid1, prepare to refactor
bitmap_{start/end}write(), there are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
[There is no bitmap_operations, resolve conflicts by exporting new api
md_bitmap_{start,end}_behind_write]
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fcf3f7e2fc8a53a6140beee46ec782a4c88e4744 upstream.
r1_bio->bios[] is used to record new bios that will be issued to
underlying disks, however, in raid1_write_request(), r1_bio->bios[]
will set to the original bio temporarily. Meanwhile, if blocked rdev
is set, free_r1bio() will be called causing that all r1_bio->bios[]
to be freed:
raid1_write_request()
r1_bio = alloc_r1bio(mddev, bio); -> r1_bio->bios[] is NULL
for (i = 0; i < disks; i++) -> for each rdev in conf
// first rdev is normal
r1_bio->bios[0] = bio; -> set to original bio
// second rdev is blocked
if (test_bit(Blocked, &rdev->flags))
break
if (blocked_rdev)
free_r1bio()
put_all_bios()
bio_put(r1_bio->bios[0]) -> original bio is freed
Test scripts:
mdadm -CR /dev/md0 -l1 -n4 /dev/sd[abcd] --assume-clean
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 \
-iodepth=128 -name=test -direct=1
echo blocked > /sys/block/md0/md/rd2/state
Test result:
BUG bio-264 (Not tainted): Object already free
-----------------------------------------------------------------------------
Allocated in mempool_alloc_slab+0x24/0x50 age=1 cpu=1 pid=869
kmem_cache_alloc+0x324/0x480
mempool_alloc_slab+0x24/0x50
mempool_alloc+0x6e/0x220
bio_alloc_bioset+0x1af/0x4d0
blkdev_direct_IO+0x164/0x8a0
blkdev_write_iter+0x309/0x440
aio_write+0x139/0x2f0
io_submit_one+0x5ca/0xb70
__do_sys_io_submit+0x86/0x270
__x64_sys_io_submit+0x22/0x30
do_syscall_64+0xb1/0x210
entry_SYSCALL_64_after_hwframe+0x6c/0x74
Freed in mempool_free_slab+0x1f/0x30 age=1 cpu=1 pid=869
kmem_cache_free+0x28c/0x550
mempool_free_slab+0x1f/0x30
mempool_free+0x40/0x100
bio_free+0x59/0x80
bio_put+0xf0/0x220
free_r1bio+0x74/0xb0
raid1_make_request+0xadf/0x1150
md_handle_request+0xc7/0x3b0
md_submit_bio+0x76/0x130
__submit_bio+0xd8/0x1d0
submit_bio_noacct_nocheck+0x1eb/0x5c0
submit_bio_noacct+0x169/0xd40
submit_bio+0xee/0x1d0
blkdev_direct_IO+0x322/0x8a0
blkdev_write_iter+0x309/0x440
aio_write+0x139/0x2f0
Since that bios for underlying disks are not allocated yet, fix this
problem by using mempool_free() directly to free the r1_bio.
Fixes: 992db13a4aee ("md/raid1: free the r1bio before waiting for blocked rdev")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240308093726.1047420-1-yukuai1@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7dab24554dedd4e6f408af8eb2d25c89997a6a1f upstream.
Use the type blk_opf_t for read and write operations instead of int. This
patch does not affect the generated code but fixes the following sparse
warning:
drivers/md/raid1.c:1993:60: sparse: sparse: incorrect type in argument 5 (different base types)
expected restricted blk_opf_t [usertype] opf
got int rw
Cc: Song Liu <song@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Fixes: 3c5e514db58f ("md/raid1: Use the new blk_opf_t type")
Cc: stable@vger.kernel.org # v6.0+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401080657.UjFnvQgX-lkp@intel.com/
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240108001223.23835-1-bvanassche@acm.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
There is a compile error when this commit is added:
md: raid1: fix potential OOB in raid1_remove_disk()
drivers/md/raid1.c: In function 'raid1_remove_disk':
drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations
and code [-Werror=declaration-after-statement]
1844 | struct raid1_info *p = conf->mirrors + number;
| ^~~~~~
That's because the new code was inserted before the struct.
The change is move the struct command above this commit.
Fixes: 8b0472b50bcf ("md: raid1: fix potential OOB in raid1_remove_disk()")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com
|
|
As the WriteMostly flag can be set on any component device of a RAID1
array, remove the constraint that it only works if set on the first one.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/2a9592bf3340f34bf588eec984b23ee219f3985e.1692013451.git.heinzm@redhat.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
handle_read_error() will call allow_barrier() to match the former barrier
raising. However, it should put the allow_barrier() at the end to avoid a
concurrent raid reshape.
Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
Raid1 reshape will change mempool and r1conf::raid_disks which are
needed to free r1bio. allow_barrier() make a concurrent raid1_reshape()
possible. So, free the in-flight r1bio before waiting blocked rdev.
Fixes: 6bfe0b499082 ("md: support blocking writes to an array on device failure")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
After allow_barrier, a concurrent raid1_reshape() will replace old mempool
and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(),
so that r1bio can be freed safely.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
If rddev->raid_disk is greater than mddev->raid_disks, there will be
an out-of-bounds in raid1_remove_disk(). We have already found
similar reports as follows:
1) commit d17f744e883b ("md-raid10: fix KASAN warning")
2) commit 1ebc2cec0b7d ("dm raid: fix KASAN warning in raid5_remove_disk")
Fix this bug by checking whether the "number" variable is
valid.
Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
wake_up is called unconditionally in a few paths such as make_request(),
which cause lock contention under high concurrency workload like below
raid1_end_write_request
wake_up
__wake_up_common_lock
spin_lock_irqsave
Improve performance by only call wake_up() if waitqueue is not empty
Fio test script:
[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]
Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server.
Before this patch With this patch
READ BW=4621MB/s BW=7337MB/s
WRITE BW=1980MB/s BW=3144MB/s
The patch is inspired by Yu Kuai's change for raid10:
https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
In fix_read_error(), 'success' will be checked immediately after assigning
it, if it is set to 1 then the loop will break. Checking it again in
condition of loop is redundant. Clean it up.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
New disk should be added to "removed" position first instead of to be a
replacement. Commit 6090368abcb4 ("md/raid10: prioritize adding disk to
'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230627014332.3810102-1-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
|
|
Two problems can be fixed this way:
1) 'active_io' will represent inflight io instead of io that is
dispatching.
2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
|
|
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:
Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:
[root@fedora ~]# cat /sys/block/md0/inflight
0 4474191
Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
|
|
current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.
On the one hand, the deadlock won't exist after commit a214b949d8e3
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.
In order to limit the number of plugged bio, commit 874807a83139
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
|
|
There are multiple places to do the same thing, factor out a helper to
prevent redundant code, and the helper will be used in following patch
as well.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
|
|
The code in raid1 and raid10 is identical, prepare to limit the number
of plugged bios.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
|
|
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:
1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().
Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.
Fix this problem by protecting all md_thread with rcu.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
|
|
The sync request code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/6cf7f66c6e646231200d025dfd5f2d3ae75c8fe5.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
alloc_behind_master_bio() can possibly add multiple pages to a bio, but it
is not checking for the return value of bio_add_page() if adding really
succeeded.
Check if the page adding succeeded and if not bail out.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/827aa12d44ebf3f50b41b47f5cedc0f80179f2c1.1685532726.git.johannes.thumshirn@wdc.com
[axboe: fold in s/free_page/put_page fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This macro is obsolete, so replace the last few uses with open coded
bi_opf assignments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
fail run raid1 array when we assemble array with the inactive disk only,
but the mdx_raid1 thread were not stop, Even if the associated resources
have been released. it will caused a NULL dereference when we do poweroff.
This causes the following Oops:
[ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
[ 287.594762] #PF: supervisor read access in kernel mode
[ 287.599912] #PF: error_code(0x0000) - not-present page
[ 287.605061] PGD 0 P4D 0
[ 287.607612] Oops: 0000 [#1] SMP NOPTI
[ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0
[ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
[ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
[ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
[ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
[ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
[ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
[ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
[ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
[ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
[ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
[ 287.713033] Call Trace:
[ 287.715498] raid1d+0x6c/0xbbb [raid1]
[ 287.719256] ? __schedule+0x1ff/0x760
[ 287.722930] ? schedule+0x3b/0xb0
[ 287.726260] ? schedule_timeout+0x1ed/0x290
[ 287.730456] ? __switch_to+0x11f/0x400
[ 287.734219] md_thread+0xe9/0x140 [md_mod]
[ 287.738328] ? md_thread+0xe9/0x140 [md_mod]
[ 287.742601] ? wait_woken+0x80/0x80
[ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod]
[ 287.751064] kthread+0x11a/0x140
[ 287.754300] ? kthread_park+0x90/0x90
[ 287.757974] ret_from_fork+0x1f/0x30
In fact, when raid1 array run fail, we need to do
md_unregister_thread() before raid1_free().
Signed-off-by: Jiang Li <jiang.li@ugreen.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
Improve static type checking by using the new blk_opf_t type for
variables that represent request flags.
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-35-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Improve uniformity in the kernel of handling of request operation and
flags by passing these as a single argument.
Cc: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the remaining calls of bdevname with snprintf using the %pg
format specifier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the %pg format specifier to save on stack consumption and code size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
|
|
There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.
Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.
There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
process cannot proceed safe.
-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.
The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.
As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.
[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
RAID1/RAID10.")
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.
The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to check the nonrot flag based on the block_device instead
of having to poke into the block layer internal request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the magic autofree semantics and require the callers to explicitly
call bio_init to initialize the bio.
This allows bio_free to catch accidental bio_put calls on bio_init()ed
bios as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
|
|
Pull SCSI updates from James Bottomley:
"This series consists of the usual driver updates (qla2xxx, pm8001,
libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
and bug fixes.
The high blast radius core update is the removal of write same, which
affects block and several non-SCSI devices. The other big change,
which is more local, is the removal of the SCSI pointer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
scsi: scsi_ioctl: Drop needless assignment in sg_io()
scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
scsi: lpfc: Copyright updates for 14.2.0.0 patches
scsi: lpfc: Update lpfc version to 14.2.0.0
scsi: lpfc: SLI path split: Refactor BSG paths
scsi: lpfc: SLI path split: Refactor Abort paths
scsi: lpfc: SLI path split: Refactor SCSI paths
scsi: lpfc: SLI path split: Refactor CT paths
scsi: lpfc: SLI path split: Refactor misc ELS paths
scsi: lpfc: SLI path split: Refactor VMID paths
scsi: lpfc: SLI path split: Refactor FDISC paths
scsi: lpfc: SLI path split: Refactor LS_RJT paths
scsi: lpfc: SLI path split: Refactor LS_ACC paths
scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
scsi: lpfc: SLI path split: Refactor lpfc_iocbq
scsi: lpfc: Use kcalloc()
...
|
|
Pull block driver updates from Jens Axboe:
- NVMe updates via Christoph:
- add vectored-io support for user-passthrough (Kanchan Joshi)
- add verbose error logging (Alan Adamson)
- support buffered I/O on block devices in nvmet (Chaitanya
Kulkarni)
- central discovery controller support (Martin Belanger)
- fix and extended the globally unique idenfier validation
(Christoph)
- move away from the deprecated IDA APIs (Sagi Grimberg)
- misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin,
Chaitanya Kulkarni)
- add lockdep annotations for in-kernel sockets (Chris Leech)
- use vmalloc for ANA log buffer (Hannes Reinecke)
- kerneldoc fixes (Chaitanya Kulkarni)
- cleanups (Guoqing Jiang, Chaitanya Kulkarni, Christoph)
- warn about shared namespaces without multipathing (Christoph)
- MD updates via Song with a set of cleanups (Christoph, Mariusz, Paul,
Erik, Dirk)
- loop cleanups and queue depth configuration (Chaitanya)
- null_blk cleanups and fixes (Chaitanya)
- Use descriptive init/exit names in virtio_blk (Randy)
- Use bvec_kmap_local() in drivers (Christoph)
- bcache fixes (Mingzhe)
- xen blk-front persistent grant speedups (Juergen)
- rnbd fix and cleanup (Gioh)
- Misc fixes (Christophe, Colin)
* tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block: (76 commits)
virtio_blk: eliminate anonymous module_init & module_exit
nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH
nvme: remove nvme_alloc_request and nvme_alloc_request_qid
nvme: cleanup how disk->disk_name is assigned
nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate
nvmet: use snprintf() with PAGE_SIZE in configfs
nvmet: don't fold lines
nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal
nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport
nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport
nvme-tcp: lockdep: annotate in-kernel sockets
nvme-tcp: don't fold the line
nvme-tcp: don't initialize ret variable
nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio
nvme-multipath: use vmalloc for ANA log buffer
xen/blkfront: speed up purge_persistent_grants()
raid5: initialize the stripe_head embeeded bios as needed
raid5-cache: statically allocate the recovery ra bio
raid5-cache: fully initialize flush_bio when needed
raid5-ppl: fully initialize the bio in ppl_new_iounit
...
|
|
Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue
limitation for RAID1 and RAID10"). Remove them from all code (conf and
plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move
definition of raid1_plug_cb to common raid1-10 definitions and use it for
RAID10 too.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the %pg format specifier to save on stack consuption and code size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.
Link: https://lore.kernel.org/r/20220209082828.2629273-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass the block_device that we plan to use this bio for and the
operation to bio_reset to optimize the assigment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block driver updates from Jens Axboe:
- mtip32xx pci cleanups (Bjorn)
- mtip32xx conversion to generic power management (Vaibhav)
- rsxx pci powermanagement cleanups (Bjorn)
- Remove the rsxx driver. This hardware never saw much adoption, and
it's been end of lifed for a while. (Christoph)
- MD pull request from Song:
- REQ_NOWAIT support (Vishal Verma)
- raid6 benchmark optimization (Dirk Müller)
- Fix for acct bioset (Xiao Ni)
- Clean up max_queued_requests (Mariusz Tkaczyk)
- PREEMPT_RT optimization (Davidlohr Bueso)
- Use default_groups in kobj_type (Greg Kroah-Hartman)
- Use attribute groups in pktcdvd and rnbd (Greg)
- NVMe pull request from Christoph:
- increment request genctr on completion (Keith Busch, Geliang
Tang)
- add a 'iopolicy' module parameter (Hannes Reinecke)
- print out valid arguments when reading from /dev/nvme-fabrics
(Hannes Reinecke)
- Use struct_group() in drbd (Kees)
- null_blk fixes (Ming)
- Get rid of congestion logic in pktcdvd (Neil)
- Floppy ejection hang fix (Tasos)
- Floppy max user request size fix (Xiongwei)
- Loop locking fix (Tetsuo)
* tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits)
md: use default_groups in kobj_type
md: Move alloc/free acct bioset in to personality
lib/raid6: Use strict priority ranking for pq gen() benchmarking
lib/raid6: skip benchmark of non-chosen xor_syndrome functions
md: fix spelling of "its"
md: raid456 add nowait support
md: raid10 add nowait support
md: raid1 add nowait support
md: add support for REQ_NOWAIT
md: drop queue limitation for RAID1 and RAID10
md/raid5: play nice with PREEMPT_RT
block/rnbd-clt-sysfs: use default_groups in kobj_type
pktcdvd: convert to use attribute groups
block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0
nvme: add 'iopolicy' module parameter
nvme: drop unused variable ctrl in nvme_setup_cmd
nvme: increment request genctr on completion
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
block: remove the rsxx driver
rsxx: Drop PCI legacy power management
...
|
|
This adds nowait support to the RAID1 driver. It makes RAID1 driver
return with EAGAIN for situations where it could wait for eg:
- Waiting for the barrier,
wait_barrier() fn is modified to return bool to support error for
wait barriers. It returns true in case of wait or if wait is not
required and returns false if wait was required but not performed
to support nowait.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
As suggested by Neil Brown[1], this limitation seems to be
deprecated.
With plugging in use, writes are processed behind the raid thread
and conf->pending_count is not increased. This limitation occurs only
if caller doesn't use plugs.
It can be avoided and often it is (with plugging). There are no reports
that queue is growing to enormous size so remove queue limitation for
non-plugged IOs too.
[1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.name
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
commit [1] causes missing bitmap updates when there isn't any WriteMostly
devices.
Detailed steps to reproduce by Norbert (which somehow didn't make to lore):
# setup md10 (raid1) with two drives (1 GByte sparse files)
dd if=/dev/zero of=disk1 bs=1024k seek=1024 count=0
dd if=/dev/zero of=disk2 bs=1024k seek=1024 count=0
losetup /dev/loop11 disk1
losetup /dev/loop12 disk2
mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/loop11 /dev/loop12
# add bitmap (aka write-intent log)
mdadm /dev/md10 --grow --bitmap=internal
echo check > /sys/block/md10/md/sync_action
root:# cat /sys/block/md10/md/mismatch_cnt
0
root:#
# remove member drive disk2 (loop12)
mdadm /dev/md10 -f loop12 ; mdadm /dev/md10 -r loop12
# modify degraded md device
dd if=/dev/urandom of=/dev/md10 bs=512 count=1
# no blocks recorded as out of sync on the remaining member disk1/loop11
root:# mdadm -X /dev/loop11 | grep Bitmap
Bitmap : 16 bits (chunks), 0 dirty (0.0%)
root:#
# re-add disk2, nothing synced because of empty bitmap
mdadm /dev/md10 --re-add /dev/loop12
# check integrity again
echo check > /sys/block/md10/md/sync_action
# disk1 and disk2 are no longer in sync, reads return differend data
root:# cat /sys/block/md10/md/mismatch_cnt
128
root:#
# clean up
mdadm -S /dev/md10
losetup -d /dev/loop11
losetup -d /dev/loop12
rm disk1 disk2
Fix this by moving the WriteMostly check to the if condition for
alloc_behind_master_bio().
[1] commit fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device")
Fixes: fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device")
Cc: stable@vger.kernel.org # v5.12+
Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Reported-by: Norbert Warmuth <nwarmuth@t-online.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
|
|
We already get rdev from conf->mirrors[i].rdev at the beginning of the
loop, so just use it.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|