| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 4555211190798b6b6fa2c37667d175bf67945c78 upstream.
- limit bitmap chunk size internal u64 variable to values not overflowing
the u32 bitmap superblock structure variable stored on persistent media
- assign bitmap chunk size internal u64 variable from unsigned values to
avoid possible sign extension artifacts when assigning from a s32 value
The bug has been there since at least kernel 4.0.
Steps to reproduce it:
1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
-n2 /dev/rnbd1 /dev/rnbd2
2 resize member device rnbd1 and rnbd2 to 8 TB
3 mdadm --grow /dev/mdx --size=max
The bitmap_chunksize will overflow without patch.
Cc: stable@vger.kernel.org
Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6b9973861cb2e96dcd0bb0f1baddc5c034207c5c upstream.
Otherwise the commit that will be aborted will be associated with the
metadata objects that will be torn down. Must write needs_check flag
to metadata with a reset block manager.
Found through code-inspection (and compared against dm-thin.c).
Cc: stable@vger.kernel.org
Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6a459d8edbdbe7b24db42a5a9f21e6aa9e00c2aa upstream.
Dm_cache also has the same UAF problem when dm_resume()
and dm_destroy() are concurrent.
Therefore, cancelling timer again in destroy().
Cc: stable@vger.kernel.org
Fixes: c6b4fcbad044e ("dm: add cache target")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e4b5957c6f749a501c464f92792f1c8e26b61a94 upstream.
Dm_clone also has the same UAF problem when dm_resume()
and dm_destroy() are concurrent.
Therefore, cancelling timer again in clone_dtr().
Cc: stable@vger.kernel.org
Fixes: 7431b7835f554 ("dm: add clone target")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f50cb2cbabd6c4a60add93d72451728f86e4791c upstream.
Dm_integrity also has the same UAF problem when dm_resume()
and dm_destroy() are concurrent.
Therefore, cancelling timer again in dm_integrity_dtr().
Cc: stable@vger.kernel.org
Fixes: 7eada909bfd7a ("dm: add integrity target")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 88430ebcbc0ec637b710b947738839848c20feff upstream.
When dm_resume() and dm_destroy() are concurrent, it will
lead to UAF, as follows:
BUG: KASAN: use-after-free in __run_timers+0x173/0x710
Write of size 8 at addr ffff88816d9490f0 by task swapper/0/0
<snip>
Call Trace:
<IRQ>
dump_stack_lvl+0x73/0x9f
print_report.cold+0x132/0xaa2
_raw_spin_lock_irqsave+0xcd/0x160
__run_timers+0x173/0x710
kasan_report+0xad/0x110
__run_timers+0x173/0x710
__asan_store8+0x9c/0x140
__run_timers+0x173/0x710
call_timer_fn+0x310/0x310
pvclock_clocksource_read+0xfa/0x250
kvm_clock_read+0x2c/0x70
kvm_clock_get_cycles+0xd/0x20
ktime_get+0x5c/0x110
lapic_next_event+0x38/0x50
clockevents_program_event+0xf1/0x1e0
run_timer_softirq+0x49/0x90
__do_softirq+0x16e/0x62c
__irq_exit_rcu+0x1fa/0x270
irq_exit_rcu+0x12/0x20
sysvec_apic_timer_interrupt+0x8e/0xc0
One of the concurrency UAF can be shown as below:
use free
do_resume |
__find_device_hash_cell |
dm_get |
atomic_inc(&md->holders) |
| dm_destroy
| __dm_destroy
| if (!dm_suspended_md(md))
| atomic_read(&md->holders)
| msleep(1)
dm_resume |
__dm_resume |
dm_table_resume_targets |
pool_resume |
do_waker #add delay work |
dm_put |
atomic_dec(&md->holders) |
| dm_table_destroy
| pool_dtr
| __pool_dec
| __pool_destroy
| destroy_workqueue
| kfree(pool) # free pool
time out
__do_softirq
run_timer_softirq # pool has already been freed
This can be easily reproduced using:
1. create thin-pool
2. dmsetup suspend pool
3. dmsetup resume pool
4. dmsetup remove_all # Concurrent with 3
The root cause of this UAF bug is that dm_resume() adds timer after
dm_destroy() skips cancelling the timer because of suspend status.
After timeout, it will call run_timer_softirq(), however pool has
already been freed. The concurrency UAF bug will happen.
Therefore, cancelling timer again in __pool_destroy().
Cc: stable@vger.kernel.org
Fixes: 991d9fa02da0d ("dm: add thin provisioning target")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 19eb1650afeb1aa86151f61900e9e5f1de5d8d02 upstream.
If a thinpool set fail_io while suspending, resume will fail with:
device-mapper: resume ioctl on vg-thinpool failed: Invalid argument
The thin-pool also can't be removed if an in-flight bio is in the
deferred list.
This can be easily reproduced using:
echo "offline" > /sys/block/sda/device/state
dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1
dmsetup suspend /dev/mapper/pool
mkfs.ext4 /dev/mapper/thin
dmsetup resume /dev/mapper/pool
The root cause is maybe_resize_data_dev() will check fail_io and return
error before called dm_resume.
Fix this by adding FAIL mode check at the end of pool_preresume().
Cc: stable@vger.kernel.org
Fixes: da105ed5fd7e ("dm thin metadata: introduce dm_pool_abort_metadata")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7991dbff6849f67e823b7cc0c15e5a90b0549b9f upstream.
Recently we found a softlock up problem in dm thin pool btree lookup
code due to corrupted metadata:
Kernel panic - not syncing: softlockup: hung tasks
CPU: 7 PID: 2669225 Comm: kworker/u16:3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Workqueue: dm-thin do_worker [dm_thin_pool]
Call Trace:
<IRQ>
dump_stack+0x9c/0xd3
panic+0x35d/0x6b9
watchdog_timer_fn.cold+0x16/0x25
__run_hrtimer+0xa2/0x2d0
</IRQ>
RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio]
__bufio_new+0x11f/0x4f0 [dm_bufio]
new_read+0xa3/0x1e0 [dm_bufio]
dm_bm_read_lock+0x33/0xd0 [dm_persistent_data]
ro_step+0x63/0x100 [dm_persistent_data]
btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data]
dm_btree_lookup+0x16f/0x210 [dm_persistent_data]
dm_thin_find_block+0x12c/0x210 [dm_thin_pool]
__process_bio_read_only+0xc5/0x400 [dm_thin_pool]
process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool]
process_one_work+0x3c5/0x730
Following process may generate a broken btree mixed with fresh and
stale btree nodes, which could get dm thin trapped in an infinite loop
while looking up data block:
Transaction 1: pmd->root = A, A->B->C // One path in btree
pmd->root = X, X->Y->Z // Copy-up
Transaction 2: X,Z is updated on disk, Y write failed.
// Commit failed, dm thin becomes read-only.
process_bio_read_only
dm_thin_find_block
__find_block
dm_btree_lookup(pmd->root)
The pmd->root points to a broken btree, Y may contain stale node
pointing to any block, for example X, which gets dm thin trapped into
a dead loop while looking up Z.
Fix this by setting pmd->root in __open_metadata(), so that dm thin
will use the last transaction's pmd->root if commit failed.
Fetch a reproducer in [Link].
Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790
Cc: stable@vger.kernel.org
Fixes: 991d9fa02da0 ("dm: add thin provisioning target")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8111964f1b8524c4bb56b02cd9c7a37725ea21fd upstream.
Following concurrent processes:
P1(drop cache) P2(kworker)
drop_caches_sysctl_handler
drop_slab
shrink_slab
down_read(&shrinker_rwsem) - LOCK A
do_shrink_slab
super_cache_scan
prune_icache_sb
dispose_list
evict
ext4_evict_inode
ext4_clear_inode
ext4_discard_preallocations
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_read_block_bitmap_nowait
ext4_read_bh_nowait
submit_bh
dm_submit_bio
do_worker
process_deferred_bios
commit
metadata_operation_failed
dm_pool_abort_metadata
down_write(&pmd->root_lock) - LOCK B
__destroy_persistent_data_objects
dm_block_manager_destroy
dm_bufio_client_destroy
unregister_shrinker
down_write(&shrinker_rwsem)
thin_map |
dm_thin_find_block ↓
down_read(&pmd->root_lock) --> ABBA deadlock
, which triggers hung task:
[ 76.974820] INFO: task kworker/u4:3:63 blocked for more than 15 seconds.
[ 76.976019] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910
[ 76.978521] task:kworker/u4:3 state:D stack:0 pid:63 ppid:2
[ 76.978534] Workqueue: dm-thin do_worker
[ 76.978552] Call Trace:
[ 76.978564] __schedule+0x6ba/0x10f0
[ 76.978582] schedule+0x9d/0x1e0
[ 76.978588] rwsem_down_write_slowpath+0x587/0xdf0
[ 76.978600] down_write+0xec/0x110
[ 76.978607] unregister_shrinker+0x2c/0xf0
[ 76.978616] dm_bufio_client_destroy+0x116/0x3d0
[ 76.978625] dm_block_manager_destroy+0x19/0x40
[ 76.978629] __destroy_persistent_data_objects+0x5e/0x70
[ 76.978636] dm_pool_abort_metadata+0x8e/0x100
[ 76.978643] metadata_operation_failed+0x86/0x110
[ 76.978649] commit+0x6a/0x230
[ 76.978655] do_worker+0xc6e/0xd90
[ 76.978702] process_one_work+0x269/0x630
[ 76.978714] worker_thread+0x266/0x630
[ 76.978730] kthread+0x151/0x1b0
[ 76.978772] INFO: task test.sh:2646 blocked for more than 15 seconds.
[ 76.979756] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910
[ 76.982111] task:test.sh state:D stack:0 pid:2646 ppid:2459
[ 76.982128] Call Trace:
[ 76.982139] __schedule+0x6ba/0x10f0
[ 76.982155] schedule+0x9d/0x1e0
[ 76.982159] rwsem_down_read_slowpath+0x4f4/0x910
[ 76.982173] down_read+0x84/0x170
[ 76.982177] dm_thin_find_block+0x4c/0xd0
[ 76.982183] thin_map+0x201/0x3d0
[ 76.982188] __map_bio+0x5b/0x350
[ 76.982195] dm_submit_bio+0x2b6/0x930
[ 76.982202] __submit_bio+0x123/0x2d0
[ 76.982209] submit_bio_noacct_nocheck+0x101/0x3e0
[ 76.982222] submit_bio_noacct+0x389/0x770
[ 76.982227] submit_bio+0x50/0xc0
[ 76.982232] submit_bh_wbc+0x15e/0x230
[ 76.982238] submit_bh+0x14/0x20
[ 76.982241] ext4_read_bh_nowait+0xc5/0x130
[ 76.982247] ext4_read_block_bitmap_nowait+0x340/0xc60
[ 76.982254] ext4_mb_init_cache+0x1ce/0xdc0
[ 76.982259] ext4_mb_load_buddy_gfp+0x987/0xfa0
[ 76.982263] ext4_discard_preallocations+0x45d/0x830
[ 76.982274] ext4_clear_inode+0x48/0xf0
[ 76.982280] ext4_evict_inode+0xcf/0xc70
[ 76.982285] evict+0x119/0x2b0
[ 76.982290] dispose_list+0x43/0xa0
[ 76.982294] prune_icache_sb+0x64/0x90
[ 76.982298] super_cache_scan+0x155/0x210
[ 76.982303] do_shrink_slab+0x19e/0x4e0
[ 76.982310] shrink_slab+0x2bd/0x450
[ 76.982317] drop_slab+0xcc/0x1a0
[ 76.982323] drop_caches_sysctl_handler+0xb7/0xe0
[ 76.982327] proc_sys_call_handler+0x1bc/0x300
[ 76.982331] proc_sys_write+0x17/0x20
[ 76.982334] vfs_write+0x3d3/0x570
[ 76.982342] ksys_write+0x73/0x160
[ 76.982347] __x64_sys_write+0x1e/0x30
[ 76.982352] do_syscall_64+0x35/0x80
[ 76.982357] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Function metadata_operation_failed() is called when operations failed
on dm pool metadata, dm pool will destroy and recreate metadata. So,
shrinker will be unregistered and registered, which could down write
shrinker_rwsem under pmd_write_lock.
Fix it by allocating dm_block_manager before locking pmd->root_lock
and destroying old dm_block_manager after unlocking pmd->root_lock,
then old dm_block_manager is replaced with new dm_block_manager under
pmd->root_lock. So, shrinker register/unregister could be done without
holding pmd->root_lock.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216676
Cc: stable@vger.kernel.org #v5.2+
Fixes: e49e582965b3 ("dm thin: add read only and fail io modes")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 352b837a5541690d4f843819028cf2b8be83d424 upstream.
Same ABBA deadlock pattern fixed in commit 4b60f452ec51 ("dm thin: Fix
ABBA deadlock between shrink_slab and dm_pool_abort_metadata") to
DM-cache's metadata.
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: stable@vger.kernel.org
Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream.
There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.
The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
a crash in mempool_free.
We fix this bug by moving bio_put before atomic_dec_and_test.
We also move rdev_dec_pending before atomic_dec_and_test as suggested by
Neil Brown.
The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 11557f0067 P4D 11557f0067 PUD 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: kdelayd flush_expired_bios [dm_delay]
RIP: 0010:mempool_free+0x47/0x80
Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
Call Trace:
<TASK>
clone_endio+0xf4/0x1c0 [dm_mod]
clone_endio+0xf4/0x1c0 [dm_mod]
__submit_bio+0x76/0x120
submit_bio_noacct_nocheck+0xb6/0x2a0
flush_expired_bios+0x28/0x2f [dm_delay]
process_one_work+0x1b4/0x300
worker_thread+0x45/0x3e0
? rescuer_thread+0x380/0x380
kthread+0xc2/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b611ad14006e5be2170d9e8e611bf49dff288911 ]
fail run raid1 array when we assemble array with the inactive disk only,
but the mdx_raid1 thread were not stop, Even if the associated resources
have been released. it will caused a NULL dereference when we do poweroff.
This causes the following Oops:
[ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
[ 287.594762] #PF: supervisor read access in kernel mode
[ 287.599912] #PF: error_code(0x0000) - not-present page
[ 287.605061] PGD 0 P4D 0
[ 287.607612] Oops: 0000 [#1] SMP NOPTI
[ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0
[ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
[ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
[ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
[ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
[ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
[ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
[ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
[ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
[ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
[ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
[ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
[ 287.713033] Call Trace:
[ 287.715498] raid1d+0x6c/0xbbb [raid1]
[ 287.719256] ? __schedule+0x1ff/0x760
[ 287.722930] ? schedule+0x3b/0xb0
[ 287.726260] ? schedule_timeout+0x1ed/0x290
[ 287.730456] ? __switch_to+0x11f/0x400
[ 287.734219] md_thread+0xe9/0x140 [md_mod]
[ 287.738328] ? md_thread+0xe9/0x140 [md_mod]
[ 287.742601] ? wait_woken+0x80/0x80
[ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod]
[ 287.751064] kthread+0x11a/0x140
[ 287.754300] ? kthread_park+0x90/0x90
[ 287.757974] ret_from_fork+0x1f/0x30
In fact, when raid1 array run fail, we need to do
md_unregister_thread() before raid1_free().
Signed-off-by: Jiang Li <jiang.li@ugreen.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8e1a2279ca2b0485cc379a153d02a9793f74a48f ]
It should use disk_stack_limits to get a proper max_discard_sectors
rather than setting a value by stack drivers.
And there is a bug. If all member disks are rotational devices,
raid0/raid10 set max_discard_sectors. So the member devices are
not ssd/nvme, but raid0/raid10 export the wrong value. It reports
warning messages in function __blkdev_issue_discard when mkfs.xfs
like this:
[ 4616.022599] ------------[ cut here ]------------
[ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0
[ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0
[ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7
[ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246
[ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080
[ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080
[ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000
[ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000
[ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080
[ 4616.213259] FS: 00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000
[ 4616.222298] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0
[ 4616.236689] Call Trace:
[ 4616.239428] blkdev_issue_discard+0x52/0xb0
[ 4616.244108] blkdev_common_ioctl+0x43c/0xa00
[ 4616.248883] blkdev_ioctl+0x116/0x280
[ 4616.252977] __x64_sys_ioctl+0x8a/0xc0
[ 4616.257163] do_syscall_64+0x5c/0x90
[ 4616.261164] ? handle_mm_fault+0xc5/0x2a0
[ 4616.265652] ? do_user_addr_fault+0x1d8/0x690
[ 4616.270527] ? do_syscall_64+0x69/0x90
[ 4616.274717] ? exc_page_fault+0x62/0x150
[ 4616.279097] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 4616.284748] RIP: 0033:0x7f9a55398c6b
Signed-off-by: Xiao Ni <xni@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3bd548e5b819b8c0f2c9085de775c5c7bff9052f ]
Check the return value of md_bitmap_get_counter() in case it returns
NULL pointer, which will result in a null pointer dereference.
v2: update the check to include other dereference
Signed-off-by: Li Zhong <floridsleeves@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1a581b72169968f4154b5793828f3bc28b258b35 ]
dm is a bit special in that it opens the underlying devices. Commit
89f871af1b26 ("dm: delay registering the gendisk") tried to accommodate
that by allowing to add the holder to the list before add_gendisk and
then just add them to sysfs once add_disk is called. But that leads to
really odd lifetime problems and error handling problems as we can't
know the state of the kobjects and don't unwind properly. To fix this
switch to just registering all existing table_devices with the holder
code right after add_disk, and remove them before calling del_gendisk.
Fixes: 89f871af1b26 ("dm: delay registering the gendisk")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221115141054.1051801-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d563792c8933a810d28ce0f2831f0726c2b15a31 ]
open_table_device() and close_table_device() is protected by
table_devices_lock, hence use it to protect add_disk() and
del_gendisk().
Prepare to track per-add_disk holder relations in dm.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221115141054.1051801-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b5865831c1003122f737df5e16adaa583f1a595 ]
Take the list unlink and free into close_table_device so that no half
torn down table_devices exist. Also remove the check for a NULL bdev
as that can't happen - open_table_device never adds a table_device to
the list that does not have a valid block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221115141054.1051801-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b9a785d2dc6567b2fd9fc60057a6a945a276927a ]
Move all the logic for allocation the table_device and linking it into
the list into the open_table_device. This keeps the code tidy and
ensures that the table_devices only exist in fully initialized state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221115141054.1051801-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 984bf2cc531e778e49298fdf6730e0396166aa21 ]
There was a problem that a user burned a dm-integrity image on CDROM
and could not activate it because it had a non-empty journal.
Fix this problem by flushing the journal (done by the previous commit)
and clearing the journal (done by this commit). Once the journal is
cleared, dm-integrity won't attempt to replay it on the next
activation.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e5dab5ec763d600fe0a67837dd9155bdc42f961 ]
This commit flushes the journal on suspend. It is prerequisite for the
next commit that enables activating dm integrity devices in read-only mode.
Note that we deliberately didn't flush the journal on suspend, so that the
journal replay code would be tested. However, the dm-integrity code is 5
years old now, so that journal replay is well-tested, and we can make this
change now.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 50a893359cd2643ee1afc96eedc9e7084cab49fa ]
This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221110184501.2451620-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 29aa778bb66795e6a78b1c99beadc83887827868 ]
This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221110184501.2451620-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 4fe1ec995483737f3d2a14c3fe1d8fe634972979 upstream.
__list_versions will first estimate the required space using the
"dm_target_iterate(list_version_get_needed, &needed)" call and then will
fill the space using the "dm_target_iterate(list_version_get_info,
&iter_info)" call. Each of these calls locks the targets using the
"down_read(&_lock)" and "up_read(&_lock)" calls, however between the first
and second "dm_target_iterate" there is no lock held and the target
modules can be loaded at this point, so the second "dm_target_iterate"
call may need more space than what was the first "dm_target_iterate"
returned.
The code tries to handle this overflow (see the beginning of
list_version_get_info), however this handling is incorrect.
The code sets "param->data_size = param->data_start + needed" and
"iter_info.end = (char *)vers+len" - "needed" is the size returned by the
first dm_target_iterate call; "len" is the size of the buffer allocated by
userspace.
"len" may be greater than "needed"; in this case, the code will write up
to "len" bytes into the buffer, however param->data_size is set to
"needed", so it may write data past the param->data_size value. The ioctl
interface copies only up to param->data_size into userspace, thus part of
the result will be truncated.
Fix this bug by setting "iter_info.end = (char *)vers + needed;" - this
guarantees that the second "dm_target_iterate" call will write only up to
the "needed" buffer and it will exit with "DM_BUFFER_FULL_FLAG" if it
overflows the "needed" space - in this case, userspace will allocate a
larger buffer and retry.
Note that there is also a bug in list_version_get_needed - we need to add
"strlen(tt->name) + 1" to the needed size, not "strlen(tt->name)".
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
dm_bufio_client_create failed
commit 0dfc1f4ceae86a0d09d880ab87625c86c61ed33c upstream.
The 'no_sleep_enabled' should be decreased in error handling path
in dm_bufio_client_create() when the DM_BUFIO_CLIENT_NO_SLEEP flag
is set, otherwise static_branch_unlikely() will always return true
even if no dm_bufio_client instances have DM_BUFIO_CLIENT_NO_SLEEP
flag set.
Cc: stable@vger.kernel.org
Fixes: 3c1c875d0586 ("dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 86e4d3e8d1838ca88fb9267e669c36f6c8f7c6cd ]
This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.
Link: https://lore.kernel.org/linux-block/20221101001558.648ee024@xps.demsh.org/
Fixes: b1a000d3b8ec5 ("block: relax direct io memory alignment")
Reportred-by: Eric Biggers <ebiggers@kernel.org>
Reported-by: Dmitrii Tcvetkov <me@demsh.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20221110184501.2451620-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 99f4f5bcb975527508eb7a5e3e34bdb91d576746 ]
Fixes: 74fe6ba923949 ("dm: convert to blk_alloc_disk/blk_cleanup_disk")
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 141b3523e9be6f15577acf4bbc3bc1f82d81d6d1 upstream.
The function test_bit doesn't provide any memory barrier. It may be
possible that the read requests that follow test_bit(B_READING, &b->state)
are reordered before the test, reading invalid data that existed before
B_READING was cleared.
Fix this bug by changing test_bit to test_bit_acquire. This is
particularly important on arches with weak(er) memory ordering
(e.g. arm64).
Depends-On: 8238b4579866 ("wait_on_bit: add an acquire memory barrier")
Depends-On: d6ffe6067a54 ("provide arch_test_bit_acquire for architectures that define test_bit")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5434ee8d28575b2e784bd5b4dbfc912e5da90759 upstream.
Use %pg for printing the block device name, instead of %pd.
Fixes: 385411ffba0c ("dm: stop using bdevname")
Cc: stable@vger.kernel.org # v5.18+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74 ]
A complicated deadlock exists when using the journal and an elevated
group_thrtead_cnt. It was found with loop devices, but its not clear
whether it can be seen with real disks. The deadlock can occur simply
by writing data with an fio script.
When the deadlock occurs, multiple threads will hang in different ways:
1) The group threads will hang in the blk-wbt code with bios waiting to
be submitted to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
ops_run_io+0x46b/0x1a30
handle_stripe+0xcd3/0x36b0
handle_active_stripes.constprop.0+0x6f6/0xa60
raid5_do_work+0x177/0x330
Or:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
flush_deferred_bios+0x136/0x170
raid5_do_work+0x262/0x330
2) The r5l_reclaim thread will hang in the same way, submitting a
bio to the block layer:
io_schedule+0x70/0xb0
rq_qos_wait+0x153/0x210
wbt_wait+0x115/0x1b0
__rq_qos_throttle+0x38/0x60
blk_mq_submit_bio+0x589/0xcd0
__submit_bio+0xe6/0x100
submit_bio_noacct_nocheck+0x42e/0x470
submit_bio_noacct+0x4c2/0xbb0
submit_bio+0x3f/0xf0
md_super_write+0x12f/0x1b0
md_update_sb.part.0+0x7c6/0xff0
md_update_sb+0x30/0x60
r5l_do_reclaim+0x4f9/0x5e0
r5l_reclaim_thread+0x69/0x30b
However, before hanging, the MD_SB_CHANGE_PENDING flag will be
set for sb_flags in r5l_write_super_and_discard_space(). This
flag will never be cleared because the submit_bio() call never
returns.
3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe()
will do no processing on any pending stripes and re-set
STRIPE_HANDLE. This will cause the raid5d thread to enter an
infinite loop, constantly trying to handle the same stripes
stuck in the queue.
The raid5d thread has a blk_plug that holds a number of bios
that are also stuck waiting seeing the thread is in a loop
that never schedules. These bios have been accounted for by
blk-wbt thus preventing the other threads above from
continuing when they try to submit bios. --Deadlock.
To fix this, add the same wait_event() that is used in raid5_do_work()
to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will
schedule and wait until the flag is cleared. The schedule action will
flush the plug which will allow the r5l_reclaim thread to continue,
thus preventing the deadlock.
However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING
from the same thread and can thus deadlock if the thread is put to
sleep. So avoid waiting if md_check_recovery() is being called in the
loop.
It's not clear when the deadlock was introduced, but the similar
wait_event() call in raid5_do_work() was added in 2017 by this
commit:
16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata
is updated.")
Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d2d05b88035d2d51a5bb6c5afec88a0880c73df4 ]
Inside set_at_max_writeback_rate() the calculation in following if()
check is wrong,
if (atomic_inc_return(&c->idle_counter) <
atomic_read(&c->attached_dev_nr) * 6)
Because each attached backing device has its own writeback thread
running and increasing c->idle_counter, the counter increates much
faster than expected. The correct calculation should be,
(counter / dev_nr) < dev_nr * 6
which equals to,
counter < dev_nr * dev_nr * 6
This patch fixes the above mistake with correct calculation, and helper
routine idle_counter_exceeded() is added to make code be more clear.
Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3bfc3bcd787c48aa31e4fde4a6dfcef4cd7ee2c2 ]
A regression is seen where mddev devices stay permanently after they
are stopped due to an elevated reference count.
This was tracked down to an extra mddev_get() in md_seq_start().
It only happened rarely because most of the time the md_seq_start()
is called with a zero offset. The path with an extra mddev_get() only
happens when it starts with a non-zero offset.
The commit noted below changed an mddev_get() to check its success
but inadvertently left the original call in. Remove the extra call.
Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c66a6f41e09ad386fd2cce22b9cded837bbbc704 ]
When running chunk-sized reads on disks with badblocks duplicate bio
free/puts are observed:
=============================================================================
BUG bio-200 (Not tainted): Object already free
-----------------------------------------------------------------------------
Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504
__slab_alloc.constprop.0+0x5a/0xb0
kmem_cache_alloc+0x31e/0x330
mempool_alloc_slab+0x17/0x20
mempool_alloc+0x100/0x2b0
bio_alloc_bioset+0x181/0x460
do_mpage_readpage+0x776/0xd00
mpage_readahead+0x166/0x320
blkdev_readahead+0x15/0x20
read_pages+0x13f/0x5f0
page_cache_ra_unbounded+0x18d/0x220
force_page_cache_ra+0x181/0x1c0
page_cache_sync_ra+0x65/0xb0
filemap_get_pages+0x1df/0xaf0
filemap_read+0x1e1/0x700
blkdev_read_iter+0x1e5/0x330
vfs_read+0x42a/0x570
Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504
kmem_cache_free+0x46d/0x490
mempool_free_slab+0x17/0x20
mempool_free+0x66/0x190
bio_free+0x78/0x90
bio_put+0x100/0x1a0
raid5_make_request+0x2259/0x2450
md_handle_request+0x402/0x600
md_submit_bio+0xd9/0x120
__submit_bio+0x11f/0x1b0
submit_bio_noacct_nocheck+0x204/0x480
submit_bio_noacct+0x32e/0xc70
submit_bio+0x98/0x1a0
mpage_readahead+0x250/0x320
blkdev_readahead+0x15/0x20
read_pages+0x13f/0x5f0
page_cache_ra_unbounded+0x18d/0x220
Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Workqueue: raid5wq raid5_do_work
Call Trace:
<TASK>
dump_stack_lvl+0x5a/0x78
dump_stack+0x10/0x16
print_trailer+0x158/0x165
object_err+0x35/0x50
free_debug_processing.cold+0xb7/0xbe
__slab_free+0x1ae/0x330
kmem_cache_free+0x46d/0x490
mempool_free_slab+0x17/0x20
mempool_free+0x66/0x190
bio_free+0x78/0x90
bio_put+0x100/0x1a0
mpage_end_io+0x36/0x150
bio_endio+0x2fd/0x360
md_end_io_acct+0x7e/0x90
bio_endio+0x2fd/0x360
handle_failed_stripe+0x960/0xb80
handle_stripe+0x1348/0x3760
handle_active_stripes.constprop.0+0x72a/0xaf0
raid5_do_work+0x177/0x330
process_one_work+0x616/0xb20
worker_thread+0x2bd/0x6f0
kthread+0x179/0x1b0
ret_from_fork+0x22/0x30
</TASK>
The double free is caused by an unnecessary bio_put() in the
if(is_badblock(...)) error path in raid5_read_one_chunk().
The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c
("md/raid5: move checking badblock before clone bio in
raid5_read_one_chunk"). The previous code checked and freed align_bio
which required a bio_put. After the move that is no longer needed as
raid_bio is returned to the control of the common io path which
performs its own endio resulting in a double free on bad device blocks.
Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk")
Signed-off-by: David Sloan <david.sloan@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2eed85bc75138a9eeb63863d20f8904ac42a577 ]
When doing degrade/recover tests using the journal a kernel BUG
is hit at drivers/md/raid5.c:4381 in handle_parity_checks5():
BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
This was found to occur because handle_stripe_fill() was skipped
for stripes in the journal due to a condition in that function.
Thus blocks were not fetched and R5_UPTODATE was not set when
the code reached handle_parity_checks5().
To fix this, don't skip handle_stripe_fill() unless the stripe is
for read.
Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path")
Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1727fd5015d8f93474148f94e34cda5aa6ad4a43 ]
Current code produces a warning as shown below when total characters
in the constituent block device names plus the slashes exceeds 200.
snprintf() returns the number of characters generated from the given
input, which could cause the expression “200 – len” to wrap around
to a large positive number. Fix this by using scnprintf() instead,
which returns the actual number of characters written into the buffer.
[ 1513.267938] ------------[ cut here ]------------
[ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510
[ 1513.267944] Modules linked in: <snip>
[ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu
[ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510
<-snip->
[ 1513.267982] Call Trace:
[ 1513.267986] snprintf+0x45/0x70
[ 1513.267990] ? disk_name+0x71/0xa0
[ 1513.267993] dump_zones+0x114/0x240 [raid0]
[ 1513.267996] ? _cond_resched+0x19/0x40
[ 1513.267998] raid0_run+0x19e/0x270 [raid0]
[ 1513.268000] md_run+0x5e0/0xc50
[ 1513.268003] ? security_capable+0x3f/0x60
[ 1513.268005] do_md_run+0x19/0x110
[ 1513.268006] md_ioctl+0x195e/0x1f90
[ 1513.268007] blkdev_ioctl+0x91f/0x9f0
[ 1513.268010] block_ioctl+0x3d/0x50
[ 1513.268012] do_vfs_ioctl+0xa9/0x640
[ 1513.268014] ? __fput+0x162/0x260
[ 1513.268016] ksys_ioctl+0x75/0x80
[ 1513.268017] __x64_sys_ioctl+0x1a/0x20
[ 1513.268019] do_syscall_64+0x5e/0x200
[ 1513.268021] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 766038846e875 ("md/raid0: replace printk() with pr_*()")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 916ef6232cc4b84db7082b4c3d3cf1753d9462ba upstream.
Verity targets can be configured to ignore corrupted data blocks.
LoadPin must only trust verity targets that are configured to
perform some kind of enforcement when data corruption is detected,
like returning an error, restarting the system or triggering a
panic.
Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin")
Reported-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220907133055.1.Ic8a1dafe960dc0f8302e189642bc88ebb785d274@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull block fixes from Jens Axboe:
- MD pull request via Song:
- Fix for clustered raid (Guoqing Jiang)
- req_op fix (Bart Van Assche)
- Fix race condition in raid recreate (David Sloan)
- loop configuration overflow fix (Siddh)
- Fix missing commit_rqs call for certain conditions (Yu)
* tag 'block-6.0-2022-08-26' of git://git.kernel.dk/linux-block:
md: call __md_stop_writes in md_stop
Revert "md-raid: destroy the bitmap after destroying the thread"
md: Flush workqueue md_rdev_misc_wq in md_alloc()
md/raid10: Fix the data type of an r10_sync_page_io() argument
loop: Check for overflow while configuring loop
blk-mq: fix io hung due to missing commit_rqs
|
|
From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.
Let's stop write first in destructor to align with normal md-raid to
fix the KASAN issue.
[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
|
|
This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it
obviously breaks clustered raid as noticed by Neil though it fixed KASAN
issue for dm-raid, let's revert it and fix KASAN issue in next commit.
[1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470
Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
|
|
A race condition still exists when removing and re-creating md devices
in test cases. However, it is only seen on some setups.
The race condition was tracked down to a reference still being held
to the kobject by the rdev in the md_rdev_misc_wq which will be released
in rdev_delayed_delete().
md_alloc() waits for previous deletions by waiting on the md_misc_wq,
but the md_rdev_misc_wq may still be holding a reference to a recently
removed device.
To fix this, also flush the md_rdev_misc_wq in md_alloc().
Signed-off-by: David Sloan <david.sloan@eideticom.com>
[logang@deltatee.com: rewrote commit message]
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
|
|
Fix the following sparse warning:
drivers/md/raid10.c:2647:60: sparse: sparse: incorrect type in argument 5 (different base types) @@ expected restricted blk_opf_t [usertype] opf @@ got int rw @@
This patch does not change any functionality since REQ_OP_READ = READ = 0
and since REQ_OP_WRITE = WRITE = 1.
Cc: Rong A Chen <rong.a.chen@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Fixes: 4ce4c73f662b ("md/core: Combine two sync_page_io() arguments")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Song Liu <song@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- A few fixes for the DM verity and bufio changes in this merge window
- A smatch warning fix for DM writecache locking in writecache_map
* tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix some cases where the code sleeps with spinlock held
dm writecache: fix smatch warning about invalid return from writecache_map
dm verity: fix verity_parse_opt_args parsing
dm verity: fix DM_VERITY_OPTS_MAX value yet again
dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking
|
|
Commit b32d45824aa7 ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
added a "NO_SLEEP" mode, it replaces a mutex with a spinlock, and it
is only usable when the device is in read-only mode (because the write
path may be sleeping while holding the dm_bufio_client lock).
However, there are still two points where the code could sleep even in
read-only mode. One is in __get_unclaimed_buffer -> __make_buffer_clean.
The other is in __try_evict_buffer -> __make_buffer_clean. These functions
will call __make_buffer_clean which sleeps if the buffer is being read.
Fix these cases so that if c->no_sleep is set __make_buffer_clean
will not be called and the buffer will be skipped instead.
Fixes: b32d45824aa7 ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
|
|
There's a smatch warning "inconsistent returns '&wc->lock'" in
dm-writecache. The reason for the warning is that writecache_map()
doesn't drop the lock on the impossible path.
Fix this warning by adding wc_unlock() after the BUG statement (so
that it will be compiled-away anyway).
Fixes: df699cc16ea5e ("dm writecache: report invalid return from writecache_map helpers")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
|
|
Commit df326e7a0699 ("dm verity: allow optional args to alter primary
args handling") introduced a bug where verity_parse_opt_args() wouldn't
properly shift past an optional argument's additional params (by
ignoring them).
Fix this by avoiding returning with error if an unknown argument is
encountered when @only_modifier_opts=true is passed to
verity_parse_opt_args().
In practice this regressed the cryptsetup testsuite's FEC testing
because unknown optional arguments were encountered, wherey
short-circuiting ever testing FEC mode. With this fix all of the
cryptsetup testsuite's verity FEC tests pass.
Fixes: df326e7a0699 ("dm verity: allow optional args to alter primary args handling")
Reported-by: Milan Broz <gmazyland@gmail.com>>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
|
|
Must account for the possibility that "try_verify_in_tasklet" is used.
This is the same issue that was fixed with commit 160f99db94322 -- it
is far too easy to miss that additional a new argument(s) require
bumping DM_VERITY_OPTS_MAX accordingly.
Fixes: 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
|
|
Historically none of the bufio code runs in interrupt context but with
the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit
5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature")
That said, the new tasklet usecase still doesn't require interrupts be
disabled by bufio (let alone conditionally restore them).
Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care
must be taken to properly synchronize between softirq and process
context, otherwise ABBA deadlock may occur. While it is unnecessary to
disable bottom-half preemption within a tasklet, we must consistently do
so in process context to ensure locking is in the proper order.
Fix these issues by switching from spin_lock_irq{save,restore} to using
spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member
in dm_bufio_client struct (that can be used unsafely if bufio must
recurse on behalf of some caller, e.g. block layer's submit_bio).
Fixes: 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
- Add flags argument to dm_bufio_client_create and introduce
DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather
than mutex for its locking.
- Add optional "try_verify_in_tasklet" feature to DM verity target.
This feature gives users the option to improve IO latency by using a
tasklet to verify, using hashes in bufio's cache, rather than wait to
schedule a work item via workqueue. But if there is a bufio cache
miss, or an error, then the tasklet will fallback to using workqueue.
- Incremental changes to both dm-bufio and the DM verity target to use
jump_label to minimize cost of branching associated with the niche
"try_verify_in_tasklet" feature. DM-bufio in particular is used by
quite a few other DM targets so it doesn't make sense to incur
additional bufio cost in those targets purely for the benefit of this
niche verity feature if the feature isn't ever used.
- Optimize verity_verify_io, which is used by both workqueue and
tasklet based verification, if FEC is not configured or tasklet based
verification isn't used.
- Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE
flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if
"try_verify_in_tasklet" is specified.
* tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet"
dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND
dm verity: only copy bvec_iter in verity_verify_io if in_tasklet
dm verity: optimize verity_verify_io if FEC not configured
dm verity: conditionally enable branching for "try_verify_in_tasklet"
dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP
dm verity: allow optional args to alter primary args handling
dm verity: Add optional "try_verify_in_tasklet" feature
dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag
dm bufio: Add flags argument to dm_bufio_client_create
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
|
Pull block driver updates from Jens Axboe:
- NVMe pull requests via Christoph:
- add support for In-Band authentication (Hannes Reinecke)
- handle the persistent internal error AER (Michael Kelley)
- use in-capsule data for TCP I/O queue connect (Caleb Sander)
- remove timeout for getting RDMA-CM established event (Israel
Rukshin)
- misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni,
Guixin Liu, Xiang wangx)
- use command_id instead of req->tag in trace_nvme_complete_rq()
(Bean Huo)
- various fixes for the new authentication code (Lukas Bulwahn,
Dan Carpenter, Colin Ian King, Chaitanya Kulkarni, Hannes
Reinecke)
- small cleanups (Liu Song, Christoph Hellwig)
- restore compat_ioctl support (Nick Bowler)
- make a nvmet-tcp workqueue lockdep-safe (Sagi Grimberg)
- enable generic interface (/dev/ngXnY) for unknown command sets
(Joel Granados, Christoph Hellwig)
- don't always build constants.o (Christoph Hellwig)
- print the command name of aborted commands (Christoph Hellwig)
- MD pull requests via Song:
- Improve raid5 lock contention, by Logan Gunthorpe.
- Misc fixes to raid5, by Logan Gunthorpe.
- Fix race condition with md_reap_sync_thread(), by Guoqing Jiang.
- Fix potential deadlock with raid5_quiesce and
raid5_get_active_stripe, by Logan Gunthorpe.
- Refactoring md_alloc(), by Christoph"
- Fix md disk_name lifetime problems, by Christoph Hellwig
- Convert prepare_to_wait() to wait_woken() api, by Logan
Gunthorpe;
- Fix sectors_to_do bitmap issue, by Logan Gunthorpe.
- Work on unifying the null_blk module parameters and configfs API
(Vincent)
- drbd bitmap IO error fix (Lars)
- Set of rnbd fixes (Guoqing, Md Haris)
- Remove experimental marker on bcache async device registration (Coly)
- Series from cleaning up the bio splitting (Christoph)
- Removal of the sx8 block driver. This hardware never really
widespread, and it didn't receive a lot of attention after the
initial merge of it back in 2005 (Christoph)
- A few fixes for s390 dasd (Eric, Jiang)
- Followup set of fixes for ublk (Ming)
- Support for UBLK_IO_NEED_GET_DATA for ublk (ZiyangZhang)
- Fixes for the dio dma alignment (Keith)
- Misc fixes and cleanups (Ming, Yu, Dan, Christophe
* tag 'for-5.20/block-2022-08-04' of git://git.kernel.dk/linux-block: (136 commits)
s390/dasd: Establish DMA alignment
s390/dasd: drop unexpected word 'for' in comments
ublk_drv: add support for UBLK_IO_NEED_GET_DATA
ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA
ublk_drv: cleanup ublksrv_ctrl_dev_info
ublk_drv: add SET_PARAMS/GET_PARAMS control command
ublk_drv: fix ublk device leak in case that add_disk fails
ublk_drv: cancel device even though disk isn't up
block: fix leaking page ref on truncated direct io
block: ensure bio_iov_add_page can't fail
block: ensure iov_iter advances for added pages
drivers:md:fix a potential use-after-free bug
md/raid5: Ensure batch_last is released before sleeping for quiesce
md/raid5: Move stripe_request_ctx up
md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()
md/raid5: Make is_inactive_blocked() helper
md/raid5: Refactor raid5_get_active_stripe()
block: pass struct queue_limits to the bio splitting helpers
block: move bio_allowed_max_sectors to blk-merge.c
block: move the call to get_max_io_size out of blk_bio_segment_split
...
|
|
Allow verify_wq to preempt softirq since verification in tasklet will
fall-back to using it for error handling (or if the bufio cache
doesn't have required hashes).
Suggested-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
|