summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2023-01-18block: handle bio_split_to_limits() NULL returnJens Axboe2-0/+4
commit 613b14884b8595e20b9fac4126bf627313827fbe upstream. This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07md/bitmap: Fix bitmap chunk size overflow issuesFlorian-Ewald Mueller1-8/+12
commit 4555211190798b6b6fa2c37667d175bf67945c78 upstream. - limit bitmap chunk size internal u64 variable to values not overflowing the u32 bitmap superblock structure variable stored on persistent media - assign bitmap chunk size internal u64 variable from unsigned values to avoid possible sign extension artifacts when assigning from a s32 value The bug has been there since at least kernel 4.0. Steps to reproduce it: 1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2 -n2 /dev/rnbd1 /dev/rnbd2 2 resize member device rnbd1 and rnbd2 to 8 TB 3 mdadm --grow /dev/mdx --size=max The bitmap_chunksize will overflow without patch. Cc: stable@vger.kernel.org Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm cache: set needs_check flag after aborting metadataMike Snitzer1-5/+5
commit 6b9973861cb2e96dcd0bb0f1baddc5c034207c5c upstream. Otherwise the commit that will be aborted will be associated with the metadata objects that will be torn down. Must write needs_check flag to metadata with a reset block manager. Found through code-inspection (and compared against dm-thin.c). Cc: stable@vger.kernel.org Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm cache: Fix UAF in destroy()Luo Meng1-0/+1
commit 6a459d8edbdbe7b24db42a5a9f21e6aa9e00c2aa upstream. Dm_cache also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in destroy(). Cc: stable@vger.kernel.org Fixes: c6b4fcbad044e ("dm: add cache target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm clone: Fix UAF in clone_dtr()Luo Meng1-0/+1
commit e4b5957c6f749a501c464f92792f1c8e26b61a94 upstream. Dm_clone also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in clone_dtr(). Cc: stable@vger.kernel.org Fixes: 7431b7835f554 ("dm: add clone target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm integrity: Fix UAF in dm_integrity_dtr()Luo Meng1-0/+2
commit f50cb2cbabd6c4a60add93d72451728f86e4791c upstream. Dm_integrity also has the same UAF problem when dm_resume() and dm_destroy() are concurrent. Therefore, cancelling timer again in dm_integrity_dtr(). Cc: stable@vger.kernel.org Fixes: 7eada909bfd7a ("dm: add integrity target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Fix UAF in run_timer_softirq()Luo Meng1-0/+2
commit 88430ebcbc0ec637b710b947738839848c20feff upstream. When dm_resume() and dm_destroy() are concurrent, it will lead to UAF, as follows: BUG: KASAN: use-after-free in __run_timers+0x173/0x710 Write of size 8 at addr ffff88816d9490f0 by task swapper/0/0 <snip> Call Trace: <IRQ> dump_stack_lvl+0x73/0x9f print_report.cold+0x132/0xaa2 _raw_spin_lock_irqsave+0xcd/0x160 __run_timers+0x173/0x710 kasan_report+0xad/0x110 __run_timers+0x173/0x710 __asan_store8+0x9c/0x140 __run_timers+0x173/0x710 call_timer_fn+0x310/0x310 pvclock_clocksource_read+0xfa/0x250 kvm_clock_read+0x2c/0x70 kvm_clock_get_cycles+0xd/0x20 ktime_get+0x5c/0x110 lapic_next_event+0x38/0x50 clockevents_program_event+0xf1/0x1e0 run_timer_softirq+0x49/0x90 __do_softirq+0x16e/0x62c __irq_exit_rcu+0x1fa/0x270 irq_exit_rcu+0x12/0x20 sysvec_apic_timer_interrupt+0x8e/0xc0 One of the concurrency UAF can be shown as below: use free do_resume | __find_device_hash_cell | dm_get | atomic_inc(&md->holders) | | dm_destroy | __dm_destroy | if (!dm_suspended_md(md)) | atomic_read(&md->holders) | msleep(1) dm_resume | __dm_resume | dm_table_resume_targets | pool_resume | do_waker #add delay work | dm_put | atomic_dec(&md->holders) | | dm_table_destroy | pool_dtr | __pool_dec | __pool_destroy | destroy_workqueue | kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3 The root cause of this UAF bug is that dm_resume() adds timer after dm_destroy() skips cancelling the timer because of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen. Therefore, cancelling timer again in __pool_destroy(). Cc: stable@vger.kernel.org Fixes: 991d9fa02da0d ("dm: add thin provisioning target") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: resume even if in FAIL modeLuo Meng1-4/+12
commit 19eb1650afeb1aa86151f61900e9e5f1de5d8d02 upstream. If a thinpool set fail_io while suspending, resume will fail with: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument The thin-pool also can't be removed if an in-flight bio is in the deferred list. This can be easily reproduced using: echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume. Fix this by adding FAIL mode check at the end of pool_preresume(). Cc: stable@vger.kernel.org Fixes: da105ed5fd7e ("dm thin metadata: introduce dm_pool_abort_metadata") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Use last transaction's pmd->root when commit failedZhihao Cheng1-0/+9
commit 7991dbff6849f67e823b7cc0c15e5a90b0549b9f upstream. Recently we found a softlock up problem in dm thin pool btree lookup code due to corrupted metadata: Kernel panic - not syncing: softlockup: hung tasks CPU: 7 PID: 2669225 Comm: kworker/u16:3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: <IRQ> dump_stack+0x9c/0xd3 panic+0x35d/0x6b9 watchdog_timer_fn.cold+0x16/0x25 __run_hrtimer+0xa2/0x2d0 </IRQ> RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio] __bufio_new+0x11f/0x4f0 [dm_bufio] new_read+0xa3/0x1e0 [dm_bufio] dm_bm_read_lock+0x33/0xd0 [dm_persistent_data] ro_step+0x63/0x100 [dm_persistent_data] btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data] dm_btree_lookup+0x16f/0x210 [dm_persistent_data] dm_thin_find_block+0x12c/0x210 [dm_thin_pool] __process_bio_read_only+0xc5/0x400 [dm_thin_pool] process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool] process_one_work+0x3c5/0x730 Following process may generate a broken btree mixed with fresh and stale btree nodes, which could get dm thin trapped in an infinite loop while looking up data block: Transaction 1: pmd->root = A, A->B->C // One path in btree pmd->root = X, X->Y->Z // Copy-up Transaction 2: X,Z is updated on disk, Y write failed. // Commit failed, dm thin becomes read-only. process_bio_read_only dm_thin_find_block __find_block dm_btree_lookup(pmd->root) The pmd->root points to a broken btree, Y may contain stale node pointing to any block, for example X, which gets dm thin trapped into a dead loop while looking up Z. Fix this by setting pmd->root in __open_metadata(), so that dm thin will use the last transaction's pmd->root if commit failed. Fetch a reproducer in [Link]. Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790 Cc: stable@vger.kernel.org Fixes: 991d9fa02da0 ("dm: add thin provisioning target") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadataZhihao Cheng1-8/+43
commit 8111964f1b8524c4bb56b02cd9c7a37725ea21fd upstream. Following concurrent processes: P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_read_bh_nowait submit_bh dm_submit_bio do_worker process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata down_write(&pmd->root_lock) - LOCK B __destroy_persistent_data_objects dm_block_manager_destroy dm_bufio_client_destroy unregister_shrinker down_write(&shrinker_rwsem) thin_map | dm_thin_find_block ↓ down_read(&pmd->root_lock) --> ABBA deadlock , which triggers hung task: [ 76.974820] INFO: task kworker/u4:3:63 blocked for more than 15 seconds. [ 76.976019] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.978521] task:kworker/u4:3 state:D stack:0 pid:63 ppid:2 [ 76.978534] Workqueue: dm-thin do_worker [ 76.978552] Call Trace: [ 76.978564] __schedule+0x6ba/0x10f0 [ 76.978582] schedule+0x9d/0x1e0 [ 76.978588] rwsem_down_write_slowpath+0x587/0xdf0 [ 76.978600] down_write+0xec/0x110 [ 76.978607] unregister_shrinker+0x2c/0xf0 [ 76.978616] dm_bufio_client_destroy+0x116/0x3d0 [ 76.978625] dm_block_manager_destroy+0x19/0x40 [ 76.978629] __destroy_persistent_data_objects+0x5e/0x70 [ 76.978636] dm_pool_abort_metadata+0x8e/0x100 [ 76.978643] metadata_operation_failed+0x86/0x110 [ 76.978649] commit+0x6a/0x230 [ 76.978655] do_worker+0xc6e/0xd90 [ 76.978702] process_one_work+0x269/0x630 [ 76.978714] worker_thread+0x266/0x630 [ 76.978730] kthread+0x151/0x1b0 [ 76.978772] INFO: task test.sh:2646 blocked for more than 15 seconds. [ 76.979756] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.982111] task:test.sh state:D stack:0 pid:2646 ppid:2459 [ 76.982128] Call Trace: [ 76.982139] __schedule+0x6ba/0x10f0 [ 76.982155] schedule+0x9d/0x1e0 [ 76.982159] rwsem_down_read_slowpath+0x4f4/0x910 [ 76.982173] down_read+0x84/0x170 [ 76.982177] dm_thin_find_block+0x4c/0xd0 [ 76.982183] thin_map+0x201/0x3d0 [ 76.982188] __map_bio+0x5b/0x350 [ 76.982195] dm_submit_bio+0x2b6/0x930 [ 76.982202] __submit_bio+0x123/0x2d0 [ 76.982209] submit_bio_noacct_nocheck+0x101/0x3e0 [ 76.982222] submit_bio_noacct+0x389/0x770 [ 76.982227] submit_bio+0x50/0xc0 [ 76.982232] submit_bh_wbc+0x15e/0x230 [ 76.982238] submit_bh+0x14/0x20 [ 76.982241] ext4_read_bh_nowait+0xc5/0x130 [ 76.982247] ext4_read_block_bitmap_nowait+0x340/0xc60 [ 76.982254] ext4_mb_init_cache+0x1ce/0xdc0 [ 76.982259] ext4_mb_load_buddy_gfp+0x987/0xfa0 [ 76.982263] ext4_discard_preallocations+0x45d/0x830 [ 76.982274] ext4_clear_inode+0x48/0xf0 [ 76.982280] ext4_evict_inode+0xcf/0xc70 [ 76.982285] evict+0x119/0x2b0 [ 76.982290] dispose_list+0x43/0xa0 [ 76.982294] prune_icache_sb+0x64/0x90 [ 76.982298] super_cache_scan+0x155/0x210 [ 76.982303] do_shrink_slab+0x19e/0x4e0 [ 76.982310] shrink_slab+0x2bd/0x450 [ 76.982317] drop_slab+0xcc/0x1a0 [ 76.982323] drop_caches_sysctl_handler+0xb7/0xe0 [ 76.982327] proc_sys_call_handler+0x1bc/0x300 [ 76.982331] proc_sys_write+0x17/0x20 [ 76.982334] vfs_write+0x3d3/0x570 [ 76.982342] ksys_write+0x73/0x160 [ 76.982347] __x64_sys_write+0x1e/0x30 [ 76.982352] do_syscall_64+0x35/0x80 [ 76.982357] entry_SYSCALL_64_after_hwframe+0x63/0xcd Function metadata_operation_failed() is called when operations failed on dm pool metadata, dm pool will destroy and recreate metadata. So, shrinker will be unregistered and registered, which could down write shrinker_rwsem under pmd_write_lock. Fix it by allocating dm_block_manager before locking pmd->root_lock and destroying old dm_block_manager after unlocking pmd->root_lock, then old dm_block_manager is replaced with new dm_block_manager under pmd->root_lock. So, shrinker register/unregister could be done without holding pmd->root_lock. Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216676 Cc: stable@vger.kernel.org #v5.2+ Fixes: e49e582965b3 ("dm thin: add read only and fail io modes") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abortMike Snitzer1-7/+47
commit 352b837a5541690d4f843819028cf2b8be83d424 upstream. Same ABBA deadlock pattern fixed in commit 4b60f452ec51 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") to DM-cache's metadata. Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: stable@vger.kernel.org Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04md: fix a crash in mempool_freeMikulas Patocka1-3/+6
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream. There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh. The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free. We fix this bug by moving bio_put before atomic_dec_and_test. We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown. The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-31md/raid1: stop mdx_raid1 thread when raid1 array run failedJiang Li1-0/+1
[ Upstream commit b611ad14006e5be2170d9e8e611bf49dff288911 ] fail run raid1 array when we assemble array with the inactive disk only, but the mdx_raid1 thread were not stop, Even if the associated resources have been released. it will caused a NULL dereference when we do poweroff. This causes the following Oops: [ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070 [ 287.594762] #PF: supervisor read access in kernel mode [ 287.599912] #PF: error_code(0x0000) - not-present page [ 287.605061] PGD 0 P4D 0 [ 287.607612] Oops: 0000 [#1] SMP NOPTI [ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0 [ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022 [ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod] [ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ...... [ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202 [ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000 [ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800 [ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff [ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800 [ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500 [ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0 [ 287.713033] Call Trace: [ 287.715498] raid1d+0x6c/0xbbb [raid1] [ 287.719256] ? __schedule+0x1ff/0x760 [ 287.722930] ? schedule+0x3b/0xb0 [ 287.726260] ? schedule_timeout+0x1ed/0x290 [ 287.730456] ? __switch_to+0x11f/0x400 [ 287.734219] md_thread+0xe9/0x140 [md_mod] [ 287.738328] ? md_thread+0xe9/0x140 [md_mod] [ 287.742601] ? wait_woken+0x80/0x80 [ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod] [ 287.751064] kthread+0x11a/0x140 [ 287.754300] ? kthread_park+0x90/0x90 [ 287.757974] ret_from_fork+0x1f/0x30 In fact, when raid1 array run fail, we need to do md_unregister_thread() before raid1_free(). Signed-off-by: Jiang Li <jiang.li@ugreen.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31md/raid0, raid10: Don't set discard sectors for request queueXiao Ni2-3/+0
[ Upstream commit 8e1a2279ca2b0485cc379a153d02a9793f74a48f ] It should use disk_stack_limits to get a proper max_discard_sectors rather than setting a value by stack drivers. And there is a bug. If all member disks are rotational devices, raid0/raid10 set max_discard_sectors. So the member devices are not ssd/nvme, but raid0/raid10 export the wrong value. It reports warning messages in function __blkdev_issue_discard when mkfs.xfs like this: [ 4616.022599] ------------[ cut here ]------------ [ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0 [ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0 [ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7 [ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246 [ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080 [ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080 [ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000 [ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000 [ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080 [ 4616.213259] FS: 00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000 [ 4616.222298] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0 [ 4616.236689] Call Trace: [ 4616.239428] blkdev_issue_discard+0x52/0xb0 [ 4616.244108] blkdev_common_ioctl+0x43c/0xa00 [ 4616.248883] blkdev_ioctl+0x116/0x280 [ 4616.252977] __x64_sys_ioctl+0x8a/0xc0 [ 4616.257163] do_syscall_64+0x5c/0x90 [ 4616.261164] ? handle_mm_fault+0xc5/0x2a0 [ 4616.265652] ? do_user_addr_fault+0x1d8/0x690 [ 4616.270527] ? do_syscall_64+0x69/0x90 [ 4616.274717] ? exc_page_fault+0x62/0x150 [ 4616.279097] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 4616.284748] RIP: 0033:0x7f9a55398c6b Signed-off-by: Xiao Ni <xni@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()Li Zhong1-12/+15
[ Upstream commit 3bd548e5b819b8c0f2c9085de775c5c7bff9052f ] Check the return value of md_bitmap_get_counter() in case it returns NULL pointer, which will result in a null pointer dereference. v2: update the check to include other dereference Signed-off-by: Li Zhong <floridsleeves@gmail.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31dm: track per-add_disk holder relations in DMChristoph Hellwig1-10/+39
[ Upstream commit 1a581b72169968f4154b5793828f3bc28b258b35 ] dm is a bit special in that it opens the underlying devices. Commit 89f871af1b26 ("dm: delay registering the gendisk") tried to accommodate that by allowing to add the holder to the list before add_gendisk and then just add them to sysfs once add_disk is called. But that leads to really odd lifetime problems and error handling problems as we can't know the state of the kobjects and don't unwind properly. To fix this switch to just registering all existing table_devices with the holder code right after add_disk, and remove them before calling del_gendisk. Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31dm: make sure create and remove dm device won't race with open and close tableYu Kuai1-0/+16
[ Upstream commit d563792c8933a810d28ce0f2831f0726c2b15a31 ] open_table_device() and close_table_device() is protected by table_devices_lock, hence use it to protect add_disk() and del_gendisk(). Prepare to track per-add_disk holder relations in dm. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31dm: cleanup close_table_deviceChristoph Hellwig1-9/+3
[ Upstream commit 7b5865831c1003122f737df5e16adaa583f1a595 ] Take the list unlink and free into close_table_device so that no half torn down table_devices exist. Also remove the check for a NULL bdev as that can't happen - open_table_device never adds a table_device to the list that does not have a valid block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31dm: cleanup open_table_deviceChristoph Hellwig1-29/+27
[ Upstream commit b9a785d2dc6567b2fd9fc60057a6a945a276927a ] Move all the logic for allocation the table_device and linking it into the list into the open_table_device. This keeps the code tidy and ensures that the table_devices only exist in fully initialized state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 1a581b721699 ("dm: track per-add_disk holder relations in DM") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-19Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linuxLinus Torvalds3-0/+3
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira) - Memory leak fix in nvmet (Sagi Grimberg) - Regression fix for block cgroups pinning the wrong blkcg, causing leaks of cgroups and blkcgs (Chris) - UAF fix for drbd setup error handling (Dan) - Fix DMA alignment propagation in DM (Keith) * tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux: dm-log-writes: set dma_alignment limit in io_hints dm-integrity: set dma_alignment limit in io_hints block: make blk_set_default_limits() private dm-crypt: provide dma_alignment limit in io_hints block: make dma_alignment a stacking queue_limit nvmet: fix a memory leak in nvmet_auth_set_key nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000 drbd: use after free in drbd_create_device() nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro blk-cgroup: properly pin the parent in blkcg_css_online
2022-11-18dm integrity: clear the journal on suspendMikulas Patocka1-0/+13
There was a problem that a user burned a dm-integrity image on CDROM and could not activate it because it had a non-empty journal. Fix this problem by flushing the journal (done by the previous commit) and clearing the journal (done by this commit). Once the journal is cleared, dm-integrity won't attempt to replay it on the next activation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-11-18dm integrity: flush the journal on suspendMikulas Patocka1-6/+1
This commit flushes the journal on suspend. It is prerequisite for the next commit that enables activating dm integrity devices in read-only mode. Note that we deliberately didn't flush the journal on suspend, so that the journal replay code would be tested. However, the dm-integrity code is 5 years old now, so that journal replay is well-tested, and we can make this change now. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-11-18dm bufio: Fix missing decrement of no_sleep_enabled if ↵Zhihao Cheng1-0/+2
dm_bufio_client_create failed The 'no_sleep_enabled' should be decreased in error handling path in dm_bufio_client_create() when the DM_BUFIO_CLIENT_NO_SLEEP flag is set, otherwise static_branch_unlikely() will always return true even if no dm_bufio_client instances have DM_BUFIO_CLIENT_NO_SLEEP flag set. Cc: stable@vger.kernel.org Fixes: 3c1c875d0586 ("dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-11-18dm ioctl: fix misbehavior if list_versions races with module loadingMikulas Patocka1-2/+2
__list_versions will first estimate the required space using the "dm_target_iterate(list_version_get_needed, &needed)" call and then will fill the space using the "dm_target_iterate(list_version_get_info, &iter_info)" call. Each of these calls locks the targets using the "down_read(&_lock)" and "up_read(&_lock)" calls, however between the first and second "dm_target_iterate" there is no lock held and the target modules can be loaded at this point, so the second "dm_target_iterate" call may need more space than what was the first "dm_target_iterate" returned. The code tries to handle this overflow (see the beginning of list_version_get_info), however this handling is incorrect. The code sets "param->data_size = param->data_start + needed" and "iter_info.end = (char *)vers+len" - "needed" is the size returned by the first dm_target_iterate call; "len" is the size of the buffer allocated by userspace. "len" may be greater than "needed"; in this case, the code will write up to "len" bytes into the buffer, however param->data_size is set to "needed", so it may write data past the param->data_size value. The ioctl interface copies only up to param->data_size into userspace, thus part of the result will be truncated. Fix this bug by setting "iter_info.end = (char *)vers + needed;" - this guarantees that the second "dm_target_iterate" call will write only up to the "needed" buffer and it will exit with "DM_BUFFER_FULL_FLAG" if it overflows the "needed" space - in this case, userspace will allocate a larger buffer and retry. Note that there is also a bug in list_version_get_needed - we need to add "strlen(tt->name) + 1" to the needed size, not "strlen(tt->name)". Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-11-17dm-log-writes: set dma_alignment limit in io_hintsKeith Busch1-0/+1
This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-6-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17dm-integrity: set dma_alignment limit in io_hintsKeith Busch1-0/+1
This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17dm-crypt: provide dma_alignment limit in io_hintsKeith Busch1-0/+1
This device mapper needs bio vectors to be sized and memory aligned to the logical block size. Set the minimum required queue limit accordingly. Link: https://lore.kernel.org/linux-block/20221101001558.648ee024@xps.demsh.org/ Fixes: b1a000d3b8ec5 ("block: relax direct io memory alignment") Reportred-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Dmitrii Tcvetkov <me@demsh.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221110184501.2451620-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-19dm clone: Fix typo in block_device format specifierNikos Tsironis1-1/+1
Use %pg for printing the block device name, instead of %pd. Fixes: 385411ffba0c ("dm: stop using bdevname") Cc: stable@vger.kernel.org # v5.18+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm: remove unnecessary assignment statement in alloc_dev()Genjian Zhang1-1/+0
Fixes: 74fe6ba923949 ("dm: convert to blk_alloc_disk/blk_cleanup_disk") Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm cache: delete the redundant word 'each' in commentShaomin Deng1-1/+1
Signed-off-by: Shaomin Deng <dengshaomin@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm raid: fix typo in analyse_superblocks code commentJiangshan Yi1-1/+1
Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm verity: enable WQ_HIGHPRI on verify_wqNathan Huckleberry1-8/+10
WQ_HIGHPRI increases throughput and decreases disk latency when using dm-verity. This is important in Android for camera startup speed. The following tests were run by doing 60 seconds of random reads using a dm-verity device backed by two ramdisks. Without WQ_HIGHPRI lat (usec): min=13, max=3947, avg=69.53, stdev=50.55 READ: bw=51.1MiB/s (53.6MB/s), 51.1MiB/s-51.1MiB/s (53.6MB/s-53.6MB/s) With WQ_HIGHPRI: lat (usec): min=13, max=7854, avg=31.15, stdev=30.42 READ: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s) Further testing was done by measuring how long it takes to open a camera on an Android device. Without WQ_HIGHPRI Total verity work queue wait times (ms): 880.960, 789.517, 898.852 With WQ_HIGHPRI: Total verity work queue wait times (ms): 528.824, 439.191, 433.300 The average time to open the camera is reduced by 350ms (or 40-50%). Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm raid: delete the redundant word 'that' in commentJilin Yuan1-1/+1
Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-19dm: change from DMWARN to DMERR or DMCRIT for fatal errorsMikulas Patocka5-85/+85
Change DMWARN to DMERR in cases when there is an unrecoverable error. Change DMWARN to DMCRIT when handling of a case is unimplemented. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-18dm bufio: use the acquire memory barrier when testing for B_READINGMikulas Patocka1-6/+7
The function test_bit doesn't provide any memory barrier. It may be possible that the read requests that follow test_bit(B_READING, &b->state) are reordered before the test, reading invalid data that existed before B_READING was cleared. Fix this bug by changing test_bit to test_bit_acquire. This is particularly important on arches with weak(er) memory ordering (e.g. arm64). Depends-On: 8238b4579866 ("wait_on_bit: add an acquire memory barrier") Depends-On: d6ffe6067a54 ("provide arch_test_bit_acquire for architectures that define test_bit") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-12treewide: use get_random_u32() when possibleJason A. Donenfeld1-1/+1
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-12treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld1-1/+1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-07Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linuxLinus Torvalds1-1/+3
Pull passthrough updates from Jens Axboe: "With these changes, passthrough NVMe support over io_uring now performs at the same level as block device O_DIRECT, and in many cases 6-8% better. This contains: - Add support for fixed buffers for passthrough (Anuj, Kanchan) - Enable batched allocations and freeing on passthrough, similarly to what we support on the normal storage path (me) - Fix from Geert fixing an issue with !CONFIG_IO_URING" * tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux: io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy nvme: wire up fixed buffer support for nvme passthrough nvme: pass ubuffer as an integer block: extend functionality to map bvec iterator block: factor out blk_rq_map_bio_alloc helper block: rename bio_map_put to blk_mq_map_bio_put nvme: refactor nvme_alloc_request nvme: refactor nvme_add_user_metadata nvme: Use blk_rq_map_user_io helper scsi: Use blk_rq_map_user_io helper block: add blk_rq_map_user_io io_uring: introduce fixed buffer support for io_uring_cmd io_uring: add io_uring_cmd_import_fixed nvme: enable batched completions of passthrough IO nvme: split out metadata vs non metadata end_io uring_cmd completions block: allow end_io based requests in the completion batch handling block: change request end_io handler to pass back a return value block: enable batched allocation for blk_mq_alloc_request() block: kill deprecated BUG_ON() in the flush handling
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds12-174/+263
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-10-04Merge tag 'dlm-6.1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Fix a couple races found with a new torture test - Improve errors when api functions are used incorrectly - Improve tracing for lock requests from user space - Fix use after free in recently added tracing cod. - Small internal code cleanups * tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: fs: dlm: fix possible use after free if tracing fs: dlm: const void resource name parameter fs: dlm: LSFL_CB_DELAY only for kernel lockspaces fs: dlm: remove DLM_LSFL_FS from uapi fs: dlm: trace user space callbacks fs: dlm: change ls_clear_proc_locks to spinlock fs: dlm: remove dlm_del_ast prototype fs: dlm: handle rcom in else if branch fs: dlm: allow lockspaces have zero lvblen fs: dlm: fix invalid derefence of sb_lvbptr fs: dlm: handle -EINVAL as log_error() fs: dlm: use __func__ for function name fs: dlm: handle -EBUSY first in unlock validation fs: dlm: handle -EBUSY first in lock arg validation fs: dlm: fix race between test_bit() and queue_work() fs: dlm: fix race in lowcomms
2022-10-04Merge tag 'hardening-v6.1-rc1' of ↵Linus Torvalds3-0/+25
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: "Most of the collected changes here are fixes across the tree for various hardening features (details noted below). The most notable new feature here is the addition of the memcpy() overflow warning (under CONFIG_FORTIFY_SOURCE), which is the next step on the path to killing the common class of "trivially detectable" buffer overflow conditions (i.e. on arrays with sizes known at compile time) that have resulted in many exploitable vulnerabilities over the years (e.g. BleedingTooth). This feature is expected to still have some undiscovered false positives. It's been in -next for a full development cycle and all the reported false positives have been fixed in their respective trees. All the known-bad code patterns we could find with Coccinelle are also either fixed in their respective trees or in flight. The commit message in commit 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") for the feature has extensive details, but I'll repeat here that this is a warning _only_, and is not intended to actually block overflows (yet). The many patches fixing array sizes and struct members have been landing for several years now, and we're finally able to turn this on to find any remaining stragglers. Summary: Various fixes across several hardening areas: - loadpin: Fix verity target enforcement (Matthias Kaehlcke). - zero-call-used-regs: Add missing clobbers in paravirt (Bill Wendling). - CFI: clean up sparc function pointer type mismatches (Bart Van Assche). - Clang: Adjust compiler flag detection for various Clang changes (Sami Tolvanen, Kees Cook). - fortify: Fix warnings in arch-specific code in sh, ARM, and xen. Improvements to existing features: - testing: improve overflow KUnit test, introduce fortify KUnit test, add more coverage to LKDTM tests (Bart Van Assche, Kees Cook). - overflow: Relax overflow type checking for wider utility. New features: - string: Introduce strtomem() and strtomem_pad() to fill a gap in strncpy() replacement needs. - um: Enable FORTIFY_SOURCE support. - fortify: Enable run-time struct member memcpy() overflow warning" * tag 'hardening-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (27 commits) Makefile.extrawarn: Move -Wcast-function-type-strict to W=1 hardening: Remove Clang's enable flag for -ftrivial-auto-var-init=zero sparc: Unbreak the build x86/paravirt: add extra clobbers with ZERO_CALL_USED_REGS enabled x86/paravirt: clean up typos and grammaros fortify: Convert to struct vs member helpers fortify: Explicitly check bounds are compile-time constants x86/entry: Work around Clang __bdos() bug ARM: decompressor: Include .data.rel.ro.local fortify: Adjust KUnit test for modular build sh: machvec: Use char[] for section boundaries kunit/memcpy: Avoid pathological compile-time string size lib: Improve the is_signed_type() kunit test LoadPin: Require file with verity root digests to have a header dm: verity-loadpin: Only trust verity targets with enforcement LoadPin: Fix Kconfig doc about format of file with verity digests um: Enable FORTIFY_SOURCE lkdtm: Update tests for memcpy() run-time warnings fortify: Add run-time WARN for cross-field memcpy() fortify: Use SIZE_MAX instead of (size_t)-1 ...
2022-09-30block: change request end_io handler to pass back a return valueJens Axboe1-1/+3
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe12-174/+263
* for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig2-5/+3
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-22md: Fix spelling mistake in comments of r5l_logZhou nan1-1/+1
Fix spelling of dones't in comments. Signed-off-by: Zhou nan <zhounan@nfschina.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5dLogan Gunthorpe1-0/+12
A complicated deadlock exists when using the journal and an elevated group_thrtead_cnt. It was found with loop devices, but its not clear whether it can be seen with real disks. The deadlock can occur simply by writing data with an fio script. When the deadlock occurs, multiple threads will hang in different ways: 1) The group threads will hang in the blk-wbt code with bios waiting to be submitted to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 ops_run_io+0x46b/0x1a30 handle_stripe+0xcd3/0x36b0 handle_active_stripes.constprop.0+0x6f6/0xa60 raid5_do_work+0x177/0x330 Or: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 flush_deferred_bios+0x136/0x170 raid5_do_work+0x262/0x330 2) The r5l_reclaim thread will hang in the same way, submitting a bio to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 submit_bio+0x3f/0xf0 md_super_write+0x12f/0x1b0 md_update_sb.part.0+0x7c6/0xff0 md_update_sb+0x30/0x60 r5l_do_reclaim+0x4f9/0x5e0 r5l_reclaim_thread+0x69/0x30b However, before hanging, the MD_SB_CHANGE_PENDING flag will be set for sb_flags in r5l_write_super_and_discard_space(). This flag will never be cleared because the submit_bio() call never returns. 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe() will do no processing on any pending stripes and re-set STRIPE_HANDLE. This will cause the raid5d thread to enter an infinite loop, constantly trying to handle the same stripes stuck in the queue. The raid5d thread has a blk_plug that holds a number of bios that are also stuck waiting seeing the thread is in a loop that never schedules. These bios have been accounted for by blk-wbt thus preventing the other threads above from continuing when they try to submit bios. --Deadlock. To fix this, add the same wait_event() that is used in raid5_do_work() to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will schedule and wait until the flag is cleared. The schedule action will flush the plug which will allow the r5l_reclaim thread to continue, thus preventing the deadlock. However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING from the same thread and can thus deadlock if the thread is put to sleep. So avoid waiting if md_check_recovery() is being called in the loop. It's not clear when the deadlock was introduced, but the similar wait_event() call in raid5_do_work() was added in 2017 by this commit: 16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata is updated.") Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: convert resync_lock to use seqlockYu Kuai2-30/+59
Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier', and io can't be dispatched until 'barrier' is dropped. Since holding the 'barrier' is not common, convert 'resync_lock' to use seqlock so that holding lock can be avoided in fast path. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-and-Tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: fix improper BUG_ON() in raise_barrier()Yu Kuai1-1/+1
'conf->barrier' is protected by 'conf->resync_lock', reading 'conf->barrier' without holding the lock is wrong. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: prevent unnecessary calls to wake_up() in fast pathYu Kuai1-2/+8
Currently, wake_up() is called unconditionally in fast path such as raid10_make_request(), which will cause lock contention under high concurrency: raid10_make_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty in allow_barrier() and raid10_make_request(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowaitYu Kuai1-2/+2
For the case nowait in wait_barrier(), there is no point to increase nr_waiting and then decrease it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>