summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2021-02-04bcache: only check feature sets when sb->version >= ↵Coly Li1-0/+6
BCACHE_SB_VERSION_CDEV_WITH_FEATURES commit 0df28cad06eb41cc36bfea69d9c882fb567fd0d6 upstream. For super block version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES, it doesn't make sense to check the feature sets. This patch checks super block version in bch_has_feature_* routines, if the version doesn't have feature sets yet, returns 0 (false) to the caller. Fixes: 5342fd425502 ("bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET") Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket") Cc: stable@vger.kernel.org # 5.9+ Reported-and-tested-by: Bockholdt Arne <a.bockholdt@precitec-optronik.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm integrity: select CRYPTO_SKCIPHERAnthony Iliopoulos1-0/+1
[ Upstream commit f7b347acb5f6c29d9229bb64893d8b6a2c7949fb ] The integrity target relies on skcipher for encryption/decryption, but certain kernel configurations may not enable CRYPTO_SKCIPHER, leading to compilation errors due to unresolved symbols. Explicitly select CRYPTO_SKCIPHER for DM_INTEGRITY, since it is unconditionally dependent on it. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27dm integrity: conditionally disable "recalculate" featureMikulas Patocka1-2/+24
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream. Otherwise a malicious user could (ab)use the "recalculate" feature that makes dm-integrity calculate the checksums in the background while the device is already usable. When the system restarts before all checksums have been calculated, the calculation continues where it was interrupted even if the recalculate feature is not requested the next time the dm device is set up. Disable recalculating if we use internal_hash or journal_hash with a key (e.g. HMAC) and we don't have the "legacy_recalculate" flag. This may break activation of a volume, created by an older kernel, that is not yet fully recalculated -- if this happens, the user should add the "legacy_recalculate" flag to constructor parameters. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Daniel Glockner <dg@emlix.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm integrity: fix a crash if "recalculate" used without "internal_hash"Mikulas Patocka1-0/+6
commit 2d06dfecb132a1cc2e374a44eae83b5c4356b8b4 upstream. Recalculate can only be specified with internal_hash. Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm: avoid filesystem lookup in dm_get_dev_t()Hannes Reinecke1-3/+12
commit 809b1e4945774c9ec5619a8f4e2189b7b3833c0c upstream. This reverts commit 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()") dm_get_dev_t() is just used to convert an arbitrary 'path' string into a dev_t. It doesn't presume that the device is present; that check will be done later, as the only caller is dm_get_device(), which does a dm_get_table_device() later on, which will properly open the device. So if the path string already _is_ in major:minor representation we can convert it directly, avoiding a recursion into the filesystem to lookup the block device. This avoids a hang in multipath_message() when the filesystem is inaccessible. Fixes: 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()") Cc: stable@vger.kernel.org Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm crypt: fix copy and paste bug in crypt_alloc_req_aeadIgnat Korchagin1-3/+3
commit 004b8ae9e2de55ca7857ba8471209dd3179e088c upstream. In commit d68b29584c25 ("dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq") code was incorrectly copy and pasted from crypt_alloc_req_skcipher()'s crypto request allocation code to crypt_alloc_req_aead(). It is OK from runtime perspective as both simple encryption request pointer and AEAD request pointer are part of a union, but may confuse code reviewers. Fixes: d68b29584c25 ("dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq") Cc: stable@vger.kernel.org # v5.9+ Reported-by: Pavel Machek <pavel@denx.de> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm: eliminate potential source of excessive kernel log noiseMike Snitzer1-1/+1
commit 0378c625afe80eb3f212adae42cc33c9f6f31abf upstream. There wasn't ever a real need to log an error in the kernel log for ioctls issued with insufficient permissions. Simply return an error and if an admin/user is sufficiently motivated they can enable DM's dynamic debugging to see an explanation for why the ioctls were disallowed. Reported-by: Nir Soffer <nsoffer@redhat.com> Fixes: e980f62353c6 ("dm: don't allow ioctls to targets that don't map to whole devices") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm zoned: select CONFIG_CRC32Arnd Bergmann1-0/+1
commit b690bd546b227c32b860dae985a18bed8aa946fe upstream. Without crc32 support, this driver fails to link: arm-linux-gnueabi-ld: drivers/md/dm-zoned-metadata.o: in function `dmz_write_sb': dm-zoned-metadata.c:(.text+0xe98): undefined reference to `crc32_le' arm-linux-gnueabi-ld: drivers/md/dm-zoned-metadata.o: in function `dmz_check_sb': dm-zoned-metadata.c:(.text+0x7978): undefined reference to `crc32_le' Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm crypt: defer decryption to a tasklet if interrupts disabledIgnat Korchagin1-2/+6
commit c87a95dc28b1431c7e77e2c0c983cf37698089d2 upstream. On some specific hardware on early boot we occasionally get: [ 1193.920255][ T0] BUG: sleeping function called from invalid context at mm/mempool.c:381 [ 1193.936616][ T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/69 [ 1193.953233][ T0] no locks held by swapper/69/0. [ 1193.965871][ T0] irq event stamp: 575062 [ 1193.977724][ T0] hardirqs last enabled at (575061): [<ffffffffab73f662>] tick_nohz_idle_exit+0xe2/0x3e0 [ 1194.002762][ T0] hardirqs last disabled at (575062): [<ffffffffab74e8af>] flush_smp_call_function_from_idle+0x4f/0x80 [ 1194.029035][ T0] softirqs last enabled at (575050): [<ffffffffad600fd2>] asm_call_irq_on_stack+0x12/0x20 [ 1194.054227][ T0] softirqs last disabled at (575043): [<ffffffffad600fd2>] asm_call_irq_on_stack+0x12/0x20 [ 1194.079389][ T0] CPU: 69 PID: 0 Comm: swapper/69 Not tainted 5.10.6-cloudflare-kasan-2021.1.4-dev #1 [ 1194.104103][ T0] Hardware name: NULL R162-Z12-CD/MZ12-HD4-CD, BIOS R10 06/04/2020 [ 1194.119591][ T0] Call Trace: [ 1194.130233][ T0] dump_stack+0x9a/0xcc [ 1194.141617][ T0] ___might_sleep.cold+0x180/0x1b0 [ 1194.153825][ T0] mempool_alloc+0x16b/0x300 [ 1194.165313][ T0] ? remove_element+0x160/0x160 [ 1194.176961][ T0] ? blk_mq_end_request+0x4b/0x490 [ 1194.188778][ T0] crypt_convert+0x27f6/0x45f0 [dm_crypt] [ 1194.201024][ T0] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1194.212906][ T0] ? module_assert_mutex_or_preempt+0x3e/0x70 [ 1194.225318][ T0] ? __module_address.part.0+0x1b/0x3a0 [ 1194.237212][ T0] ? is_kernel_percpu_address+0x5b/0x190 [ 1194.249238][ T0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [dm_crypt] [ 1194.261593][ T0] ? is_module_address+0x25/0x40 [ 1194.272905][ T0] ? static_obj+0x8a/0xc0 [ 1194.283582][ T0] ? lockdep_init_map_waits+0x26a/0x700 [ 1194.295570][ T0] ? __raw_spin_lock_init+0x39/0x110 [ 1194.307330][ T0] kcryptd_crypt_read_convert+0x31c/0x560 [dm_crypt] [ 1194.320496][ T0] ? kcryptd_queue_crypt+0x1be/0x380 [dm_crypt] [ 1194.333203][ T0] blk_update_request+0x6d7/0x1500 [ 1194.344841][ T0] ? blk_mq_trigger_softirq+0x190/0x190 [ 1194.356831][ T0] blk_mq_end_request+0x4b/0x490 [ 1194.367994][ T0] ? blk_mq_trigger_softirq+0x190/0x190 [ 1194.379693][ T0] flush_smp_call_function_queue+0x24b/0x560 [ 1194.391847][ T0] flush_smp_call_function_from_idle+0x59/0x80 [ 1194.403969][ T0] do_idle+0x287/0x450 [ 1194.413891][ T0] ? arch_cpu_idle_exit+0x40/0x40 [ 1194.424716][ T0] ? lockdep_hardirqs_on_prepare+0x286/0x3f0 [ 1194.436399][ T0] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 1194.447759][ T0] cpu_startup_entry+0x19/0x20 [ 1194.458038][ T0] secondary_startup_64_no_verify+0xb0/0xbb IO completion can be queued to a different CPU by the block subsystem as a "call single function/data". The CPU may run these routines from the idle task, but it does so with interrupts disabled. It is not a good idea to do decryption with irqs disabled even in an idle task context, so just defer it to a tasklet (as is done with requests from hard irqs). Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm crypt: do not call bio_endio() from the dm-crypt taskletIgnat Korchagin1-1/+23
commit 8e14f610159d524cd7aac37982826d3ef75c09e8 upstream. Sometimes, when dm-crypt executes decryption in a tasklet, we may get "BUG: KASAN: use-after-free in tasklet_action_common.constprop..." with a kasan-enabled kernel. When the decryption fully completes in the tasklet, dm-crypt will call bio_endio(), which in turn will call clone_endio() from dm.c core code. That function frees the resources associated with the bio, including per bio private structures. For dm-crypt it will free the current struct dm_crypt_io, which contains our tasklet object, causing use-after-free, when the tasklet is being dequeued by the kernel. To avoid this, do not call bio_endio() from the current tasklet context, but delay its execution to the dm-crypt IO workqueue. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: <stable@vger.kernel.org> # v5.9+ Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm crypt: do not wait for backlogged crypto request completion in softirqIgnat Korchagin1-5/+98
commit 8abec36d1274bbd5ae8f36f3658b9abb3db56c31 upstream. Commit 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. When Crypto API backlogs a crypto request, dm-crypt uses wait_for_completion to avoid sending further requests to an already overloaded crypto driver. However, if the code is executing in softirq context, we might get the following stacktrace: [ 210.235213][ C0] BUG: scheduling while atomic: fio/2602/0x00000102 [ 210.236701][ C0] Modules linked in: [ 210.237566][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ #50 [ 210.239292][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.241233][ C0] Call Trace: [ 210.241946][ C0] <IRQ> [ 210.242561][ C0] dump_stack+0x7d/0xa3 [ 210.243466][ C0] __schedule_bug.cold+0xb3/0xc2 [ 210.244539][ C0] __schedule+0x156f/0x20d0 [ 210.245518][ C0] ? io_schedule_timeout+0x140/0x140 [ 210.246660][ C0] schedule+0xd0/0x270 [ 210.247541][ C0] schedule_timeout+0x1fb/0x280 [ 210.248586][ C0] ? usleep_range+0x150/0x150 [ 210.249624][ C0] ? unpoison_range+0x3a/0x60 [ 210.250632][ C0] ? ____kasan_kmalloc.constprop.0+0x82/0xa0 [ 210.251949][ C0] ? unpoison_range+0x3a/0x60 [ 210.252958][ C0] ? __prepare_to_swait+0xa7/0x190 [ 210.254067][ C0] do_wait_for_common+0x2ab/0x370 [ 210.255158][ C0] ? usleep_range+0x150/0x150 [ 210.256192][ C0] ? bit_wait_io_timeout+0x160/0x160 [ 210.257358][ C0] ? blk_update_request+0x757/0x1150 [ 210.258582][ C0] ? _raw_spin_lock_irq+0x82/0xd0 [ 210.259674][ C0] ? _raw_read_unlock_irqrestore+0x30/0x30 [ 210.260917][ C0] wait_for_completion+0x4c/0x90 [ 210.261971][ C0] crypt_convert+0x19a6/0x4c00 [ 210.263033][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.264193][ C0] ? kasan_set_track+0x1c/0x30 [ 210.265191][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.266283][ C0] ? kmem_cache_free+0x104/0x470 [ 210.267363][ C0] ? crypt_endio+0x91/0x180 [ 210.268327][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.269565][ C0] blk_update_request+0x757/0x1150 [ 210.270563][ C0] blk_mq_end_request+0x4b/0x480 [ 210.271680][ C0] blk_done_softirq+0x21d/0x340 [ 210.272775][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.273847][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.275031][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.276182][ C0] __do_softirq+0x190/0x611 [ 210.277203][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.278340][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.279514][ C0] </IRQ> [ 210.280164][ C0] do_softirq_own_stack+0x37/0x40 [ 210.281281][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.282286][ C0] common_interrupt+0x74/0x120 [ 210.283376][ C0] asm_common_interrupt+0x1e/0x40 [ 210.284496][ C0] RIP: 0010:_aesni_enc1+0x65/0xb0 Fix this by making crypt_convert function reentrant from the point of a single bio and make dm-crypt defer further bio processing to a workqueue, if Crypto API backlogs a request in interrupt context. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm crypt: use GFP_ATOMIC when allocating crypto requests from softirqIgnat Korchagin1-10/+25
commit d68b29584c25dbacd01ed44a3e45abb35353f1de upstream. Commit 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") made it possible for some code paths in dm-crypt to be executed in softirq context, when the underlying driver processes IO requests in interrupt/softirq context. In this case sometimes when allocating a new crypto request we may get a stacktrace like below: [ 210.103008][ C0] BUG: sleeping function called from invalid context at mm/mempool.c:381 [ 210.104746][ C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2602, name: fio [ 210.106599][ C0] CPU: 0 PID: 2602 Comm: fio Tainted: G W 5.10.0+ #50 [ 210.108331][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 210.110212][ C0] Call Trace: [ 210.110921][ C0] <IRQ> [ 210.111527][ C0] dump_stack+0x7d/0xa3 [ 210.112411][ C0] ___might_sleep.cold+0x122/0x151 [ 210.113527][ C0] mempool_alloc+0x16b/0x2f0 [ 210.114524][ C0] ? __queue_work+0x515/0xde0 [ 210.115553][ C0] ? mempool_resize+0x700/0x700 [ 210.116586][ C0] ? crypt_endio+0x91/0x180 [ 210.117479][ C0] ? blk_update_request+0x757/0x1150 [ 210.118513][ C0] ? blk_mq_end_request+0x4b/0x480 [ 210.119572][ C0] ? blk_done_softirq+0x21d/0x340 [ 210.120628][ C0] ? __do_softirq+0x190/0x611 [ 210.121626][ C0] crypt_convert+0x29f9/0x4c00 [ 210.122668][ C0] ? _raw_spin_lock_irqsave+0x87/0xe0 [ 210.123824][ C0] ? kasan_set_track+0x1c/0x30 [ 210.124858][ C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 [ 210.125930][ C0] ? kmem_cache_free+0x104/0x470 [ 210.126973][ C0] ? crypt_endio+0x91/0x180 [ 210.127947][ C0] kcryptd_crypt_read_convert+0x30e/0x420 [ 210.129165][ C0] blk_update_request+0x757/0x1150 [ 210.130231][ C0] blk_mq_end_request+0x4b/0x480 [ 210.131294][ C0] blk_done_softirq+0x21d/0x340 [ 210.132332][ C0] ? _raw_spin_lock+0x81/0xd0 [ 210.133289][ C0] ? blk_mq_stop_hw_queue+0x30/0x30 [ 210.134399][ C0] ? _raw_read_lock_irq+0x40/0x40 [ 210.135458][ C0] __do_softirq+0x190/0x611 [ 210.136409][ C0] ? handle_edge_irq+0x221/0xb60 [ 210.137447][ C0] asm_call_irq_on_stack+0x12/0x20 [ 210.138507][ C0] </IRQ> [ 210.139118][ C0] do_softirq_own_stack+0x37/0x40 [ 210.140191][ C0] irq_exit_rcu+0x110/0x1b0 [ 210.141151][ C0] common_interrupt+0x74/0x120 [ 210.142171][ C0] asm_common_interrupt+0x1e/0x40 Fix this by allocating crypto requests with GFP_ATOMIC mask in interrupt context. Fixes: 39d42fa96ba1 ("dm crypt: add flags to optionally bypass kcryptd workqueues") Cc: stable@vger.kernel.org # v5.9+ Reported-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm integrity: fix the maximum number of argumentsMikulas Patocka1-1/+1
commit 17ffc193cdc6dc7a613d00d8ad47fc1f801b9bf0 upstream. Advance the maximum number of arguments from 9 to 15 to account for all potential feature flags that may be supplied. Linux 4.19 added "meta_device" (356d9d52e1221ba0c9f10b8b38652f78a5298329) and "recalculate" (a3fcf7253139609bf9ff901fbf955fba047e75dd) flags. Commit 468dfca38b1a6fbdccd195d875599cb7c8875cd9 added "sectors_per_bit" and "bitmap_flush_interval". Commit 84597a44a9d86ac949900441cea7da0af0f2f473 added "allow_discards". And the commit d537858ac8aaf4311b51240893add2fc62003b97 added "fix_padding". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm integrity: fix flush with external metadata deviceMikulas Patocka2-11/+55
commit 9b5948267adc9e689da609eb61cf7ed49cae5fa8 upstream. With external metadata device, flush requests are not passed down to the data device. Fix this by submitting the flush request in dm_integrity_flush_buffers. In order to not degrade performance, we overlap the data device flush with the metadata device flush. Reported-by: Lukas Straub <lukasstraub2@web.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm snapshot: flush merged data before committing metadataAkilesh Kailash1-0/+24
commit fcc42338375a1e67b8568dbb558f8b784d0f3b01 upstream. If the origin device has a volatile write-back cache and the following events occur: 1: After finishing merge operation of one set of exceptions, merge_callback() is invoked. 2: Update the metadata in COW device tracking the merge completion. This update to COW device is flushed cleanly. 3: System crashes and the origin device's cache where the recent merge was completed has not been flushed. During the next cycle when we read the metadata from the COW device, we will skip reading those metadata whose merge was completed in step (1). This will lead to data loss/corruption. To address this, flush the origin device post merge IO before updating the metadata. Cc: stable@vger.kernel.org Signed-off-by: Akilesh Kailash <akailash@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm raid: fix discard limits for raid1Mike Snitzer1-3/+3
commit cc07d72bf350b77faeffee1c37bc52197171473f upstream. Block core warned that discard_granularity was 0 for dm-raid with personality of raid1. Reason is that raid_io_hints() was incorrectly special-casing raid1 rather than raid0. Fix raid_io_hints() by removing discard limits settings for raid1. Check for raid0 instead. Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Stephan Bärwolf <stephan@matrixstorm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17bcache: set bcache device into read-only mode for ↵Coly Li1-0/+15
BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET commit 5342fd4255021ef0c4ce7be52eea1c4ebda11c63 upstream. If BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET is set in incompat feature set, it means the cache device is created with obsoleted layout with obso_bucket_site_hi. Now bcache does not support this feature bit, a new BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE incompat feature bit is added for a better layout to support large bucket size. For the legacy compatibility purpose, if a cache device created with obsoleted BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit, all bcache devices attached to this cache set should be set to read-only. Then the dirty data can be written back to backing device before re-create the cache device with BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE feature bit by the latest bcache-tools. This patch checks BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET feature bit when running a cache set and attach a bcache device to the cache set. If this bit is set, - When run a cache set, print an error kernel message to indicate all following attached bcache device will be read-only. - When attach a bcache device, print an error kernel message to indicate the attached bcache device will be read-only, and ask users to update to latest bcache-tools. Such change is only for cache device whose bucket size >= 32MB, this is for the zoned SSD and almost nobody uses such large bucket size at this moment. If you don't explicit set a large bucket size for a zoned SSD, such change is totally transparent to your bcache device. Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket") Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucketColy Li3-7/+28
commit b16671e8f493e3df40b1fb0dff4078f391c5099a upstream. When large bucket feature was added, BCH_FEATURE_INCOMPAT_LARGE_BUCKET was introduced into the incompat feature set. It used bucket_size_hi (which was added at the tail of struct cache_sb_disk) to extend current 16bit bucket size to 32bit with existing bucket_size in struct cache_sb_disk. This is not a good idea, there are two obvious problems, - Bucket size is always value power of 2, if store log2(bucket size) in existing bucket_size of struct cache_sb_disk, it is unnecessary to add bucket_size_hi. - Macro csum_set() assumes d[SB_JOURNAL_BUCKETS] is the last member in struct cache_sb_disk, bucket_size_hi was added after d[] which makes csum_set calculate an unexpected super block checksum. To fix the above problems, this patch introduces a new incompat feature bit BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE, when this bit is set, it means bucket_size in struct cache_sb_disk stores the order of power-of-2 bucket size value. When user specifies a bucket size larger than 32768 sectors, BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE will be set to incompat feature set, and bucket_size stores log2(bucket size) more than store the real bucket size value. The obsoleted BCH_FEATURE_INCOMPAT_LARGE_BUCKET won't be used anymore, it is renamed to BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET and still only recognized by kernel driver for legacy compatible purpose. The previous bucket_size_hi is renmaed to obso_bucket_size_hi in struct cache_sb_disk and not used in bcache-tools anymore. For cache device created with BCH_FEATURE_INCOMPAT_LARGE_BUCKET feature, bcache-tools and kernel driver still recognize the feature string and display it as "obso_large_bucket". With this change, the unnecessary extra space extend of bcache on-disk super block can be avoided, and csum_set() may generate expected check sum as well. Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12bcache: check unsupported feature sets for bcache registerColy Li2-0/+29
commit 1dfc0686c29a9bbd3a446a29f9ccde3dec3bc75a upstream. This patch adds the check for features which is incompatible for current supported feature sets. Now if the bcache device created by bcache-tools has features that current kernel doesn't support, read_super() will fail with error messoage. E.g. if an unsupported incompatible feature detected, bcache register will fail with dmesg "bcache: register_bcache() error : Unsupported incompatible feature found". Fixes: d721a43ff69c ("bcache: increase super block version for cache device and backing device") Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12bcache: fix typo from SUUP to SUPP in features.hColy Li1-3/+3
commit f7b4943dea48a572ad751ce1f18a245d43debe7e upstream. This patch fixes the following typos, from BCH_FEATURE_COMPAT_SUUP to BCH_FEATURE_COMPAT_SUPP from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_INCOMPAT_SUPP from BCH_FEATURE_INCOMPAT_SUUP to BCH_FEATURE_RO_COMPAT_SUPP Fixes: d721a43ff69c ("bcache: increase super block version for cache device and backing device") Fixes: ffa470327572 ("bcache: add bucket_size_hi into struct cache_sb_disk for large bucket") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06dm verity: skip verity work if I/O error when system is shutting downHyeongseok Kim1-1/+11
[ Upstream commit 252bd1256396cebc6fc3526127fdb0b317601318 ] If emergency system shutdown is called, like by thermal shutdown, a dm device could be alive when the block device couldn't process I/O requests anymore. In this state, the handling of I/O errors by new dm I/O requests or by those already in-flight can lead to a verity corruption state, which is a misjudgment. So, skip verity work in response to I/O error when system is shutting down. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-06md/raid10: initialize r10_bio->read_slot before use.Kevin Vigor1-1/+2
commit 93decc563637c4288380912eac0eb42fb246cc04 upstream. In __make_request() a new r10bio is allocated and passed to raid10_read_request(). The read_slot member of the bio is not initialized, and the raid10_read_request() uses it to index an array. This leads to occasional panics. Fix by initializing the field to invalid value and checking for valid value in raid10_read_request(). Cc: stable@vger.kernel.org Signed-off-by: Kevin Vigor <kvigor@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30md/cluster: fix deadlock when node is doing resync jobZhao Heming2-31/+42
commit bca5b0658020be90b6b504ca514fd80110204f71 upstream. md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Suggested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30md/cluster: block reshape with remote resync jobZhao Heming1-2/+6
commit a8da01f79c89755fad55ed0ea96e8d2103242a72 upstream. Reshape request should be blocked with ongoing resync job. In cluster env, a node can start resync job even if the resync cmd isn't executed on it, e.g., user executes "mdadm --grow" on node A, sometimes node B will start resync job. However, current update_raid_disks() only check local recovery status, which is incomplete. As a result, we see user will execute "mdadm --grow" successfully on local, while the remote node deny to do reshape job when it doing resync job. The inconsistent handling cause array enter unexpected status. If user doesn't observe this issue and continue executing mdadm cmd, the array doesn't work at last. Fix this issue by blocking reshape request. When node executes "--grow" and detects ongoing resync, it should stop and report error to user. The following script reproduces the issue with ~100% probability. (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB) ``` # on node1, node2 is the remote node. ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30dm ioctl: fix error return code in target_messageQinglang Miao1-0/+1
[ Upstream commit 4d7659bfbe277a43399a4a2d90fca141e70f29e1 ] Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 2ca4c92f58f9 ("dm ioctl: prevent empty message") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-26md: fix a warning caused by a race between concurrent md_ioctl()sDae R. Jeong1-2/+5
commit c731b84b51bf7fe83448bea8f56a6d55006b0615 upstream. Syzkaller reports a warning as belows. WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169 ... Call Trace: ... RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169 RSP: 0018:ffff888096027950 EFLAGS: 00010293 RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2 RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007 RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482 R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932 R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408 __blkdev_driver_ioctl block/ioctl.c:304 [inline] blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606 block_ioctl+0xee/0x130 fs/block_dev.c:1930 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is caused by a race between two concurrenct md_ioctl()s closing the array. CPU1 (md_ioctl()) CPU2 (md_ioctl()) ------ ------ set_bit(MD_CLOSING, &mddev->flags); did_set_md_closing = true; WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags)); if(did_set_md_closing) clear_bit(MD_CLOSING, &mddev->flags); Fix the warning by returning immediately if the MD_CLOSING bit is set in &mddev->flags which indicates that the array is being closed. Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-14Revert "dm raid: fix discard limits for raid1 and raid10"Greg Kroah-Hartman1-7/+5
This reverts commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512. It causes problems :( Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Mike Snitzer <snitzer@redhat.com> Cc: Zdenek Kabelac <zkabelac@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-14Revert "md: change mddev 'chunk_sectors' from int to unsigned"Greg Kroah-Hartman1-2/+2
This reverts commit 6ffeb1c3f8226244c08105bcdbeecc04bad6b89a. It causes problems :( Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Mike Snitzer <snitzer@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-13Merge tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-blockLinus Torvalds6-393/+82
Pull block fixes from Jens Axboe: "This should be it for 5.10. Mike and Song looked into the warning case, and thankfully it appears the fix was pretty trivial - we can just change the md device chunk type to unsigned int to get rid of it. They cannot currently be < 0, and nobody is checking for that either. We're reverting the discard changes as the corruption reports came in very late, and there's just no time to attempt to deal with it at this point. Reverting the changes in question is the right call for 5.10" * tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-block: md: change mddev 'chunk_sectors' from int to unsigned Revert "md: add md_submit_discard_bio() for submitting discard bio" Revert "md/raid10: extend r10bio devs to raid disks" Revert "md/raid10: pull codes that wait for blocked dev into one function" Revert "md/raid10: improve raid10 discard request" Revert "md/raid10: improve discard request for far layout" Revert "dm raid: remove unnecessary discard limits for raid10"
2020-12-12md: change mddev 'chunk_sectors' from int to unsignedMike Snitzer1-2/+2
Commit e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"") exposed compiler warnings introduced by commit e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10"): In file included from ./include/linux/kernel.h:14, from ./include/asm-generic/bug.h:20, from ./arch/x86/include/asm/bug.h:93, from ./include/linux/bug.h:5, from ./include/linux/mmdebug.h:5, from ./include/linux/gfp.h:5, from ./include/linux/slab.h:15, from drivers/md/dm-raid.c:8: drivers/md/dm-raid.c: In function ‘raid_io_hints’: ./include/linux/minmax.h:18:28: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^~ ./include/linux/minmax.h:32:4: note: in expansion of macro ‘__typecheck’ (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ ./include/linux/minmax.h:42:24: note: in expansion of macro ‘__safe_cmp’ __builtin_choose_expr(__safe_cmp(x, y), \ ^~~~~~~~~~ ./include/linux/minmax.h:51:19: note: in expansion of macro ‘__careful_cmp’ #define min(x, y) __careful_cmp(x, y, <) ^~~~~~~~~~~~~ ./include/linux/minmax.h:84:39: note: in expansion of macro ‘min’ __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); }) ^~~ drivers/md/dm-raid.c:3739:33: note: in expansion of macro ‘min_not_zero’ limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors, ^~~~~~~~~~~~ Fix this by changing the chunk_sectors member of 'struct mddev' from int to 'unsigned int' to match the type used for the 'chunk_sectors' member of 'struct queue_limits'. Various MD code still uses 'int' but none of it appears to ever make use of signed int; and storing positive signed int in unsigned is perfectly safe. Reported-by: Song Liu <songliubraving@fb.com> Fixes: e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"") Fixes: e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10") Cc: stable@vger,kernel.org # e0910c8e4f87 was marked for stable@ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-10Revert "md: add md_submit_discard_bio() for submitting discard bio"Song Liu3-24/+12
This reverts commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: extend r10bio devs to raid disks"Song Liu1-4/+5
This reverts commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: pull codes that wait for blocked dev into one function"Song Liu1-67/+51
This reverts commit f046f5d0d79cdb968f219ce249e497fd1accf484. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: improve raid10 discard request"Song Liu1-255/+1
This reverts commit bcc90d280465ebd51ab8688be86e1f00c62dccf9. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "md/raid10: improve discard request for far layout"Song Liu2-64/+23
This reverts commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-10Revert "dm raid: remove unnecessary discard limits for raid10"Song Liu1-0/+11
This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-12-04dm: remove invalid sparse __acquires and __releases annotationsMike Snitzer1-2/+0
Fixes sparse warnings: drivers/md/dm.c:508:12: warning: context imbalance in 'dm_prepare_ioctl' - wrong count at exit drivers/md/dm.c:543:13: warning: context imbalance in 'dm_unprepare_ioctl' - wrong count at exit Fixes: 971888c46993f ("dm: hold DM table for duration of ioctl rather than use blkdev_get") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-04dm: fix double RCU unlock in dm_dax_zero_page_range() error pathMike Snitzer1-2/+0
Remove redundant dm_put_live_table() in dm_dax_zero_page_range() error path to fix sparse warning: drivers/md/dm.c:1208:9: warning: context imbalance in 'dm_dax_zero_page_range' - unexpected unlock Fixes: cdf6cdcd3b99a ("dm,dax: Add dax zero_page_range operation") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-04dm: fix IO splittingMike Snitzer2-13/+11
Commit 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") caused a couple regressions: 1) Using lcm_not_zero() when stacking chunk_sectors was a bug because chunk_sectors must reflect the most limited of all devices in the IO stack. 2) DM targets that set max_io_len but that do _not_ provide an .iterate_devices method no longer had there IO split properly. And commit 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") also caused a regression where DM no longer supported varied (per target) IO splitting. The implication being the potential for severely reduced performance for IO stacks that use a DM target like dm-cache to hide performance limitations of a slower device (e.g. one that requires 4K IO splitting). Coming full circle: Fix all these issues by discontinuing stacking chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(), add optional chunk_sectors override argument to blk_max_size_offset() and update DM's max_io_len() to pass ti->max_io_len to its blk_max_size_offset() call. Passing in an optional chunk_sectors override to blk_max_size_offset() allows for code reuse of block's centralized calculation for max IO size based on provided offset and split boundary. Fixes: 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") Fixes: 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") Cc: stable@vger.kernel.org Reported-by: John Dorminy <jdorminy@redhat.com> Reported-by: Bruce Johnston <bjohnsto@redhat.com> Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: John Dorminy <jdorminy@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk>
2020-12-01dm writecache: remove BUG() and fail gracefully insteadMike Snitzer1-1/+1
Building on arch/s390/ results in this build error: cc1: some warnings being treated as errors ../drivers/md/dm-writecache.c: In function 'persistent_memory_claim': ../drivers/md/dm-writecache.c:323:1: error: no return statement in function returning non-void [-Werror=return-type] Fix this by replacing the BUG() with an -EOPNOTSUPP return. Fixes: 48debafe4f2f ("dm: add writecache target") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01dm table: Remove BUG_ON(in_interrupt())Thomas Gleixner1-6/+0
The BUG_ON(in_interrupt()) in dm_table_event() is a historic leftover from a rework of the dm table code which changed the calling context. Issuing a BUG for a wrong calling context is frowned upon and in_interrupt() is deprecated and only covering parts of the wrong contexts. The sanity check for the context is covered by CONFIG_DEBUG_ATOMIC_SLEEP and other debug facilities already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01dm: fix bug with RCU locking in dm_blk_report_zonesSergei Shtepa1-2/+4
The dm_get_live_table() function makes RCU read lock so dm_put_live_table() must be called even if dm_table map is not found. Fixes: e76239a3748c9 ("block: add a report_zones method") Cc: stable@vger.kernel.org Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-01Revert "dm cache: fix arm link errors with inline"Nick Desaulniers1-4/+0
This reverts commit 43aeaa29573924df76f44eda2bbd94ca36e407b5. Since commit 0bddd227f3dc ("Documentation: update for gcc 4.9 requirement") the minimum supported version of GCC is gcc-4.9. It's now safe to remove this code. Link: https://github.com/ClangBuiltLinux/linux/issues/427 Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-17dm writecache: fix the maximum number of argumentsMikulas Patocka1-1/+1
Advance the maximum number of arguments to 16. This fixes issue where certain operations, combined with table configured args, exceed 10 arguments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-17dm writecache: advance the number of arguments when reporting max_ageMikulas Patocka1-0/+2
When reporting the "max_age" value the number of arguments must advance by two. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 3923d4854e18 ("dm writecache: implement gradual cleanup") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-11-17dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORYMikulas Patocka1-2/+2
Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY set. These drivers allocate memory and thus they are not suitable for block I/O processing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-15Merge tag 'for-5.10/dm-changes' of ↵Linus Torvalds13-393/+222
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Improve DM core's bio splitting to use blk_max_size_offset(). Also fix bio splitting for bios that were deferred to the worker thread due to a DM device being suspended. - Remove DM core's special handling of NVMe devices now that block core has internalized efficiencies drivers previously needed to be concerned about (via now removed direct_make_request). - Fix request-based DM to not bounce through indirect dm_submit_bio; instead have block core make direct call to blk_mq_submit_bio(). - Various DM core cleanups to simplify and improve code. - Update DM cryot to not use drivers that set CRYPTO_ALG_ALLOCATES_MEMORY. - Fix DM raid's raid1 and raid10 discard limits for the purposes of linux-stable. But then remove DM raid's discard limits settings now that MD raid can efficiently handle large discards. - A couple small cleanups across various targets. * tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix request-based DM to not bounce through indirect dm_submit_bio dm: remove special-casing of bio-based immutable singleton target on NVMe dm: export dm_copy_name_and_uuid dm: fix comment in __dm_suspend() dm: fold dm_process_bio() into dm_submit_bio() dm: fix missing imposition of queue_limits from dm_wq_work() thread dm snap persistent: simplify area_io() dm thin metadata: Remove unused local variable when create thin and snap dm raid: remove unnecessary discard limits for raid10 dm raid: fix discard limits for raid1 and raid10 dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY dm: use dm_table_get_device_name() where appropriate in targets dm table: make 'struct dm_table' definition accessible to all of DM core dm: eliminate need for start_io_acct() forward declaration dm: simplify __process_abnormal_io() dm: push use of on-stack flush_bio down to __send_empty_flush() dm: optimize max_io_len() by inlining max_io_len_target_boundary() dm: push md->immutable_target optimization down to __process_bio() dm: change max_io_len() to use blk_max_size_offset() dm table: stack 'chunk_sectors' limit to account for target-specific splitting
2020-10-13Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds24-620/+1032
Pull block driver updates from Jens Axboe: "Here are the driver updates for 5.10. A few SCSI updates in here too, in coordination with Martin as they depend on core block changes for the shared tag bitmap. This contains: - NVMe pull requests via Christoph: - fix keep alive timer modification (Amit Engel) - order the PCI ID list more sensibly (Andy Shevchenko) - cleanup the open by controller helper (Chaitanya Kulkarni) - use an xarray for the CSE log lookup (Chaitanya Kulkarni) - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni) - fix nvme_ns_report_zones (Christoph Hellwig) - add a sanity check to nvmet-fc (James Smart) - fix interrupt allocation when too many polled queues are specified (Jeffle Xu) - small nvmet-tcp optimization (Mark Wunderlich) - fix a controller refcount leak on init failure (Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni) - major refactoring of the scanning code (Christoph Hellwig) - MD updates via Song: - Bug fixes in bitmap code, from Zhao Heming - Fix a work queue check, from Guoqing Jiang - Fix raid5 oops with reshape, from Song Liu - Clean up unused code, from Jason Yan - Discard improvements, from Xiao Ni - raid5/6 page offset support, from Yufen Yu - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap, Hannes) - null_blk open/active zone limit support (Niklas) - Set of bcache updates (Coly, Dongsheng, Qinglang)" * tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits) md/raid5: fix oops during stripe resizing md/bitmap: fix memory leak of temporary bitmap md: fix the checking of wrong work queue md/bitmap: md_bitmap_get_counter returns wrong blocks md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks md/raid0: remove unused function is_io_in_chunk_boundary() nvme-core: remove extra condition for vwc nvme-core: remove extra variable nvme: remove nvme_identify_ns_list nvme: refactor nvme_validate_ns nvme: move nvme_validate_ns nvme: query namespace identifiers before adding the namespace nvme: revalidate zone bitmaps in nvme_update_ns_info nvme: remove nvme_update_formats nvme: update the known admin effects nvme: set the queue limits in nvme_update_ns_info nvme: remove the 0 lba_shift check in nvme_update_ns_info nvme: clean up the check for too large logic block sizes nvme: freeze the queue over ->lba_shift updates nvme: factor out a nvme_configure_metadata helper ...
2020-10-13Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds13-119/+106
Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype