summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2021-07-19dm btree remove: assign new_root only when removal succeedsHou Tao1-1/+2
commit b6e58b5466b2959f83034bead2e2e1395cca8aeb upstream. remove_raw() in dm_btree_remove() may fail due to IO read error (e.g. read the content of origin block fails during shadowing), and the value of shadow_spine::root is uninitialized, but the uninitialized value is still assign to new_root in the end of dm_btree_remove(). For dm-thin, the value of pmd->details_root or pmd->root will become an uninitialized value, so if trying to read details_info tree again out-of-bound memory may occur as showed below: general protection fault, probably for non-canonical address 0x3fdcb14c8d7520 CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6 Hardware name: QEMU Standard PC RIP: 0010:metadata_ll_load_ie+0x14/0x30 Call Trace: sm_metadata_count_is_more_than_one+0xb9/0xe0 dm_tm_shadow_block+0x52/0x1c0 shadow_step+0x59/0xf0 remove_raw+0xb2/0x170 dm_btree_remove+0xf4/0x1c0 dm_pool_delete_thin_device+0xc3/0x140 pool_message+0x218/0x2b0 target_message+0x251/0x290 ctl_ioctl+0x1c4/0x4d0 dm_ctl_ioctl+0xe/0x20 __x64_sys_ioctl+0x7b/0xb0 do_syscall_64+0x40/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixing it by only assign new_root when removal succeeds Signed-off-by: Hou Tao <houtao1@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19dm space maps: don't reset space map allocation cursor when committingJoe Thornber2-2/+16
[ Upstream commit 5faafc77f7de69147d1e818026b9a0cbf036a7b2 ] Current commit code resets the place where the search for free blocks will begin back to the start of the metadata device. There are a couple of repercussions to this: - The first allocation after the commit is likely to take longer than normal as it searches for a free block in an area that is likely to have very few free blocks (if any). - Any free blocks it finds will have been recently freed. Reusing them means we have fewer old copies of the metadata to aid recovery from hardware error. Fix these issues by leaving the cursor alone, only resetting when the search hits the end of the metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-16dm verity: fix require_signatures module_param permissionsJohn Keeping1-1/+1
[ Upstream commit 0c1f3193b1cdd21e7182f97dc9bca7d284d18a15 ] The third parameter of module_param() is permissions for the sysfs node but it looks like it is being used as the initial value of the parameter here. In fact, false here equates to omitting the file from sysfs and does not affect the value of require_signatures. Making the parameter writable is not simple because going from false->true is fine but it should not be possible to remove the requirement to verify a signature. But it can be useful to inspect the value of this parameter from userspace, so change the permissions to make a read-only file in sysfs. Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03dm snapshot: properly fix a crash when an origin has no snapshotsMikulas Patocka1-1/+1
commit 7e768532b2396bcb7fbf6f82384b85c0f1d2f197 upstream. If an origin target has no snapshots, o->split_boundary is set to 0. This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split(). Fix this by initializing chunk_size, and in turn split_boundary, to rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits into "unsigned" type. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-26dm snapshot: fix crash with transient storage and zero chunk sizeMikulas Patocka1-0/+1
commit c699a0db2d62e3bbb7f0bf35c87edbc8d23e3062 upstream. The following commands will crash the kernel: modprobe brd rd_size=1048576 dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0" dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0" The reason is that when we test for zero chunk size, we jump to the label bad_read_metadata without setting the "r" variable. The function snapshot_ctr destroys all the structures and then exits with "r == 0". The kernel then crashes because it falsely believes that snapshot_ctr succeeded. In order to fix the bug, we set the variable "r" to -EINVAL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-14md: Fix missing unused status line of /proc/mdstatJan Glauber1-1/+5
commit 7abfabaf5f805f5171d133ce6af9b65ab766e76a upstream. Reading /proc/mdstat with a read buffer size that would not fit the unused status line in the first read will skip this line from the output. So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something like: unused devices: <none> Don't return NULL immediately in start() for v=2 but call show() once to print the status line also for multiple reads. Cc: stable@vger.kernel.org Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: Jan Glauber <jglauber@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-14md: md_open returns -EBUSY when entering racing areaZhao Heming1-2/+1
commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream. commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which will break the infinitely retry when md_open enter racing area. This patch is partly fix soft lockup issue, full fix needs mddev_find is split into two functions: mddev_find & mddev_find_or_alloc. And md_open should call new mddev_find (it only does searching job). For more detail, please refer with Christoph's "split mddev_find" patch in later commits.
2021-05-14md: factor out a mddev_find_locked helper from mddev_findChristoph Hellwig1-13/+19
commit 8b57251f9a91f5e5a599de7549915d2d226cc3af upstream. Factor out a self-contained helper to just lookup a mddev by the dev_t "unit". Cc: stable@vger.kernel.org Reviewed-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-14md: split mddev_findChristoph Hellwig1-5/+19
commit 65aa97c4d2bfd76677c211b9d03ef05a98c6d68e upstream. Split mddev_find into a simple mddev_find that just finds an existing mddev by the unit number, and a more complicated mddev_find that deals with find or allocating a mddev. This turns out to fix this bug reported by Zhao Heming. ----------------------------- snip ------------------------------ commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env.
2021-05-14md-cluster: fix use-after-free issue when removing rdevHeming Zhao1-4/+4
commit f7c7a2f9a23e5b6e0f5251f29648d0238bb7757e upstream. md_kick_rdev_from_array will remove rdev, so we should use rdev_for_each_safe to search list. How to trigger: env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns). ``` node2=192.168.0.3 for i in {1..20}; do echo ==== $i `date` ====; mdadm -Ss && ssh ${node2} "mdadm -Ss" wipefs -a /dev/sda /dev/sdb mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \ /dev/sdb --assume-clean ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" mdadm --wait /dev/md0 ssh ${node2} "mdadm --wait /dev/md0" mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda sleep 1 done ``` Crash stack: ``` stack segment: 0000 [#1] SMP ... ... RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod] ... ... RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50 RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246 RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800 FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: raid1d+0x5c/0xd40 [raid1] ? finish_task_switch+0x75/0x2a0 ? lock_timer_base+0x67/0x80 ? try_to_del_timer_sync+0x4d/0x80 ? del_timer_sync+0x41/0x50 ? schedule_timeout+0x254/0x2d0 ? md_start_sync+0xe0/0xe0 [md_mod] ? md_thread+0x127/0x160 [md_mod] md_thread+0x127/0x160 [md_mod] ? wait_woken+0x80/0x80 kthread+0x10d/0x130 ? kthread_park+0xa0/0xa0 ret_from_fork+0x1f/0x40 ``` Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code") Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment") Cc: stable@vger.kernel.org Reviewed-by: Gang He <ghe@suse.com> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-14md/bitmap: wait for external bitmap writes to complete during tear downSudhakar Panneerselvam1-0/+2
commit 404a8ef512587b2460107d3272c17a89aef75edf upstream. NULL pointer dereference was observed in super_written() when it tries to access the mddev structure. [The below stack trace is from an older kernel, but the problem described in this patch applies to the mainline kernel.] [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000 [ 1194.488000] RIP: 0010:super_written+0x29/0xe1 [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046 [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048 [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200 [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000 [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200 [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000 [ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0 [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1194.663316] PKRU: 55555554 [ 1194.674090] Call Trace: [ 1194.683735] <IRQ> [ 1194.692948] bio_endio+0xae/0x135 [ 1194.703580] blk_update_request+0xad/0x2fa [ 1194.714990] blk_update_bidi_request+0x20/0x72 [ 1194.726578] __blk_end_bidi_request+0x2c/0x4d [ 1194.738373] __blk_end_request_all+0x31/0x49 [ 1194.749344] blk_flush_complete_seq+0x377/0x383 [ 1194.761550] flush_end_io+0x1dd/0x2a7 [ 1194.772910] blk_finish_request+0x9f/0x13c [ 1194.784544] scsi_end_request+0x180/0x25c [ 1194.796149] scsi_io_completion+0xc8/0x610 [ 1194.807503] scsi_finish_command+0xdc/0x125 [ 1194.818897] scsi_softirq_done+0x81/0xde [ 1194.830062] blk_done_softirq+0xa4/0xcc [ 1194.841008] __do_softirq+0xd9/0x29f [ 1194.851257] irq_exit+0xe6/0xeb [ 1194.861290] do_IRQ+0x59/0xe3 [ 1194.871060] common_interrupt+0x1c6/0x382 [ 1194.881988] </IRQ> [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5 [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43 [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000 [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2 [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003 [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a [ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5 [ 1194.999077] cpuidle_enter+0x17/0x19 [ 1195.008395] call_cpuidle+0x23/0x3a [ 1195.017718] do_idle+0x172/0x1d5 [ 1195.026358] cpu_startup_entry+0x73/0x75 [ 1195.035769] start_secondary+0x1b9/0x20b [ 1195.044894] secondary_startup_64+0xa5/0xa5 [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78 [ 1195.096354] CR2: 00000000000002b8 bio in the above stack is a bitmap write whose completion is invoked after the tear down sequence sets the mddev structure to NULL in rdev. During tear down, there is an attempt to flush the bitmap writes, but for external bitmaps, there is no explicit wait for all the bitmap writes to complete. For instance, md_bitmap_flush() is called to flush the bitmap writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes for which there is no explicit wait to complete those writes. The call to md_bitmap_update_sb() will return simply for external bitmaps and the follow-up call to md_update_sb() is conditional and may not get called for external bitmaps. This results in a kernel panic when the completion routine, super_written() is called which tries to reference mddev in the rdev that has been set to NULL(in unbind_rdev_from_array() by tear down sequence). The solution is to call md_super_wait() for external bitmaps after the last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending bitmap writes before proceeding with the tear down. Cc: stable@vger.kernel.org Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com> Reviewed-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11dm rq: fix double free of blk_mq_tag_set in dev remove after table load failsBenjamin Block1-0/+2
commit 8e947c8f4a5620df77e43c9c75310dc510250166 upstream. When loading a device-mapper table for a request-based mapped device, and the allocation/initialization of the blk_mq_tag_set for the device fails, a following device remove will cause a double free. E.g. (dmesg): device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device device-mapper: ioctl: unable to set up device queue for new table. Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0305e098835de000 TEID: 0305e098835de803 Fault in home space mode while using kernel ASCE. AS:000000025efe0007 R3:0000000000000024 Oops: 0038 ilc:3 [#1] SMP Modules linked in: ... lots of modules ... Supported: Yes, External CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G W X 5.3.18-53-default #1 SLE15-SP3 Hardware name: IBM 8561 T01 7I2 (LPAR) Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000 000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000 000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640 00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8 Krnl Code: 000000025e368eb8: c4180041e100 lgrl %r1,25eba50b8 000000025e368ebe: ecba06b93a55 risbg %r11,%r10,6,185,58 #000000025e368ec4: e3b010000008 ag %r11,0(%r1) >000000025e368eca: e310b0080004 lg %r1,8(%r11) 000000025e368ed0: a7110001 tmll %r1,1 000000025e368ed4: a7740129 brc 7,25e369126 000000025e368ed8: e320b0080004 lg %r2,8(%r11) 000000025e368ede: b904001b lgr %r1,%r11 Call Trace: [<000000025e368eca>] kfree+0x42/0x330 [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8 [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod] [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod] [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod] [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod] [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod] [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod] [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0 [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40 [<000000025e8c15ac>] system_call+0xd8/0x2c8 Last Breaking-Event-Address: [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8 Kernel panic - not syncing: Fatal exception: panic_on_oops When allocation/initialization of the blk_mq_tag_set fails in dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer is not reset to NULL; so when dev_remove() later gets into dm_mq_cleanup_mapped_device() it sees the pointer and tries to uninitialize and free it again. Fix this by setting the pointer to NULL in dm_mq_init_request_queue() error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device(). Cc: <stable@vger.kernel.org> # 4.6+ Fixes: 1c357a1e86a4 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device") Signed-off-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11dm integrity: fix missing goto in bitmap_flush_interval error handlingTian Tao1-0/+1
commit 17e9e134a8efabbbf689a0904eee92bb5a868172 upstream. Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11dm space map common: fix division bug in sm_ll_find_free_block()Joe Thornber1-0/+2
commit 5208692e80a1f3c8ce2063a22b675dd5589d1d80 upstream. This division bug meant the search for free metadata space could skip the final allocation bitmap's worth of entries. Fix affects DM thinp, cache and era targets. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Tested-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11dm persistent data: packed struct should have an aligned() attribute tooJoe Thornber2-6/+6
commit a88b2358f1da2c9f9fcc432f2e0a79617fea397c upstream. Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to byte-by-byte access. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload ↵Heinz Mauelshagen1-6/+28
sequences commit f99a8e4373eeacb279bc9696937a55adbff7a28a upstream. If fast table reloads occur during an ongoing reshape of raid4/5/6 devices the target may race reading a superblock vs the the MD resync thread; causing an inconclusive reshape state to be read in its constructor. lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause BUG_ON() to trigger in md_run(), e.g.: "kernel BUG at drivers/md/raid5.c:7567!". Scenario triggering the bug: 1. the MD sync thread calls end_reshape() from raid5_sync_request() when done reshaping. However end_reshape() _only_ updates the reshape position to MaxSector keeping the changed layout configuration though (i.e. any delta disks, chunk sector or RAID algorithm changes). That inconclusive configuration is stored in the superblock. 2. dm-raid constructs a mapping, loading named inconsistent superblock as of step 1 before step 3 is able to finish resetting the reshape state completely, and calls md_run() which leads to mentioned bug in raid5.c. 3. the MD RAID personality's finish_reshape() is called; which resets the reshape information on chunk sectors, delta disks, etc. This explains why the bug is rarely seen on multi-core machines, as MD's finish_reshape() superblock update races with the dm-raid constructor's superblock load in step 2. Fix identifies inconclusive superblock content in the dm-raid constructor and resets it before calling md_run(), factoring out identifying checks into rs_is_layout_change() to share in existing rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also enhance a comment and remove an empty line. Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11md/raid1: properly indicate failure when ending a failed write requestPaul Clements1-0/+2
commit 2417b9869b81882ab90fd5ed1081a1cb2d4db1dd upstream. This patch addresses a data corruption bug in raid1 arrays using bitmaps. Without this fix, the bitmap bits for the failed I/O end up being cleared. Since we are in the failure leg of raid1_end_write_request, the request either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded). Fixes: eeba6809d8d5 ("md/raid1: end bio when the device faulty") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Paul Clements <paul.clements@us.sios.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21dm verity fec: fix misaligned RS roots IOJaegeuk Kim2-3/+9
commit 8ca7cab82bda4eb0b8064befeeeaa38106cac637 upstream. commit df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") introduced the possibility for misaligned roots IO relative to the underlying device's logical block size. E.g. Android's default RS roots=2 results in dm_bufio->block_size=1024, which causes the following EIO if the logical block size of the device is 4096, given v->data_dev_block_bits=12: E sd 0 : 0:0:0: [sda] tag#30 request not aligned to the logical block size E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5 Fix this by onlu using f->roots for dm_bufio blocksize IFF it is aligned to v->data_dev_block_bits. Fixes: df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") Cc: stable@vger.kernel.org Signed-off-by: Jaegeuk Kim <jaegeuk@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-30dm ioctl: fix out of bounds array access when no devicesMikulas Patocka1-1/+1
commit 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a upstream. If there are not any dm devices, we need to zero the "dev" argument in the first structure dm_name_list. However, this can cause out of bounds write, because the "needed" variable is zero and len may be less than eight. Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is too small to hold the "nl->dev" value. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-30dm verity: fix DM_VERITY_OPTS_MAX valueJeongHyeon Lee1-1/+1
commit 160f99db943224e55906dd83880da1a704c6e6b9 upstream. Three optional parameters must be accepted at once in a DM verity table, e.g.: (verity_error_handling_mode) (ignore_zero_block) (check_at_most_once) Fix this to be possible by incrementing DM_VERITY_OPTS_MAX. Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com> Fixes: 843f38d382b1 ("dm verity: add 'check_at_most_once' option to only validate hashes once") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-11dm table: fix zoned iterate_devices based device capability checksJeffle Xu1-33/+17
commit 24f6b6036c9eec21191646930ad42808e6180510 upstream. Fix dm_table_supports_zoned_model() and invert logic of both iterate_devices_callout_fn so that all devices' zoned capabilities are properly checked. Add one more parameter to dm_table_any_dev_attr(), which is actually used as the @data parameter of iterate_devices_callout_fn, so that dm_table_matches_zone_sectors() can be replaced by dm_table_any_dev_attr(). Fixes: dd88d313bef02 ("dm table: add zoned block devices validation") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> [jeffle: also convert partial completion check] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-11dm table: fix DAX iterate_devices based device capability checksJeffle Xu3-29/+12
commit 5b0fab508992c2e120971da658ce80027acbc405 upstream. Fix dm_table_supports_dax() and invert logic of both iterate_devices_callout_fn so that all devices' DAX capabilities are properly checked. Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-11dm table: fix iterate_devices based device capability checksJeffle Xu1-49/+54
commit a4c8dd9c2d0987cf542a2a0c42684c9c6d78a04e upstream. According to the definition of dm_iterate_devices_fn: * This function must iterate through each section of device used by the * target until it encounters a non-zero return code, which it then returns. * Returns zero if no callout returned non-zero. For some target type (e.g. dm-stripe), one call of iterate_devices() may iterate multiple underlying devices internally, in which case a non-zero return code returned by iterate_devices_callout_fn will stop the iteration in advance. No iterate_devices_callout_fn should return non-zero unless device iteration should stop. Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and elevate it for reuse to stop iterating (and return non-zero) on the first device that causes iterate_devices_callout_fn to return non-zero. Use dm_table_any_dev_attr() to properly iterate through devices. Rename device_is_nonrot() to device_is_rotational() and invert logic accordingly to fix improper disposition. Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set") Fixes: 4693c9668fdc ("dm table: propagate non rotational flag") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-09dm verity: fix FEC for RS roots unaligned to block sizeMilan Broz1-11/+12
commit df7b59ba9245c4a3115ebaa905e3e5719a3810da upstream. Optional Forward Error Correction (FEC) code in dm-verity uses Reed-Solomon code and should support roots from 2 to 24. The error correction parity bytes (of roots lengths per RS block) are stored on a separate device in sequence without any padding. Currently, to access FEC device, the dm-verity-fec code uses dm-bufio client with block size set to verity data block (usually 4096 or 512 bytes). Because this block size is not divisible by some (most!) of the roots supported lengths, data repair cannot work for partially stored parity bytes. This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT" where we can be sure that the full parity data is always available. (There cannot be partial FEC blocks because parity must cover whole sectors.) Because the optional FEC starting offset could be unaligned to this new block size, we have to use dm_bufio_set_sector_offset() to configure it. The problem is easily reproduced using veritysetup, e.g. for roots=13: # create verity device with RS FEC dd if=/dev/urandom of=data.img bs=4096 count=8 status=none veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash # create an erasure that should be always repairable with this roots setting dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none # try to read it through dm-verity veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash) dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer # wait for possible recursive recovery in kernel udevadm settle veritysetup close test With this fix, errors are properly repaired. device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors ... Without it, FEC code usually ends on unrecoverable failure in RS decoder: device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74 ... This problem is present in all kernels since the FEC code's introduction (kernel 4.5). It is thought that this problem is not visible in Android ecosystem because it always uses a default RS roots=2. Depends-on: a14e5ec66a7a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size") Signed-off-by: Milan Broz <gmazyland@gmail.com> Tested-by: Jérôme Carretero <cJ-ko@zougloub.eu> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: stable@vger.kernel.org # 4.5+ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-09dm bufio: subtract the number of initial sectors in dm_bufio_get_device_sizeMikulas Patocka1-0/+4
commit a14e5ec66a7a66e57b24e2469f9212a78460207e upstream. dm_bufio_get_device_size returns the device size in blocks. Before returning the value, we must subtract the nubmer of starting sectors. The number of starting sectors may not be divisible by block size. Note that currently, no target is using dm_bufio_set_sector_offset and dm_bufio_get_device_size simultaneously, so this change has no effect. However, an upcoming dm-verity-fec fix needs this change. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Update in-core bitset after committing the metadataNikos Tsironis1-6/+19
commit 2099b145d77c1d53f5711f029c37cc537897cee6 upstream. In case of a system crash, dm-era might fail to mark blocks as written in its metadata, although the corresponding writes to these blocks were passed down to the origin device and completed successfully. Consider the following sequence of events: 1. We write to a block that has not been yet written in the current era 2. era_map() checks the in-core bitmap for the current era and sees that the block is not marked as written. 3. The write is deferred for submission after the metadata have been updated and committed. 4. The worker thread processes the deferred write (process_deferred_bios()) and marks the block as written in the in-core bitmap, **before** committing the metadata. 5. The worker thread starts committing the metadata. 6. We do more writes that map to the same block as the write of step (1) 7. era_map() checks the in-core bitmap and sees that the block is marked as written, **although the metadata have not been committed yet**. 8. These writes are passed down to the origin device immediately and the device reports them as completed. 9. The system crashes, e.g., power failure, before the commit from step (5) finishes. When the system recovers and we query the dm-era target for the list of written blocks it doesn't report the aforementioned block as written, although the writes of step (6) completed successfully. The issue is that era_map() decides whether to defer or not a write based on non committed information. The root cause of the bug is that we update the in-core bitmap, **before** committing the metadata. Fix this by updating the in-core bitmap **after** successfully committing the metadata. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: only resize metadata in preresumeNikos Tsironis1-11/+10
commit cca2c6aebe86f68103a8615074b3578e854b5016 upstream. Metadata resize shouldn't happen in the ctr. The ctr loads a temporary (inactive) table that will only become active upon resume. That is why resize should always be done in terms of resume. Otherwise a load (ctr) whose inactive table never becomes active will incorrectly resize the metadata. Also, perform the resize directly in preresume, instead of using the worker to do it. The worker might run other metadata operations, e.g., it could start digestion, before resizing the metadata. These operations will end up using the old size. This could lead to errors, like: device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed device-mapper: era: process_old_eras: digest step failed, stopping digestion The reason of the above error is that the worker started the digestion of the archived writeset using the old, larger size. As a result, metadata_digest_transcribe_writeset tried to write beyond the end of the era array. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Reinitialize bitset cache before digesting a new writesetNikos Tsironis1-6/+6
commit 2524933307fd0036d5c32357c693c021ab09a0b0 upstream. In case of devices with at most 64 blocks, the digestion of consecutive eras uses the writeset of the first era as the writeset of all eras to digest, leading to lost writes. That is, we lose the information about what blocks were written during the affected eras. The digestion code uses a dm_disk_bitset object to access the archived writesets. This structure includes a one word (64-bit) cache to reduce the number of array lookups. This structure is initialized only once, in metadata_digest_start(), when we kick off digestion. But, when we insert a new writeset into the writeset tree, before the digestion of the previous writeset is done, or equivalently when there are multiple writesets in the writeset tree to digest, then all these writesets are digested using the same cache and the cache is not re-initialized when moving from one writeset to the next. For devices with more than 64 blocks, i.e., the size of the cache, the cache is indirectly invalidated when we move to a next set of blocks, so we avoid the bug. But for devices with at most 64 blocks we end up using the same cached data for digesting all archived writesets, i.e., the cache is loaded when digesting the first writeset and it never gets reloaded, until the digestion is done. As a result, the writeset of the first era to digest is used as the writeset of all the following archived eras, leading to lost writes. Fix this by reinitializing the dm_disk_bitset structure, and thus invalidating the cache, every time the digestion code starts digesting a new writeset. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Use correct value size in equality function of writeset treeNikos Tsironis1-1/+1
commit 64f2d15afe7b336aafebdcd14cc835ecf856df4b upstream. Fix the writeset tree equality test function to use the right value size when comparing two btree values. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Fix bitset memory leaksNikos Tsironis1-0/+6
commit 904e6b266619c2da5c58b5dce14ae30629e39645 upstream. Deallocate the memory allocated for the in-core bitsets when destroying the target and in error paths. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Verify the data block size hasn't changedNikos Tsironis1-1/+9
commit c8e846ff93d5eaa5384f6f325a1687ac5921aade upstream. dm-era doesn't support changing the data block size of existing devices, so check explicitly that the requested block size for a new target matches the one stored in the metadata. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm era: Recover committed writeset after crashNikos Tsironis1-8/+9
commit de89afc1e40fdfa5f8b666e5d07c43d21a1d3be0 upstream. Following a system crash, dm-era fails to recover the committed writeset for the current era, leading to lost writes. That is, we lose the information about what blocks were written during the affected era. dm-era assumes that the writeset of the current era is archived when the device is suspended. So, when resuming the device, it just moves on to the next era, ignoring the committed writeset. This assumption holds when the device is properly shut down. But, when the system crashes, the code that suspends the target never runs, so the writeset for the current era is not archived. There are three issues that cause the committed writeset to get lost: 1. dm-era doesn't load the committed writeset when opening the metadata 2. The code that resizes the metadata wipes the information about the committed writeset (assuming it was loaded at step 1) 3. era_preresume() starts a new era, without taking into account that the current era might not have been archived, due to a system crash. To fix this: 1. Load the committed writeset when opening the metadata 2. Fix the code that resizes the metadata to make sure it doesn't wipe the loaded writeset 3. Fix era_preresume() to check for a loaded writeset and archive it, before starting a new era. Fixes: eec40579d84873 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm writecache: fix writing beyond end of underlying device when shrinkingMikulas Patocka1-0/+18
commit 4134455f2aafdfeab50cabb4cccb35e916034b93 upstream. Do not attempt to write any data beyond the end of the underlying data device while shrinking it. The DM writecache device must be suspended when the underlying data device is shrunk. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04dm: fix deadlock when swapping to encrypted deviceMikulas Patocka3-0/+65
commit a666e5c05e7c4aaabb2c5d58117b0946803d03d2 upstream. The system would deadlock when swapping to a dm-crypt device. The reason is that for each incoming write bio, dm-crypt allocates memory that holds encrypted data. These excessive allocations exhaust all the memory and the result is either deadlock or OOM trigger. This patch limits the number of in-flight swap bios, so that the memory consumed by dm-crypt is limited. The limit is enforced if the target set the "limit_swap_bios" variable and if the bio has REQ_SWAP set. Non-swap bios are not affected becuase taking the semaphore would cause performance degradation. This is similar to request-based drivers - they will also block when the number of requests is over the limit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04bcache: Move journal work to new flush wqKai Krakow3-2/+19
commit afe78ab46f638ecdf80a35b122ffc92c20d9ae5d upstream. This is potentially long running and not latency sensitive, let's get it out of the way of other latency sensitive events. As observed in the previous commit, the `system_wq` comes easily congested by bcache, and this fixes a few more stalls I was observing every once in a while. Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance of boot and file system operations in my tests. Also, without `WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the previous behavior as `system_wq` also does no memory reclaim: > // workqueue.c: > system_wq = alloc_workqueue("events", 0, 0); Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04bcache: Give btree_io_wq correct semantics againKai Krakow1-1/+1
commit d797bd9897e3559eb48d68368550d637d32e468c upstream. Before killing `btree_io_wq`, the queue was allocated using `create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After killing it, it no longer had this property but `system_wq` is not single threaded. Let's combine both worlds and make it multi threaded but able to reclaim memory. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04Revert "bcache: Kill btree_io_wq"Kai Krakow3-2/+25
commit 9f233ffe02e5cef611100cd8c5bcf4de26ca7bef upstream. This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee. With the btree using the `system_wq`, I seem to see a lot more desktop latency than I should. After some more investigation, it looks like the original assumption of 56b3077 no longer is true, and bcache has a very high potential of congesting the `system_wq`. In turn, this introduces laggy desktop performance, IO stalls (at least with btrfs), and input events may be delayed. So let's revert this. It's important to note that the semantics of using `system_wq` previously mean that `btree_io_wq` should be created before and destroyed after other bcache wqs to keep the same assumptions. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Kai Krakow <kai@kaishome.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10md: Set prev_flush_start and flush_bio in an atomic wayXiao Ni1-0/+2
commit dc5d17a3c39b06aef866afca19245a9cfb533a79 upstream. One customer reports a crash problem which causes by flush request. It triggers a warning before crash. /* new request after previous flush is completed */ if (ktime_after(req_start, mddev->prev_flush_start)) { WARN_ON(mddev->flush_bio); mddev->flush_bio = bio; bio = NULL; } The WARN_ON is triggered. We use spin lock to protect prev_flush_start and flush_bio in md_flush_request. But there is no lock protection in md_submit_flush_data. It can set flush_bio to NULL first because of compiler reordering write instructions. For example, flush bio1 sets flush bio to NULL first in md_submit_flush_data. An interrupt or vmware causing an extended stall happen between updating flush_bio and prev_flush_start. Because flush_bio is NULL, flush bio2 can get the lock and submit to underlayer disks. Then flush bio1 updates prev_flush_start after the interrupt or extended stall. Then flush bio3 enters in md_flush_request. The start time req_start is behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK again. INIT_WORK() will re-initialize the list pointers in the work_struct, which then can result in a corrupted work list and the work_struct queued a second time. With the work list corrupted, it can lead in invalid work items being used and cause a crash in process_one_work. We need to make sure only one flush bio can be handled at one same time. So add spin lock in md_submit_flush_data to protect prev_flush_start and flush_bio in an atomic way. Reviewed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30dm integrity: conditionally disable "recalculate" featureMikulas Patocka1-1/+23
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream. Otherwise a malicious user could (ab)use the "recalculate" feature that makes dm-integrity calculate the checksums in the background while the device is already usable. When the system restarts before all checksums have been calculated, the calculation continues where it was interrupted even if the recalculate feature is not requested the next time the dm device is set up. Disable recalculating if we use internal_hash or journal_hash with a key (e.g. HMAC) and we don't have the "legacy_recalculate" flag. This may break activation of a volume, created by an older kernel, that is not yet fully recalculated -- if this happens, the user should add the "legacy_recalculate" flag to constructor parameters. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Daniel Glockner <dg@emlix.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm integrity: select CRYPTO_SKCIPHERAnthony Iliopoulos1-0/+1
[ Upstream commit f7b347acb5f6c29d9229bb64893d8b6a2c7949fb ] The integrity target relies on skcipher for encryption/decryption, but certain kernel configurations may not enable CRYPTO_SKCIPHER, leading to compilation errors due to unresolved symbols. Explicitly select CRYPTO_SKCIPHER for DM_INTEGRITY, since it is unconditionally dependent on it. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27dm integrity: fix a crash if "recalculate" used without "internal_hash"Mikulas Patocka1-0/+6
commit 2d06dfecb132a1cc2e374a44eae83b5c4356b8b4 upstream. Recalculate can only be specified with internal_hash. Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27dm: avoid filesystem lookup in dm_get_dev_t()Hannes Reinecke1-3/+12
commit 809b1e4945774c9ec5619a8f4e2189b7b3833c0c upstream. This reverts commit 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()") dm_get_dev_t() is just used to convert an arbitrary 'path' string into a dev_t. It doesn't presume that the device is present; that check will be done later, as the only caller is dm_get_device(), which does a dm_get_table_device() later on, which will properly open the device. So if the path string already _is_ in major:minor representation we can convert it directly, avoiding a recursion into the filesystem to lookup the block device. This avoids a hang in multipath_message() when the filesystem is inaccessible. Fixes: 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()") Cc: stable@vger.kernel.org Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm: eliminate potential source of excessive kernel log noiseMike Snitzer1-1/+1
commit 0378c625afe80eb3f212adae42cc33c9f6f31abf upstream. There wasn't ever a real need to log an error in the kernel log for ioctls issued with insufficient permissions. Simply return an error and if an admin/user is sufficiently motivated they can enable DM's dynamic debugging to see an explanation for why the ioctls were disallowed. Reported-by: Nir Soffer <nsoffer@redhat.com> Fixes: e980f62353c6 ("dm: don't allow ioctls to targets that don't map to whole devices") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm integrity: fix flush with external metadata deviceMikulas Patocka2-8/+54
[ Upstream commit 9b5948267adc9e689da609eb61cf7ed49cae5fa8 ] With external metadata device, flush requests are not passed down to the data device. Fix this by submitting the flush request in dm_integrity_flush_buffers. In order to not degrade performance, we overlap the data device flush with the metadata device flush. Reported-by: Lukas Straub <lukasstraub2@web.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19dm integrity: fix the maximum number of argumentsMikulas Patocka1-1/+1
commit 17ffc193cdc6dc7a613d00d8ad47fc1f801b9bf0 upstream. Advance the maximum number of arguments from 9 to 15 to account for all potential feature flags that may be supplied. Linux 4.19 added "meta_device" (356d9d52e1221ba0c9f10b8b38652f78a5298329) and "recalculate" (a3fcf7253139609bf9ff901fbf955fba047e75dd) flags. Commit 468dfca38b1a6fbdccd195d875599cb7c8875cd9 added "sectors_per_bit" and "bitmap_flush_interval". Commit 84597a44a9d86ac949900441cea7da0af0f2f473 added "allow_discards". And the commit d537858ac8aaf4311b51240893add2fc62003b97 added "fix_padding". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm snapshot: flush merged data before committing metadataAkilesh Kailash1-0/+24
commit fcc42338375a1e67b8568dbb558f8b784d0f3b01 upstream. If the origin device has a volatile write-back cache and the following events occur: 1: After finishing merge operation of one set of exceptions, merge_callback() is invoked. 2: Update the metadata in COW device tracking the merge completion. This update to COW device is flushed cleanly. 3: System crashes and the origin device's cache where the recent merge was completed has not been flushed. During the next cycle when we read the metadata from the COW device, we will skip reading those metadata whose merge was completed in step (1). This will lead to data loss/corruption. To address this, flush the origin device post merge IO before updating the metadata. Cc: stable@vger.kernel.org Signed-off-by: Akilesh Kailash <akailash@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19dm raid: fix discard limits for raid1Mike Snitzer1-3/+3
commit cc07d72bf350b77faeffee1c37bc52197171473f upstream. Block core warned that discard_granularity was 0 for dm-raid with personality of raid1. Reason is that raid_io_hints() was incorrectly special-casing raid1 rather than raid0. Fix raid_io_hints() by removing discard limits settings for raid1. Check for raid0 instead. Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Stephan Bärwolf <stephan@matrixstorm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06dm verity: skip verity work if I/O error when system is shutting downHyeongseok Kim1-1/+11
[ Upstream commit 252bd1256396cebc6fc3526127fdb0b317601318 ] If emergency system shutdown is called, like by thermal shutdown, a dm device could be alive when the block device couldn't process I/O requests anymore. In this state, the handling of I/O errors by new dm I/O requests or by those already in-flight can lead to a verity corruption state, which is a misjudgment. So, skip verity work in response to I/O error when system is shutting down. Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-06md/raid10: initialize r10_bio->read_slot before use.Kevin Vigor1-1/+2
commit 93decc563637c4288380912eac0eb42fb246cc04 upstream. In __make_request() a new r10bio is allocated and passed to raid10_read_request(). The read_slot member of the bio is not initialized, and the raid10_read_request() uses it to index an array. This leads to occasional panics. Fix by initializing the field to invalid value and checking for valid value in raid10_read_request(). Cc: stable@vger.kernel.org Signed-off-by: Kevin Vigor <kvigor@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30md/cluster: fix deadlock when node is doing resync jobZhao Heming2-31/+42
commit bca5b0658020be90b6b504ca514fd80110204f71 upstream. md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Suggested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>