summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2020-11-05md/raid5: fix oops during stripe resizingSong Liu1-2/+2
commit b44c018cdf748b96b676ba09fdbc5b34fc443ada upstream. KoWei reported crash during raid5 reshape: [ 1032.252932] Oops: 0002 [#1] SMP PTI [...] [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10 [...] [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286 [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000 [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000 [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8 [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8 [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000 [ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000 [ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0 [ 1032.252962] Call Trace: [ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy] [ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456] [ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456] [ 1032.252988] handle_stripe+0x592/0x1270 [raid456] [ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456] [ 1032.252999] raid5d+0x35c/0x550 [raid456] [ 1032.253002] ? schedule+0x42/0xb0 [ 1032.253006] ? schedule_timeout+0x10e/0x160 [ 1032.253011] md_thread+0x97/0x160 [ 1032.253015] ? wait_woken+0x80/0x80 [ 1032.253019] kthread+0x104/0x140 [ 1032.253022] ? md_start_sync+0x60/0x60 [ 1032.253024] ? kthread_park+0x90/0x90 [ 1032.253027] ret_from_fork+0x35/0x40 This is because cache_size_mutex was unlocked too early in resize_stripes, which races with grow_one_stripe() that grow_one_stripe() allocates a stripe with wrong pool_size. Fix this issue by unlocking cache_size_mutex after updating pool_size. Cc: <stable@vger.kernel.org> # v4.4+ Reported-by: KoWei Sung <winders@amazon.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05md/bitmap: md_bitmap_get_counter returns wrong blocksZhao Heming1-1/+1
[ Upstream commit d837f7277f56e70d82b3a4a037d744854e62f387 ] md_bitmap_get_counter() has code: ``` if (bitmap->bp[page].hijacked || bitmap->bp[page].map == NULL) csize = ((sector_t)1) << (bitmap->chunkshift + PAGE_COUNTER_SHIFT - 1); ``` The minus 1 is wrong, this branch should report 2048 bits of space. With "-1" action, this only report 1024 bit of space. This bug code returns wrong blocks, but it doesn't inflence bitmap logic: 1. Most callers focus this function return value (the counter of offset), not the parameter blocks. 2. The bug is only triggered when hijacked is true or map is NULL. the hijacked true condition is very rare. the "map == null" only true when array is creating or resizing. 3. Even the caller gets wrong blocks, current code makes caller just to call md_bitmap_get_counter() one more time. Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01bcache: fix a lost wake-up problem caused by mca_cannibalize_lockGuoju Fang3-4/+10
[ Upstream commit 34cf78bf34d48dddddfeeadb44f9841d7864997a ] This patch fix a lost wake-up problem caused by the race between mca_cannibalize_lock and bch_cannibalize_unlock. Consider two processes, A and B. Process A is executing mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock and is executing bch_cannibalize_unlock. The problem happens that after process A executes cmpxchg and will execute prepare_to_wait. In this timeslice process B executes wake_up, but after that process A executes prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A goes to sleep but no one will wake up it. This problem may cause bcache device to dead. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-09dm thin metadata: Avoid returning cmd->bm wild pointer on errorYe Bin1-2/+6
commit 219403d7e56f9b716ad80ab87db85d29547ee73e upstream. Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09dm cache metadata: Avoid returning cmd->bm wild pointer on errorYe Bin1-2/+6
commit d16ff19e69ab57e08bf908faaacbceaf660249de upstream. Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03overflow.h: Add allocation size calculation helpersKees Cook1-5/+5
commit 610b15c50e86eb1e4b77274fabcaea29ac72d6a8 upstream. In preparation for replacing unchecked overflows for memory allocations, this creates helpers for the 3 most common calculations: array_size(a, b): 2-dimensional array array3_size(a, b, c): 3-dimensional array struct_size(ptr, member, n): struct followed by n-many trailing members Each of these return SIZE_MAX on overflow instead of wrapping around. (Additionally renames a variable named "array_size" to avoid future collision.) Co-developed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21dm cache: remove all obsolete writethrough-specific codeMike Snitzer1-81/+1
commit 9958f1d9a04efb3db19134482b3f4c6897e0e7b8 upstream. Now that the writethrough code is much simpler there is no need to track so much state or cascade bio submission (as was done, via writethrough_endio(), to issue origin then cache IO in series). As such the obsolete writethrough list and workqueue is also removed. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21dm cache: submit writethrough writes in parallel to origin and cacheMike Snitzer1-17/+37
commit 2df3bae9a6543e90042291707b8db0cbfbae9ee9 upstream. Discontinue issuing writethrough write IO in series to the origin and then cache. Use bio_clone_fast() to create a new origin clone bio that will be mapped to the origin device and then bio_chain() it to the bio that gets remapped to the cache device. The origin clone bio does _not_ have a copy of the per_bio_data -- as such check_if_tick_bio_needed() will not be called. The cache bio (parent bio) will not complete until the origin bio has completed -- this fulfills bio_clone_fast()'s requirements as well as the requirement to not complete the original IO until the write IO has completed to both the origin and cache device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21dm cache: pass cache structure to mode functionsMike Snitzer1-16/+16
commit 8e3c3827776fc93728c0c8d7c7b731226dc6ee23 upstream. No functional changes, just a bit cleaner than passing cache_features structure. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()Ming Lei1-3/+0
[ Upstream commit e766668c6cd49d741cfb49eaeb38998ba34d27bc ] dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't formally stop the blk-mq queue; therefore there is no point making the blk_mq_queue_stopped() check -- it will never be stopped. In addition, even though dm_stop_queue() actually tries to quiesce hw queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced() to avoid unnecessary queue quiesce isn't reliable because: the QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and dm_stop_queue() may be called when synchronize_rcu() from another blk_mq_quiesce_queue() is in-progress. Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21bcache: allocate meta data pages as compound pagesColy Li4-5/+5
commit 5fe48867856367142d91a82f2cbf7a57a24cbb70 upstream. There are some meta data of bcache are allocated by multiple pages, and they are used as bio bv_page for I/Os to the cache device. for example cache_set->uuids, cache->disk_buckets, journal_write->data, bset_tree->data. For such meta data memory, all the allocated pages should be treated as a single memory block. Then the memory management and underlying I/O code can treat them more clearly. This patch adds __GFP_COMP flag to all the location allocating >0 order pages for the above mentioned meta data. Then their pages are treated as compound pages now. Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21md/raid5: Fix Force reconstruct-write io stuck in degraded raid5ChangSyun Peng1-1/+2
commit a1c6ae3d9f3dd6aa5981a332a6f700cf1c25edef upstream. In degraded raid5, we need to read parity to do reconstruct-write when data disks fail. However, we can not read parity from handle_stripe_dirtying() in force reconstruct-write mode. Reproducible Steps: 1. Create degraded raid5 mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing 2. Set rmw_level to 0 echo 0 > /sys/block/md2/md/rmw_level 3. IO to raid5 Now some io may be stuck in raid5. We can use handle_stripe_fill() to read the parity in this situation. Cc: <stable@vger.kernel.org> # v4.4+ Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-21bcache: fix super block seq numbers comparision in register_cache_set()Coly Li1-1/+8
[ Upstream commit 117f636ea695270fe492d0c0c9dfadc7a662af47 ] In register_cache_set(), c is pointer to struct cache_set, and ca is pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this registering cache has up to date version and other members, the in- memory version and other members should be updated to the newer value. But current implementation makes a cache set only has a single cache device, so the above assumption works well except for a special case. The execption is when a cache device new created and both ca->sb.seq and c->sb.seq are 0, because the super block is never flushed out yet. In the location for the following if() check, 2156 if (ca->sb.seq > c->sb.seq) { 2157 c->sb.version = ca->sb.version; 2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16); 2159 c->sb.flags = ca->sb.flags; 2160 c->sb.seq = ca->sb.seq; 2161 pr_debug("set version = %llu\n", c->sb.version); 2162 } c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0, the if() check will fail (because both values are 0), and the cache set version, set_uuid, flags and seq won't be updated. The above problem is hiden for current code, because the bucket size is compatible among different super block version. And the next time when running cache set again, ca->sb.seq will be larger than 0 and cache set super block version will be updated properly. But if the large bucket feature is enabled, sb->bucket_size is the low 16bits of the bucket size. For a power of 2 value, when the actual bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then read_super_common() will fail because the if() check to is_power_of_2(sb->bucket_size) is false. This is how the long time hidden bug is triggered. This patch modifies the if() check to the following way, 2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) { Then cache set's version, set_uuid, flags and seq will always be updated corectly including for a new created cache device. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21md-cluster: fix wild pointer of unlock_all_bitmaps()Zhao Heming1-0/+1
[ Upstream commit 60f80d6f2d07a6d8aee485a1d1252327eeee0c81 ] reproduction steps: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb node1 # mdadm -G /dev/md0 -b none mdadm: failed to remove clustered bitmap. node1 # mdadm -S --scan ^C <==== mdadm hung & kernel crash ``` kernel stack: ``` [ 335.230657] general protection fault: 0000 [#1] SMP NOPTI [...] [ 335.230848] Call Trace: [ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster] [ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster] [ 335.230899] leave+0x10f/0x190 [md_cluster] [ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod] [ 335.230947] ? leave+0x5/0x190 [md_cluster] [ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod] [ 335.230999] md_bitmap_free+0x142/0x150 [md_mod] [ 335.231013] ? _cond_resched+0x15/0x40 [ 335.231025] ? mutex_lock+0xe/0x30 [ 335.231056] __md_stop+0x1c/0xa0 [md_mod] [ 335.231083] do_md_stop+0x160/0x580 [md_mod] [ 335.231119] ? 0xffffffffc05fb078 [ 335.231148] md_ioctl+0xa04/0x1930 [md_mod] [ 335.231165] ? filename_lookup+0xf2/0x190 [ 335.231179] blkdev_ioctl+0x93c/0xa10 [ 335.231205] ? _cond_resched+0x15/0x40 [ 335.231214] ? __check_object_size+0xd4/0x1a0 [ 335.231224] block_ioctl+0x39/0x40 [ 335.231243] do_vfs_ioctl+0xa0/0x680 [ 335.231253] ksys_ioctl+0x70/0x80 [ 335.231261] __x64_sys_ioctl+0x16/0x20 [ 335.231271] do_syscall_64+0x65/0x1f0 [ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22dm: use noio when sending kobject eventMikulas Patocka1-3/+12
commit 6958c1c640af8c3f40fa8a2eee3b5b905d95b677 upstream. kobject_uevent may allocate memory and it may be called while there are dm devices suspended. The allocation may recurse into a suspended device, causing a deadlock. We must set the noio flag when sending a uevent. The observed deadlock was reported here: https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html Reported-by: Khazhismel Kumykov <khazhy@google.com> Reported-by: Tahsin Erdogan <tahsin@google.com> Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09dm zoned: assign max_io_len correctlyHou Tao1-1/+1
commit 7b2377486767503d47265e4d487a63c651f6b55d upstream. The unit of max_io_len is sector instead of byte (spotted through code review), so fix it. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-25md: add feature flag MD_FEATURE_RAID0_LAYOUTNeilBrown2-0/+16
[ Upstream commit 33f2c35a54dfd75ad0e7e86918dcbe4de799a56c ] Due to a bug introduced in Linux 3.14 we cannot determine the correctly layout for a multi-zone RAID0 array - there are two possibilities. It is possible to tell the kernel which to chose using a module parameter, but this can be clumsy to use. It would be best if the choice were recorded in the metadata. So add a feature flag for this purpose. If it is set, then the 'layout' field of the superblock is used to determine which layout to use. If this flag is not set, then mddev->layout gets set to -1, which causes the module parameter to be required. Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25bcache: fix potential deadlock problem in btree_gc_coalesceZhiqiang Liu1-2/+6
[ Upstream commit be23e837333a914df3f24bf0b32e87b0331ab8d1 ] coccicheck reports: drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417 In btree_gc_coalesce func, if the coalescing process fails, we will goto to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock. Then, it will cause a deadlock when trying to acquire new_nodes[i]-> write_lock for freeing new_nodes[i] before return. btree_gc_coalesce func details as follows: if alloc new_nodes[i] fails: goto out_nocoalesce; // obtain new_nodes[i]->write_lock mutex_lock(&new_nodes[i]->write_lock) // main coalescing process for (i = nodes - 1; i > 0; --i) [snipped] if coalescing process fails: // Here, directly goto out_nocoalesce // tag will cause a deadlock goto out_nocoalesce; [snipped] // release new_nodes[i]->write_lock mutex_unlock(&new_nodes[i]->write_lock) // coalesing succ, return return; out_nocoalesce: btree_node_free(new_nodes[i]) // free new_nodes[i] // obtain new_nodes[i]->write_lock mutex_lock(&new_nodes[i]->write_lock); // set flag for reuse clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags); // release new_nodes[i]->write_lock mutex_unlock(&new_nodes[i]->write_lock); To fix the problem, we add a new tag 'out_unlock_nocoalesce' for releasing new_nodes[i]->write_lock before out_nocoalesce tag. If coalescing process fails, we will go to out_unlock_nocoalesce tag for releasing new_nodes[i]->write_lock before free new_nodes[i] in out_nocoalesce tag. (Coly Li helps to clean up commit log format.) Fixes: 2a285686c109816 ("bcache: btree locking rework") Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zoneHannes Reinecke2-4/+4
[ Upstream commit 489dc0f06a5837f87482c0ce61d830d24e17082e ] The only case where dmz_get_zone_for_reclaim() cannot return a zone is if the respective lists are empty. So we should just return a simple NULL value here as we really don't have an error code which would make sense. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25dm mpath: switch paths in dm_blk_ioctl() code pathMartin Wilck1-1/+1
[ Upstream commit 2361ae595352dec015d14292f1b539242d8446d6 ] SCSI LUN passthrough code such as qemu's "scsi-block" device model pass every IO to the host via SG_IO ioctls. Currently, dm-multipath calls choose_pgpath() only in the block IO code path, not in the ioctl code path (unless current_pgpath is NULL). This has the effect that no path switching and thus no load balancing is done for SCSI-passthrough IO, unless the active path fails. Fix this by using the same logic in multipath_prepare_ioctl() as in multipath_clone_and_map(). Note: The allegedly best path selection algorithm, service-time, still wouldn't work perfectly, because the io size of the current request is always set to 0. Changing that for the IO passthrough case would require the ioctl cmd and arg to be passed to dm's prepare_ioctl() method. Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-20dm crypt: avoid truncating the logical block sizeEric Biggers1-1/+1
commit 64611a15ca9da91ff532982429c44686f4593b5f upstream. queue_limits::logical_block_size got changed from unsigned short to unsigned int, but it was forgotten to update crypt_io_hints() to use the new type. Fix it. Fixes: ad6bf88a6c19 ("block: fix an integer overflow in logical block size") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20md: don't flush workqueue unconditionally in md_openGuoqing Jiang1-1/+2
[ Upstream commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae ] We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-05dm verity fec: fix hash block number in verity_fec_decodeSunwook Eom1-1/+1
commit ad4e80a639fc61d5ecebb03caa5cdbfb91fcebfc upstream. The error correction data is computed as if data and hash blocks were concatenated. But hash block number starts from v->hash_start. So, we have to calculate hash block number based on that. Fixes: a739ff3f543af ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org Signed-off-by: Sunwook Eom <speed.eom@samsung.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-24dm flakey: check for null arg_name in parse_features()Goldwyn Rodrigues1-0/+5
[ Upstream commit 7690e25302dc7d0cd42b349e746fe44b44a94f2b ] One can crash dm-flakey by specifying more feature arguments than the number of features supplied. Checking for null in arg_name avoids this. dmsetup create flakey-test --table "0 66076080 flakey /dev/sdb9 0 0 180 2 drop_writes" Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-24dm zoned: remove duplicate nr_rnd_zones increase in dmz_init_zone()Bob Liu1-1/+0
[ Upstream commit b8fdd090376a7a46d17db316638fe54b965c2fb0 ] zmd->nr_rnd_zones was increased twice by mistake. The other place it is increased in dmz_init_zone() is the only one needed: 1131 zmd->nr_useable_zones++; 1132 if (dmz_is_rnd(zone)) { 1133 zmd->nr_rnd_zones++; ^^^ Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-24dm verity fec: fix memory leak in verity_fec_dtrShetty, Harshini X (EXT-Sony Mobile)1-0/+1
commit 75fa601934fda23d2f15bf44b09c2401942d8e15 upstream. Fix below kmemleak detected in verity_fec_ctr. output_pool is allocated for each dm-verity-fec device. But it is not freed when dm-table for the verity target is removed. Hence free the output mempool in destructor function verity_fec_dtr. unreferenced object 0xffffffffa574d000 (size 4096): comm "init", pid 1667, jiffies 4294894890 (age 307.168s) hex dump (first 32 bytes): 8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00 .6..f........... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000060e82407>] __kmalloc+0x2b4/0x340 [<00000000dd99488f>] mempool_kmalloc+0x18/0x20 [<000000002560172b>] mempool_init_node+0x98/0x118 [<000000006c3574d2>] mempool_init+0x14/0x20 [<0000000008cb266e>] verity_fec_ctr+0x388/0x3b0 [<000000000887261b>] verity_ctr+0x87c/0x8d0 [<000000002b1e1c62>] dm_table_add_target+0x174/0x348 [<000000002ad89eda>] table_load+0xe4/0x328 [<000000001f06f5e9>] dm_ctl_ioctl+0x3b4/0x5a0 [<00000000bee5fbb7>] do_vfs_ioctl+0x5dc/0x928 [<00000000b475b8f5>] __arm64_sys_ioctl+0x70/0x98 [<000000005361e2e8>] el0_svc_common+0xa0/0x158 [<000000001374818f>] el0_svc_handler+0x6c/0x88 [<000000003364e9f4>] el0_svc+0x8/0xc [<000000009d84cec9>] 0xffffffffffffffff Fixes: a739ff3f543af ("dm verity: add support for forward error correction") Depends-on: 6f1c819c219f7 ("dm: convert to bioset_init()/mempool_init()") Cc: stable@vger.kernel.org Signed-off-by: Harshini Shetty <harshini.x.shetty@sony.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02dm bio record: save/restore bi_end_io and bi_integrityMike Snitzer1-0/+15
[ Upstream commit 1b17159e52bb31f982f82a6278acd7fab1d3f67b ] Also, save/restore __bi_remaining in case the bio was used in a BIO_CHAIN (e.g. due to blk_queue_split). Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-11dm integrity: fix a deadlock due to offloading to an incorrect workqueueMikulas Patocka1-2/+13
commit 53770f0ec5fd417429775ba006bc4abe14002335 upstream. If we need to perform synchronous I/O in dm_integrity_map_continue(), we must make sure that we are not in the map function - in order to avoid the deadlock due to bio queuing in generic_make_request. To avoid the deadlock, we offload the request to metadata_wq. However, metadata_wq also processes metadata updates for write requests. If there are too many requests that get offloaded to metadata_wq at the beginning of dm_integrity_map_continue, the workqueue metadata_wq becomes clogged and the system is incapable of processing any metadata updates. This causes a deadlock because all the requests that need to do metadata updates wait for metadata_wq to proceed and metadata_wq waits inside wait_and_add_new_range until some existing request releases its range lock (which doesn't happen because the range lock is released after metadata update). In order to fix the deadlock, we create a new workqueue offload_wq and offload requests to it - so that processing of offload_wq is independent from processing of metadata_wq. Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11dm cache: fix a crash due to incorrect work item cancellingMikulas Patocka1-2/+2
commit 7cdf6a0aae1cccf5167f3f04ecddcf648b78e289 upstream. The crash can be reproduced by running the lvm2 testsuite test lvconvert-thin-external-cache.sh for several minutes, e.g.: while :; do make check T=shell/lvconvert-thin-external-cache.sh; done The crash happens in this call chain: do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset -> memset -> __memset -- which accesses an invalid pointer in the vmalloc area. The work entry on the workqueue is executed even after the bitmap was freed. The problem is that cancel_delayed_work doesn't wait for the running work item to finish, so the work item can continue running and re-submitting itself even after cache_postsuspend. In order to make sure that the work item won't be running, we must use cancel_delayed_work_sync. Also, change flush_workqueue to drain_workqueue, so that if some work item submits itself or another work item, we are properly waiting for both of them. Fixes: c6b4fcbad044 ("dm: add cache target") Cc: stable@vger.kernel.org # v3.9 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28bcache: explicity type cast in bset_bkey_last()Coly Li1-1/+2
[ Upstream commit 7c02b0055f774ed9afb6e1c7724f33bf148ffdc0 ] In bset.h, macro bset_bkey_last() is defined as, bkey_idx((struct bkey *) (i)->d, (i)->keys) Parameter i can be variable type of data structure, the macro always works once the type of struct i has member 'd' and 'keys'. bset_bkey_last() is also used in macro csum_set() to calculate the checksum of a on-disk data structure. When csum_set() is used to calculate checksum of on-disk bcache super block, the parameter 'i' data type is struct cache_sb_disk. Inside struct cache_sb_disk (also in struct cache_sb) the member keys is __u16 type. But bkey_idx() expects unsigned int (a 32bit width), so there is problem when sending parameters via stack to call bkey_idx(). Sparse tool from Intel 0day kbuild system reports this incompatible problem. bkey_idx() is part of user space API, so the simplest fix is to cast the (i)->keys to unsigned int type in macro bset_bkey_last(). Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15dm: fix potential for q->make_request_fn NULL pointerMike Snitzer1-2/+7
commit 47ace7e012b9f7ad71d43ac9063d335ea3d6820b upstream. Move blk_queue_make_request() to dm.c:alloc_dev() so that q->make_request_fn is never NULL during the lifetime of a DM device (even one that is created without a DM table). Otherwise generic_make_request() will crash simply by doing: dmsetup create -n test mount /dev/dm-N /mnt While at it, move ->congested_data initialization out of dm.c:alloc_dev() and into the bio-based specific init method. Reported-by: Stefan Bader <stefan.bader@canonical.com> BugLink: https://bugs.launchpad.net/bugs/1860231 Fixes: ff36ab34583a ("dm: remove request-based logic from make_request_fn wrapper") Depends-on: c12c9a3c3860c ("dm: various cleanups to md->queue initialization code") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> [smb: adjusted for context and dm_init_md_queue() exitsting in older kernels] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15dm crypt: fix benbi IV constructor crash if used in authenticated modeMilan Broz1-2/+8
commit 4ea9471fbd1addb25a4d269991dc724e200ca5b5 upstream. If benbi IV is used in AEAD construction, for example: cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256 the constructor uses wrong skcipher function and crashes: BUG: kernel NULL pointer dereference, address: 00000014 ... EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt] Call Trace: ? crypt_subkey_size+0x20/0x20 [dm_crypt] crypt_ctr+0x567/0xfc0 [dm_crypt] dm_table_add_target+0x15f/0x340 [dm_mod] Fix this by properly using crypt_aead_blocksize() in this case. Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Cc: stable@vger.kernel.org # v4.12+ Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051 Reported-by: Jerad Simpson <jbsimpson@gmail.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15dm space map common: fix to ensure new block isn't already in useJoe Thornber4-3/+37
commit 4feaef830de7ffdd8352e1fe14ad3bf13c9688f8 upstream. The space-maps track the reference counts for disk blocks allocated by both the thin-provisioning and cache targets. There are variants for tracking metadata blocks and data blocks. Transactionality is implemented by never touching blocks from the previous transaction, so we can rollback in the event of a crash. When allocating a new block we need to ensure the block is free (has reference count of 0) in both the current and previous transaction. Prior to this fix we were doing this by searching for a free block in the previous transaction, and relying on a 'begin' counter to track where the last allocation in the current transaction was. This 'begin' field was not being updated in all code paths (eg, increment of a data block reference count due to breaking sharing of a neighbour block in the same btree leaf). This fix keeps the 'begin' field, but now it's just a hint to speed up the search. Instead the current transaction is searched for a free block, and then the old transaction is double checked to ensure it's free. Much simpler. This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when DM thin-provisioning's snapshots are heavily used. Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net> Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15dm zoned: support zone sizes smaller than 128MiBDmitry Fomichev1-9/+14
commit b39962950339912978484cdac50069258545d753 upstream. dm-zoned is observed to log failed kernel assertions and not work correctly when operating against a device with a zone size smaller than 128MiB (e.g. 32768 bits per 4K block). The reason is that the bitmap size per zone is calculated as zero with such a small zone size. Fix this problem and also make the code related to zone bitmap management be able to handle per zone bitmaps smaller than a single block. A dm-zoned-tools patch is required to properly format dm-zoned devices with zone sizes smaller than 128MiB. Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29md: Avoid namespace collision with bitmap APIAndy Shevchenko3-9/+9
commit e64e4018d572710c44f42c923d4ac059f0a23320 upstream. bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> [only take the bitmap_free change for stable - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23block: fix an integer overflow in logical block sizeMikulas Patocka2-2/+2
commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream. Logical block size has type unsigned short. That means that it can be at most 32768. However, there are architectures that can run with 64k pages (for example arm64) and on these architectures, it may be possible to create block devices with 64k block size. For exmaple (run this on an architecture with 64k pages): Mount will fail with this error because it tries to read the superblock using 2-sector access: device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536 EXT4-fs (dm-0): unable to read superblock This patch changes the logical block size from unsigned short to unsigned int to avoid the overflow. Cc: stable@vger.kernel.org Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09md: raid1: check rdev before reference in raid1_sync_request funcZhiqiang Liu1-1/+1
[ Upstream commit 028288df635f5a9addd48ac4677b720192747944 ] In raid1_sync_request func, rdev should be checked before reference. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04bcache: at least try to shrink 1 node in bch_mca_scan()Coly Li1-0/+2
[ Upstream commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b ] In bch_mca_scan(), the number of shrinking btree node is calculated by code like this, unsigned long nr = sc->nr_to_scan; nr /= c->btree_pages; nr = min_t(unsigned long, nr, mca_can_free(c)); variable sc->nr_to_scan is number of objects (here is bcache B+tree nodes' number) to shrink, and pointer variable sc is sent from memory management code as parametr of a callback. If sc->nr_to_scan is smaller than c->btree_pages, after the above calculation, variable 'nr' will be 0 and nothing will be shrunk. It is frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make nr to be zero. Then bch_mca_scan() will do nothing more then acquiring and releasing mutex c->bucket_lock. This patch checkes whether nr is 0 after the above calculation, if 0 is the result then set 1 to variable 'n'. Then at least bch_mca_scan() will try to shrink a single B+tree node. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-21dm btree: increase rebalance threshold in __rebalance2()Hou Tao1-1/+7
commit 474e559567fa631dea8fb8407ab1b6090c903755 upstream. We got the following warnings from thin_check during thin-pool setup: $ thin_check /dev/vdb examining superblock examining devices tree missing devices: [1, 84] too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126) examining mapping tree The phenomenon is the number of entries in one node of details_info tree is less than (max_entries / 3). And it can be easily reproduced by the following procedures: $ new a thin pool $ presume the max entries of details_info tree is 126 $ new 127 thin devices (e.g. 1~127) to make the root node being full and then split $ remove the first 43 (e.g. 1~43) thin devices to make the children reblance repeatedly $ stop the thin pool $ thin_check The root cause is that the B-tree removal procedure in __rebalance2() doesn't guarantee the invariance: the minimal number of entries in non-root node should be >= (max_entries / 3). Simply fix the problem by increasing the rebalance threshold to make sure the number of entries in each child will be greater than or equal to (max_entries / 3 + 1), so no matter which child is used for removal, the number will still be valid. Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17raid5: need to set STRIPE_HANDLE for batch headGuoqing Jiang1-1/+1
[ Upstream commit a7ede3d16808b8f3915c8572d783530a82b2f027 ] With commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list"), we don't want to set STRIPE_HANDLE flag for sh which is already in batch list. However, the stripe which is the head of batch list should set this flag, otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head), it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved. Thanks for Xiao's effort to verify the change. Fixes: 6ce220dd2f8ea ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list") Reported-by: Xiao Ni <xni@redhat.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17dm zoned: reduce overhead of backing device checksDmitry Fomichev4-32/+61
commit e7fad909b68aa37470d9f2d2731b5bec355ee5d6 upstream. Commit 75d66ffb48efb3 added backing device health checks and as a part of these checks, check_events() block ops template call is invoked in dm-zoned mapping path as well as in reclaim and flush path. Calling check_events() with ATA or SCSI backing devices introduces a blocking scsi_test_unit_ready() call being made in sd_check_events(). Even though the overhead of calling scsi_test_unit_ready() is small for ATA zoned devices, it is much larger for SCSI and it affects performance in a very negative way. Fix this performance regression by executing check_events() only in case of any I/O errors. The function dmz_bdev_is_dying() is modified to call only blk_queue_dying(), while calls to check_events() are made in a new helper function, dmz_check_bdev(). Reported-by: zhangxiaoxu <zhangxiaoxu5@huawei.com> Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17md/raid0: Fix an error message in raid0_make_request()Dan Carpenter1-1/+1
[ Upstream commit e3fc3f3d0943b126f76b8533960e4168412d9e5a ] The first argument to WARN() is supposed to be a condition. The original code will just print the mdname() instead of the full warning message. Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05dm flakey: Properly corrupt multi-page bios.Sweet Tea1-11/+22
[ Upstream commit a00f5276e26636cbf72f24f79831026d2e2868e7 ] The flakey target is documented to be able to corrupt the Nth byte in a bio, but does not corrupt byte indices after the first biovec in the bio. Change the corrupting function to actually corrupt the Nth byte no matter in which biovec that index falls. A test device generating two-page bios, atop a flakey device configured to corrupt a byte index on the second page, verified both the failure to corrupt before this patch and the expected corruption after this change. Signed-off-by: John Dorminy <jdorminy@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01md/raid10: prevent access of uninitialized resync_pages offsetJohn Pittman1-1/+1
commit 45422b704db392a6d79d07ee3e3670b11048bd53 upstream. Due to unneeded multiplication in the out_free_pages portion of r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is possible to access a resync_pages offset that has not been initialized. This access translates into a crash of the system within resync_free_pages() while passing a bad pointer to put_page(). Remove the multiplication, preventing access to the uninitialized area. Fixes: f0250618361db ("md: raid10: don't use bio's vec table to manage resync pages") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: John Pittman <jpittman@redhat.com> Suggested-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-01dm raid: avoid bitmap with raid4/5/6 journal deviceHeinz Mauelshagen1-1/+1
[ Upstream commit d857ad75edf3c0066fcd920746f9dc75382b3324 ] With raid4/5/6, journal device and write intent bitmap are mutually exclusive. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24bcache: recal cached_dev_sectors on detachShenghui Wang1-0/+1
[ Upstream commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ] Recal cached_dev_sectors on cached_dev detached, as recal done on cached_dev attached. Update the cached_dev_sectors before bcache_device_detach called as bcache_device_detach will set bcache_device->c to NULL. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-24md: allow metadata updates while suspending an array - fixNeilBrown1-10/+12
[ Upstream commit 059421e041eb461fb2b3e81c9adaec18ef03ca3c ] Commit 35bfc52187f6 ("md: allow metadata update while suspending.") added support for allowing md_check_recovery() to still perform metadata updates while the array is entering the 'suspended' state. This is needed to allow the processes of entering the state to complete. Unfortunately, the patch doesn't really work. The test for "mddev->suspended" at the start of md_check_recovery() means that the function doesn't try to do anything at all while entering suspend. This patch moves the code of updating the metadata while suspending to *before* the test on mddev->suspended. Reported-by: Jeff Mahoney <jeffm@suse.com> Fixes: 35bfc52187f6 ("md: allow metadata update while suspending.") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-06dm: Use kzalloc for all structs with embedded biosets/mempoolsKent Overstreet7-7/+7
[ Upstream commit d377535405686f735b90a8ad4ba269484cd7c96e ] mempool_init()/bioset_init() require that the mempools/biosets be zeroed first; they probably should not _require_ this, but not allocating those structs with kzalloc is a fairly nonsensical thing to do (calling mempool_exit()/bioset_exit() on an uninitialized mempool/bioset is legal and safe, but only works if said memory was zeroed.) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-06dm snapshot: rework COW throttling to fix deadlockMikulas Patocka1-16/+64
[ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ] Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and workqueue stalls") introduced a semaphore to limit the maximum number of in-flight kcopyd (COW) jobs. The implementation of this throttling mechanism is prone to a deadlock: 1. One or more threads write to the origin device causing COW, which is performed by kcopyd. 2. At some point some of these threads might reach the s->cow_count semaphore limit and block in down(&s->cow_count), holding a read lock on _origins_lock. 3. Someone tries to acquire a write lock on _origins_lock, e.g., snapshot_ctr(), which blocks because the threads at step (2) already hold a read lock on it. 4. A COW operation completes and kcopyd runs dm-snapshot's completion callback, which ends up calling pending_complete(). pending_complete() tries to resubmit any deferred origin bios. This requires acquiring a read lock on _origins_lock, which blocks. This happens because the read-write semaphore implementation gives priority to writers, meaning that as soon as a writer tries to enter the critical section, no readers will be allowed in, until all writers have completed their work. So, pending_complete() waits for the writer at step (3) to acquire and release the lock. This writer waits for the readers at step (2) to release the read lock and those readers wait for pending_complete() (the kcopyd thread) to signal the s->cow_count semaphore: DEADLOCK. The above was thoroughly analyzed and documented by Nikos Tsironis as part of his initial proposal for fixing this deadlock, see: https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html Fix this deadlock by reworking COW throttling so that it waits without holding any locks. Add a variable 'in_progress' that counts how many kcopyd jobs are running. A function wait_for_in_progress() will sleep if 'in_progress' is over the limit. It drops _origins_lock in order to avoid the deadlock. Reported-by: Guruswamy Basavaiah <guru2018@gmail.com> Reported-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Tested-by: Nikos Tsironis <ntsironis@arrikto.com> Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls") Cc: stable@vger.kernel.org # v5.0+ Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-11-06dm snapshot: introduce account_start_copy() and account_end_copy()Mikulas Patocka1-3/+13
[ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ] This simple refactoring moves code for modifying the semaphore cow_count into separate functions to prepare for changes that will extend these methods to provide for a more sophisticated mechanism for COW throttling. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>