summaryrefslogtreecommitdiff
path: root/drivers/md/md-bitmap.c
AgeCommit message (Collapse)AuthorFilesLines
2024-08-27md/md-bitmap: add 'file_pages' into struct md_bitmap_statsYu Kuai1-2/+5
There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: add 'sync_size' into struct md_bitmap_statsYu Kuai1-0/+6
To avoid dereferencing bitmap directly in md-cluster to prepare inventing a new bitmap. BTW, also fix following checkpatch warnings: WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: add 'events_cleared' into struct md_bitmap_statsYu Kuai1-0/+2
Also add a new helper to get events_cleared to avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()Yu Kuai1-19/+6
There are no functional changes, and the new helper will be used in multiple places in following patches to avoid dereferencing bitmap directly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request()Yu Kuai1-0/+1
Use the existed helper instead of open coding it to make the code cleaner. There are no functional changes, and also avoid dereferencing bitmap directly to prepare inventing a new bitmap. Noted that this patch also export md_bitmap_wait_behind_writes(), which is necessary for now, and the exported api will be removed in following patches to convert bitmap apis into ops. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-06-12md/md-bitmap: fix writing non bitmap pagesOfir Gal1-3/+3
__write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length. For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock. This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES. As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on. The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'. The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug. In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest). I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === __write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com
2024-05-03md: fix resync softlockup when bitmap size is less than array sizeYu Kuai1-3/+3
Is is reported that for dm-raid10, lvextend + lvchange --syncaction will trigger following softlockup: kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [mdX_resync:6976] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not tainted 6.9.0-rc4-next-20240419 #1 RIP: 0010:_raw_spin_unlock_irq+0x13/0x30 Call Trace: <TASK> md_bitmap_start_sync+0x6b/0xf0 raid10_sync_request+0x25c/0x1b40 [raid10] md_do_sync+0x64b/0x1020 md_thread+0xa7/0x170 kthread+0xcf/0x100 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1a/0x30 And the detailed process is as follows: md_do_sync j = mddev->resync_min while (j < max_sectors) sectors = raid10_sync_request(mddev, j, &skipped) if (!md_bitmap_start_sync(..., &sync_blocks)) // md_bitmap_start_sync set sync_blocks to 0 return sync_blocks + sectors_skippe; // sectors = 0; j += sectors; // j never change Root cause is that commit 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") return early from md_bitmap_get_counter(), without setting returned blocks. Fix this problem by always set returned blocks from md_bitmap_get_counter"(), as it used to be. Noted that this patch just fix the softlockup problem in kernel, the case that bitmap size doesn't match array size still need to be fixed. Fixes: 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") Reported-and-tested-by: Nigel Croxon <ncroxon@redhat.com> Closes: https://lore.kernel.org/all/71ba5272-ab07-43ba-8232-d2da642acb4e@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240422065824.2516-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-03-06md: add a mddev_add_trace_msg helperChristoph Hellwig1-6/+3
Add a small wrapper around blk_add_trace_msg that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de
2024-02-27md/md-bitmap: fix incorrect usage for sb_indexHeming Zhao1-3/+6
Commit d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") removed page->index from bitmap code, but left wrong code logic for clustered-md. current code never set slot offset for cluster nodes, will sometimes cause crash in clustered env. Call trace (partly): md_bitmap_file_set_bit+0x110/0x1d8 [md_mod] md_bitmap_startwrite+0x13c/0x240 [md_mod] raid1_make_request+0x6b0/0x1c08 [raid1] md_handle_request+0x1dc/0x368 [md_mod] md_submit_bio+0x80/0xf8 [md_mod] __submit_bio+0x178/0x300 submit_bio_noacct_nocheck+0x11c/0x338 submit_bio_noacct+0x134/0x614 submit_bio+0x28/0xdc submit_bh_wbc+0x130/0x1cc submit_bh+0x1c/0x28 Fixes: d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240223121128.28985-1-heming.zhao@suse.com
2023-10-11md: cleanup mddev_create/destroy_serial_pool()Yu Kuai1-4/+4
Now that except for stopping the array, all the callers already suspend the array, there is no need to suspend anymore, hence remove the second parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-15-yukuai1@huaweicloud.com
2023-10-11md: use new apis to suspend array before mddev_create/destroy_serial_poolYu Kuai1-4/+4
mddev_create/destroy_serial_pool() will be called from several places where mddev_suspend() will be called later. Prepare to remove the mddev_suspend() from mddev_create/destroy_serial_pool(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-14-yukuai1@huaweicloud.com
2023-10-11md/md-bitmap: use new apis to suspend array for location_store()Yu Kuai1-4/+2
Convert to use new apis, the old apis will be removed eventually. This is not hot path, so performance is not concerned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231010151958.145896-8-yukuai1@huaweicloud.com
2023-09-22md-bitmap: suspend array earlier in location_store()Yu Kuai1-23/+20
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to call mddev_suspend() earlier. This will also be helper to refactor mddev_suspend() later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com
2023-09-22md-bitmap: remove the checking of 'pers->quiesce' from location_store()Yu Kuai1-4/+0
After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"), there is no need to check 'pers->quiesce' before calling mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com
2023-07-27md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai1-0/+7
Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai1-2/+0
Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: make bitmap file support optionalChristoph Hellwig1-0/+15
The support for write intent bitmaps in files on an external files in md is a hot mess that abuses ->bmap to map file offsets into physical device objects, and also abuses buffer_heads in a creative way. Make this code optional so that MD can be built into future kernels without buffer_head support, and so that we can eventually deprecate it. Note this does not affect the internal bitmap support, which has none of the problems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
2023-07-27md-bitmap: don't use ->index for pages backing the bitmap fileChristoph Hellwig1-27/+38
The md driver allocates pages for storing the bitmap file data, which are not page cache pages, and then stores the page granularity file offset in page->index, which is a field that isn't really valid except for page cache pages. Use a separate index for the superblock, and use the scheme used at read size to recalculate the index for the bitmap pages instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-10-hch@lst.de
2023-07-27md-bitmap: account for mddev->bitmap_info.offset in read_sb_pageChristoph Hellwig1-9/+8
Diretly apply mddev->bitmap_info.offset to the sector number to read instead of doing that in both callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-9-hch@lst.de
2023-07-27md-bitmap: cleanup read_sb_pageChristoph Hellwig1-12/+11
Convert read_sb_page to the normal kernel coding style, calculate the target sector only once, and add a local iosize variable to make the call to sync_page_io more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-8-hch@lst.de
2023-07-27md-bitmap: refactor md_bitmap_init_from_diskChristoph Hellwig1-71/+70
Split the confusing loop in md_bitmap_init_from_disk that iterates over all chunks but also needs to read and map the pages into three separate loops: one that iterates over the pages to read them, a second optional one to iterate over the pages to mark them invalid if the bitmaps are out of date, and a final one that actually iterates over the chunks. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306160552.smw0qbmb-lkp@intel.com/ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-7-hch@lst.de
2023-07-27md-bitmap: rename read_page to read_file_pageChristoph Hellwig1-6/+4
Make the difference to read_sb_page clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-6-hch@lst.de
2023-07-27md-bitmap: split file writes into a separate helperChristoph Hellwig1-24/+24
Split the file write code out of write_page into a separate helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-5-hch@lst.de
2023-07-27md-bitmap: use %pD to print the file name in md_bitmap_file_kickChristoph Hellwig1-10/+2
Don't bother allocating an extra buffer in the I/O failure handler and instead use the printk built-in format to print the last 4 path name components. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-4-hch@lst.de
2023-07-27md-bitmap: initialize variables at declaration time in md_bitmap_file_unmapChristoph Hellwig1-8/+4
Just a small tidyup to prepare for bigger changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-3-hch@lst.de
2023-07-27md-bitmap: set BITMAP_WRITE_ERROR in write_sb_pageChristoph Hellwig1-13/+8
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating the error to the caller and setting it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230615064840.629492-2-hch@lst.de
2023-06-14md/md-bitmap: add a new helper to unplug bitmap asynchrouslyYu Kuai1-0/+29
If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
2023-06-14md/raid1-10: submit write io directly if bitmap is not enabledYu Kuai1-3/+1
Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently. However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well. Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com
2023-06-14md: protect md_thread with rcuYu Kuai1-2/+8
Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-14md/bitmap: factor out a helper to set timeoutYu Kuai1-16/+19
Register/unregister 'mddev->thread' are both under 'reconfig_mutex', however, some context didn't hold the mutex to access mddev->thread, which can cause null-ptr-deference: 1) md_bitmap_daemon_work() can be called from md_check_recovery() where 'reconfig_mutex' is not held, deference 'mddev->thread' might cause null-ptr-deference, because md_unregister_thread() reset the pointer before stopping the thread. 2) timeout_store() access 'mddev->thread' multiple times, null-ptr-deference can be triggered if 'mddev->thread' is reset in the middle. This patch factor out a helper to set timeout, the new helper always check if 'mddev->thread' is null first, so that problem 1 can be fixed. Now that this helper only access 'mddev->thread' once, but it's possible that 'mddev->thread' can be freed while this helper is still in progress, hence the problem is not fixed yet. Follow up patches will fix this by protecting md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-5-yukuai1@huaweicloud.com
2023-06-14md/bitmap: always wake up md_thread in timeout_storeYu Kuai1-3/+3
md_wakeup_thread() can handle the case that pass in md_thread is NULL, the only difference is that md_wakeup_thread() will be called when current timeout is 'MAX_SCHEDULE_TIMEOUT', this should not matter because timeout_store() is not hot path, and the daemon process is woke up more than demand from other context already. Prepare to factor out a helper to set timeout. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-4-yukuai1@huaweicloud.com
2023-06-14md/raid10: check slab-out-of-bounds in md_bitmap_get_counterLi Nan1-8/+9
If we write a large number to md/bitmap_set_bits, md_bitmap_checkpage() will return -EINVAL because 'page >= bitmap->pages', but the return value was not checked immediately in md_bitmap_get_counter() in order to set *blocks value and slab-out-of-bounds occurs. Move check of 'page >= bitmap->pages' to md_bitmap_get_counter() and return directly if true. Fixes: ef4256733506 ("md/bitmap: optimise scanning of empty bitmaps.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230515134808.3936750-2-linan666@huaweicloud.com
2023-04-28md: Fix bitmap offset type in sb writerJonathan Derrick1-3/+3
Bitmap offset is allowed to be negative, indicating that bitmap precedes metadata. Change the type back from sector_t to loff_t to satisfy conditionals and calculations. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-raid/CAPhsuW6HuaUJ5WcyPajVgUfkQFYp2D_cy1g6qxN4CU_gP2=z7g@mail.gmail.com/ Fixes: 10172f200b67 ("md: Fix types in sb writer") Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev> Suggested-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230425011438.71046-1-jonathan.derrick@linux.dev
2023-04-14md: Use optimal I/O size for last bitmap pageJon Derrick1-4/+29
If the bitmap space has enough room, size the I/O for the last bitmap page write to the optimal I/O size for the storage device. The expanded write is checked that it won't overrun the data or metadata. The drive this was tested against has higher latencies when there are sub-4k writes due to device-side read-mod-writes of its atomic 4k write unit. This change helps increase performance by sizing the last bitmap page I/O for the device's preferred write unit, if it is given. Example Intel/Solidigm P5520 Raid10, Chunk-size 64M, bitmap-size 57228 bits $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60 Without patch: write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets With patch: write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets Biosnoop: Without patch: Time Process PID Device LBA Size Lat 1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02 1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02 1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01 1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02 1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01 1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02 1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 <-- 1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 <-- 1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 <-- 1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 <-- 1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01 1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02 With patch: Time Process PID Device LBA Size Lat 5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01 5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02 5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01 5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02 5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02 5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01 5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 <-- 5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 <-- 5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 <-- 5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 <-- 5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01 5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
2023-04-14md: Fix types in sb writerJon Derrick1-21/+14
Page->index is a pgoff_t and multiplying could cause overflows on a 32-bit architecture. In the sb writer, this is used to calculate and verify the sector being used, and is multiplied by a sector value. Using sector_t will cast it to a u64 type and is the more appropriate type for the unit. Additionally, the integer size unit is converted to a sector unit in later calculations, and is now corrected to be an unsigned type. Finally, clean up the calculations using variable aliases to improve readabiliy. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230224183323.638-3-jonathan.derrick@linux.dev
2023-04-14md: Move sb writer loop to its own functionJon Derrick1-60/+65
Preparatory patch for optimal I/O size calculation. Move the sb writer loop routine into its own function for clarity. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230224183323.638-2-jonathan.derrick@linux.dev
2022-11-14md/bitmap: Fix bitmap chunk size overflow issuesFlorian-Ewald Mueller1-8/+12
- limit bitmap chunk size internal u64 variable to values not overflowing the u32 bitmap superblock structure variable stored on persistent media - assign bitmap chunk size internal u64 variable from unsigned values to avoid possible sign extension artifacts when assigning from a s32 value The bug has been there since at least kernel 4.0. Steps to reproduce it: 1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2 -n2 /dev/rnbd1 /dev/rnbd2 2 resize member device rnbd1 and rnbd2 to 8 TB 3 mdadm --grow /dev/mdx --size=max The bitmap_chunksize will overflow without patch. Cc: stable@vger.kernel.org Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Song Liu <song@kernel.org>
2022-11-14drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()Li Zhong1-12/+15
Check the return value of md_bitmap_get_counter() in case it returns NULL pointer, which will result in a null pointer dereference. v2: update the check to include other dereference Signed-off-by: Li Zhong <floridsleeves@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
2022-07-14fs/buffer: Combine two submit_bh() and ll_rw_block() argumentsBart Van Assche1-2/+2
Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14md/core: Combine two sync_page_io() argumentsBart Van Assche1-1/+1
Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-26md: replace deprecated strlcpy & remove duplicated lineHeming Zhao1-2/+1
This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by b97e92574c0bf become necessary. - commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So cf921cc19cf7 became useless after d3b178adb3a3a, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-26md/bitmap: don't set sb values if can't pass sanity checkHeming Zhao1-21/+23
If bitmap area contains invalid data, kernel will crash then mdadm triggers "Segmentation fault". This is cluster-md speical bug. In non-clustered env, mdadm will handle broken metadata case. In clustered array, only kernel space handles bitmap slot info. But even this bug only happened in clustered env, current sanity check is wrong, the code should be changed. How to trigger: (faulty injection) dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm -Ss echo aaa > magic.txt == below modifying slot 2 bitmap data == dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize mdadm -A /dev/md0 /dev/sda /dev/sdb == kernel crashes. mdadm outputs "Segmentation fault" == Reason of kernel crash: In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T() trigger "divide error". Crash log: kernel: md: md0 stopped. kernel: md/raid1:md0: not clean -- starting background reconstruction kernel: md/raid1:md0: active with 2 out of 2 mirrors kernel: dlm: ... ... kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1 kernel: md0: invalid bitmap file superblock: bad magic kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2 kernel: md-cluster: Could not gather bitmaps from slot 2 kernel: divide error: 0000 [#1] SMP NOPTI kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246 kernel: ... ... kernel: Call Trace: kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0] kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a] kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0] kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a] kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a] kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a] kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a] kernel: ? blkdev_ioctl+0xb1/0x2b0 kernel: block_ioctl+0x3b/0x40 kernel: __x64_sys_ioctl+0x7f/0xb0 kernel: do_syscall_64+0x59/0x80 kernel: ? exit_to_user_mode_prepare+0x1ab/0x230 kernel: ? syscall_exit_to_user_mode+0x18/0x40 kernel: ? do_syscall_64+0x69/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f4a15fa722b kernel: ... ... kernel: ---[ end trace 8afa7612f559c868 ]--- kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-11-02md/bitmap: don't set max_write_behind if there is no write mostly deviceGuoqing Jiang1-0/+19
We shouldn't set it since write behind IO should only happen to write mostly device. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com>
2021-06-15md: Constify attribute_group structsRikard Falkeborn1-1/+1
The attribute_group structs are never modified, they're only passed to sysfs_create_group() and sysfs_remove_group(). Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md/bitmap: wait for external bitmap writes to complete during tear downSudhakar Panneerselvam1-0/+2
NULL pointer dereference was observed in super_written() when it tries to access the mddev structure. [The below stack trace is from an older kernel, but the problem described in this patch applies to the mainline kernel.] [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000 [ 1194.488000] RIP: 0010:super_written+0x29/0xe1 [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046 [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048 [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200 [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000 [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200 [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000 [ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0 [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1194.663316] PKRU: 55555554 [ 1194.674090] Call Trace: [ 1194.683735] <IRQ> [ 1194.692948] bio_endio+0xae/0x135 [ 1194.703580] blk_update_request+0xad/0x2fa [ 1194.714990] blk_update_bidi_request+0x20/0x72 [ 1194.726578] __blk_end_bidi_request+0x2c/0x4d [ 1194.738373] __blk_end_request_all+0x31/0x49 [ 1194.749344] blk_flush_complete_seq+0x377/0x383 [ 1194.761550] flush_end_io+0x1dd/0x2a7 [ 1194.772910] blk_finish_request+0x9f/0x13c [ 1194.784544] scsi_end_request+0x180/0x25c [ 1194.796149] scsi_io_completion+0xc8/0x610 [ 1194.807503] scsi_finish_command+0xdc/0x125 [ 1194.818897] scsi_softirq_done+0x81/0xde [ 1194.830062] blk_done_softirq+0xa4/0xcc [ 1194.841008] __do_softirq+0xd9/0x29f [ 1194.851257] irq_exit+0xe6/0xeb [ 1194.861290] do_IRQ+0x59/0xe3 [ 1194.871060] common_interrupt+0x1c6/0x382 [ 1194.881988] </IRQ> [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5 [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43 [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000 [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2 [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003 [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a [ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5 [ 1194.999077] cpuidle_enter+0x17/0x19 [ 1195.008395] call_cpuidle+0x23/0x3a [ 1195.017718] do_idle+0x172/0x1d5 [ 1195.026358] cpu_startup_entry+0x73/0x75 [ 1195.035769] start_secondary+0x1b9/0x20b [ 1195.044894] secondary_startup_64+0xa5/0xa5 [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78 [ 1195.096354] CR2: 00000000000002b8 bio in the above stack is a bitmap write whose completion is invoked after the tear down sequence sets the mddev structure to NULL in rdev. During tear down, there is an attempt to flush the bitmap writes, but for external bitmaps, there is no explicit wait for all the bitmap writes to complete. For instance, md_bitmap_flush() is called to flush the bitmap writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes for which there is no explicit wait to complete those writes. The call to md_bitmap_update_sb() will return simply for external bitmaps and the follow-up call to md_update_sb() is conditional and may not get called for external bitmaps. This results in a kernel panic when the completion routine, super_written() is called which tries to reference mddev in the rdev that has been set to NULL(in unbind_rdev_from_array() by tear down sequence). The solution is to call md_super_wait() for external bitmaps after the last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending bitmap writes before proceeding with the tear down. Cc: stable@vger.kernel.org Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com> Reviewed-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2020-10-09md/bitmap: fix memory leak of temporary bitmapZhao Heming1-1/+2
Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-09md/bitmap: md_bitmap_get_counter returns wrong blocksZhao Heming1-1/+1
md_bitmap_get_counter() has code: ``` if (bitmap->bp[page].hijacked || bitmap->bp[page].map == NULL) csize = ((sector_t)1) << (bitmap->chunkshift + PAGE_COUNTER_SHIFT - 1); ``` The minus 1 is wrong, this branch should report 2048 bits of space. With "-1" action, this only report 1024 bit of space. This bug code returns wrong blocks, but it doesn't inflence bitmap logic: 1. Most callers focus this function return value (the counter of offset), not the parameter blocks. 2. The bug is only triggered when hijacked is true or map is NULL. the hijacked true condition is very rare. the "map == null" only true when array is creating or resizing. 3. Even the caller gets wrong blocks, current code makes caller just to call md_bitmap_get_counter() one more time. Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-09md/bitmap: md_bitmap_read_sb uses wrong bitmap blocksZhao Heming1-2/+2
The patched code is used to get chunks number, should use round-up div to replace current sector_div. The same code is in md_bitmap_resize(): ``` chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift); ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-25md: only calculate blocksize once and use i_blocksize()Xianting Tian1-3/+4
We alreday has the interface i_blocksize(), which can be used to get blocksize, so use it. Only calculate blocksize once and use it within read_page(). Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>