summaryrefslogtreecommitdiff
path: root/drivers/md/raid5.c
AgeCommit message (Collapse)AuthorFilesLines
2026-01-11md/raid5: fix IO hang when array is broken with IO inflightYu Kuai1-2/+4
[ Upstream commit a913d1f6a7f607c110aeef8b58c8988f47a4b24e ] Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21md/md-bitmap: move bitmap_{start, end}write to md upper layerYu Kuai1-45/+5
commit cd5fc653381811f1e0ba65f5d169918cab61476f upstream. There are two BUG reports that raid5 will hang at bitmap_startwrite([1],[2]), root cause is that bitmap start write and end write is unbalanced, it's not quite clear where, and while reviewing raid5 code, it's found that bitmap operations can be optimized. For example, for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to the array: ┌────────────────────────────────────────────────────────────┐ │chunk 0 │ │ ┌────────────┬─────────────┬─────────────┬────────────┼ │ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │ ┼──────┴────────────┴─────────────┴─────────────┴────────────┼ │chunk 1 │ │ ┌────────────┬─────────────┬─────────────┬────────────┤ │ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│ └──────┴────────────┴─────────────┴─────────────┴────────────┘ Before this patch, 4 stripe head will be used, and each sh will attach bio for 3 disks, and each attached bio will trigger bitmap_startwrite() once, which means total 12 times. - 3 times (0 + 4k), for (A0, A1 and A2) - 3 times (4 + 4k), for (B0, B1 and B2) - 3 times (8 + 4k), for (C0, C1 and C3) - 3 times (12 + 4k), for (D0, D1 and D3) After this patch, md upper layer will calculate that IO range (0 + 48k) is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite() just once. Noted that this patch will align bitmap ranges to the chunks, for example, if user issue a IO (0 + 4k) to array: - Before this patch, 1 time (0 + 4k), for A0; - After this patch, 1 time (0 + 8k) for chunk 0; Usually, one bitmap bit will represent more than one disk chunk, and this doesn't have any difference. And even if user really created a array that one chunk contain multiple bits, the overhead is that more data will be recovered after power failure. Also remove STRIPE_BITMAP_PENDING since it's not used anymore. [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/ [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> [There is no bitmap_operations, resolve conflicts by replacing bitmap_ops->{startwrite, endwrite} with md_bitmap_{startwrite, endwrite}] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21md/raid5: implement pers->bitmap_sector()Yu Kuai1-0/+51
commit 9c89f604476cf15c31fbbdb043cff7fbf1dbe0cb upstream. Bitmap is used for the whole array for raid1/raid10, hence IO for the array can be used directly for bitmap. However, bitmap is used for underlying disks for raid5, hence IO for the array can't be used directly for bitmap. Implement pers->bitmap_sector() for raid5 to convert IO ranges from the array to the underlying disks. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> [ Resolve minor conflicts ] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()Yu Kuai1-11/+3
commit 4f0e7d0e03b7b80af84759a9e7cfb0f81ac4adae upstream. For the case that IO failed for one rdev, the bit will be mark as NEEDED in following cases: 1) If badblocks is set and rdev is not faulty; 2) If rdev is faulty; Case 1) is useless because synchronize data to badblocks make no sense. Case 2) can be replaced with mddev->degraded. Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since case 2) no longer use them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> [ Resolve minor conflicts ] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()Yu Kuai1-7/+6
commit 08c50142a128dcb2d7060aa3b4c5db8837f7a46a upstream. behind_write is only used in raid1, prepare to refactor bitmap_{start/end}write(), there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> [There is no bitmap_operations, resolve conflicts by exporting new api md_bitmap_{start,end}_behind_write] Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21md/raid5: recheck if reshape has finished with device_lock heldBenjamin Marzinski1-23/+41
commit 25b3a8237a03ec0b67b965b52d74862e77ef7115 upstream. When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14md/raid5: avoid BUG_ON() while continue reshape after reassemblingYu Kuai1-7/+13
[ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ] Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16md/raid5: fix deadlock that raid5d() wait for itself to clear ↵Yu Kuai1-12/+3
MD_SB_CHANGE_PENDING commit 151f66bb618d1fd0eeb84acb61b4a9fa5d8bb0fa upstream. Xiao reported that lvm2 test lvconvert-raid-takeover.sh can hang with small possibility, the root cause is exactly the same as commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") However, Dan reported another hang after that, and junxiao investigated the problem and found out that this is caused by plugged bio can't issue from raid5d(). Current implementation in raid5d() has a weird dependence: 1) md_check_recovery() from raid5d() must hold 'reconfig_mutex' to clear MD_SB_CHANGE_PENDING; 2) raid5d() handles IO in a deadloop, until all IO are issued; 3) IO from raid5d() must wait for MD_SB_CHANGE_PENDING to be cleared; This behaviour is introduce before v2.6, and for consequence, if other context hold 'reconfig_mutex', and md_check_recovery() can't update super_block, then raid5d() will waste one cpu 100% by the deadloop, until 'reconfig_mutex' is released. Refer to the implementation from raid1 and raid10, fix this problem by skipping issue IO if MD_SB_CHANGE_PENDING is still set after md_check_recovery(), daemon thread will be woken up when 'reconfig_mutex' is released. Meanwhile, the hang problem will be fixed as well. Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Reported-and-tested-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/all/20240123005700.9302-1-dan@danm.net/ Investigated-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240322081005.1112401-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-03md/raid5: fix atomicity violation in raid5_cache_countGui-Dong Han1-6/+8
[ Upstream commit dfd2bf436709b2bccb78c2dda550dde93700efa7 ] In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han <2045gemini@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-26Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""Song Liu1-0/+12
This reverts commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d. The original set [1][2] was expected to undo a suboptimal fix in [2], and replace it with a better fix [1]. However, as reported by Dan Moulding [2] causes an issue with raid5 with journal device. Revert [2] for now to close the issue. We will follow up on another issue reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a good trade-off, because the latter issue happens less freqently. In the meanwhile, we will NOT revert [1], as it contains the right logic. [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock update") [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Reported-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/linux-raid/20240123005700.9302-1-dan@danm.net/ Fixes: bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Cc: stable@vger.kernel.org # v5.19+ Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-20Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"Junxiao Bi1-12/+0
commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d upstream. This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. That commit introduced the following race and can cause system hung. md_write_start: raid5d: // mddev->in_sync == 1 set "MD_SB_CHANGE_PENDING" // running before md_write_start wakeup it waiting "MD_SB_CHANGE_PENDING" cleared >>>>>>>>> hung wakeup mddev->thread ... waiting "MD_SB_CHANGE_PENDING" cleared >>>> hung, raid5d should clear this flag but get hung by same flag. The issue reverted commit fixing is fixed by last patch in a new way. Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231108182216.73611-2-junxiao.bi@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13md/raid6: use valid sector values to determine if an I/O should wait on the ↵David Jeffery1-2/+2
reshape commit c467e97f079f0019870c314996fae952cc768e82 upstream. During a reshape or a RAID6 array such as expanding by adding an additional disk, I/Os to the region of the array which have not yet been reshaped can stall indefinitely. This is from errors in the stripe_ahead_of_reshape function causing md to think the I/O is to a region in the actively undergoing the reshape. stripe_ahead_of_reshape fails to account for the q disk having a sector value of 0. By not excluding the q disk from the for loop, raid6 will always generate a min_sector value of 0, causing a return value which stalls. The function's max_sector calculation also uses min() when it should use max(), causing the max_sector value to always be 0. During a backwards rebuild this can cause the opposite problem where it allows I/O to advance when it should wait. Fixing these errors will allow safe I/O to advance in a timely manner and delay only I/O which is unsafe due to stripes in the middle of undergoing the reshape. Fixes: 486f60558607 ("md/raid5: Check all disks in a stripe_head for reshape progress") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231128181233.6187-1-djeffery@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-03md/raid5: release batch_last before waiting for another stripe_headDavid Jeffery1-0/+7
When raid5_get_active_stripe is called with a ctx containing a stripe_head in its batch_last pointer, it can cause a deadlock if the task sleeps waiting on another stripe_head to become available. The stripe_head held by batch_last can be blocking the advancement of other stripe_heads, leading to no stripe_heads being released so raid5_get_active_stripe waits forever. Like with the quiesce state handling earlier in the function, batch_last needs to be released by raid5_get_active_stripe before it waits for another stripe_head. Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231002183422.13047-1-djeffery@redhat.com
2023-08-15md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng1-1/+1
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27raid5: fix missing io accounting in raid5_align_endio()Yu Kuai1-21/+8
Io will only be accounted as done from raid5_align_endio() if the io succeeded, and io inflight counter will be leaked if such io failed. Fix this problem by switching to use md_account_bio() for io accounting. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-4-yukuai1@huaweicloud.com
2023-07-27md: also clone new io if io accounting is disabledYu Kuai1-9/+9
Currently, 'active_io' is grabbed before make_reqeust() is called, and it's dropped immediately make_reqeust() returns. Hence 'active_io' actually means io is dispatching, not io is inflight. For raid0 and raid456 that io accounting is enabled, 'active_io' will also be grabbed when bio is cloned for io accounting, and this 'active_io' is dropped until io is done. Always clone new bio so that 'active_io' will mean that io is inflight, raid1 and raid10 will switch to use this method in later patches. Now that bio will be cloned even if io accounting is disabled, also rename related structure from '*_acct_*' to '*_clone_*'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
2023-07-27md: move initialization and destruction of 'io_acct_set' to md.cYu Kuai1-30/+11
'io_acct_set' is only used for raid0 and raid456, prepare to use it for raid1 and raid10, so that io accounting from different levels can be consistent. By the way, follow up patches will also use this io clone mechanism to make sure 'active_io' represents in flight io, not io that is dispatching, so that mddev_suspend will wait for io to be done as designed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
2023-06-28Merge tag 'hardening-v6.5-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit df8fc4e934c12b. The under-development compiler attribute __counted_by has been added so that we can start annotating flexible array members with their associated structure member that tracks the count of flexible array elements at run-time. It is possible (likely?) that the exact syntax of the attribute will change before it is finalized, but GCC and Clang are working together to sort it out. Any changes can be made to the macro while we continue to add annotations. As an example of that last case, I have a treewide commit waiting with such annotations found via Coccinelle: https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b Also see commit dd06e72e68bcb4 for more details. Summary: - Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko) - Convert strreplace() to return string start (Andy Shevchenko) - Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook) - Add missing function prototypes seen with W=1 (Arnd Bergmann) - Fix strscpy() kerndoc typo (Arne Welzel) - Replace strlcpy() with strscpy() across many subsystems which were either Acked by respective maintainers or were trivial changes that went ignored for multiple weeks (Azeem Shaikh) - Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers) - Add KUnit tests for strcat()-family - Enable KUnit tests of FORTIFY wrappers under UML - Add more complete FORTIFY protections for strlcat() - Add missed disabling of FORTIFY for all arch purgatories. - Enable -fstrict-flex-arrays=3 globally - Tightening UBSAN_BOUNDS when using GCC - Improve checkpatch to check for strcpy, strncpy, and fake flex arrays - Improve use of const variables in FORTIFY - Add requested struct_size_t() helper for types not pointers - Add __counted_by macro for annotating flexible array size members" * tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits) netfilter: ipset: Replace strlcpy with strscpy uml: Replace strlcpy with strscpy um: Use HOST_DIR for mrproper kallsyms: Replace all non-returning strlcpy with strscpy sh: Replace all non-returning strlcpy with strscpy of/flattree: Replace all non-returning strlcpy with strscpy sparc64: Replace all non-returning strlcpy with strscpy Hexagon: Replace all non-returning strlcpy with strscpy kobject: Use return value of strreplace() lib/string_helpers: Change returned value of the strreplace() jbd2: Avoid printing outside the boundary of the buffer checkpatch: Check for 0-length and 1-element arrays riscv/purgatory: Do not use fortified string functions s390/purgatory: Do not use fortified string functions x86/purgatory: Do not use fortified string functions acpi: Replace struct acpi_table_slit 1-element array with flex-array clocksource: Replace all non-returning strlcpy with strscpy string: use __builtin_memcpy() in strlcpy/strlcat staging: most: Replace all non-returning strlcpy with strscpy drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy ...
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-8/+60
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-14md/raid5: don't start reshape when recovery or replace is in progressYu Kuai1-0/+8
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING is not enough to tell recovery is in progress. Also check recovery_cp before starting reshape. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
2023-06-14md: protect md_thread with rcuYu Kuai1-7/+8
Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-14md/raid5: fix a deadlock in the case that reshape is interruptedYu Kuai1-1/+43
If reshape is in progress and io across reshape_position is issued, such io will wait for reshape to make progress(see details in the case that make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY). It has been reported several times that if system reboot while growing raid5 to raid6, array assemble will hang infinitely([1, 2]). This is because following deadlock is triggered: 1) a normal io is waiting for reshape to progress, this io can be from system-udevd or mdadm. 2) while assemble, mdadm tries to suspend the array, hence 'reconfig_mutex' is held and mddev_suspend() must wait for normal io to be done. 3) daemon thread can't start reshape because 'reconfig_mutex' can't be held. 1) and 3) is unbreakable because they're foundation design. In order to break 2), following is possible solutions that I can think of: a) Let mddev_suspend() fail is not a good option, because this will break many scenarios since mddev_suspend() doesn't fail before. b) Fail the io that is waiting for reshape to make progress from mddev_suspend(). c) Return false for the io that is waiting for reshape to make progress from raid5_make_request(), and these io will wait for suspend to be done in md_handle_request(), where 'active_io' is not grabbed. c) sounds better than b), however, b) is used because it's easy and straightforward, and it's verified that mdadm can assemble in this case. On the other hand, c) breaks the logic that mddev_suspend() will wait for submitted io to be completely handled. Fix the problem by checking reshape in mddev_suspend(), if reshape can't make progress and there are still some io waiting for reshape, fail those io. [1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/ [2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/ Reported-by: Jove <jovetoo@gmail.com> Reported-by: David Gilmour <dgilmour76@gmail.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
2023-06-14md/raid5: don't allow replacement while reshape is in progressYu Kuai1-0/+1
If reshape is interrupted(for example, echo frozen to sync_action), then rdev replacement can be set. It's safe because reshape is always prior to resync in md_check_recovery(). However, if system reboots, then kernel will complain cannot handle concurrent replacement and reshape and this array is not able to assemble anymore. Fix this problem by don't allow replacement until reshape is done. Reported-by: Peter Neuwirth <reddunur@online.de> Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-2-yukuai1@huaweicloud.com
2023-05-31md/raid5: Convert stripe_head's "dev" to flexible array memberKees Cook1-2/+2
Replace old-style 1-element array of "dev" in struct stripe_head with modern C99 flexible array. In the future, we can additionally annotate it with the run-time size, found in the "disks" member. Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/lkml/20230522212114.gonna.589-kees@kernel.org/ --- It looks like this memory calculation: memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; ... was already buggy (i.e. it included the single "dev" bytes in the result). However, I'm not entirely sure if that is the right analysis, since "dev" is not related to struct bio nor PAGE_SIZE?
2023-05-24md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk()Yu Kuai1-1/+1
'end_sector' is compared to 'rdev->recovery_offset', which is offset to rdev, however, commit e82ed3a4fbb5 ("md/raid6: refactor raid5_read_one_chunk") changes the calculation of 'end_sector' to offset to the array. Fix this miscalculation. Fixes: e82ed3a4fbb5 ("md/raid6: refactor raid5_read_one_chunk") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230524014118.3172781-1-yukuai1@huaweicloud.com
2023-04-28md/raid5: Improve performance for sequential IOJan Kara1-1/+44
Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the order in which requests for underlying disks are created. Since for large sequential IO adding of requests frequently races with md_raid5 thread submitting bios to underlying disks, this results in a change in IO pattern because intermediate states of new order of request creation result in more smaller discontiguous requests. For RAID5 on top of three rotational disks our performance testing revealed this results in regression in write throughput: iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R before 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 493670 525964 524575 513384 131072000 8 540467 532880 512028 513703 after 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 421785 456184 531278 509248 131072000 8 459283 456354 528449 543834 To reduce the amount of discontiguous requests we can start generating requests with the stripe with the lowest chunk offset as that has the best chance of being adjacent to IO queued previously. This improves the performance to: KB reclen write rewrite read reread 131072000 4 497682 506317 518043 514559 131072000 8 514048 501886 506453 504319 restoring big part of the regression. Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230417171537.17899-1-jack@suse.cz
2023-04-14md/raid5: remove unused working_disks variableTom Rix1-4/+1
clang with W=1 reports drivers/md/raid5.c:7719:6: error: variable 'working_disks' set but not used [-Werror,-Wunused-but-set-variable] int working_disks = 0; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230327132324.1769595-1-trix@redhat.com
2022-09-22md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5dLogan Gunthorpe1-0/+12
A complicated deadlock exists when using the journal and an elevated group_thrtead_cnt. It was found with loop devices, but its not clear whether it can be seen with real disks. The deadlock can occur simply by writing data with an fio script. When the deadlock occurs, multiple threads will hang in different ways: 1) The group threads will hang in the blk-wbt code with bios waiting to be submitted to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 ops_run_io+0x46b/0x1a30 handle_stripe+0xcd3/0x36b0 handle_active_stripes.constprop.0+0x6f6/0xa60 raid5_do_work+0x177/0x330 Or: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 flush_deferred_bios+0x136/0x170 raid5_do_work+0x262/0x330 2) The r5l_reclaim thread will hang in the same way, submitting a bio to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 submit_bio+0x3f/0xf0 md_super_write+0x12f/0x1b0 md_update_sb.part.0+0x7c6/0xff0 md_update_sb+0x30/0x60 r5l_do_reclaim+0x4f9/0x5e0 r5l_reclaim_thread+0x69/0x30b However, before hanging, the MD_SB_CHANGE_PENDING flag will be set for sb_flags in r5l_write_super_and_discard_space(). This flag will never be cleared because the submit_bio() call never returns. 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe() will do no processing on any pending stripes and re-set STRIPE_HANDLE. This will cause the raid5d thread to enter an infinite loop, constantly trying to handle the same stripes stuck in the queue. The raid5d thread has a blk_plug that holds a number of bios that are also stuck waiting seeing the thread is in a loop that never schedules. These bios have been accounted for by blk-wbt thus preventing the other threads above from continuing when they try to submit bios. --Deadlock. To fix this, add the same wait_event() that is used in raid5_do_work() to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will schedule and wait until the flag is cleared. The schedule action will flush the plug which will allow the r5l_reclaim thread to continue, thus preventing the deadlock. However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING from the same thread and can thus deadlock if the thread is put to sleep. So avoid waiting if md_check_recovery() is being called in the loop. It's not clear when the deadlock was introduced, but the similar wait_event() call in raid5_do_work() was added in 2017 by this commit: 16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata is updated.") Link: https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()David Sloan1-1/+0
When running chunk-sized reads on disks with badblocks duplicate bio free/puts are observed: ============================================================================= BUG bio-200 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504 __slab_alloc.constprop.0+0x5a/0xb0 kmem_cache_alloc+0x31e/0x330 mempool_alloc_slab+0x17/0x20 mempool_alloc+0x100/0x2b0 bio_alloc_bioset+0x181/0x460 do_mpage_readpage+0x776/0xd00 mpage_readahead+0x166/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 force_page_cache_ra+0x181/0x1c0 page_cache_sync_ra+0x65/0xb0 filemap_get_pages+0x1df/0xaf0 filemap_read+0x1e1/0x700 blkdev_read_iter+0x1e5/0x330 vfs_read+0x42a/0x570 Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 raid5_make_request+0x2259/0x2450 md_handle_request+0x402/0x600 md_submit_bio+0xd9/0x120 __submit_bio+0x11f/0x1b0 submit_bio_noacct_nocheck+0x204/0x480 submit_bio_noacct+0x32e/0xc70 submit_bio+0x98/0x1a0 mpage_readahead+0x250/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff) CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: raid5wq raid5_do_work Call Trace: <TASK> dump_stack_lvl+0x5a/0x78 dump_stack+0x10/0x16 print_trailer+0x158/0x165 object_err+0x35/0x50 free_debug_processing.cold+0xb7/0xbe __slab_free+0x1ae/0x330 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 mpage_end_io+0x36/0x150 bio_endio+0x2fd/0x360 md_end_io_acct+0x7e/0x90 bio_endio+0x2fd/0x360 handle_failed_stripe+0x960/0xb80 handle_stripe+0x1348/0x3760 handle_active_stripes.constprop.0+0x72a/0xaf0 raid5_do_work+0x177/0x330 process_one_work+0x616/0xb20 worker_thread+0x2bd/0x6f0 kthread+0x179/0x1b0 ret_from_fork+0x22/0x30 </TASK> The double free is caused by an unnecessary bio_put() in the if(is_badblock(...)) error path in raid5_read_one_chunk(). The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk"). The previous code checked and freed align_bio which required a bio_put. After the move that is no longer needed as raid_bio is returned to the control of the common io path which performs its own endio resulting in a double free on bad device blocks. Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk") Signed-off-by: David Sloan <david.sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Ensure stripe_fill happens on non-read IO with journalLogan Gunthorpe1-1/+1
When doing degrade/recover tests using the journal a kernel BUG is hit at drivers/md/raid5.c:4381 in handle_parity_checks5(): BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); This was found to occur because handle_stripe_fill() was skipped for stripes in the journal due to a condition in that function. Thus blocks were not fetched and R5_UPTODATE was not set when the code reached handle_parity_checks5(). To fix this, don't skip handle_stripe_fill() unless the stripe is for read. Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path") Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Don't read ->active_stripes if it's not neededLogan Gunthorpe1-3/+2
The atomic_read() is not needed in many cases so only do the read after the first checks are done. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Cleanup prototype of raid5_get_active_stripe()Logan Gunthorpe1-23/+26
Drop the three bools in the prototype of raid5_get_active_stripe() and replace them with a flags parameter. At the same time, drop the distinction with __raid5_get_active_stripe(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-09-22md/raid5: Refactor raid5_get_active_stripe()Logan Gunthorpe1-41/+41
Refactor raid5_get_active_stripe() without the gotos with an explicit infinite loop and some additional nesting. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-08-06Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-03drivers:md:fix a potential use-after-free bugWentao_Liang1-1/+1
In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and may cause sh to be released. However, sh is subsequently used in lines 2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an use-after-free bug. It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of the function. Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Ensure batch_last is released before sleeping for quiesceLogan Gunthorpe1-8/+28
A race condition exists where if raid5_quiesce() is called in the middle of a request that has set batch_last, it will deadlock. batch_last will hold a reference to a stripe when raid5_quiesce() is called. This will cause the next raid5_get_active_stripe() call to sleep waiting for the quiesce to finish, but the raid5_quiesce() thread will wait for active_stripes to go to zero which will never happen because request thread is waiting for the quiesce to stop. Fix this by creating a special __raid5_get_active_stripe() function which takes the request context and clears the last_batch before sleeping. While we're at it, change the arguments of raid5_get_active_stripe() to bools. Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") Reported-by: David Sloan <David.Sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Move stripe_request_ctx upLogan Gunthorpe1-27/+27
Move stripe_request_ctx up. No functional changes intended. This will be necessary in the next patch to release the batch_last in the context before sleeping. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()Logan Gunthorpe1-1/+0
Now that raid5_get_active_stripe() has been refactored it is appearant that r5c_check_stripe_cache_usage() doesn't need to be called in the wait_for_stripe branch. r5c_check_stripe_cache_usage() will only conditionally call r5l_wake_reclaim(), but that function is called two lines later. Drop the call for cleanup. Reported-by: Martin Oliveira <martin.oliveira@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Make is_inactive_blocked() helperLogan Gunthorpe1-5/+19
The logic to wait_for_stripe is difficult to parse being on so many lines and with confusing operator precedence. Move it to a helper function to make it easier to read. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Refactor raid5_get_active_stripe()Logan Gunthorpe1-31/+36
Refactor the raid5_get_active_stripe() to read more linearly in the order it's typically executed. The init_stripe() call is called if a free stripe is found and the function is exited early which removes a lot of if (sh) checks and unindents the following code. Remove the while loop in favour of the 'goto retry' pattern, which reduces indentation further. And use a 'goto wait_for_stripe' instead of an additional indent seeing it is the unusual path and this makes the code easier to read. No functional changes intended. Will make subsequent changes in patches easier to understand. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03raid5: fix duplicate checks for rdev->saved_raid_diskJackie Liu1-2/+1
'first' will always be greater than or equal to 0, it is unnecessary to repeat the 0 check, clean it up. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Convert prepare_to_wait() to wait_woken() apiLogan Gunthorpe1-7/+6
raid5_get_active_stripe() can sleep in various situations and it is called by make_stripe_request() while inside the prepare_to_wait()/finish_wait() section. Nested waits like this are not supported. This was noticed while making other changes that add different sleeps to raid5_get_active_stripe() that caused a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP. No ill effects have been noticed with the code as is, but theoretically a nested and here could cause a dead lock so it should be fixed. To fix this, convert the prepare_to_wait() call to use wake_woken() which supports nested sleeps. Link: https://lwn.net/Articles/628628/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Fix sectors_to_do bitmap overflow in raid5_make_request()Logan Gunthorpe1-8/+11
For unaligned IO that have nearly maximum sectors, the number of stripes will end up being one greater than the size of the bitmap. When this happens, the last stripe in the IO will not be processed as it should be, resulting in data corruption. However, this is not normally seen when the backing block devices have 4K physical block sizes since the block layer will split the request before that happens. To fix this increase the bitmap size by one bit and ensure the full number of stripes are checked when calling find_first_bit(). Reported-by: David Sloan <David.Sloan@eideticom.com> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Increase restriction on max segments per requestLogan Gunthorpe1-0/+3
The block layer defaults the maximum segments to 128, which means requests tend to get split around the 512KB depending on how many pages can be merged. There's no such restriction in the raid5 code so increase the limit to USHRT_MAX so that larger requests can be sent as one. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Improve debug printsLogan Gunthorpe1-2/+6
Add a debug print for raid5_make_request() so that each request is printed and add the logical sector number to the debug print in __add_stripe_bio(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Pivot raid5_make_request()Logan Gunthorpe1-6/+83
raid5_make_request() loops through every page in the request, finds the appropriate stripe and adds the bio for that page in the disk. This causes a great deal of contention on the hash_lock and extra work seeing each stripe must be found once for every data disk. The number of times a stripe must be found can be reduced by pivoting raid5_make_request() so that it loops through every stripe and then loops through every disk in that stripe to see if the bio must be added. This reduces the number of times the hash lock must be taken by a factor equal to the number of data disks. To accomplish this, the logical sectors that have already been added must be tracked. Tracking them is done with a bitmap: the bits for all pages are set at the start of the request and each bit is cleared once the bio is added to a stripe. Finding the next sector to be done is then just a call to find_first_bit() so that sectors that have been done can simply be skipped. One minor downside is that the maximum sectors for a request must be limited so that the bitmap can be appropriately sized on the stack. This limit is arbitrarily chosen to be 256 stripe pages which works out to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually restrict the maximum request further seeing the default block queue settings are used which restricts the number of segments to 128 (which results in request sizes that are approximately 512KB). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Check all disks in a stripe_head for reshape progressLogan Gunthorpe1-14/+39
When testing if a previous stripe has had reshape expand past it, use the earliest or latest logical sector in all the disks for that stripe head. This will allow adding multiple disks at a time in a subesquent patch. To do this cleaner, refactor the check into a helper function called stripe_ahead_of_reshape(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Refactor add_stripe_bio()Logan Gunthorpe1-30/+56
Factor out two helper functions from add_stripe_bio(): one to check for overlap (stripe_bio_overlaps()), and one to actually add the bio to the stripe (__add_stripe_bio()). The latter function will always succeed. This will be useful in the next patch so that overlap can be checked for multiple disks before adding any Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Keep a reference to last stripe_head for batchLogan Gunthorpe1-12/+40
When batching, every stripe head has to find the previous stripe head to add to the batch list. This involves taking the hash lock which is highly contended during IO. Instead of finding the previous stripe_head each time, store a reference to the previous stripe_head in a pointer so that it doesn't require taking the contended lock another time. The reference to the previous stripe must be released before scheduling and waiting for work to get done. Otherwise, it can hold up raid5_activate_delayed() and deadlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03md/raid5: Refactor for loop in raid5_make_request() into while loopLogan Gunthorpe1-4/+5
The for loop with retry label can be more cleanly expressed as a while loop by moving the logical_sector increment into the success path. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>