summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2024-05-09dm-delay: fix hung task introduced by kthread modeJoel Colledge1-2/+1
If the worker thread is not woken due to a bio, then it is not woken at all. This causes the hung task check to trigger. This occurs, for instance, when no bios are submitted. Also when a delay of 0 is configured, delay_bio() returns without waking the worker. Prevent the hung task check from triggering by creating the thread with kthread_run() instead of using kthread_create() directly. Fixes: 70bbeb29fab0 ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Joel Colledge <joel.colledge@linbit.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-05-09dm-delay: fix workqueue delay_timer raceBenjamin Marzinski1-2/+4
delay_timer could be pending when delay_dtr() is called. It needs to be shut down before kdelayd_wq is destroyed, so it won't try queueing more work to kdelayd_wq while that's getting destroyed. Also the del_timer_sync() call in delay_presuspend() doesn't protect against the timer getting immediately rearmed by the queued call to flush_delayed_bios(), but there's no real harm if that does happen. timer_delete() is less work, and is basically just as likely to stop a pointless call to flush_delayed_bios(). Fixes: 26b9f228703f ("dm: delay target") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-05-09bcache: fix variable length array abuse in btree_iterMatthew Mirvish6-59/+70
btree_iter is used in two ways: either allocated on the stack with a fixed size MAX_BSETS, or from a mempool with a dynamic size based on the specific cache set. Previously, the struct had a fixed-length array of size MAX_BSETS which was indexed out-of-bounds for the dynamically-sized iterators, which causes UBSAN to complain. This patch uses the same approach as in bcachefs's sort_iter and splits the iterator into a btree_iter with a flexible array member and a btree_iter_stack which embeds a btree_iter as well as a fixed-length data array. Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039368 Signed-off-by: Matthew Mirvish <matthew@mm12.xyz> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-09bcache: Remove usage of the deprecated ida_simple_xx() APIChristophe JAILLET1-5/+5
ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_max() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07md: Revert "md: Fix overflow in is_mddev_idle"Li Nan2-6/+5
This reverts commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro1-1/+1
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)Al Viro2-3/+3
going to be faster, actually - shift is cheaper than dereference... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-9-viro@zeniv.linux.org.uk Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03Merge tag 'md-6.10-20240502' of ↵Jens Axboe1-3/+3
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fix from Song: "This fixes an issue observed with dm-raid." * tag 'md-6.10-20240502' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix resync softlockup when bitmap size is less than array size
2024-05-03md: fix resync softlockup when bitmap size is less than array sizeYu Kuai1-3/+3
Is is reported that for dm-raid10, lvextend + lvchange --syncaction will trigger following softlockup: kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [mdX_resync:6976] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not tainted 6.9.0-rc4-next-20240419 #1 RIP: 0010:_raw_spin_unlock_irq+0x13/0x30 Call Trace: <TASK> md_bitmap_start_sync+0x6b/0xf0 raid10_sync_request+0x25c/0x1b40 [raid10] md_do_sync+0x64b/0x1020 md_thread+0xa7/0x170 kthread+0xcf/0x100 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1a/0x30 And the detailed process is as follows: md_do_sync j = mddev->resync_min while (j < max_sectors) sectors = raid10_sync_request(mddev, j, &skipped) if (!md_bitmap_start_sync(..., &sync_blocks)) // md_bitmap_start_sync set sync_blocks to 0 return sync_blocks + sectors_skippe; // sectors = 0; j += sectors; // j never change Root cause is that commit 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") return early from md_bitmap_get_counter(), without setting returned blocks. Fix this problem by always set returned blocks from md_bitmap_get_counter"(), as it used to be. Noted that this patch just fix the softlockup problem in kernel, the case that bitmap size doesn't match array size still need to be fixed. Fixes: 301867b1c168 ("md/raid10: check slab-out-of-bounds in md_bitmap_get_counter") Reported-and-tested-by: Nigel Croxon <ncroxon@redhat.com> Closes: https://lore.kernel.org/all/71ba5272-ab07-43ba-8232-d2da642acb4e@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240422065824.2516-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-05-03bcache_register(): don't bother with set_blocksize()Al Viro1-4/+0
We are not using __bread() anymore and read_cache_page_gfp() doesn't care about block size. Moreover, we should *not* change block size on a device that is currently held exclusive - filesystems that use buffer cache expect the block numbers to be interpreted in units set by filesystem. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> ACKed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-3/+9
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01dm: Check that a zoned table leads to a valid mapped deviceDamien Le Moal2-1/+59
Using targets such as dm-linear, a mapped device can be created to contain only conventional zones. Such device should not be treated as zoned as it does not contain any mandatory sequential write required zone. Since such device can be randomly written, we can modify dm_set_zones_restrictions() to set the mapped device zoned queue limit to false to expose it as a regular block device. The function dm_check_zoned() does this after counting the number of conventional zones of the mapped device and comparing it to the total number of zones reported. The special dm_check_zoned_cb() report zones callback function is used to count conventional zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Link: https://lore.kernel.org/r/20240501110907.96950-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-26Merge tag 'for-6.9/dm-fixes-3' of ↵Linus Torvalds2-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix 6.9 regression so that DM device removal is performed synchronously by default. Asynchronous removal has always been possible but it isn't the default. It is important that synchronous removal be preserved, otherwise it is an interface change that breaks lvm2. - Remove errant semicolon in drivers/md/dm-vdo/murmurhash3.c * tag 'for-6.9/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: restore synchronous close of device mapper block device dm vdo murmurhash: remove unneeded semicolon
2024-04-25Merge tag 'md-6.10-20240425' of ↵Jens Axboe3-18/+16
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.10/block Pull MD fixes from Song: "These changes contain various fixes by Yu Kuai, Li Nan, and Florian-Ewald Mueller." * tag 'md-6.10-20240425' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: don't account sync_io if iostats of the disk is disabled md: Fix overflow in is_mddev_idle md: add check for sleepers in md_wakeup_thread() md/raid5: fix deadlock that raid5d() wait for itself to clear MD_SB_CHANGE_PENDING
2024-04-23dm-crypt: don't set WQ_CPU_INTENSIVE for WQ_UNBOUND crypt_queueMike Snitzer1-1/+5
Fix crypt_queue's use of WQ_UNBOUND to _not_ use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-23dm: use queue_limits_setChristoph Hellwig1-15/+12
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-23dm-crypt: stop constraining max_segment_size to PAGE_SIZEMike Snitzer1-10/+2
This change effectively reverts commit 586b286b110e ("dm crypt: constrain crypt device's max_segment_size to PAGE_SIZE") and relies on block core's late bio-splitting to ensure that dm-crypt's encryption bios are split accordingly if they exceed the underlying device's limits (e.g. max_segment_size). Commit 586b286b110e was applied as a 4.3 fix for the benefit of stable@ kernels 4.0+ just after block core's late bio-splitting was introduced in 4.3 with commit 54efd50bfd873 ("block: make generic_make_request handle arbitrarily sized bios"). Given block core's late bio-splitting it is past time that dm-crypt make use of it. Also, given the recent need to revert meaningful progress that was attempted during the 6.9 merge window (see commit bff4b74625fe Revert "dm: use queue_limits_set") this change allows DM core to safely make use of queue_limits_set() without risk of breaking dm-crypt on NVMe. Though it should be noted this commit isn't a prereq for reinstating DM core's use of queue_limits_set() because blk_validate_limits() was made less strict with commit b561ea56a264 ("block: allow device to have both virt_boundary_mask and max segment size"). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h 386f4a737964 ("trace: events: cleanup deprecated strncpy uses") a4833e3abae1 ("SUNRPC: Fix rpcgss_context trace event acceptor field") Adjacent changes: drivers/net/ethernet/intel/ice/ice_tc_lib.c 2cca35f5dd78 ("ice: Fix checking for unsupported keys on non-tunnel device") 784feaa65dfd ("ice: Add support for PFCP hardware offload in switchdev") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17block: Simplify blk_revalidate_disk_zones() interfaceDamien Le Moal1-1/+1
The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17dm: Use the block layer zone append emulationDamien Le Moal4-459/+93
For targets requiring zone append operation emulation with regular writes (e.g. dm-crypt), we can use the block layer emulation provided by zone write plugging. Remove DM implemented zone append emulation and enable the block layer one. This is done by setting the max_zone_append_sectors limit of the mapped device queue to 0 for mapped devices that have a target table that cannot support native zone append operations (e.g. dm-crypt). Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag. dm_split_and_process_bio() is modified to execute blk_zone_write_plug_bio() for such device to let the block layer transform zone append operations into regular writes. This is done after ensuring that the submitted BIO is split if it straddles zone boundaries. Both changes are implemented unsing the inline helpers dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively. dm_revalidate_zones() is also modified to use the block layer provided function blk_revalidate_disk_zones() so that all zone resources needed for zone append emulation are initialized by the block layer without DM core needing to do anything. Since the device table is not yet live when dm_revalidate_zones() is executed, enabling the use of blk_revalidate_disk_zones() requires adding a pointer to the device table in struct mapped_device. This avoids errors in dm_blk_report_zones() trying to get the table with dm_get_live_table(). The mapped device table pointer is set to the table passed as argument to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and reset to NULL after this function returns to restore the live table handling for user call of report zones. All the code related to zone append emulation is removed from dm-zone.c. This leads to simplifications of the functions __map_bio() and dm_zone_endio(). This later function now only needs to deal with completions of real zone append operations for targets that support it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-13-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-16dm-crypt: export sysfs of all workqueuesyangerkun1-7/+21
Once there is a heavy IO load, so many encrypt/decrypt work will occupy all of the cpu, which may lead to the poor performance for other service. So the improved visibility and controls over dm-crypt workqueues, as was offered with commit a2b8b2d97567 ("dm crypt: export sysfs of kcryptd workqueue"), seems necessary. By exporting dm-crypt's workqueues in sysfs, the entry like cpumask/max_active and so on can help us to limit the CPU usage for encrypt/decrypt work. However, commit a2b8b2d97567 did not consider that DM table reload will call .ctr before .dtr, so the reload for dm-crypt failed because the same sysfs name was present. This was the original need for commit 48b0777cd93d ("Revert "dm crypt: export sysfs of kcryptd workqueue""). Reintroduce the use of WQ_SYSFS, and use it for both the IO and crypt workqueues, but make the workqueue names include a unique id (via ida) to allow both old and new sysfs entries to coexist. Signed-off-by: yangerkun <yangerkun@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-16dm-crypt: add the optional "high_priority" flagMikulas Patocka1-10/+25
When WQ_HIGHPRI was used for the dm-crypt kcryptd workqueue it was reported that dm-crypt performs badly when the system is loaded[1]. Because of reports of audio skipping, dm-crypt stopped using WQ_HIGHPRI with commit f612b2132db5 (Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"). But it has since been determined that WQ_HIGHPRI provides improved performance (with reduced latency) for highend systems with much more resources than those laptop/desktop users which suffered from the use of WQ_HIGHPRI. As such, add an option "high_priority" that allows the use of WQ_HIGHPRI for dm-crypt's workqueues and also sets the write_thread to nice level MIN_NICE (-20). This commit makes it optional, so that normal users won't be harmed by it. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053410.html Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-16dm: restore synchronous close of device mapper block deviceMing Lei1-2/+8
'dmsetup remove' and 'dmsetup remove_all' require synchronous bdev release. Otherwise dm_lock_for_deletion() may return -EBUSY if the open count is > 0, because the open count is dropped in dm_blk_close() which occurs after fput() completes. So if dm_blk_close() is delayed because of asynchronous fput(), this device mapper device is skipped during remove, which is a regression. Fix the issue by using __fput_sync(). Also, DM device removal has long supported being made asynchronous by setting the DMF_DEFERRED_REMOVE flag on the DM device. So leverage using async fput() in close_table_device() if DMF_DEFERRED_REMOVE flag is set. Reported-by: Zhong Changhui <czhong@redhat.com> Fixes: a28d893eb327 ("md: port block device access to file") Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> [snitzer: editted commit header, use fput() if DMF_DEFERRED_REMOVE set] Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-10dm vdo murmurhash: remove unneeded semicolonMatthew Sakai1-1/+1
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404050327.4ebVLBD3-lkp@intel.com/ Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-04-09md: don't account sync_io if iostats of the disk is disabledLi Nan2-1/+6
If iostats is disabled, disk_stats will not be updated and part_stat_read_accum() only returns a constant value. In this case, continuing to count sync_io and to check is_mddev_idle() is no longer meaningful. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-04-09md: Fix overflow in is_mddev_idleLi Nan2-5/+6
UBSAN reports this problem: UBSAN: Undefined behaviour in drivers/md/md.c:8175:15 signed integer overflow: -2147483291 - 2072033152 cannot be represented in type 'int' Call trace: dump_backtrace+0x0/0x310 show_stack+0x28/0x38 dump_stack+0xec/0x15c ubsan_epilogue+0x18/0x84 handle_overflow+0x14c/0x19c __ubsan_handle_sub_overflow+0x34/0x44 is_mddev_idle+0x338/0x3d8 md_do_sync+0x1bb8/0x1cf8 md_thread+0x220/0x288 kthread+0x1d8/0x1e0 ret_from_fork+0x10/0x18 'curr_events' will overflow when stat accum or 'sync_io' is greater than INT_MAX. Fix it by changing sync_io, last_events and curr_events to 64bit. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240117031946.2324519-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-04-09md: add check for sleepers in md_wakeup_thread()Florian-Ewald Mueller1-1/+2
Check for sleeping thread before attempting its wake_up in md_wakeup_thread() to avoid unnecessary spinlock contention. With a 6.1 kernel, fio random read/write tests on many (>= 100) virtual volumes, of 100 GiB each, on 3 md-raid5s on 8 SSDs each (building a raid50), show by 3 to 4 % improved IOPS performance. Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20240327114022.74634-1-jinpu.wang@ionos.com Signed-off-by: Song Liu <song@kernel.org>
2024-04-09md/raid5: fix deadlock that raid5d() wait for itself to clear ↵Yu Kuai1-12/+3
MD_SB_CHANGE_PENDING Xiao reported that lvm2 test lvconvert-raid-takeover.sh can hang with small possibility, the root cause is exactly the same as commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") However, Dan reported another hang after that, and junxiao investigated the problem and found out that this is caused by plugged bio can't issue from raid5d(). Current implementation in raid5d() has a weird dependence: 1) md_check_recovery() from raid5d() must hold 'reconfig_mutex' to clear MD_SB_CHANGE_PENDING; 2) raid5d() handles IO in a deadloop, until all IO are issued; 3) IO from raid5d() must wait for MD_SB_CHANGE_PENDING to be cleared; This behaviour is introduce before v2.6, and for consequence, if other context hold 'reconfig_mutex', and md_check_recovery() can't update super_block, then raid5d() will waste one cpu 100% by the deadloop, until 'reconfig_mutex' is released. Refer to the implementation from raid1 and raid10, fix this problem by skipping issue IO if MD_SB_CHANGE_PENDING is still set after md_check_recovery(), daemon thread will be woken up when 'reconfig_mutex' is released. Meanwhile, the hang problem will be fixed as well. Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Reported-and-tested-by: Dan Moulding <dan@danm.net> Closes: https://lore.kernel.org/all/20240123005700.9302-1-dan@danm.net/ Investigated-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240322081005.1112401-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-04-09Merge tag 'md-6.9-20240408' of ↵Jens Axboe1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.9 Pull MD fix from Song: "This change, by Yu Kuai, fixes a UAF in a corner case." * tag 'md-6.9-20240408' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: raid1: fix use-after-free for original bio in raid1_write_request()
2024-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-26/+9
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/ip_gre.c 17af420545a7 ("erspan: make sure erspan_base_hdr is present in skb->head") 5832c4a77d69 ("ip_tunnel: convert __be16 tunnel flags to bitmaps") https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/ Adjacent changes: net/ipv6/ip6_fib.c d21d40605bca ("ipv6: Fix infinite recursion in fib6_dump_done().") 5fc68320c1fb ("ipv6: remove RTNL protection from inet6_dump_fib()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-01dm: use bio_list_merge_initChristoph Hellwig8-36/+17
Use bio_list_merge_init instead of open coding bio_list_merge and bio_list_init. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240328084147.2954434-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01bitmap: introduce generic optimized bitmap_size()Alexander Lobakin1-5/+0
The number of times yet another open coded `BITS_TO_LONGS(nbits) * sizeof(long)` can be spotted is huge. Some generic helper is long overdue. Add one, bitmap_size(), but with one detail. BITS_TO_LONGS() uses DIV_ROUND_UP(). The latter works well when both divident and divisor are compile-time constants or when the divisor is not a pow-of-2. When it is however, the compilers sometimes tend to generate suboptimal code (GCC 13): 48 83 c0 3f add $0x3f,%rax 48 c1 e8 06 shr $0x6,%rax 48 8d 14 c5 00 00 00 00 lea 0x0(,%rax,8),%rdx %BITS_PER_LONG is always a pow-2 (either 32 or 64), but GCC still does full division of `nbits + 63` by it and then multiplication by 8. Instead of BITS_TO_LONGS(), use ALIGN() and then divide by 8. GCC: 8d 50 3f lea 0x3f(%rax),%edx c1 ea 03 shr $0x3,%edx 81 e2 f8 ff ff 1f and $0x1ffffff8,%edx Now it shifts `nbits + 63` by 3 positions (IOW performs fast division by 8) and then masks bits[2:0]. bloat-o-meter: add/remove: 0/0 grow/shrink: 20/133 up/down: 156/-773 (-617) Clang does it better and generates the same code before/after starting from -O1, except that with the ALIGN() approach it uses %edx and thus still saves some bytes: add/remove: 0/0 grow/shrink: 9/133 up/down: 18/-538 (-520) Note that we can't expand DIV_ROUND_UP() by adding a check and using this approach there, as it's used in array declarations where expressions are not allowed. Add this helper to tools/ as well. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29dm integrity: fix out-of-range warningArnd Bergmann1-1/+1
Depending on the value of CONFIG_HZ, clang complains about a pointless comparison: drivers/md/dm-integrity.c:4085:12: error: result of comparison of constant 42949672950 with expression of type 'unsigned int' is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (val >= (uint64_t)UINT_MAX * 1000 / HZ) { As the check remains useful for other configurations, shut up the warning by adding a second type cast to uint64_t. Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-29dm vdo murmurhash3: use kernel byteswapping routines instead of GCC onesKen Raeburn1-25/+8
Also open-code the calls. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-22Merge tag 'for-6.9/dm-fixes' of ↵Linus Torvalds2-6/+16
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix a memory leak in DM integrity recheck code that was added during the 6.9 merge. Also fix the recheck code to ensure it issues bios with proper alignment. - Fix DM snapshot's dm_exception_table_exit() to schedule while handling an large exception table during snapshot device shutdown. * tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-integrity: align the outgoing bio in integrity_recheck dm snapshot: fix lockup in dm_exception_table_exit dm-integrity: fix a memory leak when rechecking the data
2024-03-22Merge tag 'kbuild-v6.9' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ...
2024-03-21dm-integrity: align the outgoing bio in integrity_recheckMikulas Patocka1-2/+10
It is possible to set up dm-integrity with smaller sector size than the logical sector size of the underlying device. In this situation, dm-integrity guarantees that the outgoing bios have the same alignment as incoming bios (so, if you create a filesystem with 4k block size, dm-integrity would send 4k-aligned bios to the underlying device). This guarantee was broken when integrity_recheck was implemented. integrity_recheck sends bio that is aligned to ic->sectors_per_block. So if we set up integrity with 512-byte sector size on a device with logical block size 4k, we would be sending unaligned bio. This triggered a bug in one of our internal tests. This commit fixes it by determining the actual alignment of the incoming bio and then makes sure that the outgoing bio in integrity_recheck has the same alignment. Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-20dm snapshot: fix lockup in dm_exception_table_exitMikulas Patocka1-1/+3
There was reported lockup when we exit a snapshot with many exceptions. Fix this by adding "cond_resched" to the loop that frees the exceptions. Reported-by: John Pittman <jpittman@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-19dm-integrity: fix a memory leak when rechecking the dataMikulas Patocka1-3/+3
Memory for the "checksums" pointer will leak if the data is rechecked after checksum failure (because the associated kfree won't happen due to 'goto skip_io'). Fix this by freeing the checksums memory before recheck, and just use the "checksum_onstack" memory for storing checksum during recheck. Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a failure") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-15Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds1-7/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
2024-03-15Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds1-8/+9
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-03-13Merge tag 'for-6.9/dm-vdo' of ↵Linus Torvalds111-0/+52362
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper VDO target from Mike Snitzer: "Introduce the DM vdo target which provides block-level deduplication, compression, and thin provisioning. Please see: Documentation/admin-guide/device-mapper/vdo.rst Documentation/admin-guide/device-mapper/vdo-design.rst The DM vdo target handles its concurrency by pinning an IO, and subsequent stages of handling that IO, to a particular VDO thread. This aspect of VDO is "unique" but its overall implementation is very tightly coupled to its mostly lockless threading model. As such, VDO is not easily changed to use more traditional finer-grained locking and Linux workqueues. Please see the "Zones and Threading" section of vdo-design.rst The DM vdo target has been used in production for many years but has seen significant changes over the past ~6 years to prepare it for upstream inclusion. The codebase is still large but it is isolated to drivers/md/dm-vdo/ and has been made considerably more approachable and maintainable. Matt Sakai has been added to the MAINTAINERS file to reflect that he will send VDO changes upstream through the DM subsystem maintainers" * tag 'for-6.9/dm-vdo' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (142 commits) dm vdo: document minimum metadata size requirements dm vdo: remove meaningless version number constant dm vdo: remove vdo_perform_once dm vdo block-map: Remove stray semicolon dm vdo string-utils: change from uds_ to vdo_ namespace dm vdo logger: change from uds_ to vdo_ namespace dm vdo funnel-queue: change from uds_ to vdo_ namespace dm vdo indexer: fix use after free dm vdo logger: remove log level to string conversion code dm vdo: document log_level parameter dm vdo: add 'log_level' module parameter dm vdo: remove all sysfs interfaces dm vdo target: eliminate inappropriate uses of UDS_SUCCESS dm vdo indexer: update ASSERT and ASSERT_LOG_ONLY usage dm vdo encodings: update some stale comments dm vdo permassert: audit all of ASSERT to test for VDO_SUCCESS dm-vdo funnel-workqueue: return VDO_SUCCESS from make_simple_work_queue dm vdo thread-utils: return VDO_SUCCESS on vdo_create_thread success dm vdo int-map: return VDO_SUCCESS on success dm vdo: check for VDO_SUCCESS return value from memory-alloc functions ...
2024-03-13Merge tag 'for-6.9/dm-bh-wq' of ↵Linus Torvalds3-25/+50
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper BH workqueue conversion from Mike Snitzer: "Convert the DM verity and crypt targets from (ab)using tasklets to using BH workqueues. These changes were coordinated with Tejun and are based ontop of DM's 6.9 changes and Tejun's 6.9 workqueue tree" * tag 'for-6.9/dm-bh-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-verity: Convert from tasklet to BH workqueue dm-crypt: Convert from tasklet to BH workqueue
2024-03-13Merge tag 'for-6.9/dm-changes' of ↵Linus Torvalds25-103/+164
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix DM core's IO submission (which include dm-io and dm-bufio) such that a bio's IO priority is propagated. Work focused on enabling both DM crypt and verity targets to retain the appropriate IO priority - Fix DM raid reshape logic to not allow an empty flush bio to be requeued due to false concern about the bio, which doesn't have a data payload, accessing beyond the end of the device - Fix DM core's internal resume so that it properly calls both presume and resume methods, which fixes the potential for a postsuspend and resume imbalance - Update DM verity target to set DM_TARGET_SINGLETON flag because it doesn't make sense to have a DM table with a mix of targets that include dm-verity - Small cleanups in DM crypt, thin, and integrity targets - Fix references to dm-devel mailing list to use latest list address * tag 'for-6.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: call the resume method on internal suspend dm raid: fix false positive for requeue needed during reshape dm-integrity: set max_integrity_segments in dm_integrity_io_hints dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing list dm ioctl: update DM_DRIVER_EMAIL to new dm-devel mailing list dm verity: set DM_TARGET_SINGLETON feature flag dm crypt: Fix IO priority lost when queuing write bios dm verity: Fix IO priority lost when reading FEC and hash dm bufio: Support IO priority dm io: Support IO priority dm crypt: remove redundant state settings after waking up dm thin: add braces around conditional code that spans lines
2024-03-12dm: call the resume method on internal suspendMikulas Patocka1-6/+20
There is this reported crash when experimenting with the lvm2 testsuite. The list corruption is caused by the fact that the postsuspend and resume methods were not paired correctly; there were two consecutive calls to the origin_postsuspend function. The second call attempts to remove the "hash_list" entry from a list, while it was already removed by the first call. Fix __dm_internal_resume so that it calls the preresume and resume methods of the table's targets. If a preresume method of some target fails, we are in a tricky situation. We can't return an error because dm_internal_resume isn't supposed to return errors. We can't return success, because then the "resume" and "postsuspend" methods would not be paired correctly. So, we set the DMF_SUSPENDED flag and we fake normal suspend - it may confuse userspace tools, but it won't cause a kernel crash. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 1 PID: 8343 Comm: dmsetup Not tainted 6.8.0-rc6 #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 RIP: 0010:__list_del_entry_valid_or_report+0x77/0xc0 <snip> RSP: 0018:ffff8881b831bcc0 EFLAGS: 00010282 RAX: 000000000000004e RBX: ffff888143b6eb80 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff819053d0 RDI: 00000000ffffffff RBP: ffff8881b83a3400 R08: 00000000fffeffff R09: 0000000000000058 R10: 0000000000000000 R11: ffffffff81a24080 R12: 0000000000000001 R13: ffff88814538e000 R14: ffff888143bc6dc0 R15: ffffffffa02e4bb0 FS: 00000000f7c0f780(0000) GS:ffff8893f0a40000(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000057fb5000 CR3: 0000000143474000 CR4: 00000000000006b0 Call Trace: <TASK> ? die+0x2d/0x80 ? do_trap+0xeb/0xf0 ? __list_del_entry_valid_or_report+0x77/0xc0 ? do_error_trap+0x60/0x80 ? __list_del_entry_valid_or_report+0x77/0xc0 ? exc_invalid_op+0x49/0x60 ? __list_del_entry_valid_or_report+0x77/0xc0 ? asm_exc_invalid_op+0x16/0x20 ? table_deps+0x1b0/0x1b0 [dm_mod] ? __list_del_entry_valid_or_report+0x77/0xc0 origin_postsuspend+0x1a/0x50 [dm_snapshot] dm_table_postsuspend_targets+0x34/0x50 [dm_mod] dm_suspend+0xd8/0xf0 [dm_mod] dev_suspend+0x1f2/0x2f0 [dm_mod] ? table_deps+0x1b0/0x1b0 [dm_mod] ctl_ioctl+0x300/0x5f0 [dm_mod] dm_compat_ctl_ioctl+0x7/0x10 [dm_mod] __x64_compat_sys_ioctl+0x104/0x170 do_syscall_64+0x184/0x1b0 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0xf7e6aead <snip> ---[ end trace 0000000000000000 ]--- Fixes: ffcc39364160 ("dm: enhance internal suspend and resume interface") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-12dm raid: fix false positive for requeue needed during reshapeMing Lei1-2/+2
An empty flush doesn't have a payload, so it should never be looked at when considering to possibly requeue a bio for the case when a reshape is in progress. Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Reported-by: Patrick Plenefisch <simonpatp@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-03-12Revert "dm: use queue_limits_set"Linus Torvalds1-12/+15
This reverts commit 8e0ef412869430d114158fc3b9b1fb111e247bd3. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/ Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds17-772/+1092
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds5-56/+59
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-08raid1: fix use-after-free for original bio in raid1_write_request()Yu Kuai1-1/+1
r1_bio->bios[] is used to record new bios that will be issued to underlying disks, however, in raid1_write_request(), r1_bio->bios[] will set to the original bio temporarily. Meanwhile, if blocked rdev is set, free_r1bio() will be called causing that all r1_bio->bios[] to be freed: raid1_write_request() r1_bio = alloc_r1bio(mddev, bio); -> r1_bio->bios[] is NULL for (i = 0; i < disks; i++) -> for each rdev in conf // first rdev is normal r1_bio->bios[0] = bio; -> set to original bio // second rdev is blocked if (test_bit(Blocked, &rdev->flags)) break if (blocked_rdev) free_r1bio() put_all_bios() bio_put(r1_bio->bios[0]) -> original bio is freed Test scripts: mdadm -CR /dev/md0 -l1 -n4 /dev/sd[abcd] --assume-clean fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 \ -iodepth=128 -name=test -direct=1 echo blocked > /sys/block/md0/md/rd2/state Test result: BUG bio-264 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x24/0x50 age=1 cpu=1 pid=869 kmem_cache_alloc+0x324/0x480 mempool_alloc_slab+0x24/0x50 mempool_alloc+0x6e/0x220 bio_alloc_bioset+0x1af/0x4d0 blkdev_direct_IO+0x164/0x8a0 blkdev_write_iter+0x309/0x440 aio_write+0x139/0x2f0 io_submit_one+0x5ca/0xb70 __do_sys_io_submit+0x86/0x270 __x64_sys_io_submit+0x22/0x30 do_syscall_64+0xb1/0x210 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Freed in mempool_free_slab+0x1f/0x30 age=1 cpu=1 pid=869 kmem_cache_free+0x28c/0x550 mempool_free_slab+0x1f/0x30 mempool_free+0x40/0x100 bio_free+0x59/0x80 bio_put+0xf0/0x220 free_r1bio+0x74/0xb0 raid1_make_request+0xadf/0x1150 md_handle_request+0xc7/0x3b0 md_submit_bio+0x76/0x130 __submit_bio+0xd8/0x1d0 submit_bio_noacct_nocheck+0x1eb/0x5c0 submit_bio_noacct+0x169/0xd40 submit_bio+0xee/0x1d0 blkdev_direct_IO+0x322/0x8a0 blkdev_write_iter+0x309/0x440 aio_write+0x139/0x2f0 Since that bios for underlying disks are not allocated yet, fix this problem by using mempool_free() directly to free the r1_bio. Fixes: 992db13a4aee ("md/raid1: free the r1bio before waiting for blocked rdev") Cc: stable@vger.kernel.org # v6.6+ Reported-by: Coly Li <colyli@suse.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240308093726.1047420-1-yukuai1@huaweicloud.com