summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
2025-08-20ublk: check for unprivileged daemon on each I/O fetchCaleb Sander Mateos1-9/+7
[ Upstream commit 5058a62875e1916e5133a1639f0207ea2148c0bc ] Commit ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") allowed each ublk I/O to have an independent daemon task. However, nr_privileged_daemon is only computed based on whether the last I/O fetched in each ublk queue has an unprivileged daemon task. Fix this by checking whether every fetched I/O's daemon is privileged. Change nr_privileged_daemon from a count of queues to a boolean indicating whether any I/Os have an unprivileged daemon. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808155216.296170-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20drbd: add missing kref_get in handle_write_conflictsSarah Newman1-1/+5
[ Upstream commit 00c9c9628b49e368d140cfa61d7df9b8922ec2a8 ] With `two-primaries` enabled, DRBD tries to detect "concurrent" writes and handle write conflicts, so that even if you write to the same sector simultaneously on both nodes, they end up with the identical data once the writes are completed. In handling "superseeded" writes, we forgot a kref_get, resulting in a premature drbd_destroy_device and use after free, and further to kernel crashes with symptoms. Relevance: No one should use DRBD as a random data generator, and apparently all users of "two-primaries" handle concurrent writes correctly on layer up. That is cluster file systems use some distributed lock manager, and live migration in virtualization environments stops writes on one node before starting writes on the other node. Which means that other than for "test cases", this code path is never taken in real life. FYI, in DRBD 9, things are handled differently nowadays. We still detect "write conflicts", but no longer try to be smart about them. We decided to disconnect hard instead: upper layers must not submit concurrent writes. If they do, that's their fault. Signed-off-by: Sarah Newman <srn@prgmr.com> Signed-off-by: Lars Ellenberg <lars@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20250627095728.800688-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20loop: Avoid updating block size under exclusive ownerJan Kara1-8/+30
[ Upstream commit 7e49538288e523427beedd26993d446afef1a6fb ] Syzbot came up with a reproducer where a loop device block size is changed underneath a mounted filesystem. This causes a mismatch between the block device block size and the block size stored in the superblock causing confusion in various places such as fs/buffer.c. The particular issue triggered by syzbot was a warning in __getblk_slow() due to requested buffer size not matching block device block size. Fix the problem by getting exclusive hold of the loop device to change its block size. This fails if somebody (such as filesystem) has already an exclusive ownership of the block device and thus prevents modifying the loop device under some exclusive owner which doesn't expect it. Reported-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20250711163202.19623-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20sunvdc: Balance device refcount in vdc_port_mpgroup_checkMa Ke1-1/+3
commit 63ce53724637e2e7ba51fe3a4f78351715049905 upstream. Using device_find_child() to locate a probed virtual-device-port node causes a device refcount imbalance, as device_find_child() internally calls get_device() to increment the device’s reference count before returning its pointer. vdc_port_mpgroup_check() directly returns true upon finding a matching device without releasing the reference via put_device(). We should call put_device() to decrement refcount. As comment of device_find_child() says, 'NOTE: you will need to drop the reference with put_device() after use'. Found by code review. Cc: stable@vger.kernel.org Fixes: 3ee70591d6c4 ("sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Link: https://lore.kernel.org/r/20250719075856.3447953-1-make24@iscas.ac.cn Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15zloop: fix KASAN use-after-free of tag setShin'ichiro Kawasaki1-1/+2
commit 765761851d89c772f482494d452e266795460278 upstream. When a zoned loop device, or zloop device, is removed, KASAN enabled kernel reports "BUG KASAN use-after-free" in blk_mq_free_tag_set(). The BUG happens because zloop_ctl_remove() calls put_disk(), which invokes zloop_free_disk(). The zloop_free_disk() frees the memory allocated for the zlo pointer. However, after the memory is freed, zloop_ctl_remove() calls blk_mq_free_tag_set(&zlo->tag_set), which accesses the freed zlo. Hence the KASAN use-after-free. zloop_ctl_remove() put_disk(zlo->disk) put_device() kobject_put() ... zloop_free_disk() kvfree(zlo) blk_mq_free_tag_set(&zlo->tag_set) To avoid the BUG, move the call to blk_mq_free_tag_set(&zlo->tag_set) from zloop_ctl_remove() into zloop_free_disk(). This ensures that the tag_set is freed before the call to kvfree(zlo). Fixes: eb0570c7df23 ("block: new zoned loop block device driver") CC: stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250731110745.165751-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15ublk: validate ublk server pidMing Lei1-0/+9
[ Upstream commit c2c8089f325ed703fd5123b39e2dece1dd605904 ] ublk server pid(the `tgid` of the process opening the ublk device) is stored in `ublk_device->ublksrv_tgid`. This `tgid` is then checked against the `ublksrv_pid` in `ublk_ctrl_start_dev` and `ublk_ctrl_end_recovery`. This ensures that correct ublk server pid is stored in device info. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15ublk: speed up ublk server exit handlingUday Shankar1-15/+21
[ Upstream commit 2fa9c93035e17380cafa897ee1a4d503881a3770 ] Recently, we've observed a few cases where a ublk server is able to complete restart more quickly than the driver can process the exit of the previous ublk server. The new ublk server comes up, attempts recovery of the preexisting ublk devices, and observes them still in state UBLK_S_DEV_LIVE. While this is possible due to the asynchronous nature of io_uring cleanup and should therefore be handled properly in the ublk server, it is still preferable to make ublk server exit handling faster if possible, as we should strive for it to not be a limiting factor in how fast a ublk server can restart and provide service again. Analysis of the issue showed that the vast majority of the time spent in handling the ublk server exit was in calls to blk_mq_quiesce_queue, which is essentially just a (relatively expensive) call to synchronize_rcu. The ublk server exit path currently issues an unnecessarily large number of calls to blk_mq_quiesce_queue, for two reasons: 1. It tries to call blk_mq_quiesce_queue once per ublk_queue. However, blk_mq_quiesce_queue targets the request_queue of the underlying ublk device, of which there is only one. So the number of calls is larger than necessary by a factor of nr_hw_queues. 2. In practice, it calls blk_mq_quiesce_queue _more_ than once per ublk_queue. This is because of a data race where we read ubq->canceling without any locking when deciding if we should call ublk_start_cancel. It is thus possible for two calls to ublk_uring_cmd_cancel_fn against the same ublk_queue to both call ublk_start_cancel against the same ublk_queue. Fix this by making the "canceling" flag a per-device state. This actually matches the existing code better, as there are several places where the flag is set or cleared for all queues simultaneously, and there is the general expectation that cancellation corresponds with ublk server exit. This per-device canceling flag is then checked under a (new) lock (addressing the data race (2) above), and the queue is only quiesced if it is cleared (addressing (1) above). The result is just one call to blk_mq_quiesce_queue per ublk device. To minimize the number of cache lines that are accessed in the hot path, the per-queue canceling flag is kept. The values of the per-device canceling flag and all per-queue canceling flags should always match. In our setup, where one ublk server handles I/O for 128 ublk devices, each having 24 hardware queues of depth 4096, here are the results before and after this patch, where teardown time is measured from the first call to io_ring_ctx_wait_and_kill to the return from the last ublk_ch_release: before after number of calls to blk_mq_quiesce_queue: 6469 256 teardown time: 11.14s 2.44s There are still some potential optimizations here, but this takes care of a big chunk of the ublk server exit handling delay. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-1-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c2c8089f325e ("ublk: validate ublk server pid") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15nbd: fix lockdep deadlock warningMing Lei1-1/+11
[ Upstream commit 8b428f42f3edfd62422aa7ad87049ab232a2eaa9 ] nbd grabs device lock nbd->config_lock for updating nr_hw_queues, this ways cause the following lock dependency: -> #2 (&disk->open_mutex){+.+.}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 __del_gendisk+0x132/0xac6 block/genhd.c:706 del_gendisk+0xf6/0x19a block/genhd.c:819 nbd_dev_remove+0x3c/0xf2 drivers/block/nbd.c:268 nbd_dev_remove_work+0x1c/0x26 drivers/block/nbd.c:284 process_one_work+0x96a/0x1f32 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3321 [inline] worker_thread+0x5ce/0xde8 kernel/workqueue.c:3402 kthread+0x39c/0x7d4 kernel/kthread.c:464 ret_from_fork_kernel+0x2a/0xbb2 arch/riscv/kernel/process.c:214 ret_from_fork_kernel_asm+0x16/0x18 arch/riscv/kernel/entry.S:327 -> #1 (&set->update_nr_hwq_lock){++++}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 down_write+0x9c/0x19a kernel/locking/rwsem.c:1577 blk_mq_update_nr_hw_queues+0x3e/0xb86 block/blk-mq.c:5041 nbd_start_device+0x140/0xb2c drivers/block/nbd.c:1476 nbd_genl_connect+0xae0/0x1b24 drivers/block/nbd.c:2201 genl_family_rcv_msg_doit+0x206/0x2e6 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x514/0x78e net/netlink/genetlink.c:1210 netlink_rcv_skb+0x206/0x3be net/netlink/af_netlink.c:2534 genl_rcv+0x36/0x4c net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x4f0/0x82c net/netlink/af_netlink.c:1339 netlink_sendmsg+0x85e/0xdd6 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0xcc/0x160 net/socket.c:727 ____sys_sendmsg+0x63e/0x79c net/socket.c:2566 ___sys_sendmsg+0x144/0x1e6 net/socket.c:2620 __sys_sendmsg+0x188/0x246 net/socket.c:2652 __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __riscv_sys_sendmsg+0x70/0xa2 net/socket.c:2655 syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112 do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 -> #0 (&nbd->config_lock){+.+.}-{4:4}: check_noncircular+0x132/0x146 kernel/locking/lockdep.c:2178 check_prev_add kernel/locking/lockdep.c:3168 [inline] check_prevs_add kernel/locking/lockdep.c:3287 [inline] validate_chain kernel/locking/lockdep.c:3911 [inline] __lock_acquire+0x12b2/0x24ea kernel/locking/lockdep.c:5240 lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 refcount_dec_and_mutex_lock+0x60/0xd8 lib/refcount.c:118 nbd_config_put+0x3a/0x610 drivers/block/nbd.c:1423 nbd_release+0x94/0x15c drivers/block/nbd.c:1735 blkdev_put_whole+0xac/0xee block/bdev.c:721 bdev_release+0x3fe/0x600 block/bdev.c:1144 blkdev_release+0x1a/0x26 block/fops.c:684 __fput+0x382/0xa8c fs/file_table.c:465 ____fput+0x1c/0x26 fs/file_table.c:493 task_work_run+0x16a/0x25e kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0x118/0x134 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_trap_ecall_u+0x3f0/0x530 arch/riscv/kernel/traps.c:355 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 Also it isn't necessary to require nbd->config_lock, because blk_mq_update_nr_hw_queues() does grab tagset lock for sync everything. Fixes the issue by releasing ->config_lock & retry in case of concurrent updating nr_hw_queues. Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: syzbot+2bcecf3c38cb3e8fdc8d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6855034f.a00a0220.137b3.0031.GAE@google.com Reviewed-by: Yu Kuai <yukuai3@huawei.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250709111744.2353050-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15block: mtip32xx: Fix usage of dma_map_sg()Thomas Fourier1-10/+17
[ Upstream commit 8e1fab9cccc7b806b0cffdceabb09b310b83b553 ] The dma_map_sg() can fail and, in case of failure, returns 0. If it fails, mtip_hw_submit_io() returns an error. The dma_unmap_sg() requires the nents parameter to be the same as the one passed to dma_map_sg(). This patch saves the nents in command->scatter_ents. Fixes: 88523a61558a ("block: Add driver for Micron RealSSD pcie flash cards") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250627121123.203731-2-fourier.thomas@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15ublk: use vmalloc for ublk_device's __queuesCaleb Sander Mateos1-2/+2
[ Upstream commit c2f48453b7806d41f5a3270f206a5cd5640ed207 ] struct ublk_device's __queues points to an allocation with up to UBLK_MAX_NR_QUEUES (4096) queues, each of which have: - struct ublk_queue (48 bytes) - Tail array of up to UBLK_MAX_QUEUE_DEPTH (4096) struct ublk_io's, 32 bytes each This means the full allocation can exceed 512 MB, which may well be impossible to service with contiguous physical pages. Switch to kvcalloc() and kvfree(), since there is no need for physically contiguous memory. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-18Merge tag 'block-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds1-3/+2
Pull block fixes from Jens Axboe: - NVMe changes via Christoph: - revert the cross-controller atomic write size validation that caused regressions (Christoph Hellwig) - fix endianness of command word printout in nvme_log_err_passthru() (John Garry) - fix callback lock for TLS handshake (Maurizio Lombardi) - fix misaccounting of nvme-mpath inflight I/O (Yu Kuai) - fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() (Zheng Qixing) - Fix for a kobject leak in queue unregistration - Fix for loop async file write start/end handling * tag 'block-6.16-20250718' of git://git.kernel.dk/linux: loop: use kiocb helpers to fix lockdep warning nvmet-tcp: fix callback lock for TLS handshake nvme: fix misaccounting of nvme-mpath inflight I/O nvme: revert the cross-controller atomic write size validation nvme: fix endianness of command word prints in nvme_log_err_passthru() nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() block: fix kobject leak in blk_unregister_queue
2025-07-16loop: use kiocb helpers to fix lockdep warningMing Lei1-3/+2
The lockdep tool can report a circular lock dependency warning in the loop driver's AIO read/write path: ``` [ 6540.587728] kworker/u96:5/72779 is trying to acquire lock: [ 6540.593856] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.603786] [ 6540.603786] but task is already holding lock: [ 6540.610291] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.620210] [ 6540.620210] other info that might help us debug this: [ 6540.627499] Possible unsafe locking scenario: [ 6540.627499] [ 6540.634110] CPU0 [ 6540.636841] ---- [ 6540.639574] lock(sb_writers#9); [ 6540.643281] lock(sb_writers#9); [ 6540.646988] [ 6540.646988] *** DEADLOCK *** ``` This patch fixes the issue by using the AIO-specific helpers `kiocb_start_write()` and `kiocb_end_write()`. These functions are designed to be used with a `kiocb` and manage write sequencing correctly for asynchronous I/O without introducing the problematic lock dependency. The `kiocb` is already part of the `loop_cmd` struct, so this change also simplifies the completion function `lo_rw_aio_do_completion()` by using the `iocb` from the `cmd` struct directly, instead of retrieving the loop device from the request queue. Fixes: 39d86db34e41 ("loop: add file_start_write() and file_end_write()") Cc: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250716114808.3159657-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds1-3/+3
Pull block fixes from Jens Axboe: - MD changes via Yu: - fix UAF due to stack memory used for bio mempool (Jinchao) - fix raid10/raid1 nowait IO error path (Nigel and Qixing) - fix kernel crash from reading bitmap sysfs entry (Håkon) - Fix for a UAF in the nbd connect error path - Fix for blocksize being bigger than pagesize, if THP isn't enabled * tag 'block-6.16-20250710' of git://git.kernel.dk/linux: block: reject bs > ps block devices when THP is disabled nbd: fix uaf in nbd_genl_connect() error path md/md-bitmap: fix GPF in bitmap_get_stats() md/raid1,raid10: strip REQ_NOWAIT from member bios raid10: cleanup memleak at raid10_make_request md/raid1: Fix stack memory use after return in raid1_reshape
2025-07-07nbd: fix uaf in nbd_genl_connect() error pathZheng Qixing1-3/+3
There is a use-after-free issue in nbd: block nbd6: Receive control failed (result -104) block nbd6: shutting down sockets ================================================================== BUG: KASAN: slab-use-after-free in recv_work+0x694/0xa80 drivers/block/nbd.c:1022 Write of size 4 at addr ffff8880295de478 by task kworker/u33:0/67 CPU: 2 UID: 0 PID: 67 Comm: kworker/u33:0 Not tainted 6.15.0-rc5-syzkaller-00123-g2c89c1b655c0 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: nbd6-recv recv_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_dec include/linux/atomic/atomic-instrumented.h:592 [inline] recv_work+0x694/0xa80 drivers/block/nbd.c:1022 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> nbd_genl_connect() does not properly stop the device on certain error paths after nbd_start_device() has been called. This causes the error path to put nbd->config while recv_work continue to use the config after putting it, leading to use-after-free in recv_work. This patch moves nbd_start_device() after the backend file creation. Reported-by: syzbot+48240bab47e705c53126@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68227a04.050a0220.f2294.00b5.GAE@google.com/T/ Fixes: 6497ef8df568 ("nbd: provide a way for userspace processes to identify device backends") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250612132405.364904-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linuxLinus Torvalds2-7/+10
Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-01brd: fix sleeping function called from invalid context in brd_insert_page()Yu Kuai1-2/+4
__xa_cmpxchg() is called with rcu_read_lock(), and it will allocate memory if necessary. Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile, it still should be held before xa_unlock(), prevent returned page to be freed by concurrent discard. Fixes: bbcacab2e8ee ("brd: avoid extra xarray lookups on first write") Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: don't queue request if the associated uring_cmd is canceledMing Lei1-5/+6
Commit 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") need to dereference `io->cmd` for checking if the IO can be added to current batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However, `io->cmd` may become invalid after the uring_cmd is canceled. Fixes it by only allowing to queue this IO in case that ublk_prep_req() returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-27Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds1-12/+37
Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-26ublk: sanity check add_dev input for underflowRonnie Sahlberg1-1/+2
Add additional checks that queue depth and number of queues are non-zero. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25ublk: setup ublk_io correctly in case of ublk_get_data() failureMing Lei1-10/+25
If ublk_get_data() fails, -EIOCBQUEUED is returned and the current command becomes ASYNC. And the only reason is that mapping data can't move on, because of no enough pages or pending signal, then the current ublk request has to be requeued. Once the request need to be requeued, we have to setup `ublk_io` correctly, including io->cmd and flags, otherwise the request may not be forwarded to ublk server successfully. Fixes: 9810362a57cb ("ublk: don't call ublk_dispatch_req() for NEED_GET_DATA") Reported-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGVVp+VN9QcpHUz_0nasFf5q9i1gi8H8j-G-6mkBoqa3TyjRHA@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/r/20250624104121.859519-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25ublk: build batch from IOs in same io_ring_ctx and io taskMing Lei1-1/+10
ublk_queue_cmd_list() dispatches the whole batch list by scheduling task work via the tail request's io_uring_cmd, this way is fine even though more than one io_ring_ctx are involved for this batch since it is just one running context. However, the task work handler ublk_cmd_list_tw_cb() takes `issue_flags` of tail uring_cmd's io_ring_ctx for completing all commands. This way is wrong if any uring_cmd is issued from different io_ring_ctx. Fixes it by always building batch IOs from same io_ring_ctx and io task because ublk_dispatch_req() does validate task context, and IO needs to be aborted in case of running from fallback task work context. For typical per-queue or per-io daemon implementation, this way shouldn't make difference from performance viewpoint, because single io_ring_ctx is taken in each daemon for normal use case. Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250625022554.883571-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-20Merge tag 'block-6.16-20250619' of git://git.kernel.dk/linuxLinus Torvalds4-3/+22
Pull block fixes from Jens Axboe: - Two fixes for aoe which fixes issues dating back to when this driver was converted to blk-mq - Fix for ublk, checking for valid queue depth and count values before setting up a device * tag 'block-6.16-20250619' of git://git.kernel.dk/linux: ublk: santizize the arguments from userspace when adding a device aoe: defer rexmit timer downdev work to workqueue aoe: clean device rq_list in aoedev_downdev()
2025-06-19ublk: santizize the arguments from userspace when adding a deviceRonnie Sahlberg1-0/+3
Sanity check the values for queue depth and number of queues we get from userspace when adding a device. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Fixes: 62fe99cef94a ("ublk: add read()/write() support for ublk char device") Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17aoe: defer rexmit timer downdev work to workqueueJustin Sanders3-3/+11
When aoe's rexmit_timer() notices that an aoe target fails to respond to commands for more than aoe_deadsecs, it calls aoedev_downdev() which cleans the outstanding aoe and block queues. This can involve sleeping, such as in blk_mq_freeze_queue(), which should not occur in irq context. This patch defers that aoedev_downdev() call to the aoe device's workqueue. Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17aoe: clean device rq_list in aoedev_downdev()Justin Sanders1-0/+8
An aoe device's rq_list contains accepted block requests that are waiting to be transmitted to the aoe target. This queue was added as part of the conversion to blk_mq. However, the queue was not cleaned out when an aoe device is downed which caused blk_mq_freeze_queue() to sleep indefinitely waiting for those requests to complete, causing a hang. This fix cleans out the queue before calling blk_mq_freeze_queue(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Fixes: 3582dd291788 ("aoe: convert aoeblk to blk-mq") Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-14Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds1-6/+5
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-11loop: move lo_set_size() out of queue freezeMing Lei1-6/+5
It isn't necessary to freeze queue for updating disk size given submit_bio() doesn't grab queue usage counter for checking eod. Also many driver won't freeze queue for calling set_capacity_and_notify(). Move lo_set_size() out of queue freeze for fixing many lockdep warning report. Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/ Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar6-10/+14
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds2-57/+62
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-04Merge tag 'pci-v6.16-changes' of ↵Linus Torvalds1-5/+2
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Print the actual delay time in pci_bridge_wait_for_secondary_bus() instead of assuming it was 1000ms (Wilfred Mallawa) - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI devices', which broke resume from system sleep on AMD platforms and has been fixed by other commits (Lukas Wunner) Resource management: - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated and unnecessary (Philipp Stanner) - Remove pcim_iounmap_regions() and pcim_request_region_exclusive() and related flags since all uses have been removed (Philipp Stanner) - Rework devres 'request' functions so they are no longer 'hybrid', i.e., their behavior no longer depends on whether pcim_enable_device or pci_enable_device() was used, and remove related code (Philipp Stanner) - Warn (not BUG()) about failure to assign optional resources (Ilpo Järvinen) Error handling: - Log the DPC Error Source ID only when it's actually valid (when ERR_FATAL or ERR_NONFATAL was received from a downstream device) and decode into bus/device/function (Bjorn Helgaas) - Determine AER log level once and save it so all related messages use the same level (Karolina Stolarek) - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable Errors (Karolina Stolarek) - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs controls on interval and burst count, to avoid flooding logs and RCU stall warnings (Jon Pan-Doh) Power management: - Increment PM usage counter when probing reset methods so we don't try to read config space of a powered-off device (Alex Williamson) - Set all devices to D0 during enumeration to ensure ACPI opregion is connected via _REG (Mario Limonciello) Power control: - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match the filename paths. Retain old deprecated symbols for compatibility, except for the pwrctrl slot driver (PCI_PWRCTRL_SLOT) (Johan Hovold) - When unregistering pwrctrl, cancel outstanding rescan work before cleaning up data structures to avoid use-after-free issues (Brian Norris) Bandwidth control: - Simplify link bandwidth controller by replacing the count of Link Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN flag (Ilpo Järvinen) - Update the Link Speed after retraining, since the Link Speed may have changed (Ilpo Järvinen) PCIe native device hotplug: - Ignore Presence Detect Changed caused by DPC. pciehp already ignores Link Down/Up events caused by DPC, but on slots using in-band presence detect, DPC causes a spurious Presence Detect Changed event (Lukas Wunner) - Ignore Link Down/Up caused by Secondary Bus Reset. On hotplug ports using in-band presence detect, the reset causes a Presence Detect Changed event, which mistakenly caused teardown and re-enumeration of the device. Drivers may need to annotate code that resets their device (Lukas Wunner) Virtualization: - Add an ACS quirk for Loongson Root Ports that don't advertise ACS but don't allow peer-to-peer transactions between Root Ports; the quirk allows each Root Port to be in a separate IOMMU group (Huacai Chen) Endpoint framework: - For fixed-size BARs, retain both the actual size and the possibly larger size allocated to accommodate iATU alignment requirements (Jerome Brunet) - Simplify ctrl/SPAD space allocation and avoid allocating more space than needed (Jerome Brunet) - Correct MSI-X PBA offset calculations for DesignWare and Cadence endpoint controllers (Niklas Cassel) - Align the return value (number of interrupts) encoding for pci_epc_get_msi()/pci_epc_ops::get_msi() and pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel) - Align the nr_irqs parameter encoding for pci_epc_set_msi()/pci_epc_ops::set_msi() and pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel) Common host controller library: - Convert pci-host-common to a library so platforms that don't need native host controller drivers don't need to include these helper functions (Manivannan Sadhasivam) Apple PCIe controller driver: - Extract ECAM bridge creation helper from pci_host_common_probe() to separate driver-specific things like MSI from PCI things (Marc Zyngier) - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with varying capabilities (Marc Zyngier) - Skip ports disabled in DT when setting up ports (Janne Grunau) - Add t6020 compatible string (Alyssa Rosenzweig) - Add T602x PCIe support (Hector Martin) - Directly set/clear INTx mask bits because T602x dropped the accessors that could do this without locking (Marc Zyngier) - Move port PHY registers to their own reg items to accommodate T602x, which moves them around; retain default offsets for existing DTs that lack phy%d entries with the reg offsets (Hector Martin) - Stop polling for core refclk, which doesn't work on T602x and the bootloader has already done anyway (Hector Martin) - Use gpiod_set_value_cansleep() when asserting PERST# in probe because we're allowed to sleep there (Hector Martin) Cadence PCIe controller driver: - Drop a runtime PM 'put' to resolve a runtime atomic count underflow (Hans Zhang) - Make the cadence core buildable as a module (Kishon Vijay Abraham I) - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by loadable drivers when they are removed (Siddharth Vadapalli) Freescale i.MX6 PCIe controller driver: - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP (Richard Zhu) - Remove redundant dw_pcie_wait_for_link() from imx_pcie_start_link(); since the DWC core does this, imx6 only needs it when retraining for a faster link speed (Richard Zhu) - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu) - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in some cases, the controller can't exit 'L23 Ready' through Beacon or PERST# deassertion (Richard Zhu) - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum: controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8 GT/s, causing timeouts in L1 (Richard Zhu) - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu) - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu) Mobiveil PCIe controller driver: - Return bool (not int) for link-up check in mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans Zhang) NVIDIA Tegra194 PCIe controller driver: - Create debugfs directory for 'aspm_state_cnt' only when CONFIG_PCIEASPM is enabled, since there are no other entries (Hans Zhang) Qualcomm PCIe controller driver: - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane equalization presets (Krishna Chaitanya Chundru) - Read Maximum Link Width from the Link Capabilities register if DT lacks 'num-lanes' property (Krishna Chaitanya Chundru) - Add Physical Layer 64 GT/s Capability ID and register offsets for 8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya Chundru) - Add generic dwc support for configuring lane equalization presets (Krishna Chaitanya Chundru) - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar) Renesas R-Car PCIe controller driver: - Describe endpoint BAR 4 as being fixed size (Jerome Brunet) - Document how to obtain R-Car V4H (r8a779g0) controller firmware (Yoshihiro Shimoda) Rockchip PCIe controller driver: - Reorder rockchip_pci_core_rsts because reset_control_bulk_deassert() deasserts in reverse order, to fix a link training regression (Jensen Huang) - Mark RK3399 as being capable of raising INTx interrupts (Niklas Cassel) Rockchip DesignWare PCIe controller driver: - Check only PCIE_LINKUP, not LTSSM status, to determine whether the link is up (Shawn Lin) - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s for Root Complex and Endpoint modes (Shawn Lin) - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead of rockchip_pcie_ep_pre_init() so it stays hidden after PERST# resets non-sticky registers (Shawn Lin) - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit() (Diederik de Haas) Synopsys DesignWare PCIe controller driver: - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training more robust; this will not affect the intended link width if all lanes are functional (Wenbin Yao) - Return bool (not int) for link-up check in dw_pcie_ops.link_up() and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay, keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx, tegra194, uniphier, visconti (Hans Zhang) - Add debugfs support for exposing DWC device-specific PTM context (Manivannan Sadhasivam) TI J721E PCIe driver: - Make j721e buildable as a loadable and removable module (Siddharth Vadapalli) - Fix j721e host/endpoint dependencies that result in link failures in some configs (Arnd Bergmann) Device tree bindings: - Add qcom DT binding for 'global' interrupt (PCIe controller and link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p, sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan Sadhasivam) - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074, ipq8074-gen3, ipq6018 (Manivannan Sadhasivam) - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang) - Correct indentation and style of examples in brcm,stb-pcie, cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie, microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm (Krzysztof Kozlowski) - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and armada8k from text to schema DT bindings (Rob Herring) - Remove obsolete .txt DT bindings for content that has been moved to schemas (Rob Herring) - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074 and IPQ9574 (Varadarajan Narayanan) - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring) - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since PolarFire may be configured that way (Conor Dooley) Miscellaneous: - Drop 'pci' suffix from intel_mid_pci.c filename to match similar files (Andy Shevchenko) - All platforms with PCI have an MMU, so add PCI Kconfig dependency on MMU to simplify build testing and avoid inadvertent build regressions (Arnd Bergmann) - Update Krzysztof Wilczyński's email address in MAINTAINERS (Krzysztof Wilczyński) - Update Manivannan Sadhasivam's email address in MAINTAINERS (Manivannan Sadhasivam)" * tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits) MAINTAINERS: Update Manivannan Sadhasivam email address PCI: j721e: Fix host/endpoint dependencies PCI: j721e: Add support to build as a loadable module PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup PCI: cadence: Add support to build pcie-cadence library as a kernel module MAINTAINERS: Update Krzysztof Wilczyński email address PCI: Remove unnecessary linesplit in __pci_setup_bridge() PCI: WARN (not BUG()) when we fail to assign optional resources PCI: Remove unused pci_printk() PCI: qcom: Replace PERST# sleep time with proper macro PCI: dw-rockchip: Replace PERST# sleep time with proper macro PCI: host-common: Convert to library for host controller drivers PCI/ERR: Remove misleading TODO regarding kernel panic PCI: cadence: Remove duplicate message code definitions PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding PCI: cadence-ep: Correct PBA offset in .set_msix() callback ...
2025-06-03Merge tag 'mm-stable-2025-06-01-14-06' of ↵Linus Torvalds6-13/+35
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ...
2025-06-01zram: support deflate-specific paramsSergey Senozhatsky3-6/+28
Introduce support of algorithm specific parameters in algorithm_params device attribute. The expected format is algorithm.param=value. For starters, add support for deflate.winbits parameter. Link: https://lkml.kernel.org/r/20250514024825.1745489-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-01zram: rename ZCOMP_PARAM_NO_LEVELSergey Senozhatsky6-7/+7
Patch series "zram: support algorithm-specific parameters". This patchset adds support for algorithm-specific parameters. For now, only deflate-specific winbits can be configured, which fixes deflate support on some s390 setups. This patch (of 2): Use more generic name because this will be default "un-set" value for more params in the future. Link: https://lkml.kernel.org/r/20250514024825.1745489-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20250514024825.1745489-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-01Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds1-108/+223
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-31ublk: have a per-io daemon instead of a per-queue daemonUday Shankar1-55/+56
Currently, ublk_drv associates to each hardware queue (hctx) a unique task (called the queue's ubq_daemon) which is allowed to issue COMMIT_AND_FETCH commands against the hctx. If any other task attempts to do so, the command fails immediately with EINVAL. When considered together with the block layer architecture, the result is that for each CPU C on the system, there is a unique ublk server thread which is allowed to handle I/O submitted on CPU C. This can lead to suboptimal performance under imbalanced load generation. For an extreme example, suppose all the load is generated on CPUs mapping to a single ublk server thread. Then that thread may be fully utilized and become the bottleneck in the system, while other ublk server threads are totally idle. This issue can also be addressed directly in the ublk server without kernel support by having threads dequeue I/Os and pass them around to ensure even load. But this solution requires inter-thread communication at least twice for each I/O (submission and completion), which is generally a bad pattern for performance. The problem gets even worse with zero copy, as more inter-thread communication would be required to have the buffer register/unregister calls to come from the correct thread. Therefore, address this issue in ublk_drv by allowing each I/O to have its own daemon task. Two I/Os in the same queue are now allowed to be serviced by different daemon tasks - this was not possible before. Imbalanced load can then be balanced across all ublk server threads by having the ublk server threads issue FETCH_REQs in a round-robin manner. As a small toy example, consider a system with a single ublk device having 2 queues, each of depth 4. A ublk server having 4 threads could issue its FETCH_REQs against this device as follows (where each entry is the qid,tag pair that the FETCH_REQ targets): ublk server thread: T0 T1 T2 T3 0,0 0,1 0,2 0,3 1,3 1,0 1,1 1,2 This setup allows for load that is concentrated on one hctx/ublk_queue to be spread out across all ublk server threads, alleviating the issue described above. Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers can use to essentially test for the presence of this change and tailor their behavior accordingly. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-1-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27loop: add file_start_write() and file_end_write()Ming Lei1-2/+6
file_start_write() and file_end_write() should be added around ->write_iter(). Recently we switch to ->write_iter() from vfs_iter_write(), and the implied file_start_write() and file_end_write() are lost. Also we never add them for dio code path, so add them back for covering both. Cc: Jeff Moyer <jmoyer@redhat.com> Fixes: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O") Fixes: bc07c10a3603 ("block: loop: support DIO & AIO") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250527153405.837216-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds8-320/+1892
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-23ublk: add feature UBLK_F_QUIESCEMing Lei1-1/+123
Add feature UBLK_F_QUIESCE, which adds control command `UBLK_U_CMD_QUIESCE_DEV` for quiescing device, then device state can become `UBLK_S_DEV_QUIESCED` or `UBLK_S_DEV_FAIL_IO` finally from ublk_ch_release() with ublk server cooperation. This feature can help to support to upgrade ublk server application by shutting down ublk server gracefully, meantime keep ublk block device persistent during the upgrading period. The feature is only available for UBLK_F_USER_RECOVERY. Suggested-by: Yoav Cohen <yoav@nvidia.com> Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250522163523.406289-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-22Merge tag 'block-6.15-20250522' of git://git.kernel.dk/linuxLinus Torvalds1-3/+0
Pull block fixes from Jens Axboe: - Fix for a regression with setting up loop on a file system without ->write_iter() - Fix for an nvme sysfs regression * tag 'block-6.15-20250522' of git://git.kernel.dk/linux: nvme: avoid creating multipath sysfs group under namespace path devices loop: don't require ->write_iter for writable files in loop_configure
2025-05-22ublk: run auto buf unregisgering in same io_ring_ctx with registeringMing Lei1-3/+16
UBLK_F_AUTO_BUF_REG requires that the buffer registered automatically is unregistered in same `io_ring_ctx`, so check it explicitly. Document this requirement for UBLK_F_AUTO_BUF_REG. Drop WARN_ON_ONCE() which is triggered from userspace code path. Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250522152043.399824-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-22ublk: remove io argument from ublk_auto_buf_reg_fallback()Caleb Sander Mateos1-2/+2
The argument has been unused since the function was added, so remove it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250521160720.1893326-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()Ming Lei1-1/+1
If ublk_set_auto_buf_reg() fails, we need to unlock and return, otherwise `ub->mutex` is leaked. Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250521025502.71041-2-ming.lei@redhat.com Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20ublk: support UBLK_AUTO_BUF_REG_FALLBACKMing Lei1-0/+16
For UBLK_F_AUTO_BUF_REG, buffer is registered to uring_cmd context automatically with the provided buffer index. User may provide one wrong buffer index, or the specified buffer is registered by application already. Add UBLK_AUTO_BUF_REG_FALLBACK for supporting to auto buffer registering fallback by completing the uring_cmd and telling ublk server the register failure via UBLK_AUTO_BUF_REG_FALLBACK, then ublk server still can register the buffer from userspace. So we can provide reliable way for supporting auto buffer register. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20ublk: register buffer to local io_uring with provided buf index via ↵Ming Lei1-7/+49
UBLK_F_AUTO_BUF_REG Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically to local io_uring context with provided buffer index. Add UAPI structure `struct ublk_auto_buf_reg` for holding user parameter to register request buffer automatically, one 'flags' field is defined, and there is still 32bit available for future extension, such as, adding one io_ring FD field for registering buffer to external io_uring. `struct ublk_auto_buf_reg` is populated from ublk uring_cmd's sqe->addr, and all existing ublk commands are data-less, so it is just fine to reuse sqe->addr for this purpose. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20ublk: prepare for supporting to register request buffer automaticallyMing Lei1-6/+64
UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer register/unregister uring_cmd for each IO, this way is not only inefficient, but also introduce dependency between buffer consumer and buffer register/ unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c in which backing file IO has to be issued one by one by IOSQE_IO_LINK. Prepare for adding feature UBLK_F_AUTO_BUF_REG for addressing the existing zero copy limitation: - register request buffer automatically to ublk uring_cmd's io_uring context before delivering io command to ublk server - unregister request buffer automatically from the ublk uring_cmd's io_uring context when completing the request - io_uring will unregister the buffer automatically when uring is exiting, so we needn't worry about accident exit For using this feature, ublk server has to create one sparse buffer table Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20ublk: convert to refcount_tMing Lei1-14/+5
Convert to refcount_t and prepare for supporting to register bvec buffer automatically, which needs to initialize reference counter as 2, and kref doesn't provide this interface, so convert to refcount_t. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20loop: don't require ->write_iter for writable files in loop_configureChristoph Hellwig1-3/+0
Block devices can be opened read-write even if they can't be written to for historic reasons. Remove the check requiring file->f_op->write_iter when the block devices was opened in loop_configure. The call to loop_check_backing_file just below ensures the ->write_iter is present for backing files opened for writing, which is the only check that is actually needed. Fixes: f5c84eff634b ("loop: Add sanity check for read/write_iter") Reported-by: Christian Hesse <mail@eworm.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250520135420.1177312-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fixes for atomic writes (Alan Adamson) - fixes for polled CQs in nvmet-epf (Damien Le Moal) - fix for polled CQs in nvme-pci (Keith Busch) - fix compile on odd configs that need to be forced to inline (Kees Cook) - one more quirk (Ilya Guterman) - Fix for missing allocation of an integrity buffer for some cases - Fix for a regression with ublk command cancelation * tag 'block-6.15-20250515' of git://git.kernel.dk/linux: ublk: fix dead loop when canceling io command nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro nvme: all namespaces in a subsystem must adhere to a common atomic write size nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ nvmet: pci-epf: improve debug message nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq() nvmet: pci-epf: do not fall back to using INTX if not supported nvmet: pci-epf: clear completion queue IRQ flag on delete nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable nvme-pci: make nvme_pci_npages_prp() __always_inline block: always allocate integrity buffer when required
2025-05-15ublk: fix dead loop when canceling io commandMing Lei1-1/+1
Commit: f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd") adds a request state check in ublk_cancel_cmd(), and if the request is started, skips canceling this uring_cmd. However, the current uring_cmd may be in ACTIVE state, without block request coming to the uring command. Meantime, if the cached request in tag_set.tags[tag] has been delivered to ublk server and reycycled, then this uring_cmd can't be canceled. ublk requests are aborted in ublk char device release handler, which depends on canceling all ACTIVE uring_cmd. So it causes a dead loop. Fix this issue by not taking a stale request into account when canceling uring_cmd in ublk_cancel_cmd(). Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/ Fixes: f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250515162601.77346-1-ming.lei@redhat.com [axboe: rewording of commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14brd: avoid extra xarray lookups on first writeChristoph Hellwig1-43/+33
The xarray can return the previous entry at a location. Use this fact to simplify the brd code when there is no existing page at a location. This also slighly improves the handling of racy discards as we now always have a page under RCU protection by the time we are ready to copy the data. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250507060700.3929430-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>