kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2026-01-22	bcache: use bio cloning for detached device requests	Shida Zhang	3	-46/+54
	Previously, bcache hijacked the bi_end_io and bi_private fields of the incoming bio when the backing device was in a detached state. This is fragile and breaks if the bio is needed to be processed by other layers. This patch transitions to using a cloned bio embedded within a private structure. This ensures the original bio's metadata remains untouched. Fixes: 53280e398471 ("bcache: fix improper use of bi_end_io") Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Acked-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-22	blk-mq: use BLK_POLL_ONESHOT for synchronous poll completion	Ming Lei	1	-1/+1
	blk_execute_rq() with polling is used in kernel code paths such as NVMe controller connect. The aggressive spinning in blk_hctx_poll() can prevent the completion task from getting a chance to run, causing a lockup. The spinning with cpu_relax() doesn't yield CPU, so need_resched() only becomes true on timer tick. This causes unnecessary spinning while the completion task is already waiting to run. Before commit f22ecf9c14c1, the loop would exit early because task_is_running() was always true. After that commit removed the check, the loop now spins until need_resched(). Fix this by using BLK_POLL_ONESHOT in blk_rq_poll_completion(). This causes blk_hctx_poll() to poll once and return immediately, letting the outer loop's cond_resched() yield CPU so the completion task can run. Fixes: f22ecf9c14c1 ("blk-mq: delete task running check in blk_hctx_poll()") Cc: Diangang Li <lidiangang@bytedance.com> Cc: Fengnan Chang <changfengnan@bytedance.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21	selftests/ublk: fix garbage output in foreground mode	Ming Lei	1	-1/+2
	Initialize _evtfd to -1 in struct dev_ctx to prevent garbage output when running kublk in foreground mode. Without this, _evtfd is zero-initialized to 0 (stdin), and ublk_send_dev_event() writes binary data to stdin which appears as garbage on the terminal. Also fix debug message format string. Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21	selftests/ublk: fix error handling for starting device	Ming Lei	1	-2/+4
	Fix error handling in ublk_start_daemon() when start_dev fails: 1. Call ublk_ctrl_stop_dev() to cancel inflight uring_cmd before cleanup. Without this, the device deletion may hang waiting for I/O completion that will never happen. 2. Add fail_start label so that pthread_join() is called on the error path. This ensures proper thread cleanup when startup fails. Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21	selftests/ublk: fix IO thread idle check	Ming Lei	1	-1/+1
	Include cmd_inflight in ublk_thread_is_done() check. Without this, the thread may exit before all FETCH commands are completed, which may cause device deletion to hang. Fixes: 6aecda00b7d1 ("selftests: ublk: add kernel selftests for ublk") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21	block: make the new blkzoned UAPI constants discoverable	Christoph Hellwig	1	-2/+4
	The Linux 6.19 merge window added the new BLKREPORTZONESV2 ioctl, and with it the new BLK_ZONE_REP_CACHED and BLK_ZONE_COND_ACTIVE constants. The two constants are defined as part of enums, which makes it very painful for userspace to discover if they are present in the installed system headers. Use the #define to the same name trick to make them trivially discoverable using CPP directives. Fixes: 0bf0e2e46668 ("block: track zone conditions") Fixes: b30ffcdc0c15 ("block: introduce BLKREPORTZONESV2 ioctl") Reported-by: Andrey Albershteyn <aalbersh@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-21	ublk: fix ublksrv pid handling for pid namespaces	Seamus Connor	1	-5/+34
	When ublksrv runs inside a pid namespace, START/END_RECOVERY compared the stored init-ns tgid against the userspace pid (getpid vnr), so the check failed and control ops could not proceed. Compare against the caller’s init-ns tgid and store that value, then translate it back to the caller’s pid namespace when reporting GET_DEV_INFO so ublk list shows a sensible pid. Testing: start/recover in a pid namespace; `ublk list` shows reasonable pid values in init, child, and sibling namespaces. Fixes: c2c8089f325e ("ublk: validate ublk server pid") Signed-off-by: Seamus Connor <sconnor@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-19	block: Fix an error path in disk_update_zone_resources()	Bart Van Assche	1	-0/+1
	Any queue_limits_start_update() call must be followed either by a queue_limits_commit_update() call or by a queue_limits_cancel_update() call. Make sure that the error path near the start of disk_update_zone_resources() follows this requirement. Remove the "goto unfreeze" statement from that error path to make the code easier to verify. This was detected by annotating the queue_limits_*() calls with Clang thread-safety attributes and by building the kernel with thread-safety checking enabled. Without this patch and with thread-safety checking enabled, the following error is reported: block/blk-zoned.c:2020:1: error: mutex 'disk->queue->limits_lock' is not held on every path through here [-Werror,-Wthread-safety-analysis] 2020 \| } \| ^ block/blk-zoned.c:1959:8: note: mutex acquired here 1959 \| lim = queue_limits_start_update(q); \| ^ Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: bba4322e3f30 ("block: freeze queue when updating zone resources") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260114192803.4171847-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-15	rnbd-clt: fix refcount underflow in device unmap path	Chaitanya Kulkarni	1	-1/+0
	During device unmapping (triggered by module unload or explicit unmap), a refcount underflow occurs causing a use-after-free warning: [14747.574913] ------------[ cut here ]------------ [14747.574916] refcount_t: underflow; use-after-free. [14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378 [14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ... [14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G O N 6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary) [14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client] [14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90 [14747.575037] Call Trace: [14747.575038] <TASK> [14747.575038] rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client] [14747.575044] process_one_work+0x211/0x600 [14747.575052] worker_thread+0x184/0x330 [14747.575055] ? __pfx_worker_thread+0x10/0x10 [14747.575058] kthread+0x10d/0x250 [14747.575062] ? __pfx_kthread+0x10/0x10 [14747.575066] ret_from_fork+0x319/0x390 [14747.575069] ? __pfx_kthread+0x10/0x10 [14747.575072] ret_from_fork_asm+0x1a/0x30 [14747.575083] </TASK> [14747.575096] ---[ end trace 0000000000000000 ]--- Befor this patch :- The bug is a double kobject_put() on dev->kobj during device cleanup. Kobject Lifecycle: kobject_init_and_add() sets kobj.kref = 1 (initialization) kobject_put() sets kobj.kref = 0 (should be called once) * Before this patch: rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs] kobject_put(&dev->kobj) PUT #1 (WRONG!) kref: 1 to 0 rnbd_dev_release() kfree(dev) [DEVICE FREED!] rnbd_destroy_gen_disk() [use-after-free!] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) PUT #2 (UNDERFLOW!) kref: 0 to -1 [WARNING!] The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the device via rnbd_dev_release(), then the second kobject_put() in rnbd_clt_put_dev() causes refcount underflow. * After this patch :- Remove kobject_put() from rnbd_destroy_sysfs(). This function should only remove sysfs visibility (kobject_del), not manage object lifetime. Call Graph (FIXED): rnbd_clt_unmap_device() rnbd_destroy_sysfs() kobject_del(&dev->kobj) [remove from sysfs only] [kref unchanged: 1] rnbd_destroy_gen_disk() [device still valid] rnbd_clt_put_dev() refcount_dec_and_test(&dev->refcount) kobject_put(&dev->kobj) ONLY PUT (CORRECT!) kref: 1 to 0 [BALANCED] rnbd_dev_release() kfree(dev) [CLEAN DESTRUCTION] This follows the kernel pattern where sysfs removal (kobject_del) is separate from object destruction (kobject_put). Fixes: 581cf833cac4 ("block: rnbd: add .release to rnbd_dev_ktype") Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-14	Merge tag 'nvme-6.19-2026-01-14' of git://git.infradead.org/nvme into block-6.19	Jens Axboe	5	-7/+26
	Pull NVMe fixes from Keith: "- Device quirk to disable faulty temperature (Ilikara) - TCP target null pointer fix from bad host protocol usage (Shivam) - Add compatible apple controller (Janne) - FC tagset leak fix (Chaitanya) - TCP socket deadlock fix (Hannes) - Target name buffer overrun fix (Shin'ichiro)" * tag 'nvme-6.19-2026-01-14' of git://git.infradead.org/nvme: nvme: fix PCIe subsystem reset controller state transition nvmet: do not copy beyond sybsysnqn string length nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready() nvme-fc: release admin tagset if init fails nvme-apple: add "apple,t8103-nvme-ans2" as compatible nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec nvme-pci: disable secondary temp for Wodposit WPBSNM8
2026-01-14	nvme: fix PCIe subsystem reset controller state transition	Nilay Shroff	1	-1/+4
	The commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING state”) disallows controller state transitions directly from RESETTING to LIVE. However, the NVMe PCIe subsystem reset path relies on this transition to recover the controller on PowerPC (PPC) systems. On PPC systems, issuing a subsystem reset causes a temporary loss of communication with the NVMe adapter. A subsequent PCIe MMIO read then triggers EEH recovery, which restores the PCIe link and brings the controller back online. For EEH recovery to proceed correctly, the controller must transition back to the LIVE state. Due to the changes introduced by commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING state”), the controller can no longer transition directly from RESETTING to LIVE. As a result, EEH recovery exits prematurely, leaving the controller stuck in the RESETTING state. Fix this by explicitly transitioning the controller state from RESETTING to CONNECTING and then to LIVE. This satisfies the updated state transition rules and allows the controller to be successfully recovered on PPC systems following a PCIe subsystem reset. Cc: stable@vger.kernel.org Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state") Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-14	nvmet: do not copy beyond sybsysnqn string length	Shin'ichiro Kawasaki	1	-1/+1
	Commit edd17206e363 ("nvmet: remove redundant subsysnqn field from ctrl") replaced ctrl->subsysnqn with ctrl->subsys->subsysnqn. This change works as expected because both point to strings with the same data. However, their memory allocation lengths differ. ctrl->subsysnqn had the fixed size defined as NVMF_NQN_FILED_LEN, while ctrl->subsys->subsysnqn has variable length determined by kstrndup(). Due to this difference, KASAN slab-out-of-bounds occurs at memcpy() in nvmet_passthru_override_id_ctrl() after the commit. The failure can be recreated by running the blktests test case nvme/033. To prevent such failures, replace memcpy() with strscpy(), which copies only the string length and avoids overruns. Fixes: edd17206e363 ("nvmet: remove redundant subsysnqn field from ctrl") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13	nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()	Hannes Reinecke	1	-5/+4
	When the socket is closed while in TCP_LISTEN a callback is run to flush all outstanding packets, which in turns calls nvmet_tcp_listen_data_ready() with the sk_callback_lock held. So we need to check if we are in TCP_LISTEN before attempting to get the sk_callback_lock() to avoid a deadlock. Link: https://lore.kernel.org/linux-nvme/CAHj4cs-zu7eVB78yUpFjVe2UqMWFkLk8p+DaS3qj+uiGCXBAoA@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13	null_blk: fix kmemleak by releasing references to fault configfs items	Nilay Shroff	1	-1/+11
	When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, the null-blk driver sets up fault injection support by creating the timeout_inject, requeue_inject, and init_hctx_fault_inject configfs items as children of the top-level nullbX configfs group. However, when the nullbX device is removed, the references taken to these fault-config configfs items are not released. As a result, kmemleak reports a memory leak, for example: unreferenced object 0xc00000021ff25c40 (size 32): comm "mkdir", pid 10665, jiffies 4322121578 hex dump (first 32 bytes): 69 6e 69 74 5f 68 63 74 78 5f 66 61 75 6c 74 5f init_hctx_fault_ 69 6e 6a 65 63 74 00 88 00 00 00 00 00 00 00 00 inject.......... backtrace (crc 1a018c86): __kmalloc_node_track_caller_noprof+0x494/0xbd8 kvasprintf+0x74/0xf4 config_item_set_name+0xf0/0x104 config_group_init_type_name+0x48/0xfc fault_config_init+0x48/0xf0 0xc0080000180559e4 configfs_mkdir+0x304/0x814 vfs_mkdir+0x49c/0x604 do_mkdirat+0x314/0x3d0 sys_mkdir+0xa0/0xd8 system_call_exception+0x1b0/0x4f0 system_call_vectored_common+0x15c/0x2ec Fix this by explicitly releasing the references to the fault-config configfs items when dropping the reference to the top-level nullbX configfs group. Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Fixes: bb4c19e030f4 ("block: null_blk: make fault-injection dynamically configurable per device") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10	block: zero non-PI portion of auto integrity buffer	Caleb Sander Mateos	1	-1/+1
	The auto-generated integrity buffer for writes needs to be fully initialized before being passed to the underlying block device, otherwise the uninitialized memory can be read back by userspace or anyone with physical access to the storage device. If protection information is generated, that portion of the integrity buffer is already initialized. The integrity data is also zeroed if PI generation is disabled via sysfs or the PI tuple size is 0. However, this misses the case where PI is generated and the PI tuple size is nonzero, but the metadata size is larger than the PI tuple. In this case, the remainder ("opaque") of the metadata is left uninitialized. Generalize the BLK_INTEGRITY_CSUM_NONE check to cover any case when the metadata is larger than just the PI tuple. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c546d6f43833 ("block: only zero non-PI metadata tuples in bio_integrity_prep") Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-09	nvme-fc: release admin tagset if init fails	Chaitanya Kulkarni	1	-0/+2
	nvme_fabrics creates an NVMe/FC controller in following path: nvmf_dev_write() -> nvmf_create_ctrl() -> nvme_fc_create_ctrl() -> nvme_fc_init_ctrl() nvme_fc_init_ctrl() allocates the admin blk-mq resources right after nvme_add_ctrl() succeeds. If any of the subsequent steps fail (changing the controller state, scheduling connect work, etc.), we jump to the fail_ctrl path, which tears down the controller references but never frees the admin queue/tag set. The leaked blk-mq allocations match the kmemleak report seen during blktests nvme/fc. Check ctrl->ctrl.admin_tagset in the fail_ctrl path and call nvme_remove_admin_tag_set() when it is set so that all admin queue allocations are reclaimed whenever controller setup aborts. Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-09	nvme-apple: add "apple,t8103-nvme-ans2" as compatible	Janne Grunau	1	-0/+1
	After discussion with the devicetree maintainers we agreed to not extend lists with the generic compatible "apple,nvme-ans2" anymore [1]. Add "apple,t8103-nvme-ans2" as fallback compatible as it is the SoC the driver and bindings were written for. [1]: https://lore.kernel.org/asahi/12ab93b7-1fc2-4ce0-926e-c8141cfe81bf@kernel.org/ Cc: stable@vger.kernel.org # v6.18+ Fixes: 5bd2927aceba ("nvme-apple: Add initial Apple SoC NVMe driver") Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Janne Grunau <j@jannau.net> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-09	nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec	Shivam Kumar	1	-0/+12
	Commit efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length") added ttag bounds checking and data_offset validation in nvmet_tcp_handle_h2c_data_pdu(), but it did not validate whether the command's data structures (cmd->req.sg and cmd->iov) have been properly initialized before processing H2C_DATA PDUs. The nvmet_tcp_build_pdu_iovec() function dereferences these pointers without NULL checks. This can be triggered by sending H2C_DATA PDU immediately after the ICREQ/ICRESP handshake, before sending a CONNECT command or NVMe write command. Attack vectors that trigger NULL pointer dereferences: 1. H2C_DATA PDU sent before CONNECT → both pointers NULL 2. H2C_DATA PDU for READ command → cmd->req.sg allocated, cmd->iov NULL 3. H2C_DATA PDU for uninitialized command slot → both pointers NULL The fix validates both cmd->req.sg and cmd->iov before calling nvmet_tcp_build_pdu_iovec(). Both checks are required because: - Uninitialized commands: both NULL - READ commands: cmd->req.sg allocated, cmd->iov NULL - WRITE commands: both allocated Fixes: efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Shivam Kumar <kumar.shivam43666@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-09	ublk: fix use-after-free in ublk_partition_scan_work	Ming Lei	1	-15/+22
	A race condition exists between the async partition scan work and device teardown that can lead to a use-after-free of ub->ub_disk: 1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk() 2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does: - del_gendisk(ub->ub_disk) - ublk_detach_disk() sets ub->ub_disk = NULL - put_disk() which may free the disk 3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk leading to UAF Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold a reference to the disk during the partition scan. The spinlock in ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker either gets a valid reference or sees NULL and exits early. Also change flush_work() to cancel_work_sync() to avoid running the partition scan work unnecessarily when the disk is already detached. Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07	blk-mq: avoid stall during boot due to synchronize_rcu_expedited	Mikulas Patocka	1	-2/+1
	On the kernel 6.19-rc, I am experiencing 15-second boot stall in a virtual machine when probing a virtio-scsi disk: [ 1.011641] SCSI subsystem initialized [ 1.013972] virtio_scsi virtio6: 16/0/0 default/read/poll queues [ 1.015983] scsi host0: Virtio SCSI HBA [ 1.019578] ACPI: \_SB_.GSIA: Enabled at IRQ 16 [ 1.020225] ahci 0000:00:1f.2: AHCI vers 0001.0000, 32 command slots, 1.5 Gbps, SATA mode [ 1.020228] ahci 0000:00:1f.2: 6/6 ports implemented (port mask 0x3f) [ 1.020230] ahci 0000:00:1f.2: flags: 64bit ncq only [ 1.024688] scsi host1: ahci [ 1.025432] scsi host2: ahci [ 1.025966] scsi host3: ahci [ 1.026511] scsi host4: ahci [ 1.028371] scsi host5: ahci [ 1.028918] scsi host6: ahci [ 1.029266] ata1: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23100 irq 16 lpm-pol 1 [ 1.029305] ata2: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23180 irq 16 lpm-pol 1 [ 1.029316] ata3: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23200 irq 16 lpm-pol 1 [ 1.029327] ata4: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23280 irq 16 lpm-pol 1 [ 1.029341] ata5: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23300 irq 16 lpm-pol 1 [ 1.029356] ata6: SATA max UDMA/133 abar m4096@0xfea23000 port 0xfea23380 irq 16 lpm-pol 1 [ 1.118111] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 1.348916] ata1: SATA link down (SStatus 0 SControl 300) [ 1.350713] ata2: SATA link down (SStatus 0 SControl 300) [ 1.351025] ata6: SATA link down (SStatus 0 SControl 300) [ 1.351160] ata5: SATA link down (SStatus 0 SControl 300) [ 1.351326] ata3: SATA link down (SStatus 0 SControl 300) [ 1.351536] ata4: SATA link down (SStatus 0 SControl 300) [ 1.449153] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2 [ 16.483477] sd 0:0:0:0: Power-on or device reset occurred [ 16.483691] sd 0:0:0:0: [sda] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) [ 16.483762] sd 0:0:0:0: [sda] Write Protect is off [ 16.483877] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.569225] sd 0:0:0:0: [sda] Attached SCSI disk I bisected it and it is caused by the commit 89e1fb7ceffd which introduces calls to synchronize_rcu_expedited. This commit replaces synchronize_rcu_expedited and kfree with a call to kfree_rcu_mightsleep, avoiding the 15-second delay. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 89e1fb7ceffd ("blk-mq: fix potential uaf for 'queue_hw_ctx'") Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07	loop: add missing bd_abort_claiming in loop_set_status	Tetsuo Handa	1	-1/+3
	Commit 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") forgot to call bd_abort_claiming() when mutex_lock_killable() failed. Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07	block: don't merge bios with different app_tags	Caleb Sander Mateos	1	-5/+18
	nvme_set_app_tag() uses the app_tag value from the bio_integrity_payload of the struct request's first bio. This assumes all the request's bios have the same app_tag. However, it is possible for bios with different app_tag values to be merged into a single request. Add a check in blk_integrity_merge_{bio,rq}() to prevent the merging of bios/requests with different app_tag values if BIP_CHECK_APPTAG is set. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 3d8b5a22d404 ("block: add support to pass user meta buffer") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07	blk-rq-qos: Remove unlikely() hints from QoS checks	Breno Leitao	1	-16/+9
	The unlikely() annotations on QUEUE_FLAG_QOS_ENABLED checks are counterproductive. Writeback throttling (WBT) might be enabled by default, mainly because CONFIG_BLK_WBT_MQ defaults to 'y'. Branch profiling on Meta servers, which have WBT enabled, confirms 100% misprediction rates on these checks. Remove the unlikely() annotations to let the CPU's branch predictor learn the actual behavior, potentially improving I/O path performance. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-06	loop: don't change loop device under exclusive opener in loop_set_status	Raphael Pinsonneault-Thibeault	1	-11/+30
	loop_set_status() is allowed to change the loop device while there are other openers of the device, even exclusive ones. In this case, it causes a KASAN: slab-out-of-bounds Read in ext4_search_dir(), since when looking for an entry in an inlined directory, e_value_offs is changed underneath the filesystem by loop_set_status(). Fix the problem by forbidding loop_set_status() from modifying the loop device while there are exclusive openers of the device. This is similar to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") alongside commit ecbe6bc0003b ("block: use bd_prepare_to_claim directly in the loop driver"). Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397 Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-01	block, bfq: update outdated comment	Julia Lawall	1	-1/+1
	The function bfq_bfqq_may_idle() was renamed as bfq_better_to_idle() in commit 277a4a9b56cd ("block, bfq: give a better name to bfq_bfqq_may_idle"). Update the comment accordingly. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-31	Merge tag 'md-6.19-20251231' of ↵	Jens Axboe	2	-15/+56
	gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into block-6.19 Pull MD fixes from Yu Kuai: "- Fix null-pointer dereference in raid5 sysfs group_thread_cnt store (Tuo Li) - Fix possible mempool corruption during raid1 raid_disks update via sysfs (FengWei Shih) - Fix logical_block_size configuration being overwritten during super_1_validate() (Li Nan) - Fix forward incompatibility with configurable logical block size: arrays assembled on new kernels could not be assembled on kernels <=6.18 due to non-zero reserved pad rejection (Li Nan) - Fix static checker warning about iterator not incremented (Li Nan)" * tag 'md-6.19-20251231' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2025-12-30	blk-mq: skip CPU offline notify on unmapped hctx	Cong Zhang	1	-1/+1
	If an hctx has no software ctx mapped, blk_mq_map_swqueue() never allocates tags and leaves hctx->tags NULL. The CPU hotplug offline notifier can still run for that hctx, return early since hctx cannot hold any requests. Signed-off-by: Cong Zhang <cong.zhang@oss.qualcomm.com> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28	selftests/ublk: fix Makefile to rebuild on header changes	Ming Lei	1	-2/+2
	Add header dependencies to kublk build rule so that changes to kublk.h, ublk_dep.h, or utils.h trigger a rebuild. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28	selftests/ublk: add test for async partition scan	Ming Lei	3	-4/+81
	Add test_generic_15.sh to verify that async partition scan prevents IO hang when reading partition tables. The test creates ublk devices with fault_inject target and very large delay (60s) to simulate blocked partition table reads, then kills the daemon to verify proper state transitions without hanging: 1. Without recovery support: - Create device with fault_inject and 60s delay - Kill daemon while partition scan may be blocked - Verify device transitions to DEAD state 2. With recovery support (-r 1): - Create device with fault_inject, 60s delay, and recovery - Kill daemon while partition scan may be blocked - Verify device transitions to QUIESCED state Before the async partition scan fix, killing the daemon during partition scan would cause deadlock as partition scan held ub->mutex while waiting for IO. With the async fix, partition scan happens in a work function and flush_work() ensures proper synchronization. Add _add_ublk_dev_no_settle() helper function to skip udevadm settle, which would otherwise hang waiting for partition scan events to complete when partition table read is delayed. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28	ublk: scan partition in async way	Ming Lei	1	-3/+32
	Implement async partition scan to avoid IO hang when reading partition tables. Similar to nvme_partition_scan_work(), partition scanning is deferred to a work queue to prevent deadlocks. When partition scan happens synchronously during add_disk(), IO errors can cause the partition scan to wait while holding ub->mutex, which can deadlock with other operations that need the mutex. Changes: - Add partition_scan_work to ublk_device structure - Implement ublk_partition_scan_work() to perform async scan - Always suppress sync partition scan during add_disk() - Schedule async work after add_disk() for trusted daemons - Add flush_work() in ublk_stop_dev() before grabbing ub->mutex Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-28	block,bfq: fix aux stat accumulation destination	shechenglong	1	-1/+1
	Route bfqg_stats_add_aux() time accumulation into the destination stats object instead of the source, aligning with other stat fields. Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-27	md: Fix forward incompatibility from configurable logical block size	Li Nan	1	-4/+44
	Commit 62ed1b582246 ("md: allow configuring logical block size") used reserved pad to add 'logical_block_size' to metadata. RAID rejects non-zero reserved pad, so arrays fail when rolling back to old kernels after booting new ones. Set 'logical_block_size' only for newly created arrays to support rollback to old kernels. Importantly new arrays still won't work on old kernels to prevent data loss issue from LBS changes. For arrays created on old kernels which confirmed not to rollback, configure LBS by echo current LBS (queue/logical_block_size) to md/logical_block_size. Fixes: 62ed1b582246 ("md: allow configuring logical block size") Reported-by: BugReports <bugreports61@gmail.com> Closes: https://lore.kernel.org/linux-raid/825e532d-d1e1-44bb-5581-692b7c091796@huaweicloud.com/T/#t Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20251226024221.724201-2-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27	md: Fix logical_block_size configuration being overwritten	Li Nan	1	-1/+3
	In super_1_validate(), mddev->logical_block_size is directly overwritten with the value from metadata. This causes the previously configured lbs to be lost, making the configuration ineffective. Fix it. Fixes: 62ed1b582246 ("md: allow configuring logical block size") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20251226024221.724201-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27	md: suspend array while updating raid_disks via sysfs	FengWei Shih	1	-2/+2
	In raid1_reshape(), freeze_array() is called before modifying the r1bio memory pool (conf->r1bio_pool) and conf->raid_disks, and unfreeze_array() is called after the update is completed. However, freeze_array() only waits until nr_sync_pending and (nr_pending - nr_queued) of all buckets reaches zero. When an I/O error occurs, nr_queued is increased and the corresponding r1bio is queued to either retry_list or bio_end_io_list. As a result, freeze_array() may unblock before these r1bios are released. This can lead to a situation where conf->raid_disks and the mempool have already been updated while queued r1bios, allocated with the old raid_disks value, are later released. Consequently, free_r1bio() may access memory out of bounds in put_all_bios() and release r1bios of the wrong size to the new mempool, potentially causing issues with the mempool as well. Since only normal I/O might increase nr_queued while an I/O error occurs, suspending the array avoids this issue. Note: Updating raid_disks via ioctl SET_ARRAY_INFO already suspends the array. Therefore, we suspend the array when updating raid_disks via sysfs to avoid this issue too. Signed-off-by: FengWei Shih <dannyshih@synology.com> Link: https://lore.kernel.org/linux-raid/20251226101816.4506-1-dannyshih@synology.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-27	md/raid5: fix possible null-pointer dereferences in ↵	Tuo Li	1	-4/+6
	raid5_store_group_thread_cnt() The variable mddev->private is first assigned to conf and then checked: conf = mddev->private; if (!conf) ... If conf is NULL, then mddev->private is also NULL. In this case, null-pointer dereferences can occur when calling raid5_quiesce(): raid5_quiesce(mddev, true); raid5_quiesce(mddev, false); since mddev->private is assigned to conf again in raid5_quiesce(), and conf is dereferenced in several places, for example: conf->quiesce = 0; wake_up(&conf->wait_for_quiescent); To fix this issue, the function should unlock mddev and return before invoking raid5_quiesce() when conf is NULL, following the existing pattern in raid5_change_consistency_policy(). Fixes: fa1944bbe622 ("md/raid5: Wait sync io to finish before changing group cnt") Signed-off-by: Tuo Li <islituo@gmail.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20251225130326.67780-1-islituo@gmail.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-25	md: Fix static checker warning in analyze_sbs	Li Nan	1	-4/+1
	The following warn is reported: drivers/md/md.c:3912 analyze_sbs() warn: iterator 'i' not incremented Fixes: d8730f0cf4ef ("md: Remove deprecated CONFIG_MD_MULTIPATH") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-raid/7e2e95ce-3740-09d8-a561-af6bfb767f18@huaweicloud.com/T/#t Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20251215124412.4015572-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-12-20	block: rnbd-clt: Fix signedness bug in init_dev()	Dan Carpenter	1	-1/+1
	The "dev->clt_device_id" variable is set using ida_alloc_max() which returns an int and in particular it returns negative error codes. Change the type from u32 to int to fix the error checking. Fixes: c9b5645fd8ca ("block: rnbd-clt: Fix leaked ID in init_dev()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20	ublk: clean up user copy references on ublk server exit	Caleb Sander Mateos	1	-2/+1
	If a ublk server process releases a ublk char device file, any requests dispatched to the ublk server but not yet completed will retain a ref value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef83 ("ublk: simplify aborting ublk request"), __ublk_fail_req() would decrement the reference count before completing the failed request. However, that commit optimized __ublk_fail_req() to call __ublk_complete_rq() directly without decrementing the request reference count. The leaked reference count incorrectly allows user copy and zero copy operations on the completed ublk request. It also triggers the WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit() and ublk_deinit_queue(). Commit c5c5eb24ed61 ("ublk: avoid ublk_io_release() called after ublk char dev is closed") already fixed the issue for ublk devices using UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference count leak also affects UBLK_F_USER_COPY, the other reference-counted data copy mode. Fix the condition in ublk_check_and_reset_active_ref() to include all reference-counted data copy modes. This ensures that any ublk requests still owned by the ublk server when it exits have their reference counts reset to 0. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e63d2228ef83 ("ublk: simplify aborting ublk request") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18	block: validate interval_exp integrity limit	Caleb Sander Mateos	1	-1/+6
	Various code assumes that the integrity interval is at least 1 sector and evenly divides the logical block size. Add these checks to blk_validate_integrity_limits(). This guards against block drivers that report invalid interval_exp values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18	block: validate pi_offset integrity limit	Caleb Sander Mateos	1	-4/+3
	The PI tuple must be contained within the metadata value, so validate that pi_offset + pi_tuple_size <= metadata_size. This guards against block drivers that report invalid pi_offset values. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18	block: rnbd-clt: Fix leaked ID in init_dev()	Thomas Fourier	1	-5/+8
	If kstrdup() fails in init_dev(), then the newly allocated ID is lost. Fixes: 64e8a6ece1a5 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-18	ublk: fix deadlock when reading partition table	Ming Lei	1	-4/+28
	When one process(such as udev) opens ublk block device (e.g., to read the partition table via bdev_open()), a deadlock[1] can occur: 1. bdev_open() grabs disk->open_mutex 2. The process issues read I/O to ublk backend to read partition table 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request() runs bio->bi_end_io() callbacks 4. If this triggers fput() on file descriptor of ublk block device, the work may be deferred to current task's task work (see fput() implementation) 5. This eventually calls blkdev_release() from the same context 6. blkdev_release() tries to grab disk->open_mutex again 7. Deadlock: same task waiting for a mutex it already holds The fix is to run blk_update_request() and blk_mq_end_request() with bottom halves disabled. This forces blkdev_release() to run in kernel work-queue context instead of current task work context, and allows ublk server to make forward progress, and avoids the deadlock. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Link: https://github.com/ublk-org/ublksrv/issues/170 [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> [axboe: rewrite comment in ublk] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-17	block: add allocation size check in blkdev_pr_read_keys()	Deepanshu Kartikey	2	-4/+7
	blkdev_pr_read_keys() takes num_keys from userspace and uses it to calculate the allocation size for keys_info via struct_size(). While there is a check for SIZE_MAX (integer overflow), there is no upper bound validation on the allocation size itself. A malicious or buggy userspace can pass a large num_keys value that doesn't trigger overflow but still results in an excessive allocation attempt, causing a warning in the page allocator when the order exceeds MAX_PAGE_ORDER. Fix this by introducing PR_KEYS_MAX to limit the number of keys to a sane value. This makes the SIZE_MAX check redundant, so remove it. Also switch to kvzalloc/kvfree to handle larger allocations gracefully. Fixes: 22a1ffea5f80 ("block: add IOC_PR_READ_KEYS ioctl") Tested-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Reported-by: syzbot+660d079d90f8a1baf54d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=660d079d90f8a1baf54d Link: https://lore.kernel.org/all/20251212013510.3576091-1-kartikey406@gmail.com/T/ [v1] Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15	Documentation: admin-guide: blockdev: replace zone_capacity with ↵	Yongpeng Yang	1	-1/+1
	zone_capacity_mb when creating devices The "zone_capacity=%umb" option is no longer used. The effective option is now "zone_capacity_mb=%u", so update the documentation accordingly. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15	zloop: use READ_ONCE() to read lo->lo_state in queue_rq path	Yongpeng Yang	1	-4/+4
	In the queue_rq path, zlo->state is accessed without locking, and direct access may read stale data. This patch uses READ_ONCE() to read zlo->state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15	loop: use READ_ONCE() to read lo->lo_state without locking	Yongpeng Yang	1	-9/+13
	When lo->lo_mutex is not held, direct access may read stale data. This patch uses READ_ONCE() to read lo->lo_state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15	nvme-pci: disable secondary temp for Wodposit WPBSNM8	Ilikara Zheng	1	-0/+2
	Secondary temperature thresholds (temp2_{min,max}) were not reported properly on this NVMe SSD. This resulted in an error while attempting to read these values with sensors(1): ERROR: Can't get value of subfeature temp2_min: I/O error ERROR: Can't get value of subfeature temp2_max: I/O error Add the device to the nvme_id_table with the NVME_QUIRK_NO_SECONDARY_TEMP_THRESH flag to suppress access to all non- composite temperature thresholds. Cc: stable@vger.kernel.org Tested-by: Wu Haotian <rigoligo03@gmail.com> Signed-off-by: Ilikara Zheng <ilikara@aosc.io> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-12-12	block: fix race between wbt_enable_default and IO submission	Ming Lei	6	-11/+23
	When wbt_enable_default() is moved out of queue freezing in elevator_change(), it can cause the wbt inflight counter to become negative (-1), leading to hung tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter is in an inconsistent state. The issue occurs because wbt_enable_default() could race with IO submission, allowing the counter to be decremented before proper initialization. This manifests as: rq_wait[0]: inflight: -1 has_waiters: True rwb_enabled() checks the state, which can be updated exactly between wbt_wait() (rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter will become negative. And results in hung task warnings like: task:kworker/u24:39 state:D stack:0 pid:14767 Call Trace: rq_qos_wait+0xb4/0x150 wbt_wait+0xa9/0x100 __rq_qos_throttle+0x24/0x40 blk_mq_submit_bio+0x672/0x7b0 ... Fix this by: 1. Splitting wbt_enable_default() into: - __wbt_enable_default(): Returns true if wbt_init() should be called - wbt_enable_default(): Wrapper for existing callers (no init) - wbt_init_enable_default(): New function that checks and inits WBT 2. Using wbt_init_enable_default() in blk_register_queue() to ensure proper initialization during queue registration 3. Move wbt_init() out of wbt_enable_default() which is only for enabling disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the original lock warning can be avoided. 4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling code since it's no longer needed This ensures WBT is properly initialized before any IO can be submitted, preventing the counter from going negative. Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai@fnnas.com> Cc: Guangwu Zhang <guazhang@redhat.com> Fixes: 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12	selftests: ublk: add user copy test cases	Caleb Sander Mateos	9	-0/+243
	The ublk selftests cover every data copy mode except user copy. Add tests for user copy based on the existing test suite: - generic_14 ("basic recover function verification (user copy)") based on generic_04 and generic_05 - null_03 ("basic IO test with user copy") based on null_01 and null_02 - loop_06 ("write and verify over user copy") based on loop_01 and loop_03 - loop_07 ("mkfs & mount & umount with user copy") based on loop_02 and loop_04 - stripe_05 ("write and verify test on user copy") based on stripe_03 - stripe_06 ("mkfs & mount & umount on user copy") based on stripe_02 and stripe_04 - stress_06 ("run IO and remove device (user copy)") based on stress_01 and stress_03 - stress_07 ("run IO and kill ublk server (user copy)") based on stress_02 and stress_04 Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12	selftests: ublk: add support for user copy to kublk	Caleb Sander Mateos	4	-9/+64
	The ublk selftests mock ublk server kublk supports every data copy mode except user copy. Add support for user copy to kublk, enabled via the --user_copy (-u) command line argument. On writes, issue pread() calls to copy the write data into the ublk_io's buffer before dispatching the write to the target implementation. On reads, issue pwrite() calls to copy read data from the ublk_io's buffer before committing the request. Copy in 2 KB chunks to provide some coverage of the offseting logic. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>