kernel/linux.git/drivers/nvme, branch v6.6.142

nvme-pci: fix missed admin queue sq doorbell write

2026-05-23T11:03:27+00:00

[ Upstream commit 1cc4cdae2a3b7730d462d69e30f213fd2efe7807 ] We can batch admin commands submitted through io_uring_cmd passthrough, which means bd->last may be false and skips the doorbell write to aggregate multiple commands per write. If a subsequent command can't be dispatched for whatever reason, we have to provide the blk-mq ops' commit_rqs callback in order to ensure we properly update the doorbell. Fixes: 58e5bdeb9c2b ("nvme: enable uring-passthrough for admin commands") Reviewed-by: Christoph Hellwig Reviewed-by: Kanchan Joshi Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet-tcp: propagate nvmet_tcp_build_pdu_iovec() errors to its callers

2026-05-23T11:03:27+00:00

[ Upstream commit ea8e356acb165cb1fd75537a52e1f66e5e76c538 ] Currently, when nvmet_tcp_build_pdu_iovec() detects an out-of-bounds PDU length or offset, it triggers nvmet_tcp_fatal_error(cmd->queue) and returns early. However, because the function returns void, the callers are entirely unaware that a fatal error has occurred and that the cmd->recv_msg.msg_iter was left uninitialized. Callers such as nvmet_tcp_handle_h2c_data_pdu() proceed to blindly overwrite the queue state with queue->rcv_state = NVMET_TCP_RECV_DATA Consequently, the socket receiving loop may attempt to read incoming network data into the uninitialized iterator. Fix this by shifting the error handling responsibility to the callers. Fixes: 52a0a9854934 ("nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovec") Reviewed-by: Hannes Reinecke Reviewed-by: Yunje Shin Reviewed-by: Chaitanya Kulkarni Signed-off-by: Maurizio Lombardi Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet: avoid recursive nvmet-wq flush in nvmet_ctrl_free

2026-05-17T15:13:41+00:00

commit aade8abd8b868b6ffa9697aadaea28ec7f65bee6 upstream. nvmet_tcp_release_queue_work() runs on nvmet-wq and can drop the final controller reference through nvmet_cq_put(). If that triggers nvmet_ctrl_free(), the teardown path flushes ctrl->async_event_work on the same nvmet-wq. Call chain: nvmet_tcp_schedule_release_queue() kref_put(&queue->kref, nvmet_tcp_release_queue) nvmet_tcp_release_queue() queue_work(nvmet_wq, &queue->release_work) <--- nvmet_wq process_one_work() nvmet_tcp_release_queue_work() nvmet_cq_put(&queue->nvme_cq) nvmet_cq_destroy() nvmet_ctrl_put(cq->ctrl) nvmet_ctrl_free() flush_work(&ctrl->async_event_work) <--- nvmet_wq Previously Scheduled by :- nvmet_add_async_event queue_work(nvmet_wq, &ctrl->async_event_work); This trips lockdep with a possible recursive locking warning. [ 5223.015876] run blktests nvme/003 at 2026-04-07 20:53:55 [ 5223.061801] loop0: detected capacity change from 0 to 2097152 [ 5223.072206] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 5223.088368] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 5223.126086] nvmet: Created discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 5223.128453] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 [ 5233.199447] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" [ 5233.227718] ============================================ [ 5233.231283] WARNING: possible recursive locking detected [ 5233.234696] 7.0.0-rc3nvme+ #20 Tainted: G O N [ 5233.238434] -------------------------------------------- [ 5233.241852] kworker/u192:6/2413 is trying to acquire lock: [ 5233.245429] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 [ 5233.251438] but task is already holding lock: [ 5233.255254] ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0 [ 5233.261125] other info that might help us debug this: [ 5233.265333] Possible unsafe locking scenario: [ 5233.269217] CPU0 [ 5233.270795] ---- [ 5233.272436] lock((wq_completion)nvmet-wq); [ 5233.275241] lock((wq_completion)nvmet-wq); [ 5233.278020] *** DEADLOCK *** [ 5233.281793] May be due to missing lock nesting notation [ 5233.286195] 3 locks held by kworker/u192:6/2413: [ 5233.289192] #0: ffff888111632548 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x5cc/0x6e0 [ 5233.294569] #1: ffffc9000e2a7e40 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x6e0 [ 5233.300128] #2: ffffffff82d7dc40 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530 [ 5233.304290] stack backtrace: [ 5233.306520] CPU: 4 UID: 0 PID: 2413 Comm: kworker/u192:6 Tainted: G O N 7.0.0-rc3nvme+ #20 PREEMPT(full) [ 5233.306524] Tainted: [O]=OOT_MODULE, [N]=TEST [ 5233.306525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 5233.306527] Workqueue: nvmet-wq nvmet_tcp_release_queue_work [nvmet_tcp] [ 5233.306532] Call Trace: [ 5233.306534] [ 5233.306536] dump_stack_lvl+0x73/0xb0 [ 5233.306552] print_deadlock_bug+0x225/0x2f0 [ 5233.306556] __lock_acquire+0x13f0/0x2290 [ 5233.306563] lock_acquire+0xd0/0x300 [ 5233.306565] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306571] ? __flush_work+0x20b/0x530 [ 5233.306573] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306577] touch_wq_lockdep_map+0x3b/0x90 [ 5233.306580] ? touch_wq_lockdep_map+0x26/0x90 [ 5233.306583] ? __flush_work+0x20b/0x530 [ 5233.306585] __flush_work+0x268/0x530 [ 5233.306588] ? __pfx_wq_barrier_func+0x10/0x10 [ 5233.306594] ? xen_error_entry+0x30/0x60 [ 5233.306600] nvmet_ctrl_free+0x140/0x310 [nvmet] [ 5233.306617] nvmet_cq_put+0x74/0x90 [nvmet] [ 5233.306629] nvmet_tcp_release_queue_work+0x19f/0x360 [nvmet_tcp] [ 5233.306634] process_one_work+0x206/0x6e0 [ 5233.306640] worker_thread+0x184/0x320 [ 5233.306643] ? __pfx_worker_thread+0x10/0x10 [ 5233.306646] kthread+0xf1/0x130 [ 5233.306648] ? __pfx_kthread+0x10/0x10 [ 5233.306651] ret_from_fork+0x355/0x450 [ 5233.306653] ? __pfx_kthread+0x10/0x10 [ 5233.306656] ret_from_fork_asm+0x1a/0x30 [ 5233.306664] There is also no need to flush async_event_work from controller teardown. The admin queue teardown already fails outstanding AER requests before the final controller put :- nvmet_sq_destroy(admin sq) nvmet_async_events_failall(ctrl) The controller has already been removed from the subsystem list before nvmet_ctrl_free() quiesces outstanding work. Replace flush_work() with cancel_work_sync() so a pending async_event_work item is canceled and a running instance is waited on without recursing into the same workqueue. Fixes: 06406d81a2d7 ("nvmet: cancel fatal error and flush async work before free controller") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Chaitanya Kulkarni Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman

nvme-apple: drop invalid put of admin queue reference count

2026-05-17T15:13:41+00:00

commit ba9d308ccd6732dd97ed8080d834a4a89e758e14 upstream. Commit 03b3bcd319b3 ("nvme: fix admin request_queue lifetime") moved the admin queue reference ->put call into nvme_free_ctrl() - a controller device release callback performed for every nvme driver doing nvme_init_ctrl(). nvme-apple sets refcount of the admin queue to 1 at allocation during the probe function and then puts it twice now: nvme_free_ctrl() blk_put_queue(ctrl->admin_q) // #1 ->free_ctrl() apple_nvme_free_ctrl() blk_put_queue(anv->ctrl.admin_q) // #2 Note that there is a commit 941f7298c70c ("nvme-apple: remove an extra queue reference") which intended to drop taking an extra admin queue reference. Looks like at that moment it accidentally fixed a refcount leak, which existed since the driver's introduction. There were two ->get calls at driver's probe function and a single ->put inside apple_nvme_free_ctrl(). However now after commit 03b3bcd319b3 ("nvme: fix admin request_queue lifetime") the refcount is imbalanced again. Fix it by removing extra ->put call from apple_nvme_free_ctrl(). anv->dev and ctrl->dev point to the same device, so use ctrl->dev directly for simplification. Compile tested only. Found by Linux Verification Center (linuxtesting.org). Fixes: 03b3bcd319b3 ("nvme: fix admin request_queue lifetime") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Fedor Pchelkin Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman

nvme: respect NVME_QUIRK_DISABLE_WRITE_ZEROES when wzsl is set

2026-05-17T15:13:28+00:00

commit 40f0496b617b431f8d2dd94d7f785c1121f8a68a upstream. The NVM Command Set Identify Controller data may report a non-zero Write Zeroes Size Limit (wzsl). When present, nvme_init_non_mdts_limits() unconditionally overrides max_zeroes_sectors from wzsl, even if NVME_QUIRK_DISABLE_WRITE_ZEROES previously set it to zero. This effectively re-enables write zeroes for devices that need it disabled, defeating the quirk. Several Kingston OM* drives rely on this quirk to avoid firmware issues with write zeroes commands. Check for the quirk before applying the wzsl override. Fixes: 5befc7c26e5a ("nvme: implement non-mdts command limits") Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman

nvme-pci: add NVME_QUIRK_DISABLE_WRITE_ZEROES for Kingston OM3SGP4

2026-05-17T15:13:28+00:00

commit a8eebf9699d69987cc49cec4e4fdb4111ab32423 upstream. The Kingston OM3SGP42048K2-A00 (PCI ID 2646:502f) firmware has a race condition when processing concurrent write zeroes and DSM (discard) commands, causing spurious "LBA Out of Range" errors and IOMMU page faults at address 0x0. The issue is reliably triggered by running two concurrent mkfs commands on different partitions of the same drive, which generates interleaved write zeroes and discard operations. Disable write zeroes for this device, matching the pattern used for other Kingston OM* drives that have similar firmware issues. Cc: stable@vger.kernel.org Signed-off-by: Robert Beckett Assisted-by: claude-opus-4-6-v1 Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman

nvme: fix admin queue leak on controller reset

2026-04-02T11:07:30+00:00

[ Upstream commit b84bb7bd913d8ca2f976ee6faf4a174f91c02b8d ] When nvme_alloc_admin_tag_set() is called during a controller reset, a previous admin queue may still exist. Release it properly before allocating a new one to avoid orphaning the old queue. This fixes a regression introduced by commit 03b3bcd319b3 ("nvme: fix admin request_queue lifetime"). Cc: Keith Busch Fixes: 03b3bcd319b3 ("nvme: fix admin request_queue lifetime"). Reported-and-tested-by: Yi Zhang Closes: https://lore.kernel.org/linux-block/CAHj4cs9wv3SdPo+N01Fw2SHBYDs9tj2M_e1-GdQOkRy=DsBB1w@mail.gmail.com/ Signed-off-by: Ming Lei Signed-off-by: Keith Busch Signed-off-by: Li hongliang <1468888505@139.com> Signed-off-by: Greg Kroah-Hartman

nvme-pci: ensure we're polling a polled queue

2026-04-02T11:07:14+00:00

[ Upstream commit 166e31d7dbf6aa44829b98aa446bda5c9580f12a ] A user can change the polled queue count at run time. There's a brief window during a reset where a hipri task may try to poll that queue before the block layer has updated the queue maps, which would race with the now interrupt driven queue and may cause double completions. Reviewed-by: Christoph Hellwig Reviewed-by: Kanchan Joshi Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme-fabrics: use kfree_sensitive() for DHCHAP secrets

2026-04-02T11:07:13+00:00

[ Upstream commit 0a1fc2f301529ac75aec0ce80d5ab9d9e4dc4b16 ] The DHCHAP secrets (dhchap_secret and dhchap_ctrl_secret) contain authentication key material for NVMe-oF. Use kfree_sensitive() instead of kfree() in nvmf_free_options() to ensure secrets are zeroed before the memory is freed, preventing recovery from freed pages. Reviewed-by: Christoph Hellwig Signed-off-by: Daniel Hodges Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme-pci: cap queue creation to used queues

2026-04-02T11:07:13+00:00

[ Upstream commit 4735b510a00fb2d4ac9e8d21a8c9552cb281f585 ] If the user reduces the special queue count at runtime and resets the controller, we need to reduce the number of queues and interrupts requested accordingly rather than start with the pre-allocated queue count. Tested-by: Kanchan Joshi Reviewed-by: Kanchan Joshi Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin