| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 166e31d7dbf6aa44829b98aa446bda5c9580f12a ]
A user can change the polled queue count at run time. There's a brief
window during a reset where a hipri task may try to poll that queue
before the block layer has updated the queue maps, which would race with
the now interrupt driven queue and may cause double completions.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0a1fc2f301529ac75aec0ce80d5ab9d9e4dc4b16 ]
The DHCHAP secrets (dhchap_secret and dhchap_ctrl_secret) contain
authentication key material for NVMe-oF. Use kfree_sensitive() instead
of kfree() in nvmf_free_options() to ensure secrets are zeroed before
the memory is freed, preventing recovery from freed pages.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Hodges <hodgesd@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4735b510a00fb2d4ac9e8d21a8c9552cb281f585 ]
If the user reduces the special queue count at runtime and resets the
controller, we need to reduce the number of queues and interrupts
requested accordingly rather than start with the pre-allocated queue
count.
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc71f409b22ca831a9f87a2712eaa09ef2bb4a5e ]
In the following scenario, pdev can be disabled between (1) and (3) by
(2). This sets pdev->msix_enabled = 0. Then, pci_irq_vector() will
return MSI-X IRQ(>15) for (1) whereas return INTx IRQ(<=15) for (2).
This causes IRQ warning because it tries to enable INTx IRQ that has
never been disabled before.
To fix this, save IRQ number into a local variable and ensure
disable_irq() and enable_irq() operate on the same IRQ number. Even if
pci_free_irq_vectors() frees the IRQ concurrently, disable_irq() and
enable_irq() on a stale IRQ number is still valid and safe, and the
depth accounting reamins balanced.
task 1:
nvme_poll_irqdisable()
disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(1)
enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(3)
task 2:
nvme_reset_work()
nvme_dev_disable()
pdev->msix_enable = 0; ...(2)
crash log:
------------[ cut here ]------------
Unbalanced enable for IRQ 10
WARNING: kernel/irq/manage.c:753 at __enable_irq+0x102/0x190 kernel/irq/manage.c:753, CPU#1: kworker/1:0H/26
Modules linked in:
CPU: 1 UID: 0 PID: 26 Comm: kworker/1:0H Not tainted 6.19.0-dirty #9 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:__enable_irq+0x107/0x190 kernel/irq/manage.c:753
Code: ff df 48 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 79 48 8d 3d 2e 7a 3f 05 41 8b 74 24 2c <67> 48 0f b9 3a e8 ef b9 21 00 5b 41 5c 5d e9 46 54 66 03 e8 e1 b9
RSP: 0018:ffffc900001bf550 EFLAGS: 00010046
RAX: 0000000000000007 RBX: 0000000000000000 RCX: ffffffffb20c0e90
RDX: 0000000000000000 RSI: 000000000000000a RDI: ffffffffb74b88f0
RBP: ffffc900001bf560 R08: ffff88800197cf00 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000003 R12: ffff8880012a6000
R13: 1ffff92000037eae R14: 000000000000000a R15: 0000000000000293
FS: 0000000000000000(0000) GS:ffff8880b49f7000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555da4a25fa8 CR3: 00000000208e8000 CR4: 00000000000006f0
Call Trace:
<TASK>
enable_irq+0x121/0x1e0 kernel/irq/manage.c:797
nvme_poll_irqdisable+0x162/0x1c0 drivers/nvme/host/pci.c:1494
nvme_timeout+0x965/0x14b0 drivers/nvme/host/pci.c:1744
blk_mq_rq_timed_out block/blk-mq.c:1653 [inline]
blk_mq_handle_expired+0x227/0x2d0 block/blk-mq.c:1721
bt_iter+0x2fc/0x3a0 block/blk-mq-tag.c:292
__sbitmap_for_each_set include/linux/sbitmap.h:269 [inline]
sbitmap_for_each_set include/linux/sbitmap.h:290 [inline]
bt_for_each block/blk-mq-tag.c:324 [inline]
blk_mq_queue_tag_busy_iter+0x969/0x1e80 block/blk-mq-tag.c:536
blk_mq_timeout_work+0x627/0x870 block/blk-mq.c:1763
process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x65c/0xe60 kernel/workqueue.c:3421
kthread+0x41a/0x930 kernel/kthread.c:463
ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
</TASK>
irq event stamp: 74478
hardirqs last enabled at (74477): [<ffffffffb5720a9c>] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:159 [inline]
hardirqs last enabled at (74477): [<ffffffffb5720a9c>] _raw_spin_unlock_irq+0x2c/0x60 kernel/locking/spinlock.c:202
hardirqs last disabled at (74478): [<ffffffffb57207b5>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
hardirqs last disabled at (74478): [<ffffffffb57207b5>] _raw_spin_lock_irqsave+0x85/0xa0 kernel/locking/spinlock.c:162
softirqs last enabled at (74304): [<ffffffffb1e9466c>] __do_softirq kernel/softirq.c:656 [inline]
softirqs last enabled at (74304): [<ffffffffb1e9466c>] invoke_softirq kernel/softirq.c:496 [inline]
softirqs last enabled at (74304): [<ffffffffb1e9466c>] __irq_exit_rcu+0xdc/0x120 kernel/softirq.c:723
softirqs last disabled at (74287): [<ffffffffb1e9466c>] __do_softirq kernel/softirq.c:656 [inline]
softirqs last disabled at (74287): [<ffffffffb1e9466c>] invoke_softirq kernel/softirq.c:496 [inline]
softirqs last disabled at (74287): [<ffffffffb1e9466c>] __irq_exit_rcu+0xdc/0x120 kernel/softirq.c:723
---[ end trace 0000000000000000 ]---
Fixes: fa059b856a59 (nvme-pci: Simplify nvme_poll_irqdisable)
Acked-by: Chao Shi <cshi008@fiu.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Acked-by: Dave Tian <daveti@purdue.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b4e78f1427c7d6859229ae9616df54e1fc05a516 ]
dev->online_queues is a count incremented in nvme_init_queue. Thus,
valid indices are 0 through dev->online_queues − 1.
This patch fixes the loop condition to ensure the index stays within the
valid range. Index 0 is excluded because it is the admin queue.
KASAN splat:
==================================================================
BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline]
BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404
Read of size 2 at addr ffff88800592a574 by task kworker/u8:5/74
CPU: 0 UID: 0 PID: 74 Comm: kworker/u8:5 Not tainted 6.19.0-dirty #10 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: nvme-reset-wq nvme_reset_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0xea/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xce/0x5d0 mm/kasan/report.c:482
kasan_report+0xdc/0x110 mm/kasan/report.c:595
__asan_report_load2_noabort+0x18/0x20 mm/kasan/report_generic.c:379
nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline]
nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404
nvme_reset_work+0x36b/0x8c0 drivers/nvme/host/pci.c:3252
process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x65c/0xe60 kernel/workqueue.c:3421
kthread+0x41a/0x930 kernel/kthread.c:463
ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
</TASK>
Allocated by task 34 on cpu 1 at 4.241550s:
kasan_save_stack+0x2c/0x60 mm/kasan/common.c:57
kasan_save_track+0x1c/0x70 mm/kasan/common.c:78
kasan_save_alloc_info+0x3c/0x50 mm/kasan/generic.c:570
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0xb5/0xc0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__do_kmalloc_node mm/slub.c:5657 [inline]
__kmalloc_node_noprof+0x2bf/0x8d0 mm/slub.c:5663
kmalloc_array_node_noprof include/linux/slab.h:1075 [inline]
nvme_pci_alloc_dev drivers/nvme/host/pci.c:3479 [inline]
nvme_probe+0x2f1/0x1820 drivers/nvme/host/pci.c:3534
local_pci_probe+0xef/0x1c0 drivers/pci/pci-driver.c:324
pci_call_probe drivers/pci/pci-driver.c:392 [inline]
__pci_device_probe drivers/pci/pci-driver.c:417 [inline]
pci_device_probe+0x743/0x920 drivers/pci/pci-driver.c:451
call_driver_probe drivers/base/dd.c:583 [inline]
really_probe+0x29b/0xb70 drivers/base/dd.c:661
__driver_probe_device+0x3b0/0x4a0 drivers/base/dd.c:803
driver_probe_device+0x56/0x1f0 drivers/base/dd.c:833
__driver_attach_async_helper+0x155/0x340 drivers/base/dd.c:1159
async_run_entry_fn+0xa6/0x4b0 kernel/async.c:129
process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x65c/0xe60 kernel/workqueue.c:3421
kthread+0x41a/0x930 kernel/kthread.c:463
ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
The buggy address belongs to the object at ffff88800592a000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 244 bytes to the right of
allocated 1152-byte region [ffff88800592a000, ffff88800592a480)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5928
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0xfffffc0000040(head|node=0|zone=1|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001
head: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 000fffffc0000003 ffffea0000164a01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88800592a400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88800592a480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88800592a500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88800592a580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88800592a600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fixes: 0f0d2c876c96 (nvme: free sq/cq dbbuf pointers when dbbuf set fails)
Acked-by: Chao Shi <cshi008@fiu.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Acked-by: Dave Tian <daveti@purdue.edu>
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c3320153769f05fd7fe9d840cb555dd3080ae424 ]
nvme_pr_read_keys() takes num_keys from userspace and uses it to
calculate the allocation size for rse via struct_size(). The upper
limit is PR_KEYS_MAX (64K).
A malicious or buggy userspace can pass a large num_keys value that
results in a 4MB allocation attempt at most, causing a warning in
the page allocator when the order exceeds MAX_PAGE_ORDER.
To fix this, use kvzalloc() instead of kzalloc().
This bug has the same reasoning and fix with the patch below:
https://lore.kernel.org/linux-block/20251212013510.3576091-1-kartikey406@gmail.com/
Warning log:
WARNING: mm/page_alloc.c:5216 at __alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216, CPU#1: syz-executor117/272
Modules linked in:
CPU: 1 UID: 0 PID: 272 Comm: syz-executor117 Not tainted 6.19.0 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:__alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216
Code: ff 83 bd a8 fe ff ff 0a 0f 86 69 fb ff ff 0f b6 1d f9 f9 c4 04 80 fb 01 0f 87 3b 76 30 ff 83 e3 01 75 09 c6 05 e4 f9 c4 04 01 <0f> 0b 48 c7 85 70 fe ff ff 00 00 00 00 e9 8f fd ff ff 31 c0 e9 0d
RSP: 0018:ffffc90000fcf450 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1ffff920001f9ea0
RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0
RBP: ffffc90000fcf648 R08: ffff88800b6c3380 R09: 0000000000000001
R10: ffffc90000fcf840 R11: ffff88807ffad280 R12: 0000000000000000
R13: 0000000000040dc0 R14: 0000000000000001 R15: ffffc90000fcf620
FS: 0000555565db33c0(0000) GS:ffff8880be26c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002000000c CR3: 0000000003b72000 CR4: 00000000000006f0
Call Trace:
<TASK>
alloc_pages_mpol+0x236/0x4d0 mm/mempolicy.c:2486
alloc_frozen_pages_noprof+0x149/0x180 mm/mempolicy.c:2557
___kmalloc_large_node+0x10c/0x140 mm/slub.c:5598
__kmalloc_large_node_noprof+0x25/0xc0 mm/slub.c:5629
__do_kmalloc_node mm/slub.c:5645 [inline]
__kmalloc_noprof+0x483/0x6f0 mm/slub.c:5669
kmalloc_noprof include/linux/slab.h:961 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
nvme_pr_read_keys+0x8f/0x4c0 drivers/nvme/host/pr.c:245
blkdev_pr_read_keys block/ioctl.c:456 [inline]
blkdev_common_ioctl+0x1b71/0x29b0 block/ioctl.c:730
blkdev_ioctl+0x299/0x700 block/ioctl.c:786
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl fs/ioctl.c:583 [inline]
__x64_sys_ioctl+0x1bf/0x220 fs/ioctl.c:583
x64_sys_call+0x1280/0x21b0 mnt/fuzznvme_1/fuzznvme/linux-build/v6.19/./arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x71/0x330 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fb893d3108d
Code: 28 c3 e8 46 1e 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffff61f2f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffff61f3138 RCX: 00007fb893d3108d
RDX: 0000000020000040 RSI: 00000000c01070ce RDI: 0000000000000003
RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffff61f3138
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffff61f3128 R14: 00007fb893dae530 R15: 0000000000000001
</TASK>
Fixes: 5fd96a4e15de (nvme: Add pr_ops read_keys support)
Acked-by: Chao Shi <cshi008@fiu.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Acked-by: Dave Tian <daveti@purdue.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0f5197ea9a73a4c406c75e6d8af3a13f7f96ae89 ]
We need to fall back to the synchronous removal if we can't get a
reference on the module needed for the deferred removal.
Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node")
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b84bb7bd913d8ca2f976ee6faf4a174f91c02b8d ]
When nvme_alloc_admin_tag_set() is called during a controller reset,
a previous admin queue may still exist. Release it properly before
allocating a new one to avoid orphaning the old queue.
This fixes a regression introduced by commit 03b3bcd319b3 ("nvme: fix
admin request_queue lifetime").
Cc: Keith Busch <kbusch@kernel.org>
Fixes: 03b3bcd319b3 ("nvme: fix admin request_queue lifetime").
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAHj4cs9wv3SdPo+N01Fw2SHBYDs9tj2M_e1-GdQOkRy=DsBB1w@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
The initial state of dma_needs_unmap may be false, but change to true
while mapping the data iterator. Enabling swiotlb is one such case that
can change the result. The nvme driver needs to save the mapped dma
vectors to be unmapped later, so allocate as needed during iteration
rather than assume it was always allocated at the beginning. This fixes
a NULL dereference from accessing an uninitialized dma_vecs when the
device dma unmapping requirements change mid-iteration.
Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping")
Link: https://lore.kernel.org/linux-nvme/20260202125738.1194899-1-pradeep.pragallapati@oss.qualcomm.com/
Reported-by: Pradeep P V K <pradeep.pragallapati@oss.qualcomm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The call to nvme_free_sgls() in nvme_unmap_data() has the sg_list and sge
parameters swapped. This wasn't noticed by the compiler because both share
the same type. On a Xen PV hardware domain, and possibly any other
architectures that takes that path, this leads to corruption of the NVMe
contents.
Fixes: f0887e2a52d4 ("nvme-pci: create common sgl unmapping helper")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING
state”) disallows controller state transitions directly from RESETTING
to LIVE. However, the NVMe PCIe subsystem reset path relies on this
transition to recover the controller on PowerPC (PPC) systems.
On PPC systems, issuing a subsystem reset causes a temporary loss of
communication with the NVMe adapter. A subsequent PCIe MMIO read then
triggers EEH recovery, which restores the PCIe link and brings the
controller back online. For EEH recovery to proceed correctly, the
controller must transition back to the LIVE state.
Due to the changes introduced by commit d2fe192348f9 (“nvme: only allow
entering LIVE from CONNECTING state”), the controller can no longer
transition directly from RESETTING to LIVE. As a result, EEH recovery
exits prematurely, leaving the controller stuck in the RESETTING state.
Fix this by explicitly transitioning the controller state from RESETTING
to CONNECTING and then to LIVE. This satisfies the updated state
transition rules and allows the controller to be successfully recovered
on PPC systems following a PCIe subsystem reset.
Cc: stable@vger.kernel.org
Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state")
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_fabrics creates an NVMe/FC controller in following path:
nvmf_dev_write()
-> nvmf_create_ctrl()
-> nvme_fc_create_ctrl()
-> nvme_fc_init_ctrl()
nvme_fc_init_ctrl() allocates the admin blk-mq resources right after
nvme_add_ctrl() succeeds. If any of the subsequent steps fail (changing
the controller state, scheduling connect work, etc.), we jump to the
fail_ctrl path, which tears down the controller references but never
frees the admin queue/tag set. The leaked blk-mq allocations match the
kmemleak report seen during blktests nvme/fc.
Check ctrl->ctrl.admin_tagset in the fail_ctrl path and call
nvme_remove_admin_tag_set() when it is set so that all admin queue
allocations are reclaimed whenever controller setup aborts.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
After discussion with the devicetree maintainers we agreed to not extend
lists with the generic compatible "apple,nvme-ans2" anymore [1]. Add
"apple,t8103-nvme-ans2" as fallback compatible as it is the SoC the
driver and bindings were written for.
[1]: https://lore.kernel.org/asahi/12ab93b7-1fc2-4ce0-926e-c8141cfe81bf@kernel.org/
Cc: stable@vger.kernel.org # v6.18+
Fixes: 5bd2927aceba ("nvme-apple: Add initial Apple SoC NVMe driver")
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Janne Grunau <j@jannau.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Secondary temperature thresholds (temp2_{min,max}) were not reported
properly on this NVMe SSD. This resulted in an error while attempting to
read these values with sensors(1):
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Add the device to the nvme_id_table with the
NVME_QUIRK_NO_SECONDARY_TEMP_THRESH flag to suppress access to all non-
composite temperature thresholds.
Cc: stable@vger.kernel.org
Tested-by: Wu Haotian <rigoligo03@gmail.com>
Signed-off-by: Ilikara Zheng <ilikara@aosc.io>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull NVMe updates from Keith:
"- Subsystem usage cleanups (Max)
- Endpoint device fixes (Shin'ichiro)
- Debug statements (Gerd)
- FC fabrics cleanups and fixes (Daniel)
- Consistent alloc API usages (Israel)
- Code comment updates (Chu)
- Authentication retry fix (Justin)"
* tag 'nvme-6.19-2025-12-04' of git://git.infradead.org/nvme:
nvme-fabrics: add ENOKEY to no retry criteria for authentication failures
nvme-auth: use kvfree() for memory allocated with kvcalloc()
nvmet-tcp: use kvcalloc for commands array
nvmet-rdma: use kvcalloc for commands and responses arrays
nvme: fix typo error in nvme target
nvmet-fc: use pr_* print macros instead of dev_*
nvmet-fcloop: remove unused lsdir member.
nvmet-fcloop: check all request and response have been processed
nvme-fc: check all request and response have been processed
nvme-fc: don't hold rport lock when putting ctrl
nvme-pci: add debug message on fail to read CSTS
nvme-pci: print error message on failure in nvme_probe
nvmet: pci-epf: fix DMA channel debug print
nvmet: pci-epf: move DMA initialization to EPC init callback
nvmet: remove redundant subsysnqn field from ctrl
nvmet: add sanity checks when freeing subsystem
|
|
With authentication, in addition to EKEYREJECTED there is also no point in
retrying reconnects when status is ENOKEY. Thus, add -ENOKEY as another
criteria to determine when to stop retries.
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Closes: https://lore.kernel.org/linux-nvme/20250829-nvme-fc-sync-v3-0-d69c87e63aee@kernel.org/
Signed-off-by: Justin Tee <justintee8345@gmail.com>
Tested-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Memory allocated by kvcalloc() may come from vmalloc or kmalloc,
so use kvfree() instead of kfree() for proper deallocation.
Fixes: aa36d711e945 ("nvme-auth: convert dhchap_auth_list to an array")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When the rport is removed there shouldn't be any in flight request or
responses.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The pr_read_keys() interface has a u32 num_keys parameter. The NVMe
Reservation Report command has a u32 maximum length. Reject num_keys
values that are too large to fit.
This will become important when pr_read_keys() is exposed to untrusted
userspace via an <linux/pr.h> ioctl.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use bio_alloc_bioset for passthru IO by default, so that we can enable
bio cache for irq and polled passthru IO in later.
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Fix head insertion for mq-deadline, a regression from when priority
support was added
- Series simplifying and improving the ublk user copy code
- Various ublk related cleanups
- Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
request is punted to a thread for handling
- Merge and then later revert loop dio nowait support, as it ended up
causing excessive stack usage for when the inline issue code needs to
dip back into the full file system code
- Improve auto integrity code, making it less deadlock prone
- Speedup polled IO handling, but manually managing the hctx lookups
- Fixes for blk-throttle for SSD devices
- Small series with fixes for the S390 dasd driver
- Add support for caching zones, avoiding unnecessary report zone
queries
- MD pull requests via Yu:
- fix null-ptr-dereference regression for dm-raid0
- fix IO hang for raid5 when array is broken with IO inflight
- remove legacy 1s delay to speed up system shutdown
- change maintainer's email address
- data can be lost if array is created with different lbs devices,
fix this problem and record lbs of the array in metadata
- fix rcu protection for md_thread
- fix mddev kobject lifetime regression
- enable atomic writes for md-linear
- some cleanups
- bcache updates via Coly
- remove useless discard and cache device code
- improve usage of per-cpu workqueues
- Reorganize the IO scheduler switching code, fixing some lockdep
reports as well
- Improve the block layer P2P DMA support
- Add support to the block tracing code for zoned devices
- Segment calculation improves, and memory alignment flexibility
improvements
- Set of prep and cleanups patches for ublk batching support. The
actual batching hasn't been added yet, but helps shrink down the
workload of getting that patchset ready for 6.20
- Fix for how the ps3 block driver handles segments offsets
- Improve how block plugging handles batch tag allocations
- nbd fixes for use-after-free of the configuration on device clear/put
- Set of improvements and fixes for zloop
- Add Damien as maintainer of the block zoned device code handling
- Various other fixes and cleanups
* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
block/rnbd: correct all kernel-doc complaints
blk-mq: use queue_hctx in blk_mq_map_queue_type
md: remove legacy 1s delay in md_notify_reboot
md/raid5: fix IO hang when array is broken with IO inflight
md: warn about updating super block failure
md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
sbitmap: fix all kernel-doc warnings
ublk: add helper of __ublk_fetch()
ublk: pass const pointer to ublk_queue_is_zoned()
ublk: refactor auto buffer register in ublk_dispatch_req()
ublk: add `union ublk_io_buf` with improved naming
ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
kfifo: add kfifo_alloc_node() helper for NUMA awareness
blk-mq: fix potential uaf for 'queue_hw_ctx'
blk-mq: use array manage hctx map instead of xarray
ublk: prevent invalid access with DEBUG
s390/dasd: Use scnprintf() instead of sprintf()
s390/dasd: Move device name formatting into separate function
s390/dasd: Remove unnecessary debugfs_create() return checks
s390/dasd: Fix gendisk parent after copy pair swap
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
|
|
nvme_fc_ctrl_put can acquire the rport lock when freeing the
ctrl object:
nvme_fc_ctrl_put
nvme_fc_ctrl_free
spin_lock_irqsave(rport->lock)
Thus we can't hold the rport lock when calling nvme_fc_ctrl_put.
Justin suggested use the safe list iterator variant because
nvme_fc_ctrl_put will also modify the rport->list.
Cc: Justin Tee <justin.tee@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a debug log spelling out that reading the CSTS register failed - to
distinguish this from other reasons for ENODEV.
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a new error message that makes failures to probe visible in the
kernel log, like:
nvme 0008:00:00.0: error -ENODEV: probe failed
This highlights issues with a particular device right away instead of
leaving users to search for missing drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Conflicts:
net/xdp/xsk.c
0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
30ed05adca4a ("xsk: use a smaller new lock for shared pool case")
https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au
https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nvme_fc_delete_assocation() waits for pending I/O to complete before
returning, and an error can cause ->ioerr_work to be queued after
cancel_work_sync() had been called. Move the call to cancel_work_sync() to
be after nvme_fc_delete_association() to ensure ->ioerr_work is not running
when the nvme_fc_ctrl object is freed. Otherwise the following can occur:
[ 1135.911754] list_del corruption, ff2d24c8093f31f8->next is NULL
[ 1135.917705] ------------[ cut here ]------------
[ 1135.922336] kernel BUG at lib/list_debug.c:52!
[ 1135.926784] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 1135.931851] CPU: 48 UID: 0 PID: 726 Comm: kworker/u449:23 Kdump: loaded Not tainted 6.12.0 #1 PREEMPT(voluntary)
[ 1135.943490] Hardware name: Dell Inc. PowerEdge R660/0HGTK9, BIOS 2.5.4 01/16/2025
[ 1135.950969] Workqueue: 0x0 (nvme-wq)
[ 1135.954673] RIP: 0010:__list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1135.961041] Code: c7 c7 98 68 72 94 e8 26 45 fe ff 0f 0b 48 c7 c7 70 68 72 94 e8 18 45 fe ff 0f 0b 48 89 fe 48 c7 c7 80 69 72 94 e8 07 45 fe ff <0f> 0b 48 89 d1 48 c7 c7 a0 6a 72 94 48 89 c2 e8 f3 44 fe ff 0f 0b
[ 1135.979788] RSP: 0018:ff579b19482d3e50 EFLAGS: 00010046
[ 1135.985015] RAX: 0000000000000033 RBX: ff2d24c8093f31f0 RCX: 0000000000000000
[ 1135.992148] RDX: 0000000000000000 RSI: ff2d24d6bfa1d0c0 RDI: ff2d24d6bfa1d0c0
[ 1135.999278] RBP: ff2d24c8093f31f8 R08: 0000000000000000 R09: ffffffff951e2b08
[ 1136.006413] R10: ffffffff95122ac8 R11: 0000000000000003 R12: ff2d24c78697c100
[ 1136.013546] R13: fffffffffffffff8 R14: 0000000000000000 R15: ff2d24c78697c0c0
[ 1136.020677] FS: 0000000000000000(0000) GS:ff2d24d6bfa00000(0000) knlGS:0000000000000000
[ 1136.028765] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1136.034510] CR2: 00007fd207f90b80 CR3: 000000163ea22003 CR4: 0000000000f73ef0
[ 1136.041641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1136.048776] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1136.055910] PKRU: 55555554
[ 1136.058623] Call Trace:
[ 1136.061074] <TASK>
[ 1136.063179] ? show_trace_log_lvl+0x1b0/0x2f0
[ 1136.067540] ? show_trace_log_lvl+0x1b0/0x2f0
[ 1136.071898] ? move_linked_works+0x4a/0xa0
[ 1136.075998] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.081744] ? __die_body.cold+0x8/0x12
[ 1136.085584] ? die+0x2e/0x50
[ 1136.088469] ? do_trap+0xca/0x110
[ 1136.091789] ? do_error_trap+0x65/0x80
[ 1136.095543] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.101289] ? exc_invalid_op+0x50/0x70
[ 1136.105127] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.110874] ? asm_exc_invalid_op+0x1a/0x20
[ 1136.115059] ? __list_del_entry_valid_or_report.cold+0xf/0x6f
[ 1136.120806] move_linked_works+0x4a/0xa0
[ 1136.124733] worker_thread+0x216/0x3a0
[ 1136.128485] ? __pfx_worker_thread+0x10/0x10
[ 1136.132758] kthread+0xfa/0x240
[ 1136.135904] ? __pfx_kthread+0x10/0x10
[ 1136.139657] ret_from_fork+0x31/0x50
[ 1136.143236] ? __pfx_kthread+0x10/0x10
[ 1136.146988] ret_from_fork_asm+0x1a/0x30
[ 1136.150915] </TASK>
Fixes: 19fce0470f05 ("nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context")
Cc: stable@vger.kernel.org
Tested-by: Marco Patalano <mpatalan@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Now target is removed from nvme_fc_ctrl_free() which is the ctrl->ref
release handler. And even admin queue is unquiesced there, this way
is definitely wrong because the ctr->ref is grabbed when submitting
command.
And Marco observed that nvme_fc_ctrl_free() can be called from request
completion code path, and trigger kernel warning since request completes
from softirq context.
Fix the issue by moveing target removal into nvme_fc_delete_ctrl(),
which is also aligned with nvme-tcp and nvme-rdma.
Patch originally proposed by Ming Lei, then modified to move the tagset
removal down to after nvme_fc_delete_association() after further testing.
Cc: Marco Patalano <mpatalan@redhat.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Tested-by: Marco Patalano <mpatalan@redhat.com>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Blktests test cases nvme/014, 057 and 058 fail occasionally due to a
lockdep WARN. As reported in the Closes tag URL, the WARN indicates that
a deadlock can happen due to the dependency among disk->open_mutex,
kblockd workqueue completion and partition_scan_work completion.
To avoid the lockdep WARN and the potential deadlock, cut the dependency
by running the partition_scan_work not by kblockd workqueue but by
nvme_wq.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAHj4cs8mJ+R_GmQm9R8ebResKAWUE8kF5+_WVg0v8zndmqd6BQ@mail.gmail.com/
Link: https://lore.kernel.org/linux-block/oeyzci6ffshpukpfqgztsdeke5ost5hzsuz4rrsjfmvpqcevax@5nhnwbkzbrpa/
Fixes: 1f021341eef4 ("nvme-multipath: defer partition scanning")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In commit eadaa8b255f3 ("dma-mapping: introduce new DMA attribute to
indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe
MMIO addresses, which require to avoid any memory cache flushing, as
an outcome of the discussion pointed in Link tag below.
In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic
treated this as regular page and relied on "struct page" DMA flow.
That flow performs CPU cache flushing, which shouldn't be done here,
and doesn't set IOMMU_MMIO flag in DMA-IOMMU case.
As a solution, let's encode peer-to-peer transaction type in NVMe IOD
flags variable and provide it to blk-mq-dma API.
Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After introduction of dma_map_phys(), there is no need to convert
from physical address to struct page in order to map page. So let's
use it directly.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such devices; rdma fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements, but tcp
and fc never needed it in the first place so they get to report no
virtual boundary.
Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Cross-merge networking fixes after downstream PR (net-6.18-rc5).
Conflicts:
drivers/net/wireless/ath/ath12k/mac.c
9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"")
6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon")
https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net
Adjacent changes:
drivers/net/ethernet/intel/Kconfig
b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG")
93f53db9f9dc ("ice: switch to Page Pool")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The namespaces can access the controller's admin request_queue, and
stale references on the namespaces may exist after tearing down the
controller. Ensure the admin request_queue is active by moving the
controller's 'put' to after all controller references have been released
to ensure no one is can access the request_queue. This fixes a reported
use-after-free bug:
BUG: KASAN: slab-use-after-free in blk_queue_enter+0x41c/0x4a0
Read of size 8 at addr ffff88c0a53819f8 by task nvme/3287
CPU: 67 UID: 0 PID: 3287 Comm: nvme Tainted: G E 6.13.2-ga1582f1a031e #15
Tainted: [E]=UNSIGNED_MODULE
Hardware name: Jabil /EGS 2S MB1, BIOS 1.00 06/18/2025
Call Trace:
<TASK>
dump_stack_lvl+0x4f/0x60
print_report+0xc4/0x620
? _raw_spin_lock_irqsave+0x70/0xb0
? _raw_read_unlock_irqrestore+0x30/0x30
? blk_queue_enter+0x41c/0x4a0
kasan_report+0xab/0xe0
? blk_queue_enter+0x41c/0x4a0
blk_queue_enter+0x41c/0x4a0
? __irq_work_queue_local+0x75/0x1d0
? blk_queue_start_drain+0x70/0x70
? irq_work_queue+0x18/0x20
? vprintk_emit.part.0+0x1cc/0x350
? wake_up_klogd_work_func+0x60/0x60
blk_mq_alloc_request+0x2b7/0x6b0
? __blk_mq_alloc_requests+0x1060/0x1060
? __switch_to+0x5b7/0x1060
nvme_submit_user_cmd+0xa9/0x330
nvme_user_cmd.isra.0+0x240/0x3f0
? force_sigsegv+0xe0/0xe0
? nvme_user_cmd64+0x400/0x400
? vfs_fileattr_set+0x9b0/0x9b0
? cgroup_update_frozen_flag+0x24/0x1c0
? cgroup_leave_frozen+0x204/0x330
? nvme_ioctl+0x7c/0x2c0
blkdev_ioctl+0x1a8/0x4d0
? blkdev_common_ioctl+0x1930/0x1930
? fdget+0x54/0x380
__x64_sys_ioctl+0x129/0x190
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f765f703b0b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 52 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2cefe808 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe2cefe860 RCX: 00007f765f703b0b
RDX: 00007ffe2cefe860 RSI: 00000000c0484e41 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
R10: 00007f765f611d50 R11: 0000000000000202 R12: 0000000000000003
R13: 00000000c0484e41 R14: 0000000000000001 R15: 00007ffe2cefea60
</TASK>
Reported-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.
Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.
All block device drivers supporting the report_zones block operation are
updated to use this new scheme.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Update all struct proto_ops connect() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update all struct proto_ops bind() callback function prototypes from
"struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the
compiler about object sizes. Calls into struct proto handlers gain casts
that will be removed in the struct proto conversion patch.
No binary changes expected.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The dma_map_bvec helper doesn't work for p2p data, so use the same
blk_map_iter method that sgl uses for this memory type.
Reported-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
With TLS enabled, records that are encrypted and appended to TLS TX
list can fail to see a retry if the underlying TCP socket is busy, for
example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known
to the NVMe TCP driver, as the TLS layer successfully generated a record.
Typically, the TLS write_space() callback would ensure such records are
retried, but in the NVMe TCP Host driver, write_space() invokes
nvme_tcp_write_space(). This causes a partially sent record in the TLS TX
list to timeout after not being retried.
This patch fixes the above by calling queue->write_space(), which calls
into the TLS layer to retry any pending records.
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The sc_c field is currently not updated in the host response to the
controller challenge leading to failures while attempting secure
channel concatenation. Fix this by adding a new sc_c variable to the
dhchap queue context structure which is appropriately set during
negotiate and then used in the host response.
Fixes: e88a7595b57f ("nvme-tcp: request secure channel concatenation")
Signed-off-by: Martin George <marting@netapp.com>
Signed-off-by: Prashanth Adurthi <prashana@netapp.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
For queue-depth I/O policy, this patch fixes unbalanced I/Os across
nvme multipaths.
Issue Description:
The RETRY disposition incorrectly increments ns->ctrl->nr_active
counter and reinitializes iostat start-time. In such cases nr_active
counter never goes back to zero until that path disconnects and
reconnects.
Such a path is not chosen for new I/Os if multiple RETRY cases on a given
a path cause its queue-depth counter to be artificially higher compared
to other paths. This leads to unbalanced I/Os across paths.
The patch skips incrementing nr_active if NVME_MPATH_CNT_ACTIVE is already
set. And it skips restarting io stats if NVME_MPATH_IO_STATS is already set.
base-commit: e989a3da2d371a4b6597ee8dee5c72e407b4db7a
Fixes: d4d957b53d91eeb ("nvme-multipath: support io stats on the mpath device")
Signed-off-by: Amit Chaudhary <achaudhary@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- FC target fixes (Daniel)
- Authentication fixes and updates (Martin, Chris)
- Admin controller handling (Kamaljit)
- Target lockdep assertions (Max)
- Keep-alive updates for discovery (Alastair)
- Suspend quirk (Georg)
- MD pull request via Yu:
- Add support for a lockless bitmap.
A key feature for the new bitmap are that the IO fastpath is
lockless. If a user issues lots of write IO to the same bitmap
bit in a short time, only the first write has additional overhead
to update bitmap bit, no additional overhead for the following
writes.
By supporting only resync or recover written data, means in the
case creating new array or replacing with a new disk, there is no
need to do a full disk resync/recovery.
- Switch ->getgeo() and ->bios_param() to using struct gendisk rather
than struct block_device.
- Rust block changes via Andreas. This series adds configuration via
configfs and remote completion to the rnull driver. The series also
includes a set of changes to the rust block device driver API: a few
cleanup patches, and a few features supporting the rnull changes.
The series removes the raw buffer formatting logic from
`kernel::block` and improves the logic available in `kernel::string`
to support the same use as the removed logic.
- floppy arch cleanups
- Reduce the number of dereferencing needed for ublk commands
- Restrict supported sockets for nbd. Mostly done to eliminate a class
of issues perpetually reported by syzbot, by using nonsensical socket
setups.
- A few s390 dasd block fixes
- Fix a few issues around atomic writes
- Improve DMA interation for integrity requests
- Improve how iovecs are treated with regards to O_DIRECT aligment
constraints.
We used to require each segment to adhere to the constraints, now
only the request as a whole needs to.
- Clean up and improve p2p support, enabling use of p2p for metadata
payloads
- Improve locking of request lookup, using SRCU where appropriate
- Use page references properly for brd, avoiding very long RCU sections
- Fix ordering of recursively submitted IOs
- Clean up and improve updating nr_requests for a live device
- Various fixes and cleanups
* tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits)
s390/dasd: enforce dma_alignment to ensure proper buffer validation
s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request
ublk: remove redundant zone op check in ublk_setup_iod()
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
blk-cgroup: fix possible deadlock while configuring policy
blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path
blk-mq: Fix more tag iteration function documentation
selftests: ublk: fix behavior when fio is not installed
ublk: don't access ublk_queue in ublk_unmap_io()
ublk: pass ublk_io to __ublk_complete_rq()
ublk: don't access ublk_queue in ublk_need_complete_req()
ublk: don't access ublk_queue in ublk_check_commit_and_fetch()
ublk: don't pass ublk_queue to ublk_fetch()
ublk: don't access ublk_queue in ublk_config_io_buf()
ublk: don't access ublk_queue in ublk_check_fetch_buf()
ublk: pass q_id and tag to __ublk_check_and_get_req()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Store ring provided buffers locally for the users, rather than stuff
them into struct io_kiocb.
These types of buffers must always be fully consumed or recycled in
the current context, and leaving them in struct io_kiocb is hence not
a good ideas as that struct has a vastly different life time.
Basically just an architecture cleanup that can help prevent issues
with ring provided buffers in the future.
- Support for mixed CQE sizes in the same ring.
Before this change, a CQ ring either used the default 16b CQEs, or it
was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
a few 32b CQEs were needed, this caused everything else to use big
CQEs. This is wasteful both in terms of memory usage, but also memory
bandwidth for the posted CQEs.
With IORING_SETUP_CQE_MIXED, applications may use request types that
post both normal 16b and big 32b CQEs on the same ring.
- Add helpers for async data management, to make it harder for opcode
handlers to mess it up.
- Add support for multishot for uring_cmd, which ublk can use. This
helps improve efficiency, by providing a persistent request type that
can trigger multiple CQEs.
- Add initial support for ring feature querying.
We had basic support for probe operations, but the API isn't great.
Rather than expand that, add support for QUERY which is easily
expandable and can cover a lot more cases than the existing probe
support. This will help applications get a better idea of what
operations are supported on a given host.
- zcrx improvements from Pavel:
- Improve refill entry alignment for better caching
- Various cleanups, especially around deduplicating normal
memory vs dmabuf setup.
- Generalisation of the niov size (Patch 12). It's still hard
coded to PAGE_SIZE on init, but will let the user to specify
the rx buffer length on setup.
- Syscall / synchronous bufer return. It'll be used as a slow
fallback path for returning buffers when the refill queue is
full. Useful for tolerating slight queue size misconfiguration
or with inconsistent load.
- Accounting more memory to cgroups.
- Additional independent cleanups that will also be useful for
mutli-area support.
- Various fixes and cleanups
* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
io_uring: fix nvme's 32b cqes on mixed cq
io_uring/query: cap number of queries
io_uring/query: prevent infinite loops
io_uring/zcrx: account niov arrays to cgroup
io_uring/zcrx: allow synchronous buffer return
io_uring/zcrx: introduce io_parse_rqe()
io_uring/zcrx: don't adjust free cache space
io_uring/zcrx: use guards for the refill lock
io_uring/zcrx: reduce netmem scope in refill
io_uring/zcrx: protect netdev with pp_lock
io_uring/zcrx: rename dma lock
io_uring/zcrx: make niov size variable
io_uring/zcrx: set sgt for umem area
io_uring/zcrx: remove dmabuf_offset
io_uring/zcrx: deduplicate area mapping
io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
io_uring/zcrx: check all niovs filled with dma addresses
io_uring/zcrx: move area reg checks into io_import_area
io_uring/zcrx: don't pass slot to io_zcrx_create_area
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"Lots of platform specific updates for Qualcomm SoCs, including a new
TEE subsystem driver for the Qualcomm QTEE firmware interface.
Added support for the Apple A11 SoC in drivers that are shared with
the M1/M2 series, among more updates for those.
Smaller platform specific driver updates for Renesas, ASpeed,
Broadcom, Nvidia, Mediatek, Amlogic, TI, Allwinner, and Freescale
SoCs.
Driver updates in the cache controller, memory controller and reset
controller subsystems.
SCMI firmware updates to add more features and improve robustness.
This includes support for having multiple SCMI providers in a single
system.
TEE subsystem support for protected DMA-bufs, allowing hardware to
access memory areas that managed by the kernel but remain inaccessible
from the CPU in EL1/EL0"
* tag 'soc-drivers-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (139 commits)
soc/fsl/qbman: Use for_each_online_cpu() instead of for_each_cpu()
soc: fsl: qe: Drop legacy-of-mm-gpiochip.h header from GPIO driver
soc: fsl: qe: Change GPIO driver to a proper platform driver
tee: fix register_shm_helper()
pmdomain: apple: Add "apple,t8103-pmgr-pwrstate"
dt-bindings: spmi: Add Apple A11 and T2 compatible
serial: qcom-geni: Load UART qup Firmware from linux side
spi: geni-qcom: Load spi qup Firmware from linux side
i2c: qcom-geni: Load i2c qup Firmware from linux side
soc: qcom: geni-se: Add support to load QUP SE Firmware via Linux subsystem
soc: qcom: geni-se: Cleanup register defines and update copyright
dt-bindings: qcom: se-common: Add QUP Peripheral-specific properties for I2C, SPI, and SERIAL bus
Documentation: tee: Add Qualcomm TEE driver
tee: qcom: enable TEE_IOC_SHM_ALLOC ioctl
tee: qcom: add primordial object
tee: add Qualcomm TEE driver
tee: increase TEE_MAX_ARG_SIZE to 4096
tee: add TEE_IOCTL_PARAM_ATTR_TYPE_OBJREF
tee: add TEE_IOCTL_PARAM_ATTR_TYPE_UBUF
tee: add close_context to TEE driver operation
...
|
|
for-6.18/block
Pull NVMe updates from Keith:
" - FC target fixes (Daniel)
- Authentication fixes and updates (Martin, Chris)
- Admin controller handling (Kamaljit)
- Target lockdep assertions (Max)
- Keep-alive updates for discovery (Alastair)
- Suspend quirk (Georg)"
* tag 'nvme-6.18-2025-09-23' of git://git.infradead.org/nvme:
nvme: Use non zero KATO for persistent discovery connections
nvmet: add safety check for subsys lock
nvme-core: use nvme_is_io_ctrl() for I/O controller check
nvme-core: do ioccsz/iorcsz validation only for I/O controllers
nvme-core: add method to check for an I/O controller
nvme-pci: Add TUXEDO IBS Gen8 to Samsung sleep quirk
nvme-auth: use hkdf_expand_label()
nvme-auth: add hkdf_expand_label()
nvme-tcp: send only permitted commands for secure concat
nvme-fc: use lock accessing port_state and rport state
nvmet-fcloop: call done callback even when remote port is gone
nvmet-fc: avoid scheduling association deletion twice
nvmet-fc: move lsop put work to nvmet_fc_ls_req_op
nvme-auth: update bi_directional flag
|
|
The NVMe Base Specification 2.1 states that:
"""
A host requests an explicit persistent connection ... by specifying a
non-zero Keep Alive Timer value in the Connect command.
"""
As such if we are starting a persistent connection to a discovery
controller and the KATO is currently 0 we need to update KATO to a non
zero value to avoid continuous timeouts on the target.
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Replace the current I/O controller check in
nvme_init_non_mdts_limits() with the helper nvme_is_io_ctrl()
function to maintain consistency with similar checks in other
parts of the code and improve code readability.
Signed-off-by: Martin George <marting@netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
An administrative controller does not support I/O queues, hence it
should ignore existing checks for IOCCSZ/IORCSZ. Currently, these checks
only exclude a discovery controller but need to also exclude an
administrative controller.
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add nvme_is_io_ctrl() to check if the controller is of type I/O
controller. Uses negative logic by excluding an administrative
controller and a discovery controller.
Signed-off-by: Kamaljit Singh <kamaljit.singh@opensource.wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|