summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)AuthorFilesLines
2022-11-16nvme-auth: don't ignore key generation failures when initializing ctrl keysSagi Grimberg3-7/+25
nvme_auth_generate_key can fail, don't ignore it upon initialization. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-auth: remove redundant buffer deallocationsSagi Grimberg1-4/+0
host_response, host_key, ctrl_key and sess_key are freed in nvme_auth_reset_dhchap which is called from nvme_auth_free_dhchap. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-auth: don't re-authenticate if the controller is not LIVESagi Grimberg1-0/+7
The connect sequence will re-authenticate. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-auth: remove symbol export from nvme_auth_resetSagi Grimberg1-1/+0
Only the nvme module calls it. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-auth: rename authentication work elementsSagi Grimberg1-4/+4
Use nvme_ctrl_auth_work and nvme_queue_auth_work for better readability. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-auth: rename __nvme_auth_[reset|free] to nvme_auth[reset|free]_dhchapSagi Grimberg1-6/+6
nvme_auth_[reset|free] operate on the controller while __nvme_auth_[reset|free] operate on a chap struct (which maps to a queue context). Rename it for clarity. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-pci: don't unbind the driver on reset failureChristoph Hellwig1-30/+10
Unbind a device driver when a reset fails is very unusual behavior. Just shut the controller down and leave it in dead state if we fail to reset it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-16nvme-pci: split the initial probe from the rest pathChristoph Hellwig1-53/+80
nvme_reset_work is a little fragile as it needs to handle both resetting a live controller and initializing one during probe. Split out the initial probe and open code it in nvme_probe and leave nvme_reset_work to just do the live controller reset. This fixes a recently introduced bug where nvme_dev_disable causes a NULL pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer is not set when the reset state is entered directly from the new state. The separate probe code can skip the reset state and probe directly and fixes this. To make sure the system isn't single threaded on enabling nvme controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver structure so that the driver core probes in parallel. Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reported-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-16nvme-pci: move the HMPRE check into nvme_setup_host_memChristoph Hellwig1-5/+6
Check that a HMB is wanted into the allocation helper instead of the caller. This makes life simpler for an upcoming second caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-16nvmet: fix a memory leak in nvmet_auth_set_keySagi Grimberg1-0/+2
When changing dhchap secrets we need to release the old secrets as well. kmemleak complaint: -- unreferenced object 0xffff8c7f44ed8180 (size 64): comm "check", pid 7304, jiffies 4295686133 (age 72034.246s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 4c 64 4c 4f 64 71 DHHC-1:00:LdLOdq 79 56 69 67 77 48 55 32 6d 5a 59 4c 7a 35 59 38 yVigwHU2mZYLz5Y8 backtrace: [<00000000b6fc5071>] kstrdup+0x2e/0x60 [<00000000f0f4633f>] 0xffffffffc0e07ee6 [<0000000053006c05>] 0xffffffffc0dff783 [<00000000419ae922>] configfs_write_iter+0xb1/0x120 [<000000008183c424>] vfs_write+0x2be/0x3c0 [<000000009005a2a5>] ksys_write+0x5f/0xe0 [<00000000cd495c89>] do_syscall_64+0x38/0x90 [<00000000f2a84ac5>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000Tiago Dias Ferreira1-0/+2
Added a quirk to fix the Netac NV7000 SSD reporting duplicate NGUIDs. Cc: <stable@vger.kernel.org> Signed-off-by: Tiago Dias Ferreira <tiagodfer@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvme-pci: simplify nvme_dbbuf_dma_allocChristoph Hellwig1-16/+16
Move the OACS check and the error checking into nvme_dbbuf_dma_alloc so that an upcoming second caller doesn't have to duplicate this boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-15nvme-pci: call nvme_pci_configure_admin_queue from nvme_pci_enableChristoph Hellwig1-5/+2
nvme_pci_configure_admin_queue is called right after nvme_pci_enable, and it's work is undone by nvme_dev_disable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme-pci: set constant paramters in nvme_pci_alloc_ctrlChristoph Hellwig1-21/+17
Move setting of low-level constant parameters from nvme_reset_work to nvme_pci_alloc_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme-pci: factor out a nvme_pci_alloc_dev helperChristoph Hellwig1-35/+46
Add a helper that allocates the nvme_dev structure up to the point where we can call nvme_init_ctrl. This pairs with the free_ctrl method and can thus be used to cleanup the teardown path and make it more symmetric. Note that this now calls nvme_init_ctrl a lot earlier during probing, which also means the per-controller character device shows up earlier. Due to the controller state no commnds can be send on it, but it might make sense to delay the cdev registration until nvme_init_ctrl_finish. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme-pci: factor the iod mempool creation into a helperChristoph Hellwig1-23/+18
Add a helper to create the iod mempool. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme-pci: move more teardown work to nvme_removeChristoph Hellwig1-2/+2
nvme_dbbuf_dma_free frees dma coherent memory, so it must not be called after ->remove has returned. Fortunately there is no way to use it after shutdown as no more I/O is possible so it can be moved. Similarly the iod_mempool can't be used for a device kept alive after shutdown, so move it next to freeing the PRP pools. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme-pci: put the admin queue in nvme_dev_remove_adminChristoph Hellwig1-2/+1
Once the controller is shutdown no one can access the admin queue. Tear it down in nvme_dev_remove_admin, which matches the flow in the other drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme: simplify transport specific device attribute handlingChristoph Hellwig3-17/+16
Allow the transport driver to override the attribute groups for the control device, so that the PCIe driver doesn't manually have to add a group after device creation and keep track of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme: move OPAL setup from PCIe to coreChristoph Hellwig8-25/+29
Nothing about the TCG Opal support is PCIe transport specific, so move it to the core code. For this nvme_init_ctrl_finish grows a new was_suspended argument that allows the transport driver to tell the OPAL code if the controller came out of a suspend cycle. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme: don't call nvme_init_ctrl_finish from nvme_passthru_endChristoph Hellwig1-2/+4
nvme_passthrough_end can race with a reset, which can lead to racing stores to the cels xarray as well as further shengians with upcoming more complicated initialization. So drop the call and just log that the controller capabilities might have changed and a reset could be required to use the new controller capabilities. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
2022-11-15nvme: implement the DEAC bit for the Write Zeroes commandChristoph Hellwig2-1/+13
While the specification allows devices to either deallocate data or to actually write zeroes on any Write Zeroes command, many SSDs only do the sensible thing and deallocate data when the DEAC bit is specific. Set it when it is supported and the caller doesn't explicitly opt out of deallocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-11-15nvme: identify-namespace without CAP_SYS_ADMINKanchan Joshi1-2/+16
Allow all identify-namespace variants (CNS 00h, 05h and 08h) without requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is needed to form IO commands for passthrough interface. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvme: fine-granular CAP_SYS_ADMIN for nvme io commandsKanchan Joshi1-33/+69
Currently both io and admin commands are kept under a coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely. $ ls -l /dev/ng* crw-rw-rw- 1 root root 242, 0 Sep 9 19:20 /dev/ng0n1 crw------- 1 root root 242, 1 Sep 9 19:20 /dev/ng0n2 In the example above, ng0n1 appears as if it may allow unprivileged read/write operation but it does not and behaves same as ng0n2. This patch implements a shift from CAP_SYS_ADMIN to more fine-granular control for io-commands. If CAP_SYS_ADMIN is present, nothing else is checked as before. Otherwise, following rules are in place - any admin-cmd is not allowed - vendor-specific and fabric commmand are not allowed - io-commands that can write are allowed if matching FMODE_WRITE permission is present - io-commands that read are allowed Add a helper nvme_cmd_allowed that implements above policy. Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed for any decision making. Since file open mode is counted for any approval/denial, change at various places to keep file-mode information handy. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvme-fc: improve memory usage in nvme_fc_rcv_ls_req()Christophe JAILLET1-8/+9
sizeof( struct nvmefc_ls_rcv_op ) = 64 sizeof( union nvmefc_ls_requests ) = 1024 sizeof( union nvmefc_ls_responses ) = 128 So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when kzalloc() is called. Because of the way memory allocations are performed, 2048 bytes are allocated. So about 800 bytes are wasted for each request. Switch to 3 distinct memory allocations, in order to: - save these 800 bytes - avoid zeroing this extra memory - make sure that memory is properly aligned in case of DMA access ("fc_dma_map_single(lsop->rspbuf)" just a few lines below) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvmet: only allocate a single slab for bvecsChristoph Hellwig3-22/+19
There is no need to have a separate slab cache for each namespace, and having separate ones creates duplicate debugs file names as well. Fixes: d5eff33ee6f8 ("nvmet: add simple file backed ns support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-11-15nvmet: force reconnect when number of queue changesDaniel Wagner1-1/+8
In order to test queue number changes we need to make sure that the host reconnects. Because only when the host disconnects from the target the number of queues are allowed to change according the spec. The initial idea was to disable and re-enable the ports and have the host wait until the KATO timer expires, triggering error recovery. Though the host would see a DNR reply when trying to reconnect. Because of the DNR bit the connection is dropped completely. There is no point in trying to reconnect with the same parameters according the spec. We can force to reconnect the host is by deleting all controllers. The host will observe any newly posted request to fail and thus starts the error recovery but this time without the DNR bit set. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvmet: use try_cmpxchg in nvmet_update_sq_headUros Bizjak1-3/+2
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in nvmet_update_sq_head. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-15nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron NitroBean Huo1-0/+2
Added a quirk to fix Micron Nitro NVMe reporting duplicate NGUIDs. Cc: <stable@vger.kernel.org> Signed-off-by: Bean Huo <beanhuo@micron.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-12Merge tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linuxLinus Torvalds3-6/+7
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Quiet user passthrough command errors (Keith Busch) - Fix memory leak in nvmet_subsys_attr_model_store_locked - Fix a memory leak in nvmet-auth (Sagi Grimberg) - Fix a potential NULL point deref in bfq (Yu) - Allocate command/response buffers separately for DMA for sed-opal, rather than rely on embedded alignment (Serge) * tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux: nvmet: fix a memory leak nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked nvme: quiet user passthrough command errors block: sed-opal: kmalloc the cmd/resp buffers block, bfq: fix null pointer dereference in bfq_bio_bfqg()
2022-11-09nvmet: fix a memory leakSagi Grimberg1-0/+1
We need to also free the dhchap_ctrl_secret when releasing nvmet_host. kmemleak complaint: -- unreferenced object 0xffff99b1cbca5140 (size 64): comm "check", pid 4864, jiffies 4305092436 (age 2913.583s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 65 36 2b 41 63 44 DHHC-1:00:e6+AcD 39 76 47 4d 52 57 59 78 67 54 47 44 51 59 47 78 9vGMRWYxgTGDQYGx backtrace: [<00000000c07d369d>] kstrdup+0x2e/0x60 [<000000001372171c>] 0xffffffffc0cceec6 [<0000000010dbf50b>] 0xffffffffc0cc6783 [<000000007465e93c>] configfs_write_iter+0xb1/0x120 [<0000000039c23f62>] vfs_write+0x2be/0x3c0 [<000000002da4351c>] ksys_write+0x5f/0xe0 [<00000000d5011e32>] do_syscall_64+0x38/0x90 [<00000000503870cf>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09nvmet: fix memory leak in nvmet_subsys_attr_model_store_lockedAleksandr Miloserdov1-2/+5
Since model_number is allocated before it needs to be freed before kmemdump_nul. Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com> Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-09nvme: quiet user passthrough command errorsKeith Busch2-4/+1
The driver is spamming the kernel logs for entirely harmless errors from user space submitting unsupported commands. Just silence the errors. The application has direct access to command status, so there's no need to log these. And since every passthrough command now uses the quiet flag, move the setting to the common initializer. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Tested-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-02nvme: use blk_mq_[un]quiesce_tagsetChao Leng2-27/+9
All controller namespaces share the same tagset, so we can use this interface which does the optimal operation for parallel quiesce based on the tagset type(e.g. blocking tagsets and non-blocking tagsets). nvme connect_q should not be quiesced when quiesce tagset, so set the QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. Currently we use NVME_NS_STOPPED to ensure pairing quiescing and unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. In addition, we never really quiesce a single namespace. It is a better choice to move the flag from ns to ctrl. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: rebased on top of prep patches] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02blk-mq: pass a tagset to blk_mq_wait_quiesce_doneChristoph Hellwig1-2/+2
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme-apple: don't unquiesce the I/O queues in apple_nvme_reset_workChristoph Hellwig1-1/+0
apple_nvme_reset_work schedules apple_nvme_remove, to be called, which will call apple_nvme_disable and unquiesce the I/O queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrlChristoph Hellwig1-1/+0
nvme_remove_dead_ctrl schedules nvme_remove to be called, which will call nvme_dev_disable and unquiesce the I/O queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme: split nvme_kill_queuesChristoph Hellwig4-33/+15
nvme_kill_queues does two things: 1) mark the gendisk of all namespaces dead 2) unquiesce all I/O queues These used to be be intertwined due to block layer issues, but aren't any more. So move the unquiscing of the I/O queues into the callers, and rename the rest of the function to the now more descriptive nvme_mark_namespaces_dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme: don't unquiesce the admin queue in nvme_kill_queuesChristoph Hellwig1-4/+0
None of the callers of nvme_kill_queues needs it to unquiesce the admin queues, as all of them already do it themselves: 1) nvme_reset_work explicit call nvme_start_admin_queue toward the beginning of the function. The extra call to nvme_start_admin_queue in nvme_reset_work this won't do anything as NVME_CTRL_ADMIN_Q_STOPPED will already be cleared. 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at the very beginning of the function if the PCIe device was not present, which is the precondition for the call to nvme_kill_queues. nvme_dev_disable already calls nvme_start_admin_queue toward the end of the function when the shutdown flag is set to true, so the admin queue is already enabled at this point. 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver, which will end up in nvme_remove, which calls nvme_dev_disable with the shutdown flag. This case will call nvme_start_admin_queue a bit later than before. 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl above. 5) nvme_remove_namespaces only calls nvme_kill_queues when the controller is in the DEAD state. That can only happen in the PCIe driver, and only from nvme_remove. See item 2) above for the conditions there. So it is safe to just remove the call to nvme_start_admin_queue in nvme_kill_queues without replacement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme: remove the NVME_NS_DEAD check in nvme_validate_nsChristoph Hellwig1-4/+0
At the point where namespaces are marked dead, the controller is in a non-live state and we won't get pass the identify commands. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespacesChristoph Hellwig1-1/+1
The NVME_NS_DEAD check only made sense when we revalidated namespaces in nvme_passthrough_end for commands that affected the namespace inventory. These days NVME_NS_DEAD is only set during reset or when tearing down namespaces, and we always remove all namespaces right after that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme: don't remove namespaces in nvme_passthru_endChristoph Hellwig1-1/+0
The call to nvme_remove_invalid_namespaces made sense when nvme_passthru_end revalidated all namespaces and had to remove those that didn't exist any more. Since we don't revalidate from nvme_passthru_end now, this call is entirely spurious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02nvme-pci: refactor the tagset handling in nvme_reset_workChristoph Hellwig1-16/+28
The code to create, update or delete a tagset and namespaces in nvme_reset_work is a bit convoluted. Refactor it with a two high-level conditionals for first probe vs reset and I/O queues vs no I/O queues to make the code flow more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de [axboe: fix whitespace issue] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02block: set the disk capacity to 0 in blk_mark_disk_deadChristoph Hellwig1-6/+1
nvme and xen-blkfront are already doing this to stop buffered writes from creating dirty pages that can't be written out later. Move it to the common code. This also removes the comment about the ordering from nvme, as bd_mutex not only is gone entirely, but also hasn't been used for locking updates to the disk size long before that, and thus the ordering requirement documented there doesn't apply any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-30Merge tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linuxLinus Torvalds2-2/+12
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - make the multipath dma alignment match the non-multipath one (Keith Busch) - fix a bogus use of sg_init_marker() (Nam Cao) - fix circulr locking in nvme-tcp (Sagi Grimberg) - Initialization fix for requests allocated via the special hw queue allocator (John) - Fix for a regression added in this release with the batched completions of end_io backed requests (Ming) - Error handling leak fix for rbd (Yang) - Error handling leak fix for add_disk() failure (Yu) * tag 'block-6.1-2022-10-28' of git://git.kernel.dk/linux: blk-mq: Properly init requests from blk_mq_alloc_request_hctx() blk-mq: don't add non-pt request with ->end_io to batch rbd: fix possible memory leak in rbd_sysfs_init() nvme-multipath: set queue dma alignment to 3 nvme-tcp: fix possible circular locking when deleting a controller under memory pressure nvme-tcp: replace sg_init_marker() with sg_init_table() block: fix memory leak for elevator on add_disk failure
2022-10-25nvme-multipath: set queue dma alignment to 3Keith Busch1-0/+1
NVMe spec requires all transports support dword aligned addresses, which is already set in the namespace request_queue. Set the same limit in the multipath device's request_queue as well. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25nvme-tcp: fix possible circular locking when deleting a controller under ↵Sagi Grimberg1-1/+10
memory pressure When destroying a queue, when calling sock_release, the network stack might need to allocate an skb to send a FIN/RST. When that happens during memory pressure, there is a need to reclaim memory, which in turn may ask the nvme-tcp device to write out dirty pages, however this is not possible due to a ctrl teardown that is going on. Set PF_MEMALLOC to the task that releases the socket to grant access to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp thread as this may also originate from the swap itself and should be more resilient to memory pressure situations. This fixes the following lockdep complaint: -- ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc2+ #25 Tainted: G W ------------------------------------------------------ kswapd0/92 is trying to acquire lock: ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0 but task is already holding lock: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x11e/0x160 kmem_cache_alloc_node+0x44/0x530 __alloc_skb+0x158/0x230 tcp_send_active_reset+0x7e/0x730 tcp_disconnect+0x1272/0x1ae0 __tcp_close+0x707/0xd90 tcp_close+0x26/0x80 inet_release+0xfa/0x220 sock_release+0x85/0x1a0 nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp] nvme_do_delete_ctrl+0x130/0x13d [nvme_core] nvme_sysfs_delete.cold+0x8/0xd [nvme_core] kernfs_fop_write_iter+0x356/0x530 vfs_write+0x4e8/0xce0 ksys_write+0xfd/0x1d0 do_syscall_64+0x58/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}: __lock_acquire+0x2a0c/0x5690 lock_acquire+0x18e/0x4f0 lock_sock_nested+0x37/0xc0 tcp_sendpage+0x23/0xa0 inet_sendpage+0xad/0x120 kernel_sendpage+0x156/0x440 nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp] nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp] __blk_mq_try_issue_directly+0x452/0x660 blk_mq_plug_issue_direct.constprop.0+0x207/0x700 blk_mq_flush_plug_list+0x6f5/0xc70 __blk_flush_plug+0x264/0x410 blk_finish_plug+0x4b/0xa0 shrink_lruvec+0x1263/0x1ea0 shrink_node+0x736/0x1a80 balance_pgdat+0x740/0x10d0 kswapd+0x5f2/0xaf0 kthread+0x256/0x2f0 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sk_lock-AF_INET-NVME); lock(fs_reclaim); lock(sk_lock-AF_INET-NVME); *** DEADLOCK *** 3 locks held by kswapd0/92: #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0 #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70 #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp] Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25nvme-tcp: replace sg_init_marker() with sg_init_table()Nam Cao1-1/+1
In nvme_tcp_ddgst_update(), sg_init_marker() is called with an uninitialized scatterlist. This is probably fine, but gcc complains: CC [M] drivers/nvme/host/tcp.o In file included from ./include/linux/dma-mapping.h:10, from ./include/linux/skbuff.h:31, from ./include/net/net_namespace.h:43, from ./include/linux/netdevice.h:38, from ./include/net/sock.h:46, from drivers/nvme/host/tcp.c:12: In function ‘sg_mark_end’, inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2, inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2: ./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized] 234 | sg->page_link |= SG_END; | ~~^~~~~~~~~~~ drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’: drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here 388 | struct scatterlist sg; | ^~ cc1: all warnings being treated as errors Use sg_init_table() instead, which basically memset the scatterlist to zero first before calling sg_init_marker(). Signed-off-by: Nam Cao <namcaov@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-10-25nvme-apple: remove an extra queue referenceChristoph Hellwig1-9/+0
Now that blk_mq_destroy_queue does not release the queue reference, there is no need for a second admin queue reference to be held by the apple_nvme structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sven Peter <sven@svenpeter.dev> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25nvme-pci: remove an extra queue referenceChristoph Hellwig1-6/+0
Now that blk_mq_destroy_queue does not release the queue reference, there is no need for a second admin queue reference to be held by the nvme_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>