kernel/linux.git/drivers/nvme, branch v4.14.85

nvme-loop: fix kernel oops in case of unhandled command

2018-11-21T08:24:17+00:00

commit 11d9ea6f2ca69237d35d6c55755beba3e006b106 upstream. When nvmet_req_init() fails, __nvmet_req_complete() is called to handle the target request via .queue_response(), so nvme_loop_queue_response() shouldn't be called again for handling the failure. This patch fixes this case by the following way: - move blk_mq_start_request() before nvmet_req_init(), so nvme_loop_queue_response() may work well to complete this host request - don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq() - don't call nvme_loop_queue_response() which is done via .queue_response() Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig [trimmed changelog] Signed-off-by: Keith Busch Signed-off-by: Jens Axboe Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman

nvme_fc: fix ctrl create failures racing with workq items

2018-10-13T07:27:28+00:00

commit cf25809bec2c7df4b45df5b2196845d9a4a3c89b upstream. If there are errors during initial controller create, the transport will teardown the partially initialized controller struct and free the ctlr memory. Trouble is - most of those errors can occur due to asynchronous events happening such io timeouts and subsystem connectivity failures. Those failures invoke async workq items to reset the controller and attempt reconnect. Those may be in progress as the main thread frees the ctrl memory, resulting in NULL ptr oops. Prevent this from happening by having the main ctrl failure thread changing state to DELETING followed by synchronously cancelling any pending queued work item. The change of state will prevent the scheduling of resets or reconnect events. Signed-off-by: James Smart Signed-off-by: Keith Busch Signed-off-by: Jens Axboe Signed-off-by: Amit Pundir Signed-off-by: Greg Kroah-Hartman

nvmet-rdma: fix possible bogus dereference under heavy load

2018-10-10T06:54:24+00:00

[ Upstream commit 8407879c4e0d7731f6e7e905893cecf61a7762c7 ] Currently we always repost the recv buffer before we send a response capsule back to the host. Since ordering is not guaranteed for send and recv completions, it is posible that we will receive a new request from the host before we got a send completion for the response capsule. Today, we pre-allocate 2x rsps the length of the queue, but in reality, under heavy load there is nothing that is really preventing the gap to expand until we exhaust all our rsps. To fix this, if we don't have any pre-allocated rsps left, we dynamically allocate a rsp and make sure to free it when we are done. If under memory pressure we fail to allocate a rsp, we silently drop the command and wait for the host to retry. Reported-by: Steve Wise Tested-by: Steve Wise Signed-off-by: Sagi Grimberg [hch: dropped a superflous assignment] Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

nvme-fcloop: Fix dropped LS's to removed target port

2018-10-04T00:00:59+00:00

[ Upstream commit afd299ca996929f4f98ac20da0044c0cdc124879 ] When a targetport is removed from the config, fcloop will avoid calling the LS done() routine thinking the targetport is gone. This leaves the initiator reset/reconnect hanging as it waits for a status on the Create_Association LS for the reconnect. Change the filter in the LS callback path. If tport null (set when failed validation before "sending to remote port"), be sure to call done. This was the main bug. But, continue the logic that only calls done if tport was set but there is no remoteport (e.g. case where remoteport has been removed, thus host doesn't expect a completion). Signed-off-by: James Smart Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

nvme-rdma: unquiesce queues when deleting the controller

2018-09-26T06:38:02+00:00

[ Upstream commit 90140624e8face94207003ac9a9d2a329b309d68 ] If the controller is going away, we need to unquiesce the IO queues so that all pending request can fail gracefully before moving forward with controller deletion. Do that before we destroy the IO queues so blk_cleanup_queue won't block in freeze. Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event

2018-09-05T07:26:36+00:00

commit f1ed3df20d2d223e0852cc4ac1f19bba869a7e3c upstream. In many architectures loads may be reordered with older stores to different locations. In the nvme driver the following two operations could be reordered: - Write shadow doorbell (dbbuf_db) into memory. - Read EventIdx (dbbuf_ei) from memory. This can result in a potential race condition between driver and VM host processing requests (if given virtual NVMe controller has a support for shadow doorbell). If that occurs, then the NVMe controller may decide to wait for MMIO doorbell from guest operating system, and guest driver may decide not to issue MMIO doorbell on any of subsequent commands. This issue is purely timing-dependent one, so there is no easy way to reproduce it. Currently the easiest known approach is to run "Oracle IO Numbers" (orion) that is shipped with Oracle DB: orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \ concat -write 40 -duration 120 -matrix row -testname nvme_test Where nvme_test is a .lun file that contains a list of NVMe block devices to run test against. Limiting number of vCPUs assigned to given VM instance seems to increase chances for this bug to occur. On test environment with VM that got 4 NVMe drives and 1 vCPU assigned the virtual NVMe controller hang could be observed within 10-20 minutes. That correspond to about 400-500k IO operations processed (or about 100GB of IO read/writes). Orion tool was used as a validation and set to run in a loop for 36 hours (equivalent of pushing 550M IO operations). No issues were observed. That suggest that the patch fixes the issue. Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices") Signed-off-by: Michal Wnukowski Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg [hch: updated changelog and comment a bit] Signed-off-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman

nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD

2018-08-24T11:09:21+00:00

[ Upstream commit 9b382768135ee3ff282f828c906574a8478e036b ] The old code in nvme_user_cmd() passed the userspace virtual address from nvme_passthru_cmd.metadata as the length of the metadata buffer as well as the address to nvme_submit_user_cmd(). Fixes: 63263d60 ("nvme: Use metadata for passthrough commands") Signed-off-by: Roland Dreier Reviewed-by: Keith Busch Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

nvmet: reset keep alive timer in controller enable

2018-08-24T11:09:02+00:00

[ Upstream commit d68a90e148f5a82aa67654c5012071e31c0e4baa ] Controllers that are not yet enabled should not really enforce keep alive timeouts, but we still want to track a timeout and cleanup in case a host died before it enabled the controller. Hence, simply reset the keep alive timer when the controller is enabled. Suggested-by: Max Gurtovoy Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

nvmet-fc: fix target sgl list on large transfers

2018-08-09T10:16:39+00:00

commit d082dc1562a2ff0947b214796f12faaa87e816a9 upstream. The existing code to carve up the sg list expected an sg element-per-page which can be very incorrect with iommu's remapping multiple memory pages to fewer bus addresses. To hit this error required a large io payload (greater than 256k) and a system that maps on a per-page basis. It's possible that large ios could get by fine if the system condensed the sgl list into the first 64 elements. This patch corrects the sg list handling by specifically walking the sg list element by element and attempting to divide the transfer up on a per-sg element boundary. While doing so, it still tries to keep sequences under 256k, but will exceed that rule if a single sg element is larger than 256k. Fixes: 48fa362b6c3f ("nvmet-fc: simplify sg list handling") Cc: # 4.14 Signed-off-by: James Smart Signed-off-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman

nvme-pci: Fix queue double allocations

2018-08-09T10:16:39+00:00

commit 62314e405fa101dbb82563394f9dfc225e3f1167 upstream. The queue count says the highest queue that's been allocated, so don't reallocate a queue lower than that. Fixes: 147b27e4bd0 ("nvme-pci: allocate device queues storage space at probe") Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig Signed-off-by: Jon Derrick Signed-off-by: Greg Kroah-Hartman