summaryrefslogtreecommitdiff
path: root/drivers/nvme/host
AgeCommit message (Collapse)AuthorFilesLines
2018-12-21nvme: validate controller state before rescheduling keep aliveJames Smart1-1/+9
[ Upstream commit 86880d646122240596d6719b642fee3213239994 ] Delete operations are seeing NULL pointer references in call_timer_fn. Tracking these back, the timer appears to be the keep alive timer. nvme_keep_alive_work() which is tied to the timer that is cancelled by nvme_stop_keep_alive(), simply starts the keep alive io but doesn't wait for it's completion. So nvme_stop_keep_alive() only stops a timer when it's pending. When a keep alive is in flight, there is no timer running and the nvme_stop_keep_alive() will have no affect on the keep alive io. Thus, if the io completes successfully, the keep alive timer will be rescheduled. In the failure case, delete is called, the controller state is changed, the nvme_stop_keep_alive() is called while the io is outstanding, and the delete path continues on. The keep alive happens to successfully complete before the delete paths mark it as aborted as part of the queue termination, so the timer is restarted. The delete paths then tear down the controller, and later on the timer code fires and the timer entry is now corrupt. Fix by validating the controller state before rescheduling the keep alive. Testing with the fix has confirmed the condition above was hit. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17nvme-rdma: fix double freeing of async event dataPrabhath Sajeepa1-0/+2
[ Upstream commit 6344d02dc8f886b6bbcd922ae1a17e4a41500f2d ] Some error paths in configuration of admin queue free data buffer associated with async request SQE without resetting the data buffer pointer to NULL, This buffer is also freed up again if the controller is shutdown or reset. Signed-off-by: Prabhath Sajeepa <psajeepa@purestorage.com> Reviewed-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17nvme: flush namespace scanning work just before removing namespacesSagi Grimberg1-1/+3
[ Upstream commit f6c8e432cb0479255322c5d0335b9f1699a0270c ] nvme_stop_ctrl can be called also for reset flow and there is no need to flush the scan_work as namespaces are not being removed. This can cause deadlock in rdma, fc and loop drivers since nvme_stop_ctrl barriers before controller teardown (and specifically I/O cancellation of the scan_work itself) takes place, but the scan_work will be blocked anyways so there is no need to flush it. Instead, move scan_work flush to nvme_remove_namespaces() where it really needs to flush. Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed by: James Smart <jsmart2021@gmail.com> Tested-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17nvme: warn when finding multi-port subsystems without multipathing enabledChristoph Hellwig1-0/+3
[ Upstream commit 14a1336e6fff47dd1028b484d6c802105c58e2ee ] Without CONFIG_NVME_MULTIPATH enabled a multi-port subsystem might show up as invididual devices and cause problems, warn about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-13nvme-fc: resolve io failures during connectJames Smart1-10/+63
[ Upstream commit 4cff280a5fccf6513ed9e895bb3a4e7ad8b0cedc ] If an io error occurs on an io issued while connecting, recovery of the io falls flat as the state checking ends up nooping the error handler. Create an err_work work item that is scheduled upon an io error while connecting. The work thread terminates all io on all queues and marks the queues as not connected. The termination of the io will return back to the callee, which will then back out of the connection attempt and will reschedule, if possible, the connection attempt. The changes: - in case there are several commands hitting the error handler, a state flag is kept so that the error work is only scheduled once, on the first error. The subsequent errors can be ignored. - The calling sequence to stop keep alive and terminate the queues and their io is lifted from the reset routine. Made a small service routine used by both reset and err_work. - During debugging, found that the teardown path can reference an uninitialized pointer, resulting in a NULL pointer oops. The aen_ops weren't initialized yet. Add validation on their initialization before calling the teardown routine. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27nvme: make sure ns head inherits underlying device limitsSagi Grimberg2-1/+4
[ Upstream commit 8f676b8508c250bbe255096522fdefb73f1ea0b9 ] Whenever we update ns_head info, we need to make sure it is still compatible with all underlying backing devices because although nvme multipath doesn't have any explicit use of these limits, other devices can still be stacked on top of it which may rely on the underlying limits. Start with unlimited stacking limits, and every info update iterate over siblings and adjust queue limits. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-13nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/OJames Smart2-2/+12
[ Upstream commit 783f4a4408e1251d17f333ad56abac24dde988b9 ] When an io is rejected by nvmf_check_ready() due to validation of the controller state, the nvmf_fail_nonready_command() will normally return BLK_STS_RESOURCE to requeue and retry. However, if the controller is dying or the I/O is marked for NVMe multipath, the I/O is failed so that the controller can terminate or so that the io can be issued on a different path. Unfortunately, as this reject point is before the transport has accepted the command, blk-mq ends up completing the I/O and never calls nvme_complete_rq(), which is where multipath may preserve or re-route the I/O. The end result is, the device user ends up seeing an EIO error. Example: single path connectivity, controller is under load, and a reset is induced. An I/O is received: a) while the reset state has been set but the queues have yet to be stopped; or b) after queues are started (at end of reset) but before the reconnect has completed. The I/O finishes with an EIO status. This patch makes the following changes: - Adds the HOST_PATH_ERROR pathing status from TP4028 - Modifies the reject point such that it appears to queue successfully, but actually completes the io with the new pathing status and calls nvme_complete_rq(). - nvme_complete_rq() recognizes the new status, avoids resetting the controller (likely was already done in order to get this new status), and calls the multipather to clear the current path that errored. This allows the next command (retry or new command) to select a new path if there is one. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-08nvme: remove ns sibling before clearing pathKeith Busch1-1/+1
The code had been clearing a namespace being deleted as the current path while that namespace was still in the path siblings list. It is possible a new IO could set that namespace back to the current path since it appeared to be an eligable path to select, which may result in a use-after-free error. This patch ensures a namespace being removed is not eligable to be reset as a current path prior to clearing it as the current path. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-26nvme: properly propagate errors in nvme_mpath_initSusobhan Dey1-2/+4
Signed-off-by: Susobhan Dey <susobhan.dey@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_eventMichal Wnukowski1-0/+8
In many architectures loads may be reordered with older stores to different locations. In the nvme driver the following two operations could be reordered: - Write shadow doorbell (dbbuf_db) into memory. - Read EventIdx (dbbuf_ei) from memory. This can result in a potential race condition between driver and VM host processing requests (if given virtual NVMe controller has a support for shadow doorbell). If that occurs, then the NVMe controller may decide to wait for MMIO doorbell from guest operating system, and guest driver may decide not to issue MMIO doorbell on any of subsequent commands. This issue is purely timing-dependent one, so there is no easy way to reproduce it. Currently the easiest known approach is to run "Oracle IO Numbers" (orion) that is shipped with Oracle DB: orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \ concat -write 40 -duration 120 -matrix row -testname nvme_test Where nvme_test is a .lun file that contains a list of NVMe block devices to run test against. Limiting number of vCPUs assigned to given VM instance seems to increase chances for this bug to occur. On test environment with VM that got 4 NVMe drives and 1 vCPU assigned the virtual NVMe controller hang could be observed within 10-20 minutes. That correspond to about 400-500k IO operations processed (or about 100GB of IO read/writes). Orion tool was used as a validation and set to run in a loop for 36 hours (equivalent of pushing 550M IO operations). No issues were observed. That suggest that the patch fixes the issue. Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices") Signed-off-by: Michal Wnukowski <wnukowski@google.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> [hch: updated changelog and comment a bit] Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-16Merge branch 'linus/master' into rdma.git for-nextJason Gunthorpe10-359/+670
rdma.git merge resolution for the 4.19 merge window Conflicts: drivers/infiniband/core/rdma_core.c - Use the rdma code and revise with the new spelling for atomic_fetch_add_unless drivers/nvme/host/rdma.c - Replace max_sge with max_send_sge in new blk code drivers/nvme/target/rdma.c - Use the blk code and revise to use NULL for ib_post_recv when appropriate - Replace max_sge with max_recv_sge in new blk code net/rds/ib_send.c - Use the net code and revise to use NULL for ib_post_recv when appropriate Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe7-81/+131
Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-14Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds10-359/+670
Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
2018-08-08nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect foreverTal Shorer1-1/+1
When the user supplies a ctrl_loss_tmo < 0, we warn them that this will cause the fabrics layer to attempt reconnection forever. However, in reality the fabrics layer never attempts to reconnect because the condition to test whether we should reconnect is backwards in this case. Signed-off-by: Tal Shorer <tal.shorer@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-08nvme: set gendisk read only based on nsattrChaitanya Kulkarni1-0/+6
NVMe 1.3 TP 4005 introduces new filed (NSATTR). This field indicates whether given namespace is write protected or not. This patch sets the gendisk associated with the namespace to read only based on the identify namespace nsattr field. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-07nvme: fixup crash on failed discoveryHannes Reinecke1-1/+1
When the initial discovery fails the subsystem hasn't been setup yet in nvme_mpath_stop, and we can't dereference ctrl->subsys. Fixes: 0d0b660f ("nvme: add ANA support") Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-06lightnvm: remove minor version check for 2.0Matias Bjørling1-6/+0
A minor version number increase should not break backwards compatibility. Fixes: 3cb98f84d368b ("lightnvm: add minor version to generic geometry") Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2Jens Axboe4-62/+422
Pull NVMe changes from Christoph: "This contains the support for TP4004, Asymmetric Namespace Access, which makes NVMe multipathing usable in practice." * 'nvme-4.19' of git://git.infradead.org/nvme: nvmet: use Retain Async Event bit to clear AEN nvmet: support configuring ANA groups nvmet: add minimal ANA support nvmet: track and limit the number of namespaces per subsystem nvmet: keep a port pointer in nvmet_ctrl nvme: add ANA support nvme: remove nvme_req_needs_failover nvme: simplify the API for getting log pages nvme.h: add ANA definitions nvme.h: add support for the log specific field Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe2-34/+41
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-07-30nvme: use blk API to remap ref tags for IOs with metadataMax Gurtovoy3-82/+20
Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30block: move ref_tag calculation func to the block layerMax Gurtovoy1-2/+1
Currently this function is implemented in the scsi layer, but it's actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-blockLinus Torvalds4-6/+11
Pull block fixes from Jens Axboe: "Bigger than usual at this time, mostly due to the O_DIRECT corruption issue and the fact that I was on vacation last week. This contains: - NVMe pull request with two fixes for the FC code, and two target fixes (Christoph) - a DIF bio reset iteration fix (Greg Edwards) - two nbd reply and requeue fixes (Josef) - SCSI timeout fixup (Keith) - a small series that fixes an issue with bio_iov_iter_get_pages(), which ended up causing corruption for larger sized O_DIRECT writes that ended up racing with buffered writes (Martin Wilck)" * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block: block: reset bi_iter.bi_done after splitting bio block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs blkdev: __blkdev_direct_IO_simple: fix leak in error case block: bio_iov_iter_get_pages: fix size of last iovec nvmet: only check for filebacking on -ENOTBLK nvmet: fixup crash on NULL device path scsi: set timed out out mq requests to complete blk-mq: export setting request completion state nvme: if_ready checks to fail io to deleting controller nvmet-fc: fix target sgl list on large transfers nbd: handle unexpected replies better nbd: don't requeue the same request twice.
2018-07-27nvme: add ANA supportChristoph Hellwig3-27/+408
Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. With ANA each namespace attached to a controller belongs to an ANA group that describes the characteristics of accessing the namespaces through this controller. In the optimized and non-optimized states namespaces can be accessed regularly, although in a multi-pathing environment we should always prefer to access a namespace through a controller where an optimized relationship exists. Namespaces in Inaccessible, Permanent-Loss or Change state for a given controller should not be accessed. The states are updated through reading the ANA log page, which is read once during controller initialization, whenever the ANA change notice AEN is received, or when one of the ANA specific status codes that signal a state change is received on a command. The ANA state is kept in the nvme_ns structure, which makes the checks in the fast path very simple. Updating the ANA state when reading the log page is also very simple, the only downside is that finding the initial ANA state when scanning for namespaces is a bit cumbersome. The gendisk for a ns_head is only registered once a live path for it exists. Without that the kernel would hang during partition scanning. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme: remove nvme_req_needs_failoverChristoph Hellwig3-14/+2
Now that we just call out to blk_path_error there isn't really any good reason to not merge it into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme: simplify the API for getting log pagesChristoph Hellwig3-25/+16
Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes a plain nsid instead of the nvme_ns pointer. Also add support for the log specific field while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-25nvme-rdma: Simplify ib_post_(send|recv|srq_recv)() callsBart Van Assche1-6/+5
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL as third argument to ib_post_(send|recv|srq_recv)(). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24nvme-rdma: centralize admin/io queue teardown sequenceSagi Grimberg1-37/+29
We follow the same queue teardown sequence in delete, reset and error recovery. Centralize the logic. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme-rdma: centralize controller setup sequenceSagi Grimberg1-77/+53
Centralize controller sequence to a single routine that correctly cleans up after failures instead of having multiple apperances in several flows (create, reset, reconnect). One thing that we also gain here are the sanity/boundary checks also when connecting back to a dynamic controller. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme-rdma: unquiesce queues when deleting the controllerSagi Grimberg1-0/+2
If the controller is going away, we need to unquiesce the IO queues so that all pending request can fail gracefully before moving forward with controller deletion. Do that before we destroy the IO queues so blk_cleanup_queue won't block in freeze. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme-rdma: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme: add disk name to trace eventsKeith Busch2-7/+37
This will print the disk name to the nvme event trace for io requests so a user can better distinguish traffic to different disks. This can be used to create disk based filters. For example, to see only nvme0n2 traffic: echo "disk == \"nvme0n2\"" > /sys/kernel/debug/tracing/events/nvme/filter Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: turned __assign_disk_name into an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme: add controller name to trace eventsKeith Busch1-6/+11
This appends the controller instance to the nvme trace buffer to distinguish which controller is dispatching and completing a command. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-24nvme: if_ready checks to fail io to deleting controllerJames Smart4-6/+11
The revised if_ready checks skipped over the case of returning error when the controller is being deleted. Instead it was returning BUSY, which caused the ios to retry, which caused the ns delete to hang waiting for the ios to drain. Stack trace of hang looks like: kworker/u64:2 D 0 74 2 0x80000000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] Call Trace: ? __schedule+0x26d/0x820 schedule+0x32/0x80 blk_mq_freeze_queue_wait+0x36/0x80 ? remove_wait_queue+0x60/0x60 blk_cleanup_queue+0x72/0x160 nvme_ns_remove+0x106/0x140 [nvme_core] nvme_remove_namespaces+0x7e/0xa0 [nvme_core] nvme_delete_ctrl_work+0x4d/0x80 [nvme_core] process_one_work+0x160/0x350 worker_thread+0x1c3/0x3d0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Extend nvmf_fail_nonready_command() to supply the controller pointer so that the controller state can be looked at. Fail any io to a controller that is deleting. Fixes: 3bc32bb1186c ("nvme-fabrics: refactor queue ready check") Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com>
2018-07-23nvme: use hw qid in trace eventsKeith Busch3-71/+53
We can not match a command to its completion based on the command id alone. We need the submitting queue identifier to pair with the completion, so this patch adds that to the trace buffer. This patch is also collapsing the admin and IO submission traces into a single one so we don't need to duplicate this and creating unnecessary code branches: we know if the command is an admin vs IO based on the qid. And since we're here, the patch fixes code formatting in the area. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: move the qid helper to nvme.h and made it an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23nvme: cache struct nvme_ctrl reference to struct nvme_requestSagi Grimberg4-0/+5
We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23nvme-rdma: support up to 4 segments of inline dataSteve Wise1-11/+27
Allow up to 4 segments of inline data for NVMF WRITE operations. This reduces latency for small WRITEs by removing the need for the target to issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp. Also cap the inline segments used based on the limitations of the device. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23nvme: move init of keep_alive work item to controller initializationJames Smart1-3/+4
Currently, the code initializes the keep alive work item whenever nvme_start_keep_alive() is called. However, this routine is called several times while reconnecting, etc. Although it's hoped that keep alive is always disabled and not scheduled when start is called, re-initing if it were scheduled or completing can have very bad side effects. There's no need for re-initialization. Move the keep_alive work item and cmd struct initialization to controller init. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-20nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMDRoland Dreier1-1/+1
The old code in nvme_user_cmd() passed the userspace virtual address from nvme_passthru_cmd.metadata as the length of the metadata buffer as well as the address to nvme_submit_user_cmd(). Fixes: 63263d60 ("nvme: Use metadata for passthrough commands") Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-17nvme: don't enable AEN if not supportedWeiping Zhang1-4/+7
Avoid excuting set_feature command if there is no supported bit in Optional Asynchronous Events Supported (OAES). Fixes: c0561f82 ("nvme: submit AEN event configuration on startup") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Weiping Zhang <zhangweiping@didichuxing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-17nvme: ensure forward progress during Admin passthruScott Bauer1-24/+26
If the controller supports effects and goes down during the passthru admin command we will deadlock during namespace revalidation. [ 363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds. [ 363.488290] Not tainted 4.17.0+ #2 [ 363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.488303] kworker/u16:5 D 0 231 2 0x80000000 [ 363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 363.488338] Call Trace: [ 363.488385] schedule+0x75/0x190 [ 363.488396] rwsem_down_read_failed+0x1c3/0x2f0 [ 363.488481] call_rwsem_down_read_failed+0x14/0x30 [ 363.488504] down_read+0x1d/0x80 [ 363.488523] nvme_stop_queues+0x1e/0xa0 [nvme_core] [ 363.488536] nvme_dev_disable+0xae4/0x1620 [nvme] [ 363.488614] nvme_reset_work+0xd1e/0x49d9 [nvme] [ 363.488911] process_one_work+0x81a/0x1400 [ 363.488934] worker_thread+0x87/0xe80 [ 363.488955] kthread+0x2db/0x390 [ 363.488977] ret_from_fork+0x35/0x40 Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects") Signed-off-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Keith Busch <keith.busch@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-13lightnvm: limit get chunk meta request sizeMatias Bjørling1-2/+8
For devices that does not specify a limit on its transfer size, the get_chk_meta command may send down a single I/O retrieving the full chunk metadata table. Resulting in large 2-4MB I/O requests. Instead, split up the I/Os to a maximum of 256KB and issue them separately to reduce memory requirements. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-13lightnvm: Remove redundant rq->__data_len initializationBart Van Assche1-4/+2
Since both blk_old_get_request() and blk_mq_alloc_request() initialize rq->__data_len to zero, it is not necessary to initialize that member in nvme_nvm_alloc_request(). Hence remove the rq->__data_len initialization from nvme_nvm_alloc_request(). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-12nvme-pci: fix memory leak on probe failureKeith Busch1-5/+7
The nvme driver specific structures need to be initialized prior to enabling the generic controller so we can unwind on failure with out using the reference counting callbacks so that 'probe' and 'remove' can be symmetric. The newly added iod_mempool is the only resource that was being allocated out of order, and a failure there would leak the generic controller memory. This patch just moves that allocation above the controller initialization. Fixes: 943e942e6266f ("nvme-pci: limit max IO size and segments to avoid high order allocations") Reported-by: Weiping Zhang <zwp10758@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-28nvme-rdma: fix possible double free of controller async event bufferSagi Grimberg1-2/+5
If reconnect/reset failed where the controller async event buffer was freed, we might end up freeing it again as we call nvme_rdma_destroy_admin_queue again in the remove path. Given that the sequence is guaranteed to serialize by .ctrl_stop, we simply set ctrl->async_event_sqe.data to NULL and don't free it in future visits. Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-21nvme-pci: limit max IO size and segments to avoid high order allocationsJens Axboe3-5/+39
nvme requires an sg table allocation for each request. If the request is large, then the allocation can become quite large. For instance, with our default software settings of 1280KB IO size, we'll need 10248 bytes of sg table. That turns into a 2nd order allocation, which we can't always guarantee. If we fail the allocation, blk-mq will retry it later. But there's no guarantee that we'll EVER be able to allocate that much contigious memory. Limit the IO size such that we never need more than a single page of memory. That's a lot faster and more reliable. Then back that allocation with a mempool, so that we know we'll always be able to succeed the allocation at some point. Signed-off-by: Jens Axboe <axboe@kernel.dk> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-21nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrlJianchao Wang1-1/+1
There is race between nvme_remove and nvme_reset_work that can lead to io hang. nvme_remove nvme_reset_work -> nvme_remove_dead_ctrl -> nvme_dev_disable -> quiesce request_queue -> queue remove_work -> cancel_work_sync reset_work -> nvme_remove_namespaces -> splice ctrl->namespaces nvme_remove_dead_ctrl_work -> nvme_kill_queues -> nvme_ns_remove do nothing -> blk_cleanup_queue -> blk_freeze_queue Finally, the request_queue is quiesced state when wait freeze, we will get io hang here. To fix it, move the nvme_kill_queues from nvme_remove_dead_ctrl_work to nvme_remove_dead_ctrl. Suggested-by: Keith Busch <keith.busch@linux.intel.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-21nvme-fc: release io queues to allow fast failJames Smart1-3/+3
Rather than leaving io queues quiesced after tearing down an association, restart them. This allows ios to be replayed, with fastfail ios terminating and non-fastfail getting into loops of retry. This follows rdma's lead. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimber.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-20nvme-rdma: don't override opts->queue_sizeSagi Grimberg1-11/+10
That is user argument, and theoretically controller limits can change over time (over reconnects/resets). Instead, use the sqsize controller attribute to check queue depth boundaries and use it to the tagset allocation. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-20nvme-rdma: Fix command completion race at error recoveryIsrael Rukshin1-2/+6
The race is between completing the request at error recovery work and rdma completions. If we cancel the request before getting the good rdma completion we get a NULL deref of the request MR at nvme_rdma_process_nvme_rsp(). When Canceling the request we return its mr to the mr pool (set mr to NULL) and also unmap its data. Canceling the requests while the rdma queues are active is not safe. Because rdma queues are active and we get good rdma completions that can use the mr pointer which may be NULL. Completing the request too soon may lead also to performing DMA to/from user buffers which might have been already unmapped. The commit fixes the race by draining the QP before starting the abort commands mechanism. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-20nvme-rdma: fix possible free of a non-allocated async event bufferSagi Grimberg1-13/+11
If nvme_rdma_configure_admin_queue fails before we allocated the async event buffer, we will falsly free it because nvme_rdma_free_queue is freeing it. Fix it by allocating the buffer right after nvme_rdma_alloc_queue and free it right before nvme_rdma_queue_free to maintain orderly reverse cleanup sequence. Reported-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>