summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/core.c
AgeCommit message (Collapse)AuthorFilesLines
2018-05-29nvme: fix extended data LBA supported settingMax Gurtovoy1-1/+1
This value depands on the metadata support value, so reorder the initialization to fit. Fixes: b5be3b392 ("nvme: always unregister the integrity profile in __nvme_revalidate_disk") Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org
2018-05-11nvme: Fix sync controller reset returnCharles Machalow1-1/+2
If a controller reset is requested while the device has no namespaces, we were incorrectly returning ENETRESET. This patch adds the check for ADMIN_ONLY controller state to indicate a successful reset. Fixes: 8000d1fdb0 ("nvme-rdma: fix sysfs invoked reset_ctrl error flow ") Cc: <stable@vger.kernel.org> Signed-off-by: Charles Machalow <charles.machalow@intel.com> [changelog] Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-07nvme: fix use-after-free in nvme_free_ns_headJianchao Wang1-0/+5
Currently only nvme_ctrl will take a reference counter of nvme_subsystem, nvme_ns_head also needs it. Otherwise nvme_free_ns_head will access the nvme_subsystem.ns_ida which has been freed by __nvme_release_subsystem after all the reference of nvme_subsystem have been released by nvme_free_ctrl. This could cause memory corruption. BUG: KASAN: use-after-free in radix_tree_next_chunk+0x9f/0x4b0 Read of size 8 at addr ffff88036494d2e8 by task fio/1815 CPU: 1 PID: 1815 Comm: fio Kdump: loaded Tainted: G W 4.17.0-rc1+ #18 Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 Call Trace: dump_stack+0x91/0xeb print_address_description+0x6b/0x290 kasan_report+0x261/0x360 radix_tree_next_chunk+0x9f/0x4b0 ida_remove+0x8b/0x180 ida_simple_remove+0x26/0x40 nvme_free_ns_head+0x58/0xc0 __blkdev_put+0x30a/0x3a0 blkdev_close+0x44/0x50 __fput+0x184/0x380 task_work_run+0xaf/0xe0 do_exit+0x501/0x1440 do_group_exit+0x89/0x140 __x64_sys_exit_group+0x28/0x30 do_syscall_64+0x72/0x230 Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-03nvme/multipath: Fix multipath disabled naming collisionsKeith Busch1-25/+1
When CONFIG_NVME_MULTIPATH is set, but we're not using nvme to multipath, namespaces with multiple paths were not creating unique names due to reusing the same instance number from the namespace's head. This patch fixes this by falling back to the non-multipath naming method when the parameter disabled using multipath. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03nvme: Set integrity flag for user passthrough commandsKeith Busch1-0/+1
If the command a separate metadata buffer attached, the request needs to have the integrity flag set so the driver knows to map it. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: expand nvmf_check_if_ready checksJames Smart1-5/+12
The nvmf_check_if_ready() checks that were added are very simplistic. As such, the routine allows a lot of cases to fail ios during windows of reset or re-connection. In cases where there are not multi-path options present, the error goes back to the callee - the filesystem or application. Not good. The common routine was rewritten and calling syntax slightly expanded so that per-transport is_ready routines don't need to be present. The transports now call the routine directly. The routine is now a fabrics routine rather than an inline function. The routine now looks at controller state to decide the action to take. Some states mandate io failure. Others define the condition where a command can be accepted. When the decision is unclear, a generic queue-or-reject check is made to look for failfast or multipath ios and only fails the io if it is so marked. Otherwise, the io will be queued and wait for the controller state to resolve. Admin commands issued via ioctl share a live admin queue with commands from the transport for controller init. The ioctls could be intermixed with the initialization commands. It's possible for the ioctl cmd to be issued prior to the controller being enabled. To block this, the ioctl admin commands need to be distinguished from admin commands used for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to reflect this division and set it on ioctls requests. As the nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(), ensure that commands allocated by the ioctl path (actually anything in core.c) preps the nvme_req(req) before starting the io. This will preserve the USERCMD flag during execution and/or retry. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.e> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: Use admin command effects for admin commandsKeith Busch1-1/+1
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: check return value of init_srcu_struct functionMax Gurtovoy1-1/+4
Also add error flow in case srcu initialization function fails. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: unexport nvme_start_keep_aliveJohannes Thumshirn1-2/+1
nvme_start_keep_alive() isn't used outside core.c so unexport it and make it static. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12nvme: enforce 64bit offset for nvme_get_log_ext fnMatias Bjørling1-3/+3
Compiling on 32 bits system produces a warning for the shift width when shifting 32 bit integer with 64bit integer. Make sure that offset always is 64bit, and use macros for retrieving lower and upper bits of the offset. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-06Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-blockLinus Torvalds1-44/+79
Pull block layer updates from Jens Axboe: "It's a pretty quiet round this time, which is nice. This contains: - series from Bart, cleaning up the way we set/test/clear atomic queue flags. - series from Bart, fixing races between gendisk and queue registration and removal. - set of bcache fixes and improvements from various folks, by way of Michael Lyle. - set of lightnvm updates from Matias, most of it being the 1.2 to 2.0 transition. - removal of unused DIO flags from Nikolay. - blk-mq/sbitmap memory ordering fixes from Omar. - divide-by-zero fix for BFQ from Paolo. - minor documentation patches from Randy. - timeout fix from Tejun. - Alpha "can't write a char atomically" fix from Mikulas. - set of NVMe fixes by way of Keith. - bsg and bsg-lib improvements from Christoph. - a few sed-opal fixes from Jonas. - cdrom check-disk-change deadlock fix from Maurizio. - various little fixes, comment fixes, etc from various folks" * tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits) blk-mq: Directly schedule q->timeout_work when aborting a request blktrace: fix comment in blktrace_api.h lightnvm: remove function name in strings lightnvm: pblk: remove some unnecessary NULL checks lightnvm: pblk: don't recover unwritten lines lightnvm: pblk: implement 2.0 support lightnvm: pblk: implement get log report chunk lightnvm: pblk: rename ppaf* to addrf* lightnvm: pblk: check for supported version lightnvm: implement get log report chunk helpers lightnvm: make address conversions depend on generic device lightnvm: add support for 2.0 address format lightnvm: normalize geometry nomenclature lightnvm: complete geo structure with maxoc* lightnvm: add shorten OCSSD version in geo lightnvm: add minor version to generic geometry lightnvm: simplify geometry structure lightnvm: pblk: refactor init/exit sequences lightnvm: Avoid validation of default op value lightnvm: centralize permission check for lightnvm ioctl ...
2018-03-30lightnvm: implement get log report chunk helpersJavier González1-2/+2
The 2.0 spec provides a report chunk log page that can be retrieved using the stangard nvme get log page. This replaces the dedicated get/put bad block table in 1.2. This patch implements the helper functions to allow targets retrieve the chunk metadata using get log page. It makes nvme_get_log_ext available outside of nvme core so that we can use it form lightnvm. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-30nvme: lightnvm: add late setup of block size and metadataMatias Bjørling1-0/+2
The nvme driver sets up the size of the nvme namespace in two steps. First it initializes the device with standard logical block and metadata sizes, and then sets the correct logical block and metadata size. Due to the OCSSD 2.0 specification relies on the namespace to expose these sizes for correct initialization, let it be updated appropriately on the LightNVM side as well. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: make nvme_get_log_ext non-staticMatias Bjørling1-1/+1
Enable the lightnvm integration to use the nvme_get_log_ext() function. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: Add .stop_ctrl to nvme ctrl opsNitzan Carmi1-0/+2
For consistancy reasons, any fabric-specific works (e.g error recovery/reconnect) should be canceled in nvme_stop_ctrl, as for all other NVMe pending works (e.g. scan, keep alive). The patch aims to simplify the logic of the code, as we now only rely on a vague demand from any fabric to flush its private workqueues at the beginning of .delete_ctrl op. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: Skip checking heads without namespacesKeith Busch1-0/+1
If a task is holding a reference to a namespace on a removed controller, the head will not be released. If the same controller is added again later, its namespaces may not be successfully added. Instead, the user will see kernel message "Duplicate IDs for nsid <X>". This patch fixes that by skipping heads that don't have namespaces when considering if a new namespace is safe to add. Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: centralize ctrl removal printsMax Gurtovoy1-0/+3
nvme_delete_ctrl can be called from various contexts in parallel, and cause duplicated information prints, even though the specific context doesn't perform the actual removal. Instead, print the information when the actual removal occurs. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: implement log page low/high offset and dwordsMatias Bjørling1-10/+22
NVMe 1.2.1 extends the get log page interface to include 64 bit offset and increases the number of dwords to 32 bits. Implement for future use. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: change namespaces_mutext to namespaces_rwsemJianchao Wang1-32/+32
namespaces_mutext is used to synchronize the operations on ctrl namespaces list. Most of the time, it is a read operation. On the other hand, there are many interfaces in nvme core that need this lock, such as nvme_wait_freeze, and even more interfaces will be added. If we use mutex here, circular dependency could be introduced easily. For example: context A context B nvme_xxx nvme_xxx hold namespaces_mutext require namespaces_mutext sync context B So it is better to change it from mutex to rwsem. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: fix the dangerous reference of namespaces listJianchao Wang1-2/+14
nvme_remove_namespaces and nvme_remove_invalid_namespaces reference the ctrl->namespaces list w/o holding namespaces_mutext. It is ok to invoke nvme_ns_remove there, but what if there is others. To be safer, reference the ctrl->namespaces list under namespaces_mutext. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: Add fault injection featureThomas Tai1-0/+2
Linux's fault injection framework provides a systematic way to support error injection via debugfs in the /sys/kernel/debug directory. This patch uses the framework to add error injection to NVMe driver. The fault injection source code is stored in a separate file and only linked if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected. Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no retry will be injected into the nvme_end_request. Users can change the default status code and no retry flag via debufs. Following example shows how to enable and inject an error. For more examples, refer to Documentation/fault-injection/nvme-fault-injection.txt How to enable nvme fault injection: First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config, recompile the kernel. After booting up the kernel, do the following. How to inject an error: mount /dev/nvme0n1 /mnt echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability cp a.file /mnt Expected Result: cp: cannot stat ‘/mnt/a.file’: Input/output error Message from dmesg: FAULT_INJECTION: forcing a failure. name fault_inject, interval 1, probability 100, space 0, times 1 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: <IRQ> dump_stack+0x5c/0x7d should_fail+0x148/0x170 nvme_should_fail+0x2f/0x50 [nvme_core] nvme_process_cq+0xe7/0x1d0 [nvme] nvme_irq+0x1e/0x40 [nvme] __handle_irq_event_percpu+0x3a/0x190 handle_irq_event_percpu+0x30/0x70 handle_irq_event+0x36/0x60 handle_fasteoi_irq+0x78/0x120 handle_irq+0xa7/0x130 ? tick_irq_enter+0xa8/0xc0 do_IRQ+0x43/0xc0 common_interrupt+0xa2/0xa2 </IRQ> RIP: 0010:native_safe_halt+0x2/0x10 RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480 R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000 ? __sched_text_end+0x4/0x4 default_idle+0x18/0xf0 do_idle+0x150/0x1d0 cpu_startup_entry+0x6f/0x80 start_kernel+0x4c4/0x4e4 ? set_init_arg+0x55/0x55 secondary_startup_64+0xa5/0xb0 print_req_error: I/O error, dev nvme0n1, sector 9240 EXT4-fs error (device nvme0n1): ext4_find_entry:1436: inode #2: comm cp: reading directory lblock 0 Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Signed-off-by: Karl Volz <karl.volz@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: use define instead of magic value for identify sizeMinwoo Im1-2/+3
NVME_IDENTIFY_DATA_SIZE was added to linux/nvme.h by following commit. commit 0add5e8e588c ("nvmet: use NVME_IDENTIFY_DATA_SIZE") Make it use NVME_IDENTIFY_DATA_SIZE define instead of magic value 0x1000 in case of identify data size. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-09block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()Bart Van Assche1-2/+2
This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-07Revert "nvme: create 'slaves' and 'holders' entries for hidden controllers"Christoph Hellwig1-2/+0
This reverts commit e9a48034d7d1318ece7d4a235838a86c94db9d68. The slaves and holders link for the hidden gendisks confuse lsblk so that it errors out on, or doesn't report the nvme multipath devices. Given that we don't need holder relationships for something that can't even be directly accessed we should just stop creating those links. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Potnuri Bharat Teja <bharat@chelsio.com> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-02-28Merge branch 'for-jens' of git://git.infradead.org/nvme into for-linusJens Axboe1-9/+3
Pull NVMe fixes from Keith for 4.16-rc. * 'for-jens' of git://git.infradead.org/nvme: nvmet: fix PSDT field check in command format nvme-multipath: fix sysfs dangerously created links nvme-pci: Fix nvme queue cleanup if IRQ setup fails nvmet-loop: use blk_rq_payload_bytes for sgl selection nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
2018-02-28nvme-multipath: fix sysfs dangerously created linksBaegjae Sung1-9/+3
If multipathing is enabled, each NVMe subsystem creates a head namespace (e.g., nvme0n1) and multiple private namespaces (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for private namespaces, links of head namespace are used, so the namespace creation order must be followed (e.g., nvme0n1 -> nvme0c1n1). If the order is not followed, links of sysfs will be incomplete or kernel panic will occur. The kernel panic was: kernel BUG at fs/sysfs/symlink.c:27! Call Trace: nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core] nvme_validate_ns+0x5c2/0x850 [nvme_core] nvme_scan_work+0x1af/0x2d0 [nvme_core] Correct order Context A Context B nvme0n1 nvme0c0n1 nvme0c1n1 Incorrect order Context A Context B nvme0c1n1 nvme0n1 nvme0c0n1 The nvme_mpath_add_disk (for creating head namespace) is called just before the nvme_mpath_add_disk_links (for creating private namespaces). In nvme_mpath_add_disk, the first context acquires the lock of subsystem and creates a head namespace, and other contexts do nothing by checking GENHD_FL_UP of a head namespace after waiting to acquire the lock. We verified the code with or without multipathing using three vendors of dual-port NVMe SSDs. Signed-off-by: Baegjae Sung <baegjae@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-02-14nvme-rdma: fix sysfs invoked reset_ctrl error flowNitzan Carmi1-1/+5
When reset_controller that is invoked by sysfs fails, it enters an error flow which practically removes the nvme ctrl entirely (similar to delete_ctrl flow). It causes the system to hang, since a sysfs attribute cannot be unregistered by one of its own methods. This can be fixed by calling delete_ctrl as a work rather than sequential code. In addition, it should give the ctrl a chance to recover using reconnection mechanism (consistant with FC reset_ctrl error flow). Also, while we're here, return suitable errno in case the reset ended with non live ctrl. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-14nvme: fix the deadlock in nvme_update_formatsJianchao Wang1-3/+8
nvme_update_formats will invoke nvme_ns_remove under namespaces_mutext. The will cause deadlock because nvme_ns_remove will also require the namespaces_mutext. Fix it by getting the ns entries which should be removed under namespaces_mutext and invoke nvme_ns_remove out of namespaces_mutext. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-12nvme: Don't use a stack buffer for keep-alive commandRoland Dreier1-5/+3
In nvme_keep_alive() we pass a request with a pointer to an NVMe command on the stack into blk_execute_rq_nowait(). However, the block layer doesn't guarantee that the request is fully queued before blk_execute_rq_nowait() returns. If not, and the request is queued after nvme_keep_alive() returns, then we'll end up using stack memory that might have been overwritten to form the NVMe command we pass to hardware. Fix this by keeping a special command struct in the nvme_ctrl struct right next to the delayed work struct used for keep-alives. Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08nvme: Fix discard buffer overrunKeith Busch1-3/+5
This patch checks the discard range array bounds before setting it in case the driver gets a badly formed request. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08nvme: delete NVME_CTRL_LIVE --> NVME_CTRL_CONNECTING transitionMax Gurtovoy1-1/+0
There is no logical reason to move from live state to connecting state. In case of initial connection establishment, the transition should be NVME_CTRL_NEW --> NVME_CTRL_CONNECTING --> NVME_CTRL_LIVE. In case of error recovery or reset, the transition should be NVME_CTRL_LIVE --> NVME_CTRL_RESETTING --> NVME_CTRL_CONNECTING --> NVME_CTRL_LIVE. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08nvme-rdma: use NVME_CTRL_CONNECTING state to mark init processMax Gurtovoy1-0/+1
In order to avoid concurrent error recovery during initialization process (allowed by the NVME_CTRL_NEW --> NVME_CTRL_RESETTING transition) we must mark the ctrl as CONNECTING before initial connection establisment. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08nvme: rename NVME_CTRL_RECONNECTING state to NVME_CTRL_CONNECTINGMax Gurtovoy1-5/+5
In pci transport, this state is used to mark the initialization process. This should be also used in other transports as well. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-01Merge tag 'driver-core-4.16-rc1' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of "big" driver core patches for 4.16-rc1. The majority of the work here is in the firmware subsystem, with reworks to try to attempt to make the code easier to handle in the long run, but no functional change. There's also some tree-wide sysfs attribute fixups with lots of acks from the various subsystem maintainers, as well as a handful of other normal fixes and changes. And finally, some license cleanups for the driver core and sysfs code. All have been in linux-next for a while with no reported issues" * tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (48 commits) device property: Define type of PROPERTY_ENRTY_*() macros device property: Reuse property_entry_free_data() device property: Move property_entry_free_data() upper firmware: Fix up docs referring to FIRMWARE_IN_KERNEL firmware: Drop FIRMWARE_IN_KERNEL Kconfig option USB: serial: keyspan: Drop firmware Kconfig options sysfs: remove DEBUG defines sysfs: use SPDX identifiers drivers: base: add coredump driver ops sysfs: add attribute specification for /sysfs/devices/.../coredump test_firmware: fix missing unlock on error in config_num_requests_store() test_firmware: make local symbol test_fw_config static sysfs: turn WARN() into pr_warn() firmware: Fix a typo in fallback-mechanisms.rst treewide: Use DEVICE_ATTR_WO treewide: Use DEVICE_ATTR_RO treewide: Use DEVICE_ATTR_RW sysfs.h: Use octal permissions component: add debugfs support bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate ...
2018-01-29Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-18/+116
Pull block updates from Jens Axboe: "This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself" * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits) block: remove smart1,2.h nvme: add tracepoint for nvme_complete_rq nvme: add tracepoint for nvme_setup_cmd nvme-pci: introduce RECONNECTING state to mark initializing procedure nvme-rdma: remove redundant boolean for inline_data nvme: don't free uuid pointer before printing it nvme-pci: Suspend queues after deleting them bsg: use pr_debug instead of hand crafted macros blk-mq-debugfs: don't allow write on attributes with seq_operations set nvme-pci: Fix queue double allocations block: Set BIO_TRACE_COMPLETION on new bio during split blk-throttle: use queue_is_rq_based block: Remove kblockd_schedule_delayed_work{,_on}() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-throttle: track read and write request individually block: add bdev_read_only() checks to common helpers block: fail op_is_write() requests to read-only partitions blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ...
2018-01-26nvme: add tracepoint for nvme_complete_rqJohannes Thumshirn1-0/+2
Add a tracepoint in nvme_complete_rq() for completions of NVMe commands. An expmale output of the trace-point is as follows: <idle>-0 [001] d.h. 3.505266: nvme_complete_rq: cmdid=989, qid=1, res=0, retries=0, flags=0x0, status=0 Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-26nvme: add tracepoint for nvme_setup_cmdJohannes Thumshirn1-0/+7
Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands. Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd(): kworker/u8:0-5 [003] .... 2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0) and trace_nvme_setup_nvm_cmd(): dd-205 [001] .... 3.503929: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=989, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=4096, len=2047, ctrl=0x0, dsmgmt=0, reftag=0) Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-26nvme-pci: introduce RECONNECTING state to mark initializing procedureJianchao Wang1-1/+1
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect), both nvme-fc/rdma have following pattern: RESETTING - quiesce blk-mq queues, teardown and delete queues/ connections, clear out outstanding IO requests... RECONNECTING - establish new queues/connections and some other initializing things. Introduce RECONNECTING to nvme-pci transport to do the same mark. Then we get a coherent state definition among nvme pci/rdma/fc transports. Suggested-by: James Smart <james.smart@broadcom.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15nvme: host delete_work and reset_work on separate workqueuesRoy Shterman1-5/+39
We need to ensure that delete_work will be hosted on a different workqueue than all the works we flush or cancel from it. Otherwise we may hit a circular dependency warning [1]. Also, given that delete_work flushes reset_work, host reset_work on nvme_reset_wq and delete_work on nvme_delete_wq. In addition, fix the flushing in the individual drivers to flush nvme_delete_wq when draining queued deletes. [1]: [ 178.491942] ============================================= [ 178.492718] [ INFO: possible recursive locking detected ] [ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE [ 178.494382] --------------------------------------------- [ 178.495160] kworker/5:1/135 is trying to acquire lock: [ 178.495894] ( [ 178.496120] "nvme-wq" [ 178.496471] ){++++.+} [ 178.496599] , at: [ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0 [ 178.497670] but task is already holding lock: [ 178.498499] ( [ 178.498724] "nvme-wq" [ 178.499074] ){++++.+} [ 178.499202] , at: [ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.500343] other info that might help us debug this: [ 178.501269] Possible unsafe locking scenario: [ 178.502113] CPU0 [ 178.502472] ---- [ 178.502829] lock( [ 178.503115] "nvme-wq" [ 178.503467] ); [ 178.503716] lock( [ 178.504001] "nvme-wq" [ 178.504353] ); [ 178.504601] *** DEADLOCK *** [ 178.505441] May be due to missing lock nesting notation [ 178.506453] 2 locks held by kworker/5:1/135: [ 178.507068] #0: [ 178.507330] ( [ 178.507598] "nvme-wq" [ 178.507726] ){++++.+} [ 178.508079] , at: [ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.509004] #1: [ 178.509265] ( [ 178.509532] (&ctrl->delete_work) [ 178.509795] ){+.+.+.} [ 178.510145] , at: [ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.511070] stack backtrace: : [ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3 [ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp] [ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80 [ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700 [ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e [ 178.518443] Call Trace: [ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2 [ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0 [ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30 [ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0 [ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0 [ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220 [ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0 [ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220 [ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0 [ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40 [ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50 [ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20 [ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core] [ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp] [ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0 [ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0 [ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490 [ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120 [ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40 Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15nvme-pci: serialize pci resetsSagi Grimberg1-1/+2
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-10nvme/multipath: Consult blk_status_t for failoverKeith Busch1-4/+5
This removes nvme multipath's specific status decoding to see if failover is needed, using the generic blk_status_t that was decoded earlier. This abstraction from the raw NVMe status means all status decoding exists in one place. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme: Add more command status translationKeith Busch1-0/+7
This adds more NVMe status code translations to blk_status_t values, and captures all the current status codes NVMe multipath uses. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09treewide: Use DEVICE_ATTR_ROJoe Perches1-5/+5
Convert DEVICE_ATTR uses to DEVICE_ATTR_RO where possible. Done with perl script: $ git grep -w --name-only DEVICE_ATTR | \ xargs perl -i -e 'local $/; while (<>) { s/\bDEVICE_ATTR\s*\(\s*(\w+)\s*,\s*\(?(?:\s*S_IRUGO\s*|\s*0444\s*)\)?\s*,\s*\1_show\s*,\s*NULL\s*\)/DEVICE_ATTR_RO(\1)/g; print;}' Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Robert Jarzmik <robert.jarzmik@free.fr> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-08nvme: fix subsystem multiple controllers support checkIsrael Rukshin1-1/+17
There is a problem when another module (e.g. nvmet) takes a reference on the nvme block device and the physical nvme drive is removed. In that case nvme_free_ctrl() will not be called and the controller state will be "deleting" or "dead" unless nvmet module releases the block device. Later on, the same nvme drive probes back and nvme_init_subsystem() will be called and fail due to duplicate subnqn (if the nvme device doesn't support subsystem with multiple controllers). This will cause a probe failure. This commit changes the check of multiple controllers support at nvme_init_subsystem() by not counting all the controllers at "dead" or "deleting" state (this is safe because controllers at this state will never be active again). Fixes: ab9e00cc72fa ("nvme: track subsystems") Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme: take refcount on transport moduleNitzan Carmi1-3/+14
The block device is backed by the transport so we must ensure that the transport driver will not be removed until all references are released. Otherwise, we might end up referencing freed memory. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-pci: fix NULL pointer reference in nvme_alloc_nsJianchao Wang1-3/+22
When the io queues setup or tagset allocation failed, ctrl.tagset is NULL. But the scan work will still be queued and executed, then panic comes up due to NULL pointer reference of ctrl.tagset. To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only admin queue is live. When non io queues or tagset allocation failed, ctrl enters into this state, scan work will not be started. But async event work and nvme dev ioctl will be still available. This will be helpful to do further investigation and recovery. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme: modify the debug level for setting shutdown timeoutMax Gurtovoy1-1/+1
When an NVMe controller reports RTD3 Entry Latency larger than the value of shutdown_timeout module parameter, we update the shutdown_timeout accordingly to honor RTD3 Entry Latency. Use an informational debug level instead of a warning level for it. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme-mpath: fix last path removal during trafficSagi Grimberg1-0/+1
In case our last path is removed during traffic, we can end up requeueing the bio(s) but never schedule the actual requeue work as upper layers still have open handles on the mpath device node. Fix this by scheduling requeue work if the namespace being removed is the last path in the ns_head path list. Fixes: 32acab3181c7 ("nvme: implement multipath access to nvme subsystems") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme: fix sector units when going between formatsJeff Lien1-1/+5
If you format a device with a 4k sector size back to 512 bytes, the queue limit values for physical block size and minimum IO size were not getting updated; only the logical block size was being updated. This patch adds code to update the physical block and IO minimum sizes. Signed-off-by: Jeff Lien <jeff.lien@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme: setup streams after initializing namespace headKeith Busch1-1/+1
Fixes a NULL pointer dereference. Reported-by: Arnav Dawn <a.dawn@samsung.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>