summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/pci.c
AgeCommit message (Collapse)AuthorFilesLines
2019-05-13nvme-pci: init shadow doorbell after each resetMaxim Levitsky1-2/+1
The spec states: "The settings are not retained across a Controller Level Reset" Therefore the driver must enable the shadow doorbell, after each reset. This was caught while testing the nvme driver over upcoming nvme-mdev device. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01nvme: move command size checks to the coreChristoph Hellwig1-28/+3
Most command aren't PCIe specific, so move the size checking for them to core.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-01nvme-pci: check more command sizesMinwoo Im1-0/+7
All the NVMe command has 64bytes fixed size so that it has been assured with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h also need to be checked here. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01nvme-pci: remove an unneeded variable initializationMinwoo Im1-1/+1
Variable "n" will be assigned once kstrtoint() succeeds, otherwise it will not be referred because kstrtoint() will return an error which means go out from this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01nvme-pci: unquiesce admin queue on shutdownKeith Busch1-1/+4
Just like IO queues, the admin queue also will not be restarted after a controller shutdown. Unquiesce this queue so that we do not block request dispatch on a permanently disabled controller. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01nvme-pci: shutdown on timeout during deletionKeith Busch1-1/+4
We do not restart a controller in a deleting state for timeout errors. When in this state, unblock potential request dispatchers with failed completions by shutting down the controller on timeout detection. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01nvme-pci: fix psdt field for single segment sglsKlaus Birkelund Jensen1-0/+1
The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: tidy up nvme_map_dataChristoph Hellwig1-12/+5
Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: optimize mapping single segment requests using SGLsChristoph Hellwig1-0/+22
If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: optimize mapping of small single segment requestsChristoph Hellwig1-5/+40
If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove the inline scatterlist optimizationChristoph Hellwig1-32/+6
We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_dataChristoph Hellwig1-21/+27
This prepares for some bigger changes to the data mapping helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: do not build a scatterlist to map metadataChristoph Hellwig1-13/+10
We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: only call nvme_unmap_data for requests transferring dataChristoph Hellwig1-1/+2
This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: merge nvme_free_iod into nvme_unmap_dataChristoph Hellwig1-27/+17
This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_dataChristoph Hellwig1-1/+1
Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove nvme_init_iodChristoph Hellwig1-34/+22
nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove unused nvme_iod memberKeith Busch1-2/+0
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: remove q_dmadev from nvme_queueKeith Busch1-5/+3
We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: use a flag for polled queuesKeith Busch1-20/+13
A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-16Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+2
Pull more block layer changes from Jens Axboe: "This is a collection of both stragglers, and fixes that came in after I finalized the initial pull. This contains: - An MD pull request from Song, with a few minor fixes - Set of NVMe patches via Christoph - Pull request from Konrad, with a few fixes for xen/blkback - pblk fix IO calculation fix (Javier) - Segment calculation fix for pass-through (Ming) - Fallthrough annotation for blkcg (Mathieu)" * tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits) blkcg: annotate implicit fall through nvme-tcp: support C2HData with SUCCESS flag nvmet: ignore EOPNOTSUPP for discard nvme: add proper write zeroes setup for the multipath device nvme: add proper discard setup for the multipath device nvme: remove nvme_ns_config_oncs nvme: disable Write Zeroes for qemu controllers nvmet-fc: bring Disconnect into compliance with FC-NVME spec nvmet-fc: fix issues with targetport assoc_list list walking nvme-fc: reject reconnect if io queue count is reduced to zero nvme-fc: fix numa_node when dev is null nvme-fc: use nr_phys_segments to determine existence of sgl nvme-loop: init nvmet_ctrl fatal_err_work when allocate nvme: update comment to make the code easier to read nvme: put ns_head ref if namespace fails allocation nvme-trace: fix cdw10 buffer overrun nvme: don't warn on block content change effects nvme: add get-feature to admin cmds tracer md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice ...
2019-03-13nvme: disable Write Zeroes for qemu controllersChristoph Hellwig1-1/+2
Qemu started out with a broken implementation of Write Zeroes written by yours truly. Disable Write Zeroes on qemu for now, eventually we need to go back and make all the qemu quirks version specific, but that is left for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-09Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+3
Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...
2019-03-05Merge branch 'irq-core-for-linus' of ↵Linus Torvalds1-78/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The interrupt departement delivers this time: - New infrastructure to manage NMIs on platforms which have a sane NMI delivery, i.e. identifiable NMI vectors instead of a single lump. - Simplification of the interrupt affinity management so drivers don't have to implement ugly loops around the PCI/MSI enablement. - Speedup for interrupt statistics in /proc/stat - Provide a function to retrieve the default irq domain - A new interrupt controller for the Loongson LS1X platform - Affinity support for the SiFive PLIC - Better support for the iMX irqsteer driver - NUMA aware memory allocations for GICv3 - The usual small fixes, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irqchip/imx-irqsteer: Add multi output interrupts support irqchip/imx-irqsteer: Change to use reg_num instead of irq_group dt-bindings: irq: imx-irqsteer: Add multi output interrupts support dt-binding: irq: imx-irqsteer: Use irq number instead of group number irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables irqdomain: Allow the default irq domain to be retrieved irqchip/sifive-plic: Implement irq_set_affinity() for SMP host irqchip/sifive-plic: Differentiate between PLIC handler and context irqchip/sifive-plic: Add warning in plic_init() if handler already present irqchip/sifive-plic: Pre-compute context hart base and enable base PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets genirq/affinity: Remove the leftovers of the original set support nvme-pci: Simplify interrupt allocation genirq/affinity: Add new callback for (re)calculating interrupt sets genirq/affinity: Store interrupt sets size in struct irq_affinity genirq/affinity: Code consolidation irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid. irqchip/i8259: Fix shutdown order by moving syscore_ops registration dt-bindings: interrupt-controller: loongson ls1x intc ...
2019-02-20nvme-pci: convert to SPDX identifiersChristoph Hellwig1-9/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-pci: check kstrtoint() return value in queue_count_set()Bart Van Assche1-0/+2
This patch avoids that the compiler complains about 'ret' being set but not being used when building with W=1. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-18nvme-pci: Simplify interrupt allocationMing Lei1-76/+38
The NVME PCI driver contains a tedious mechanism for interrupt allocation, which is necessary to adjust the number and size of interrupt sets to the maximum available number of interrupts which depends on the underlying PCI capabilities and the available CPU resources. It works around the former short comings of the PCI and core interrupt allocation mechanims in combination with interrupt sets. The PCI interrupt allocation function allows to provide a maximum and a minimum number of interrupts to be allocated and tries to allocate as many as possible. This worked without driver interaction as long as there was only a single set of interrupts to handle. With the addition of support for multiple interrupt sets in the generic affinity spreading logic, which is invoked from the PCI interrupt allocation, the adaptive loop in the PCI interrupt allocation did not work for multiple interrupt sets. The reason is that depending on the total number of interrupts which the PCI allocation adaptive loop tries to allocate in each step, the number and the size of the interrupt sets need to be adapted as well. Due to the way the interrupt sets support was implemented there was no way for the PCI interrupt allocation code or the core affinity spreading mechanism to invoke a driver specific function for adapting the interrupt sets configuration. As a consequence the driver had to implement another adaptive loop around the PCI interrupt allocation function and calling that with maximum and minimum interrupts set to the same value. This ensured that the allocation either succeeded or immediately failed without any attempt to adjust the number of interrupts in the PCI code. The core code now allows drivers to provide a callback to recalculate the number and the size of interrupt sets during PCI interrupt allocation, which in turn allows the PCI interrupt allocation function to be called in the same way as with a single set of interrupts. The PCI code handles the adaptive loop and the interrupt affinity spreading mechanism invokes the driver callback to adapt the interrupt set configuration to the current loop value. This replaces the adaptive loop in the driver completely. Implement the NVME specific callback which adjusts the interrupt sets configuration and remove the adaptive allocation loop. [ tglx: Simplify the callback further and restore the dropped adjustment of number of sets ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
2019-02-18genirq/affinity: Store interrupt sets size in struct irq_affinityMing Lei1-4/+3
The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
2019-02-12nvme-pci: add missing unlock for reset errorKeith Busch1-3/+5
The reset work holds a mutex to prevent races with removal modifying the same resources, but was unlocking only on success. Unlock on failure too. Fixes: 5c959d73dba64 ("nvme-pci: fix rapid add remove sequence") Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-06nvme-pci: fix rapid add remove sequenceKeith Busch1-10/+12
A surprise removal may fail to tear down request queues if it is racing with the initial asynchronous probe. If that happens, the remove path won't see the queue resources to tear down, and the controller reset path may create a new request queue on a removed device, but will not be able to make forward progress, deadlocking the pci removal. Protect setting up non-blocking resources from a shutdown by holding the same mutex, and transition to the CONNECTING state after these resources are initialized so the probe path may see the dead controller state before dispatching new IO. Link: https://bugzilla.kernel.org/show_bug.cgi?id=202081 Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-20Merge tag 'for-linus-20190118' of git://git.kernel.dk/linux-blockLinus Torvalds1-8/+13
Pull block fixes from Jens Axboe: - block size setting fixes for loop/nbd (Jan Kara) - md bio_alloc_mddev() cleanup (Marcos) - Ensure we don't lose the REQ_INTEGRITY flag (Ming) - Two NVMe fixes by way of Christoph: - Fix NVMe IRQ calculation (Ming) - Uninitialized variable in nvmet-tcp (Sagi) - BFQ comment fix (Paolo) - License cleanup for recently added blk-mq-debugfs-zoned (Thomas) * tag 'for-linus-20190118' of git://git.kernel.dk/linux-block: block: Cleanup license notice nvme-pci: fix nvme_setup_irqs() nvmet-tcp: fix uninitialized variable access block: don't lose track of REQ_INTEGRITY flag blockdev: Fix livelocks on loop device nbd: Use set_blocksize() to set device blocksize md: Make bio_alloc_mddev use bio_alloc_bioset block, bfq: fix comments on __bfq_deactivate_entity
2019-01-16nvme-pci: fix nvme_setup_irqs()Ming Lei1-8/+13
When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(), we still try to allocate multiple irq vectors again, so irq queues covers the admin queue actually. But we don't consider that, then number of the allocated irq vector may be same with sum of io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way is obviously wrong, and finally breaks nvme_pci_map_queues(), and warning from pci_irq_get_affinity() is triggered. IRQ queues should cover admin queues, this patch makes this point explicitely in nvme_calc_io_queues(). We got severl boot failure internal report on aarch64, so please consider to fix it in v4.20. Fixes: 6451fe73fa0f ("nvme: fix irq vs io_queue calculations") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: fin4478 <fin4478@hotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-13Merge tag 'for-linus-20190112' of git://git.kernel.dk/linux-blockLinus Torvalds1-20/+47
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph, with little fixes all over the map - Loop caching fix for offset/bs change (Jaegeuk Kim) - Block documentation tweaks (Jeff, Jon, Weiping, John) - null_blk zoned tweak (John) - ahch mvebu suspend/resume support. Should have gone into the merge window, but there was some confusion on which tree had it. (Miquel) * tag 'for-linus-20190112' of git://git.kernel.dk/linux-block: (22 commits) ata: ahci: mvebu: request PHY suspend/resume for Armada 3700 ata: ahci: mvebu: add Armada 3700 initialization needed for S2RAM ata: ahci: mvebu: do Armada 38x configuration only on relevant SoCs ata: ahci: mvebu: remove stale comment ata: libahci_platform: comply to PHY framework loop: drop caches if offset or block_size are changed block: fix kerneldoc comment for blk_attempt_plug_merge() nvme: don't initlialize ctrl->cntlid twice nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN nvme: pad fake subsys NQN vid and ssvid with zeros nvme-multipath: zero out ANA log buffer nvme-fabrics: unset write/poll queues for discovery controllers nvme-tcp: don't ask if controller is fabrics nvme-tcp: remove dead code nvme-pci: fix out of bounds access in nvme_cqe_pending nvme-pci: rerun irq setup on IO queue init errors nvme-pci: use the same attributes when freeing host_mem_desc_bufs. nvme-pci: fix the wrong setting of nr_maps block: doc: add slice_idle_us to bfq documentation block: clarify documentation for blk_{start|finish}_plug ...
2019-01-09nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQNJames Dingwall1-0/+2
If a device provides an NQN it is expected to be globally unique. Unfortunately some firmware revisions for Intel 760p/Pro 7600p devices did not satisfy this requirement. In these circumstances if a system has >1 affected device then only one device is enabled. If this quirk is enabled then the device supplied subnqn is ignored and we fallback to generating one as if the field was empty. In this case we also suppress the version check so we don't print a warning when the quirk is enabled. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09nvme-pci: fix out of bounds access in nvme_cqe_pendingHongbo Yao1-1/+3
There is an out of bounds array access in nvme_cqe_peding(). When enable irq_thread for nvme interrupt, there is racing between the nvmeq->cq_head updating and reading. nvmeq->cq_head is updated in nvme_update_cq_head(), if nvmeq->cq_head equals nvmeq->q_depth and before its value set to zero, nvme_cqe_pending() uses its value as an array index, the index will be out of bounds. Signed-off-by: Hongbo Yao <yaohongbo@huawei.com> [hch: slight coding style update] Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09nvme-pci: rerun irq setup on IO queue init errorsKeith Busch1-14/+36
If the driver is unable to create a subset of IO queues for any reason, the read/write and polled queue sets will not match the actual allocated hardware contexts. This leaves gaps in the CPU affinity mappings and causes the following kernel panic after blk_mq_map_queue_type() returns a NULL hctx. BUG: unable to handle kernel NULL pointer dereference at 0000000000000198 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 64 PID: 1171 Comm: kworker/u259:1 Not tainted 4.20.0+ #241 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 Workqueue: nvme-wq nvme_scan_work [nvme_core] RIP: 0010:blk_mq_init_allocated_queue+0x2d9/0x440 RSP: 0018:ffffb1bf0abc3cd0 EFLAGS: 00010286 RAX: 000000000000001f RBX: ffff8ea744cf0718 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 000000000000007c RDI: ffffffff9109a820 RBP: ffff8ea7565f7008 R08: 000000000000001f R09: 000000000000003f R10: ffffb1bf0abc3c00 R11: 0000000000000000 R12: 000000000001d008 R13: ffff8ea7565f7008 R14: 000000000000003f R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8ea757200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000198 CR3: 0000000013058000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_init_queue+0x35/0x60 nvme_validate_ns+0xc6/0x7c0 [nvme_core] ? nvme_identify_ctrl.isra.56+0x7e/0xc0 [nvme_core] nvme_scan_work+0xc8/0x340 [nvme_core] ? __wake_up_common+0x6d/0x120 ? try_to_wake_up+0x55/0x410 process_one_work+0x1e9/0x3d0 worker_thread+0x2d/0x3d0 ? process_one_work+0x3d0/0x3d0 kthread+0x111/0x130 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nvme nvme_core serio_raw CR2: 0000000000000198 Fix by re-running the interrupt vector setup from scratch using a reduced count that may be successful until the created queues matches the irq affinity plus polling queue sets. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09nvme-pci: use the same attributes when freeing host_mem_desc_bufs.Liviu Dudau1-4/+6
When using HMB the PCIe host driver allocates host_mem_desc_bufs using dma_alloc_attrs() but frees them using dma_free_coherent(). Use the correct dma_free_attrs() function to free the buffers. Signed-off-by: Liviu Dudau <liviu@dudau.co.uk> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09nvme-pci: fix the wrong setting of nr_mapsJianchao Wang1-1/+0
We only set the nr_maps to 3 if poll queues are supported. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-08cross-tree: phase out dma_zalloc_coherent()Luis Chamberlain1-4/+4
We already need to zero out memory for dma_alloc_coherent(), as such using dma_zalloc_coherent() is superflous. Phase it out. This change was generated with the following Coccinelle SmPL patch: @ replace_dma_zalloc_coherent @ expression dev, size, data, handle, flags; @@ -dma_zalloc_coherent(dev, size, handle, flags) +dma_alloc_coherent(dev, size, handle, flags) Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> [hch: re-ran the script on the latest tree] Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-19nvme-pci: trace SQ status on completionsyupeng1-0/+2
Export the disk name, queue id, sq_head, sq_tail to a trace event in completion handling. Usage example: cd /sys/kernel/debug/tracing/events/nvme/nvme_sq echo 'disk=="nvme1n1"' > filter echo 1 > enable cat /sys/kernel/debug/tracing/trace_pipe Signed-off-by: yupeng <yupeng0921@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> [hch: slight formatting tweaks, use standard nvme tracepoint conventions] Signed-off-by: Christoph Hellwig <hch@lst.de> wip
2018-12-18nvme-pci: refactor nvme_poll_irqdisable to make sparse happyChristoph Hellwig1-6/+6
By duplicating the nvme_process_cq in both branches we keep the sparse lock context checking happy, so do it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-18nvme-pci: only set nr_maps to 2 if poll queues are supportedChristoph Hellwig1-0/+3
The block layer now enables polling support on a queue if nr_maps includes the poll map, so we should only set that if we actually support poll queues. Fixes: 6544d229bf ("block: enable polling by default if a poll map is initalized") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-12-17nvme-pci: don't share queue mapsChristoph Hellwig1-5/+1
Now that the block layer checks if a queue map has any queues inside it there is no more reason to duplicate the maps for the non-default types. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11nvme: fix irq vs io_queue calculationsJens Axboe1-35/+29
Guenter reported an boot hang issue on HPPA after we default to 0 poll queues. We have two issues in the queue count calculations: 1) We don't separate the poll queues from the read/write queues. This is important, since the former doesn't need interrupts. 2) The adjust logic is broken. Adjust the poll queue count before doing nvme_calc_io_queues(). The poll queue count is only limited by the IO queue count we were able to get from the controller, not failures in the IRQ allocation loop. This leaves nvme_calc_io_queues() just adjusting the read/write queue map. Reported-by: Reported-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04block: only allow polling if a poll queue_map existsChristoph Hellwig1-20/+9
This avoids having to have differnet mq_ops for different setups with or without poll queues. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04nvme-pci: remove the CQ lock for interrupt driven queuesChristoph Hellwig1-11/+26
Now that we can't poll regular, interrupt driven I/O queues there is almost nothing that can race with an interrupt. The only possible other contexts polling a CQ are the error handler and queue shutdown, and both are so far off in the slow path that we can simply use the big hammer of disabling interrupts. With that we can stop taking the cq_lock for normal queues. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04nvme-pci: don't poll from irq context when deleting queuesChristoph Hellwig1-8/+19
This is the last place outside of nvme_irq that handles CQEs from interrupt context, and thus is in the way of removing the cq_lock for normal queues, and avoiding lockdep warnings on the poll queues, for which we already take it without IRQ disabling. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04nvme-pci: refactor nvme_disable_io_queuesChristoph Hellwig1-21/+20
Pass the opcode for the delete SQ/CQ command as an argument instead of the somewhat confusing pass loop. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04nvme-pci: consolidate code for polling non-dedicated queuesChristoph Hellwig1-23/+12
We have three places that can poll for I/O completions on a normal interrupt-enabled queue. All of them are in slow path code, so consolidate them to a single helper that uses spin_lock_irqsave and removes the fast path cqe_pending check. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04nvme-pci: only allow polling with separate poll queuesChristoph Hellwig1-13/+5
This will allow us to simplify both the regular NVMe interrupt handler and the upcoming aio poll code. In addition to that the separate queues are generally a good idea for performance reasons. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>