Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull block fixes from Jens Axboe:
- Fix a regression in bdev partition locking (Christoph)
- NVMe pull request from Christoph:
- cancel async events before freeing them (David Milburn)
- revert a broken race fix (James Smart)
- fix command processing during resets (Sagi Grimberg)
- Fix a kyber crash with requeued flushes (Omar)
- Fix __bio_try_merge_page() same_page error for no merging (Ritesh)
* tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
block: Set same_page to false in __bio_try_merge_page if ret is false
nvme-fabrics: allow to queue requests for live queues
block: only call sched requeue_request() for scheduled requests
nvme-tcp: cancel async events before freeing event struct
nvme-rdma: cancel async events before freeing event struct
nvme-fc: cancel async events before freeing event struct
nvme: Revert: Fix controller creation races with teardown flow
block: restore a specific error code in bdev_del_partition
|
|
Right now we are failing requests based on the controller state (which
is checked inline in nvmf_check_ready) however we should definitely
accept requests if the queue is live.
When entering controller reset, we transition the controller into
NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
requests (have blk_noretry_request set).
This is also the case for NVME_REQ_USER for the wrong reason. There
shouldn't be any reason for us to reject this I/O in a controller reset.
We do want to prevent passthru commands on the admin queue because we
need the controller to fully initialize first before we let user passthru
admin commands to be issued.
In a non-mpath setup, this means that the requests will simply be
requeued over and over forever not allowing the q_usage_counter to drop
its final reference, causing controller reset to hang if running
concurrently with heavy I/O.
Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Cancel async event work in case async event has been queued up, and
nvme_tcp_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn <dmilburn@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Cancel async event work in case async event has been queued up, and
nvme_rdma_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn <dmilburn@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Cancel async event work in case async event has been queued up, and
nvme_fc_submit_async_event() runs after event has been freed.
Signed-off-by: David Milburn <dmilburn@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The indicated patch introduced a barrier in the sysfs_delete attribute
for the controller that rejects the request if the controller isn't
created. "Created" is defined as at least 1 call to nvme_start_ctrl().
This is problematic in error-injection testing. If an error occurs on
the initial attempt to create an association and the controller enters
reconnect(s) attempts, the admin cannot delete the controller until
either there is a successful association created or ctrl_loss_tmo
times out.
Where this issue is particularly hurtful is when the "admin" is the
nvme-cli, it is performing a connection to a discovery controller, and
it is initiated via auto-connect scripts. With the FC transport, if the
first connection attempt fails, the controller enters a normal reconnect
state but returns control to the cli thread that created the controller.
In this scenario, the cli attempts to read the discovery log via ioctl,
which fails, causing the cli to see it as an empty log and then proceeds
to delete the discovery controller. The delete is rejected and the
controller is left live. If the discovery controller reconnect then
succeeds, there is no action to delete it, and it sits live doing nothing.
Cc: <stable@vger.kernel.org> # v5.7+
Fixes: ce1518139e69 ("nvme: Fix controller creation races with teardown flow")
Signed-off-by: James Smart <james.smart@broadcom.com>
CC: Israel Rukshin <israelr@mellanox.com>
CC: Max Gurtovoy <maxg@mellanox.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Keith Busch <kbusch@kernel.org>
CC: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pull block fixes from Jens Axboe:
"A bit larger than usual this week, mostly due to the NVMe fixes
arriving late for -rc3 and hence didn't make last weeks pull request.
- NVMe:
- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Sagi
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye
- Locking fix for partition removal (Christoph)
- Ensure bdi->io_pages is always set (me)
- Fixup for hd struct reference (Ming)
- Fix for zero length bvecs (Ming)
- Two small blk-iocost fixes (Tejun)"
* tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
block: allow for_each_bvec to support zero len bvec
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
|
|
This patch addresses an irq free warning and null pointer dereference
error problem when nvme devices got timeout error during initialization.
This problem happens when nvme_timeout() function is called while
nvme_reset_work() is still in execution. This patch fixed the problem by
setting flag of the problematic request to NVME_REQ_CANCELLED before
calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
an error code and let nvme_submit_sync_cmd() fail gracefully.
The following is console output.
[ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
[ 62.488796] nvme nvme0: could not set timestamp (881)
[ 62.494888] ------------[ cut here ]------------
[ 62.495142] Trying to free already-free IRQ 11
[ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
[ 62.495742] Modules linked in:
[ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
[ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work
[ 62.497019] RIP: 0010:free_irq+0x1f7/0x370
[ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
[ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
[ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
[ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
[ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
[ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
[ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
[ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 62.501864] Call Trace:
[ 62.501993] pci_free_irq+0x13/0x20
[ 62.502167] nvme_reset_work+0x5d0/0x12a0
[ 62.502369] ? update_load_avg+0x59/0x580
[ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0
[ 62.502780] ? try_to_wake_up+0x1a2/0x450
[ 62.502979] process_one_work+0x1d2/0x390
[ 62.503179] worker_thread+0x45/0x3b0
[ 62.503361] ? process_one_work+0x390/0x390
[ 62.503568] kthread+0xf9/0x130
[ 62.503726] ? kthread_park+0x80/0x80
[ 62.503911] ret_from_fork+0x22/0x30
[ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
[ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
[ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues
[ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 123.917469] #PF: supervisor write access in kernel mode
[ 123.917725] #PF: error_code(0x0002) - not-present page
[ 123.917976] PGD 0 P4D 0
[ 123.918109] Oops: 0002 [#1] SMP PTI
[ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8
[ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
[ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work
[ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 123.924353] Call Trace:
[ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0
[ 123.924694] nvme_reset_work+0xed6/0x12a0
[ 123.924898] process_one_work+0x1d2/0x390
[ 123.925099] worker_thread+0x45/0x3b0
[ 123.925280] ? process_one_work+0x390/0x390
[ 123.925486] kthread+0xf9/0x130
[ 123.925642] ? kthread_park+0x80/0x80
[ 123.925825] ret_from_fork+0x22/0x30
[ 123.926004] Modules linked in:
[ 123.926158] CR2: 0000000000000000
[ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
[ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
[ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
[ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
[ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
[ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
[ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
[ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
[ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
[ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
[ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
[ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
The kernel requires a power of two for boundaries because that's the
only way it can efficiently split commands that cross them. A
controller, however, may report a non-power of two boundary.
The driver had been rounding the controller's value to one the kernel
can use, but splitting on the wrong boundary provides no benefit on the
device side, and incurs additional submission overhead from non-optimal
splits.
Don't provide any boundary hint if the controller's value can't be used
and log a warning when first scanning a disk's unreported IO boundary.
Since the chunk sector logic has grown, move it to a separate function.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.
Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
PCIe controllers do not have fabric opts, verify they exist before
showing ctrl_loss_tmo or reconnect_delay attributes.
Fixes: 764075fdcb2f ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
Reported-by: Tobias Markus <tobias@markus-regensburg.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.
So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.
However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.
Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.
In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.
Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.
So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.
However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.
Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.
In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.
Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request from Sagi:
- nvme completion rework from Christoph and Chao that mostly came
from a bit of divergence of how we classify errors related to
pathing/retry etc.
- nvmet passthru fixes from Chaitanya
- minor nvmet fixes from Amit and I
- mpath round-robin path selection fix from Martin
- ignore noiob for zoned devices from Keith
- minor nvme-fc fix from Tianjia"
- BFQ cgroup leak fix (Dmitry)
- block layer MAINTAINERS addition (Geert)
- fix null_blk FUA checking (Hou)
- get_max_io_size() size fix (Keith)
- fix block page_is_mergeable() for compound pages (Matthew)
- discard granularity fixes (Ming)
- IO scheduler ordering fix (Ming)
- misc fixes
* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
null_blk: fix passing of REQ_FUA flag in null_handle_rq
nvmet: Disable keep-alive timer when kato is cleared to 0h
nvme: redirect commands on dying queue
nvme: just check the status code type in nvme_is_path_error
nvme: refactor command completion
nvme: rename and document nvme_end_request
nvme: skip noiob for zoned devices
nvme-pci: fix PRP pool size
nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
nvme: Use spin_lock_irq() when taking the ctrl->lock
nvmet: call blk_mq_free_request() directly
nvmet: fix oops in pt cmd execution
nvmet: add ns tear down label for pt-cmd handling
nvme: multipath: round-robin: eliminate "fallback" variable
nvme: multipath: round-robin: fix single non-optimized path case
nvme-fc: Fix wrong return value in __nvme_fc_init_request()
nvmet-passthru: Reject commands with non-sgl flags set
nvmet: fix a memory leak
blkcg: fix memleak for iolatency
MAINTAINERS: Add missing header files to BLOCK LAYER section
...
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
If a command send through nvme-multipath failed on a dying queue, resend it
on another path.
Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of the completion refactoring]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Check the SCT sub-field for a path related status instead of enumerating
invididual status code. As of NVMe 1.4 this adds "Internal Path Error"
and "Controller Pathing Error" to the list, but it also future proofs for
additional status codes added to the category.
Suggested-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Lift all the code to decide the dispostition of a completed command
from nvme_complete_rq and nvme_failover_req into a new helper, which
returns an emum of the potential actions. nvme_complete_rq then
just switches on those and calls the proper helper for the action.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API. It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Zoned block devices reuse the chunk_sectors queue limit to define zone
boundaries. If a such a device happens to also report an optimal
boundary, do not use that to define the chunk_sectors as that may
intermittently interfere with io splitting and zone size queries.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All operations are based on the controller, not the host page size.
Switch the dma pool to use the controller page size as well to avoid
massive overallocations on large page size systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Recently nvme_dev.q_depth was changed from an int to u16 type.
This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:
root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
x1 : 0000000000000001 x0 : 0000000000000000
Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---
Fix by making onto a u32.
Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.
Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
dead lock. The new spin_lock() calls recently added produce the
following lockdep warning when running the blktest nvme/003:
================================
WARNING: inconsistent lock state
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x164/0x500
_raw_spin_lock+0x28/0x40
nvme_get_effects_log+0x37/0x1c0
nvme_init_identify+0x9e4/0x14f0
nvme_reset_work+0xadd/0x2360
process_one_work+0x66b/0xb70
worker_thread+0x6e/0x6c0
kthread+0x1e7/0x210
ret_from_fork+0x22/0x30
irq event stamp: 1449221
hardirqs last enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
softirqs last enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&ctrl->lock);
<Interrupt>
lock(&ctrl->lock);
*** DEADLOCK ***
no locks held by ksoftirqd/2/22.
stack backtrace:
CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xc8/0x11a
print_usage_bug.cold.63+0x235/0x23e
mark_lock+0xa9c/0xcf0
__lock_acquire+0xd9a/0x2b50
lock_acquire+0x164/0x500
_raw_spin_lock_irqsave+0x40/0x60
nvme_keep_alive_end_io+0x50/0xc0
blk_mq_end_request+0x158/0x210
nvme_complete_rq+0x146/0x500
nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
blk_done_softirq+0x187/0x1e0
__do_softirq+0x118/0x595
run_ksoftirqd+0x35/0x50
smpboot_thread_fn+0x1d3/0x310
kthread+0x1e7/0x210
ret_from_fork+0x22/0x30
Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we find an optimized path, we quit the loop immediately. Thus we can use
just one variable for the next path, slighly simplifying the code.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If there's only one usable, non-optimized path, nvme_round_robin_path()
returns NULL, which is wrong. Fix it by falling back to "old", like in
the single optimized path case. Also, if the active path isn't changed,
there's no need to re-assign the pointer.
Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Martin George <marting@netapp.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On an error exit path, a negative error code should be returned
instead of a positive return value.
Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart <jsmart2021@gmail.com>
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block stacking updates from Jens Axboe:
"The stacking related fixes depended on both the core block and drivers
branches, so here's a topic branch with that change.
Outside of that, a late fix from Johannes for zone revalidation"
* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
block: don't do revalidate zones on invalid devices
block: remove blk_queue_stack_limits
block: remove bdev_stack_limits
block: inherit the zoned characteristics in blk_stack_limits
|
|
Pull block driver updates from Jens Axboe:
- NVMe:
- ZNS support (Aravind, Keith, Matias, Niklas)
- Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
Dongli, Max, Sagi)
- null_blk zone capacity support (Aravind)
- MD:
- raid5/6 fixes (ChangSyun)
- Warning fixes (Damien)
- raid5 stripe fixes (Guoqing, Song, Yufen)
- sysfs deadlock fix (Junxiao)
- raid10 deadlock fix (Vitaly)
- struct_size conversions (Gustavo)
- Set of bcache updates/fixes (Coly)
* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
md/raid5: Allow degraded raid6 to do rmw
md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
raid5: don't duplicate code for different paths in handle_stripe
raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
md: print errno in super_written
md/raid5: remove the redundant setting of STRIPE_HANDLE
md: register new md sysfs file 'uuid' read-only
md: fix max sectors calculation for super 1.0
nvme-loop: remove extra variable in create ctrl
nvme-loop: set ctrl state connecting after init
nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
nvme-multipath: fix logic for non-optimized paths
nvme-rdma: fix controller reset hang during traffic
nvme-tcp: fix controller reset hang during traffic
nvmet: introduce the passthru Kconfig option
nvmet: introduce the passthru configfs interface
nvmet: Add passthru enable/disable helpers
nvmet: add passthru code to process commands
nvme: export nvme_find_get_ns() and nvme_put_ns()
nvme: introduce nvme_ctrl_get_by_path()
...
|
|
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
|
|
Add a quirk for a device that does not support the Identify Namespace
Identification Descriptor list despite claiming 1.3 compliance.
Fixes: ea43d9709f72 ("nvme: fix identify error status silent ignore")
Reported-by: Ingo Brunberg <ingo_brunberg@web.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ingo Brunberg <ingo_brunberg@web.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
When nvme_round_robin_path() finds a valid namespace we should be using it;
falling back to __nvme_find_path() for non-optimized paths will cause the
result from nvme_round_robin_path() to be ignored for non-optimized paths.
Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Handle the special case where we have exactly one optimized path,
which we should keep using in this case.
Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy")
Signed off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.
So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).
This follows to how the pci driver handles resets.
Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
commit fe35ec58f0d3 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.
So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).
This follows to how the pci driver handles resets.
Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme_find_get_ns() and nvme_put_ns() are required by the target passthru
code and are exported under the NVME_TARGET_PASSTHRU namespace.
Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it
gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
It makes use of filp_open() to open the file and uses the private
data to obtain a pointer to the struct nvme_ctrl. If the fops of the
file do not match, -EINVAL is returned.
The purpose of this function is to support NVMe-OF target passthru
and is exported under the NVME_TARGET_PASSTHRU namespace.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Introduce a new nvme_execute_passthru_rq() helper which calls
nvme_passthru_[start|end]() around blk_execute_rq(). This ensures
all passthru calls (including nvme_submit_io()) will be wrapped
appropriately.
nvme_execute_passthru_rq() will also be useful for the nvmet passthru
code and is exported in the NVME_TARGET_PASSTHRU namespace.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Separate the code to obtain command effects from the code
to start a passthru request and move the nvme_passthru_start() and
nvme_passthru_end() functions up above nvme_submit_user_cmd() in order
that they may be used in a new helper a subsequent patch.
The new helper function will be necessary for nvmet passthru
code to determine if we need to change out of interrupt context
to handle the effects. It is exported in the NVME_TARGET_PASSTHRU
namespace.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The host driver should decide whether to use SGLs or PRPs and they
currently assume the flags are cleared after the call to
nvme_setup_cmd(). However, passed-through commands may erroneously
set these bits; so clear them for all cases.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Currently the FC transport is set max_hw_sectors based on the lldds
max sgl segment count. However, the block queue max segments is
set based on the controller's max_segments count, which the transport
does not set. As such, the lldd is receiving sgl lists that are
exceeding its max segment count.
Set the controller max segment count and derive max_hw_sectors from
the max segment count.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Stay consistent with the rest of the driver
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
is an optimized path to it, path nvme1 happens to be
inaccessible.
2) Before scan_work is complete nvme0 disconnect is initiated
nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
3) scan_work(1) attempts to submit IO,
but nvme_path_is_optimized() observes nvme0 is not LIVE.
Since nvme1 is a possible path IO is requeued and scan_work hangs.
--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: efi_partition+0x1e6/0x708
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
--
4) Delete also hangs in flush_work(ctrl->scan_work)
from nvme_remove_namespaces().
Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.
[ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[ 605.552087] Call Trace:
[ 605.552683] __schedule+0x2b9/0x6c0
[ 605.553507] schedule+0x42/0xb0
[ 605.554201] io_schedule+0x16/0x40
[ 605.555012] do_read_cache_page+0x438/0x830
[ 605.556925] read_cache_page+0x12/0x20
[ 605.557757] read_dev_sector+0x27/0xc0
[ 605.558587] amiga_partition+0x4d/0x4c5
[ 605.561278] check_partition+0x154/0x244
[ 605.562138] rescan_partitions+0xae/0x280
[ 605.563076] __blkdev_get+0x40f/0x560
[ 605.563830] blkdev_get+0x3d/0x140
[ 605.564500] __device_add_disk+0x388/0x480
[ 605.565316] device_add_disk+0x13/0x20
[ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core]
[ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core]
[ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core]
[ 605.573330] process_one_work+0x1db/0x380
[ 605.574144] worker_thread+0x4d/0x400
[ 605.574896] kthread+0x104/0x140
[ 605.577205] ret_from_fork+0x35/0x40
[ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830
[ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.582320] nvme D 0 14044 14043 0x00000000
[ 605.583424] Call Trace:
[ 605.583935] __schedule+0x2b9/0x6c0
[ 605.584625] schedule+0x42/0xb0
[ 605.585290] schedule_timeout+0x203/0x2f0
[ 605.588493] wait_for_completion+0xb1/0x120
[ 605.590066] __flush_work+0x123/0x1d0
[ 605.591758] __cancel_work_timer+0x10e/0x190
[ 605.593542] cancel_work_sync+0x10/0x20
[ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core]
[ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core]
[ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[ 605.598320] dev_attr_store+0x17/0x30
Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.
In addition, change the transport drivers to follow the updated state
machine.
Fixes: 0d0b660f214d ("nvme: add ANA support")
Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We are starting to see some non-trivial states
so lets start documenting them.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Has the driver use shared CQs providing ~10%-20% improvement as seen in
the patch introducing shared CQs. Instead of opening a CQ for each QP
per controller connected, a CQ for each QP will be provided by the RDMA
core driver that will be shared between the QPs on that core reducing
interrupt overhead.
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|