summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
2014-05-01block: null_blk: fix use after freeMing Lei1-1/+1
entry(cmd->ll_list) may belong to new request once end_cmd() returns, so fix the bug with the patch. Without the change, it is easy to observe oops when doing null_blk(timer) test. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: use list_first_entry_or_null in first_peer_device/first_connectionLars Ellenberg1-2/+2
If there are no peer_devices or connections, I'd rather have NULL than some "arbitrary" address pretending to point to a struct. Helps to avoid hard to debug symptoms, in case we ever try to use and dereference a drbd_connection or drbd_peer_device where we in fact don't have any connection at all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Allow attaching of a newly created device to any backing devicePhilipp Reisner1-1/+1
A newly created device was never exposed before, i.e. has a exposed_data_uuid of 0. Then it is valid to attach to any current_uuid of a backing device (of course also to a newly created one (4)) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Test cstate while holding req_lockPhilipp Reisner1-1/+2
In case a connection transitions into C_TIMEOUT within the timer function (request_timer_fn()) we need to make sure that the receiver thread (potentially running on a different CPU) sees the updated cstate later on. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: use blk_set_stacking_limits()Philipp Reisner1-6/+6
...instead directly assigning to q->limits.discard_zeroes_data Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: evaluate disk and network timeout on different requestsLars Ellenberg1-14/+33
Just because it is the oldest not yet completed request does not make it the oldest request waiting for disk. Or waiting for the peer. And we completely missed already completed requests that would still hold references to activity log extents, waiting only for the barrier ack. Find two oldest not yet completely processed requests, one that is still waiting for local completion, and one that is still waiting for some response from the peer. These may or may not be the same request object. Then separately apply the network and disk timeouts, respectively. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Fix a hole in the challange-response connection authenticationPhilipp Reisner1-0/+12
In the implementation as it was, the two peers sent each other a challenge, and expects the challenge hashed with the shared secret back. A attacker could simply wait for the challenge of the peer, and send the same challenge back. Then it waits for the response, and sends the same response back. Prevent this by not accepting a challenge from the peer that is the same as the challenge sent to the peer. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: always implicitly close last epoch when idleLars Ellenberg1-33/+15
Once our sender thread needs to wait_for_work(), and actually needs to schedule(), just before we do that, we already check if it is useful to implicitly close the last epoch. The condition was too strict: only implicitly close the epoch, if there have been no new (write) requests at all. The assumption was that if there were new requests, they would always be communicated one way or another, and would send necessary epoch separating barriers explicitly. This is not always true, e.g. when becoming diskless, or while explicitly starting a full resync. The last communicated epoch could stay open for a long time, locking down corresponding activity log extents. It is safe to always implicitly send that last barrier, as soon as we determin that there cannot be more requests in the last communicated epoch, even if there have been (uncommunicated) new requests in new epochs meanwhile. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: add back some fairness to AL transactionsLars Ellenberg3-2/+22
When batching more updates to the activity log into single transactions, we lost the ability for new requests to force themselves into the active set: all preparation steps became non-blocking, and if all currently hot extents keep busy, they could starve out new incoming requests to cold extents for quite a while. This can only happen if your IO backend accepts more IO operations per average DRBD replication round trip time than you have al-extents configured. If we have incoming requests to cold extents, at least do one blocking update per transaction. In an artificial worst-case workload on SSD with an asynchronous 600 ms replication link, with al-extents = 7 (the minimum we allow), and concurrent full resynch, without this patch, some write requests have been observed to be starved for 40 seconds. With this patch, application observed a worst case latency of twice the replication round trip time. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: keep max-bio size during detach/attach on disconnected primaryLars Ellenberg1-1/+7
We want to store in persistent meta data what the peer DRBD can handle, which, due to spreading requests to multiple bios, may be more than its backing device can handle. Otherwise, if a disconnected Primary temporarily loses access to its local data as well, we may accidentally shrink the max-bio setting, portentially causing already assembled, but not yet processed, application bios to be spuriously failed due to device limits. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: fix a race between start_resync and send_and_submitLars Ellenberg1-3/+8
In the drbd make request function, specifically in drbd_send_and_submit(), we decide whether we want to send the actual write request, or only a "set this block out of sync" information. We do so based on the current connection state, while holding the req_lock. The connection state is not supposed to change while holding the req_lock. But in drbd_start_resync, we did change that state anyways, while only holding the global_state_lock, which is enough to change sync-after dependencies (paused vs active resync), but not good enough to change the connection state. Fix: in drbd_start_resync, first grab the req_lock to serialize with drbd_send_and_submit(), before grabbing the global_state_lock to be able to evaluate the sync-after dependencies. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Enable QUEUE_FLAG_DISCARD only if the peer can recieve P_TRIMLars Ellenberg4-4/+37
Allow the user of REQ_DISCARD. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: prepare sending side for REQ_DISCARDLars Ellenberg4-4/+30
Note that I do NOT call __drbd_chk_io_error for failed REQ_DISCARD. That may be wrong, though, or needs to differ between EOPNOTSUPP and other errors... Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: prepare receiving side for REQ_DISCARDLars Ellenberg5-26/+100
If the receiver needs to serve a discard request on a queue that does not announce to be discard cabable, it falls back to do synchronous blkdev_issue_zeroout(). We expect only "reasonably" large (up to one activity log extent?) discard requests. We do this to not to not block the receiver for too long in this fallback code path, and to not set/clear too many bits inside one spinlock_irq_save() in drbd_set_in_sync/drbd_set_out_of_sync, Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: allow parallel promote/demote actionsLars Ellenberg3-14/+88
We plan to use genl_family->parallel_ops = true in the future, but need to review all possible interactions first. For now, only selectively drop genl_lock() in drbd_set_role(), instead serializing on our own internal resource->conf_update mutex. We now can be promoted/demoted on many resources in parallel, which may significantly improve cluster failover times when fencing is required. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: perpare for genetlink parallel_opsLars Ellenberg3-175/+198
Because all administrative requests via genetlink have been globally serialized via genl_lock(), we used to have one static struct drbd_config_context "admin context". Move this on-stack to the respective callback functions. This will allow us to selectively drop the genl_lock() (or use genl_family->parallel_ops) in the future. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Do not BUG() when connection breaks in a special wayPhilipp Reisner1-7/+7
When a 'cluster wide' disconnect executes, the result comes back from the peer, and immediately after that the connection breaks then _conn_rq_cond() reported back SS_CW_SUCCESS. Therefore _conn_request_state() calls conn_set_state(), which has a BUG() in it. The BUG() is hit because conn_is_valid_transition() does not like the transaction. Which goes back to is_valid_soft_transition() returning SS_OUTDATE_WO_CONN. This fix is to consider an error reported by is_valid_soft_transition() even when the peer agreed to the transaction. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: don't let application IO pre-empt resync too oftenLars Ellenberg3-29/+34
Before, application IO could pre-empt resync activity for up to hardcoded 20 seconds per resync request. A very busy server could throttle the effective resync bandwidth down to one request per 20 seconds. Now, we only let application IO pre-empt resync traffic while the current resync rate estimate is above c-min-rate. If you disable the c-min-rate throttle feature (set c-min-rate = 0), application IO will no longer pre-empt resync traffic at all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: fix potential distributed deadlock during verify or resyncLars Ellenberg2-15/+29
If max-buffers and socket buffer sizes are "too small" for the chosen resync rate, this could lead potentially lead to a distributed deadlock, which may or may not resolve itself via the "ko-count" and request timeout mechanism, or could be resolved by forced disconnect. One option to deal with this is proper configuration: use larger max-buffer and socket buffers settings, or reduce the resync rate. But even with bad configuration we should not deadlock, but "gracefully" recover. The issue is avoided by using only up to max-buffers/2 for resync requests, and by using max-buffers not as a hard limit for data buffer allocations, but as a throttle threshold only. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: resync: fix too large bursts for very slow ratesLars Ellenberg1-1/+1
While merging adjacent dirty blocks into resync requests, the resync rate throttle was disregarded. For very low resync rates, the effective rate may have exceeded the intended rate by a larger margin. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: fix stalled resync detection in /proc/drbdLars Ellenberg1-1/+1
If we don't make resync or verify progress for "too long", we want to flag it as "stalled". Since 2010, "use rolling marks for resync speed calculation" this "too long" was wrong by a factor of HZ. With HZ 250, it would have been flagged as stalled after 100 minutes. Hardcode 3 minutes instead. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Allow online layout change of AL while peer is not connectedPhilipp Reisner1-1/+1
If a user forces the operation he takes the blame in case the peer does not have enough space. No reason to dey this... Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Remove drbd_wrappers.hPhilipp Reisner6-58/+35
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Leave IO suspended if the fence handler find the peer primaryPhilipp Reisner2-11/+21
Actually we are clearing the susp_fen flag if we are not going to call a fencing handler. For setting the susp_fen flag needs to be edge-triggerd, and not level triggered. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-30drbd: Break a deadlock while concurrent fencing and establishing a connectionPhilipp Reisner1-10/+13
When we need to outdate the peer while being promoted to primary, and the connection gets established at the same time, we deadlock in drbd_try_outdate_peer() when trying to clear the susp_fen bit. Fix this by setting the STATE_SENT bit while holding the mutex. Using drbd_change_state(.. , CS_HARD, ..) which does not block until STATE_SENT is cleared, is only for clearness. It does not contribute anything to the fix. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-23mtip32xx: Fix ERO and NoSnoop values in PCIe upstream on AMD systemsAsai Thambi S P1-0/+53
A hardware quirk in P320h/P420m interfere with PCIe transactions on some AMD chipsets, making P320h/P420m unusable. This workaround is to disable ERO and NoSnoop bits in the parent and root complex for normal functioning of these devices NOTE: This workaround is specific to AMD chipset with a PCIe upstream device with device id 0x5aXX Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-23mtip32xx: Remove dfs_parent after pci unregisterAsai Thambi S P1-2/+2
In module exit, dfs_parent and it's subtree were removed before unregistering with pci. When debugfs entry for each device is attempted to remove in pci_remove() context, they don't exist, as dfs_parent and its children were already ripped apart. Modified to first unregister with pci and then remove dfs_parent. Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-23mtip32xx: Increase timeout for STANDBY IMMEDIATE commandAsai Thambi S P1-31/+35
Increased timeout for STANDBY IMMEDIATE command to 2 minutes. Signed-off-by: Selvan Mani <smani@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-22cciss: Use pci_enable_msix_exact() instead of pci_enable_msix()Alexander Gordeev1-5/+1
As result of deprecation of MSI-X/MSI enablement functions pci_enable_msix() and pci_enable_msi_block() all drivers using these two interfaces need to be updated to use the new pci_enable_msi_range() or pci_enable_msi_exact() and pci_enable_msix_range() or pci_enable_msix_exact() interfaces. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Mike Miller <mike.miller@hp.com> Cc: iss_storagedev@hp.com Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-pci@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-22skd: Use pci_enable_msix_exact() instead of pci_enable_msix_range()Alexander Gordeev1-4/+3
Function pci_enable_msix_exact() is a variation of pci_enable_msix_range() that allows a device driver to request a particular number of MSI-X interrupts, rather than any number within a specified range. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: linux-pci@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-17blk-mq: add async parameter to blk_mq_start_stopped_hw_queuesChristoph Hellwig1-2/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-16blk-mq: split out tag initialization, support shared tagsChristoph Hellwig2-54/+86
Add a new blk_mq_tag_set structure that gets set up before we initialize the queue. A single blk_mq_tag_set structure can be shared by multiple queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Modular export of blk_mq_{alloc,free}_tagset added by me. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-16blk-mq: add ->init_request and ->exit_request methodsChristoph Hellwig1-12/+11
The current blk_mq_init_commands/blk_mq_free_commands interface has a two problems: 1) Because only the constructor is passed to blk_mq_init_commands there is no easy way to clean up when a comman initialization failed. The current code simply leaks the allocations done in the constructor. 2) There is no good place to call blk_mq_free_commands: before blk_cleanup_queue there is no guarantee that all outstanding commands have completed, so we can't free them yet. After blk_cleanup_queue the queue has usually been freed. This can be worked around by grabbing an unconditional reference before calling blk_cleanup_queue and dropping it after blk_mq_free_commands is done, although that's not exatly pretty and driver writers are guaranteed to get it wrong sooner or later. Both issues are easily fixed by making the request constructor and destructor normal blk_mq_ops methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-16blk-mq: do not initialize req->specialChristoph Hellwig2-5/+5
Drivers can reach their private data easily using the blk_mq_rq_to_pdu helper and don't need req->special. By not initializing it code can be simplified nicely, and we also shave off a few more instructions from the I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-16block: remove struct request buffer memberJens Axboe14-40/+41
This was used in the olden days, back when onions were proper yellow. Basically it mapped to the current buffer to be transferred. With highmem being added more than a decade ago, most drivers map pages out of a bio, and rq->buffer isn't pointing at anything valid. Convert old style drivers to just use bio_data(). For the discard payload use case, just reference the page in the bio. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-13Merge branch 'for-linus' of ↵Linus Torvalds2-40/+20
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-12Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2-236/+491
Pull NVMe driver updates from Matthew Wilcox: "Various updates to the NVMe driver. The most user-visible change is that drive hotplugging now works and CPU hotplug while an NVMe drive is installed should also work better" * git://git.infradead.org/users/willy/linux-nvme: NVMe: Retry failed commands with non-fatal errors NVMe: Add getgeo to block ops NVMe: Start-stop nvme_thread during device add-remove. NVMe: Make I/O timeout a module parameter NVMe: CPU hot plug notification NVMe: per-cpu io queues NVMe: Replace DEFINE_PCI_DEVICE_TABLE NVMe: Fix divide-by-zero in nvme_trans_io_get_num_cmds NVMe: IOCTL path RCU protect queue access NVMe: RCU protected access to io queues NVMe: Initialize device reference count earlier NVMe: Add CONFIG_PM_SLEEP to suspend/resume functions
2014-04-11NVMe: Retry failed commands with non-fatal errorsKeith Busch2-94/+151
For commands returned with failed status, queue these for resubmission and continue retrying them until success or for a limited amount of time. The final timeout was arbitrarily chosen so requests can't be retried indefinitely. Since these are requeued on the nvmeq that submitted the command, the callbacks have to take an nvmeq instead of an nvme_dev as a parameter so that we can use the locked queue to append the iod to retry later. The nvme_iod conviently can be used to track how long we've been trying to successfully complete an iod request. The nvme_iod also provides the nvme prp dma mappings, so I had to move a few things around so we can keep those mappings. Signed-off-by: Keith Busch <keith.busch@intel.com> [fixed checkpatch issue with long line] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-11NVMe: Add getgeo to block opsKeith Busch1-0/+11
Some programs require HDIO_GETGEO work, which requires we implement getgeo. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-11NVMe: Start-stop nvme_thread during device add-remove.Dan McLeran1-14/+42
Done to ensure nvme_thread is not running when there are no devices to poll. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-11NVMe: Make I/O timeout a module parameterKeith Busch1-0/+4
Increase the default timeout to 30 seconds to match SCSI. Signed-off-by: Keith Busch <keith.busch@intel.com> [use byte instead of ushort] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-11NVMe: CPU hot plug notificationKeith Busch1-0/+19
Registers with hot cpu notification to rebalance, and potentially allocate additional, io queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-11NVMe: per-cpu io queuesKeith Busch1-37/+167
The device's IO queues are associated with CPUs, so we can use a per-cpu variable to map the a qid to a cpu. This provides a convienient way to optimally assign queues to multiple cpus when the device supports fewer queues than the host has cpus. The previous implementation may have assigned these poorly in these situations. This patch addresses this by sharing queues among cpus that are "close" together and should have a lower lock contention penalty. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-04-10Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+4
Pull block layer fixes from Jens Axboe: "A small collection of fixes that should go in before -rc1. The pull request contains: - A two patch fix for a regression with block enabled tagging caused by a commit in the initial pull request. One patch is from Martin and ensures that SCSI doesn't truncate 64-bit block flags, the other one is from me and prevents us from double using struct request queuelist for both completion and busy tags. This caused anything from a boot crash for some, to crashes under load. - A blk-mq fix for a potential soft stall when hot unplugging CPUs with busy IO. - percpu_counter fix is listed in here, that caused a suspend issue with virtio-blk due to percpu counters having an inconsistent state during CPU removal. Andrew sent this in separately a few days ago, but it's here. JFYI. - A few fixes for block integrity from Martin. - A ratelimit fix for loop from Mike Galbraith, to avoid spewing too much in error cases" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix regression with block enabled tagging scsi: Make sure cmd_flags are 64-bit block: Ensure we only enable integrity metadata for reads and writes block: Fix integrity verification block: Fix for_each_bvec() drivers/block/loop.c: ratelimit error messages blk-mq: fix potential stall during CPU unplug with IO pending percpu_counter: fix bad counter state during suspend
2014-04-09drivers/block/loop.c: ratelimit error messagesMike Galbraith1-4/+4
Metric tons of high speed spew is not helpful when things go pear shaped. systemd lost its mind, forgot how to stop services it insists on being sole manager of, massive printk() flood ensued, box eventually died. [16206.684000] loop: Write error at byte offset 11412291584, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 13155434496, length 4096. [16206.684000] loop: Write error at byte offset 13155438592, length 4096. [16206.684000] loop: Write error at byte offset 13155442688, length 4096. [16206.684000] loop: Write error at byte offset 13960736768, length 4096. [16206.684000] loop: Write error at byte offset 14229172224, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. [16206.684000] loop: Write error at byte offset 14766043136, length 4096. [16206.684000] loop: Write error at byte offset 15034478592, length 4096. [16206.684000] systemd-journald[1758]: /dev/kmsg buffer overrun, some messages lost. Signed-off-by: Mike Galbraith <bitbucket@online.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-04-08Merge branch 'akpm' (incoming from Andrew)Linus Torvalds10-170/+797
Merge second patch-bomb from Andrew Morton: - the rest of MM - zram updates - zswap updates - exit - procfs - exec - wait - crash dump - lib/idr - rapidio - adfs, affs, bfs, ufs - cris - Kconfig things - initramfs - small amount of IPC material - percpu enhancements - early ioremap support - various other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits) MAINTAINERS: update Intel C600 SAS driver maintainers fs/ufs: remove unused ufs_super_block_third pointer fs/ufs: remove unused ufs_super_block_second pointer fs/ufs: remove unused ufs_super_block_first pointer fs/ufs/super.c: add __init to init_inodecache() doc/kernel-parameters.txt: add early_ioremap_debug arm64: add early_ioremap support arm64: initialize pgprot info earlier in boot x86: use generic early_ioremap mm: create generic early_ioremap() support x86/mm: sparse warning fix for early_memremap lglock: map to spinlock when !CONFIG_SMP percpu: add preemption checks to __this_cpu ops vmstat: use raw_cpu_ops to avoid false positives on preemption checks slub: use raw_cpu_inc for incrementing statistics net: replace __this_cpu_inc in route.c with raw_cpu_inc modules: use raw_cpu_write for initialization of per cpu refcount. mm: use raw_cpu ops for determining current NUMA node percpu: add raw_cpu_ops slub: fix leak of 'name' in sysfs_slab_add ...
2014-04-08zram: support REQ_DISCARDJoonsoo Kim1-0/+62
zram is ram based block device and can be used by backend of filesystem. When filesystem deletes a file, it normally doesn't do anything on data block of that file. It just marks on metadata of that file. This behavior has no problem on disk based block device, but has problems on ram based block device, since we can't free memory used for data block. To overcome this disadvantage, there is REQ_DISCARD functionality. If block device support REQ_DISCARD and filesystem is mounted with discard option, filesystem sends REQ_DISCARD to block device whenever some data blocks are discarded. All we have to do is to handle this request. This patch implements to flag up QUEUE_FLAG_DISCARD and handle this REQ_DISCARD request. With it, we can free memory used by zram if it isn't used. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08zram: use scnprintf() in attrs show() methodsSergey Senozhatsky2-9/+11
sysfs.txt documentation lists the following requirements: - The buffer will always be PAGE_SIZE bytes in length. On i386, this is 4096. - show() methods should return the number of bytes printed into the buffer. This is the return value of scnprintf(). - show() should always use scnprintf(). Use scnprintf() in show() functions. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08zram: propagate error to userMinchan Kim3-11/+20
When we initialized zcomp with single, we couldn't change max_comp_streams without zram reset but current interface doesn't show any error to user and even it changes max_comp_streams's value without any effect so it would make user very confusing. This patch prevents max_comp_streams's change when zcomp was initialized as single zcomp and emit the error to user(ex, echo). [akpm@linux-foundation.org: don't return with the lock held, per Sergey] [fengguang.wu@intel.com: fix coccinelle warnings] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08zram: return error-valued pointer from zcomp_create()Sergey Senozhatsky2-15/+18
Instead of returning just NULL, return ERR_PTR from zcomp_create() if compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or compression stream) error. Perform IS_ERR() check of returned from zcomp_create() value in disksize_store() and set return code to PTR_ERR(). Change suggested by Jerome Marchand. [akpm@linux-foundation.org: clean up error recovery flow] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Jerome Marchand <jmarchan@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>