summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
2014-06-16Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2-90/+149
Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...
2014-06-13NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.Dan McLeran1-7/+6
This patch contains several fixes for Scsi START_STOP_UNIT. The previous code did not account for signed vs. unsigned arithmetic which resulted in an invalid lowest power state caculation when the device only supports 1 power state. The code for Power Condition == 2 (Idle) was not following the spec. The spec calls for setting the device to specific power states, depending upon Power Condition Modifier, without accounting for the number of power states supported by the device. The code for Power Condition == 3 (Standby) was using a hard-coded '0' which is replaced with the macro POWER_STATE_0. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13NVMe: Use Log Page constants in SCSI emulationMatthew Wilcox1-3/+2
The nvme-scsi file defined its own Log Page constant. Use the newly-defined one from the header file instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13NVMe: Fix hot cpu notification dead lockKeith Busch1-10/+25
There is a potential dead lock if a cpu event occurs during nvme probe since it registered with hot cpu notification. This fixes the race by having the module register with notification outside of probe rather than have each device register. The actual work is done in a scheduled work queue instead of in the notifier since assigning IO queues has the potential to block if the driver creates additional queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-13Merge branch 'for-linus' of ↵Linus Torvalds1-45/+197
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
2014-06-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds3-11/+14
Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging
2014-06-11rbd: only set disk to read-only onceJosh Durgin1-1/+1
rbd_open(), called every time the device is opened, calls set_device_ro(). There's no reason to set the device read-only or read-write every time it is opened. Just do this once during device setup, using set_disk_ro() instead because the struct block_device isn't available to us there. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-11rbd: move calls that may sleep out of spin lock rangeJosh Durgin1-11/+18
get_user() and set_disk_ro() may allocate memory, leading to a potential deadlock if theye are called while a spin lock is held. Move the acquisition and release of rbd_dev->lock from rbd_ioctl() into rbd_ioctl_set_ro(), so it can occur between get_user() and set_disk_ro(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-11rbd: add ioctl for rbdGuangliang Zhao1-2/+60
When running the following commands: [root@ceph0 mnt]# blockdev --setro /dev/rbd1 [root@ceph0 mnt]# blockdev --getro /dev/rbd1 0 The block setro didn't take effect, it is because the rbd doesn't support ioctl of block driver. This resolves: http://tracker.ceph.com/issues/6265 Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-06-07nbd: zero from and len fields in NBD_CMD_DISCONNECT.Hani Benhabiles1-5/+2
Len field is already set to zero, but not the from field which is sent as 0xfffffffffffffe00. This makes no sense, and may cause confuse server implementations doing sanity checks (qemu-nbd is an example.) Signed-off-by: Hani Benhabiles <hani@linux.com> Cc: Paul Clements <paul.clements@us.sios.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mtip32xx: minor performance enhancementsSam Bradshaw2-10/+13
This patch adds the following: 1) Compiler hinting in the fast path. 2) A prefetch of port->flags to eliminate moderate cpu stalling later in mtip_hw_submit_io(). 3) Eliminate a redundant rq_data_dir(). 4) Reorder members of driver_data to eliminate false cacheline sharing between irq_workers_active and unal_qdepth. With some workload and topology configurations, I'm seeing ~1.5% throughput improvement in small block random read benchmarks as well as improved latency std. dev. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Add include of <linux/prefetch.h> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06block: add blk_rq_set_block_pc()Jens Axboe1-1/+1
With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06rbd: fix ida/idr memory leakIlya Dryomov1-0/+1
ida_destroy() needs to be called on module exit to release ida caches. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-06-06rbd: use reference counts for image requestsAlex Elder1-0/+9
Each image request contains a reference count, but to date it has not actually been used. (I think this was just an oversight.) A recent report involving rbd failing an assertion shed light on why and where we need to use these reference counts. Every OSD request associated with an object request uses rbd_osd_req_callback() as its callback function. That function will call a helper function (dependent on the type of OSD request) that will set the object request's "done" flag if the object request if appropriate. If that "done" flag is set, the object request is passed to rbd_obj_request_complete(). In rbd_obj_request_complete(), requests are processed in sequential order. So if an object request completes before one of its predecessors in the image request, the completion is deferred. Otherwise, if it's a completing object's "turn" to be completed, it is passed to rbd_img_obj_end_request(), which records the result of the operation, accumulates transferred bytes, and so on. Next, the successor to this request is checked and if it is marked "done", (deferred) completion processing is performed on that request, and so on. If the last object request in an image request is completed, rbd_img_request_complete() is called, which (typically) destroys the image request. There is a race here, however. The instant an object request is marked "done" it can be provided (by a thread handling completion of one of its predecessor operations) to rbd_img_obj_end_request(), which (for the last request) can then lead to the image request getting torn down. And this can happen *before* that object has itself entered rbd_img_obj_end_request(). As a result, once it *does* enter that function, the image request (and even the object request itself) may have been freed and become invalid. All that's necessary to avoid this is to properly count references to the image requests. We tear down an image request's object requests all at once--only when the entire image request has completed. So there's no need for an image request to count references for its object requests. However, we don't want an image request to go away until the last of its object requests has passed through rbd_img_obj_callback(). In other words, we don't want rbd_img_request_complete() to necessarily result in the image request being destroyed, because it may get called before we've finished processing on all of its object requests. So the fix is to add a reference to an image request for each of its object requests. The reference can be viewed as representing an object request that has not yet finished its call to rbd_img_obj_callback(). That is emphasized by getting the reference right after assigning that as the image object's callback function. The corresponding release of that reference is done at the end of rbd_img_obj_callback(), which every image object request passes through exactly once. Cc: stable@vger.kernel.org Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()Ilya Dryomov1-38/+85
osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-06-06rbd: make sure we have latest osdmap on 'rbd map'Ilya Dryomov1-3/+33
Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-06-06rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERODuan Jiong1-1/+1
This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-05Merge branch 'akpm' (patchbomb from Andrew) into nextLinus Torvalds2-5/+15
Merge misc updates from Andrew Morton: - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow. - various misc fixes and cleanups - most of the ocfs2 queue. Review is slow... - most of MM. The MM queue is pretty huge this time, but not much in the way of feature work. - some tweaks under kernel/ - printk maintenance work - updates to lib/ - checkpatch updates - tweaks to init/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits) fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul init/main.c: remove an ifdef kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND init/main.c: add initcall_blacklist kernel parameter init/main.c: don't use pr_debug() fs/binfmt_flat.c: make old_reloc() static fs/binfmt_elf.c: fix bool assignements fs/efs: convert printk(KERN_DEBUG to pr_debug fs/efs: add pr_fmt / use __func__ fs/efs: convert printk to pr_foo() scripts/checkpatch.pl: device_initcall is not the only __initcall substitute checkpatch: check stable email address checkpatch: warn on unnecessary void function return statements checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar); checkpatch: add warning for kmalloc/kzalloc with multiply checkpatch: warn on #defines ending in semicolon checkpatch: make --strict a default for files in drivers/net and net/ checkpatch: always warn on missing blank line after variable declaration block checkpatch: fix wildcard DT compatible string checking ...
2014-06-05zram: correct offset usage in zram_bio_discardWeijie Yang1-2/+2
We want to skip the physical block(PAGE_SIZE) which is partially covered by the discard bio, so we check the remaining size and subtract it if there is a need to goto the next physical block. The current offset usage in zram_bio_discard is incorrect, it will cause its upper filesystem breakdown. Consider the following scenario: On some architecture or config, PAGE_SIZE is 64K for example, filesystem is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a offset = 4K and size=72K, normally, it should not really discard any physical block as it partially cover two physical blocks. However, with the current offset usage, it will discard the second physical block and free its memory, which will cause filesystem breakdown. This patch corrects the offset usage in zram_bio_discard. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05brd: return -ENOSPC rather than -ENOMEM on page allocation failureMatthew Wilcox1-3/+3
brd is effectively a thinly provisioned device. Thinly provisioned devices return -ENOSPC when they can't write a new block. -ENOMEM is an implementation detail that callers shouldn't know. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05brd: add support for rw_page()Matthew Wilcox1-0/+10
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameterJens Axboe1-1/+3
We currently pass in the hardware queue, and get the tags from there. But from scsi-mq, with a shared tag space, it's a lot more convenient to pass in the blk_mq_tags instead as the hardware queue isn't always directly available. So instead of having to re-map to a given hardware queue from rq->mq_ctx, just pass in the tags structure. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-04NVMe: Rename io_timeout to nvme_io_timeoutMatthew Wilcox1-2/+2
It's positively immoral to have a global variable called 'io_timeout'. Keep the module parameter called io_timeout, though. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Use last bytes of f/w rev SCSI InquiryKeith Busch1-1/+6
After skipping right-padded spaces, use the last four bytes of the firmware revision when reporting the Inquiry Product Revision. These are generally more indicative to what is running. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Adhere to request queue block accounting enable/disableSam Bradshaw1-14/+19
Recently, a new sysfs control "iostats" was added to selectively enable or disable io statistics collection for request queues. This patch hooks that control. IO statistics collection is rather expensive on large, multi-node machines with drives pushing millions of iops. Having the ability to disable collection if not needed can improve throughput significantly. As a data point, on a quad E5-4640, I see more than 50% throughput improvement when io statistics accounting is disabled during heavily multi-threaded small block random read benchmarks where device performance is in the million iops+ range. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Fix nvme get/put queue semanticsKeith Busch1-8/+21
The routines to get and lock nvme queues required the caller to "put" or "unlock" them even if getting one returned NULL. This patch fixes that. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Delete NVME_GET_FEAT_TEMP_THRESHMatthew Wilcox1-1/+0
This define isn't used, and any code that wanted to use it should use NVME_FEAT_TEMP_THRESH instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Make admin timeout a module parameterKeith Busch1-3/+7
Signed-off-by: Keith Busch <keith.busch@intel.com> [made admin_timeout static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Make iod bio timeout a parameterKeith Busch1-1/+5
This was originally set to 4 times the IO timeout, but that was when the IO timeout was 5 seconds instead of 30. 20 seconds for total time to failure seemed more reasonable than 2 minutes for most, but other users have requested to make this a module parameter instead. Signed-off-by: Keith Busch <keith.busch@intel.com> [renamed the module parameter to retry_time] [made retry_time static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04Merge branch 'sched-core-for-linus' of ↵Linus Torvalds3-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull scheduler updates from Ingo Molnar: "The main scheduling related changes in this cycle were: - various sched/numa updates, for better performance - tree wide cleanup of open coded nice levels - nohz fix related to rq->nr_running use - cpuidle changes and continued consolidation to improve the kernel/sched/idle.c high level idle scheduling logic. As part of this effort I pulled cpuidle driver changes from Rafael as well. - standardized idle polling amongst architectures - continued work on preparing better power/energy aware scheduling - sched/rt updates - misc fixlets and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) sched/numa: Decay ->wakee_flips instead of zeroing sched/numa: Update migrate_improves/degrades_locality() sched/numa: Allow task switch if load imbalance improves sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice() sched: Initialize rq->age_stamp on processor start sched, nohz: Change rq->nr_running to always use wrappers sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance() sched: Use clamp() and clamp_val() to make sys_nice() more readable sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() sched/numa: Fix initialization of sched_domain_topology for NUMA sched: Call select_idle_sibling() when not affine_sd sched: Simplify return logic in sched_read_attr() sched: Simplify return logic in sched_copy_attr() sched: Fix exec_start/task_hot on migrated tasks arm64: Remove TIF_POLLING_NRFLAG metag: Remove TIF_POLLING_NRFLAG sched/idle: Make cpuidle_idle_call() void sched/idle: Reflow cpuidle_idle_call() sched/idle: Delay clearing the polling bit ...
2014-06-04NVMe: Prevent possible NULL pointer dereferenceSantosh Y1-1/+4
kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod' can fail. So check the return value. Signed-off-by: Santosh Y <santosh.sy@samsung.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-04NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)Indraneel Mukherjee1-4/+4
In GetLogPage the buffer size passed to device is a 0's based value. Signed-off-by: Indraneel M <indraneel.m@samsung.com> Reported-by: Shiro Itou <shiro.itou@outlook.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2014-06-03Merge branch 'for-3.16/drivers' of git://git.kernel.dk/linux-block into nextLinus Torvalds23-1148/+1212
Pull block driver changes from Jens Axboe: "Now that the core bits are in, here's the pull request for the driver related changes for 3.16. Nothing out of the ordinary here, mostly business as usual. There are a few pulls of for-3.16/core into this branch, which were done when the blk-mq was modified after the mtip32xx conversion was put in. The pull request contains: - skd and cciss converted to use pci_enable_msix_exact(). From Alexander Gordeev. - A few mtip32xx fixes from Asai @ Micron. - The conversion of mtip32xx from make_request_fn to blk-mq, and a later small fix for that conversion on quiescing for non-queued IO. From me. - A fix for bsg to use an exported function to check whether this driver is request based or not. Needed updating for blk-mq, which is request based, but does not have a request_fn hook. From me. - Small floppy bug fix from Jiri. - A series of cleanups for the cdrom uniform layer from Joe Perches. Gets rid of various old ugly macros, making the code conform more to the modern coding style. - A series of patches for drbd from the drbd crew (Lars Ellenberg and Philipp Reisner). - A use-after-free fix for null_blk from Ming Lei. - Also from Ming Lei is a performance patch for virtio-blk, which can net us a 3x win on kvm platforms where world notification is expensive. - Ming Lei also fixed a stall issue in virtio-blk, due to a race between queue start/stop and resource limits. - A small batch of fixes for xen-blk{back,front} from Olaf Hering and Valentin Priescu" * 'for-3.16/drivers' of git://git.kernel.dk/linux-block: (54 commits) block: virtio_blk: don't hold spin lock during world switch xen-blkback: defer freeing blkif to avoid blocking xenwatch xen blkif.h: fix comment typo in discard-alignment xen/blkback: disable discard feature if requested by toolstack xen-blkfront: remove type check from blkfront_setup_discard floppy: do not corrupt bio.bi_flags when reading block 0 mtip32xx: move error handling to service thread virtio_blk: fix race between start and stop queue mtip32xx: stop block hardware queues before quiescing IO mtip32xx: blk_mq_init_queue() returns an ERR_PTR mtip32xx: convert to use blk-mq cdrom: Remove unnecessary prototype for cdrom_get_disc_info cdrom: Remove unnecessary prototype for cdrom_mrw_exit cdrom: Remove cdrom_count_tracks prototype cdrom: Remove cdrom_get_next_writeable prototype cdrom: Remove cdrom_get_last_written prototype cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype cdrom: Remove unnecessary sanitize_format prototype cdrom: Remove unnecessary check_for_audio_disc prototype cdrom: Remove prototype for open_for_data ...
2014-06-02Merge tag 'pci-v3.16-changes' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into next Pull PCI changes from Bjorn Helgaas: "Enumeration - Notify driver before and after device reset (Keith Busch) - Use reset notification in NVMe (Keith Busch) NUMA - Warn if we have to guess host bridge node information (Myron Stowe) - Work around AMD Fam15h BIOSes that fail to provide _PXM (Suravee Suthikulpanit) - Clean up and mark early_root_info_init() as deprecated (Suravee Suthikulpanit) Driver binding - Add "driver_override" for force specific binding (Alex Williamson) - Fail "new_id" addition for devices we already know about (Bandan Das) Resource management - Support BAR sizes up to 8GB (Nikhil Rao, Alan Cox) - Don't move IORESOURCE_PCI_FIXED resources (Bjorn Helgaas) - Mark SBx00 HPET BAR as IORESOURCE_PCI_FIXED (Bjorn Helgaas) - Fail safely if we can't handle BARs larger than 4GB (Bjorn Helgaas) - Reject BAR above 4GB if dma_addr_t is too small (Bjorn Helgaas) - Don't convert BAR address to resource if dma_addr_t is too small (Bjorn Helgaas) - Don't set BAR to zero if dma_addr_t is too small (Bjorn Helgaas) - Don't print anything while decoding is disabled (Bjorn Helgaas) - Don't add disabled subtractive decode bus resources (Bjorn Helgaas) - Add resource allocation comments (Bjorn Helgaas) - Restrict 64-bit prefetchable bridge windows to 64-bit resources (Yinghai Lu) - Assign i82875p_edac PCI resources before adding device (Yinghai Lu) PCI device hotplug - Remove unnecessary "dev->bus" test (Bjorn Helgaas) - Use PCI_EXP_SLTCAP_PSN define (Bjorn Helgaas) - Fix rphahp endianess issues (Laurent Dufour) - Acknowledge spurious "cmd completed" event (Rajat Jain) - Allow hotplug service drivers to operate in polling mode (Rajat Jain) - Fix cpqphp possible NULL dereference (Rickard Strandqvist) MSI - Replace pci_enable_msi_block() by pci_enable_msi_exact() (Alexander Gordeev) - Replace pci_enable_msix() by pci_enable_msix_exact() (Alexander Gordeev) - Simplify populate_msi_sysfs() (Jan Beulich) Virtualization - Add Intel Patsburg (X79) root port ACS quirk (Alex Williamson) - Mark RTL8110SC INTx masking as broken (Alex Williamson) Generic host bridge driver - Add generic PCI host controller driver (Will Deacon) Freescale i.MX6 - Use new clock names (Lucas Stach) - Drop old IRQ mapping (Lucas Stach) - Remove optional (and unused) IRQs (Lucas Stach) - Add support for MSI (Lucas Stach) - Fix imx6_add_pcie_port() section mismatch warning (Sachin Kamat) Renesas R-Car - Add gen2 device tree support (Ben Dooks) - Use new OF interrupt mapping when possible (Lucas Stach) - Add PCIe driver (Phil Edworthy) - Add PCIe MSI support (Phil Edworthy) - Add PCIe device tree bindings (Phil Edworthy) Samsung Exynos - Remove unnecessary OOM messages (Jingoo Han) - Fix add_pcie_port() section mismatch warning (Sachin Kamat) Synopsys DesignWare - Make MSI ISR shared IRQ aware (Lucas Stach) Miscellaneous - Check for broken config space aliasing (Alex Williamson) - Update email address (Ben Hutchings) - Fix Broadcom CNB20LE unintended sign extension (Bjorn Helgaas) - Fix incorrect vgaarb conditional in WARN_ON() (Bjorn Helgaas) - Remove unnecessary __ref annotations (Bjorn Helgaas) - Add arch/x86/kernel/quirks.c to MAINTAINERS PCI file patterns (Bjorn Helgaas) - Fix use of uninitialized MPS value (Bjorn Helgaas) - Tidy x86/gart messages (Bjorn Helgaas) - Fix return value from pci_user_{read,write}_config_*() (Gavin Shan) - Turn pcibios_penalize_isa_irq() into a weak function (Hanjun Guo) - Remove unused serial device IDs (Jean Delvare) - Use designated initialization in PCI_VDEVICE (Mark Rustad) - Fix powerpc NULL dereference in pci_root_buses traversal (Mike Qiu) - Configure MPS on ARM (Murali Karicheri) - Remove unnecessary includes of <linux/init.h> (Paul Gortmaker) - Move Open Firmware devspec attribute to PCI common code (Sebastian Ott) - Use pdev->dev.groups for attribute creation on s390 (Sebastian Ott) - Remove pcibios_add_platform_entries() (Sebastian Ott) - Add new ID for Intel GPU "spurious interrupt" quirk (Thomas Jarosch) - Rename pci_is_bridge() to pci_has_subordinate() (Yijing Wang) - Add and use new pci_is_bridge() interface (Yijing Wang) - Make pci_bus_add_device() void (Yijing Wang) DMA API - Clarify physical/bus address distinction in docs (Bjorn Helgaas) - Fix typos in docs (Emilio López) - Update dma_pool_create ()and dma_pool_alloc() descriptions (Gioh Kim) - Change dma_declare_coherent_memory() CPU address to phys_addr_t (Bjorn Helgaas) - Pass GAPSPCI_DMA_BASE CPU & bus address to dma_declare_coherent_memory() (Bjorn Helgaas)" * tag 'pci-v3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (92 commits) MAINTAINERS: Add generic PCI host controller driver PCI: generic: Add generic PCI host controller driver PCI: imx6: Add support for MSI PCI: designware: Make MSI ISR shared IRQ aware PCI: imx6: Remove optional (and unused) IRQs PCI: imx6: Drop old IRQ mapping PCI: imx6: Use new clock names i82875p_edac: Assign PCI resources before adding device ARM/PCI: Call pcie_bus_configure_settings() to set MPS PCI: imx6: Fix imx6_add_pcie_port() section mismatch warning PCI: Make pci_bus_add_device() void PCI: exynos: Fix add_pcie_port() section mismatch warning PCI: Introduce new device binding path using pci_dev.driver_override PCI: rcar: Add gen2 device tree support PCI: cpqphp: Fix possible null pointer dereference PCI: rcar: Add R-Car PCIe device tree bindings PCI: rcar: Add MSI support for PCIe PCI: rcar: Add Renesas R-Car PCIe driver PCI: Fix return value from pci_user_{read,write}_config_*() PCI: exynos: Remove unnecessary OOM messages ...
2014-06-02Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into nextLinus Torvalds16-149/+124
Pull block core updates from Jens Axboe: "It's a big(ish) round this time, lots of development effort has gone into blk-mq in the last 3 months. Generally we're heading to where 3.16 will be a feature complete and performant blk-mq. scsi-mq is progressing nicely and will hopefully be in 3.17. A nvme port is in progress, and the Micron pci-e flash driver, mtip32xx, is converted and will be sent in with the driver pull request for 3.16. This pull request contains: - Lots of prep and support patches for scsi-mq have been integrated. All from Christoph. - API and code cleanups for blk-mq from Christoph. - Lots of good corner case and error handling cleanup fixes for blk-mq from Ming Lei. - A flew of blk-mq updates from me: * Provide strict mappings so that the driver can rely on the CPU to queue mapping. This enables optimizations in the driver. * Provided a bitmap tagging instead of percpu_ida, which never really worked well for blk-mq. percpu_ida relies on the fact that we have a lot more tags available than we really need, it fails miserably for cases where we exhaust (or are close to exhausting) the tag space. * Provide sane support for shared tag maps, as utilized by scsi-mq * Various fixes for IO timeouts. * API cleanups, and lots of perf tweaks and optimizations. - Remove 'buffer' from struct request. This is ancient code, from when requests were always virtually mapped. Kill it, to reclaim some space in struct request. From me. - Remove 'magic' from blk_plug. Since we store these on the stack and since we've never caught any actual bugs with this, lets just get rid of it. From me. - Only call part_in_flight() once for IO completion, as includes two atomic reads. Hopefully we'll get a better implementation soon, as the part IO stats are now one of the more expensive parts of doing IO on blk-mq. From me. - File migration of block code from {mm,fs}/ to block/. This includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me, from a discussion on lkml. That should describe the meat of the pull request. Also has various little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong, Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam Bradshaw" * 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits) blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() blk-mq: remember to start timeout handler for direct queue block: ensure that the timer is always added blk-mq: blk_mq_unregister_hctx() can be static blk-mq: make the sysfs mq/ layout reflect current mappings blk-mq: blk_mq_tag_to_rq should handle flush request block: remove dead code in scsi_ioctl:blk_verify_command blk-mq: request initialization optimizations block: add queue flag for disabling SG merging block: remove 'magic' from struct blk_plug blk-mq: remove alloc_hctx and free_hctx methods blk-mq: add file comments and update copyright notices blk-mq: remove blk_mq_alloc_request_pinned blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request blk-mq: remove blk_mq_wait_for_tags blk-mq: initialize request in __blk_mq_alloc_request blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request blk-mq: add helper to insert requests from irq context blk-mq: remove stale comment for blk_mq_complete_request() blk-mq: allow non-softirq completions ...
2014-06-02Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k into next Pull m68k updates from Geert Uytterhoeven: "Highlights: - support for running kernels in fast TT-RAM instead of slow ST-RAM on Atari - multi-platform EARLY_PRINTK - better support for machines with lots of RAM (think ARAnyM), and for running kernels larger than 4 MiB (think multi-platform)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k/hp300: Convert printk to pr_foo() m68k/apollo: Convert printk to pr_foo() m68k/amiga: Convert printk(foo to pr_foo() m68k: Increase initial mapping to 8 or 16 MiB if possible m68k: Update defconfigs for v3.15-rc2 m68k/atari: fix SCC initialization for debug console m68k/mvme16x: Adopt common boot console m68k: Multi-platform EARLY_PRINTK m68k: Toward platform agnostic framebuffer debug logging m68k/atari - atari_scsi: use correct virt/phys translation for DMA buffer m68k/atari - ataflop: use correct virt/phys translation for DMA buffer m68k/atari - atafb: convert allocation of fb ram to new interface m68k/atari - stram: alloc ST-RAM pool even if kernel not in ST-RAM
2014-05-30block: virtio_blk: don't hold spin lock during world switchMing Lei1-3/+6
Firstly, it isn't necessary to hold lock of vblk->vq_lock when notifying hypervisor about queued I/O. Secondly, virtqueue_notify() will cause world switch and it may take long time on some hypervisors(such as, qemu-arm), so it isn't good to hold the lock and block other vCPUs. On arm64 quad core VM(qemu-kvm), the patch can increase I/O performance a lot with VIRTIO_RING_F_EVENT_IDX enabled: - without the patch: 14K IOPS - with the patch: 34K IOPS fio script: [global] direct=1 bsrange=4k-4k timeout=10 numjobs=4 ioengine=libaio iodepth=64 filename=/dev/vdc group_reporting=1 [f1] rw=randread Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Ming Lei <ming.lei@canonical.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30Merge branch 'for-3.16/core' into for-3.16/driversJens Axboe1-3/+1
Pulled in for the blk_mq_tag_to_rq() change, which impacts mtip32xx. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28Merge branch 'stable/for-jens-3.16' of ↵Jens Axboe3-41/+56
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-3.16/drivers Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.16 which has a bunch of fixes to the Xen block frontend and backend driver and a new parameter for Xen backend driver - an override (set by the toolstack) whether to expose the discard support (if disk of course supports it) or not.
2014-05-28xen-blkback: defer freeing blkif to avoid blocking xenwatchValentin Priescu2-14/+36
Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O requests to finish. If the VBD is attached to a hot-swappable disk, then xenwatch can hang for a long period of time, stalling other watches. INFO: task xenwatch:39 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040 Call Trace: [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20 [<ffffffff8128f761>] ? list_del+0x11/0x40 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20 [<ffffffff8147a26a>] schedule+0x3a/0x60 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback] [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback] [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0 [<ffffffff81346488>] device_release_driver+0x28/0x40 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0 [<ffffffff81342c9f>] device_del+0x12f/0x1a0 [<ffffffff81342d2d>] device_unregister+0x1d/0x60 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback] [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback] [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80 [<ffffffff810799d6>] kthread+0x96/0xa0 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6 [<ffffffff81485930>] ? gs_change+0x13/0x13 With this patch, when there is still pending I/O, the actual disconnect is done by the last reference holder (last pending I/O request). In this case, xenwatch doesn't block indefinitely. Signed-off-by: Valentin Priescu <priescuv@amazon.com> Reviewed-by: Steven Kady <stevkady@amazon.com> Reviewed-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28xen/blkback: disable discard feature if requested by toolstackOlaf Hering1-1/+6
Newer toolstacks may provide a boolean property "discard-enable" in the backend node. Its purpose is to disable discard for file backed storage to avoid fragmentation. Recognize this setting also for physical storage. If that property exists and is false, do not advertise "feature-discard" to the frontend. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28xen-blkfront: remove type check from blkfront_setup_discardOlaf Hering1-26/+14
In its initial implementation a check for "type" was added, but only phy and file are handled. This breaks advertised discard support for other type values such as qdisk. Fix and simplify this function: If the backend advertises discard support it is supposed to implement it properly, so enable feature_discard unconditionally. If the backend advertises the need for a certain granularity and alignment then propagate both properties to the blocklayer. The discard-secure property is a boolean, update the code to reflect that. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28Merge branch 'for-3.16/core' into for-3.16/driversJens Axboe3-31/+1
Pull in core changes (again), since we got rid of the alloc/free hctx mq_ops hooks and mtip32xx then needed updating again. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28blk-mq: remove alloc_hctx and free_hctx methodsChristoph Hellwig2-29/+1
There is no need for drivers to control hardware context allocation now that we do the context to node mapping in common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28Merge branch 'for-3.16/core' into for-3.16/driversJens Axboe3-35/+6
mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the core changes so we have a properly merged end result. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28floppy: do not corrupt bio.bi_flags when reading block 0Jiri Kosina1-1/+1
Commit 41a55b4de39 ("floppy: silence warning during disk test") caused bio.bi_flags being overwritten, and its initialization to BIO_UPTODATE in bio_init() to be lost. This was unnoticed until 7b7b68bba5 ("floppy: bail out in open() if drive is not responding to block0 read"), because the error value wasn't checked for in the bio completion callback. Now we are actually looking at the error, and the loss of BIO_UPTODATE causes EIO to be wrongly passed to the callback, which confuses the FD_OPEN_SHOULD_FAIL_BIT logic. Fix this by not destroying previous value of bi_flags when setting BIO_QUIET. Cc: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-05-27blk-mq: pass in suggested NUMA node to ->alloc_hctx()Jens Axboe1-32/+3
Drivers currently have to figure this out on their own, and they are missing information to do it properly. The ones that did attempt to do it, do it wrong. So just pass in the suggested node directly to the alloc function. Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27NVMe: Implement PCIe reset notification callbackKeith Busch1-0/+11
Quiesce and shutdown the device prior to reset, then restart the device and resume IO after. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-05-27virtio_blk: fix race between start and stop queueMing Lei1-2/+2
When there isn't enough vring descriptor for adding to vq, blk-mq will be put as stopped state until some of pending descriptors are completed & freed. Unfortunately, the vq's interrupt may come just before blk-mq's BLK_MQ_S_STOPPED flag is set, so the blk-mq will still be kept as stopped even though lots of descriptors are completed and freed in the interrupt handler. The worst case is that all pending descriptors are freed in the interrupt handler, and the queue is kept as stopped forever. This patch fixes the problem by starting/stopping blk-mq with holding vq_lock. Cc: Jens Axboe <axboe@kernel.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: drivers/block/virtio_blk.c
2014-05-27m68k/atari - ataflop: use correct virt/phys translation for DMA bufferMichael Schmitz1-1/+1
With the kernel running from FastRAM instead of ST-RAM, none of ST-RAM is mapped by mem_init, and DMA-addressable buffer must be mapped by ioremap. Use platform specific virt/phys translation helpers for this case. Signed-off-by: Michael Schmitz <schmitz@debian.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>