summaryrefslogtreecommitdiff
path: root/include/linux/blkdev.h
AgeCommit message (Collapse)AuthorFilesLines
2016-12-17block: Remove unused member (busy) from struct blk_queue_tagRitesh Harjani1-1/+0
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-14Merge branch 'for-4.10' of ↵Linus Torvalds1-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - Adam added opt-in ATA command priority support. - There are machines which hide multiple nvme devices behind an ahci BAR. Dan Williams proposed a solution to force-switch the mode but deemed too hackishd. People are gonna discuss the proper way to handle the situation in nvme standard meetings. For now, detect and warn about the situation. - Low level driver specific changes. Christoph Hellwig pipes in about the hidden nvme warning: "I wish that was the case. We've pretty much agreed that we'll want to implement it as a virtual PCIe root bridge, similar to Intels other 'innovation' VMD that we work around that way. But Intel management has apparently decided that they don't want to spend more cycles on this now that Lenovo has an optional BIOS that doesn't force this broken mode anymore, and no one outside of Intel has enough information to implement something like this. So for now I guess this warning is it, until Intel reconsideres and spends resources on fixing up the damage their Chipset people caused" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: warn about remapped NVMe devices ahci-remap.h: add ahci remapping definitions nvme: move NVMe class code to pci_ids.h pata: imx: support controller modes up to PIO4 pata: imx: add support of setting timings for PIO modes pata: imx: set controller PIO mode with .set_piomode callback pata: imx: sort headers out ata: set ncq_prio_enabled iff device has support ata: ATA Command Priority Disabled By Default ata: Enabling ATA Command Priorities block: Add iocontext priority to request ahci: qoriq: added ls1046a platform support
2016-12-09block: improve handling of the magic discard payloadChristoph Hellwig1-3/+12
Instead of allocating a single unused biovec for discard requests, send them down without any payload. Instead we allow the driver to add a "special" payload using a biovec embedded into struct request (unioned over other fields never used while in the driver), and overloading the number of segments for this case. This has a couple of advantages: - we don't have to allocate the bio_vec - the amount of special casing for discard requests in the block layer is significantly reduced - using this same scheme for other request types is trivial, which will be important for implementing the new WRITE_ZEROES op on devices where it actually requires a payload (e.g. SCSI) - we can get rid of playing games with the request length, as we'll never touch it and completions will work just fine - it will allow us to support ranged discard operations in the future by merging non-contiguous discard bios into a single request - last but not least it removes a lot of code This patch is the common base for my WIP series for ranges discards and to remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES, so it would be good to get it in quickly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01block: add support for REQ_OP_WRITE_ZEROESChaitanya Kulkarni1-0/+19
This adds a new block layer operation to zero out a range of LBAs. This allows to implement zeroing for devices that don't use either discard with a predictable zero pattern or WRITE SAME of zeroes. The prominent example of that is NVMe with the Write Zeroes command, but in the future, this should also help with improving the way zeroing discards work. For this operation, suitable entry is exported in sysfs which indicate the number of maximum bytes allowed in one write zeroes operation by the device. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01block: add async variant of blkdev_issue_zerooutChaitanya Kulkarni1-0/+3
Similar to __blkdev_issue_discard this variant allows submitting the final bio asynchronously and chaining multiple ranges into a single completion. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-18block: Change extern inline to static inlineTobias Klauser1-1/+1
With compilers which follow the C99 standard (like modern versions of gcc and clang), "extern inline" does the opposite thing from older versions of gcc (emits code for an externally linkable version of the inline function). "static inline" does the intended behavior in all cases instead. Description taken from commit 6d91857d4826 ("staging, rtl8192e, LLVMLinux: Change extern inline to static inline"). This also fixes the following GCC warning when building with CONFIG_PM disabled: ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes] Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()") Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-17blk-mq: make the polling code adaptiveJens Axboe1-1/+1
The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17blk-mq: implement hybrid poll mode for sync O_DIRECTJens Axboe1-0/+1
This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-11block: move poll code to blk-mqJens Axboe1-1/+1
The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-10block: hook up writeback throttlingJens Axboe1-0/+3
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10block: add scalable completion tracking of requestsJens Axboe1-0/+7
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-06block: add code to track actual device queue depthJens Axboe1-0/+11
For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that drivers can call to more closely inform the block layer of the real queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-04block: immediately dispatch big size requestShaohua Li1-0/+1
Currently block plug holds up to 16 non-mergeable requests. This makes sense if the request size is small, eg, reduce lock contention. But if request size is big enough, we don't need to worry about lock contention. Holding such request makes no sense and it lows the disk utilization. In practice, this improves 10% throughput for my raid5 sequential write workload. The size (128k) is arbitrary right now, but it makes sure lock contention is small. This probably could be more intelligent, eg, check average request size holded. Since this is mainly for sequential IO, probably not worthy. V2: check the last request instead of the first request, so as long as there is one big size request we flush the plug. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Introduce blk_mq_quiesce_queue()Bart Van Assche1-0/+1
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations have finished. This function does *not* wait until all outstanding requests have finished (this means invocation of request.end_io()). The algorithm used by blk_mq_quiesce_queue() is as follows: * Hold either an RCU read lock or an SRCU read lock around .queue_rq() calls. The former is used if .queue_rq() does not block and the latter if .queue_rq() may block. * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues() followed by synchronize_srcu() or synchronize_rcu(). The latter call waits for .queue_rq() invocations that started before blk_mq_quiesce_queue() was called. * The blk_mq_hctx_stopped() calls that control whether or not .queue_rq() will be called are called with the (S)RCU read lock held. This is necessary to avoid race conditions against blk_mq_quiesce_queue(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: better op and flags encodingChristoph Hellwig1-24/+2
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: split out request-only flags into a new namespaceChristoph Hellwig1-1/+48
A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19block: Add iocontext priority to requestAdam Manzanares1-0/+14
Patch adds an association between iocontext ioprio and the ioprio of a request. This is done to enable request based drivers the ability to act on priority information stored in the request. An example being ATA devices that support command priorities. If the ATA driver discovers that the device supports command priorities and the request has valid priority information indicating the request is high priority, then a high priority command can be sent to the device. This should improve tail latencies for high priority IO on any device that queues requests internally and can make use of the priority information stored in the request. The ioprio of the request is set in blk_rq_set_prio which takes the request and the ioc as arguments. If the ioc is valid in blk_rq_set_prio then the iopriority of the request is set as the iopriority of the ioc. In init_request_from_bio a check is made to see if the ioprio of the bio is valid and if so then the request prio comes from the bio. Signed-off-by: Adam Manzananares <adam.manzanares@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-18blk-zoned: implement ioctlsShaun Tancheff1-0/+21
Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively obtaining the zone configuration of a zoned block device and resetting the write pointer of sequential zones of a zoned block device. The BLKREPORTZONE ioctl maps directly to a single call of the function blkdev_report_zones. The zone information result is passed as an array of struct blk_zone identical to the structure used internally for processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl maps to a single call of the blkdev_reset_zones function. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-18block: Implement support for zoned block devicesHannes Reinecke1-0/+31
Implement zoned block device zone information reporting and reset. Zone information are reported as struct blk_zone. This implementation does not differentiate between host-aware and host-managed device models and is valid for both. Two functions are provided: blkdev_report_zones for discovering the zone configuration of a zoned block device, and blkdev_reset_zones for resetting the write pointer of sequential zones. The helper function blk_queue_zone_size and bdev_zone_size are also provided for, as the name suggest, obtaining the zone size (in 512B sectors) of the zones of the device. Signed-off-by: Hannes Reinecke <hare@suse.de> [Damien: * Removed the zone cache * Implement report zones operation based on earlier proposal by Shaun Tancheff <shaun.tancheff@seagate.com>] Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-18block: Add 'zoned' queue limitDamien Le Moal1-0/+47
Add the zoned queue limit to indicate the zoning model of a block device. Defined values are 0 (BLK_ZONED_NONE) for regular block devices, 1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM) for host-managed zone block devices. The standards defined drive managed model is not defined here since these block devices do not provide any command for accessing zone information. Drive managed model devices will be reported as BLK_ZONED_NONE. The helper functions blk_queue_zoned_model and bdev_zoned_model return the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned return a boolean for callers to test if a block device is zoned. The zoned attribute is also exported as a string to applications via sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and BLK_ZONED_HM as "host-managed". Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14blk-mq: introduce blk_mq_delay_kick_requeue_list()Mike Snitzer1-1/+1
blk_mq_delay_kick_requeue_list() provides the ability to kick the q->requeue_list after a specified time. To do this the request_queue's 'requeue_work' member was changed to a delayed_work. blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued requests while it doesn't make sense to immediately requeue them (e.g. when all paths in a DM multipath have failed). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-29block: add kblockd_schedule_work_on()Jens Axboe1-1/+1
Add a helper to schedule a regular struct work on a particular CPU. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-16block: Fix secure eraseAdrian Hunter1-2/+4
Commit 288dab8a35a0 ("block: add a separate operation type for secure erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering all the places REQ_OP_DISCARD was being used to mean either. Fix those. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase") Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07block/mm: make bdev_ops->rw_page() take a bool for read/writeJens Axboe1-1/+1
Commit abf545484d31 changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04mm/block: convert rw_page users to bio op useMike Christie1-1/+1
The rw_page users were not converted to use bio/req ops. As a result bdev_write_page is not passing down REQ_OP_WRITE and the IOs will be sent down as reads. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code") Modified by me to: 1) Drop op_flags passing into ->rw_page(), as we don't use it. 2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04Include: blkdev: Removed duplicate 'struct request;' declaration.John Pittman1-1/+0
In include/linux/blkdev.h duplicate declarations of the request struct exist. Cleaned up by removing the second, unneeded declaration. Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-29Merge tag 'libnvdimm-for-4.8' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...
2016-07-27Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-22/+8
Pull block driver updates from Jens Axboe: "This branch also contains core changes. I've come to the conclusion that from 4.9 and forward, I'll be doing just a single branch. We often have dependencies between core and drivers, and it's hard to always split them up appropriately without pulling core into drivers when that happens. That said, this contains: - separate secure erase type for the core block layer, from Christoph. - set of discard fixes, from Christoph. - bio shrinking fixes from Christoph, as a followup up to the op/flags change in the core branch. - map and append request fixes from Christoph. - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty exciting! - nvme-loop fixes from Arnd. - removal of ->driverfs_dev from Dan, after providing a device_add_disk() helper. - bcache fixes from Bhaktipriya and Yijing. - cdrom subchannel read fix from Vchannaiah. - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier. - set of drbd updates and fixes from Fabian, Lars, and Philipp. - mg_disk error path fix from Bart. - user notification for failed device add for loop, from Minfei. - NVMe in general: + NVMe delay quirk from Guilherme. + SR-IOV support and command retry limits from Keith. + fix for memory-less NUMA node from Masayoshi. + use UINT_MAX for discard sectors, from Minfei. + cancel IO fixes from Ming. + don't allocate unused major, from Neil. + error code fixup from Dan. + use constants for PSDT/FUSE from James. + variable init fix from Jay. + fabrics fixes from Ming, Sagi, and Wei. + various fixes" * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits) nvme/pci: Provide SR-IOV support nvme: initialize variable before logical OR'ing it block: unexport various bio mapping helpers scsi/osd: open code blk_make_request target: stop using blk_make_request block: simplify and export blk_rq_append_bio block: ensure bios return from blk_get_request are properly initialized virtio_blk: use blk_rq_map_kern memstick: don't allow REQ_TYPE_BLOCK_PC requests block: shrink bio size again block: simplify and cleanup bvec pool handling block: get rid of bio_rw and READA block: don't ignore -EOPNOTSUPP blkdev_issue_write_same block: introduce BLKDEV_DISCARD_ZERO to fix zeroout NVMe: don't allocate unused nvme_major nvme: avoid crashes when node 0 is memoryless node. nvme: Limit command retries loop: Make user notify for adding loop device failed nvme-loop: fix nvme-loop Kconfig dependencies nvmet: fix return value check in nvmet_subsys_alloc() ...
2016-07-21block: Fix front merge checkDamien Le Moal1-2/+3
For a front merge, the maximum number of sectors of the request must be checked against the front merge BIO sector, not the current sector of the request. Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-21block: add QUEUE_FLAG_DAX for devices to advertise their DAX supportToshi Kani1-0/+2
Currently, presence of direct_access() in block_device_operations indicates support of DAX on its block device. Because block_device_operations is instantiated with 'const', this DAX capablity may not be enabled conditinally. In preparation for supporting DAX to device-mapper devices, add QUEUE_FLAG_DAX to request_queue flags to advertise their DAX support. This will allow to set the DAX capability based on how mapped device is composed. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <linux-s390@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-21scsi/osd: open code blk_make_requestChristoph Hellwig1-2/+0
I wish the OSD code could simply use blk_rq_map_* helpers like everyone else, but the complex nature of deciding if we have DATA IN and/or DATA OUT buffers might make this impossible (at least for a mere human like me). But using blk_rq_append_bio at least allows sharing the setup code between request with or without dat a buffers, and given that this is the last user of blk_make_request it allows getting rid of that somewhat awkward interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-21block: simplify and export blk_rq_append_bioChristoph Hellwig1-0/+1
The target SCSI passthrough backend is much better served with the low-level blk_rq_append_bio construct then the helpers built on top of it, so export it. Also use the opportunity to remove the pointless request_queue argument and make the code flow a little more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-21block: introduce BLKDEV_DISCARD_ZERO to fix zerooutChristoph Hellwig1-1/+3
Currently blkdev_issue_zeroout cascades down from discards (if the driver guarantees that discards zero data), to WRITE SAME and then to a loop writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the block layer blkdev_issue_discard helper to work around DM volumes that may have mixed discard support underneath. This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to blkdev_issue_discard that indicates we are called for zeroing operation. This allows both to ignore the EOPNOTSUPP hack and actually consolidating the discard_zeroes_data check into the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-13pmem: kill __pmem address spaceDan Williams1-3/+3
The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-28block: Convert fifo_time from ulong to u64Jan Kara1-1/+1
Currently rq->fifo_time is unsigned long but CFQ stores nanosecond timestamp in it which would overflow on 32-bit archs. Convert it to u64 to avoid the overflow. Since the rq->fifo_time is unioned with struct call_single_data(), this does not change the size of struct request in any way. We have to slightly fixup block/deadline-iosched.c so that comparison happens in the right types. Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09block: add a separate operation type for secure eraseChristoph Hellwig1-19/+4
Instead of overloading the discard support with the REQ_SECURE flag. Use the opportunity to rename the queue flag as well, and remove the dead checks for this flag in the RAID 1 and RAID 10 drivers that don't claim support for secure erase. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09block: better packing for struct requestChristoph Hellwig1-3/+2
Keep the 32-bit CPU and cmd_type flags together to avoid holes on 64-bit architectures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers: add REQ_OP_FLUSH operationMike Christie1-0/+3
This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, fs, drivers: remove REQ_OP compat defs and related codeMike Christie1-4/+10
This patch drops the compat definition of req_op where it matches the rq_flag_bits definitions, and drops the related old and compat code that allowed users to set either the op or flags for the operation. We also then store the operation in the bi_rw/cmd_flags field similar to how we used to store the bio ioprio where it sat in the upper bits of the field. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: convert is_sync helpers to use REQ_OPs.Mike Christie1-3/+3
This patch converts the is_sync helpers to use separate variables for the operation and flags. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: convert merge/insert code to check for REQ_OPs.Mike Christie1-10/+10
This patch converts the block layer merging code to use separate variables for the operation and flags, and to check req_op for the REQ_OP. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block discard: use bio set op accessorMike Christie1-1/+2
This converts the block issue discard helper and users to use the bio_set_op_attrs accessor and only pass in the operation flags like REQ_SEQURE. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block: add REQ_OP definitions and helpersMike Christie1-1/+9
The following patches separate the operation (WRITE, READ, DISCARD, etc) from the rq_flag_bits flags. This patch adds definitions for request/bio operations (REQ_OPs) and adds request/bio accessors to get/set the op. In this patch the REQ_OPs match the REQ rq_flag_bits ones for compat reasons while all the code is converted to use the op accessors in the set. In the last patches the op will become a number and the accessors and helpers in this patch will be dropped or updated. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-27Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds1-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
2016-05-18dax: enable dax in the presence of known media errors (badblocks)Dan Williams1-1/+1
1/ If a mapping overlaps a bad sector fail the request. 2/ Do not opportunistically report more dax-capable capacity than is requested when errors present. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> [vishal: fix a conflict with system RAM collision patches] [vishal: add a 'size' parameter to ->direct_access] [vishal: fix a conflict with DAX alignment check patches] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+2
Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...
2016-05-17block: Update blkdev_dax_capable() for consistencyToshi Kani1-0/+1
blkdev_dax_capable() is similar to bdev_dax_supported(), but needs to remain as a separate interface for checking dax capability of a raw block device. Rename and relocate blkdev_dax_capable() to keep them maintained consistently, and call bdev_direct_access() for the dax capability check. There is no change in the behavior. Link: https://lkml.org/lkml/2016/5/9/950 Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@fb.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17block: Add bdev_dax_supported() for dax mount checksToshi Kani1-0/+1
DAX imposes additional requirements to a device. Add bdev_dax_supported() which performs all the precondition checks necessary for filesystem to mount the device with dax option. Also add a new check to verify if a partition is aligned by 4KB. When a partition is unaligned, any dax read/write access fails, except for metadata update. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@fb.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17block: Add vfs_msg() interfaceToshi Kani1-0/+11
In preparation of moving DAX capability checks to the block layer from filesystem code, add a VFS message interface that aligns with filesystem's message format. For instance, a vfs_msg() message followed by XFS messages in case of a dax mount error may look like: VFS (pmem0p1): error: unaligned partition for dax XFS (pmem0p1): DAX unsupported by block device. Turning off DAX. XFS (pmem0p1): Mounting V5 Filesystem : vfs_msg() is largely based on ext4_msg(). Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@fb.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-02block: add __blkdev_issue_discardChristoph Hellwig1-0/+2
This is a version of blkdev_issue_discard which doesn't wait for the I/O to complete, but instead allows the caller to submit the final bio and/or chain it to others. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>