summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-06-27lightnvm: pblk: set mempool and workqueue params.Javier González4-20/+44
Make constants to define sizes for internal mempools and workqueues. In this process, adjust the values to be more meaningful given the internal constrains of the FTL. In order to do this for workqueues, separate the current auxiliary workqueue into two dedicated workqueues to manage lines being closed and bad blocks. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: redesign GC algorithmJavier González6-278/+368
At the moment, in order to get enough read parallelism, we have recycled several lines at the same time. This approach has proven not to work well when reaching capacity, since we end up mixing valid data from all lines, thus not maintaining a sustainable free/recycled line ratio. The new design, relies on a two level workqueue mechanism. In the first level, we read the metadata for a number of lines based on the GC list they reside on (this is governed by the number of valid sectors in each line). In the second level, we recycle a single line at a time. Here, we issue reads in parallel, while a single GC write thread places data in the write buffer. This design allows to (i) only move data from one line at a time, thus maintaining a sane free/recycled ration and (ii) maintain the GC writer busy with recycled data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: add lock assertions on helpersJavier González1-0/+4
Add lockdep assertions on helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: cleanup unnecessary codeJavier González2-7/+0
Cleanup unnecessary headers and code lines. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: set metadata list for all I/OsJavier González2-38/+54
Set a dma area for all I/Os in order to read/write from/to the metadata stored on the per-sector out-of-bound area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: choose optimal victim GC lineJavier González1-1/+15
At the moment, we separate the closed lines on three different list based on their number of valid sectors. GC recycles lines from each list based on capacity. Lines from each list are taken in a FIFO fashion. Since the number of lines is limited (it corresponds to the number of blocks in a LUN, which is somewhere between 1000-2000), we can afford scanning the lists to choose the optimal line to be recycled. This helps specially in lines with a high number of valid sectors. If the number of blocks per LUN increases, we will consider a more efficient policy. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: decouple bad block from line allocJavier González1-16/+37
Decouple bad block discovery from line allocation logic. This allows to return meaningful error codes in case of bad block discovery failure. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: simplify meta. memory allocationJavier González4-8/+8
smeta size will always be suitable for a kmalloc allocation. Simplify the code and leave the vmalloc fallback only for emeta, where the pblk configuration has an impact. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: issue multiplane reads if possibleJavier González4-12/+51
If a read request is sequential and its size aligns with a multi-plane page size, use the multi-plane hint to process the I/O in parallel in the controller. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: delete redundant buffer pointerJavier González7-41/+11
After refactoring the metadata path, the backpointer controlling synced I/Os in a line becomes unnecessary; metadata is scheduled on the write thread, thus we know when the end of the line is reached and act on it directly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: delete redundant debug line statJavier González1-5/+3
Remove a legacy variable that helped verifying the consistency of the run-time metadata for the free line list. With the new metadata layout, this check is no longer necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: sched. metadata on write threadJavier González8-285/+673
At the moment, line metadata is persisted on a separate work queue, that is kicked each time that a line is closed. The assumption when designing this was that freeing the write thread from creating a new write request was better than the potential impact of writes colliding on the media (user I/O and metadata I/O). Experimentation has proven that this assumption is wrong; collision can cause up to 25% of bandwidth and introduce long tail latencies on the write thread, which potentially cause user write threads to spend more time spinning to get a free entry on the write buffer. This patch moves the metadata logic to the write thread. When a line is closed, remaining metadata is written in memory and is placed on a metadata queue. The write thread then takes the metadata corresponding to the previous line, creates the write request and schedules it to minimize collisions on the media. Using this approach, we see that we can saturate the media's bandwidth, which helps reducing both write latencies and the spinning time for user writer threads. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: rename read request poolJavier González5-37/+38
Read requests allocate some extra memory to store its per I/O context. Instead of requiring yet another memory pool for other type of requests, generalize this context allocation (and change naming accordingly). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: generalize erase pathJavier González6-90/+116
Erase I/Os are scheduled with the following goals in mind: (i) minimize LUNs collisions with write I/Os, and (ii) even out the price of erasing on every write, instead of putting all the burden on when garbage collection runs. This works well on the current design, but is specific to the default mapping algorithm. This patch generalizes the erase path so that other mapping algorithms can select an arbitrary line to be erased instead. It also gets rid of the erase semaphore since it creates jittering for user writes. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: expose max sec per write on sysfsJavier González4-1/+48
Allow to configure the number of maximum sectors per write command through sysfs. This makes it easier to tune write command sizes for different controller configurations. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: add debug stat for read cache hitsJavier González4-1/+10
Add a new debug counter to measure cache hits on the read path Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: pblk: spare double cpu_to_le64 calc.Javier González2-4/+5
Spare a double calculation on the fast write path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: propagate right error code to targetJavier González1-1/+1
If nvme_alloc_request fails, propagate the right error, instead of assuming ENOMEM. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: re-convert ppa format on I/O failureJavier González1-1/+7
In case of a failure when submitting a request, convert the ppa_list addresses to the target format so that it can interpret ppas for recovery Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-23mtip32xx: fix up the checking for internal command failureJens Axboe1-17/+4
This fixes up two commits that have touched this driver. The command status field is now a blk_status_t, so we can't check for < 0 and we definitely can't assume it's holding -Exxxx error values. All we care about here is whether ->status is zero or not. Check for that, and remove the various attempts at smart error reporting. Just log to dmesg what command failed, and the blk_status_t value. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 2a842acab109 ("block: introduce new block status code type") Fixes: 3f5e6a35774c ("mtip32xx: convert internal command issue to block IO path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-23Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe27-148/+275
Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22blk-mq: remove double set queue_numweiping1-1/+0
hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need set it again. Signed-off-by: weiping <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: Make it safe to quiesce and unquiesce from an interrupt handlerBart Van Assche2-11/+19
Since blk_mq_quiesce_queue_nowait() can be called from interrupt context, make this safe. Since this function is not in the hot path, uninline it. Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Fix off-by-one errors in blk_status_to_errno() and print_req_error()Bart Van Assche1-2/+2
This was detected by the smatch static analyzer. Fixes: commit 2a842acab109 ("block: introduce new block status code type") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Declare local symbols staticBart Van Assche1-1/+1
Avoid that building with W=1 causes the compiler to complain that a declaration for bounce_bio_set and bounce_bio_split is missing. References: commit a8821f3f32be ("block: Improvements to bounce-buffer handling") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Neil Brown <neilb@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Add fallthrough markers to switch statementsBart Van Assche3-1/+4
This patch suppresses gcc 7 warnings about falling through in switch statements when building with W=1. From the gcc documentation: The -Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: fix performance regression with shared tagsJens Axboe4-24/+61
If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21btrfs: use new block error codeDan Carpenter1-1/+1
This function is supposed to return blk_status_t error codes now but there was a stray -ENOMEM left behind. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: Warn when attempting to run a hardware queue that is not mappedBart Van Assche1-3/+5
A queue must be frozen while the mapped state of a hardware queue is changed. Additionally, any change of the mapped state is followed by a call to blk_mq_map_swqueue() (see also blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()). Since blk_mq_map_swqueue() does not map any unmapped hardware queue onto any software queue, no attempt will be made to run an unmapped hardware queue. Hence issue a warning upon attempts to run an unmapped hardware queue. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Constify disk_typeBart Van Assche1-2/+2
The variable 'disk_type' is never modified so constify it. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: Document locking assumptionsBart Van Assche2-0/+6
Document the locking assumptions in functions that modify blk_mq_ctx.rq_list to make it easier for humans to verify this code. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Document what queue type each function is intended forBart Van Assche2-0/+35
Some functions in block/blk-core.c must only be used on blk-sq queues while others are safe to use against any queue type. Document which functions are intended for blk-sq queues and issue a warning if the blk-sq API is misused. This does not only help block driver authors but will also make it easier to remove the blk-sq code once that code is declared obsolete. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Check locking assumptions at runtimeBart Van Assche5-37/+64
Instead of documenting the locking assumptions of most block layer functions as a comment, use lockdep_assert_held() to verify locking assumptions at runtime. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Add a comment above queue_lockdep_assert_held()Bart Van Assche1-0/+7
Add a comment above the queue_lockdep_assert_held() macro that explains the purpose of the q->queue_lock test. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init()Bart Van Assche1-1/+2
Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init() is called after a value has been assigned to .rq_flags and .rq_flags is initialized in __blk_mq_finish_request(). Initialize .rq_flags in blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request(). Moving the initialization of .rq_flags is fine because all changes and tests of .rq_flags occur between blk_get_request() and finishing a request. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Change argument type of scsi_req_init()Bart Van Assche5-7/+13
Since scsi_req_init() works on a struct scsi_request, change the argument type into struct scsi_request *. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Make most scsi_req_init() calls implicitBart Van Assche25-33/+20
Instead of explicitly calling scsi_req_init() after blk_get_request(), call that function from inside blk_get_request(). Add an .initialize_rq_fn() callback function to the block drivers that need it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn() because it is too small to keep it as a separate function. Keep the scsi_req_init() call in ide_prep_sense() because it follows a blk_rq_init() call. References: commit 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Introduce request_queue.initialize_rq_fn()Bart Van Assche3-4/+19
Several block drivers need to initialize the driver-private request data after having called blk_get_request() and before .prep_rq_fn() is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that that initialization code has to be repeated after every blk_get_request() call by adding new callback functions to struct request_queue and to struct blk_mq_ops. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: Make request operation type argument declarations consistentBart Van Assche4-15/+17
Instead of declaring the second argument of blk_*_get_request() as int and passing it to functions that expect an unsigned int, declare that second argument as unsigned int. Also because of consistency, rename that second argument from 'rw' into 'op'. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21blk-mq: Reduce blk_mq_hw_ctx sizeBart Van Assche2-10/+25
Since the srcu structure is rather large (184 bytes on an x86-64 system with kernel debugging disabled), only allocate it if needed. Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: stop using bio_data() in blk_write_same_mergeableChristoph Hellwig1-1/+2
While the Write Same page currently always is in low-level it is just as easy and safer to just compare the page and offset directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21block: remove the unused bio_to_phys macroChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20null_blk: add support for shared tagsJens Axboe1-42/+70
Some storage drivers need to share tag sets between devices. It's useful to be able to model that with null_blk, to find hangs or performance issues. Add a 'shared_tags' bool module parameter that. If that is set to true and nr_devices is bigger than 1, all devices allocated will share the same tag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20btrfs: nowait aio supportGoldwyn Rodrigues2-6/+30
Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20xfs: nowait aio supportGoldwyn Rodrigues2-6/+48
If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20ext4: nowait aio supportGoldwyn Rodrigues1-6/+29
Return EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20block: return on congested block deviceGoldwyn Rodrigues5-4/+42
A new bio operation flag REQ_NOWAIT is introduced to identify bio's orignating from iocb with IOCB_NOWAIT. This flag indicates to return immediately if a request cannot be made instead of retrying. Stacked devices such as md (the ones with make_request_fn hooks) currently are not supported because it may block for housekeeping. For example, an md can have a part of the device suspended. For this reason, only request based devices are supported. In the future, this feature will be expanded to stacked devices by teaching them how to handle the REQ_NOWAIT flags. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20fs: Introduce IOMAP_NOWAITGoldwyn Rodrigues2-0/+9
IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps. This is used by XFS in the XFS patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20fs: return if direct I/O will trigger writebackGoldwyn Rodrigues1-7/+25
Find out if the I/O will trigger a wait due to writeback. If yes, return -EAGAIN. Return -EINVAL for buffered AIO: there are multiple causes of delay such as page locks, dirty throttling logic, page loading from disk etc. which cannot be taken care of. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAITGoldwyn Rodrigues3-1/+18
RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable of returning -EAGAIN if the AIO call will block. This must be set by supporting filesystems in the ->open() call. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>