<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/block/blk-flush.c, branch v3.13.3</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v3.13.3</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v3.13.3'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2013-11-24T23:33:41+00:00</updated>
<entry>
<title>block: submit_bio_wait() conversions</title>
<updated>2013-11-24T23:33:41+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kmo@daterainc.com</email>
</author>
<published>2013-11-24T23:32:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c170bbb45febc03ac4d34ba2b8bb55e06104b7e7'/>
<id>urn:sha1:c170bbb45febc03ac4d34ba2b8bb55e06104b7e7</id>
<content type='text'>
It was being open coded in a few places.

Signed-off-by: Kent Overstreet &lt;kmo@daterainc.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Joern Engel &lt;joern@logfs.org&gt;
Cc: Prasad Joshi &lt;prasadjoshi.linux@gmail.com&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Chris Mason &lt;chris.mason@fusionio.com&gt;
Acked-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blk-mq: fix for flush deadlock</title>
<updated>2013-10-28T19:33:58+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2013-10-28T19:33:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3228f48be2d19b2dd90db96ec16a40187a2946f3'/>
<id>urn:sha1:3228f48be2d19b2dd90db96ec16a40187a2946f3</id>
<content type='text'>
The flush state machine takes in a struct request, which then is
submitted multiple times to the underling driver.  The old block code
requeses the same request for each of those, so it does not have an
issue with tapping into the request pool.  The new one on the other hand
allocates a new request for each of the actualy steps of the flush
sequence. If have already allocated all of the tags for IO, we will
fail allocating the flush request.

Set aside a reserved request just for flushes.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blk-mq: new multi-queue block IO queueing mechanism</title>
<updated>2013-10-25T10:56:00+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2013-10-24T08:20:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=320ae51feed5c2f13664aa05a76bec198967e04d'/>
<id>urn:sha1:320ae51feed5c2f13664aa05a76bec198967e04d</id>
<content type='text'>
Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
  request units for IO. The block layer provides various helper
  functionalities to let drivers share code, things like tag
  management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
  block layer and IO submitter. Since this bypasses the IO stack,
  driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
  be able to uniquely identify a request both in the driver and
  to the hardware. The tagging uses per-cpu caches for freed
  tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
  basis. Basically the driver should be able to get a notification,
  if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
  submission queues. blk-mq can redirect IO completions to the
  desired location.

- Support for per-request payloads. Drivers almost always need
  to associate a request structure with some driver private
  command structure. Drivers can tell blk-mq this at init time,
  and then any request handed to the driver will have the
  required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
  gets neither of these. Even for high IOPS devices, merging
  sequential IO reduces per-command overhead and thus
  increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li &lt;shli@fusionio.com&gt;
Alexander Gordeev &lt;agordeev@redhat.com&gt;
Christoph Hellwig &lt;hch@infradead.org&gt;
Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Matias Bjorling &lt;m@bjorling.me&gt;
Jeff Moyer &lt;jmoyer@redhat.com&gt;

Acked-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>Block: blk-flush: Fixed indent code style</title>
<updated>2013-03-22T18:22:51+00:00</updated>
<author>
<name>Alice Ferrazzi</name>
<email>alice.ferrazzi@gmail.com</email>
</author>
<published>2013-03-22T17:11:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f2fc7d0eddf86b0233faa34aa5af6780ea48bc08'/>
<id>urn:sha1:f2fc7d0eddf86b0233faa34aa5af6780ea48bc08</id>
<content type='text'>
Fixed code indent should use tabs where possible.

Signed-off-by: Alice Ferrazzi &lt;alice.ferrazzi@gmail.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: account iowait time when waiting for completion of IO request</title>
<updated>2013-02-15T15:45:07+00:00</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2013-02-14T14:19:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5577022f4ed8973762450ebe7fe7ebfd953817db'/>
<id>urn:sha1:5577022f4ed8973762450ebe7fe7ebfd953817db</id>
<content type='text'>
Using wait_for_completion() for waiting for a IO request to be executed
results in wrong iowait time accounting. For example, a system having
the only task doing write() and fdatasync() on a block device can be
reported being idle instead of iowaiting as it should because
blkdev_issue_flush() calls wait_for_completion() which in turn calls
schedule() that does not increment the iowait proc counter and thus does
not turn on iowait time accounting.

The patch makes block layer use wait_for_completion_io() instead of
wait_for_completion() where appropriate to account iowait time
correctly.

Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blk-flush: move the queue kick into</title>
<updated>2011-10-24T14:24:31+00:00</updated>
<author>
<name>Jeff Moyer</name>
<email>jmoyer@redhat.com</email>
</author>
<published>2011-10-17T10:57:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e67b77c791ca2778198c9e7088f3266ed2da7a55'/>
<id>urn:sha1:e67b77c791ca2778198c9e7088f3266ed2da7a55</id>
<content type='text'>
A dm-multipath user reported[1] a problem when trying to boot
a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
(block: fix flush machinery for stacking drivers with differring
flush flags) applied.  It turns out that an empty flush request
can be sent into blk_insert_flush.  When the BUG_ON was fixed
to allow for this, I/O on the underlying device would stall.  The
reason is that blk_insert_cloned_request does not kick the queue.
In the aforementioned commit, I had added a special case to
kick the queue if data was sent down but the queue flags did
not require a flush.  A better solution is to push the queue
kick up into blk_insert_cloned_request.

This patch, along with a follow-on which fixes the BUG_ON, fixes
the issue reported.

[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

Reported-by: Christophe Saout &lt;christophe@saout.de&gt;
Signed-off-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;

Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>blk-flush: fix invalid BUG_ON in blk_insert_flush</title>
<updated>2011-10-24T14:24:30+00:00</updated>
<author>
<name>Jeff Moyer</name>
<email>jmoyer@redhat.com</email>
</author>
<published>2011-10-17T10:57:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=834f9f61a525d2f6d3d0c93894e26326c8d3ceed'/>
<id>urn:sha1:834f9f61a525d2f6d3d0c93894e26326c8d3ceed</id>
<content type='text'>
A user reported a regression due to commit
4853abaae7e4a2af938115ce9071ef8684fb7af4 (block: fix flush
machinery for stacking drivers with differring flush flags).
Part of the problem is that blk_insert_flush required a
single bio be attached to the request.  In reality, having
no attached bio is also a valid case, as can be observed with
an empty flush.

[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

Reported-by: Christophe Saout &lt;christophe@saout.de&gt;
Signed-off-by: Jeff Moyer &lt;jmoyer@redhat.com
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;

Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>block: fix flush machinery for stacking drivers with differring flush flags</title>
<updated>2011-08-15T19:37:25+00:00</updated>
<author>
<name>Jeff Moyer</name>
<email>jmoyer@redhat.com</email>
</author>
<published>2011-08-15T19:37:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4853abaae7e4a2af938115ce9071ef8684fb7af4'/>
<id>urn:sha1:4853abaae7e4a2af938115ce9071ef8684fb7af4</id>
<content type='text'>
Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
        struct request *rq;

        while (1) {
-               while (!list_empty(&amp;q-&gt;queue_head)) {
+               if (!list_empty(&amp;q-&gt;queue_head)) {
                        rq = list_entry_rq(q-&gt;queue_head.next);
-                       if (!(rq-&gt;cmd_flags &amp; (REQ_FLUSH | REQ_FUA)) ||
-                           (rq-&gt;cmd_flags &amp; REQ_FLUSH_SEQ))
-                               return rq;
-                       rq = blk_do_flush(q, rq);
-                       if (rq)
-                               return rq;
+                       return rq;
                }

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
        unsigned int fflags = q-&gt;flush_flags; /* may change, cache it */
        bool has_flush = fflags &amp; REQ_FLUSH, has_fua = fflags &amp; REQ_FUA;
        bool do_preflush = has_flush &amp;&amp; (rq-&gt;cmd_flags &amp; REQ_FLUSH);
        bool do_postflush = has_flush &amp;&amp; !has_fua &amp;&amp; (rq-&gt;cmd_flags &amp;
        REQ_FUA);
        unsigned skip = 0;
...
        if (blk_rq_sectors(rq) &amp;&amp; !do_preflush &amp;&amp; !do_postflush) {
                rq-&gt;cmd_flags &amp;= ~REQ_FLUSH;
		if (!has_fua)
			rq-&gt;cmd_flags &amp;= ~REQ_FUA;
	        return rq;
	}

So, the flush machinery was bypassed in such cases (q-&gt;flush_flags == 0
&amp;&amp; rq-&gt;cmd_flags &amp; (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all.  Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking.  While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req-&gt;end_io).  Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers.  So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance.  Comments and other testers, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;jaxboe@fusionio.com&gt;
</content>
</entry>
<entry>
<title>allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH</title>
<updated>2011-08-09T18:32:09+00:00</updated>
<author>
<name>Jeff Moyer</name>
<email>jmoyer@redhat.com</email>
</author>
<published>2011-08-09T18:32:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fa1bf42ff9296ac4cf211b0a1b450a6071d26a95'/>
<id>urn:sha1:fa1bf42ff9296ac4cf211b0a1b450a6071d26a95</id>
<content type='text'>
blk_insert_flush has the following check:

	/*
	 * If there's data but flush is not necessary, the request can be
	 * processed directly without going through flush machinery.  Queue
	 * for normal execution.
	 */
	if ((policy &amp; REQ_FSEQ_DATA) &amp;&amp;
	    !(policy &amp; (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
		list_add_tail(&amp;rq-&gt;queuelist, &amp;q-&gt;queue_head);
		return;
	}

However, blk_flush_policy will not return with policy set to only
REQ_FSEQ_DATA:

static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
	unsigned int policy = 0;

	if (fflags &amp; REQ_FLUSH) {
		if (rq-&gt;cmd_flags &amp; REQ_FLUSH)
			policy |= REQ_FSEQ_PREFLUSH;
		if (blk_rq_sectors(rq))
			policy |= REQ_FSEQ_DATA;
		if (!(fflags &amp; REQ_FUA) &amp;&amp; (rq-&gt;cmd_flags &amp; REQ_FUA))
			policy |= REQ_FSEQ_POSTFLUSH;
	}
	return policy;
}

Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set.  Fix this
mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
check.

Tejun notes:

  Hmmm... yes, this can become a correctness issue if (and only if)
  blk_queue_flush() is called to change q-&gt;flush_flags while requests
  are in-flight; otherwise, requests wouldn't reach the function at all.
  Also, I think it would be a generally good idea to always set
  FSEQ_DATA if the request has data.

Cheers,
Jeff

Signed-off-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;jaxboe@fusionio.com&gt;
</content>
</entry>
<entry>
<title>block: hold queue if flush is running for non-queueable flush drive</title>
<updated>2011-05-06T17:36:25+00:00</updated>
<author>
<name>shaohua.li@intel.com</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-05-06T17:34:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3ac0cc4508709d42ec9aa351086c7d38bfc0660c'/>
<id>urn:sha1:3ac0cc4508709d42ec9aa351086c7d38bfc0660c</id>
<content type='text'>
In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it.  Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue.  Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.

In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

which causes about 20% regression running a sysbench fileio
workload.

Stable: 2.6.39 only

Cc: stable@kernel.org
Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;jaxboe@fusionio.com&gt;
</content>
</entry>
</feed>
