summaryrefslogtreecommitdiff
path: root/drivers/block/xen-blkback
AgeCommit message (Collapse)AuthorFilesLines
2015-02-13Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2-2/+11
Pull block driver changes from Jens Axboe: "This contains: - The 4k/partition fixes for brd from Boaz/Matthew. - A few xen front/back block fixes from David Vrabel and Roger Pau Monne. - Floppy changes from Takashi, cleaning the device file creation. - Switching libata to use the new blk-mq tagging policy, removing code (and a suboptimal implementation) from libata. This will throw you a merge conflict, since a bug in the original libata tagging code was fixed since this code was branched. Trivial. From Shaohua. - Conversion of loop to blk-mq, from Ming Lei. - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra. He claims it improves on unreadable code, which will cost him a beer. - Maintainer update or NDB, now handled by Markus Pargmann. - NVMe: - Optimization from me that avoids a kmalloc/kfree per IO for smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU overhead. - Removal of (now) dead RCU code, a relic from before NVMe was converted to blk-mq" * 'for-3.20/drivers' of git://git.kernel.dk/linux-block: xen-blkback: default to X86_32 ABI on x86 xen-blkfront: fix accounting of reqs when migrating xen-blkback,xen-blkfront: add myself as maintainer block: Simplify bsg complete all floppy: Avoid manual call of device_create_file() NVMe: avoid kmalloc/kfree for smaller IO MAINTAINERS: Update NBD maintainer libata: make sata_sil24 use fifo tag allocator libata: move sas ata tag allocation to libata-scsi.c libata: use blk taging NVMe: within nvme_free_queues(), delete RCU sychro/deferred free null_blk: suppress invalid partition info brd: Request from fdisk 4k alignment brd: Fix all partitions BUGs axonram: Fix bug in direct_access loop: add blk-mq.h include block: loop: don't handle REQ_FUA explicitly block: loop: introduce lo_discard() and lo_req_flush() block: loop: say goodby to bio block: loop: improve performance via blk-mq
2015-02-10xen-blkback: default to X86_32 ABI on x86David Vrabel2-2/+11
Prior to the existance of 64-bit backends using the X86_64 ABI, frontends used the X86_32 ABI. These old frontends do not specify the ABI and when used with a 64-bit backend do not work. On x86, default to the X86_32 ABI if one is not specified. Backends on ARM continue to default to their NATIVE ABI. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2015-01-28xen-blkback: safely unmap grants in case they are still in useJennifer Herbert2-50/+122
Use gnttab_unmap_refs_async() to wait until the mapped pages are no longer in use before unmapping them. This allows blkback to use network storage which may retain refs to pages in queued skbs after the block I/O has completed. Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jens Axboe <axboe@kernel.de> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-28xen/grant-table: add helpers for allocating pagesDavid Vrabel1-4/+4
Add gnttab_alloc_pages() and gnttab_free_pages() to allocate/free pages suitable to for granted maps. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-10-18Merge branch 'for-3.18/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2-3/+4
Pull block layer driver update from Jens Axboe: "This is the block driver pull request for 3.18. Not a lot in there this round, and nothing earth shattering. - A round of drbd fixes from the linbit team, and an improvement in asender performance. - Removal of deprecated (and unused) IRQF_DISABLED flag in rsxx and hd from Michael Opdenacker. - Disable entropy collection from flash devices by default, from Mike Snitzer. - A small collection of xen blkfront/back fixes from Roger Pau Monné and Vitaly Kuznetsov" * 'for-3.18/drivers' of git://git.kernel.dk/linux-block: block: disable entropy contributions for nonrot devices xen, blkfront: factor out flush-related checks from do_blkif_request() xen-blkback: fix leak on grant map error path xen/blkback: unmap all persistent grants when frontend gets disconnected rsxx: Remove deprecated IRQF_DISABLED block: hd: remove deprecated IRQF_DISABLED drbd: use RB_DECLARE_CALLBACKS() to define augment callbacks drbd: compute the end before rb_insert_augmented() drbd: Add missing newline in resync progress display in /proc/drbd drbd: reduce lock contention in drbd_worker drbd: Improve asender performance drbd: Get rid of the WORK_PENDING macro drbd: Get rid of the __no_warn and __cond_lock macros drbd: Avoid inconsistent locking warning drbd: Remove superfluous newline from "resync_extents" debugfs entry. drbd: Use consistent names for all the bi_end_io callbacks drbd: Use better variable names
2014-10-06xen: remove DEFINE_XENBUS_DRIVER() macroDavid Vrabel1-8/+3
The DEFINE_XENBUS_DRIVER() macro looks a bit weird and causes sparse errors. Replace the uses with standard structure definitions instead. This is similar to pci and usb device registration. Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-02xen-blkback: fix leak on grant map error pathRoger Pau Monné1-0/+1
Fix leaking a page when a grant mapping has failed. CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Tao Chen <boby.chen@huawei.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-10-02xen/blkback: unmap all persistent grants when frontend gets disconnectedVitaly Kuznetsov1-3/+3
blkback does not unmap persistent grants when frontend goes to Closed state (e.g. when blkfront module is being removed). This leads to the following in guest's dmesg: [ 343.243825] xen:grant_table: WARNING: g.e. 0x445 still in use! [ 343.243825] xen:grant_table: WARNING: g.e. 0x42a still in use! ... When load module -> use device -> unload module sequence is performed multiple times it is possible to hit BUG() condition in blkfront module: [ 343.243825] kernel BUG at drivers/block/xen-blkfront.c:954! [ 343.243825] invalid opcode: 0000 [#1] SMP [ 343.243825] Modules linked in: xen_blkfront(-) ata_generic pata_acpi [last unloaded: xen_blkfront] ... [ 343.243825] Call Trace: [ 343.243825] [<ffffffff814111ef>] ? unregister_xenbus_watch+0x16f/0x1e0 [ 343.243825] [<ffffffffa0016fbf>] blkfront_remove+0x3f/0x140 [xen_blkfront] ... [ 343.243825] RIP [<ffffffffa0016aae>] blkif_free+0x34e/0x360 [xen_blkfront] [ 343.243825] RSP <ffff88001eb8fdc0> We don't need to keep these grants if we're disconnecting as frontend might already forgot about them. Solve the issue by moving xen_blkbk_free_caches() call from xen_blkif_free() to xen_blkif_disconnect(). Now we can see the following: [ 928.590893] xen:grant_table: WARNING: g.e. 0x587 still in use! [ 928.591861] xen:grant_table: WARNING: g.e. 0x372 still in use! ... [ 929.592146] xen:grant_table: freeing g.e. 0x587 [ 929.597174] xen:grant_table: freeing g.e. 0x372 ... Backend does not keep persistent grants any more, reconnect works fine. CC: stable@vger.kernel.org Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28xen-blkback: defer freeing blkif to avoid blocking xenwatchValentin Priescu2-14/+36
Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O requests to finish. If the VBD is attached to a hot-swappable disk, then xenwatch can hang for a long period of time, stalling other watches. INFO: task xenwatch:39 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782 ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040 ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040 Call Trace: [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20 [<ffffffff8128f761>] ? list_del+0x11/0x40 [<ffffffff8147a080>] ? wait_for_common+0x60/0x160 [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50 [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20 [<ffffffff8147a26a>] schedule+0x3a/0x60 [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback] [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback] [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90 [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0 [<ffffffff81346488>] device_release_driver+0x28/0x40 [<ffffffff813456e8>] bus_remove_device+0x78/0xe0 [<ffffffff81342c9f>] device_del+0x12f/0x1a0 [<ffffffff81342d2d>] device_unregister+0x1d/0x60 [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback] [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback] [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60 [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0 [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120 [<ffffffff8130b4cb>] frontend_changed+0xb/0x10 [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130 [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40 [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80 [<ffffffff810799d6>] kthread+0x96/0xa0 [<ffffffff81485934>] kernel_thread_helper+0x4/0x10 [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6 [<ffffffff81485930>] ? gs_change+0x13/0x13 With this patch, when there is still pending I/O, the actual disconnect is done by the last reference holder (last pending I/O request). In this case, xenwatch doesn't block indefinitely. Signed-off-by: Valentin Priescu <priescuv@amazon.com> Reviewed-by: Steven Kady <stevkady@amazon.com> Reviewed-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-28xen/blkback: disable discard feature if requested by toolstackOlaf Hering1-1/+6
Newer toolstacks may provide a boolean property "discard-enable" in the backend node. Its purpose is to disable discard for file backed storage to avoid fragmentation. Recognize this setting also for physical storage. If that property exists and is false, do not advertise "feature-discard" to the frontend. Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-12xen-blkback: init persistent_purge_work work_structRoger Pau Monne3-2/+3
Initialize persistent_purge_work work_struct on xen_blkif_alloc (and remove the previous initialization done in purge_persistent_gnt). This prevents flush_work from complaining even if purge_persistent_gnt has not been used. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-02-10Merge branch 'stable/for-jens-3.14' of ↵Jens Axboe3-22/+58
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into for-linus Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git stable/for-jens-3.14 which is based off v3.13-rc6. If you would like me to rebase it on a different branch/tag I would be more than happy to do so. The patches are all bug-fixes and hopefully can go in 3.14. They deal with xen-blkback shutdown and cause memory leaks as well as shutdown races. They should go to stable tree and if you are OK with I will ask them to backport those fixes. There is also a fix to xen-blkfront to deal with unexpected state transition. And lastly a fix to the header where it was using the __aligned__ unnecessarily.
2014-02-07xen-blkif: drop struct blkif_request_segment_alignedRoger Pau Monne2-2/+2
This was wrongly introduced in commit 402b27f9, the only difference between blkif_request_segment_aligned and blkif_request_segment is that the former has a named padding, while both share the same memory layout. Also correct a few minor glitches in the description, including for it to no longer assume PAGE_SIZE == 4096. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> [Description fix by Jan Beulich] Signed-off-by: Jan Beulich <jbeulich@suse.com> Reported-by: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Matt Rushton <mrushton@amazon.com> Cc: Matt Wilson <msw@amazon.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07xen-blkback: fix shutdown raceRoger Pau Monne3-10/+24
Introduce a new variable to keep track of the number of in-flight requests. We need to make sure that when xen_blkif_put is called the request has already been freed and we can safely free xen_blkif, which was not the case before. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Matt Rushton <mrushton@amazon.com> Reviewed-by: Matt Rushton <mrushton@amazon.com> Cc: Matt Wilson <msw@amazon.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07xen-blkback: fix memory leaksRoger Pau Monne3-9/+31
I've at least identified two possible memory leaks in blkback, both related to the shutdown path of a VBD: - blkback doesn't wait for any pending purge work to finish before cleaning the list of free_pages. The purge work will call put_free_pages and thus we might end up with pages being added to the free_pages list after we have emptied it. Fix this by making sure there's no pending purge work before exiting xen_blkif_schedule, and moving the free_page cleanup code to xen_blkif_free. - blkback doesn't wait for pending requests to end before cleaning persistent grants and the list of free_pages. Again this can add pages to the free_pages list or persistent grants to the persistent_gnts red-black tree. Fixed by moving the persistent grants and free_pages cleanup code to xen_blkif_free. Also, add some checks in xen_blkif_free to make sure we are cleaning everything. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Matt Rushton <mrushton@amazon.com> Reviewed-by: Matt Rushton <mrushton@amazon.com> Cc: Matt Wilson <msw@amazon.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-07xen-blkback: fix memory leak when persistent grants are usedMatt Rushton1-3/+3
Currently shrink_free_pagepool() is called before the pages used for persistent grants are released via free_persistent_gnts(). This results in a memory leak when a VBD that uses persistent grants is torn down. Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: linux-kernel@vger.kernel.org Cc: xen-devel@lists.xen.org Cc: Anthony Liguori <aliguori@amazon.com> Signed-off-by: Matt Rushton <mrushton@amazon.com> Signed-off-by: Matt Wilson <msw@amazon.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-24block: Abstract out bvec iteratorKent Overstreet1-1/+1
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-11-08xen/blkback: fix reference countingVegard Nossum1-1/+2
If the permission check fails, we drop a reference to the blkif without having taken it in the first place. The bug was introduced in commit 604c499cbbcc3d5fe5fb8d53306aa0fae1990109 (xen/blkback: Check device permissions before allowing OP_DISCARD). Cc: stable@vger.kernel.org Cc: Jan Beulich <JBeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-12block: replace strict_strtoul() with kstrtoul()Jingoo Han1-1/+1
The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-23Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds3-322/+782
Pull block IO driver bits from Jens Axboe: "As I mentioned in the core block pull request, due to real life circumstances the driver pull request would be late. Now it looks like -rc2 late... On the plus side, apart form the rsxx update, these are all things that I could argue could go in later in the cycle as they are fixes and not features. So even though things are late, it's not ALL bad. The pull request contains: - Updates to bcache, all bug fixes, from Kent. - A pile of drbd bug fixes (no big features this time!). - xen blk front/back fixes. - rsxx driver updates, some of them deferred form 3.10. So should be well cooked by now" * 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits) bcache: Allocation kthread fixes bcache: Fix GC_SECTORS_USED() calculation bcache: Journal replay fix bcache: Shutdown fix bcache: Fix a sysfs splat on shutdown bcache: Advertise that flushes are supported bcache: check for allocation failures bcache: Fix a dumb race bcache: Use standard utility code bcache: Update email address bcache: Delete fuzz tester bcache: Document shrinker reserve better bcache: FUA fixes drbd: Allow online change of al-stripes and al-stripe-size drbd: Constants should be UPPERCASE drbd: Ignore the exit code of a fence-peer handler if it returns too late drbd: Fix rcu_read_lock balance on error path drbd: fix error return code in drbd_init() drbd: Do not sleep inside rcu bcache: Refresh usage docs ...
2013-07-04drivers: avoid parsing names as kthread_run() format stringsKees Cook1-1/+1
Calling kthread_run with a single name parameter causes it to be handled as a format string. Many callers are passing potentially dynamic string content, so use "%s" in those cases to avoid any potential accidents. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-25xen-blkback: check the number of iovecs before allocating a biosRoger Pau Monne1-1/+2
With the introduction of indirect segments we can receive requests with a number of segments bigger than the maximum number of allowed iovecs in a bios, so make sure that blkback doesn't try to allocate a bios with more iovecs than BIO_MAX_PAGES Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-21xen-blkback: workaround compiler bug in gcc 4.1Roger Pau Monne1-10/+14
The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54) creates an unbound loop for the second foreach_grant_safe loop in purge_persistent_gnt. The workaround is to avoid having this second loop and instead perform all the work inside the first loop by adding a new variable, clean_used, that will be set when all the desired persistent grants have been removed and we need to iterate over the remaining ones to remove the WAS_ACTIVE flag. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Tom O'Neill <toneill@vmem.com> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-17xen/blkback: Check for insane amounts of request on the ring (v6).Konrad Rzeszutek Wilk3-1/+16
Check that the ring does not have an insane amount of requests (more than there could fit on the ring). If we detect this case we will stop processing the requests and wait until the XenBus disconnects the ring. The existing check RING_REQUEST_CONS_OVERFLOW which checks for how many responses we have created in the past (rsp_prod_pvt) vs requests consumed (req_cons) and whether said difference is greater or equal to the size of the ring, does not catch this case. Wha the condition does check if there is a need to process more as we still have a backlog of responses to finish. Note that both of those values (rsp_prod_pvt and req_cons) are not exposed on the shared ring. To understand this problem a mini crash course in ring protocol response/request updates is in place. There are four entries: req_prod and rsp_prod; req_event and rsp_event to track the ring entries. We are only concerned about the first two - which set the tone of this bug. The req_prod is a value incremented by frontend for each request put on the ring. Conversely the rsp_prod is a value incremented by the backend for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when pushing the responses on the ring). Both values can wrap and are modulo the size of the ring (in block case that is 32). Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details. The culprit here is that if the difference between the req_prod and req_cons is greater than the ring size we have a problem. Fortunately for us, the '__do_block_io_op' loop: rc = blk_rings->common.req_cons; rp = blk_rings->common.sring->req_prod; while (rc != rp) { .. blk_rings->common.req_cons = ++rc; /* before make_response() */ } will loop up to the point when rc == rp. The macros inside of the loop (RING_GET_REQUEST) is smart and is indexing based on the modulo of the ring size. If the frontend has provided a bogus req_prod value we will loop until the 'rc == rp' - which means we could be processing already processed requests (or responses) often. The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is b/c it only tracks how many responses we have internally produced and whether we would should process more. The astute reader will notice that the macro RING_REQUEST_CONS_OVERFLOW provides two arguments - more on this later. For example, if we were to enter this function with these values: blk_rings->common.sring->req_prod = X+31415 (X is the value from the last time __do_block_io_op was called). blk_rings->common.req_cons = X blk_rings->common.rsp_prod_pvt = X The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons) is doing: req_cons - rsp_prod_pvt >= 32 Which is, X - X >= 32 or 0 >= 32 And that is false, so we continue on looping (this bug). If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp instead (sring->req_prod) of rc, the this macro can do the check: req_prod - rsp_prov_pvt >= 32 Which is, X + 31415 - X >= 32 , or 31415 >= 32 which is true, so we can error out and break out of the function. Unfortunatly the difference between rsp_prov_pvt and req_prod can be at 32 (which would error out in the macro). This condition exists when the backend is lagging behind with the responses and still has not finished responding to all of them (so make_response has not been called), and the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able to use said macro. Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does a simple check of: req_prod - rsp_prod_pvt > RING_SIZE And with the X values from above: X + 31415 - X > 32 Returns true. Also not that if the ring is full (which is where the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the same condition: X + 32 - X > 32 Which is false. Lets use that macro. Note that in v5 of this patchset the macro was different - we used an earlier version. Cc: stable@vger.kernel.org [v1: Move the check outside the loop] [v2: Add a pr_warn as suggested by David] [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan] [v4: Move wake_up after kthread_stop as suggested by Jan] [v5: Use RING_REQUEST_PROD_OVERFLOW instead] [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> gadsa
2013-06-08xen/blkback: Check device permissions before allowing OP_DISCARDKonrad Rzeszutek Wilk1-1/+12
We need to make sure that the device is not RO or that the request is not past the number of sectors we want to issue the DISCARD operation for. This fixes CVE-2013-2140. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> [v1: Made it pr_warn instead of pr_debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-08xen/blkback: Use physical sector size for setupStefan Bader1-0/+5
Currently xen-blkback passes the logical sector size over xenbus and xen-blkfront sets up the paravirt disk with that logical block size. But newer drives usually have the logical sector size set to 512 for compatibility reasons and would show the actual sector size only in physical sector size. This results in the device being partitioned and accessed in dom0 with the correct sector size, but the guest thinks 512 bytes is the correct block size. And that results in poor performance. To fix this, blkback gets modified to pass also physical-sector-size over xenbus and blkfront to use both values to set up the paravirt disk. I did not just change the passed in sector-size because I am not sure having a bigger logical sector size than the physical one is valid (and that would happen if a newer dom0 kernel hits an older domU kernel). Also this way a domU set up before should still be accessible (just some tools might detect the unaligned setup). [v2: Make xenbus write failure non-fatal] [v3: Use xenbus_scanf instead of xenbus_gather] [v4: Rebased against segment changes] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-07xen-blkback: allocate list of pending reqs in small chunksRoger Pau Monne3-78/+106
Allocate pending requests in smaller chunks instead of allocating them all at the same time. This change also removes the global array of pending_reqs, it is no longer necessay. Variables related to the grant mapping have been grouped into a struct called "grant_page", this allows to allocate them in smaller chunks, and also improves memory locality. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-block: implement indirect descriptorsRoger Pau Monne3-41/+198
Indirect descriptors introduce a new block operation (BLKIF_OP_INDIRECT) that passes grant references instead of segments in the request. This grant references are filled with arrays of blkif_request_segment_aligned, this way we can send more segments in a request. The proposed implementation sets the maximum number of indirect grefs (frames filled with blkif_request_segment_aligned) to 256 in the backend and 32 in the frontend. The value in the frontend has been chosen experimentally, and the backend value has been set to a sane value that allows expanding the maximum number of indirect descriptors in the frontend if needed. The migration code has changed from the previous implementation, in which we simply remapped the segments on the shared ring. Now the maximum number of segments allowed in a request can change depending on the backend, so we have to requeue all the requests in the ring and in the queue and split the bios in them if they are bigger than the new maximum number of segments. [v2: Fixed minor comments by Konrad. [v1: Added padding to make the indirect request 64bit aligned. Added some BUGs, comments; fixed number of indirect pages in blkif_get_x86_{32/64}_req. Added description about the indirect operation in blkif.h] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> [v3: Fixed spaces and tabs mix ups] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: expand map/unmap functionsRoger Pau Monne1-55/+86
Preparatory change for implementing indirect descriptors. Change xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount of grants (previously it was limited to BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req in the map/unmap functions, so we can map/unmap grants without needing to pass a pending_req. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: make the queue of free requests per backendRoger Pau Monne3-105/+74
Remove the last dependency from blkbk by moving the list of free requests to blkif. This change reduces the contention on the list of available requests. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: move pending handles list from blkbk to pending_reqRoger Pau Monne1-12/+4
Moving grant ref handles from blkbk to pending_req will allow us to get rid of the shared blkbk structure. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: implement LRU mechanism for persistent grantsRoger Pau Monne3-57/+250
This mechanism allows blkback to change the number of grants persistently mapped at run time. The algorithm uses a simple LRU mechanism that removes (if needed) the persistent grants that have not been used since the last LRU run, or if all grants have been used it removes the first grants in the list (that are not in use). The algorithm allows the user to change the maximum number of persistent grants, by changing max_persistent_grants in sysfs. Since we are storing the persistent grants used inside the request struct (to be able to mark them as "unused" when unmapping), we no longer need the bitmap (unmap_seg). Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: use balloon pages for all mappingsRoger Pau Monne3-121/+173
Using balloon pages for all granted pages allows us to simplify the logic in blkback, especially in the xen_blkbk_map function, since now we can decide if we want to map a grant persistently or not after we have actually mapped it. This could not be done before because persistent grants used ballooned pages, whereas non-persistent grants used pages from the kernel. This patch also introduces several changes, the first one is that the list of free pages is no longer global, now each blkback instance has it's own list of free pages that can be used to map grants. Also, a run time parameter (max_buffer_pages) has been added in order to tune the maximum number of free pages each blkback instance will keep in it's buffer. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-04-18xen-blkback: print stats about persistent grantsRoger Pau Monne1-2/+4
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19xen-blkback: don't store dev_bus_addrRoger Pau Monne2-16/+6
dev_bus_addr returned in the grant ref map operation is the mfn of the passed page, there's no need to store it in the persistent grant entry, since we can always get it provided that we have the page. This reduces the memory overhead of persistent grants in blkback. While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as it makes much more sense - as we use that value in bio_add_page which as the fourth argument expects the offset. We hadn't used the physical address as part of this at all. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org [v1: s/buf/offset/] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19xen-blkback: fix foreach_grant_safe to handle empty listsRoger Pau Monne1-1/+1
We may use foreach_grant_safe in the future with empty lists, so make sure we can handle them. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19xen-blkback: fix dispatch_rw_block_io() error pathJan Beulich1-6/+1
Commit 7708992 ("xen/blkback: Seperate the bio allocation and the bio submission") consolidated the pendcnt updates to just a single write, neglecting the fact that the error path relied on it getting set to 1 up front (such that the decrement in __end_block_io_op() would actually drop the count to zero, triggering the necessary cleanup actions). Also remove a misleading and a stale (after said commit) comment. CC: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11xen/blkback: Change statistics counter types to unsignedZoltan Kiss3-16/+16
These values shouldn't be negative, but after an overflow their value can turn into negative, if they are signed. xentop can show bogus values in this case. Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11xen/blkback: correctly respond to unknown, non-native requestsDavid Vrabel2-4/+52
If the frontend is using a non-native protocol (e.g., a 64-bit frontend with a 32-bit backend) and it sent an unrecognized request, the request was not translated and the response would have the incorrect ID. This may cause the frontend driver to behave incorrectly or crash. Since the ID field in the request is always in the same place, regardless of the request type we can get the correct ID and make a valid response (which will report BLKIF_RSP_EOPNOTSUPP). This bug affected 64-bit SLES 11 guests when using a 32-bit backend. This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES source) and would crash in blkif_int() as the ID in the response would be invalid. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-01xen/xen-blkback: preq.dev is used without initializedChen Gang1-1/+2
If call xen_vbd_translate failed, the preq.dev will be not initialized. Use blkif->vbd.pdevice instead (still better to print relative info). Note that before commit 01c681d4c70d64cb72142a2823f27c4146a02e63 (xen/blkback: Don't trust the handle from the frontend.) the value bogus, as it was the guest provided value from req->u.rw.handle rather than the actual device. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-20xen-blkback: use balloon pages for persistent grantsRoger Pau Monne1-2/+4
With current persistent grants implementation we are not freeing the persistent grants after we disconnect the device. Since grant map operations change the mfn of the allocated page, and we can no longer pass it to __free_page without setting the mfn to a sane value, use balloon grant pages instead, as the gntdev device does. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: stable@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-20xen/blkback: Don't trust the handle from the frontend.Konrad Rzeszutek Wilk1-1/+0
The 'handle' is the device that the request is from. For the life-time of the ring we copy it from a request to a response so that the frontend is not surprised by it. But we do not need it - when we start processing I/Os we have our own 'struct phys_req' which has only most essential information about the request. In fact the 'vbd_translate' ends up over-writing the preq.dev with a value from the backend. This assignment of preq.dev with the 'handle' value is superfluous so lets not do it. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-20xen-blkback: do not leak mode propertyJan Beulich1-25/+24
"be->mode" is obtained from xenbus_read(), which does a kmalloc() for the message body. The short string is never released, so do it along with freeing "be" itself, and make sure the string isn't kept when backend_changed() doesn't complete successfully (which made it desirable to slightly re-structure that function, so that the error cleanup can be done in one place). Reported-by: Olaf Hering <olaf@aepfle.de> CC: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-19Merge branch 'stable/for-jens-3.8' of ↵Jens Axboe1-7/+11
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.8 which has a bug-fix to the xen-blkfront and xen-blkback driver when using the persistent mode. An issue was discovered where LVM disks could not be read correctly and this fixes it. There is also a change in llist.h which has been blessed by akpm.
2012-12-18Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds3-27/+313
Pull block driver update from Jens Axboe: "Now that the core bits are in, here are the driver bits for 3.8. The branch contains: - A huge pile of drbd bits that were dumped from the 3.7 merge window. Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved. - A few cleanups from Akinobu Mita for drbd and cciss. - Queue improvement for loop from Lukas. This grew into adding a generic interface for waiting/checking an even with a specific lock, allowing this to be pulled out of md and now loop and drbd is also using it. - A few fixes for xen back/front block driver from Roger Pau Monne. - Partition improvements from Stephen Warren, allowing partiion UUID to be used as an identifier." * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits) drbd: update Kconfig to match current dependencies drbd: Fix drbdsetup wait-connect, wait-sync etc... commands drbd: close race between drbd_set_role and drbd_connect drbd: respect no-md-barriers setting also when changed online via disk-options drbd: Remove obsolete check drbd: fixup after wait_even_lock_irq() addition to generic code loop: Limit the number of requests in the bio list wait: add wait_event_lock_irq() interface xen-blkfront: free allocated page xen-blkback: move free persistent grants code block: partition: msdos: provide UUIDs for partitions init: reduce PARTUUID min length to 1 from 36 block: store partition_meta_info.uuid as a string cciss: use check_signature() cciss: cleanup bitops usage drbd: use copy_highpage drbd: if the replication link breaks during handshake, keep retrying drbd: check return of kmalloc in receive_uuids drbd: Broadcast sync progress no more often than once per second drbd: don't try to clear bits once the disk has failed ...
2012-12-08xen-blkback: implement safe iterator for the list of persistent grantsRoger Pau Monne1-7/+11
Change foreach_grant iterator to a safe version, that allows freeing the element while iterating. Also move the free code in free_persistent_gnts to prevent freeing the element before the rb_next call. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26xen-blkback: move free persistent grants codeRoger Pau Monne1-31/+37
Move the code that frees persistent grants from the red-black tree to a function. This will make it easier for other consumers to move this to a common place. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-04xen/blkback: persistent-grants fixesRoger Pau Monne2-4/+7
This patch contains fixes for persistent grants implementation v2: * handle == 0 is a valid handle, so initialize grants in blkback setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported by Konrad Rzeszutek Wilk. * new_map is a boolean, use "true" or "false" instead of 1 and 0. Reported by Konrad Rzeszutek Wilk. * blkfront announces the persistent-grants feature as feature-persistent-grants, use feature-persistent instead which is consistent with blkback and the public Xen headers. * Add a consistency check in blkfront to make sure we don't try to access segments that have not been set. Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> [v1: The new_map int->bool had already been changed] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30xen/blkback: Persistent grant maps for xen blk driversRoger Pau Monne3-27/+305
This patch implements persistent grants for the xen-blk{front,back} mechanism. The effect of this change is to reduce the number of unmap operations performed, since they cause a (costly) TLB shootdown. This allows the I/O performance to scale better when a large number of VMs are performing I/O. Previously, the blkfront driver was supplied a bvec[] from the request queue. This was granted to dom0; dom0 performed the I/O and wrote directly into the grant-mapped memory and unmapped it; blkfront then removed foreign access for that grant. The cost of unmapping scales badly with the number of CPUs in Dom0. An experiment showed that when Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests (at which point 650,000 IOPS are being performed in total). If more than 5 guests are used, the performance declines. By 10 guests, only 400,000 IOPS are being performed. This patch improves performance by only unmapping when the connection between blkfront and back is broken. On startup blkfront notifies blkback that it is using persistent grants, and blkback will do the same. If blkback is not capable of persistent mapping, blkfront will still use the same grants, since it is compatible with the previous protocol, and simplifies the code complexity in blkfront. To perform a read, in persistent mode, blkfront uses a separate pool of pages that it maps to dom0. When a request comes in, blkfront transmutes the request so that blkback will write into one of these free pages. Blkback keeps note of which grefs it has already mapped. When a new ring request comes to blkback, it looks to see if it has already mapped that page. If so, it will not map it again. If the page hasn't been previously mapped, it is mapped now, and a record is kept of this mapping. Blkback proceeds as usual. When blkfront is notified that blkback has completed a request, it memcpy's from the shared memory, into the bvec supplied. A record that the {gref, page} tuple is mapped, and not inflight is kept. Writes are similar, except that the memcpy is peformed from the supplied bvecs, into the shared pages, before the request is put onto the ring. Blkback stores a mapping of grefs=>{page mapped to by gref} in a red-black tree. As the grefs are not known apriori, and provide no guarantees on their ordering, we have to perform a search through this tree to find the page, for every gref we receive. This operation takes O(log n) time in the worst case. In blkfront grants are stored using a single linked list. The maximum number of grants that blkback will persistenly map is currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to prevent a malicios guest from attempting a DoS, by supplying fresh grefs, causing the Dom0 kernel to map excessively. If a guest is using persistent grants and exceeds the maximum number of grants to map persistenly the newly passed grefs will be mapped and unmaped. Using this approach, we can have requests that mix persistent and non-persistent grants, and we need to handle them correctly. This allows us to set the maximum number of persistent grants to a lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although setting it will lead to unpredictable performance. In writing this patch, the question arrises as to if the additional cost of performing memcpys in the guest (to/from the pool of granted pages) outweigh the gains of not performing TLB shootdowns. The answer to that question is `no'. There appears to be very little, if any additional cost to the guest of using persistent grants. There is perhaps a small saving, from the reduced number of hypercalls performed in granting, and ending foreign access. Signed-off-by: Oliver Chick <oliver.chick@citrix.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [v1: Fixed up the misuse of bool as int]
2012-10-30xen/blkback: Change xen_vbd's flush_support and discard_secure to have type ↵Oliver Chick1-2/+2
unsigned int, rather than bool Changing the type of bdev parameters to be unsigned int :1, rather than bool. This is more consistent with the types of other features in the block drivers. Signed-off-by: Oliver Chick <oliver.chick@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>