summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-02-28io_uring: support for IO pollingJens Axboe2-9/+271
Add support for a polled io_uring instance. When a read or write is submitted to a polled io_uring, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add fsync supportChristoph Hellwig2-1/+61
Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28Add io_uring IO interfaceJens Axboe12-2/+1390
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28block: introduce mp_bvec_for_each_page() for iterating over pageMing Lei1-0/+5
mp_bvec_for_each_segment() is a bit big for the iteration, so introduce a light-weight helper for iterating over pages, then 32bytes stack space can be saved. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-27block: optimize blk_bio_segment_split for single-page bvecMing Lei1-3/+9
Introduce a fast path for single-page bvec IO, then we can avoid to call bvec_split_segs() unnecessarily. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-27block: optimize __blk_segment_map_sg() for single-page bvecMing Lei1-2/+7
Introduce a fast path for single-page bvec IO, then blk_bvec_map_sg() can be avoided. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-27block: introduce bvec_nth_page()Ming Lei2-4/+9
Single-page bvec can often be seen in small BS workloads, so introduce bvec_nth_page() for avoiding to call nth_page() unnecessarily, which looks not cheap. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24iomap: wire up the iopoll methodChristoph Hellwig4-15/+32
Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified to use bio_set_polled(). Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24block: add bio_set_polled() helperJens Axboe2-2/+16
For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24block: wire up block device iopoll methodChristoph Hellwig1-1/+17
Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24fs: add an iopoll method to struct file_operationsChristoph Hellwig2-0/+5
This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-23loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()Dongli Zhang1-4/+17
Commit 0da03cab87e6 ("loop: Fix deadlock when calling blkdev_reread_part()") moves blkdev_reread_part() out of the loop_ctl_mutex. However, GENHD_FL_NO_PART_SCAN is set before __blkdev_reread_part(). As a result, __blkdev_reread_part() will fail the check of GENHD_FL_NO_PART_SCAN and will not rescan the loop device to delete all partitions. Below are steps to reproduce the issue: step1 # dd if=/dev/zero of=tmp.raw bs=1M count=100 step2 # losetup -P /dev/loop0 tmp.raw step3 # parted /dev/loop0 mklabel gpt step4 # parted -a none -s /dev/loop0 mkpart primary 64s 1 step5 # losetup -d /dev/loop0 Step5 will not be able to delete /dev/loop0p1 (introduced by step4) and there is below kernel warning message: [ 464.414043] __loop_clr_fd: partition scan of loop0 failed (rc=-22) This patch sets GENHD_FL_NO_PART_SCAN after blkdev_reread_part(). Fixes: 0da03cab87e6 ("loop: Fix deadlock when calling blkdev_reread_part()") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-23loop: do not print warn message if partition scan is successfulDongli Zhang1-2/+3
Do not print warn message when the partition scan returns 0. Fixes: d57f3374ba48 ("loop: Move special partition reread handling in loop_clr_fd()") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-21block: bounce: make sure that bvec table is updatedMing Lei1-2/+6
Block bounce needs to allocate new page for doing IO, and the new page has to be updated to bvec table. Commit 6dc4f100c switches __blk_queue_bounce() to use the new bio_for_each_segment_all() interface. Unfortunately the new bio_for_each_segment_all() can't be used to update bvec table. This patch fixes this issue by retrieving bvec from the table directly, then the new allocated page can be updated to the bio. This way is safe because the cloned bio is single page bvec. Fixes: 6dc4f100c ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-21Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-5.1/blockJens Axboe29-289/+168
Pull NVMe changes for 5.1 from Christoph * 'nvme-5.1' of git://git.infradead.org/nvme: (22 commits) nvme-rdma: use nr_phys_segments when map rq to sgl nvmet: convert to SPDX identifiers nvmet-rdma: convert to SPDX identifiers nvme-loop: convert to SPDX identifiers nvmet-fcloop: convert to SPDX identifiers nvmet-fc: convert to SPDX identifiers nvme: convert to SPDX identifiers nvme-pci: convert to SPDX identifiers nvme-lightnvm: convert to SPDX identifiers nvme-rdma: convert to SPDX identifiers nvme-fc: convert to SPDX identifiers nvme-fabrics: convert to SPDX identifiers nvme-tcp.h: fix SPDX header nvme_ioctl.h: remove duplicate GPL boilerplate nvme: return error from nvme_alloc_ns() nvme: avoid that deleting a controller triggers a circular locking complaint nvme: introduce a helper function for controller deletion nvme: unexport nvme_delete_ctrl_sync() nvme-pci: check kstrtoint() return value in queue_count_set() nvme-fabrics: document the poll function argument ...
2019-02-21nvme-rdma: use nr_phys_segments when map rq to sglChaitanya Kulkarni1-2/+2
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check if a command contains data to be mapped. This fixes the case where a struct request contains LBAs, but it has no payload, such as Write Zeroes support. Fixes: 6e02318eaea5 ("nvme: add support for the Write Zeroes command") Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvmet: convert to SPDX identifiersChristoph Hellwig7-63/+7
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvmet-rdma: convert to SPDX identifiersChristoph Hellwig1-9/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-loop: convert to SPDX identifiersChristoph Hellwig1-9/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvmet-fcloop: convert to SPDX identifiersChristoph Hellwig1-12/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvmet-fc: convert to SPDX identifiersChristoph Hellwig1-13/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme: convert to SPDX identifiersChristoph Hellwig6-46/+6
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-pci: convert to SPDX identifiersChristoph Hellwig1-9/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-lightnvm: convert to SPDX identifiersChristoph Hellwig1-15/+1
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-rdma: convert to SPDX identifiersChristoph Hellwig2-18/+2
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-fc: convert to SPDX identifiersChristoph Hellwig3-35/+3
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-fabrics: convert to SPDX identifiersChristoph Hellwig2-18/+2
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme-tcp.h: fix SPDX headerChristoph Hellwig1-1/+1
For .h files we need to use /* */ style comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme_ioctl.h: remove duplicate GPL boilerplateChristoph Hellwig2-18/+1
We already have a ЅPDX header, so no need to duplicate the information. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20nvme: return error from nvme_alloc_ns()Hannes Reinecke1-10/+21
nvme_alloc_ns() might fail, so we should be returning an error code. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme: avoid that deleting a controller triggers a circular locking complaintBart Van Assche1-2/+3
Rework nvme_delete_ctrl_sync() such that it does not have to wait for queued work. This patch avoids that test nvme/008 triggers the following complaint: WARNING: possible circular locking dependency detected 5.0.0-rc6-dbg+ #10 Not tainted ------------------------------------------------------ nvme/7918 is trying to acquire lock: 000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410 but task is already holding lock: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (kn->count#389){++++}: lock_acquire+0xc5/0x1e0 __kernfs_remove+0x42a/0x4a0 kernfs_remove_by_name_ns+0x45/0x90 remove_files.isra.1+0x3a/0x90 sysfs_remove_group+0x5c/0xc0 sysfs_remove_groups+0x39/0x60 device_remove_attrs+0x68/0xb0 device_del+0x24d/0x570 cdev_device_del+0x1a/0x50 nvme_delete_ctrl_work+0xbd/0xe0 process_one_work+0x4f1/0xa40 worker_thread+0x67/0x5b0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 ((work_completion)(&ctrl->delete_work)){+.+.}: __lock_acquire+0x1323/0x17b0 lock_acquire+0xc5/0x1e0 __flush_work+0x399/0x410 flush_work+0x10/0x20 nvme_delete_ctrl_sync+0x65/0x70 nvme_sysfs_delete+0x4f/0x60 dev_attr_store+0x3e/0x50 sysfs_kf_write+0x87/0xa0 kernfs_fop_write+0x186/0x240 __vfs_write+0xd7/0x430 vfs_write+0xfa/0x260 ksys_write+0xab/0x130 __x64_sys_write+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(kn->count#389); lock((work_completion)(&ctrl->delete_work)); lock(kn->count#389); lock((work_completion)(&ctrl->delete_work)); *** DEADLOCK *** 3 locks held by nvme/7918: #0: 00000000e2223b44 (sb_writers#6){.+.+}, at: vfs_write+0x1eb/0x260 #1: 000000003404976f (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240 #2: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210 stack backtrace: CPU: 4 PID: 7918 Comm: nvme Not tainted 5.0.0-rc6-dbg+ #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xca print_circular_bug.isra.36.cold.54+0x173/0x1d5 check_prev_add.constprop.45+0x996/0x1110 __lock_acquire+0x1323/0x17b0 lock_acquire+0xc5/0x1e0 __flush_work+0x399/0x410 flush_work+0x10/0x20 nvme_delete_ctrl_sync+0x65/0x70 nvme_sysfs_delete+0x4f/0x60 dev_attr_store+0x3e/0x50 sysfs_kf_write+0x87/0xa0 kernfs_fop_write+0x186/0x240 __vfs_write+0xd7/0x430 vfs_write+0xfa/0x260 ksys_write+0xab/0x130 __x64_sys_write+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme: introduce a helper function for controller deletionBart Van Assche1-4/+9
This patch does not change any functionality but makes the next patch in this series easier to read. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme: unexport nvme_delete_ctrl_sync()Bart Van Assche2-3/+1
Since nvme_delete_ctrl_sync() is not called from any other kernel module, unexport it. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme-pci: check kstrtoint() return value in queue_count_set()Bart Van Assche1-0/+2
This patch avoids that the compiler complains about 'ret' being set but not being used when building with W=1. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme-fabrics: document the poll function argumentBart Van Assche1-0/+1
This patch avoids that the kernel-doc tool reports a warning when building with W=1. Fixes: 26c682274e0a ("nvme-fabrics: allow nvmf_connect_io_queue to poll") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvmet: fix indentationBart Van Assche1-1/+1
This patch avoids that smatch complains about inconsistent indentation. Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") # v4.10 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20nvme-multipath: round-robin I/O policyHannes Reinecke3-1/+100
Implement a simple round-robin I/O policy for multipathing. Path selection is done in two rounds, first iterating across all optimized paths, and if that doesn't return any valid paths, iterate over all optimized and non-optimized paths. If no paths are found, use the existing algorithm. Also add a sysfs attribute 'iopolicy' to switch between the current NUMA-aware I/O policy and the 'round-robin' I/O policy. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-19block: avoid to READ fields of null bioMing Lei1-1/+3
rq->bio can be NULL sometimes, such as flush request, so don't read bio->bi_seg_front_size until this 'bio' is checked as valid. Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: dcebd755926b0f39dd1e ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15Merge tag 'v5.0-rc6' into for-5.1/blockJens Axboe566-2185/+5560
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c. This is needed since io_uring is now based on the block branch, to avoid a conflict between the multi-page bvecs and the bits of io_uring that touch the core block parts. * tag 'v5.0-rc6': (525 commits) Linux 5.0-rc6 x86/mm: Make set_pmd_at() paravirt aware MAINTAINERS: Update the ocores i2c bus driver maintainer, etc blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter MAINTAINERS: unify reference to xen-devel list x86/mm/cpa: Fix set_mce_nospec() futex: Handle early deadlock return correctly futex: Fix barrier comment net: dsa: b53: Fix for failure when irq is not defined in dt blktrace: Show requests without sector mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach() geneve: should not call rt6_lookup() when ipv6 was disabled KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) signal: Better detection of synchronous signals ...
2019-02-15block: kill BLK_MQ_F_SG_MERGEMing Lei10-11/+7
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: kill QUEUE_FLAG_NO_SG_MERGEMing Lei5-43/+6
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"), physical segment number is mainly figured out in blk_queue_split() for fast path, and the flag of BIO_SEG_VALID is set there too. Now only blk_recount_segments() and blk_recalc_rq_segments() use this flag. Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID is set in blk_queue_split(). For another user of blk_recalc_rq_segments(): - run in partial completion branch of blk_update_request, which is an unusual case - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed since dm-rq is the only user. Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the current setup of the I/O path, as it isn't going to save you a significant amount of cycles. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: document usage of bio iterator helpersMing Lei1-0/+25
Now multi-page bvec is supported, some helpers may return page by page, meantime some may return segment by segment, this patch documents the usage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: always define BIO_MAX_PAGES as 256Ming Lei1-8/+0
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to increase BIO_MAX_PAGES for it. CONFIG_THP_SWAP needs to split one THP into normal pages and adds them all to one bio. With multipage-bvec, it just takes one bvec to hold them all. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: enable multipage bvecsMing Lei4-12/+20
This patch pulls the trigger for multi-page bvecs. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: allow bio_for_each_segment_all() to iterate over multi-page bvecMing Lei27-46/+127
This patch introduces one extra iterator variable to bio_for_each_segment_all(), then we can allow bio_for_each_segment_all() to iterate over multi-page bvec. Given it is just one mechannical & simple change on all bio_for_each_segment_all() users, this patch does tree-wide change in one single patch, so that we can avoid to use a temporary helper for this conversion. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()Ming Lei1-1/+5
bch_bio_alloc_pages() is always called on one new bio, so it is safe to access the bvec table directly. Given it is the only kind of this case, open code the bvec table access since bio_for_each_segment_all() will be changed to support for iterating over multipage bvec. Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: loop: pass multi-page bvec to iov_iterMing Lei1-10/+10
iov_iter is implemented on bvec itererator helpers, so it is safe to pass multi-page bvec to it, and this way is much more efficient than passing one page in each bvec. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15btrfs: use mp_bvec_last_segment to get bio's last pageMing Lei1-2/+3
Preparing for supporting multi-page bvec. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15fs/buffer.c: use bvec iterator to truncate the bioMing Lei1-1/+4
Once multi-page bvec is enabled, the last bvec may include more than one page, this patch use mp_bvec_last_segment() to truncate the bio. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: introduce mp_bvec_last_segment()Ming Lei1-0/+22
BTRFS and guard_bio_eod() need to get the last singlepage segment from one multipage bvec, so introduce this helper to make them happy. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>