summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
2018-12-01floppy: fix race condition in __floppy_read_block_0()Jens Axboe1-1/+2
[ Upstream commit de7b75d82f70c5469675b99ad632983c50b6f7e7 ] LKP recently reported a hang at bootup in the floppy code: [ 245.678853] INFO: task mount:580 blocked for more than 120 seconds. [ 245.679906] Tainted: G T 4.19.0-rc6-00172-ga9f38e1 #1 [ 245.680959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 245.682181] mount D 6372 580 1 0x00000004 [ 245.683023] Call Trace: [ 245.683425] __schedule+0x2df/0x570 [ 245.683975] schedule+0x2d/0x80 [ 245.684476] schedule_timeout+0x19d/0x330 [ 245.685090] ? wait_for_common+0xa5/0x170 [ 245.685735] wait_for_common+0xac/0x170 [ 245.686339] ? do_sched_yield+0x90/0x90 [ 245.686935] wait_for_completion+0x12/0x20 [ 245.687571] __floppy_read_block_0+0xfb/0x150 [ 245.688244] ? floppy_resume+0x40/0x40 [ 245.688844] floppy_revalidate+0x20f/0x240 [ 245.689486] check_disk_change+0x43/0x60 [ 245.690087] floppy_open+0x1ea/0x360 [ 245.690653] __blkdev_get+0xb4/0x4d0 [ 245.691212] ? blkdev_get+0x1db/0x370 [ 245.691777] blkdev_get+0x1f3/0x370 [ 245.692351] ? path_put+0x15/0x20 [ 245.692871] ? lookup_bdev+0x4b/0x90 [ 245.693539] blkdev_get_by_path+0x3d/0x80 [ 245.694165] mount_bdev+0x2a/0x190 [ 245.694695] squashfs_mount+0x10/0x20 [ 245.695271] ? squashfs_alloc_inode+0x30/0x30 [ 245.695960] mount_fs+0xf/0x90 [ 245.696451] vfs_kern_mount+0x43/0x130 [ 245.697036] do_mount+0x187/0xc40 [ 245.697563] ? memdup_user+0x28/0x50 [ 245.698124] ksys_mount+0x60/0xc0 [ 245.698639] sys_mount+0x19/0x20 [ 245.699167] do_int80_syscall_32+0x61/0x130 [ 245.699813] entry_INT80_32+0xc7/0xc7 showing that we never complete that read request. The reason is that the completion setup is racy - it initializes the completion event AFTER submitting the IO, which means that the IO could complete before/during the init. If it does, we are passing garbage to complete() and we may sleep forever waiting for the event to occur. Fixes: 7b7b68bba5ef ("floppy: bail out in open() if drive is not responding to block0 read") Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27zram: close udev startup race condition as default groupsMinchan Kim1-20/+6
commit fef912bf860e upstream. commit 98af4d4df889 upstream. I got a report from Howard Chen that he saw zram and sysfs race(ie, zram block device file is created but sysfs for it isn't yet) when he tried to create new zram devices via hotadd knob. v4.20 kernel fixes it by [1, 2] but it's too large size to merge into -stable so this patch fixes the problem by registering defualt group by Greg KH's approach[3]. This patch should be applied to every stable tree [3.16+] currently existing from kernel.org because the problem was introduced at 2.6.37 by [4]. [1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk [2] 98af4d4df889, zram: register default groups with device_add_disk() [3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/ [4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Tested-by: Howard Chen <howardsoc@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-13xen/blkfront: avoid NULL blkfront_info dereference on device removalVasilis Liaskovitis1-0/+3
commit f92898e7f32e3533bfd95be174044bc349d416ca upstream. If a block device is hot-added when we are out of grants, gnttab_grant_foreign_access fails with -ENOSPC (log message "28 granting access to ring page") in this code path: talk_to_blkback -> setup_blkring -> xenbus_grant_ring -> gnttab_grant_foreign_access and the failing path in talk_to_blkback sets the driver_data to NULL: destroy_blkring: blkif_free(info, 0); mutex_lock(&blkfront_mutex); free_info(info); mutex_unlock(&blkfront_mutex); dev_set_drvdata(&dev->dev, NULL); This results in a NULL pointer BUG when blkfront_remove and blkif_free try to access the failing device's NULL struct blkfront_info. Cc: stable@vger.kernel.org # 4.5 and later Signed-off-by: Vasilis Liaskovitis <vliaskovitis@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13swim: fix cleanup on setup errorOmar Sandoval1-2/+11
[ Upstream commit 1448a2a5360ae06f25e2edc61ae070dff5c0beb4 ] If we fail to allocate the request queue for a disk, we still need to free that disk, not just the previous ones. Additionally, we need to cleanup the previous request queues. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13ataflop: fix error handling during setupOmar Sandoval1-10/+15
[ Upstream commit 71327f547ee3a46ec5c39fdbbd268401b2578d0e ] Move queue allocation next to disk allocation to fix a couple of issues: - If add_disk() hasn't been called, we should clear disk->queue before calling put_disk(). - If we fail to allocate a request queue, we still need to put all of the disks, not just the ones that we allocated queues for. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-10nbd: only set MSG_MORE when we have more to sendJosef Bacik1-4/+2
[ Upstream commit d61b7f972dab2a7d187c38254845546dfc8eed85 ] A user noticed that write performance was horrible over loopback and we traced it to an inversion of when we need to set MSG_MORE. It should be set when we have more bvec's to send, not when we are on the last bvec. This patch made the test go from 20 iops to 78k iops. Signed-off-by: Josef Bacik <jbacik@fb.com> Fixes: 429a787be679 ("nbd: fix use-after-free of rq/bio in the xmit path") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-10-04floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctlAndy Whitcroft1-0/+3
commit 65eea8edc315589d6c993cf12dbb5d0e9ef1fe4e upstream. The final field of a floppy_struct is the field "name", which is a pointer to a string in kernel memory. The kernel pointer should not be copied to user memory. The FDGETPRM ioctl copies a floppy_struct to user memory, including this "name" field. This pointer cannot be used by the user and it will leak a kernel address to user-space, which will reveal the location of kernel code and data and undermine KASLR protection. Model this code after the compat ioctl which copies the returned data to a previously cleared temporary structure on the stack (excluding the name pointer) and copy out to userspace from there. As we already have an inparam union with an appropriate member and that memory is already cleared even for read only calls make use of that as a temporary store. Based on an initial patch by Brian Belleville. CVE-2018-7755 Signed-off-by: Andy Whitcroft <apw@canonical.com> Broke up long line. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-17loop: remember whether sysfs_create_group() was doneTetsuo Handa2-5/+7
commit d3349b6b3c373ac1fbfb040b810fcee5e2adc7e0 upstream. syzbot is hitting WARN() triggered by memory allocation fault injection [1] because loop module is calling sysfs_remove_group() when sysfs_create_group() failed. Fix this by remembering whether sysfs_create_group() succeeded. [1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+9f03168400f56df89dbc6f1751f4458fe739ff29@syzkaller.appspotmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Renamed sysfs_ready -> sysfs_inited. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-17loop: add recursion validation to LOOP_CHANGE_FDTheodore Ts'o1-30/+38
commit d2ac838e4cd7e5e9891ecc094d626734b0245c99 upstream. Refactor the validation code used in LOOP_SET_FD so it is also used in LOOP_CHANGE_FD. Otherwise it is possible to construct a set of loop devices that all refer to each other. This can lead to a infinite loop in starting with "while (is_loop_device(f)) .." in loop_set_fd(). Fix this by refactoring out the validation code and using it for LOOP_CHANGE_FD as well as LOOP_SET_FD. Reported-by: syzbot+4349872271ece473a7c91190b68b4bac7c5dbc87@syzkaller.appspotmail.com Reported-by: syzbot+40bd32c4d9a3cc12a339@syzkaller.appspotmail.com Reported-by: syzbot+769c54e66f994b041be7@syzkaller.appspotmail.com Reported-by: syzbot+0a89a9ce473936c57065@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11drbd: fix access after freeLars Ellenberg1-1/+1
commit 64dafbc9530c10300acffc57fae3269d95fa8f93 upstream. We have struct drbd_requests { ... struct bio *private_bio; ... } to hold a bio clone for local submission. On local IO completion, we put that bio, and in case we want to use the result later, we overload that member to hold the ERR_PTR() of the completion result, Which, before v4.3, used to be the passed in "int error", so we could first bio_put(), then assign. v4.3-rc1~100^2~21 4246a0b63bd8 block: add a bi_error field to struct bio changed that: bio_put(req->private_bio); - req->private_bio = ERR_PTR(error); + req->private_bio = ERR_PTR(bio->bi_error); Which introduces an access after free, because it was non obvious that req->private_bio == bio. Impact of that was mostly unnoticable, because we only use that value in a multiple-failure case, and even then map any "unexpected" error code to EIO, so worst case we could potentially mask a more specific error with EIO in a multiple failure case. Unless the pointed to memory region was unmapped, as is the case with CONFIG_DEBUG_PAGEALLOC, in which case this results in BUG: unable to handle kernel paging request v4.13-rc1~70^2~75 4e4cbee93d56 block: switch bios to blk_status_t changes it further to bio_put(req->private_bio); req->private_bio = ERR_PTR(blk_status_to_errno(bio->bi_status)); And blk_status_to_errno() now contains a WARN_ON_ONCE() for unexpected values, which catches this "sometimes", if the memory has been reused quickly enough for other things. Should also go into stable since 4.3, with the trivial change around 4.13. Cc: stable@vger.kernel.org Fixes: 4246a0b63bd8 block: add a bi_error field to struct bio Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03rbd: flush rbd_dev->watch_dwork after watch is unregisteredDongsheng Yang1-1/+1
commit 23edca864951250af845a11da86bb3ea63522ed2 upstream. There is a problem if we are going to unmap a rbd device and the watch_dwork is going to queue delayed work for watch: unmap Thread watch Thread timer do_rbd_remove cancel_tasks_sync(rbd_dev) queue_delayed_work for watch destroy_workqueue(rbd_dev->task_wq) drain_workqueue(wq) destroy other resources in wq call_timer_fn __queue_work() Then the delayed work escape the cancel_tasks_sync() and destroy_workqueue() and we will get an user-after-free call trace: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI Modules linked in: CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE 4.17.0-rc6+ #13 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__queue_work+0x6a/0x3b0 RSP: 0018:ffff9427df1c3e90 EFLAGS: 00010086 RAX: ffff9427deca8400 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff9427deca8400 RSI: ffff9427df1c3e50 RDI: 0000000000000000 RBP: ffff942783e39e00 R08: ffff9427deca8400 R09: ffff9427df1c3f00 R10: 0000000000000004 R11: 0000000000000005 R12: ffff9427cfb85970 R13: 0000000000002000 R14: 000000000001eca0 R15: 0000000000000007 FS: 0000000000000000(0000) GS:ffff9427df1c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000004c900a005 CR4: 00000000000206e0 Call Trace: <IRQ> ? __queue_work+0x3b0/0x3b0 call_timer_fn+0x2d/0x130 run_timer_softirq+0x16e/0x430 ? tick_sched_timer+0x37/0x70 __do_softirq+0xd2/0x280 irq_exit+0xd5/0xe0 smp_apic_timer_interrupt+0x6c/0x130 apic_timer_interrupt+0xf/0x20 [ Move rbd_dev->watch_dwork cancellation so that rbd_reregister_watch() either bails out early because the watch is UNREGISTERED at that point or just gets cancelled. ] Cc: stable@vger.kernel.org Fixes: 99d1694310df ("rbd: retry watch re-registration periodically") Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30cdrom: do not call check_disk_change() inside cdrom_open()Maurizio Lombardi1-0/+2
[ Upstream commit 2bbea6e117357d17842114c65e9a9cf2d13ae8a3 ] when mounting an ISO filesystem sometimes (very rarely) the system hangs because of a race condition between two tasks. PID: 6766 TASK: ffff88007b2a6dd0 CPU: 0 COMMAND: "mount" #0 [ffff880078447ae0] __schedule at ffffffff8168d605 #1 [ffff880078447b48] schedule_preempt_disabled at ffffffff8168ed49 #2 [ffff880078447b58] __mutex_lock_slowpath at ffffffff8168c995 #3 [ffff880078447bb8] mutex_lock at ffffffff8168bdef #4 [ffff880078447bd0] sr_block_ioctl at ffffffffa00b6818 [sr_mod] #5 [ffff880078447c10] blkdev_ioctl at ffffffff812fea50 #6 [ffff880078447c70] ioctl_by_bdev at ffffffff8123a8b3 #7 [ffff880078447c90] isofs_fill_super at ffffffffa04fb1e1 [isofs] #8 [ffff880078447da8] mount_bdev at ffffffff81202570 #9 [ffff880078447e18] isofs_mount at ffffffffa04f9828 [isofs] #10 [ffff880078447e28] mount_fs at ffffffff81202d09 #11 [ffff880078447e70] vfs_kern_mount at ffffffff8121ea8f #12 [ffff880078447ea8] do_mount at ffffffff81220fee #13 [ffff880078447f28] sys_mount at ffffffff812218d6 #14 [ffff880078447f80] system_call_fastpath at ffffffff81698c49 RIP: 00007fd9ea914e9a RSP: 00007ffd5d9bf648 RFLAGS: 00010246 RAX: 00000000000000a5 RBX: ffffffff81698c49 RCX: 0000000000000010 RDX: 00007fd9ec2bc210 RSI: 00007fd9ec2bc290 RDI: 00007fd9ec2bcf30 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000010 R10: 00000000c0ed0001 R11: 0000000000000206 R12: 00007fd9ec2bc040 R13: 00007fd9eb6b2380 R14: 00007fd9ec2bc210 R15: 00007fd9ec2bcf30 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b This task was trying to mount the cdrom. It allocated and configured a super_block struct and owned the write-lock for the super_block->s_umount rwsem. While exclusively owning the s_umount lock, it called sr_block_ioctl and waited to acquire the global sr_mutex lock. PID: 6785 TASK: ffff880078720fb0 CPU: 0 COMMAND: "systemd-udevd" #0 [ffff880078417898] __schedule at ffffffff8168d605 #1 [ffff880078417900] schedule at ffffffff8168dc59 #2 [ffff880078417910] rwsem_down_read_failed at ffffffff8168f605 #3 [ffff880078417980] call_rwsem_down_read_failed at ffffffff81328838 #4 [ffff8800784179d0] down_read at ffffffff8168cde0 #5 [ffff8800784179e8] get_super at ffffffff81201cc7 #6 [ffff880078417a10] __invalidate_device at ffffffff8123a8de #7 [ffff880078417a40] flush_disk at ffffffff8123a94b #8 [ffff880078417a88] check_disk_change at ffffffff8123ab50 #9 [ffff880078417ab0] cdrom_open at ffffffffa00a29e1 [cdrom] #10 [ffff880078417b68] sr_block_open at ffffffffa00b6f9b [sr_mod] #11 [ffff880078417b98] __blkdev_get at ffffffff8123ba86 #12 [ffff880078417bf0] blkdev_get at ffffffff8123bd65 #13 [ffff880078417c78] blkdev_open at ffffffff8123bf9b #14 [ffff880078417c90] do_dentry_open at ffffffff811fc7f7 #15 [ffff880078417cd8] vfs_open at ffffffff811fc9cf #16 [ffff880078417d00] do_last at ffffffff8120d53d #17 [ffff880078417db0] path_openat at ffffffff8120e6b2 #18 [ffff880078417e48] do_filp_open at ffffffff8121082b #19 [ffff880078417f18] do_sys_open at ffffffff811fdd33 #20 [ffff880078417f70] sys_open at ffffffff811fde4e #21 [ffff880078417f80] system_call_fastpath at ffffffff81698c49 RIP: 00007f29438b0c20 RSP: 00007ffc76624b78 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff81698c49 RCX: 0000000000000000 RDX: 00007f2944a5fa70 RSI: 00000000000a0800 RDI: 00007f2944a5fa70 RBP: 00007f2944a5f540 R8: 0000000000000000 R9: 0000000000000020 R10: 00007f2943614c40 R11: 0000000000000246 R12: ffffffff811fde4e R13: ffff880078417f78 R14: 000000000000000c R15: 00007f2944a4b010 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b This task tried to open the cdrom device, the sr_block_open function acquired the global sr_mutex lock. The call to check_disk_change() then saw an event flag indicating a possible media change and tried to flush any cached data for the device. As part of the flush, it tried to acquire the super_block->s_umount lock associated with the cdrom device. This was the same super_block as created and locked by the previous task. The first task acquires the s_umount lock and then the sr_mutex_lock; the second task acquires the sr_mutex_lock and then the s_umount lock. This patch fixes the issue by moving check_disk_change() out of cdrom_open() and let the caller take care of it. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-20block/loop: fix deadlock after loop_set_statusTetsuo Handa1-4/+8
commit 1e047eaab3bb5564f25b41e9cd3a053009f4e789 upstream. syzbot is reporting deadlocks at __blkdev_get() [1]. ---------------------------------------- [ 92.493919] systemd-udevd D12696 525 1 0x00000000 [ 92.495891] Call Trace: [ 92.501560] schedule+0x23/0x80 [ 92.502923] schedule_preempt_disabled+0x5/0x10 [ 92.504645] __mutex_lock+0x416/0x9e0 [ 92.510760] __blkdev_get+0x73/0x4f0 [ 92.512220] blkdev_get+0x12e/0x390 [ 92.518151] do_dentry_open+0x1c3/0x2f0 [ 92.519815] path_openat+0x5d9/0xdc0 [ 92.521437] do_filp_open+0x7d/0xf0 [ 92.527365] do_sys_open+0x1b8/0x250 [ 92.528831] do_syscall_64+0x6e/0x270 [ 92.530341] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.931922] 1 lock held by systemd-udevd/525: [ 92.933642] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x73/0x4f0 ---------------------------------------- The reason of deadlock turned out that wait_event_interruptible() in blk_queue_enter() got stuck with bdev->bd_mutex held at __blkdev_put() due to q->mq_freeze_depth == 1. ---------------------------------------- [ 92.787172] a.out S12584 634 633 0x80000002 [ 92.789120] Call Trace: [ 92.796693] schedule+0x23/0x80 [ 92.797994] blk_queue_enter+0x3cb/0x540 [ 92.803272] generic_make_request+0xf0/0x3d0 [ 92.807970] submit_bio+0x67/0x130 [ 92.810928] submit_bh_wbc+0x15e/0x190 [ 92.812461] __block_write_full_page+0x218/0x460 [ 92.815792] __writepage+0x11/0x50 [ 92.817209] write_cache_pages+0x1ae/0x3d0 [ 92.825585] generic_writepages+0x5a/0x90 [ 92.831865] do_writepages+0x43/0xd0 [ 92.836972] __filemap_fdatawrite_range+0xc1/0x100 [ 92.838788] filemap_write_and_wait+0x24/0x70 [ 92.840491] __blkdev_put+0x69/0x1e0 [ 92.841949] blkdev_close+0x16/0x20 [ 92.843418] __fput+0xda/0x1f0 [ 92.844740] task_work_run+0x87/0xb0 [ 92.846215] do_exit+0x2f5/0xba0 [ 92.850528] do_group_exit+0x34/0xb0 [ 92.852018] SyS_exit_group+0xb/0x10 [ 92.853449] do_syscall_64+0x6e/0x270 [ 92.854944] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 92.943530] 1 lock held by a.out/634: [ 92.945105] #0: 00000000a2849e25 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x3c/0x1e0 ---------------------------------------- The reason of q->mq_freeze_depth == 1 turned out that loop_set_status() forgot to call blk_mq_unfreeze_queue() at error paths for info->lo_encrypt_type != NULL case. ---------------------------------------- [ 37.509497] CPU: 2 PID: 634 Comm: a.out Tainted: G W 4.16.0+ #457 [ 37.513608] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 37.518832] RIP: 0010:blk_freeze_queue_start+0x17/0x40 [ 37.521778] RSP: 0018:ffffb0c2013e7c60 EFLAGS: 00010246 [ 37.524078] RAX: 0000000000000000 RBX: ffff8b07b1519798 RCX: 0000000000000000 [ 37.527015] RDX: 0000000000000002 RSI: ffffb0c2013e7cc0 RDI: ffff8b07b1519798 [ 37.529934] RBP: ffffb0c2013e7cc0 R08: 0000000000000008 R09: 47a189966239b898 [ 37.532684] R10: dad78b99b278552f R11: 9332dca72259d5ef R12: ffff8b07acd73678 [ 37.535452] R13: 0000000000004c04 R14: 0000000000000000 R15: ffff8b07b841e940 [ 37.538186] FS: 00007fede33b9740(0000) GS:ffff8b07b8e80000(0000) knlGS:0000000000000000 [ 37.541168] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 37.543590] CR2: 00000000206fdf18 CR3: 0000000130b30006 CR4: 00000000000606e0 [ 37.546410] Call Trace: [ 37.547902] blk_freeze_queue+0x9/0x30 [ 37.549968] loop_set_status+0x67/0x3c0 [loop] [ 37.549975] loop_set_status64+0x3b/0x70 [loop] [ 37.549986] lo_ioctl+0x223/0x810 [loop] [ 37.549995] blkdev_ioctl+0x572/0x980 [ 37.550003] block_ioctl+0x34/0x40 [ 37.550006] do_vfs_ioctl+0xa7/0x6d0 [ 37.550017] ksys_ioctl+0x6b/0x80 [ 37.573076] SyS_ioctl+0x5/0x10 [ 37.574831] do_syscall_64+0x6e/0x270 [ 37.576769] entry_SYSCALL_64_after_hwframe+0x42/0xb7 ---------------------------------------- [1] https://syzkaller.appspot.com/bug?id=cd662bc3f6022c0979d01a262c318fab2ee9b56f Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <bot+48594378e9851eab70bcd6f99327c7db58c5a28a@syzkaller.appspotmail.com> Fixes: ecdd09597a572513 ("block/loop: fix race between I/O and set_status") Cc: Ming Lei <tom.leiming@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: stable <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13Fix loop device flush before configure v3James Wang1-0/+3
[ Upstream commit 6460495709aeb651896bc8e5c134b2e4ca7d34a8 ] While installing SLES-12 (based on v4.4), I found that the installer will stall for 60+ seconds during LVM disk scan. The root cause was determined to be the removal of a bound device check in loop_flush() by commit b5dd2f6047ca ("block: loop: improve performance via blk-mq"). Restoring this check, examining ->lo_state as set by loop_set_fd() eliminates the bad behavior. Test method: modprobe loop max_loop=64 dd if=/dev/zero of=disk bs=512 count=200K for((i=0;i<4;i++))do losetup -f disk; done mkfs.ext4 -F /dev/loop0 for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done for f in `ls /dev/loop[0-9]*|sort`; do \ echo $f; dd if=$f of=/dev/null bs=512 count=1; \ done Test output: stock patched /dev/loop0 18.1217e-05 8.3842e-05 /dev/loop1 6.1114e-05 0.000147979 /dev/loop10 0.414701 0.000116564 /dev/loop11 0.7474 6.7942e-05 /dev/loop12 0.747986 8.9082e-05 /dev/loop13 0.746532 7.4799e-05 /dev/loop14 0.480041 9.3926e-05 /dev/loop15 1.26453 7.2522e-05 Note that from loop10 onward, the device is not mounted, yet the stock kernel consumes several orders of magnitude more wall time than it does for a mounted device. (Thanks for Mike Galbraith <efault@gmx.de>, give a changelog review.) Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: James Wang <jnwang@suse.com> Fixes: b5dd2f6047ca ("block: loop: improve performance via blk-mq") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08Revert "mtip32xx: use runtime tag to initialize command header"Greg Kroah-Hartman1-24/+12
This reverts commit db54facd56a40e7766bf7f7cda1ae138e72a691c which was commit a4e84aae8139aca9fbfbced1f45c51ca81b57488 upstream. Ben writes: MQ IO schedulers were introduced in 4.11, so this shouldn't be needed in older branches. It also causes a performance regression (fixed upstream). Please revert this for 4.4 and 4.9. So let's revert it! Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jens Axboe <axboe@fb.com> Cc: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-24mtip32xx: use runtime tag to initialize command headerMing Lei1-12/+24
[ Upstream commit a4e84aae8139aca9fbfbced1f45c51ca81b57488 ] mtip32xx supposes that 'request_idx' passed to .init_request() is tag of the request, and use that as request's tag to initialize command header. After MQ IO scheduler is in, request tag assigned isn't same with the request index anymore, so cause strange hardware failure on mtip32xx, even whole system panic is triggered. This patch fixes the issue by initializing command header via request's real tag. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18loop: Fix lost writes caused by missing flagRoss Zwisler1-1/+1
commit 1d037577c323e5090ce281e96bc313ab2eee5be2 upstream. The following commit: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") replaced __do_lo_send_write(), which used ITER_KVEC iterators, with lo_write_bvec() which uses ITER_BVEC iterators. In this change, though, the WRITE flag was lost: - iov_iter_kvec(&from, ITER_KVEC | WRITE, &kvec, 1, len); + iov_iter_bvec(&i, ITER_BVEC, bvec, 1, bvec->bv_len); This flag is necessary for the DAX case because we make decisions based on whether or not the iterator is a READ or a WRITE in dax_iomap_actor() and in dax_iomap_rw(). We end up going through this path in configurations where we combine a PMEM device with 4k sectors, a loopback device and DAX. The consequence of this missed flag is that what we intend as a write actually turns into a read in the DAX code, so no data is ever written. The very simplest test case is to create a loopback device and try and write a small string to it, then hexdump a few bytes of the device to see if the write took. Without this patch you read back all zeros, with this you read back the string you wrote. For XFS this causes us to fail or panic during the following xfstests: xfs/074 xfs/078 xfs/216 xfs/217 xfs/250 For ext4 we have a similar issue where writes never happen, but we don't currently have any xfstests that use loopback and show this issue. Fix this by restoring the WRITE flag argument to iov_iter_bvec(). This causes the xfstests to all pass. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Fixes: commit aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25rbd: silence bogus -Wmaybe-uninitialized warningIlya Dryomov1-1/+1
commit d4c2269b3d5d06a8ea434b1841fbcaec336ed396 upstream. drivers/block/rbd.c: In function ‘rbd_watch_cb’: drivers/block/rbd.c:3690:5: error: ‘struct_v’ may be used uninitialized in this function [-Werror=maybe-uninitialized] drivers/block/rbd.c:3759:5: note: ‘struct_v’ was declared here Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17pktcdvd: Fix pkt_setup_dev() error pathBart Van Assche1-3/+1
commit 5a0ec388ef0f6e33841aeb810d7fa23f049ec4cd upstream. Commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue") modified add_disk() and disk_release() but did not update any of the error paths that trigger a put_disk() call after disk->queue has been assigned. That introduced the following behavior in the pktcdvd driver if pkt_new_dev() fails: Kernel BUG at 00000000e98fd882 [verbose debug info unavailable] Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL, fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev() error path. Fixes: commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03loop: fix concurrent lo_open/lo_releaseLinus Torvalds1-2/+8
commit ae6650163c66a7eff1acd6eb8b0f752dcfa8eba5 upstream. 范龙飞 reports that KASAN can report a use-after-free in __lock_acquire. The reason is due to insufficient serialization in lo_release(), which will continue to use the loop device even after it has decremented the lo_refcnt to zero. In the meantime, another process can come in, open the loop device again as it is being shut down. Confusion ensues. Reported-by: 范龙飞 <long7573@126.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17rbd: set max_segments to USHRT_MAXIlya Dryomov1-1/+1
commit 21acdf45f4958135940f0b4767185cf911d4b010 upstream. Commit d3834fefcfe5 ("rbd: bump queue_max_segments") bumped max_segments (unsigned short) to max_hw_sectors (unsigned int). max_hw_sectors is set to the number of 512-byte sectors in an object and overflows unsigned short for 32M (largest possible) objects, making the block layer resort to handing us single segment (i.e. single page or even smaller) bios in that case. Fixes: d3834fefcfe5 ("rbd: bump queue_max_segments") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10nbd: fix use-after-free of rq/bio in the xmit pathJens Axboe1-9/+23
commit 429a787be6793554ee02aacc7e1f11ebcecc4453 upstream. For writes, we can get a completion in while we're still iterating the request and bio chain. If that happens, we're reading freed memory and we can crash. Break out after the last segment and avoid having the iterator read freed memory. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25nbd: set queue timeout properlyJosef Bacik1-1/+4
[ Upstream commit f8586855031a1d6b243f013c3082631346fddfad ] We can't just set the timeout on the tagset, we have to set it on the queue as it would have been setup already at this point. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14zram: set physical queue limits to avoid array out of bounds accessesJohannes Thumshirn1-0/+2
[ Upstream commit 0bc315381fe9ed9fb91db8b0e82171b645ac008f ] zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using the NVMe over Fabrics loopback target which potentially sends a huge bulk of pages attached to the bio's bvec this results in a kernel panic because of array out of bounds accesses in zram_decompress_page(). Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15rbd: use GFP_NOIO for parent stat and data requestsIlya Dryomov1-2/+2
commit 1e37f2f84680fa7f8394fd444b6928e334495ccc upstream. rbd_img_obj_exists_submit() and rbd_img_obj_parent_read_full() are on the writeback path for cloned images -- we attempt a stat on the parent object to see if it exists and potentially read it in to call copyup. GFP_NOIO should be used instead of GFP_KERNEL here. Link: http://tracker.ceph.com/issues/22014 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27skd: Submit requests to firmware before triggering the doorbellBart Van Assche1-0/+6
commit 5fbd545cd3fd311ea1d6e8be4cedddd0ee5684c7 upstream. Ensure that the members of struct skd_msg_buf have been transferred to the PCIe adapter before the doorbell is triggered. This patch avoids that I/O fails sporadically and that the following error message is reported: (skd0:STM000196603:[0000:00:09.0]): Completion mismatch comp_id=0x0000 skreq=0x0400 new=0x0000 Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27skd: Avoid that module unloading triggers a use-after-freeBart Van Assche1-8/+9
commit 7277cc67b3916eed47558c64f9c9c0de00a35cda upstream. Since put_disk() triggers a disk_release() call and since that last function calls blk_put_queue() if disk->queue != NULL, clear the disk->queue pointer before calling put_disk(). This avoids that unloading the skd kernel module triggers the following use-after-free: WARNING: CPU: 8 PID: 297 at lib/refcount.c:128 refcount_sub_and_test+0x70/0x80 refcount_t: underflow; use-after-free. CPU: 8 PID: 297 Comm: kworker/8:1 Not tainted 4.11.10-300.fc26.x86_64 #1 Workqueue: events work_for_cpu_fn Call Trace: dump_stack+0x63/0x84 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5a/0x80 refcount_sub_and_test+0x70/0x80 refcount_dec_and_test+0x11/0x20 kobject_put+0x1f/0x50 blk_put_queue+0x15/0x20 disk_release+0xae/0xf0 device_release+0x32/0x90 kobject_release+0x67/0x170 kobject_put+0x2b/0x50 put_disk+0x17/0x20 skd_destruct+0x5c/0x890 [skd] skd_pci_probe+0x124d/0x13a0 [skd] local_pci_probe+0x42/0xa0 work_for_cpu_fn+0x14/0x20 process_one_work+0x19e/0x470 worker_thread+0x1dc/0x4a0 kthread+0x125/0x140 ret_from_fork+0x25/0x30 Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-25xen-blkfront: use a right index when checking requestsMunehisa Kamata1-3/+3
commit b15bd8cb37598afb2963f7eb9e2de468d2d60a2f upstream. Since commit d05d7f40791c ("Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block") and 3fc9d690936f ("Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block"), blkfront_resume() has been using an index for iterating ring_info to check request when iterating blk_shadow in an inner loop. This seems to have been accidentally introduced during the massive rewrite of the block layer macros in the commits. This may cause crash like this: [11798.057074] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [11798.058832] IP: [<ffffffff814411fa>] blkfront_resume+0x10a/0x610 .... [11798.061063] Call Trace: [11798.061063] [<ffffffff8139ce93>] xenbus_dev_resume+0x53/0x140 [11798.061063] [<ffffffff8139ce40>] ? xenbus_dev_probe+0x150/0x150 [11798.061063] [<ffffffff813f359e>] dpm_run_callback+0x3e/0x110 [11798.061063] [<ffffffff813f3a08>] device_resume+0x88/0x190 [11798.061063] [<ffffffff813f4cc0>] dpm_resume+0x100/0x2d0 [11798.061063] [<ffffffff813f5221>] dpm_resume_end+0x11/0x20 [11798.061063] [<ffffffff813950a8>] do_suspend+0xe8/0x1a0 [11798.061063] [<ffffffff813954bd>] shutdown_handler+0xfd/0x130 [11798.061063] [<ffffffff8139aba0>] ? split+0x110/0x110 [11798.061063] [<ffffffff8139ac26>] xenwatch_thread+0x86/0x120 [11798.061063] [<ffffffff810b4570>] ? prepare_to_wait_event+0x110/0x110 [11798.061063] [<ffffffff8108fe57>] kthread+0xd7/0xf0 [11798.061063] [<ffffffff811da811>] ? kfree+0x121/0x170 [11798.061063] [<ffffffff8108fd80>] ? kthread_park+0x60/0x60 [11798.061063] [<ffffffff810863b0>] ? call_usermodehelper_exec_work+0xb0/0xb0 [11798.061063] [<ffffffff810864ea>] ? call_usermodehelper_exec_async+0x13a/0x140 [11798.061063] [<ffffffff81534a45>] ret_from_fork+0x25/0x30 Use the right index in the inner loop. Fixes: d05d7f40791c ("Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block") Fixes: 3fc9d690936f ("Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Reviewed-by: Thomas Friebel <friebelt@amazon.de> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com> Cc: xen-devel@lists.xenproject.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11virtio_blk: fix panic in initialization error pathOmar Sandoval1-1/+2
[ Upstream commit 6bf6b0aa3da84a3d9126919a94c49c0fb7ee2fb3 ] If blk_mq_init_queue() returns an error, it gets assigned to vblk->disk->queue. Then, when we call put_disk(), we end up calling blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by only assigning to vblk->disk->queue on success. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11nbd: blk_mq_init_queue returns an error code on failure, not NULLJeff Moyer1-2/+4
[ Upstream commit 25b4acfc7de0fc4da3bfea3a316f7282c6fbde81 ] Additionally, don't assign directly to disk->queue, otherwise blk_put_queue (called via put_disk) will choke (panic) on the errno stored there. Bug found by code inspection after Omar found a similar issue in virtio_blk. Compile-tested only. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05xen/blkback: don't free be structure too earlyJuergen Gross1-3/+3
commit 71df1d7ccad1c36f7321d6b3b48f2ea42681c363 upstream. The be structure must not be freed when freeing the blkif structure isn't done. Otherwise a use-after-free of be when unmapping the ring used for communicating with the frontend will occur in case of a late call of xenblk_disconnect() (e.g. due to an I/O still active when trying to disconnect). Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05xen/blkback: don't use xen_blkif_get() in xen-blkback kthreadJuergen Gross2-4/+0
commit a24fa22ce22ae302b3bf8f7008896d52d5d57b8d upstream. There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread of xen-blkback. Thread stopping is synchronous and using the blkif reference counting in the kthread will avoid to ever let the reference count drop to zero at the end of an I/O running concurrent to disconnecting and multiple rings. Setting ring->xenblkd to NULL after stopping the kthread isn't needed as the kthread does this already. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29xen-blkback: don't leak stack data via response ringJan Beulich2-31/+17
commit 089bc0143f489bd3a4578bdff5f4ca68fb26f341 upstream. Rather than constructing a local structure instance on the stack, fill the fields directly on the shared ring, just like other backends do. Build on the fact that all response structure flavors are actually identical (the old code did make this assumption too). This is XSA-216. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29xen/blkback: fix disconnect while I/Os in flightJuergen Gross2-2/+6
commit 46464411307746e6297a034a9983a22c9dfc5a0c upstream. Today disconnecting xen-blkback is broken in case there are still I/Os in flight: xen_blkif_disconnect() will bail out early without releasing all resources in the hope it will be called again when the last request has terminated. This, however, won't happen as xen_blkif_free() won't be called on termination of the last running request: xen_blkif_put() won't decrement the blkif refcnt to 0 as xen_blkif_disconnect() didn't finish before thus some xen_blkif_put() calls in xen_blkif_disconnect() didn't happen. To solve this deadlock xen_blkif_disconnect() and xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and xen_blkif_get() but use some other way to do their accounting of resources. This at once fixes another error in xen_blkif_disconnect(): when it returned early with -EBUSY for another ring than 0 it would call xen_blkif_put() again for already handled rings on a subsequent call. This will lead to inconsistencies in the refcnt handling. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21nbd: fix 64-bit divisionJens Axboe1-1/+1
commit e88f72cb9f54f6d244e55f629fe5e2f34ca6f9ed upstream. We have this: ERROR: "__aeabi_ldivmod" [drivers/block/nbd.ko] undefined! ERROR: "__divdi3" [drivers/block/nbd.ko] undefined! nbd.c:(.text+0x247c72): undefined reference to `__divdi3' due to a recent commit, that did 64-bit division. Use the proper divider function so that 32-bit compiles don't break. Fixes: ef77b515243b ("nbd: use loff_t for blocksize and nbd_set_size args") Signed-off-by: Jens Axboe <axboe@fb.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21nbd: use loff_t for blocksize and nbd_set_size argsJosef Bacik1-4/+4
commit ef77b515243b3499d62cf446eda6ca7e0a0b079c upstream. If we have large devices (say like the 40t drive I was trying to test with) we will end up overflowing the int arguments to nbd_set_size and not get the right size for our device. Fix this by using loff_t everywhere so I don't have to think about this again. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> [bwh: Backported to 4.9: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21zram: do not use copy_page with non-page aligned addressMinchan Kim1-3/+3
commit d72e9a7a93e4f8e9e52491921d99e0c8aa89eb4e upstream. The copy_page is optimized memcpy for page-alinged address. If it is used with non-page aligned address, it can corrupt memory which means system corruption. With zram, it can happen with 1. 64K architecture 2. partial IO 3. slub debug Partial IO need to allocate a page and zram allocates it via kmalloc. With slub debug, kmalloc(PAGE_SIZE) doesn't return page-size aligned address. And finally, copy_page(mem, cmem) corrupts memory. So, this patch changes it to memcpy. Actuaully, we don't need to change zram_bvec_write part because zsmalloc returns page-aligned address in case of PAGE_SIZE class but it's not good to rely on the internal of zsmalloc. Note: When this patch is merged to stable, clear_page should be fixed, too. Unfortunately, recent zram removes it by "same page merge" feature so it's hard to backport this patch to -stable tree. I will handle it when I receive the mail from stable tree maintainer to merge this patch to backport. Fixes: 42e99bd ("zram: optimize memory operations with clear_page()/copy_page()") Link: http://lkml.kernel.org/r/1492042622-12074-2-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12loop: fix LO_FLAGS_PARTSCAN hangOmar Sandoval1-7/+8
commit e02898b423802b1f3a3aaa7f16e896da069ba8f7 upstream. loop_reread_partitions() needs to do I/O, but we just froze the queue, so we end up waiting forever. This can easily be reproduced with losetup -P. Fix it by moving the reread to after we unfreeze the queue. Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12block/loop: fix race between I/O and set_statusMing Lei1-5/+12
commit ecdd09597a57251323b0de50e3d45e69298c4a83 upstream. Inside set_status, transfer need to setup again, so we have to drain IO before the transition, otherwise oops may be triggered like the following: divide error: 0000 [#1] SMP KASAN CPU: 0 PID: 2935 Comm: loop7 Not tainted 4.10.0-rc7+ #213 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006ba1e840 task.stack: ffff880067338000 RIP: 0010:transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: 0018:ffff88006733f108 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800688d7000 RCX: 0000000000000059 RDX: 0000000000000000 RSI: 1ffff1000d743f43 RDI: ffff880068891c08 RBP: ffff88006733f160 R08: ffff8800688d7001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800688d7000 R13: ffff880067b7d000 R14: dffffc0000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006c17e0 CR3: 0000000066e3b000 CR4: 00000000001406f0 Call Trace: lo_do_transfer drivers/block/loop.c:251 [inline] lo_read_transfer drivers/block/loop.c:392 [inline] do_req_filebacked drivers/block/loop.c:541 [inline] loop_handle_cmd drivers/block/loop.c:1677 [inline] loop_queue_work+0xda0/0x49b0 drivers/block/loop.c:1689 kthread_worker_fn+0x4c3/0xa30 kernel/kthread.c:630 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 Code: 03 83 e2 07 41 29 df 42 0f b6 04 30 4d 8d 44 24 01 38 d0 7f 08 84 c0 0f 85 62 02 00 00 44 89 f8 41 0f b6 48 ff 25 ff 01 00 00 99 <f7> 7d c8 48 63 d2 48 03 55 d0 48 89 d0 48 89 d7 48 c1 e8 03 83 RIP: transfer_xor+0x1d1/0x440 drivers/block/loop.c:110 RSP: ffff88006733f108 ---[ end trace 0166f7bd3b0c0933 ]--- Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19virtio_blk: avoid DMA to stack for the sense bufferChristoph Hellwig1-1/+3
commit a14d749fcebe97ddf6af6db3d1f6ece85c9ddcb9 upstream. Most users of BLOCK_PC requests allocate the sense buffer on the stack, so to avoid DMA to the stack copy them to a field in the heap allocated virtblk_req structure. Without that any attempt at SCSI passthrough I/O, including the SG_IO ioctl from userspace will crash the kernel. Note that this includes running tools like hdparm even when the host does not have SCSI passthrough enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19zram: support BDI_CAP_STABLE_WRITESMinchan Kim1-2/+11
commit b09ab054b69b07077bd3292f67e777861ac796e5 upstream. zram has used per-cpu stream feature from v4.7. It aims for increasing cache hit ratio of scratch buffer for compressing. Downside of that approach is that zram should ask memory space for compressed page in per-cpu context which requires stricted gfp flag which could be failed. If so, it retries to allocate memory space out of per-cpu context so it could get memory this time and compress the data again, copies it to the memory space. In this scenario, zram assumes the data should never be changed but it is not true without stable page support. So, If the data is changed under us, zram can make buffer overrun so that zsmalloc free object chain is broken so system goes crash like below https://bugzilla.suse.com/show_bug.cgi?id=997574 This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block device needing *stable write*". Fixes: da9556a2367c ("zram: user per-cpu compression streams") Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hyeoncheol Lee <cheol.lee@lge.com> Cc: <yjay.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19zram: revalidate disk under init_lockMinchan Kim1-7/+1
commit e7ccfc4ccb703e0f033bd4617580039898e912dd upstream. Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk") moved revalidate_disk call out of init_lock to avoid lockdep false-positive splat. However, commit 08eee69fcf6b ("zram: remove init_lock in zram_make_request") removed init_lock in IO path so there is no worry about lockdep splat. So, let's restore it. This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next patch. Fixes: da9556a2367c ("zram: user per-cpu compression streams") Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hyeoncheol Lee <cheol.lee@lge.com> Cc: <yjay.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06loop: return proper error from loop_queue_rq()Omar Sandoval1-1/+1
commit b4a567e8114327518c09f5632339a5954ab975a3 upstream. ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not an errno. Fixes: f4aa4c7bbac6 ("block: loop: convert to per-device workqueue") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08zram: restrict add/remove attributes to root onlySergey Senozhatsky1-1/+7
zram hot_add sysfs attribute is a very 'special' attribute - reading from it creates a new uninitialized zram device. This file, by a mistake, can be read by a 'normal' user at the moment, while only root must be able to create a new zram device, therefore hot_add attribute must have S_IRUSR mode, not S_IRUGO. [akpm@linux-foundation.org: s/sence/sense/, reflow comment to use 80 cols] Fixes: 6566d1a32bf72 ("zram: add dynamic device add/remove functionality") Link: http://lkml.kernel.org/r/20161205155845.20129-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Steven Allen <steven@stebalien.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-01zram: fix unbalanced idr management at hot removalTakashi Iwai1-1/+2
The zram hot removal code calls idr_remove() even when zram_remove() returns an error (typically -EBUSY). This results in a leftover at the device release, eventually leading to a crash when the module is reloaded. As described in the bug report below, the following procedure would cause an Oops with zram: - provision three zram devices via modprobe zram num_devices=3 - configure a size for each device + echo "1G" > /sys/block/$zram_name/disksize - mkfs and mount zram0 only - attempt to hot remove all three devices + echo 2 > /sys/class/zram-control/hot_remove + echo 1 > /sys/class/zram-control/hot_remove + echo 0 > /sys/class/zram-control/hot_remove - zram0 removal fails with EBUSY, as expected - unmount zram0 - try zram0 hot remove again + echo 0 > /sys/class/zram-control/hot_remove - fails with ENODEV (unexpected) - unload zram kernel module + completes successfully - zram0 device node still exists - attempt to mount /dev/zram0 + mount command is killed + following BUG is encountered BUG: unable to handle kernel paging request at ffffffffa0002ba0 IP: get_disk+0x16/0x50 Oops: 0000 [#1] SMP CPU: 0 PID: 252 Comm: mount Not tainted 4.9.0-rc6 #176 Call Trace: exact_lock+0xc/0x20 kobj_lookup+0xdc/0x160 get_gendisk+0x2f/0x110 __blkdev_get+0x10c/0x3c0 blkdev_get+0x19d/0x2e0 blkdev_open+0x56/0x70 do_dentry_open.isra.19+0x1ff/0x310 vfs_open+0x43/0x60 path_openat+0x2c9/0xf30 do_filp_open+0x79/0xd0 do_sys_open+0x114/0x1e0 SyS_open+0x19/0x20 entry_SYSCALL_64_fastpath+0x13/0x94 This patch adds the proper error check in hot_remove_store() not to call idr_remove() unconditionally. Fixes: 17ec4cd98578 ("zram: don't call idr_remove() from zram_remove()") Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1010970 Link: http://lkml.kernel.org/r/20161121132140.12683-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: David Disseldorp <ddiss@suse.de> Reported-by: David Disseldorp <ddiss@suse.de> Tested-by: David Disseldorp <ddiss@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-12aoe: fix crash in page count manipulationJens Axboe1-41/+0
aoeblk contains some mysterious code, that wants to elevate the bio vec page counts while it's under IO. That is not needed, it's fragile, and it's causing kernel oopses for some. Reported-by: Tested-by: Don Koch <kochd@us.ibm.com> Tested-by: Tested-by: Don Koch <kochd@us.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10drbd: Fix kernel_sendmsg() usage - potential NULL derefRichard Weinberger1-1/+1
Don't pass a size larger than iov_len to kernel_sendmsg(). Otherwise it will cause a NULL pointer deref when kernel_sendmsg() returns with rv < size. DRBD as external module has been around in the kernel 2.4 days already. We used to be compatible to 2.4 and very early 2.6 kernels, we used to use rv = sock_sendmsg(sock, &msg, iov.iov_len); then later changed to rv = kernel_sendmsg(sock, &msg, &iov, 1, size); when we should have used rv = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len); tcp_sendmsg() used to totally ignore the size parameter. 57be5bd ip: convert tcp_sendmsg() to iov_iter primitives changes that, and exposes our long standing error. Even with this error exposed, to trigger the bug, we would need to have an environment (config or otherwise) causing us to not use sendpage() for larger transfers, a failing connection, and have it fail "just at the right time". Apparently that was unlikely enough for most, so this went unnoticed for years. Still, it is known to trigger at least some of these, and suspected for the others: [0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html [1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html [2] https://forums.grsecurity.net/viewtopic.php?f=3&t=4546 [3] https://ubuntuforums.org/showthread.php?t=2336150 [4] http://e2.howsolveproblem.com/i/1175162/ This should go into 4.9, and into all stable branches since and including v4.0, which is the first to contain the exposing change. It is correct for all stable branches older than that as well (which contain the DRBD driver; which is 2.6.33 and up). It requires a small "conflict" resolution for v4.4 and earlier, with v4.5 we dropped the comment block immediately preceding the kernel_sendmsg(). Fixes: b411b3637fa7 ("The DRBD driver") Cc: <stable@vger.kernel.org> # 2.6.33.x- Cc: viro@zeniv.linux.org.uk Cc: christoph.lechleitner@iteg.at Cc: wolfgang.glas@iteg.at Reported-by: Christoph Lechleitner <christoph.lechleitner@iteg.at> Tested-by: Christoph Lechleitner <christoph.lechleitner@iteg.at> Signed-off-by: Richard Weinberger <richard@nod.at> [changed oneliner to be "obvious" without context; more verbose message] Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-07nbd: Fix error handlingChristophe JAILLET1-1/+1
'blk_mq_alloc_request()' returns an error pointer in case of error, not NULL. So test it with IS_ERR. Fixes: fd8383fd88a2 ("nbd: convert to blkmq") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-31virtio_blk: Delete an unnecessary initialisation in init_vq()Markus Elfring1-1/+1
The local variable "err" will be set to an appropriate value by a following statement. Thus omit the explicit initialisation at the beginning. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-10-31virtio_blk: Use kmalloc_array() in init_vq()Markus Elfring1-4/+4
Multiplications for the size determination of memory allocations indicated that array data structures should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>