summaryrefslogtreecommitdiff
path: root/block/genhd.c
AgeCommit message (Collapse)AuthorFilesLines
2015-09-18block: fix ext_dev_lock lockdep reportDan Williams1-6/+6
commit 4d66e5e9b6d720d8463e11d027bd4ad91c8b1318 upstream. ================================= [ INFO: inconsistent lock state ] 4.1.0-rc7+ #217 Tainted: G O --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 {SOFTIRQ-ON-W} state was registered at: [<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70 [<ffffffff810c1947>] lock_acquire+0xb7/0x290 [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50 [<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0 <-- take the lock in process context [..] [<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70 [<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70 [<ffffffff810c1947>] lock_acquire+0xb7/0x290 [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70 [<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50 [<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70 [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 <-- take the lock in softirq [<ffffffff8143bfec>] part_release+0x1c/0x50 [<ffffffff8158edf6>] device_release+0x36/0xb0 [<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8145aad0>] kobject_put+0x30/0x70 [<ffffffff8158f147>] put_device+0x17/0x20 [<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180 [<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0 [<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90 [<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90 [<ffffffff81067e2e>] __do_softirq+0xde/0x600 Neil sees this in his tests and it also triggers on pmem driver unbind for the libnvdimm tests. This fix is on top of an initial fix by Keith for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block: Fix dev_t minor allocation lifetime". Both this and 2da78092dda1 are candidates for -stable. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Keith Busch <keith.busch@intel.com> Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
2015-04-14genhd: check for int overflow in disk_expand_part_tbl()Jens Axboe1-2/+9
commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream. We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-12-01genhd: fix leftover might_sleep() in blk_free_devt()Jens Axboe1-2/+0
commit 46f341ffcfb5d8530f7d1e60f3be06cce6661b62 upstream. Commit 2da78092 changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-12-01block: Fix dev_t minor allocation lifetimeKeith Busch1-9/+9
commit 2da78092dda13f1efd26edbbf99a567776913750 upstream. Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com> [lizf: Backported to 3.4: - adjust context - remove idr_preload() and idr_preload_end()] Signed-off-by: Zefan Li <lizefan@huawei.com>
2013-07-13block: do not pass disk names as format stringsKees Cook1-1/+1
commit ffc8b30866879ed9ba62bd0a86fecdbd51cd3d19 upstream. Disk names may contain arbitrary strings, so they must not be interpreted as format strings. It seems that only md allows arbitrary strings to be used for disk names, but this could allow for a local memory corruption from uid 0 into ring 0. CVE-2013-2851 Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04block: fix synchronization and limit check in blk_alloc_devt()Tejun Heo1-8/+5
commit ce23bba842aee98092225d9576dba47c82352521 upstream. idr allocation in blk_alloc_devt() wasn't synchronized against lookup and removal, and its limit check was off by one - 1 << MINORBITS is the number of minors allowed, not the maximum allowed minor. Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit checking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04block: fix ext_devt_idr handlingTomas Henzl1-1/+5
commit 7b74e912785a11572da43292786ed07ada7e3e0c upstream. While adding and removing a lot of disks disks and partitions this sometimes shows up: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted) Hardware name: sysfs: cannot create duplicate filename '/dev/block/259:751' Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan] Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1 Call Trace: warn_slowpath_common+0x87/0xc0 warn_slowpath_fmt+0x46/0x50 sysfs_add_one+0xc9/0x130 sysfs_do_create_link+0x12b/0x170 sysfs_create_link+0x13/0x20 device_add+0x317/0x650 idr_get_new+0x13/0x50 add_partition+0x21c/0x390 rescan_partitions+0x32b/0x470 sd_open+0x81/0x1f0 [sd_mod] __blkdev_get+0x1b6/0x3c0 blkdev_get+0x10/0x20 register_disk+0x155/0x170 add_disk+0xa6/0x160 sd_probe_async+0x13b/0x210 [sd_mod] add_wait_queue+0x46/0x60 async_thread+0x102/0x250 default_wake_function+0x0/0x20 async_thread+0x0/0x250 kthread+0x96/0xa0 child_rip+0xa/0x20 kthread+0x0/0xa0 child_rip+0x0/0x20 This most likely happens because dev_t is freed while the number is still used and idr_get_new() is not protected on every use. The fix adds a mutex where it wasn't before and moves the dev_t free function so it is called after device del. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-15block: fix buffer overflow when printing partition UUIDsTejun Heo1-4/+6
6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct" added part_unpack_uuid() which assumes that the passed in buffer has enough space for sprintfing "%pU" - 37 characters including '\0'. Unfortunately, b5af921ec0233 "init: add support for root devices specified by partition UUID" supplied 33 bytes buffer to the function leading to the following panic with stackprotector enabled. Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e [<ffffffff815e226b>] panic+0xba/0x1c6 [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20 [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f [<ffffffff81aee0fa>] mount_root+0x57/0x5b [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176 [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30 [<ffffffff81aedd60>] kernel_init+0x155/0x15a [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0 [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10 [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5 [<ffffffff815f4d20>] ? gs_change+0x13/0x13 Increase the buffer size, remove the dangerous part_unpack_uuid() and use snprintf() directly from printk_all_partitions(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com> Cc: Will Drewry <wad@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02Block: use a freezable workqueue for disk-event pollingAlan Stern1-5/+5
This patch (as1519) fixes a bug in the block layer's disk-events polling. The polling is done by a work routine queued on the system_nrt_wq workqueue. Since that workqueue isn't freezable, the polling continues even in the middle of a system sleep transition. Obviously, polling a suspended drive for media changes and such isn't a good thing to do; in the case of USB mass-storage devices it can lead to real problems requiring device resets and even re-enumeration. The patch fixes things by creating a new system-wide, non-reentrant, freezable workqueue and using it for disk-events polling. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: <stable@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02block: fix __blkdev_get and add_disk race conditionStanislaw Gruszka1-13/+19
The following situation might occur: __blkdev_get: add_disk: register_disk() get_gendisk() disk_block_events() disk->ev == NULL disk_add_events() __disk_unblock_events() disk->ev != NULL --ev->block Then we unblock events, when they are suppose to be blocked. This can trigger events related block/genhd.c warnings, but also can crash in sd_check_events() or other places. I'm able to reproduce crashes with the following scripts (with connected usb dongle as sdb disk). <snip> DEV=/dev/sdb ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue function stop_me() { for i in `jobs -p` ; do kill $i 2> /dev/null ; done exit } trap stop_me SIGHUP SIGINT SIGTERM for ((i = 0; i < 10; i++)) ; do while true; do fdisk -l $DEV 2>&1 > /dev/null ; done & done while true ; do echo 1 > $ENABLE sleep 1 echo 0 > $ENABLE done </snip> I use the script to verify patch fixing oops in sd_revalidate_disk http://marc.info/?l=linux-scsi&m=132935572512352&w=2 Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in sd_revalidate_disk" or this one, script easily crash kernel within a few seconds. With both patches applied I do not observe crash. Unfortunately after some time (dozen of minutes), script will hung in: [ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20 [ 1563.906437] [<c04532d5>] msleep+0x15/0x20 [ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0 [ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170 [ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60 [ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0 [ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60 [ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0 [ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage] [ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage] Anyway I think this patch is some step forward. As drawback, I do not teardown on sysfs file create error, because I do not know how to nullify disk->ev (since it can be used). However add_disk error handling practically does not exist too, and things will work without this sysfs file, except events will not be exported to user space. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-16Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)
2012-01-07Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into ZAl Viro1-3/+2
2012-01-04switch device_get_devnode() and ->devnode() to umode_t *Al Viro1-1/+1
both callers of device_get_devnode() are only interested in lower 16bits and nobody tries to return anything wider than 16bit anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04fs: move code out of buffer.cAl Viro1-1/+0
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04make register_disk() staticAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-14block: misc updates to blk_get_queue()Tejun Heo1-1/+1
* blk_get_queue() is peculiar in that it returns 0 on success and 1 on failure instead of 0 / -errno or boolean. Update it such that it returns %true on success and %false on failure. * Make sure the caller checks for the return value. * Separate out __blk_get_queue() which doesn't check whether @q is dead and put it in blk.h. This will be used later. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-10block: Revert "[SCSI] genhd: add a new attribute "alias" in gendisk"Tejun Heo1-71/+0
This reverts commit a72c5e5eb738033938ab30d6a634b74d1d060f10. The commit introduced alias for block devices which is intended to be used during logging although actual usage hasn't been committed yet. This approach adds very limited benefit (raw log might be easier to follow) which can be trivially implemented in userland but has a lot of problems. It is much worse than netif renames because it doesn't rename the actual device but just adds conveninence name which isn't used universally or enforced. Everything internal including device lookup and sysfs still uses the internal name and nothing prevents two devices from using conflicting alias - ie. sda can have sdb as its alias. This has been nacked by people working on device driver core, block layer and kernel-userland interface and shouldn't have been upstreamed. Revert it. http://thread.gmane.org/gmane.linux.kernel/1155104 http://thread.gmane.org/gmane.linux.scsi/68632 http://thread.gmane.org/gmane.linux.scsi/69776 Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Kay Sievers <kay.sievers@vrfy.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nao Nishijima <nao.nishijima.xt@hitachi.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+2
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits) virtio-blk: use ida to allocate disk index hpsa: add small delay when using PCI Power Management to reset for kump cciss: add small delay when using PCI Power Management to reset for kump xen/blkback: Fix two races in the handling of barrier requests. xen/blkback: Check for proper operation. xen/blkback: Fix the inhibition to map pages when discarding sector ranges. xen/blkback: Report VBD_WSECT (wr_sect) properly. xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests. xen-blkfront: plug device number leak in xlblk_init() error path xen-blkfront: If no barrier or flush is supported, use invalid operation. xen-blkback: use kzalloc() in favor of kmalloc()+memset() xen-blkback: fixed indentation and comments xen-blkfront: fix a deadlock while handling discard response xen-blkfront: Handle discard requests. xen-blkback: Implement discard requests ('feature-discard') xen-blkfront: add BLKIF_OP_DISCARD and discard request struct drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd() drivers/block/loop.c: emit uevent on auto release drivers/block/cpqarray.c: use pci_dev->revision loop: always allow userspace partitions and optionally support automatic scanning ... Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-05Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+8
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-10-19block: make gendisk hold a reference to its queueTejun Heo1-0/+8
The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-08-29[SCSI] genhd: add a new attribute "alias" in gendiskNao Nishijima1-0/+71
This patch allows the user to set an "alias" of the disk via sysfs interface. This patch only adds a new attribute "alias" in gendisk structure. To show the alias instead of the device name in kernel messages, we need to revise printk messages and use alias_name() in them. Example: (current) printk("disk name is %s\n", disk->disk_name); (new) printk("disk name is %s\n", alias_name(disk)); Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can have an "alias" which length is up to 255 bytes. This attribute is write-once. Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Suggested-by: Jon Masters <jcm@redhat.com> Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-08-23block: add GENHD_FL_NO_PART_SCANTejun Heo1-2/+2
There are cases where suppressing partition scan is useful - e.g. for lo devices and pseudo SATA devices which advertise to be a disk but get upset on partition scan (some port multiplier control devices show such behavior). This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan regardless of the number of possible partitions. disk_partitionable() is renamed to disk_part_scan_enabled() as suppressing partition scan doesn't imply the device can't be partitioned using BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now directly tests disk_max_parts() to maintain backward-compatibility. -v2: Updated to make it clear that only partition scan is suppressed not partitioning itself as suggested by Kay Sievers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02block/genhd.c: remove useless cast in diskstats_show()Herbert Poetzl1-4/+4
Remove the (unsigned long long) cast in diskstats_show() and adjusts the seq_printf() format string to 'unsigned long' diskstats_show() uses part_stat_read() to get the stats, which either accesses the specified field in the struct disk_stats directly (non SMP) or sums up the per CPU values in a variable of the same type as the field, so in any case the result will have the same type and range as the specified field which for all disk_stats entries is unsigned long Also, for unsigned long ranges the output of %lu should be identical to the one of %llu, so no change in the actual proc entry contents. Signed-off-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-25Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-13/+15
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...
2011-07-21block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()Lai Jiangshan1-9/+1
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-01block: flush MEDIA_CHANGE from drivers on close(2)Tejun Heo1-10/+12
Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01Merge branch 'for-linus' into for-3.1/coreJens Axboe1-34/+45
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13block:remove some spare spaces in genhd.cWanlong Gao1-3/+3
Remove the end-of-line spaces in genhd.c. Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: make disk_block_events() properly wait for work cancellationTejun Heo1-0/+10
disk_block_events() should guarantee that the event work is not in flight on return and once blocked it shouldn't issue further cancellations. Because there was no synchronization between the first blocker doing cancel_delayed_work_sync() and the following blockers, the following blockers could finish before cancellation was complete, which broke both guarantees - event work could be in flight and cancellation could happen after return. This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 Fix it by adding an outer mutex which protects both block count manipulation and work cancellation. -v2: Use outer mutex instead of bit waitqueue per Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: remove non-syncing __disk_block_events() and fold it into ↵Tejun Heo1-31/+24
disk_block_events() After the previous update to disk_check_events(), nobody is using non-syncing __disk_block_events(). Remove @sync and, as this makes __disk_block_events() virtually identical to disk_block_events(), remove the underscore prefixed version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: don't use non-syncing event blocking in disk_check_events()Tejun Heo1-3/+11
This patch is part of fix for triggering of WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 disk_clear_events() blocks events, schedules and flushes the event work. It expects the work to have started execution on schedule and finished on return from flush. WARN_ON_ONCE() triggers if the event work hasn't executed as expected. This problem happens because __disk_block_events() fails to guarantee that the event work item is not in flight on return from the function in race-free manner. The problem is two-fold and this patch addresses one of them. When __disk_block_events() is called with @sync == %false, it bumps event block count, calls cancel_delayed_work() and return. This makes it impossible to guarantee that event polling is not in flight on return from syncing __disk_block_events() - if the first blocker was non-syncing, polling could still be in progress and later syncing ones would assume that the first blocker already canceled it. Making __disk_block_events() cancel_sync regardless of block count isn't feasible either as it may race with forced event checking in disk_clear_events(). As disk_check_events() is the only user of non-syncing __disk_block_events(), updating it to directly cancel and schedule event work is the easiest way to solve the issue. Note that there's another bug in __disk_block_events() and this patch doesn't fix the issue completely. Later patch will fix the other bug. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26block: always allocate genhd->ev if check_events is implementedTejun Heo1-1/+1
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block drivers which have inadequate ->check_events(). Combined with earlier change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to userland), this enables using ->check_events() for internal processing while avoiding enabling in-kernel block event polling which can lead to infinite event loop. Unfortunately, this made many drivers including floppy without any bit set in disk->events and ->async_events in which case disk_add_events() simply skipped allocation of disk->ev, which disables whole event handling. As ->check_events() is still used during open processing for revalidation, this can lead to open failure. This patch always allocates disk->ev if ->check_events is implemented. In the long term, it would make sense to simply include the event structure inline into genhd as it's now used by virtually all block devices. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ondrej Zary <linux@rainbow-software.org> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21block: don't propagate unlisted DISK_EVENTs to userlandTejun Heo1-2/+6
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and internal event for revalidation of removeable devices. Some legacy drivers don't implement proper event detection and continuously generate events under certain circumstances. For example, ide-cd generates media changed continuously if there's no media in the drive, which can lead to infinite loop of events jumping back and forth between the driver and userland event handler. This patch updates disk event infrastructure such that it never propagates events not listed in disk->events to userland. Those events are processed the same for internal purposes but uevent generation is suppressed. This also ensures that userland only gets events which are advertised in the @events sysfs node lowering risk of confusion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-09block: Don't implicitly trigger event check on disk_unblock_events()Tejun Heo1-1/+1
Currently, disk_unblock_events() implicitly kick event check if the block count reaches zero. This behavior is not described in the comment and hinders with future changes. Make the unblocker explicitly check events by calling disk_check_events() as necessary. This patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org>
2011-03-04Merge branch 'for-linus' of ../linux-2.6-block into block-for-2.6.39/coreTejun Heo1-1/+1
This merge creates two set of conflicts. One is simple context conflicts caused by removal of throtl_scheduled_delayed_work() in for-linus and removal of throtl_shutdown_timer_wq() in for-2.6.39/core. The other is caused by commit 255bb490c8 (block: blk-flush shouldn't call directly into q->request_fn() __blk_run_queue()) in for-linus crashing with FLUSH reimplementation in for-2.6.39/core. The conflict isn't trivial but the resolution is straight-forward. * __blk_run_queue() calls in flush_end_io() and flush_data_end_io() should be called with @force_kblockd set to %true. * elv_insert() in blk_kick_flush() should use %ELEVATOR_INSERT_REQUEUE. Both changes are to avoid invoking ->request_fn() directly from request completion path and closely match the changes in the commit 255bb490c8. Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-02block/genhd: Change some numerals into macrosLiu Yuan1-8/+8
Rename the numerals in the diskstats_show() into the macros. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-02-24Fix over-zealous flush_disk when changing device size.NeilBrown1-1/+1
There are two cases when we call flush_disk. In one, the device has disappeared (check_disk_change) so any data will hold becomes irrelevant. In the oter, the device has changed size (check_disk_size_change) so data we hold may be irrelevant. In both cases it makes sense to discard any 'clean' buffers, so they will be read back from the device if needed. In the former case it makes sense to discard 'dirty' buffers as there will never be anywhere safe to write the data. In the second case it *does*not* make sense to discard dirty buffers as that will lead to file system corruption when you simply enlarge the containing devices. flush_disk calls __invalidate_devices. __invalidate_device calls both invalidate_inodes and invalidate_bdev. invalidate_inodes *does* discard I_DIRTY inodes and this does lead to fs corruption. invalidate_bev *does*not* discard dirty pages, but I don't really care about that at present. So this patch adds a flag to __invalidate_device (calling it __invalidate_device2) to indicate whether dirty buffers should be killed, and this is passed to invalidate_inodes which can choose to skip dirty inodes. flusk_disk then passes true from check_disk_change and false from check_disk_size_change. dm avoids tripping over this problem by calling i_size_write directly rathher than using check_disk_size_change. md does use check_disk_size_change and so is affected. This regression was introduced by commit 608aeef17a which causes check_disk_size_change to call flush_disk, so it is suitable for any kernel since 2.6.27. Cc: stable@kernel.org Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Andrew Patterson <andrew.patterson@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-13Merge branch 'for-2.6.38/event-handling' into for-2.6.38/coreJens Axboe1-28/+516
2011-01-07block: add internal hd part table referencesJens Axboe1-1/+2
We can't use krefs since it's apparently restricted to very basic reference counting. This reverts commit e4a683c8. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-05block: fix accounting bug on cross partition mergesJerome Marchand1-0/+1
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. Also add a refcount to struct hd_struct to keep the partition in memory as long as users exist. We use kref_test_and_get() to ensure we don't add a reference to a partition which is going away. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17fs/block: type signature of major_to_index(int) to major_to_index(unsigned)Yang Zhang1-1/+1
The major/minor device numbers are always defined and used as `unsigned'. Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)Yang Zhang1-1/+1
Signed-off-by: Yang Zhang <kthreadd@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16implement in-kernel gendisk events handlingTejun Heo1-0/+429
Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16block: move register_disk() and del_gendisk() to block/genhd.cTejun Heo1-3/+87
There's no reason for register_disk() and del_gendisk() to be in fs/partitions/check.c. Move both to genhd.c. While at it, collapse unlink_gendisk(), which was artificially in a separate function due to genhd.c / check.c split, into del_gendisk(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-16block: kill genhd_media_change_notify()Tejun Heo1-25/+0
There's no user of the facility. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-25Revert "block: fix accounting bug on cross partition merges"Jens Axboe1-14/+0
This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327. Conflicts: include/linux/genhd.h It has numerous issues with the cleanup path and non-elevator devices. Revert it for now so we can come up with a clean version without rushing things. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds1-5/+2
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits) driver core: Display error codes when class suspend fails Driver core: Add section count to memory_block struct Driver core: Add mutex for adding/removing memory blocks Driver core: Move find_memory_block routine hpilo: Despecificate driver from iLO generation driver core: Convert link_mem_sections to use find_memory_block_hinted. driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted. kobject: Introduce kset_find_obj_hinted. driver core: fix build for CONFIG_BLOCK not enabled driver-core: base: change to new flag variable sysfs: only access bin file vm_ops with the active lock sysfs: Fail bin file mmap if vma close is implemented. FW_LOADER: fix kconfig dependency warning on HOTPLUG uio: Statically allocate uio_class and use class .dev_attrs. uio: Support 2^MINOR_BITS minors uio: Cleanup irq handling. uio: Don't clear driver data uio: Fix lack of locking in init_uio_class SYSFS: Allow boot time switching between deprecated and modern sysfs layout driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices ...
2010-10-22SYSFS: Allow boot time switching between deprecated and modern sysfs layoutAndi Kleen1-5/+2
I have some systems which need legacy sysfs due to old tools that are making assumptions that a directory can never be a symlink to another directory, and it's a big hazzle to compile separate kernels for them. This patch turns CONFIG_SYSFS_DEPRECATED into a run time option that can be switched on/off the kernel command line. This way the same binary can be used in both cases with just a option on the command line. The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set the default. I kept the weird name to not break existing config files. Also the compat code can be still completely disabled by undefining CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes care of this now instead of lots of ifdefs. This makes the code look nicer. v2: This is an updated version on top of Kay's patch to only handle the block devices. I tested it on my old systems and that seems to work. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-19block: fix accounting bug on cross partition mergesYasuaki Ishimatsu1-0/+14
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. When reloading partition tables, quiesce IO to ensure that no request references to the partition struct exists. When it is safe to free the partition table, the IO for that device is restarted again. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>