summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2010-12-20Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds5-49/+54
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix cciss_revalidate panic block: max hardware sectors limit wrapper block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead blk-throttle: Correct the placement of smp_rmb() blk-throttle: Trim/adjust slice_end once a bio has been dispatched block: check for proper length of iov entries earlier in blk_rq_map_user_iov() drbd: fix for spin_lock_irqsave in endio callback drbd: don't recvmsg with zero length
2010-12-17block: max hardware sectors limit wrapperMike Snitzer1-6/+20
Implement blk_limits_max_hw_sectors() and make blk_queue_max_hw_sectors() a wrapper around it. DM needs this to avoid setting queue_limits' max_hw_sectors and max_sectors directly. dm_set_device_limits() now leverages blk_limits_max_hw_sectors() logic to establish the appropriate max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was incorrectly setting max_sectors rather than max_hw_sectors (which caused dm_merge_bvec()'s max_hw_sectors check to be ineffective). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-17block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits insteadMartin K. Petersen3-27/+6
When stacking devices, a request_queue is not always available. This forced us to have a no_cluster flag in the queue_limits that could be used as a carrier until the request_queue had been set up for a metadevice. There were several problems with that approach. First of all it was up to the stacking device to remember to set queue flag after stacking had completed. Also, the queue flag and the queue limits had to be kept in sync at all times. We got that wrong, which could lead to us issuing commands that went beyond the max scatterlist limit set by the driver. The proper fix is to avoid having two flags for tracking the same thing. We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the block layer merging functions. The queue_limit 'no_cluster' is turned into 'cluster' to avoid double negatives and to ease stacking. Clustering defaults to being enabled as before. The queue flag logic is removed from the stacking function, and explicitly setting the cluster flag is no longer necessary in DM and MD. Reported-by: Ed Lin <ed.lin@promise.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-09[SCSI] bsg: correct fault if queue object removed while dev_t openJames Smart1-0/+8
This patch corrects an issue in bsg that results in a general protection fault if an LLD is removed while an application is using an open file handle to a bsg device, and the application issues an ioctl. The fault occurs because the class_dev is NULL, having been cleared in bsg_unregister_queue() when the driver was removed. With this patch, a check is made for the class_dev, and the application will receive ENXIO if the related object is gone. Signed-off-by: Carl Lajeunesse <carl.lajeunesse@emulex.com> Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-12-01blk-throttle: Correct the placement of smp_rmb()Vivek Goyal1-14/+9
o I was discussing what are the variable being updated without spin lock and why do we need barriers and Oleg pointed out that location of smp_rmb() should be between read of td->limits_changed and tg->limits_changed. This patch fixes it. o Following is one possible sequence of events. Say cpu0 is executing throtl_update_blkio_group_read_bps() and cpu1 is executing throtl_process_limit_change(). cpu0 cpu1 tg->limits_changed = true; smp_mb__before_atomic_inc(); atomic_inc(&td->limits_changed); if (!atomic_read(&td->limits_changed)) return; if (tg->limits_changed) do_something; If cpu0 has updated tg->limits_changed and td->limits_changed, we want to make sure that if update to td->limits_changed is visible on cpu1, then update to tg->limits_changed should also be visible. Oleg pointed out to ensure that we need to insert an smp_rmb() between td->limits_changed read and tg->limits_changed read. o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed). This patch fixes it. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-01blk-throttle: Trim/adjust slice_end once a bio has been dispatchedVivek Goyal1-0/+16
o During some testing I did following and noticed throttling stops working. - Put a very low limit on a cgroup, say 1 byte per second. - Start some reads, this will set slice_end to a very high value. - Change the limit to higher value say 1MB/s - Now IO unthrottles and finishes as expected. - Try to do the read again but IO is not limited to 1MB/s as expected. o What is happening. - Initially low value of limit sets slice_end to a very high value. - During updation of limit, slice_end is not being truncated. - Very high value of slice_end leads to keeping the existing slice valid for a very long time and new slice does not start. - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice() is not called in this path. So slice_start is some old value and practically we are able to do huge amount of IO. o There are many ways it can be fixed. I have fixed it by trying to adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio is big and can't be dispatched in one slice. After dispatch of bio, readjust the slice_end to make sure we don't end up with huge values. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-29block: check for proper length of iov entries earlier in blk_rq_map_user_iov()Xiaotian Feng1-2/+3
commit 9284bcf checks for proper length of iov entries in blk_rq_map_user_iov(). But if the map is unaligned, kernel will break out the loop without checking for the proper length. So we need to check the proper length before the unalign check. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-27Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-1/+1
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: fix build for PROC_FS disabled block: fix amiga and atari floppy driver compile warning blk-throttle: Fix calculation of max number of WRITES to be dispatched ioprio: grab rcu_read_lock in sys_ioprio_{set,get}() xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER xen/blkfront: change blk_shadow.request to proper pointer xen/blkfront: map REQ_FLUSH into a full barrier
2010-11-17BKL: remove extraneous #include <smp_lock.h>Arnd Bergmann2-2/+0
The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-15blk-throttle: Fix calculation of max number of WRITES to be dispatchedVivek Goyal1-1/+1
o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one dispatch round. ummy pointed out that there is a bug in max_nr_writes calculation. This patch fixes it. Reported-by: ummy y <yummylln@yahoo.com.cn> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-11block: remove unused copy_io_context()Jens Axboe1-14/+0
Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10block: remove REQ_HARDBARRIERChristoph Hellwig2-9/+2
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left at this point is: - various checks inside the block layer. - sanity checks in bio based drivers. - now unused bio_empty_barrier helper. - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while, but Xen really needs to sort out it's barrier situaton. - setting of ordered tags in uas - dead code copied from old scsi drivers. - scsi different retry for barriers - it's dead and should have been removed when flushes were converted to FS requests. - blktrace handling of barriers - removed. Someone who knows blktrace better should add support for REQ_FLUSH and REQ_FUA, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10block: ioctl: fix information leak to userlandVasiliy Kulikov1-0/+1
Structure hd_geometry is copied to userland with 4 padding bytes between cylinders and start fields uninitialized on 64-bit platforms. It leads to leaking of contents of kernel stack memory. Currently there is no memset() in real implementations of getgeo() in drivers/block/, so it makes sense to have memset() in blkdev_ioctl(). Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10block: read i_size with i_size_read()Mike Snitzer3-7/+7
Convert direct reads of an inode's i_size to using i_size_read(). i_size_{read,write} use a seqcount to protect reads from accessing incomple writes. Concurrent i_size_write()s require mutual exclussion to protect the seqcount that is used by i_size_{read,write}. But i_size_read() callers do not need to use additional locking. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: NeilBrown <neilb@suse.de> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10block: take care not to overflow when calculating total iov lengthJens Axboe1-10/+24
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10block: check for proper length of iov entries in blk_rq_map_user_iov()Jens Axboe1-0/+2
Ensure that we pass down properly validated iov segments before calling into the mapping or copy functions. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-25Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds4-31/+13
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "block: fix accounting bug on cross partition merges"
2010-10-25Merge branch 'for-next' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Update broken web addresses in arch directory. Update broken web addresses in the kernel. Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget Revert "Fix typo: configuation => configuration" partially ida: document IDA_BITMAP_LONGS calculation ext2: fix a typo on comment in ext2/inode.c drivers/scsi: Remove unnecessary casts of private_data drivers/s390: Remove unnecessary casts of private_data net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data drivers/infiniband: Remove unnecessary casts of private_data drivers/gpu/drm: Remove unnecessary casts of private_data kernel/pm_qos_params.c: Remove unnecessary casts of private_data fs/ecryptfs: Remove unnecessary casts of private_data fs/seq_file.c: Remove unnecessary casts of private_data arm: uengine.c: remove C99 comments arm: scoop.c: remove C99 comments Fix typo configue => configure in comments Fix typo: configuation => configuration Fix typo interrest[ing|ed] => interest[ing|ed] Fix various typos of valid in comments ... Fix up trivial conflicts in: drivers/char/ipmi/ipmi_si_intf.c drivers/usb/gadget/rndis.c net/irda/irnet/irnet_ppp.c
2010-10-25Revert "block: fix accounting bug on cross partition merges"Jens Axboe4-31/+13
This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327. Conflicts: include/linux/genhd.h It has numerous issues with the cleanup path and non-elevator devices. Revert it for now so we can come up with a clean version without rushing things. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-23block: fix use-after-free bug in blk throttle codeJens Axboe2-2/+2
blk_throtl_exit() frees the throttle data hanging off the queue in blk_cleanup_queue(), but blk_put_queue() will indirectly dereference this data when calling blk_sync_queue() which in turns calls throtl_shutdown_timer_wq(). Fix this by moving the freeing of the throttle data to when the queue is truly being released, and post the call to blk_sync_queue(). Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6Linus Torvalds1-5/+2
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (31 commits) driver core: Display error codes when class suspend fails Driver core: Add section count to memory_block struct Driver core: Add mutex for adding/removing memory blocks Driver core: Move find_memory_block routine hpilo: Despecificate driver from iLO generation driver core: Convert link_mem_sections to use find_memory_block_hinted. driver core: Introduce find_memory_block_hinted which utilizes kset_find_obj_hinted. kobject: Introduce kset_find_obj_hinted. driver core: fix build for CONFIG_BLOCK not enabled driver-core: base: change to new flag variable sysfs: only access bin file vm_ops with the active lock sysfs: Fail bin file mmap if vma close is implemented. FW_LOADER: fix kconfig dependency warning on HOTPLUG uio: Statically allocate uio_class and use class .dev_attrs. uio: Support 2^MINOR_BITS minors uio: Cleanup irq handling. uio: Don't clear driver data uio: Fix lack of locking in init_uio_class SYSFS: Allow boot time switching between deprecated and modern sysfs layout driver core: remove CONFIG_SYSFS_DEPRECATED_V2 but keep it for block devices ...
2010-10-23Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds9-486/+350
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
2010-10-23Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds17-252/+2069
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits) cfq-iosched: Fix a gcc 4.5 warning and put some comments block: Turn bvec_k{un,}map_irq() into static inline functions block: fix accounting bug on cross partition merges block: Make the integrity mapped property a bio flag block: Fix double free in blk_integrity_unregister block: Ensure physical block size is unsigned int blkio-throttle: Fix possible multiplication overflow in iops calculations blkio-throttle: limit max iops value to UINT_MAX blkio-throttle: There is no need to convert jiffies to milli seconds blkio-throttle: Fix link failure failure on i386 blkio: Recalculate the throttled bio dispatch time upon throttle limit change blkio: Add root group to td->tg_list blkio: deletion of a cgroup was causes oops blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n block: set the bounce_pfn to the actual DMA limit rather than to max memory block: revert bad fix for memory hotplug causing bounces Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK block: set the bounce_pfn to the actual DMA limit rather than to max memory block: Prevent hang_check firing during long I/O cfq: improve fsync performance for small files ... Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
2010-10-22Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds1-0/+1
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: vfs: make no_llseek the default vfs: don't use BKL in default_llseek llseek: automatically add .llseek fop libfs: use generic_file_llseek for simple_attr mac80211: disallow seeks in minstrel debug code lirc: make chardev nonseekable viotape: use noop_llseek raw: use explicit llseek file operations ibmasmfs: use generic_file_llseek spufs: use llseek in all file operations arm/omap: use generic_file_llseek in iommu_debug lkdtm: use generic_file_llseek in debugfs net/wireless: use generic_file_llseek in debugfs drm: use noop_llseek
2010-10-22Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds1-3/+0
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: block: autoconvert trivial BKL users to private mutex drivers: autoconvert trivial BKL users to private mutex ipmi: autoconvert trivial BKL users to private mutex mac: autoconvert trivial BKL users to private mutex mtd: autoconvert trivial BKL users to private mutex scsi: autoconvert trivial BKL users to private mutex Fix up trivial conflicts (due to addition of private mutex right next to deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
2010-10-22SYSFS: Allow boot time switching between deprecated and modern sysfs layoutAndi Kleen1-5/+2
I have some systems which need legacy sysfs due to old tools that are making assumptions that a directory can never be a symlink to another directory, and it's a big hazzle to compile separate kernels for them. This patch turns CONFIG_SYSFS_DEPRECATED into a run time option that can be switched on/off the kernel command line. This way the same binary can be used in both cases with just a option on the command line. The old CONFIG_SYSFS_DEPRECATED_V2 option is still there to set the default. I kept the weird name to not break existing config files. Also the compat code can be still completely disabled by undefining CONFIG_SYSFS_DEPRECATED_SWITCH -- just the optimizer takes care of this now instead of lots of ifdefs. This makes the code look nicer. v2: This is an updated version on top of Kay's patch to only handle the block devices. I tested it on my old systems and that seems to work. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-10-22cfq-iosched: Fix a gcc 4.5 warning and put some commentsVivek Goyal1-3/+13
- Andi encountedred following warning with gcc 4.5 linux/block/cfq-iosched.c: In function ‘cfq_dispatch_requests’: linux/block/cfq-iosched.c:2156:3: warning: array subscript is above array bounds - Warning happens due to following code. slice = group_slice * count / max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio], cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg)); gcc is complaining about cfqg->busy_queues_avg[] being indexed by CFQ prio classes (RT, BE and IDLE) while the array size is only 2. - At run time, we never access cfqg->busy_queues_avg[IDLE] and return from function before this code hits. - To fix warning increase the array size though it will remain unused. This patch also puts some comments to clarify some of the confusions. - I have taken Jens's patch and modified it a bit. - Compile tested with gcc 4.4 and boot tested. I don't have gcc 4.5 running, Andi can you please test it with gcc 4.5 to make sure it worked. Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6Linus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: [SCSI] bsg: fix incorrect device_status value [SCSI] Fix VPD inquiry page wrapper
2010-10-19Merge branch 'v2.6.36-rc8' into for-2.6.37/barrierJens Axboe8-40/+162
Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19block: fix accounting bug on cross partition mergesYasuaki Ishimatsu4-13/+31
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. When reloading partition tables, quiesce IO to ensure that no request references to the partition struct exists. When it is safe to free the partition table, the IO for that device is restarted again. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-15[SCSI] bsg: fix incorrect device_status valueFUJITA Tomonori1-1/+1
bsg incorrectly returns sg's masked_status value for device_status. [jejb: fix up expression logic] Reported-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Stable Tree <stable@kernel.org> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-10-15llseek: automatically add .llseek fopArnd Bergmann1-0/+1
All file_operations should get a .llseek operation so we can make nonseekable_open the default for future file operations without a .llseek pointer. The three cases that we can automatically detect are no_llseek, seq_lseek and default_llseek. For cases where we can we can automatically prove that the file offset is always ignored, we use noop_llseek, which maintains the current behavior of not returning an error from a seek. New drivers should normally not use noop_llseek but instead use no_llseek and call nonseekable_open at open time. Existing drivers can be converted to do the same when the maintainer knows for certain that no user code relies on calling seek on the device file. The generated code is often incorrectly indented and right now contains comments that clarify for each added line why a specific variant was chosen. In the version that gets submitted upstream, the comments will be gone and I will manually fix the indentation, because there does not seem to be a way to do that using coccinelle. Some amount of new code is currently sitting in linux-next that should get the same modifications, which I will do at the end of the merge window. Many thanks to Julia Lawall for helping me learn to write a semantic patch that does all this. ===== begin semantic patch ===== // This adds an llseek= method to all file operations, // as a preparation for making no_llseek the default. // // The rules are // - use no_llseek explicitly if we do nonseekable_open // - use seq_lseek for sequential files // - use default_llseek if we know we access f_pos // - use noop_llseek if we know we don't access f_pos, // but we still want to allow users to call lseek // @ open1 exists @ identifier nested_open; @@ nested_open(...) { <+... nonseekable_open(...) ...+> } @ open exists@ identifier open_f; identifier i, f; identifier open1.nested_open; @@ int open_f(struct inode *i, struct file *f) { <+... ( nonseekable_open(...) | nested_open(...) ) ...+> } @ read disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ read_no_fpos disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { ... when != off } @ write @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ write_no_fpos @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { ... when != off } @ fops0 @ identifier fops; @@ struct file_operations fops = { ... }; @ has_llseek depends on fops0 @ identifier fops0.fops; identifier llseek_f; @@ struct file_operations fops = { ... .llseek = llseek_f, ... }; @ has_read depends on fops0 @ identifier fops0.fops; identifier read_f; @@ struct file_operations fops = { ... .read = read_f, ... }; @ has_write depends on fops0 @ identifier fops0.fops; identifier write_f; @@ struct file_operations fops = { ... .write = write_f, ... }; @ has_open depends on fops0 @ identifier fops0.fops; identifier open_f; @@ struct file_operations fops = { ... .open = open_f, ... }; // use no_llseek if we call nonseekable_open //////////////////////////////////////////// @ nonseekable1 depends on !has_llseek && has_open @ identifier fops0.fops; identifier nso ~= "nonseekable_open"; @@ struct file_operations fops = { ... .open = nso, ... +.llseek = no_llseek, /* nonseekable */ }; @ nonseekable2 depends on !has_llseek @ identifier fops0.fops; identifier open.open_f; @@ struct file_operations fops = { ... .open = open_f, ... +.llseek = no_llseek, /* open uses nonseekable */ }; // use seq_lseek for sequential files ///////////////////////////////////// @ seq depends on !has_llseek @ identifier fops0.fops; identifier sr ~= "seq_read"; @@ struct file_operations fops = { ... .read = sr, ... +.llseek = seq_lseek, /* we have seq_read */ }; // use default_llseek if there is a readdir /////////////////////////////////////////// @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier readdir_e; @@ // any other fop is used that changes pos struct file_operations fops = { ... .readdir = readdir_e, ... +.llseek = default_llseek, /* readdir is present */ }; // use default_llseek if at least one of read/write touches f_pos ///////////////////////////////////////////////////////////////// @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read.read_f; @@ // read fops use offset struct file_operations fops = { ... .read = read_f, ... +.llseek = default_llseek, /* read accesses f_pos */ }; @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, ... + .llseek = default_llseek, /* write accesses f_pos */ }; // Use noop_llseek if neither read nor write accesses f_pos /////////////////////////////////////////////////////////// @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; identifier write_no_fpos.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, .read = read_f, ... +.llseek = noop_llseek, /* read and write both use no f_pos */ }; @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write_no_fpos.write_f; @@ struct file_operations fops = { ... .write = write_f, ... +.llseek = noop_llseek, /* write uses no f_pos */ }; @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; @@ struct file_operations fops = { ... .read = read_f, ... +.llseek = noop_llseek, /* read uses no f_pos */ }; @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; @@ struct file_operations fops = { ... +.llseek = noop_llseek, /* no read or write fn */ }; ===== End semantic patch ===== Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Julia Lawall <julia@diku.dk> Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15block: Fix double free in blk_integrity_unregisterMartin K. Petersen1-1/+0
Commit 3839e4b introduced a kobject_put but failed to remove the kmem_cache_free beneath it, leading to a double free. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-13block: Ensure physical block size is unsigned intMartin K. Petersen1-1/+1
Physical block size was declared unsigned int to accomodate the maximum size reported by READ CAPACITY(16). Make sure we use the right type in the related functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-07elevator: fix oops on early call to elevator_change()Jens Axboe1-4/+8
2.6.36 introduces an API for drivers to switch the IO scheduler instead of manually calling the elevator exit and init functions. This API was added since q->elevator must be cleared in between those two calls. And since we already have this functionality directly from use by the sysfs interface to switch schedulers online, it was prudent to reuse it internally too. But this API needs the queue to be in a fully initialized state before it is called, or it will attempt to unregister elevator kobjects before they have been added. This results in an oops like this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000051 IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0 PGD 47ddfc067 PUD 47c6a1067 PMD 0 Oops: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq CPU 2 Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R RIP: 0010:[<ffffffff8116f15e>] [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0 RSP: 0018:ffff88027da25d08 EFLAGS: 00010246 RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000 RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88 RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0 R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528 FS: 00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090) Stack: ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88 <0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77 <0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528 Call Trace: [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0 [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60 [<ffffffff8123feb9>] kobject_add+0x69/0x90 [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0 [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0 [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50 [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0 [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0 [<ffffffff81224204>] elv_register_queue+0x34/0xa0 [<ffffffff81224aad>] elevator_change+0xfd/0x250 [<ffffffffa007e000>] ? t_init+0x0/0x361 [t] [<ffffffffa007e000>] ? t_init+0x0/0x361 [t] [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t] [<ffffffff810001de>] do_one_initcall+0x3e/0x170 [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220 [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6 RIP [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0 RSP <ffff88027da25d08> CR2: 0000000000000051 ---[ end trace a6541d3bf07945df ]--- Fix this by adding a registered bit to the elevator queue, which is set when the sysfs kobjects have been registered. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-05block: autoconvert trivial BKL users to private mutexArnd Bergmann1-3/+0
The block device drivers have all gained new lock_kernel calls from a recent pushdown, and some of the drivers were already using the BKL before. This turns the BKL into a set of per-driver mutexes. Still need to check whether this is safe to do. file=$1 name=$2 if grep -q lock_kernel ${file} ; then if grep -q 'include.*linux.mutex.h' ${file} ; then sed -i '/include.*<linux\/smp_lock.h>/d' ${file} else sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file} fi sed -i ${file} \ -e "/^#include.*linux.mutex.h/,$ { 1,/^\(static\|int\|long\)/ { /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex); } }" \ -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \ -e '/[ ]*cycle_kernel_lock();/d' else sed -i -e '/include.*\<smp_lock.h\>/d' ${file} \ -e '/cycle_kernel_lock()/d' fi Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-01blkio-throttle: Fix possible multiplication overflow in iops calculationsVivek Goyal1-1/+15
o User can specify max iops value of 32bit (UINT_MAX), through cgroup interface. If a user has specified say 4294967294 (UNIT_MAX - 2), then on 32bit platform, following multiplication can overflow. io_allowed = (tg->iops[rw] * jiffy_elapsed_rnd) o Explicitly cast the multiplication to 64bit and then perform division and then check whether result is still great then UNINT_MAX. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio-throttle: limit max iops value to UINT_MAXVivek Goyal2-4/+10
- Limit max iops value to UINT_MAX and return error to user if value is more than that instead of accepting bigger values and truncating implicitly. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio-throttle: There is no need to convert jiffies to milli secondsVivek Goyal1-4/+3
o Do not convert jiffies to mili seconds as it is not required. Just work with jiffies and HZ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio-throttle: Fix link failure failure on i386Vivek Goyal1-6/+10
o Randy Dunlap reported following linux-next failure. This patch fixes it. on i386: blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3' blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3' o bytes_per_second interface is 64bit and I was continuing to do 64 bit division even on 32bit platform without help of special macros/functions hence the failure. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio: Recalculate the throttled bio dispatch time upon throttle limit changeVivek Goyal4-35/+139
o Currently any cgroup throttle limit changes are processed asynchronousy and the change does not take affect till a new bio is dispatched from same group. o It might happen that a user sets a redicuously low limit on throttling. Say 1 bytes per second on reads. In such cases simple operations like mount a disk can wait for a very long time. o Once bio is throttled, there is no easy way to come out of that wait even if user increases the read limit later. o This patch fixes it. Now if a user changes the cgroup limits, we recalculate the bio dispatch time according to new limits. o Can't take queueu lock under blkcg_lock, hence after the change I wake up the dispatch thread again which recalculates the time. So there are some variables being synchronized across two threads without lock and I had to make use of barriers. Hoping I have used barriers correctly. Any review of memory barrier code especially will help. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio: Add root group to td->tg_listVivek Goyal1-4/+13
o Currently all the dynamically allocated groups, except root grp is added to td->tg_list. This was not a problem so far but in next patch I will travel through td->tg_list to process any updates of limits on the group. If root group is not in tg_list, then root group's updates are not processed. o It is better to root group also to tg_list instead of doing special processing for it during limit updates. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio: deletion of a cgroup was causes oopsVivek Goyal1-4/+5
o Now a cgroup list of blkg elements can contain blkg from multiple policies. Before sending an unlink event, make sure blkg belongs to they policy. If policy does not own the blkg, do not send update for this blkg. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=nVivek Goyal1-47/+50
Currently throttling related files were visible even if user had disabled throttling using config options. It was switching off background throttling of bio but not the cgroup files. This patch fixes it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01block: set the bounce_pfn to the actual DMA limit rather than to max memoryMalahal Naineni1-1/+1
The bounce_pfn of the request queue in 64 bit systems is set to the current max_low_pfn. Adding more memory later makes this incorrect. Memory allocated beyond this boot time max_low_pfn appear to require bounce buffers (bounce buffers are actually not allocated but used in calculating segments that may result in "over max segments limit" errors). Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01block: revert bad fix for memory hotplug causing bouncesJens Axboe1-1/+3
Revert "block: set the bounce_pfn to the actual DMA limit rather than to max memory" This reverts commit c49825facfd4969585224a896a5e717f88450cad. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-25block: prevent merges of discard and write requestsAdrian Hunter1-0/+12
Add logic to prevent two I/O requests being merged if only one of them is a discard. Ditto secure discard. Without this fix, it is possible for write requests to transform into discard requests. For example: Submit bio 1 to discard 8 sectors from sector n Submit bio 2 to write 8 sectors from sector n + 16 Submit bio 3 to write 8 sectors from sector n + 8 Bio 1 becomes request 1. Bio 2 becomes request 2. Bio 3 is merged with request 2, and then subsequently request 2 is merged with request 1 resulting in just one I/O request which discards all 24 sectors. Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com> (Moved the checks above the position checks /Jens) Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-24block: set the bounce_pfn to the actual DMA limit rather than to max memoryMalahal Naineni1-3/+1
The bounce_pfn of the request queue in 64 bit systems is set to the current max_low_pfn. Adding more memory later makes this incorrect. Memory allocated beyond this boot time max_low_pfn appear to require bounce buffers (bounce buffers are actually not allocated but used in calculating segments that may result in "over max segments limit" errors). Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-24block: Prevent hang_check firing during long I/OMark Lord1-1/+8
During long I/O operations, the hang_check timer may fire, trigger stack dumps that unnecessarily alarm the user. Eg. hdparm --security-erase NULL /dev/sdb ## can take *hours* to complete So, if hang_check is armed, we should wake up periodically to prevent it from triggering. This patch uses a wake-up interval equal to half the hang_check timer period, which keeps overhead low enough. Signed-off-by: Mark Lord <mlord@pobox.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-21cfq-iosched: fix a kernel OOPs when usb key is insertedVivek Goyal1-3/+13
Mike reported a kernel crash when a usb key hotplug is performed while all kernel thrads are not in a root cgroup and are running in one of the child cgroups of blkio controller. BUG: unable to handle kernel NULL pointer dereference at 0000002c IP: [<c11c7b08>] cfq_get_queue+0x232/0x412 *pde = 00000000 Oops: 0000 [#1] PREEMPT last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb2/2-1/2-1:1.0/host3/scsi_host/host3/uevent [..] Pid: 30039, comm: scsi_scan_3 Not tainted 2.6.35.2-fg.roam #1 Volvi2 /Aspire 4315 EIP: 0060:[<c11c7b08>] EFLAGS: 00010086 CPU: 0 EIP is at cfq_get_queue+0x232/0x412 EAX: f705f9c0 EBX: e977abac ECX: 00000000 EDX: 00000000 ESI: f00da400 EDI: f00da4ec EBP: e977a800 ESP: dff8fd00 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 Process scsi_scan_3 (pid: 30039, ti=dff8e000 task=f6b6c9a0 task.ti=dff8e000) Stack: 00000000 00000000 00000001 01ff0000 f00da508 00000000 f00da524 f00da540 <0> e7994940 dd631750 f705f9c0 e977a820 e977ac44 f00da4d0 00000001 f6b6c9a0 <0> 00000010 00008010 0000000b 00000000 00000001 e977a800 dd76fac0 00000246 Call Trace: [<c11c7f10>] ? cfq_set_request+0x228/0x34c [<c11c7ce8>] ? cfq_set_request+0x0/0x34c [<c11bb3b9>] ? elv_set_request+0xf/0x1c [<c11bdd51>] ? get_request+0x1ad/0x22f [<c11bddf2>] ? get_request_wait+0x1f/0x11a [<c11d013b>] ? kvasprintf+0x33/0x3b [<c127b537>] ? scsi_execute+0x1d/0x103 [<c127b675>] ? scsi_execute_req+0x58/0x83 [<c127c391>] ? scsi_probe_and_add_lun+0x188/0x7c2 [<c12718c6>] ? attribute_container_add_device+0x15/0xfa [<c11c95d1>] ? kobject_get+0xf/0x13 [<c126d1db>] ? get_device+0x10/0x14 [<c127be93>] ? scsi_alloc_target+0x217/0x24d [<c127cbd8>] ? __scsi_scan_target+0x95/0x480 [<c10204eb>] ? dequeue_entity+0x14/0x1fe [<c1020491>] ? update_curr+0x165/0x1ab [<c1020491>] ? update_curr+0x165/0x1ab [<c127d00d>] ? scsi_scan_channel+0x4a/0x76 [<c127d0b0>] ? scsi_scan_host_selected+0x77/0xad [<c127d13c>] ? do_scan_async+0x0/0x11a [<c127d137>] ? do_scsi_scan_host+0x51/0x56 [<c127d13c>] ? do_scan_async+0x0/0x11a [<c127d14a>] ? do_scan_async+0xe/0x11a [<c127d13c>] ? do_scan_async+0x0/0x11a [<c10354c5>] ? kthread+0x5e/0x63 [<c1035467>] ? kthread+0x0/0x63 [<c1002af6>] ? kernel_thread_helper+0x6/0x10 Code: 44 24 1c 54 83 44 24 18 54 83 fa 03 75 94 8b 06 c7 86 64 02 00 00 01 00 00 00 83 e0 03 09 f0 89 06 8b 44 24 28 8b 90 58 01 00 00 <8b> 42 2c 85 c0 75 03 8b 42 08 8d 54 24 48 52 8d 4c 24 50 51 68 EIP: [<c11c7b08>] cfq_get_queue+0x232/0x412 SS:ESP 0068:dff8fd00 CR2: 000000000000002c ---[ end trace 9a88306573f69b12 ]--- The problem here is that we don't have bdi->dev information available when thread does some IO. Hence when dev_name() tries to access bdi->dev, it crashes. This problem does not happen if kernel threads are in root group as root group is statically allocated at device initialization time and we don't hit this piece of code. Fix it by delaying the filling of major and minor number information of device in blk_group. Initially a blk_group is created with 0 as device information and this information is filled later once some more IO comes in from same group. Reported-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>