summaryrefslogtreecommitdiff
path: root/drivers/md/raid5.c
AgeCommit message (Collapse)AuthorFilesLines
2013-11-28md/raid5: fix newly-broken locking in get_active_stripe.NeilBrown1-7/+4
commit 566c09c53455d7c4f1 raid5: relieve lock contention in get_active_stripe() modified the locking in get_active_stripe() reducing the range protected by the (highly contended) device_lock. Unfortunately it reduced the range too much opening up some races. One race can occur if get_priority_stripe runs between the test on sh->count and device_lock being taken. This will mean that sh->lru is not empty while get_active_stripe thinks ->count is zero resulting in a 'BUG' firing. Another race happens if __release_stripe is called immediately after sh->count is tested and found to be non-zero. If STRIPE_HANDLE is not set, get_active_stripe should increment ->active_stripes when it increments ->count from 0, but as it didn't think it was 0, it doesn't. Extending device_lock to cover the test on sh->count close these races. While we are here, fix the two BUG tests: -If count is zero, then lru really must not be empty, or we've lock the stripe_head somehow - no other tests are relevant. -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so testing it is pointless. Reported-and-tested-by: Brassow Jonathan <jbrassow@redhat.com> Reviewed-by: Shaohua Li <shli@kernel.org> Fixes: 566c09c53455d7c4f1 Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-28md/raid5: fix new memory-reference bug in alloc_thread_groups.NeilBrown1-1/+1
In alloc_thread_groups, worker_groups is a pointer to an array, not an array of pointers. So worker_groups[i] is wrong. It should be &(*worker_groups)[i] Found-by: coverity Fixes: 60aaf9338545 Reported-by: Ben Hutchings <bhutchings@solarflare.com> Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-21Merge tag 'md/3.13' of git://neil.brown.name/mdLinus Torvalds1-101/+319
Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null
2013-11-19md/raid5: Use conf->device_lock protect changing of multi-thread resources.majianpeng1-24/+39
When we change group_thread_cnt from sysfs entry, it can OOPS. The kernel messages are: [ 135.299021] BUG: unable to handle kernel NULL pointer dereference at (null) [ 135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299107] PGD 0 [ 135.299122] Oops: 0000 [#1] SMP [ 135.299144] Modules linked in: netconsole e1000e ptp pps_core [ 135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24 [ 135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000 [ 135.299283] RIP: 0010:[<ffffffff815188ab>] [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299323] RSP: 0018:ffff8800b77a5c48 EFLAGS: 00010002 [ 135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008 [ 135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00 [ 135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000 [ 135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00 [ 135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70 [ 135.299479] FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 [ 135.299510] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0 [ 135.299559] Stack: [ 135.299570] ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300 [ 135.299611] 000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8 [ 135.299654] 000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98 [ 135.299696] Call Trace: [ 135.299711] [<ffffffff8107383e>] ? __wake_up+0x4e/0x70 [ 135.299733] [<ffffffff81518f88>] raid5d+0x4c8/0x680 [ 135.299756] [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0 [ 135.299781] [<ffffffff81524c9f>] md_thread+0x11f/0x170 [ 135.299804] [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40 [ 135.299826] [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110 [ 135.299850] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 135.299871] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299899] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 135.299923] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90 [ 135.300005] RIP [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.300005] RSP <ffff8800b77a5c48> [ 135.300005] CR2: 0000000000000000 [ 135.300005] ---[ end trace 504854e5bb7562ed ]--- [ 135.300005] Kernel panic - not syncing: Fatal exception This is because raid5d() can be running when the multi-thread resources are changed via system. We see need to provide locking. mddev->device_lock is suitable, but we cannot simple call alloc_thread_groups under this lock as we cannot allocate memory while holding a spinlock. So change alloc_thread_groups() to allocate and return the data structures, then raid5_store_group_thread_cnt() can take the lock while updating the pointers to the data structures. This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x stable series. Fixes: b721420e8719131896b009b11edbbd27 Cc: stable@vger.kernel.org (3.12) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>
2013-11-19md/raid5: Before freeing old multi-thread worker, it should flush them.majianpeng1-0/+3
When changing group_thread_cnt from sysfs entry, the kernel can oops. The kernel messages are: [ 740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961476] PGD b9013067 PUD b651e067 PMD 0 [ 740.961503] Oops: 0000 [#1] SMP [ 740.961525] Modules linked in: netconsole e1000e ptp pps_core [ 740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23 [ 740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000 [ 740.961673] RIP: 0010:[<ffffffff81062570>] [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961708] RSP: 0018:ffff88013a247e08 EFLAGS: 00010086 [ 740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400 [ 740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680 [ 740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d [ 740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00 [ 740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0 [ 740.961861] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 740.961893] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0 [ 740.962001] Stack: [ 740.962001] 0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680 [ 740.962001] ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0 [ 740.962001] ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8 [ 740.962001] Call Trace: [ 740.962001] [<ffffffff810639c6>] worker_thread+0x206/0x3f0 [ 740.962001] [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0 [ 740.962001] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84 [ 740.962001] RIP [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.962001] RSP <ffff88013a247e08> [ 740.962001] CR2: 0000000000000008 [ 740.962001] ---[ end trace 39181460000748de ]--- [ 740.962001] Kernel panic - not syncing: Fatal exception This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH. A worker is queued to handle them. But before calling raid5_do_work, raid5d handles those stripes making conf->active_stripe = 0. So mddev_suspend() can return. We might then free old worker resources before the queued raid5_do_work() handled them. When it runs, it crashes. raid5d() raid5_store_group_thread_cnt() queue_work mddev_suspend() handle_strips active_stripe=0 free(old worker resources) process_one_work raid5_do_work To avoid this, we should only flush the worker resources before freeing them. This fixes a bug introduced in 3.12 so is suitable for the 3.12.x stable series. Cc: stable@vger.kernel.org (3.12) Fixes: b721420e8719131896b009b11edbbd27 Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>
2013-11-19md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.majianpeng1-1/+1
For R5_ReadNoMerge,it mean this bio can't merge with other bios or request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the same work. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.NeilBrown1-4/+15
We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19raid5: Retry R5_ReadNoMerge flag when hit a read error.Bian Yu1-0/+3
Because of block layer merge, one bio fails will cause other bios which belongs to the same request fails, so raid5_end_read_request will record all these bios as badblocks. If retry request with R5_ReadNoMerge flag to avoid bios merge, badblocks can only record sector which is bad exactly. test: hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean mdadm /dev/md0 -f /dev/sdd mdadm /dev/md0 -r /dev/sdd mdadm --zero-superblock /dev/sdd mdadm /dev/md0 -a /dev/sdd 1. Without this patch: cat /sys/block/md0/md/rd*/bad_blocks 299776 256 299776 256 2. With this patch: cat /sys/block/md0/md/rd*/bad_blocks 300000 8 300000 8 Signed-off-by: Bian Yu <bianyu@kedacom.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-19raid5: relieve lock contention in get_active_stripe()Shaohua Li1-1/+7
track empty inactive list count, so md_raid5_congested() can use it to make decision. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-15llists: move llist_reverse_order from raid5 to llist.cChristoph Hellwig1-14/+0
Make this useful helper available for other users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-14raid5: relieve lock contention in get_active_stripe()Shaohua Li1-72/+245
get_active_stripe() is the last place we have lock contention. It has two paths. One is stripe isn't found and new stripe is allocated, the other is stripe is found. The first path basically calls __find_stripe and init_stripe. It accesses conf->generation, conf->previous_raid_disks, conf->raid_disks, conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded, conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except stripe_hashtbl and inactive_list, other fields are changed very rarely. With this patch, we split inactive_list and add new hash locks. Each free stripe belongs to a specific inactive list. Which inactive list is determined by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a lock_hash assigned. Stripe's inactive list is protected by a hash lock, which is determined by it's lock_hash too. The lock_hash is derivied from current stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl list too. The goal of the new hash locks introduced is we can only use the new locks in the first path of get_active_stripe(). Since we have several hash locks, lock contention is relieved significantly. The first path of get_active_stripe() accesses other fields, since they are changed rarely, changing them now need take conf->device_lock and all hash locks. For a slow path, this isn't a problem. If we need lock device_lock and hash lock, we always lock hash lock first. The tricky part is release_stripe and friends. We need take device_lock first. Neil's suggestion is we put inactive stripes to a temporary list and readd it to inactive_list after device_lock is released. In this way, we add stripes to temporary list with device_lock hold and remove stripes from the list with hash lock hold. So we don't allow concurrent access to the temporary list, which means we need allocate temporary list for all participants of release_stripe. One downside is free stripes are maintained in their inactive list, they can't across between the lists. By default, we have total 256 stripes and 8 lists, so each list will have 32 stripes. It's possible one list has free stripe but other list hasn't. The chance should be rare because stripes allocation are even distributed. And we can always allocate more stripes for cache, several mega bytes memory isn't a big deal. This completely removes the lock contention of the first path of get_active_stripe(). It slows down the second code path a little bit though because we now need takes two locks, but since the hash lock isn't contended, the overhead should be quite small (several atomic instructions). The second path of get_active_stripe() (basically sequential write or big request size randwrite) still has lock contentions. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14md/raid5.c: add proper locking to error path of raid5_start_reshape.NeilBrown1-0/+6
If raid5_start_reshape errors out, we need to reset all the fields that were updated (not just some), and need to use the seq_counter to ensure make_request() doesn't use an inconsitent state. Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-14raid5: Use slow_path to release stripe when mddev->thread is nullmajianpeng1-1/+2
When release_stripe() is called in grow_one_stripe(), the mddev->thread is null. So it will omit one wakeup this thread to release stripe. For this condition, use slow_path to release stripe. Bug was introduced in 3.12 Cc: stable@vger.kernel.org (3.12+) Fixes: 773ca82fa1ee58dd1bf88b Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24raid5: avoid finding "discard" stripeShaohua Li1-0/+8
SCSI discard will damage discard stripe bio setting, eg, some fields are changed. If the stripe is reused very soon, we have wrong bios setting. We remove discard stripe from hash list, so next time the strip will be fully initialized. Suitable for backport to 3.7+. Cc: <stable@vger.kernel.org> (3.7+) Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-10-24raid5: set bio bi_vcnt 0 for discard requestShaohua Li1-0/+12
SCSI layer will add new payload for discard request. If two bios are merged to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse SCSI and cause oops. Suitable for backport to 3.7+ Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
2013-09-02raid5: only wakeup necessary threadsShaohua Li1-6/+35
If there are not enough stripes to handle, we'd better not always queue all available work_structs. If one worker can only handle small or even none stripes, it will impact request merge and create lock contention. With this patch, the number of work_struct running will depend on pending stripes number. Note: some statistics info used in the patch are accessed without locking protection. This should doesn't matter, we just try best to avoid queue unnecessary work_struct. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28md/raid5: flush out all pending requests before proceeding with reshape.NeilBrown1-0/+7
Some requests - particularly 'discard' and 'read' are handled differently depending on whether a reshape is active or not. It is harmless to assume reshape is active if it isn't but wrong to act as though reshape is not active when it is. So when we start reshape - after making clear to all requests that reshape has started - use mddev_suspend/mddev_resume to flush out all requests. This will ensure that no requests will be assuming the absence of reshape once it really starts. Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28md/raid5: use seqcount to protect access to shape in make_request.NeilBrown1-1/+13
make_request() access various shape parameters (raid_disks, chunk_size etc) which might be changed by raid5_start_reshape(). If the later is called at and awkward time during the form, the wrong stripe_head might be used. So introduce a 'seqcount' and after finding a stripe_head make sure there is no reason to expect that we got the wrong one. Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28raid5: sysfs entry to control worker thread numberShaohua Li1-0/+60
Add a sysfs entry to control running workqueue thread number. If group_thread_cnt is set to 0, we will disable workqueue offload handling of stripes. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28raid5: offload stripe handle to workqueueShaohua Li1-15/+171
This is another attempt to create multiple threads to handle raid5 stripes. This time I use workqueue. raid5 handles request (especially write) in stripe unit. A stripe is page size aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a state machine for the corresponding stripe, which includes reading some disks of the stripe, calculating parity, and writing some disks of the stripe. The state machine is running in raid5d thread currently. Since there is only one thread, it doesn't scale well for high speed storage. An obvious solution is multi-threading. To get better performance, we have some requirements: a. locality. stripe corresponding to request submitted from one cpu is better handled in thread in local cpu or local node. local cpu is preferred but some times could be a bottleneck, for example, parity calculation is too heavy. local node running has wide adaptability. b. configurablity. Different setup of raid5 array might need diffent configuration. Especially the thread number. More threads don't always mean better performance because of lock contentions. My original implementation is creating some kernel threads. There are interfaces to control which cpu's stripe each thread should handle. And userspace can set affinity of the threads. This provides biggest flexibility and configurability. But it's hard to use and apparently a new thread pool implementation is disfavor. Recent workqueue improvement is quite promising. unbound workqueue will be bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to do affinity setting. For example, we can only include one HT sibling in affinity. Since work is non-reentrant by default, and we can control running thread number by limiting dispatched work_struct number. In this patch, I created several stripe worker group. A group is a numa node. stripes from cpus of one node will be added to a group list. Workqueue thread of one node will only handle stripes of worker group of the node. In this way, stripe handling has numa node locality. And as I said, we can control thread number by limiting dispatched work_struct number. The work_struct callback function handles several stripes in one run. A typical work queue usage is to run one unit in each work_struct. In raid5 case, the unit is a stripe. But we can't do that: a. Though handling a stripe doesn't need lock because of reference accounting and stripe isn't in any list, queuing a work_struct for each stripe will make workqueue lock contended very heavily. b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we might dispatch request. If each work_struct only handles one stripe, such block plug is meaningless. This implementation can't do very fine grained configuration. But the numa binding is most popular usage model, should be enough for most workloads. Note: since we have only one stripe queue, switching to multi-thread might decrease request size dispatching down to low level layer. The impact depends on thread number, raid configuration and workload. So multi-thread raid5 might not be proper for all setups. Changes V1 -> V2: 1. remove WQ_NON_REENTRANT 2. disabling multi-threading by default 3. Add more descriptions in changelog Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28raid5: fix stripe release orderShaohua Li1-0/+15
patch "make release_stripe lockless" changes the order stripes are released. Originally I thought block layer can take care of request merge, but it appears there are still some requests not merged. It's easy to fix the order. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28raid5: make release_stripe locklessShaohua Li1-3/+46
release_stripe still has big lock contention. We just add the stripe to a llist without taking device_lock. We let the raid5d thread to do the real stripe release, which must hold device_lock anyway. In this way, release_stripe doesn't hold any locks. The side effect is the released stripes order is changed. But sounds not a big deal, stripes are never handled in order. And I thought block layer can already do nice request merge, which means order isn't that important. I kept the unplug release batch, which is unnecessary with this patch from lock contention avoid point of view, and actually if we delete it, the stripe_head release_list and lru can share storage. But the unplug release batch is also helpful for request merge. We probably can delay wakeup raid5d till unplug, but I'm still afraid of the case which raid5d is running. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-25md/raid5: fix interaction of 'replace' and 'recovery'.NeilBrown1-5/+10
If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Cc: stable@vger.kernel.org (3.3+) Reported-by: qindehua <13691222965@163.com> Tested-by: qindehua <qindehua@163.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04md/raid5: allow 5-device RAID6 to be reshaped to 4-device.NeilBrown1-1/+1
There is a bug in 'check_reshape' for raid5.c To checks that the new minimum number of devices is large enough (which is good), but it does so also after the reshape has started (bad). This is bad because - the calculation is now wrong as mddev->raid_disks has changed already, and - it is pointless because it is now too late to stop. So only perform that test when reshape has not been committed to. Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14md: replace strict_strto*() with kstrto*()Jingoo Han1-2/+2
The usage of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds1-1/+3
Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
2013-06-13md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in placeH. Peter Anvin1-1/+3
There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-05-30raid5: Initialize bi_vcntKent Overstreet1-0/+2
The patch that converted raid5 to use bio_reset() forgot to initialize bi_vcnt. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: NeilBrown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Tested-by: Ilia Mirkin <imirkin@alum.mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-08Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-25/+24
Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
2013-04-24md/raid5: avoid an extra write when writing to a known-bad-block.NeilBrown1-1/+8
If we write to a known-bad-block it will be flags as having a ReadError by analyse_stripe, but the write will proceed anyway (as it should). Then the read-error handling will kick in an write again, then re-read. We don't need that 'write-again', so set R5_ReWrite so it looks like it has already been done. Then we will just get the re-read, which we want. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-24md/raid5: Change or of some order to improve efficiency.majianpeng1-3/+4
As the function call is the most expensive of these tests it should be done later in the chain so that it can be avoided in some cases. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-04-18Revert "block: add missing block_bio_complete() tracepoint"Linus Torvalds1-1/+10
This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-02Merge branch 'writeback-workqueue' of ↵Jens Axboe1-35/+81
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-24Merge tag 'md-3.9-fixes' of git://neil.brown.name/mdLinus Torvalds1-35/+81
Pull md fixes from NeilBrown: "A few bugfixes for md - recent regressions in raid5 - recent regressions in dmraid - a few instances of CONFIG_MULTICORE_RAID456 linger Several tagged for -stable" * tag 'md-3.9-fixes' of git://neil.brown.name/md: md: remove CONFIG_MULTICORE_RAID456 entirely md/raid5: ensure sync and DISCARD don't happen at the same time. MD: Prevent sysfs operations on uninitialized kobjects MD RAID5: Avoid accessing gendisk or queue structs when not available md/raid5: schedule_construction should abort if nothing to do.
2013-03-24raid5: use bio_reset()Kent Overstreet1-14/+13
Had to shuffle the code around a bit (where bi_rw and bi_end_io were set), but shouldn't really be anything tricky here Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de>
2013-03-24block: Use bio_sectors() more consistentlyKent Overstreet1-4/+4
Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>
2013-03-24block: Add bio_end_sector()Kent Overstreet1-7/+7
Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-03-20md/raid5: ensure sync and DISCARD don't happen at the same time.NeilBrown1-6/+39
A number of problems can occur due to races between resync/recovery and discard. - if sync_request calls handle_stripe() while a discard is happening on the stripe, it might call handle_stripe_clean_event before all of the individual discard requests have completed (so some devices are still locked, but not all). Since commit ca64cae96037de16e4af92678814f5d4bf0c1c65 md/raid5: Make sure we clear R5_Discard when discard is finished. this will cause R5_Discard to be cleared for the parity device, so handle_stripe_clean_event() will not be called when the other devices do become unlocked, so their ->written will not be cleared. This ultimately leads to a WARN_ON in init_stripe and a lock-up. - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward time for resync, it can lead to s->uptodate being less than disks in handle_parity_checks5(), which triggers a BUG (because it is one). So: - keep R5_Discard on the parity device until all other devices have completed their discard request - make sure we don't try to have a 'discard' and a 'sync' action at the same time. This involves a new stripe flag to we know when a 'discard' is happening, and the use of R5_Overlap on the parity disk so when a discard is wanted while a sync is active, so we know to wake up the discard at the appropriate time. Discard support for RAID5 was added in 3.7, so this is suitable for any -stable kernel since 3.7. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Tested-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20MD RAID5: Avoid accessing gendisk or queue structs when not availableJonathan Brassow1-13/+20
MD RAID5: Fix kernel oops when RAID4/5/6 is used via device-mapper Commit a9add5d (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver. However, when device-mapper is used to create RAID4/5/6 arrays, the mddev->gendisk and mddev->queue fields are not setup. Therefore, calling things like trace_block_bio_remap will cause a kernel oops. This patch conditionalizes those calls on whether the proper fields exist to make the calls. (Device-mapper will call trace_block_bio_remap on its own.) This patch is suitable for the 3.8.y stable kernel. Cc: stable@vger.kernel.org (v3.8+) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-20md/raid5: schedule_construction should abort if nothing to do.NeilBrown1-16/+22
Since commit 1ed850f356a0a422013846b5291acff08815008b md/raid5: make sure to_read and to_write never go negative. It has been possible for handle_stripe_dirtying to be called when there isn't actually any work to do. It then calls schedule_reconstruction() which will set R5_LOCKED on the parity block(s) even when nothing else is happening. This then causes problems in do_release_stripe(). So add checks to schedule_reconstruction() so that if it doesn't find anything to do, it just aborts. This bug was introduced in v3.7, so the patch is suitable for -stable kernels since then. Cc: stable@vger.kernel.org (v3.7+) Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-03-06Merge tag 'md-3.9' of git://neil.brown.name/mdLinus Torvalds1-37/+1
Pull md updates from NeilBrown: "Mostly little bugfixes. Only "feature" is a new RAID10 layout which slightly improves the number of sets of devices that can concurrently fail, without data loss." * tag 'md-3.9' of git://neil.brown.name/md: md: expedite metadata update when switching read-auto -> active md: remove CONFIG_MULTICORE_RAID456 md/raid1,raid10: fix deadlock with freeze_array() md/raid0: improve error message when converting RAID4-with-spares to RAID0 md: raid0: fix error return from create_stripe_zones. md: fix two bugs when attempting to resize RAID0 array. DM RAID: Add support for MD's RAID10 "far" and "offset" algorithms MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 2) MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1) MD RAID10: Minor non-functional code changes md: raid1,10: Handle REQ_WRITE_SAME flag in write bios md: protect against crash upon fsync on ro array
2013-03-01Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-10/+1
Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...
2013-02-28hlist: drop the node parameter from iteratorsSasha Levin1-2/+1
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28md: remove CONFIG_MULTICORE_RAID456NeilBrown1-37/+1
This doesn't seem to actually help and we have an alternate multi-threading approach waiting in the wings, so just get rid of this config option and associated code. As a bonus, we remove one use of CONFIG_EXPERIMENTAL Cc: Dan Williams <djbw@fb.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NeilBrown <neilb@suse.de>
2013-01-14block: add missing block_bio_complete() tracepointTejun Heo1-10/+1
bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-12-18Merge tag 'md-3.8' of git://neil.brown.name/mdLinus Torvalds1-7/+36
Pull md update from Neil Brown: "Mostly just little fixes. Probably biggest part is AVX accelerated RAID6 calculations." * tag 'md-3.8' of git://neil.brown.name/md: md/raid5: add blktrace calls md/raid5: use async_tx_quiesce() instead of open-coding it. md: Use ->curr_resync as last completed request when cleanly aborting resync. lib/raid6: build proper files on corresponding arch lib/raid6: Add AVX2 optimized gen_syndrome functions lib/raid6: Add AVX2 optimized recovery functions md: Update checkpoint of resync/recovery based on time. md:Add place to update ->recovery_cp. md.c: re-indent various 'switch' statements. md: close race between removing and adding a device. md: removed unused variable in calc_sb_1_csm.
2012-12-18md/raid5: add blktrace callsNeilBrown1-3/+35
This makes it easier to trace what raid5 is doing. Signed-off-by: NeilBrown <neilb@suse.de>
2012-12-18Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+5
Pull block driver update from Jens Axboe: "Now that the core bits are in, here are the driver bits for 3.8. The branch contains: - A huge pile of drbd bits that were dumped from the 3.7 merge window. Following that, it was both made perfectly clear that there is going to be no more over-the-wall pulls and how the situation on individual pulls can be improved. - A few cleanups from Akinobu Mita for drbd and cciss. - Queue improvement for loop from Lukas. This grew into adding a generic interface for waiting/checking an even with a specific lock, allowing this to be pulled out of md and now loop and drbd is also using it. - A few fixes for xen back/front block driver from Roger Pau Monne. - Partition improvements from Stephen Warren, allowing partiion UUID to be used as an identifier." * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits) drbd: update Kconfig to match current dependencies drbd: Fix drbdsetup wait-connect, wait-sync etc... commands drbd: close race between drbd_set_role and drbd_connect drbd: respect no-md-barriers setting also when changed online via disk-options drbd: Remove obsolete check drbd: fixup after wait_even_lock_irq() addition to generic code loop: Limit the number of requests in the bio list wait: add wait_event_lock_irq() interface xen-blkfront: free allocated page xen-blkback: move free persistent grants code block: partition: msdos: provide UUIDs for partitions init: reduce PARTUUID min length to 1 from 36 block: store partition_meta_info.uuid as a string cciss: use check_signature() cciss: cleanup bitops usage drbd: use copy_highpage drbd: if the replication link breaks during handshake, keep retrying drbd: check return of kmalloc in receive_uuids drbd: Broadcast sync progress no more often than once per second drbd: don't try to clear bits once the disk has failed ...
2012-12-14Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial branch from Jiri Kosina: "Usual stuff -- comment/printk typo fixes, documentation updates, dead code elimination." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) HOWTO: fix double words typo x86 mtrr: fix comment typo in mtrr_bp_init propagate name change to comments in kernel source doc: Update the name of profiling based on sysfs treewide: Fix typos in various drivers treewide: Fix typos in various Kconfig wireless: mwifiex: Fix typo in wireless/mwifiex driver messages: i2o: Fix typo in messages/i2o scripts/kernel-doc: check that non-void fcts describe their return value Kernel-doc: Convention: Use a "Return" section to describe return values radeon: Fix typo and copy/paste error in comments doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c various: Fix spelling of "asynchronous" in comments. Fix misspellings of "whether" in comments. eisa: Fix spelling of "asynchronous". various: Fix spelling of "registered" in comments. doc: fix quite a few typos within Documentation target: iscsi: fix comment typos in target/iscsi drivers treewide: fix typo of "suport" in various comments and Kconfig treewide: fix typo of "suppport" in various comments ...
2012-12-13md/raid5: use async_tx_quiesce() instead of open-coding it.NeilBrown1-4/+1
handle_stripe_expansion contains: if (tx) { async_tx_ack(tx); dma_wait_for_async_tx(tx); } which is very similar to the body of async_tx_quiesce(), except that the later handles an error from dma_wait_for_async_tx() (admittedly by panicing, but that decision belongs in the dma code, not the md code). So just us async_tx_quiesce(). Acked-by: Dan Williams <djbw@fb.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NeilBrown <neilb@suse.de>