summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)AuthorFilesLines
2012-08-01Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-54/+5
Pull block driver changes from Jens Axboe: - Making the plugging support for drivers a bit more sane from Neil. This supersedes the plugging change from Shaohua as well. - The usual round of drbd updates. - Using a tail add instead of a head add in the request completion for ndb, making us find the most completed request more quickly. - A few floppy changes, getting rid of a duplicated flag and also running the floppy init async (since it takes forever in boot terms) from Andi. * 'for-3.6/drivers' of git://git.kernel.dk/linux-block: floppy: remove duplicated flag FD_RAW_NEED_DISK blk: pass from_schedule to non-request unplug functions. block: stack unplug blk: centralize non-request unplug handling. md: remove plug_cnt feature of plugging. block/nbd: micro-optimization in nbd request completion drbd: announce FLUSH/FUA capability to upper layers drbd: fix max_bio_size to be unsigned drbd: flush drbd work queue before invalidate/invalidate remote drbd: fix potential access after free drbd: call local-io-error handler early drbd: do not reset rs_pending_cnt too early drbd: reset congestion information before reporting it in /proc/drbd drbd: report congestion if we are waiting for some userland callback drbd: differentiate between normal and forced detach drbd: cleanup, remove two unused global flags floppy: Run floppy initialization asynchronous
2012-07-31blk: pass from_schedule to non-request unplug functions.NeilBrown1-1/+1
This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31blk: centralize non-request unplug handling.NeilBrown1-51/+5
Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31md: remove plug_cnt feature of plugging.NeilBrown1-4/+1
This seemed like a good idea at the time, but after further thought I cannot see it making a difference other than very occasionally and testing to try to exercise the case it is most likely to help did not show any performance difference by removing it. So remove the counting of active plugs and allow 'pending writes' to be activated at any time, not just when no plugs are active. This is only relevant when there is a write-intent bitmap, and the updating of the bitmap will likely introduce enough delay that the single-threading of bitmap updates will be enough to collect large numbers of updates together. Removing this will make it easier to centralise the unplug code, and will clear the other for other unplug enhancements which have a measurable effect. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31md: remove duplicated test on ->openers when calling do_md_stop()NeilBrown1-6/+2
do_md_stop tests mddev->openers while holding ->open_mutex, and fails if this count is too high. So callers do not need to check mddev->openers and doing so isn't very meaningful as they don't hold ->open_mutex so the number could change. So remove the unnecessary tests on mddev->openers. These are not called often enough for there to be any gain in an early test on ->open_mutex to avoid the need for a slightly more costly mutex_lock call. Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19md: avoid crash when stopping md array races with closing other open fds.NeilBrown1-13/+23
md will refuse to stop an array if any other fd (or mounted fs) is using it. When any fs is unmounted of when the last open fd is closed all pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put) so there will be no pending IO to worry about when the array is stopped. However in order to send the STOP_ARRAY ioctl to stop the array one must first get and open fd on the block device. If some fd is being used to write to the block device and it is closed after mdadm open the block device, but before mdadm issues the STOP_ARRAY ioctl, then there will be no last-close on the md device so __blkdev_put will not call sync_blockdev. If this happens, then IO can still be in-flight while md tears down the array and bad things can happen (use-after-free and subsequent havoc). So in the case where do_md_stop is being called from an open file descriptor, call sync_block after taking the mutex to ensure there will be no new openers. This is needed when setting a read-write device to read-only too. Cc: stable@vger.kernel.org Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19md: fix bug in handling of new_data_offsetNeilBrown1-0/+1
commit c6563a8c38fde3c1c7fc925a10bde3ca20799301 md: add possibility to change data-offset for devices. introduced a 'new_data_offset' attribute which should normally be the same as 'data_offset', but can be explicitly set to a different value to allow a reshape operation to move the data. Unfortunately when the 'data_offset' is explicitly set through sysfs, the new_data_offset is not also set, so the two would become out-of-sync incorrectly. One result of this is that trying to set the 'size' after the 'data_offset' would fail because it is not permitted to set the size when the 'data_offset' and 'new_data_offset' are different - as that can be confusing. Consequently when mdadm tried to do this while assembling an IMSM array it would fail. This bug was introduced in 3.5-rc1. Reported-by: Brian Downing <bdowning@lavos.net> Bisected-by: Brian Downing <bdowning@lavos.net> Tested-by: Brian Downing <bdowning@lavos.net> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: support re-add of recovering devices.NeilBrown1-2/+1
We currently only allow a device to be re-added if it appear to be in-sync. This is overly restrictive as it may be desirable to re-add a device that is in the middle of recovery. So remove the test for "InSync" - the test on rdev->raid_disk is sufficient to ensure that the re-add will succeed. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md: make 'name' arg to md_register_thread non-optional.NeilBrown1-1/+1
Having the 'name' arg optional and defaulting to the current personality name is no necessary and leads to errors, as when changing the level of an array we can end up using the name of the old level instead of the new one. So make it non-optional and always explicitly pass the name of the level that the array will be. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03md:Add blk_plug in sync_thread.majianpeng1-0/+3
Add blk_plug in sync_thread will increase the performance of sync. Because sync_thread did not blk_plug,so when raid sync, the bio merge not well. Testing environment: SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller. OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012 x86_64 x86_64 x86_64 GNU/Linux. RAID5: four ST31000524NS disk. Without blk_plug:recovery speed about 63M/Sec; Add blk_plug:recovery speed about 120M/Sec. Using blktrace: blktrace -d /dev/sdb -w 60 -o -|blkparse -i - without blk_plug: Total (8,16): Reads Queued: 309811, 1239MiB Writes Queued: 0, 0KiB Read Dispatches: 283583, 1189MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 273351, 1149MiB Writes Completed: 0, 0KiB Read Merges: 23533, 94132KiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 0 add blk_plug: Total (8,16): Reads Queued: 428697, 1714MiB Writes Queued: 0, 0KiB Read Dispatches: 3954, 1714MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3956, 1715MiB Writes Completed: 0, 0KiB Read Merges: 424743, 1698MiB Write Merges: 0, 0KiB IO unplugs: 0 Timer unplugs: 3384 The ratio of merge will be markedly increased. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md: check the return of mddev_find()Yuanhan Liu1-0/+3
Check the return of mddev_find(), since it may fail due to out of memeory or out of usable minor number. The reason I chose -ENODEV instead of -ENOMEM or something else is md_alloc() function chose that ;) Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22DM RAID: Set recovery flags on resumeJonathan Brassow1-3/+2
Properly initialize MD recovery flags when resuming device-mapper devices. When a device-mapper device is suspended, all I/O must stop. This is done by calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device may have been suspended while recovery was not yet complete, so the process needs to pick-up where it left off. Since 'mddev_resume' does not unset 'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves. 'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN' must be set outside of 'mddev_resume' due to how MD handles RAID reshaping. (e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the desired behavior.) Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)' there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'. Also clean up where level_store calls mddev_resume() - it current duplicates some of the funcitons of that call. - NB Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md: allow array to be resized while bitmap is present.NeilBrown1-5/+1
Now that bitmaps can be resized, we can allow an array to be resized while the bitmap is present. This only covers resizing that involves changing the effective size of member devices, not resizing that changes the number of devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct.NeilBrown1-4/+5
This new 'struct bitmap_storage' reflects the external storage of the bitmap. Having this clearly defined will make it easier to change the storage used while the array is active. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md/bitmap: allow a bitmap with no backing storage.NeilBrown1-2/+3
An md bitmap comprises two parts - internal counting of active writes per 'chunk'. - external storage of whether there are any active writes on each chunk The second requires the first, but the first doesn't require the second. Not having backing storage means that the bitmap cannot expedite resync after a crash, but it still allows us to expedite the recovery of a recently-removed device. So: allow a bitmap to exist even if there is no backing device. In that case we default to 128M chunks. A particular value of this is that we can remove and re-add a bitmap (possibly of a different granularity) on a degraded array, and not lose the information needed to fast-recover the missing device. We don't actually activate these bitmaps yet - that will come in a later patch. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md/bitmap: add new 'space' attribute for bitmaps.NeilBrown1-2/+31
If we are to allow bitmaps to be resized when the array is resized, we need to know how much space there is. So create an attribute to store this information and set appropriate defaults. It can be set more precisely via sysfs, or future metadata extensions may allow it to be recorded. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md: move freeing of badblocks.page into md_rdev_clearNeilBrown1-3/+2
This ensures that it is always freed - there were case where we failed to free the page. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22md: dm-raid should call helper function to clear rdev.NeilBrown1-4/+4
dm-raid currently open-codes the freeing of some members of and rdev. It is more maintainable to have it call common code from md.c which does this for all call-sites. So remove free_disk_sb to md_rdev_clear, export it, and use it in dm-raid.c Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: use resync_max_sectors for reshape as well as resync.NeilBrown1-3/+5
Some resync type operations need to act on the address space of the device, others on the address space of the array. This only affects RAID10, so it sets resync_max_sectors to the array size (it defaults to the device size), and that is currently used for resync only. However reshape of a RAID10 must be done against the array size, not device size, so change code to use resync_max_sectors for both the resync and the reshape cases. This does not affect RAID5 or RAID1, just RAID10. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: teach sync_page_io about new_data_offset.NeilBrown1-0/+4
Some code in raid1 and raid10 use sync_page_io to read/write pages when responding to read errors. As we will shortly support changing data_offset for raid10, this function must understand new_data_offset. So add that understanding. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: add possibility to change data-offset for devices.NeilBrown1-21/+196
When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: allow a reshape operation to be reversed.NeilBrown1-2/+65
Currently a reshape operation always progresses from the start of the array to the end unless the number of devices is being reduced, in which case it progressed in the opposite direction. To reverse a partial reshape which changes the number of devices you can stop the array and re-assemble with the raid-disks numbers reversed and it will undo. However for a reshape that does not change the number of devices it is not possible to reverse the reshape in the middle - you have to wait until it completes. So add a 'reshape_direction' attribute with is either 'forwards' or 'backwards' and can be explicitly set when delta_disks is zero. This will become more important when we allow the data_offset to change in a reshape. Then the explicit statement of what direction is being used will be more useful. This can be enabled in raid5 trivially as it already supports reverse reshape and just needs to use a different trigger to request it. Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21md: using GFP_NOIO to allocate bio for flush requestShaohua Li1-1/+1
A flush request is usually issued in transaction commit code path, so using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. This is suitable for any -stable kernel to which it applies as it avoids a possible deadlock. Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17MD: Add del_timer_sync to mddev_suspend (fix nasty panic)Jonathan Brassow1-0/+2
Use del_timer_sync to remove timer before mddev_suspend finishes. We don't want a timer going off after an mddev_suspend is called. This is especially true with device-mapper, since it can call the destructor function immediately following a suspend. This results in the removal (kfree) of the structures upon which the timer depends - resulting in a very ugly panic. Therefore, we add a del_timer_sync to mddev_suspend to prevent this. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24md: fix possible corruption of array metadata on shutdown.NeilBrown1-1/+2
commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown. removed the possibility of a 'BUG' when data is written to an array that has just been switched to read-only, but also introduced the possibility that the array metadata could be corrupted. If, when md_notify_reboot gets the mddev lock, the array is in a state where it is assembled but hasn't been started (as can happen if the personality module is not available, or in other unusual situations), then incorrect metadata will be written out making it impossible to re-assemble the array. So only call __md_stop_writes() if the array has actually been activated. This patch is needed for any stable kernel which has had the above commit applied. Cc: stable@vger.kernel.org Reported-by: Christoph Nelles <evilazrael@evilazrael.de> Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24md: don't call ->add_disk unless there is good reason.NeilBrown1-2/+2
Commit 7bfec5f35c68121e7b18 md/raid5: If there is a spare and a want_replacement device, start replacement. cause md_check_recovery to call ->add_disk much more often. Instead of only when the array is degraded, it is now called whenever md_check_recovery finds anything useful to do, which includes updating the metadata for clean<->dirty transition. This causes unnecessary work, and causes info messages from ->add_disk to be reported much too often. So refine md_check_recovery to only do any actual recovery checking (including ->add_disk) if MD_RECOVERY_NEEDED is set. This fix is suitable for 3.3.y: Cc: stable@vger.kernel.org Reported-by: Jan Ceuleers <jan.ceuleers@computer.org> Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().majianpeng1-1/+1
If there are no unacked bad blocks, then there is no point searching for them to acknowledge them. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md: fix clearing of the 'changed' flags for the bad blocks list.NeilBrown1-1/+2
In super_1_sync (the first hunk) we need to clear 'changed' before checking read_seqretry(), otherwise we might race with other code adding a bad block and so won't retry later. In md_update_sb (the second hunk), in the case where there is no metadata (neither persistent nor external), we treat any bad blocks as an error. However we need to clear the 'changed' flag before calling md_ack_all_badblocks, else it won't do anything. This patch is suitable for -stable release 3.0 and later. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md/bitmap: move printing of bitmap status to bitmap.cNeilBrown1-22/+1
The part of /proc/mdstat which describes the bitmap should really be generated by code in bitmap.c. So move it there. Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md/raid10: handle merge_bvec_fn in member devices.NeilBrown1-0/+1
Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So enhance the raid10 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md: tidy up rdev_for_each usage.NeilBrown1-37/+37
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md: don't set md arrays to readonly on shutdown.NeilBrown1-22/+15
It seems that with recent kernel, writeback can still be happening while shutdown is happening, and consequently data can be written after the md reboot notifier switches all arrays to read-only. This causes a BUG. So don't switch them to read-only - just mark them clean and set 'safemode' to '2' which mean that immediately after any write the array will be switch back to 'clean'. This could result in the shutdown happening when array is marked dirty, thus forcing a resync on reboot. However if you reboot without performing a "sync" first, you get to keep both halves. This is suitable for any stable kernel (though there might be some conflicts with obvious fixes in earlier kernels). Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-09Merge tag 'md-3.3-fixes' of git://neil.brown.name/mdLinus Torvalds1-2/+3
Some simple md-related fixes. 1/ two small fixes to ensure we handle an interrupted resync properly. 2/ avoid loading the bitmap multiple times in dm-raid * tag 'md-3.3-fixes' of git://neil.brown.name/md: md: two small fixes to handling interrupt resync. Prevent DM RAID from loading bitmap twice.
2012-02-07md: two small fixes to handling interrupt resync.NeilBrown1-2/+3
1/ If a resync is aborted we should record how far we got (recovery_cp) the last request that we know has completed (->curr_resync_completed) rather than the last request that was submitted (->curr_resync). 2/ When a resync aborts we still want to update the metadata with any changes, so set MD_CHANGE_DEVS even if we 'skip'. Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-16Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+1
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)
2012-01-12Merge tag 'md-3.3-fixes' of git://neil.brown.name/mdLinus Torvalds1-0/+6
Two bugfixes for md. One is a recently introduced regression that affects an unusual configuration with a guaranteed BUG_ON. Has been tagged for -stable. The other is minor missing functionality. * tag 'md-3.3-fixes' of git://neil.brown.name/md: md/raid1: perform bad-block tests for WriteMostly devices too. md: notify the 'degraded' sysfs attribute on failure.
2012-01-11block: Introduce blk_set_stacking_limits functionMartin K. Petersen1-0/+1
Stacking driver queue limits are typically bounded exclusively by the capabilities of the low level devices, not by the stacking driver itself. This patch introduces blk_set_stacking_limits() which has more liberal metrics than the default queue limits function. This allows us to inherit topology parameters from bottom devices without manually tweaking the default limits in each driver prior to calling the stacking function. Since there is now a clear distinction between stacking and low-level devices, blk_set_default_limits() has been modified to carry the more conservative values that we used to manually set in blk_queue_make_request(). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11md: notify the 'degraded' sysfs attribute on failure.NeilBrown1-0/+6
We currently only 'notify' changes to the 'degraded' attribute when it decreases, not when it increases. Notifying on failure is a little awkward as it happen in interrupt context. So instead, notify when we remove the failed device from the array, which is very soon afterwards. Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-09Merge tag 'md-3.3' of git://neil.brown.name/mdLinus Torvalds1-27/+80
md update for 3.3 Big change is new hot-replacement. A slot in an array can hold 2 devices - one that wants-replacement and one that is the replacement. Once the replacement is built - either from the original or (in the case of errors) from elsewhere, the wants-replacement device will be removed. * tag 'md-3.3' of git://neil.brown.name/md: (36 commits) md/raid1: Mark device want_replacement when we see a write error. md/raid1: If there is a spare and a want_replacement device, start replacement. md/raid1: recognise replacements when assembling arrays. md/raid1: handle activation of replacement device when recovery completes. md/raid1: Allow a failed replacement device to be removed. md/raid1: Allocate spare to store replacement devices and their bios. md/raid1: Replace use of mddev->raid_disks with conf->raid_disks. md/raid10: If there is a spare and a want_replacement device, start replacement. md/raid10: recognise replacements when assembling array. md/raid10: Allow replacement device to be replace old drive. md/raid10: handle recovery of replacement devices. md/raid10: Handle replacement devices during resync. md/raid10: writes should get directed to replacement as well as original. md/raid10: allow removal of failed replacement devices. md/raid10: preferentially read from replacement device if possible. md/raid10: change read_balance to return an rdev md/raid10: prepare data structures for handling replacement. md/raid5: Mark device want_replacement when we see a write error. md/raid5: If there is a spare and a want_replacement device, start replacement. md/raid5: recognise replacements when assembling array. ...
2012-01-04fs: move code out of buffer.cAl Viro1-2/+1
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-23md/raid5: If there is a spare and a want_replacement device, start replacement.NeilBrown1-16/+14
When attempting to add a spare to a RAID[456] array, also consider adding it as a replacement for a want_replacement device. This requires that common md code attempt hot_add even when the array is not formally degraded. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: create externally visible flags for supporting hot-replace.NeilBrown1-1/+54
hot-replace is a feature being added to md which will allow a device to be replaced without removing it from the array first. With hot-replace a spare can be activated and recovery can start while the original device is still in place, thus allowing a transition from an unreliable device to a reliable device without leaving the array degraded during the transition. It can also be use when the original device is still reliable but it not wanted for some reason. This will eventually be supported in RAID4/5/6 and RAID10. This patch adds a super-block flag to distinguish the replacement device. If an old kernel sees this flag it will reject the device. It also adds two per-device flags which are viewable and settable via sysfs. "want_replacement" can be set to request that a device be replaced. "replacement" is set to show that this device is replacing another device. The "rd%d" links in /sys/block/mdXx/md only apply to the original device, not the replacement. We currently don't make links for the replacement - there doesn't seem to be a need. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: change hot_remove_disk to take an rdev rather than a number.NeilBrown1-3/+3
Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: remove test for duplicate device when setting slot number.NeilBrown1-5/+0
When setting the slot number on a device in an active array we currently check that the number is not already in use. We then call into the personality's hot_add_disk function which performs the same test and returns the same error. Thus the common test is not needed. As we will shortly be changing some personalities to allow duplicates in some cases (to support hot-replace), the common test will become inconvenient. So remove the common test. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: allow non-privileged uses to GET_*_INFO about raid arrays.NeilBrown1-2/+9
The info is already available in /proc/mdstat and /sys/block in an accessible form so there is no point in putting a road-block in the ioctl for information gathering. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: don't give up looking for spares on first failure-to-addNeilBrown1-2/+1
Before performing a recovery we try to remove any spares that might not be working, then add any that might have become relevant. Currently we abort on the first spare that cannot be added. This is a false optimisation. It is conceivable that - depending on rules in the personality - a subsequent spare might be accepted. Also the loop does other things like count the available spares and reset the 'recovery_offset' value. If we abort early these might not happen properly. So remove the early abort. In particular if you have an array what is undergoing recovery and which has extra spares, then the recovery may not restart after as reboot as the could of 'spares' might end up as zero. Reported-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08md: ensure new badblocks are handled promptly.NeilBrown1-0/+1
When we mark blocks as bad we need them to be acknowledged by the metadata handler promptly. For an in-kernel metadata handler that was already being done. But for an external metadata handler we need to alert it of the change by sending a notification through the sysfs file. This adds that notification. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08md: bad blocks shouldn't cause a Blocked status on a Faulty device.NeilBrown1-1/+2
Once a device is marked Faulty the badblocks - whether acknowledged or not - become irrelevant. So they shouldn't cause the device to be marked as Blocked. Without this patch, a process might write "-blocked" to clear the Blocked status, but while that will correctly fail the device, it won't remove the apparent 'blocked' status. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08md: take a reference to mddev during sysfs access.NeilBrown1-1/+18
When we are accessing an mddev via sysfs we know that the mddev cannot disappear because it has an embedded kobj which is refcounted by sysfs. And we also take the mddev_lock. However this is not enough. The final mddev_put could have been called and the mddev_delayed_delete is waiting for sysfs to let go so it can destroy the kobj and mddev. In this state there are a lot of changes that should not be attempted. To to guard against this we: - initialise mddev->all_mddevs in on last put so the state can be easily detected. - in md_attr_show and md_attr_store, check ->all_mddevs under all_mddevs_lock and mddev_get the mddev if it still appears to be active. This means that if we get to sysfs as the mddev is being deleted we will get -EBUSY. rdev_attr_store and rdev_attr_show are similar but already have sufficient protection. They check that rdev->mddev still points to mddev after taking mddev_lock. As this is cleared before delayed removal which can only be requested under the mddev_lock, this ensure the rdev and mddev are still alive. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08md: refine interpretation of "hold_active == UNTIL_IOCTL".NeilBrown1-2/+2
We like md devices to disappear when they really are not needed. However it is not possible to tell from the current state whether it is needed or not. We can only tell from recent history of changes. In particular immediately after we create an md device it looks very similar to immediately after we have finished with it. So we always preserve a newly created md device until something significant happens. This state is stored in 'hold_active'. The normal case is to keep it until an ioctl happens, as that will normally either activate it, or explicitly de-activate it. If it doesn't then it was probably created by mistake and it is now time to get rid of it. We can also modify an array via sysfs (instead of via ioctl) and we currently treat any change via sysfs like an ioctl as a sign that if it now isn't more active, it should be destroyed. However this is not appropriate as changes made via sysfs are more gradual so we should look for a more definitive change. So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when the array_state is changed via sysfs. Other changes via sysfs are ignored. Signed-off-by: NeilBrown <neilb@suse.de>