summaryrefslogtreecommitdiff
path: root/drivers/md/raid1.c
AgeCommit message (Collapse)AuthorFilesLines
2016-10-26Fix incomplete backport of commit 423f04d63cf4Zefan Li1-3/+0
Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-04-27raid1: include bio_end_io_list in nr_queued to prevent freeze_array hangNate Dailey1-2/+5
commit ccfc7bf1f09d6190ef86693ddc761d5fe3fa47cb upstream. If raid1d is handling a mix of read and write errors, handle_read_error's call to freeze_array can get stuck. This can happen because, though the bio_end_io_list is initially drained, writes can be added to it via handle_write_finished as the retry_list is processed. These writes contribute to nr_pending but are not included in nr_queued. If a later entry on the retry_list triggers a call to handle_read_error, freeze array hangs waiting for nr_pending == nr_queued+extra. The writes on the bio_end_io_list aren't included in nr_queued so the condition will never be satisfied. To prevent the hang, include bio_end_io_list writes in nr_queued. There's probably a better way to handle decrementing nr_queued, but this seemed like the safest way to avoid breaking surrounding code. I'm happy to supply the script I used to repro this hang. Fixes: 55ce74d4bfe1b(md/raid1: ensure device failure recorded before write request returns.) Signed-off-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-04-27md/raid1: don't clear bitmap bit when bad-block-list write fails.NeilBrown1-3/+8
commit bd8688a199b864944bf62eebed0ca13b46249453 upstream. When a write fails and a bad-block-list is present, we can update the bad-block-list instead of writing the data. If this succeeds then it is OK clear the relevant bitmap-bit as no further 'sync' of the block is needed. However if writing the bad-block-list fails then we need to treat the write as failed and particularly must not clear the bitmap bit. Otherwise the device can be re-added (after any hardware connection issues are resolved) and because the relevant bit in the bitmap is clear, that block will not be resynced. This leads to data corruption. We already delay the final bio_endio() on the write until the bad-block-list is written so that when the write returns: either that data is safe, the bad-block record is safe, or the fact that the device is faulty is safe. However we *don't* delay the clearing of the bitmap, so the bitmap bit can be recorded as cleared before we know if the bad-block-list was written safely. So: delay that until the write really is safe. i.e. move the call to close_write() until just before calling bio_endio(), and recheck the 'is array degraded' status before making that call. This bug goes back to v3.1 when bad-block-lists were introduced, though it only affects arrays created with mdadm-3.3 or later as only those have bad-block lists. Backports will require at least Commit: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.") as well. I'll send that to 'stable' separately. Note that of the two tests of R1BIO_WriteError that this patch adds, the first is certain to fail and the second is certain to succeed. However doing it this way makes the patch more obviously correct. I will tidy the code up in a future merge window. Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Cc: Jes Sorensen <Jes.Sorensen@redhat.com> Fixes: cd5ff9a16f08 ("md/raid1: Handle write errors by updating badblock log.") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-04-27md/raid1: ensure device failure recorded before write request returns.NeilBrown1-1/+28
commit 55ce74d4bfe1b9444436264c637f39a152d1e5ac upstream. When a write to one of the legs of a RAID1 fails, the failure is recorded in the metadata of the other leg(s) so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Similarly when we record a bad-block in response to a write failure, we must not let the write complete until the bad-block update is safe. Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-03-21md/raid1: extend spinlock to protect raid1_end_read_request against ↵NeilBrown1-1/+6
inconsistencies commit 423f04d63cf421ea436bcc5be02543d549ce4b28 upstream. raid1_end_read_request() assumes that the In_sync bits are consistent with the ->degaded count. raid1_spare_active updates the In_sync bit before the ->degraded count and so exposes an inconsistency, as does error() So extend the spinlock in raid1_spare_active() and error() to hide those inconsistencies. This should probably be part of Commit: 34cab6f42003 ("md/raid1: fix test for 'was read error from last working device'.") as it addresses the same issue. It fixes the same bug and should go to -stable for same reasons. Fixes: 76073054c95b ("md/raid1: clean up read_balance.") Signed-off-by: NeilBrown <neilb@suse.com> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-03-21md/raid1: fix test for 'was read error from last working device'.NeilBrown1-1/+1
commit 34cab6f42003cb06f48f86a86652984dec338ae9 upstream. When we get a read error from the last working device, we don't try to repair it, and don't fail the device. We simple report a read error to the caller. However the current test for 'is this the last working device' is wrong. When there is only one fully working device, it assumes that a non-faulty device is that device. However a spare which is rebuilding would be non-faulty but so not the only working device. So change the test from "!Faulty" to "In_sync". If ->degraded says there is only one fully working device and this device is in_sync, this must be the one. This bug has existed since we allowed read_balance to read from a recovering spare in v3.0 Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Fixes: 76073054c95b ("md/raid1: clean up read_balance.") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Zefan Li <lizefan@huawei.com>
2013-11-13md: Fix skipping recovery for read-only arrays.Lukasz Dorau1-0/+1
commit 61e4947c99c4494336254ec540c50186d186150b upstream. Since: commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20md/raid1,raid10: use freeze_array in place of raise_barrier in various places.NeilBrown1-11/+11
commit e2d59925221cd562e07fee38ec8839f7209ae603 upstream. Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> [adjust context to make it can be apply on top of 3.4 ] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-20md/raid1: consider WRITE as successful only if at least one non-Faulty and ↵Alex Lyakas1-1/+11
non-rebuilding drive completed it. commit 3056e3aec8d8ba61a0710fb78b2d562600aa2ea7 upstream. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-05md/raid1: Fix assembling of arrays containing Replacements.NeilBrown1-1/+1
commit 02b898f2f04e418094f0093a3ad0b415bcdbe8eb upstream. setup_conf in raid1.c uses conf->raid_disks before assigning a value. It is used when including 'Replacement' devices. The consequence is that assembling an array which contains a replacement will misbehave and either not include the replacement, or not include the device being replaced. Though this doesn't lead directly to data corruption, it could lead to reduced data safety. So use mddev->raid_disks, which is initialised, instead. Bug was introduced by commit c19d57980b38a5bb613a898937a1cf85f422fb9b md/raid1: recognise replacements when assembling arrays. in 3.3, so fix is suitable for 3.3.y thru 3.6.y. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15md/raid1: don't abort a resync on the first badblock.NeilBrown1-1/+4
commit b7219ccb33aa0df9949a60c68b5e9f712615e56f upstream. If a resync of a RAID1 array with 2 devices finds a known bad block one device it will neither read from, or write to, that device for this block offset. So there will be one read_target (The other device) and zero write targets. This condition causes md/raid1 to abort the resync assuming that it has finished - without known bad blocks this would be true. When there are no write targets because of the presence of bad blocks we should only skip over the area covered by the bad block. RAID10 already gets this right, raid1 doesn't. Or didn't. As this can cause a 'sync' to abort early and appear to have succeeded it could lead to some data corruption, so it suitable for -stable. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-29md/raid1: close some possible races on write errors during resyncNeilBrown1-2/+8
commit 58e94ae18478c08229626daece2fc108a4a23261 upstream. commit 4367af556133723d0f443e14ca8170d9447317cb md/raid1: clear bad-block record when write succeeds. Added a 'reschedule_retry' call possibility at the end of end_sync_write, but didn't add matching code at the end of sync_request_write. So if the writes complete very quickly, or scheduling makes it seem that way, then we can miss rescheduling the request and the resync could hang. Also commit 73d5c38a9536142e062c35997b044e89166e063b md: avoid races when stopping resync. Fix a race condition in this same code in end_sync_write but didn't make the change in sync_request_write. This patch updates sync_request_write to fix both of those. Patch is suitable for 3.1 and later kernels. Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-19md/raid1: fix use-after-free bug in RAID1 data-check code.NeilBrown1-1/+2
commit 2d4f4f3384d4ef4f7c571448e803a1ce721113d5 upstream. This bug has been present ever since data-check was introduce in 2.6.16. However it would only fire if a data-check were done on a degraded array, which was only possible if the array has 3 or more devices. This is certainly possible, but is quite uncommon. Since hot-replace was added in 3.3 it can happen more often as the same condition can arise if not all possible replacements are present. The problem is that as soon as we submit the last read request, the 'r1_bio' structure could be freed at any time, so we really should stop looking at it. If the last device is being read from we will stop looking at it. However if the last device is not due to be read from, we will still check the bio pointer in the r1_bio, but the r1_bio might already be free. So use the read_targets counter to make sure we stop looking for bios to submit as soon as we have submitted them all. This fix is suitable for any -stable kernel since 2.6.16. Reported-by: Arnold Schulz <arnysch@gmx.net> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-09md: raid1/raid10: fix problem with merge_bvec_fnNeilBrown1-0/+4
commit aba336bd1d46d6b0404b06f6915ed76150739057 upstream. The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. However that are conflicts in raid10.c so a separate patch will be needed for 3.4.y. Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-12md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.majianpeng1-1/+2
If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03md/raid1,raid10: don't compare excess byte during consistency check.NeilBrown1-1/+1
When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03md/raid1:Remove unnecessary rcu_dereference(conf->mirrors[i].rdev).majianpeng1-2/+1
Because rde->nr_pending > 0,so can not remove this disk. And in any case, we aren't holding rcu_read_lock() Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02md/raid1: If md_integrity_register() failed,run() must free the memmajianpeng1-1/+7
Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md/raid1: handle merge_bvec_fn in member devices.NeilBrown1-21/+56
Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So create a raid1 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md: tidy up rdev_for_each usage.NeilBrown1-2/+2
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19md/raid1,raid10: avoid deadlock during resync/recovery.NeilBrown1-2/+15
If RAID1 or RAID10 is used under LVM or some other stacking block device, it is possible to enter a deadlock during resync or recovery. This can happen if the upper level block device creates two requests to the RAID1 or RAID10. The first request gets processed, blocks recovery and queue requests for underlying requests in current->bio_list. A resync request then starts which will wait for those requests and block new IO. But then the second request to the RAID1/10 will be attempted and it cannot progress until the resync request completes, which cannot progress until the underlying device requests complete, which are on a queue behind that second request. So allow that second request to proceed even though there is a resync request about to start. This is suitable for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Ray Morris <support@bettercgi.com> Tested-by: Ray Morris <support@bettercgi.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-13md/raid1: fix buglet in md_raid1_contested.NeilBrown1-1/+1
Since we added 'replacement' capability, RAID1 can have twice as many devices as ->raid_disks indicates. So md_raid1_congested needs to check that many possible devices, not just ->raid_disks many. Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-11md/raid1: perform bad-block tests for WriteMostly devices too.NeilBrown1-1/+10
We normally try to avoid reading from write-mostly devices, but when we do we really have to check for bad blocks and be sure not to try reading them. With the current code, best_good_sectors might not get set and that causes zero-length read requests to be send down which is very confusing. This bug was introduced in commit d2eb35acfdccbe2 and so the patch is suitable for 3.1.x and 3.2.x Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org
2011-12-23md/raid1: Mark device want_replacement when we see a write error.NeilBrown1-1/+15
Now that WantReplacement drives are replaced cleanly, mark a drive as want_replacement when we see a write error. It might get failed soon so the WantReplacement flag is irrelevant, but if the write error is recorded in the bad block log, we still want to activate any spare that might be available. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: If there is a spare and a want_replacement device, start replacement.NeilBrown1-2/+15
When attempting to add a spare to a RAID1 array, also consider adding it as a replacement for a want_replacement device. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: recognise replacements when assembling arrays.NeilBrown1-2/+23
If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: handle activation of replacement device when recovery completes.NeilBrown1-3/+33
When recovery completes ->spare_active is called. This checks if the replacement is ready and if so it fails the original. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Allow a failed replacement device to be removed.NeilBrown1-0/+6
Replacement devices are stored at a different offset, so look there too. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Allocate spare to store replacement devices and their bios.NeilBrown1-30/+34
In RAID1, a replacement is much like a normal device, so we just double the size of the relevant arrays and look at all possible devices for reads and writes. This means that the array looks like it is now double the size in some way - we need to be careful about that. In particular, we checking if the array is still degraded while creating a recovery request we need to only consider the first 'half' - i.e. the real (non-replacement) devices. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.NeilBrown1-3/+4
In general mddev->raid_disks can change unexpectedly while conf->raid_disks will only change in a very controlled way. So change some uses of one to the other. The use of mddev->raid_disks will not cause actually problems but this way is more consistent and safer in the long term. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23md: change hot_remove_disk to take an rdev rather than a number.NeilBrown1-4/+3
Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-11-07Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-05Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+3
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-11-01md: Add module.h to all files using it implicitlyPaul Gortmaker1-0/+1
A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-26md: Fix some bugs in recovery_disabled handling.NeilBrown1-1/+3
In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-24block: Remove the control of complete cpu from bio.Tao Ma1-1/+0
bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19Merge branch 'v3.1-rc10' into for-3.2/coreJens Axboe1-7/+10
Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-11md: add proper write-congestion reporting to RAID1 and RAID10.NeilBrown1-0/+20
RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: rename "mdk_personality" to "md_personality"NeilBrown1-1/+1
"mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md/raid1: typedef removal: conf_t -> struct r1confNeilBrown1-47/+47
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mirror_info_t -> struct mirror_infoNeilBrown1-5/+5
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bioNeilBrown1-30/+30
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: remove typedefs: mddev_t -> struct mddevNeilBrown1-26/+26
Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11md: removing typedefs: mdk_rdev_t -> struct md_rdevNeilBrown1-23/+23
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md: remove PRINTK and dprintk debugging and use pr_debugNeilBrown1-13/+11
Being able to dynamically enable these make them much more useful. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md/raid1/ avoid bio search in end_sync_read()NeilBrown1-2/+1
We know which device we just read from so we don't need to search the bios to find out. Just use ->read_disk. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07md/raid1: factor out common bio handling codeNamhyung Kim1-20/+24
When normal-write and sync-read/write bio completes, we should find out the disk number the bio belongs to. Factor those common code out to a separate function. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21md: Avoid waking up a thread after it has been freed.NeilBrown1-2/+1
Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
2011-09-12block: remove support for bio remapping from ->make_requestChristoph Hellwig1-5/+3
There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-10md/raid1,10: Remove use-after-free bug in make_request.NeilBrown1-5/+9
A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call **_write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the **_write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>