kernel/linux.git/drivers/md, branch v3.4.98

md: flush writes before starting a recovery.

2014-07-09T17:51:21+00:00

commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream. When we write to a degraded array which has a bitmap, we make sure the relevant bit in the bitmap remains set when the write completes (so a 're-add' can quickly rebuilt a temporarily-missing device). If, immediately after such a write starts, we incorporate a spare, commence recovery, and skip over the region where the write is happening (because the 'needs recovery' flag isn't set yet), then that write will not get to the new device. Once the recovery finishes the new device will be trusted, but will have incorrect data, leading to possible corruption. We cannot set the 'needs recovery' flag when we start the write as we do not know easily if the write will be "degraded" or not. That depends on details of the particular raid level and particular write request. This patch fixes a corruption issue of long standing and so it suitable for any -stable kernel. It applied correctly to 3.0 at least and will minor editing to earlier kernels. Reported-by: Bill Tested-by: Bill Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".

2014-06-11T19:04:11+00:00

commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream. If mddev->ro is set, md_to_sync will (correctly) abort. However in that case MD_RECOVERY_INTR isn't set. If a RESHAPE had been requested, then ->finish_reshape() will be called and it will think the reshape was successful even though nothing happened. Normally a resync will not be requested if ->ro is set, but if an array is stopped while a reshape is on-going, then when the array is started, the reshape will be restarted. If the array is also set read-only at this point, the reshape will instantly appear to success, resulting in data corruption. Consequently, this patch is suitable for any -stable kernel. Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

dm thin: fix discard corruption

2014-06-07T23:02:05+00:00

commit f046f89a99ccfd9408b94c653374ff3065c7edb3 upstream. Fix a bug in dm_btree_remove that could leave leaf values with incorrect reference counts. The effect of this was that removal of a shared block could result in the space maps thinking the block was no longer used. More concretely, if you have a thin device and a snapshot of it, sending a discard to a shared region of the thin could corrupt the snapshot. Thinp uses a 2-level nested btree to store it's mappings. This first level is indexed by thin device, and the second level by logical block. Often when we're removing an entry in this mapping tree we need to rebalance nodes, which can involve shadowing them, possibly creating a copy if the block is shared. If we do create a copy then children of that node need to have their reference counts incremented. In this way reference counts percolate down the tree as shared trees diverge. The rebalance functions were incrementing the children at the appropriate time, but they were always assuming the children were internal nodes. This meant the leaf values (in our case packed block/flags entries) were not being incremented. Signed-off-by: Joe Thornber Signed-off-by: Alasdair G Kergon [bwh: Backported to 3.2: bump target version numbers from 1.0.1 to 1.0.2] Signed-off-by: Ben Hutchings [xr: Backported to 3.4: bump target version numbers to 1.1.1] Signed-off-by: Rui Xiang Signed-off-by: Greg Kroah-Hartman

dm mpath: fix race condition between multipath_dtr and pg_init_done

2014-06-07T23:02:05+00:00

commit 954a73d5d3073df2231820c718fdd2f18b0fe4c9 upstream. Whenever multipath_dtr() is happening we must prevent queueing any further path activation work. Implement this by adding a new 'pg_init_disabled' flag to the multipath structure that denotes future path activation work should be skipped if it is set. By disabling pg_init and then re-enabling in flush_multipath_work() we also avoid the potential for pg_init to be initiated while suspending an mpath device. Without this patch a race condition exists that may result in a kernel panic: 1) If after pg_init_done() decrements pg_init_in_progress to 0, a call to wait_for_pg_init_completion() assumes there are no more pending path management commands. 2) If pg_init_required is set by pg_init_done(), due to retryable mode_select errors, then process_queued_ios() will again queue the path activation work. 3) If free_multipath() completes before activate_path() work is called a NULL pointer dereference like the following can be seen when accessing members of the recently destructed multipath: BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 RIP: 0010:[] [] activate_path+0x1b/0x30 [dm_multipath] [] worker_thread+0x170/0x2a0 [] ? autoremove_wake_function+0x0/0x40 [switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer] Signed-off-by: Shiva Krishna Merla Reviewed-by: Krishnasamy Somasundaram Tested-by: Speagle Andy Acked-by: Junichi Nomura Signed-off-by: Mike Snitzer [bwh: Backported to 3.2: - Adjust context - Bump version to 1.3.2 not 1.6.0] Signed-off-by: Ben Hutchings [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang Signed-off-by: Greg Kroah-Hartman

dm snapshot: avoid snapshot space leak on crash

2014-06-07T23:02:05+00:00

commit 230c83afdd9cd384348475bea1e14b80b3b6b1b8 upstream. There is a possible leak of snapshot space in case of crash. The reason for space leaking is that chunks in the snapshot device are allocated sequentially, but they are finished (and stored in the metadata) out of order, depending on the order in which copying finished. For example, supposed that the metadata contains the following records SUPERBLOCK METADATA (blocks 0 ... 250) DATA 0 DATA 1 DATA 2 ... DATA 250 Now suppose that you allocate 10 new data blocks 251-260. Suppose that copying of these blocks finish out of order (block 260 finished first and the block 251 finished last). Now, the snapshot device looks like this: SUPERBLOCK METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256) DATA 0 DATA 1 DATA 2 ... DATA 250 DATA 251 DATA 252 DATA 253 DATA 254 DATA 255 METADATA (blocks 255, 254, 253, 252, 251) DATA 256 DATA 257 DATA 258 DATA 259 DATA 260 Now, if the machine crashes after writing the first metadata block but before writing the second metadata block, the space for areas DATA 250-255 is leaked, it contains no valid data and it will never be used in the future. This patch makes dm-snapshot complete exceptions in the same order they were allocated, thus fixing this bug. Note: when backporting this patch to the stable kernel, change the version field in the following way: * if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0} * if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it to {1, 10, 2} Userspace reads the version to determine if the bug was fixed, so the version change is needed. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer [xr: Backported to 3.4: adjust version] Signed-off-by: Rui Xiang Signed-off-by: Greg Kroah-Hartman

md/raid10: fix "enough" function for detecting if array is failed.

2014-06-07T23:02:05+00:00

commit 80b4812407c6b1f66a4f2430e69747a13f010839 upstream. The 'enough' function is written to work with 'near' arrays only in that is implicitly assumes that the offset from one 'group' of devices to the next is the same as the number of copies. In reality it is the number of 'near' copies. So change it to make this number explicit. This bug makes it possible to run arrays without enough drives present, which is dangerous. It is appropriate for an -stable kernel, but will almost certainly need to be modified for some of them. Reported-by: Jakub Husák Signed-off-by: NeilBrown [bwh: Backported to 3.2: s/geo->/conf->/] Signed-off-by: Ben Hutchings Cc: Rui Xiang Signed-off-by: Greg Kroah-Hartman

dm snapshot: add missing module aliases

2014-06-07T23:02:05+00:00

commit 23cb21092eb9dcec9d3604b68d95192b79915890 upstream. Add module aliases so that autoloading works correctly if the user tries to activate "snapshot-origin" or "snapshot-merge" targets. Reference: https://bugzilla.redhat.com/889973 Reported-by: Chao Yang Signed-off-by: Mikulas Patocka Signed-off-by: Alasdair G Kergon Signed-off-by: Ben Hutchings Cc: Rui Xiang Signed-off-by: Greg Kroah-Hartman

dm bufio: avoid a possible __vmalloc deadlock

2014-06-07T23:02:05+00:00

commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2 upstream. This patch uses memalloc_noio_save to avoid a possible deadlock in dm-bufio. (it could happen only with large block size, at most PAGE_SIZE << MAX_ORDER (typically 8MiB). __vmalloc doesn't fully respect gfp flags. The specified gfp flags are used for allocation of requested pages, structures vmap_area, vmap_block and vm_struct and the radix tree nodes. However, the kernel pagetables are allocated always with GFP_KERNEL. Thus the allocation of pagetables can recurse back to the I/O layer and cause a deadlock. This patch uses the function memalloc_noio_save to set per-process PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore it. When this flag is set, all allocations in the process are done with implied GFP_NOIO flag, thus the deadlock can't happen. This should be backported to stable kernels, but they don't have the PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore functions. So, PF_MEMALLOC should be set and restored instead. Signed-off-by: Mikulas Patocka Signed-off-by: Alasdair G Kergon [bwh: Backported to 3.2 as recommended] Signed-off-by: Ben Hutchings Cc: Rui Xiang Signed-off-by: Greg Kroah-Hartman

md: avoid possible spinning md thread at shutdown.

2014-06-07T23:02:01+00:00

commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream. If an md array with externally managed metadata (e.g. DDF or IMSM) is in use, then we should not set safemode==2 at shutdown because: 1/ this is ineffective: user-space need to be involved in any 'safemode' handling, 2/ The safemode management code doesn't cope with safemode==2 on external metadata and md_check_recover enters an infinite loop. Even at shutdown, an infinite-looping process can be problematic, so this could cause shutdown to hang. Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

dm thin: fix dangling bio in process_deferred_bios error path

2014-05-13T12:11:32+00:00

commit fe76cd88e654124d1431bb662a0fc6e99ca811a5 upstream. If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by: Mike Snitzer Acked-by: Joe Thornber Signed-off-by: Greg Kroah-Hartman