summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2017-02-16dm cache metadata: name the cache block that couldn't be loadedMike Snitzer1-4/+8
Improves __load_mapping_v1() and __load_mapping_v2() DMERR messages to explicitly name the cache block number whose mapping couldn't be loaded. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm cache metadata: add "metadata2" featureJoe Thornber3-53/+274
If "metadata2" is provided as a table argument when creating/loading a cache target a more compact metadata format, with separate dirty bits, is used. "metadata2" improves speed of shutting down a cache target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm cache metadata: use bitset cursor api to load discard bitsetJoe Thornber1-20/+28
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm bitset: introduce cursor apiJoe Thornber2-0/+91
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm btree: use GFP_NOFS in dm_btree_del()Joe Thornber1-1/+6
dm_btree_del() is called from an ioctl so don't recurse into FS. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm space map common: memcpy the disk root to ensure it's arch alignedJoe Thornber1-5/+11
The metadata_space_map_root passed to sm_ll_open_metadata() may or may not be arch aligned, use memcpy to ensure it is. This is not a fast path so the extra memcpy doesn't hurt us. Long-term it'd be better to use the kernel's alignment infrastructure to remove the memcpy()s that are littered across persistent-data (btree, array, space-maps, etc). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm block manager: add unlikely() annotations on dm_bufio error pathsJoe Thornber1-4/+4
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-16dm cache: fix corruption seen when using cache > 2TBJoe Thornber1-3/+3
A rounding bug due to compiler generated temporary being 32bit was found in remap_to_cache(). A localized cast in remap_to_cache() fixes the corruption but this preferred fix (changing from uint32_t to sector_t) eliminates potential for future rounding errors elsewhere. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: cleanup awkward branching in raid_message() option processingMike Snitzer1-3/+4
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: use mddev rather than rdev->mddevHeinz Mauelshagen1-1/+1
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: use read_disk_sb() throughoutHeinz Mauelshagen1-8/+9
For consistency, call read_disk_sb() from attempt_restore_of_faulty_devices() instead of calling sync_page_io() directly. Explicitly set device to faulty on superblock read error. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: add raid4/5/6 journaling supportHeinz Mauelshagen1-21/+140
Add md raid4/5/6 journaling support (upstream commit bac624f3f86a started the implementation) which closes the write hole (i.e. non-atomic updates to stripes) using a dedicated journal device. Background: raid4/5/6 stripes hold N data payloads per stripe plus one parity raid4/5 or two raid6 P/Q syndrome payloads in an in-memory stripe cache. Parity or P/Q syndromes used to recover any data payloads in case of a disk failure are calculated from the N data payloads and need to be updated on the different component devices of the raid device. Those are non-atomic, persistent updates. Hence a crash can cause failure to update all stripe payloads persistently and thus cause data loss during stripe recovery. This problem gets addressed by writing whole stripe cache entries (together with journal metadata) to a persistent journal entry on a dedicated journal device. Only if that journal entry is written successfully, the stripe cache entry is updated on the component devices of the raid device (i.e. writethrough type). In case of a crash, the entry can be recovered from the journal and be written again thus ensuring consistent stripe payload suitable to data recovery. Future dependencies: once writeback caching being worked on to compensate for the throughput implictions involved with writethrough overhead is supported with journaling in upstream, an additional patch based on this one will support it in dm-raid. Journal resilience related remarks: because stripes are recovered from the journal in case of a crash, the journal device better be resilient. Resilience becomes mandatory with future writeback support, because loosing the working set in the log means data loss as oposed to writethrough, were the loss of the journal device 'only' reintroduces the write hole. Fix comment on data offsets in parse_dev_params() and initialize new_data_offset as well. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: be prepared to accept arbitrary '- -' tuplesHeinz Mauelshagen1-5/+23
During raid set resize checks and setting up the recovery offset in case a raid set grows, calculated rd->md.dev_sectors is compared to rs->dev[0].rdev.sectors. Device 0 may not be defined in case userspace passes in '- -' for it (lvm2 doesn't do that so far), thus it's device sectors can't be taken authoritatively in this comparison and another valid device must be used to retrieve the device size. Use mddev->dev_sectors in checking for ongoing recovery for the same reason. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-25dm raid: fix transient device failure processingHeinz Mauelshagen1-49/+38
This fix addresses the following 3 failure scenarios: 1) If a (transiently) inaccessible metadata device is being passed into the constructor (e.g. a device tuple '254:4 254:5'), it is processed as if '- -' was given. This erroneously results in a status table line containing '- -', which mistakenly differs from what has been passed in. As a result, userspace libdevmapper puts the device tuple seperate from the RAID device thus not processing the dependencies properly. 2) False health status char 'A' instead of 'D' is emitted on the status status info line for the meta/data device tuple in this metadata device failure case. 3) If the metadata device is accessible when passed into the constructor but the data device (partially) isn't, that leg may be set faulty by the raid personality on access to the (partially) unavailable leg. Restore tried in a second raid device resume on such failed leg (status char 'D') fails after the (partial) leg returned. Fixes for aforementioned failure scenarios: - don't release passed in devices in the constructor thus allowing the status table line to e.g. contain '254:4 254:5' rather than '- -' - emit device status char 'D' rather than 'A' for the device tuple with the failed metadata device on the status info line - when attempting to restore faulty devices in a second resume, allow the device hot remove function to succeed by setting the device to not in-sync In case userspace intentionally passes '- -' into the constructor to avoid that device tuple (e.g. to split off a raid1 leg temporarily for later re-addition), the status table line will correctly show '- -' and the status info line will provide a '-' device health character for the non-defined device tuple. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-01-10md/raid5: Use correct IS_ERR() variation on pointer checkJes Sorensen1-1/+1
This fixes a build error on certain architectures, such as ppc64. Fixes: 6995f0b247e("md: takeover should clear unrelated bits") Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05md: cleanup mddev flag clear for takeoverShaohua Li4-7/+26
Commit 6995f0b (md: takeover should clear unrelated bits) clear unrelated bits, but it's quite fragile. To avoid error in the future, define a macro for unsupported mddev flags for each raid type and use it to clear unsupported mddev flags. This should be less error-prone. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05md/r5cache: fix spelling mistake on "recoverying"Colin Ian King1-1/+1
Trivial fix to spelling mistake "recoverying" to "recovering" in pr_dbg message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05md/r5cache: assign conf->log before r5l_load_log()Song Liu1-1/+3
r5l_load_log() calls functions that requires a proper conf->log, for example, r5c_is_writeback(). Therefore, we should set conf->log before calling r5l_load_log(). If r5l_load_log() fails, conf->log is set back to NULL. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05md/r5cache: simplify handling of sh->log_start in recoverySong Liu1-15/+12
We only need to update sh->log_start at the end of recovery, which is r5c_recovery_rewrite_data_only_stripes(), so it is not necessary to set it before that. In this patch, log_start is removed from r5c_recovery_alloc_stripe(). After updating all sh->log_start, rewrite_data_only_stripes() also updates log->next_checkpoints to the last sh->log_start. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-05md/raid5-cache: removes unnecessary write-through mode judgmentsJackieLiu1-3/+0
The write-through mode has been returned in front of the function, do not need to do it again. Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-03md/raid10: Refactor raid10_make_requestRobert LeBlanc1-105/+140
Refactor raid10_make_request into seperate read and write functions to clean up the code. Shaohua: add the recovery check back to read path Signed-off-by: Robert LeBlanc <robert@leblancnet.us> Signed-off-by: Shaohua Li <shli@fb.com>
2017-01-03md/raid1: Refactor raid1_make_requestRobert LeBlanc1-128/+139
Refactor raid1_make_request to make read and write code in their own functions to clean up the code. Signed-off-by: Robert LeBlanc <robert@leblancnet.us> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-17bcache: partition support: add 16 minors per bcacheN deviceEric Wheeler1-1/+4
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net> Tested-by: Wido den Hollander <wido@widodh.nl>
2016-12-17bcache: Make gc wakeup sane, remove set_task_state()Kent Overstreet5-26/+26
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2016-12-16linux: drop __bitwise__ everywhereMichael S. Tsirkin1-3/+3
__bitwise__ used to mean "yes, please enable sparse checks unconditionally", but now that we dropped __CHECK_ENDIAN__ __bitwise is exactly the same. There aren't many users, replace it by __bitwise everywhere. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Akced-by: Lee Duncan <lduncan@suse.com>
2016-12-14Merge tag 'dm-4.10-changes' of ↵Linus Torvalds19-173/+406
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and improvements to request-based DM and DM multipath - some locking improvements in DM bufio - add Kconfig option to disable the DM block manager's extra locking which mainly serves as a developer tool - a few bug fixes to DM's persistent-data - a couple changes to prepare for multipage biovec support in the block layer - various improvements and cleanups in the DM core, DM cache, DM raid and DM crypt - add ability to have DM crypt use keys from the kernel key retention service - add a new "error_writes" feature to the DM flakey target, reads are left unchanged in this mode * tag 'dm-4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits) dm flakey: introduce "error_writes" feature dm cache policy smq: use hash_32() instead of hash_32_generic() dm crypt: reject key strings containing whitespace chars dm space map: always set ev if sm_ll_mutate() succeeds dm space map metadata: skip useless memcpy in metadata_ll_init_index() dm space map metadata: fix 'struct sm_metadata' leak on failed create Documentation: dm raid: define data_offset status field dm raid: fix discard support regression dm raid: don't allow "write behind" with raid4/5/6 dm mpath: use hw_handler_params if attached hw_handler is same as requested dm crypt: add ability to use keys from the kernel key retention service dm array: remove a dead assignment in populate_ablock_with_values() dm ioctl: use offsetof() instead of open-coding it dm rq: simplify use_blk_mq initialization dm: use blk_set_queue_dying() in __dm_destroy() dm bufio: drop the lock when doing GFP_NOIO allocation dm bufio: don't take the lock in dm_bufio_shrink_count dm bufio: avoid sleeping while holding the dm_bufio lock dm table: simplify dm_table_determine_type() dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device ...
2016-12-13Merge branch 'md-next' into md-linusShaohua Li14-1232/+3168
2016-12-13dm flakey: introduce "error_writes" featureMike Snitzer1-9/+42
Recent dm-flakey fixes, to have reads error out during the "down" interval, made it so that the previous read behaviour is no longer available. It is useful to have reads complete like normal but have writes error out, so make it possible again with a new "error_writes" feature. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-13Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds20-105/+57
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
2016-12-09md: separate flags for superblock changesShaohua Li9-101/+106
The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-09md: MD_RECOVERY_NEEDED is set for mddev->recoveryShaohua Li1-1/+1
Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device removal.") Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-09md: takeover should clear unrelated bitsShaohua Li3-2/+14
When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit will be accidentally set, but raid5 doesn't support it. The same is true for the MD_HAS_JOURNAL bit. Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-12-09dm cache policy smq: use hash_32() instead of hash_32_generic()Mike Snitzer1-1/+1
Switch to using hash_32() because hash_32_generic() should only be used by the kernel's selftests. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm crypt: reject key strings containing whitespace charsOndrej Kozina1-0/+18
Unfortunately key_string may theoretically contain whitespace even after it's processed by dm_split_args(). The reason for this is DM core supports escaping of almost all chars including any whitespace. If userspace passes a key to the kernel in format ":32:logon:my_prefix:my\ key" dm-crypt will look up key "my_prefix:my key" in kernel keyring service. So far everything's fine. Unfortunately if userspace later calls DM_TABLE_STATUS ioctl, it will not receive back expected ":32:logon:my_prefix:my\ key" but the unescaped version instead. Also userpace (most notably cryptsetup) is not ready to parse single target argument containing (even escaped) whitespace chars and any whitespace is simply taken as delimiter of another argument. This effect is mitigated by the fact libdevmapper curently performs double escaping of '\' char. Any user input in format "x\ x" is transformed into "x\\ x" before being passed to the kernel. Nonetheless dm-crypt may be used without libdevmapper. Therefore the near-term solution to this is to reject any key string containing whitespace. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm space map: always set ev if sm_ll_mutate() succeedsBenjamin Marzinski1-1/+2
If no block was allocated or freed, sm_ll_mutate() wasn't setting *ev, leaving the variable unitialized. sm_ll_insert(), sm_disk_inc_block(), and sm_disk_new_block() all check ev to see if there was an allocation event in sm_ll_mutate(), possibly reading unitialized data. If no allocation event occured, sm_ll_mutate() should set *ev to SM_NONE. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm space map metadata: skip useless memcpy in metadata_ll_init_index()Benjamin Marzinski1-1/+0
When metadata_ll_init_index() is called by sm_ll_new_metadata(), ll->mi_le hasn't been initialized yet. So, when metadata_ll_init_index() copies the contents of ll->mi_le into the newly allocated bitmap_root, it is just copying garbage. ll->mi_le will be allocated later in sm_ll_extend() and copied into the bitmap_root, in sm_ll_commit(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm space map metadata: fix 'struct sm_metadata' leak on failed createBenjamin Marzinski1-8/+6
In dm_sm_metadata_create() we temporarily change the dm_space_map operations from 'ops' (whose .destroy function deallocates the sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't). If dm_sm_metadata_create() fails in sm_ll_new_metadata() or sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls dm_sm_destroy() with the intention of freeing the sm_metadata, but it doesn't (because the dm_space_map operations is still set to 'bootstrap_ops'). Fix this by setting the dm_space_map operations back to 'ops' if dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2016-12-08dm raid: fix discard support regressionHeinz Mauelshagen1-6/+3
Commit ecbfb9f118 ("dm raid: add raid level takeover support") moved the configure_discard_support() call from raid_ctr() to raid_preresume(). Enabling/disabling discard _must_ happen during table load (through the .ctr hook). Fix this regression by moving the configure_discard_support() call back to raid_ctr(). Fixes: ecbfb9f118 ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm raid: don't allow "write behind" with raid4/5/6Heinz Mauelshagen1-2/+0
Remove CTR_FLAG_MAX_WRITE_BEHIND from raid4/5/6's valid ctr flags. Only the md raid1 personality supports setting a maximum number of "write behind" write IOs on any legs set to "write mostly". "write mostly" enhances throughput with slow links/disks. Technically the "write behind" value is a write intent bitmap property only being respected by the raid1 personality. It allows a maximum number of "write behind" writes to any "write mostly" raid1 mirror legs to be delayed and avoids reads from such legs. No other MD personalities supported via dm-raid make use of "write behind", thus setting this property is superfluous; it wouldn't cause harm but it is correct to reject it. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm mpath: use hw_handler_params if attached hw_handler is same as requestedtang.junhui1-5/+9
Let the requested m->hw_handler_params be used if the attached hardware handler is the same handler as requested with m->hw_handler_name. Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm crypt: add ability to use keys from the kernel key retention serviceOndrej Kozina1-13/+146
The kernel key service is a generic way to store keys for the use of other subsystems. Currently there is no way to use kernel keys in dm-crypt. This patch aims to fix that. Instead of key userspace may pass a key description with preceding ':'. So message that constructs encryption mapping now looks like this: <cipher> [<key>|:<key_string>] <iv_offset> <dev_path> <start> [<#opt_params> <opt_params>] where <key_string> is in format: <key_size>:<key_type>:<key_description> Currently we only support two elementary key types: 'user' and 'logon'. Keys may be loaded in dm-crypt either via <key_string> or using classical method and pass the key in hex representation directly. dm-crypt device initialised with a key passed in hex representation may be replaced with key passed in key_string format and vice versa. (Based on original work by Andrey Ryabinin) Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm array: remove a dead assignment in populate_ablock_with_values()Bart Van Assche1-2/+0
A value is assigned to 'nr_entries' but is never used, remove it. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm ioctl: use offsetof() instead of open-coding itBart Van Assche1-1/+1
Subtracting sizes is a fragile approach because the result is only correct if the compiler has not added any padding at the end of the structure. Hence use offsetof() instead of size subtraction. An additional advantage of offsetof() is that it makes the intent more clear. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm rq: simplify use_blk_mq initializationBart Van Assche1-5/+1
Use a single statement to declare and initialize 'use_blk_mq' instead of two statements. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm: use blk_set_queue_dying() in __dm_destroy()Bart Van Assche1-3/+1
After QUEUE_FLAG_DYING has been set any code that is waiting in get_request() should be woken up. But to get this behaviour blk_set_queue_dying() must be used instead of only setting QUEUE_FLAG_DYING. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm bufio: drop the lock when doing GFP_NOIO allocationMikulas Patocka1-0/+10
If the first allocation attempt using GFP_NOWAIT fails, drop the lock and retry using GFP_NOIO allocation (lock is dropped because the allocation can take some time). Note that we won't do GFP_NOIO allocation when we loop for the second time, because the lock shouldn't be dropped between __wait_for_free_buffer and __get_unclaimed_buffer. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm bufio: don't take the lock in dm_bufio_shrink_countMikulas Patocka1-11/+2
dm_bufio_shrink_count() is called from do_shrink_slab to find out how many freeable objects are there. The reported value doesn't have to be precise, so we don't need to take the dm-bufio lock. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm bufio: avoid sleeping while holding the dm_bufio lockDouglas Anderson1-2/+3
We've seen in-field reports showing _lots_ (18 in one case, 41 in another) of tasks all sitting there blocked on: mutex_lock+0x4c/0x68 dm_bufio_shrink_count+0x38/0x78 shrink_slab.part.54.constprop.65+0x100/0x464 shrink_zone+0xa8/0x198 In the two cases analyzed, we see one task that looks like this: Workqueue: kverityd verity_prefetch_io __switch_to+0x9c/0xa8 __schedule+0x440/0x6d8 schedule+0x94/0xb4 schedule_timeout+0x204/0x27c schedule_timeout_uninterruptible+0x44/0x50 wait_iff_congested+0x9c/0x1f0 shrink_inactive_list+0x3a0/0x4cc shrink_lruvec+0x418/0x5cc shrink_zone+0x88/0x198 try_to_free_pages+0x51c/0x588 __alloc_pages_nodemask+0x648/0xa88 __get_free_pages+0x34/0x7c alloc_buffer+0xa4/0x144 __bufio_new+0x84/0x278 dm_bufio_prefetch+0x9c/0x154 verity_prefetch_io+0xe8/0x10c process_one_work+0x240/0x424 worker_thread+0x2fc/0x424 kthread+0x10c/0x114 ...and that looks to be the one holding the mutex. The problem has been reproduced on fairly easily: 0. Be running Chrome OS w/ verity enabled on the root filesystem 1. Pick test patch: http://crosreview.com/412360 2. Install launchBalloons.sh and balloon.arm from http://crbug.com/468342 ...that's just a memory stress test app. 3. On a 4GB rk3399 machine, run nice ./launchBalloons.sh 4 900 100000 ...that tries to eat 4 * 900 MB of memory and keep accessing. 4. Login to the Chrome web browser and restore many tabs With that, I've seen printouts like: DOUG: long bufio 90758 ms ...and stack trace always show's we're in dm_bufio_prefetch(). The problem is that we try to allocate memory with GFP_NOIO while we're holding the dm_bufio lock. Instead we should be using GFP_NOWAIT. Using GFP_NOIO can cause us to sleep while holding the lock and that causes the above problems. The current behavior explained by David Rientjes: It will still try reclaim initially because __GFP_WAIT (or __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO. This is the cause of contention on dm_bufio_lock() that the thread holds. You want to pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a mutex that can be contended by a concurrent slab shrinker (if count_objects didn't use a trylock, this pattern would trivially deadlock). This change significantly increases responsiveness of the system while in this state. It makes a real difference because it unblocks kswapd. In the bug report analyzed, kswapd was hung: kswapd0 D ffffffc000204fd8 0 72 2 0x00000000 Call trace: [<ffffffc000204fd8>] __switch_to+0x9c/0xa8 [<ffffffc00090b794>] __schedule+0x440/0x6d8 [<ffffffc00090bac0>] schedule+0x94/0xb4 [<ffffffc00090be44>] schedule_preempt_disabled+0x28/0x44 [<ffffffc00090d900>] __mutex_lock_slowpath+0x120/0x1ac [<ffffffc00090d9d8>] mutex_lock+0x4c/0x68 [<ffffffc000708e7c>] dm_bufio_shrink_count+0x38/0x78 [<ffffffc00030b268>] shrink_slab.part.54.constprop.65+0x100/0x464 [<ffffffc00030dbd8>] shrink_zone+0xa8/0x198 [<ffffffc00030e578>] balance_pgdat+0x328/0x508 [<ffffffc00030eb7c>] kswapd+0x424/0x51c [<ffffffc00023f06c>] kthread+0x10c/0x114 [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 By unblocking kswapd memory pressure should be reduced. Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-12-08dm table: simplify dm_table_determine_type()Bart Van Assche1-13/+8
Use a single loop instead of two loops to determine whether or not all_blk_mq has to be set. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>