summaryrefslogtreecommitdiff
path: root/block/blk-cgroup.h
AgeCommit message (Collapse)AuthorFilesLines
2012-04-02blkcg: cfq doesn't need per-cpu dispatch statsTejun Heo1-6/+6
blkio_group_stats_cpu is used to count dispatch stats using per-cpu counters. This is used by both blk-throtl and cfq-iosched but the sharing is rather silly. * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates those stats while holding queue_lock. * blk-throtl needs per-cpu dispatch stats but only service_bytes and serviced. It doesn't make use of sectors. This patch makes cfq add and use global stats for service_bytes, serviced and sectors, removes per-cpu sectors counter and moves per-cpu stat printing code to blk-throttle.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: move statistics update code to policiesTejun Heo1-94/+0
As with conf/stats file handling code, there's no reason for stat update code to live in blkcg core with policies calling into update them. The current organization is both inflexible and complex. This patch moves stat update code to specific policies. All blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP stats are collapsed into their cfq_blkiocg_update_*_stats() counterparts. blkiocg_update_dispatch_stats() is used by both policies and duplicated as throtl_update_dispatch_stats() and cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: move conf/stat file handling code to policiesTejun Heo1-15/+0
blkcg conf/stat handling is convoluted in that details which belong to specific policy implementations are all out in blkcg core and then policies hook into core layer to access and manipulate confs and stats. This sadly achieves both inflexibility (confs/stats can't be modified without messing with blkcg core) and complexity (all the call-ins and call-backs). The previous patches restructured conf and stat handling code such that they can be separated out. This patch relocates the file handling part. All conf/stat file handling code which belongs to BLKIO_POLICY_PROP is moved to cfq-iosched.c and all BKLIO_POLICY_THROTL code to blk-throtl.c. The move is verbatim except for blkio_update_group_{weight|bps|iops}() callbacks which relays conf changes to policies. The configuration settings are handled in policies themselves so the relaying isn't necessary. Conf setting functions are modified to directly call per-policy update functions and the relaying mechanism is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: implement blkio_policy_type->cftypesTejun Heo1-0/+1
Add blkiop->cftypes which is added and removed together with the policy. This will be used to move conf/stat handling to the policies. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: export conf/stat helpers to prepare for reorganizationTejun Heo1-0/+27
conf/stat handling is about to be moved to policy implementation from blkcg core. Export conf/stat helpers from blkcg core so that blk-throttle and cfq-iosched can use them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: restructure blkio_group configruation settingTejun Heo1-13/+0
As part of userland interface restructuring, this patch updates per-blkio_group configuration setting. Instead of funneling everything through a master function which has hard-coded cases for each config file it may handle, the common part is factored into blkg_conf_prep() and blkg_conf_finish() and different configuration setters are implemented using the helpers. While this doesn't result in immediate LOC reduction, this enables further cleanups and more modular implementation. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: restructure configuration printingTejun Heo1-2/+1
Similarly to the previous stat restructuring, this patch restructures conf printing code such that, * Conf printing uses the same helpers as stat. * Printing function doesn't require hardcoded switching on the config being printed. Note that this isn't complete yet for throttle confs. The next patch will convert setting for these confs and will complete the transition. * Printing uses read_seq_string callback (other methods will be phased out). Note that blkio_group_conf.iops[2] is changed to u64 so that they can be manipulated with the same functions. This is transitional and will go away later. After this patch, per-device configurations - weight, bps and iops - use __blkg_prfill_u64() for printing which uses white space as delimiter instead of tab. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: restructure statistics printingTejun Heo1-56/+4
blkcg stats handling is a mess. None of the stats has much to do with blkcg core but they are all implemented in blkcg core. Code sharing is achieved by mixing common code with hard-coded cases for each stat counter. This patch restructures statistics printing such that * Common logic exists as helper functions and specific print functions use the helpers to implement specific cases. * Printing functions serving multiple counters don't require hardcoded switching on specific counters. * Printing uses read_seq_string callback (other methods will be phased out). This change enables further cleanups and relocating stats code to the policy implementation it belongs to. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: introduce blkg_stat and blkg_rwstatTejun Heo1-45/+166
blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcountersTejun Heo1-3/+6
BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters and is counted by blkio_group_stats_cpu->sectors; however, it still holds a member in blkio_group_stats_cpu->stat_arr_cpu. Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have subcounters. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02blkcg: remove unused @pol and @plid parametersTejun Heo1-6/+2
@pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no longer necessary. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-03-20blkcg: add blkcg->idTejun Heo1-0/+3
Add 64bit unique id to blkcg. This will be used by policies which want blkcg identity test to tell whether the associated blkcg has changed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: remove blkio_group->stats_lockTejun Heo1-2/+1
With recent plug merge updates, all non-percpu stat updates happen under queue_lock making stats_lock unnecessary to synchronize stat updates. The only synchronization necessary is stat reading, which can be done using u64_stats_sync instead. This patch removes blkio_group->stats_lock and adds blkio_group_stats->syncp for reader synchronization. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: restructure blkio_get_stat()Tejun Heo1-1/+5
Restructure blkio_get_stat() to prepare for removal of stats_lock. * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have subtypes instead of using BLKIO_STAT_QUEUED. * Separate out stat acquisition and printing. After this, there are only two users of blkio_fill_stat(). Just open code it. * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no need to subtract one. Use MAX_KEY_LEN consistently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: simplify stat resetTejun Heo1-2/+12
blkiocg_reset_stats() implements stat reset for blkio.reset_stats cgroupfs file. This feature is very unconventional and something which shouldn't have been merged. It's only useful when there's only one user or tool looking at the stats. As soon as multiple users and/or tools are involved, it becomes useless as resetting disrupts other usages. There are very good reasons why all other stats expect readers to read values at the start and end of a period and subtract to determine delta over the period. The implementation is rather complex - some fields shouldn't be cleared and it saves some fields, resets whole and restores for some reason. Reset of percpu stats is also racy. The comment points to 64bit store atomicity for the reason but even without that stores for zero can simply race with other CPUs doing RMW and get clobbered. Simplify reset by * Clear selectively instead of resetting and restoring. * Grouping debug stat fields to be reset and using memset() over them. * Not caring about stats_lock. * Using memset() to reset percpu stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: don't use percpu for merged statsTejun Heo1-3/+3
With recent plug merge updates, merged stats are no longer called for plug merges and now only updated while holding queue_lock. As stats_lock is scheduled to be removed, there's no reason to use percpu for merged stats. Don't use percpu for merged stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: alloc per cpu stats from worker thread in a delayed mannerVivek Goyal1-0/+2
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07block: make block cgroup policies follow bio task associationTejun Heo1-2/+2
Implement bio_blkio_cgroup() which returns the blkcg associated with the bio if exists or %current's blkcg, and use it in blk-throttle and cfq-iosched propio. This makes both cgroup policies honor task association for the bio instead of always assuming %current. As nobody is using bio_set_task() yet, this doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: drop unnecessary RCU lockingTejun Heo1-2/+2
Now that blkg additions / removals are always done under both q and blkcg locks, the only places RCU locking is necessary are blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops unncessary RCU locking replacing it with plain blkcg locking as necessary. * blkiocg_pre_destroy() already perform proper locking and don't need RCU. Dropped. * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read lock. This isn't a hot path. * Now unnecessary synchronize_rcu() from queue exit paths removed. This makes q->nr_blkgs unnecessary. Dropped. * RCU annotation on blkg->q removed. -v2: Vivek pointed out that blkg_lookup_create() still needs to be called under rcu_read_lock(). Updated. -v3: After the update, stats_lock locking in blkio_read_blkg_stats() shouldn't be using _irq variant as it otherwise ends up enabling irq while blkcg->lock is locked. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: use double locking instead of RCU for blkg synchronizationTejun Heo1-4/+0
blkgs are chained from both blkcgs and request_queues and thus subjected to two locks - blkcg->lock and q->queue_lock. As both blkcg and q can go away anytime, locking during removal is tricky. It's currently solved by wrapping removal inside RCU, which makes the synchronization complex. There are three locks to worry about - the outer RCU, q lock and blkcg lock, and it leads to nasty subtle complications like conditional synchronize_rcu() on queue exit paths. For all other paths, blkcg lock is naturally nested inside q lock and the only exception is blkcg removal path, which is a very cold path and can be implemented as clumsy but conceptually-simple reverse double lock dancing. This patch updates blkg removal path such that blkgs are removed while holding both q and blkcg locks, which is trivial for request queue exit path - blkg_destroy_all(). The blkcg removal path, blkiocg_pre_destroy(), implements reverse double lock dancing essentially identical to ioc_release_fn(). This simplifies blkg locking - no half-dead blkgs to worry about. Now unnecessary RCU annotations will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: unify blkg's for blkcg policiesTejun Heo1-8/+7
Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: let blkcg core manage per-queue blkg list and counterTejun Heo1-10/+5
With the previous patch to move blkg list heads and counters to request_queue and blkg, logic to manage them in both policies are almost identical and can be moved to blkcg core. This patch moves blkg link logic into blkg_lookup_create(), implements common blkg unlink code in blkg_destroy(), and updates blkg_destory_all() so that it's policy specific and can skip root group. The updated blkg_destroy_all() is now used to both clear queue for bypassing and elv switching, and release all blkgs on q exit. This patch introduces a race window where policy [de]registration may race against queue blkg clearing. This can only be a problem on cfq unload and shouldn't be a real problem in practice (and we have many other places where this race already exists). Future patches will remove these unlikely races. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: move per-queue blkg list heads and counters to queue and blkgTejun Heo1-0/+1
Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: don't use blkg->plid in stat related functionsTejun Heo1-30/+50
blkg is scheduled to be unified for all policies and thus there won't be one-to-one mapping from blkg to policy. Update stat related functions to take explicit @pol or @plid arguments and not use blkg->plid. This is painful for now but most of specific stat interface functions will be replaced with a handful of generic helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: make blkg->pd an array and move configuration and stats into itTejun Heo1-9/+9
To prepare for unifying blkgs for different policies, make blkg->pd an array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats, and ->stats_cpu into blkg_policy_data. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: move refcnt to blkcg coreTejun Heo1-0/+35
Currently, blkcg policy implementations manage blkg refcnt duplicating mostly identical code in both policies. This patch moves refcnt to blkg and let blkcg core handle refcnt and freeing of blkgs. * cfq blkgs now also get freed via RCU. * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If necessary, we can add blkio_exit_group_fn() to resurrect this. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: let blkcg core handle policy private data allocationTejun Heo1-3/+50
Currently, blkg's are embedded in private data blkcg policy private data structure and thus allocated and freed by policies. This leads to duplicate codes in policies, hinders implementing common part in blkcg core with strong semantics, and forces duplicate blkg's for the same cgroup-q association. This patch introduces struct blkg_policy_data which is a separate data structure chained from blkg. Policies specifies the amount of private data it needs in its blkio_policy_type->pdata_size and blkcg core takes care of allocating them along with blkg which can be accessed using blkg_to_pdata(). blkg can be determined from pdata using pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated to blkio_init_group_fn(). For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in the reverse direction are added. Except that policy specific data now lives in a separate data structure from blkg, this patch doesn't introduce any functional difference. This will be used to unify blkg's for different policies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: add blkcg_{init|drain|exit}_queue()Tejun Heo1-0/+7
Currently block core calls directly into blk-throttle for init, drain and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps the blk-throttle functions. This is to give more control and visiblity to blkcg core layer for proper layering. Further patches will add logic common to blkcg policies to the functions. While at it, collapse blk_throtl_release() into blk_throtl_exit(). There's no reason to keep them separate. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: let blkio_group point to blkio_cgroup directlyTejun Heo1-1/+1
Currently, blkg points to the associated blkcg via its css_id. This unnecessarily complicates dereferencing blkcg. Let blkg hold a reference to the associated blkcg and point directly to it and disable css_id on blkio_subsys. This change requires splitting blkiocg_destroy() into blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be destroyed and all the blkcg references held by them dropped during cgroup removal. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: kill the mind-bending blkg->devTejun Heo1-2/+0
blkg->dev is dev_t recording the device number of the block device for the associated request_queue. It is used to identify the associated block device when printing out configuration or stats. This is redundant to begin with. A blkg is an association between a cgroup and a request_queue and it of course is possible to reach request_queue from blkg and synchronization conventions are in place for safe q dereferencing, so this shouldn't be necessary from the beginning. Furthermore, it's initialized by sscanf()ing the device name of backing_dev_info. The mind boggles. Anyways, if blkg is visible under rcu lock, we *know* that the associated request_queue hasn't gone away yet and its bdi is registered and alive - blkg can't be created for request_queue which hasn't been fully initialized and it can't go away before blkg is removed. Let stat and conf read functions get device name from blkg->q->backing_dev_info.dev and pass it down to printing functions and remove blkg->dev. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: kill blkio_policy_nodeTejun Heo1-32/+0
Now that blkcg configuration lives in blkg's, blkio_policy_node is no longer necessary. Kill it. blkio_policy_parse_and_set() now fails if invoked for missing device and functions to print out configurations are updated to print from blkg's. cftype_blkg_same_policy() is dropped along with other policy functions for consistency. Its one line is open coded in the only user - blkio_read_blkg_stats(). -v2: Update to reflect the retry-on-bypass logic change of the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: don't allow or retain configuration of missing devicesTejun Heo1-0/+9
blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: factor out blkio_group creationTejun Heo1-15/+15
Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: add blkio_policy[] array and allow one policy per policy IDTejun Heo1-0/+2
Block cgroup policies are maintained in a linked list and, theoretically, multiple policies sharing the same policy ID are allowed. This patch temporarily restricts one policy per plid and adds blkio_policy[] array which indexes registered policy types by plid. Both the restriction and blkio_policy[] array are transitional and will be removed once API cleanup is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: use q and plid instead of opaque void * for blkio_group associationTejun Heo1-10/+12
blkgio_group is association between a block cgroup and a queue for a given policy. Using opaque void * for association makes things confusing and hinders factoring of common code. Use request_queue * and, if necessary, policy id instead. This will help block cgroup API cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: shoot down blkio_groups on elevator switchTejun Heo1-1/+4
Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-07blkcg: make CONFIG_BLK_CGROUP boolTejun Heo1-8/+2
Block cgroup core can be built as module; however, it isn't too useful as blk-throttle can only be built-in and cfq-iosched is usually the default built-in scheduler. Scheduled blkcg cleanup requires calling into blkcg from block core. To simplify that, disallow building blkcg as module by making CONFIG_BLK_CGROUP bool. If building blkcg core as module really matters, which I doubt, we can revisit it after blkcg API cleanup. -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to depend on BLK_CGROUP. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24block: fix a typo in the blk-cgroup.h fileJie Liu1-1/+1
byptes -> bytes. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-05-23cfq-iosched: Make IO merge related stats per cpuVivek Goyal1-2/+2
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking blkg->stats_lock. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20blk-cgroup: Make 64bit per cpu stats safe on 32bit archVivek Goyal1-0/+2
Some of the stats are 64bit and updation will be non atomic on 32bit architecture. Use sequence counters on 32bit arch to make reading of stats safe. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20blk-throttle: Make dispatch stats per cpuVivek Goyal1-6/+21
Currently we take blkg_stat lock for even updating the stats. So even if a group has no throttling rules (common case for root group), we end up taking blkg_lock, for updating the stats. Make dispatch stats per cpu so that these can be updated without taking blkg lock. If cpu goes offline, these stats simply disappear. No protection has been provided for that yet. Do we really need anything for that? Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20blk-cgroup: move some fields of unaccounted_time file under right config optionVivek Goyal1-3/+4
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y. there are some fields which are out side this config option. Fix that. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-16blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroupVivek Goyal1-0/+3
Currentlly we first map the task to cgroup and then cgroup to blkio_cgroup. There is a more direct way to get to blkio_cgroup from task using task_subsys_state(). Use that. The real reason for the fix is that it also avoids a race in generic cgroup code. During remount/umount rebind_subsystems() is called and it can do following with and rcu protection. cgrp->subsys[i] = NULL; That means if somebody got hold of cgroup under rcu and then it tried to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which is wrong. I was running into this race condition with ltp running on a upstream derived kernel and that lead to crash. So ideally we should also fix cgroup generic code to wait for rcu grace period before setting pointer to NULL. Li Zefan is not very keen on introducing synchronize_wait() as he thinks it will slow down moun/remount/umount operations. So for the time being atleast fix the kernel crash by taking a more direct route to blkio_cgroup. One tester had reported a crash while running LTP on a derived kernel and with this fix crash is no more seen while the test has been running for over 6 days. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-12blk-cgroup: Add unaccounted time to timeslice_used.Justin TerAvest1-2/+10
There are two kind of times that tasks are not charged for: the first seek and the extra time slice used over the allocated timeslice. Both of these exported as a new unaccounted_time stat. I think it would be good to have this reported in 'time' as well, but that is probably a separate discussion. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-08blk-cgroup: Lower minimum weight from 100 to 10.Justin TerAvest1-1/+1
We've found that we still get good, useful isolation at weights this low. I'd like to adjust the minimum so that any other changes can take these values into account. Signed-off-by: Justin TerAvest <teravest@google.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio-throttle: limit max iops value to UINT_MAXVivek Goyal1-0/+3
- Limit max iops value to UINT_MAX and return error to user if value is more than that instead of accepting bigger values and truncating implicitly. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-01blkio: Recalculate the throttled bio dispatch time upon throttle limit changeVivek Goyal1-10/+11
o Currently any cgroup throttle limit changes are processed asynchronousy and the change does not take affect till a new bio is dispatched from same group. o It might happen that a user sets a redicuously low limit on throttling. Say 1 bytes per second on reads. In such cases simple operations like mount a disk can wait for a very long time. o Once bio is throttled, there is no easy way to come out of that wait even if user increases the read limit later. o This patch fixes it. Now if a user changes the cgroup limits, we recalculate the bio dispatch time according to new limits. o Can't take queueu lock under blkcg_lock, hence after the change I wake up the dispatch thread again which recalculates the time. So there are some variables being synchronized across two threads without lock and I had to make use of barriers. Hoping I have used barriers correctly. Any review of memory barrier code especially will help. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16blk-cgroup: cgroup changes for IOPS limit supportVivek Goyal1-0/+13
o cgroup changes for IOPS throttling rules. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16blk-cgroup: Introduce cgroup changes for throttling policyVivek Goyal1-1/+29
o cgroup chagnes for throttle policy. o Introduces READ and WRITE bytes per second throttling rules. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16blk-cgroup: Prepare the base for supporting more than one IO control policiesVivek Goyal1-2/+34
o This patch prepares the base for introducing new IO control policies. Currently all the code is written knowing there is only one policy and that is proportional bandwidth. Creating infrastructure for newer policies to come in. o Also there were many functions which were generated using macro. It was very confusing. Got rid of those. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>