summaryrefslogtreecommitdiff
path: root/block/blk-cgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2018-12-19block: save irq state in blkg_lookup_create()Ming Lei1-2/+4
blkg_lookup_create() may be called from pool_map() in which irq state is saved, so we have to do that in blkg_lookup_create(). Otherwise, the following lockdep warning can be triggered: [ 104.258537] ================================ [ 104.259129] WARNING: inconsistent lock state [ 104.259725] 4.20.0-rc6+ #545 Not tainted [ 104.260268] -------------------------------- [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.263747] {SOFTIRQ-ON-W} state was registered at: [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c [ 104.265014] blkg_lookup_create+0xdc/0xe6 [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f [ 104.266312] bio_associate_blkg+0x15a/0x1bb [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool] [ 104.267572] __map_bio+0x98/0x29c [dm_mod] [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod] [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod] [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod] [ 104.270973] generic_make_request+0x3f5/0x68b [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool] [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool] [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool] [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool] [ 104.274635] process_one_work+0x4eb/0x834 [ 104.275203] worker_thread+0x318/0x484 [ 104.275740] kthread+0x1d1/0x1e1 [ 104.276203] ret_from_fork+0x3a/0x50 [ 104.276714] irq event stamp: 170003 [ 104.277201] hardirqs last enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55 [ 104.280273] softirqs last enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73 [ 104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d [ 104.282744] [ 104.282744] other info that might help us debug this: [ 104.283640] Possible unsafe locking scenario: [ 104.283640] [ 104.284452] CPU0 [ 104.284803] ---- [ 104.285150] lock(&(&pool->lock)->rlock#3); [ 104.285762] <Interrupt> [ 104.286130] lock(&(&pool->lock)->rlock#3); [ 104.286750] [ 104.286750] *** DEADLOCK *** [ 104.286750] [ 104.287564] no locks held by swapper/49/0. [ 104.288129] [ 104.288129] stack backtrace: [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545 [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 104.290858] Call Trace: [ 104.291204] <IRQ> [ 104.291502] dump_stack+0x9a/0xe6 [ 104.291968] mark_lock+0x56c/0x7a6 [ 104.292442] ? check_usage_backwards+0x209/0x209 [ 104.293086] __lock_acquire+0x400/0x15bf [ 104.293662] ? check_chain_key+0x150/0x1aa [ 104.294236] lock_acquire+0x1a6/0x1e3 [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55 [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.299886] ? check_flags+0x20a/0x20a [ 104.300408] ? lock_acquire+0x1a6/0x1e3 [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod] [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f [ 104.303836] ? bio_endio+0x2b2/0x2da [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod] [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f [ 104.306333] ? bio_endio+0x2b2/0x2da [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod] [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f [ 104.308817] ? bio_endio+0x2b2/0x2da [ 104.309319] blk_update_request+0x2de/0x4cc [ 104.309927] blk_mq_end_request+0x2a/0x183 [ 104.310498] blk_done_softirq+0x16a/0x1a6 [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2 [ 104.311653] ? __lock_is_held+0x2a/0x87 [ 104.312186] __do_softirq+0x250/0x4e8 [ 104.312705] irq_exit+0x7e/0x11d [ 104.313157] call_function_single_interrupt+0xf/0x20 [ 104.313860] </IRQ> [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3 [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8 [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7 [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001 [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000 [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278 [ 104.323927] default_idle+0xd9/0x1a8 [ 104.324427] do_idle+0x162/0x2b2 [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28 [ 104.325467] ? mark_held_locks+0x28/0x7f [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.326719] cpu_startup_entry+0x1d/0x1f [ 104.327261] start_secondary+0x2cb/0x308 [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3 [ 104.328421] secondary_startup_64+0xa4/0xb0 Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking") Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13blkcg: handle dying request_queue when associating a blkgDennis Zhou1-0/+6
Between v3 [1] and v4 [2] of the blkg association series, the association point moved from generic_make_request_checks(), which is called after the request enters the queue, to bio_set_dev(), which is when the bio is formed before submit_bio(). When the request_queue goes away, the blkgs supporting the request_queue are destroyed and then the q->root_blkg is set to %NULL. This patch adds a %NULL check to blkg_tryget_closest() to prevent the NPE caused by the above. It also adds a guard to see if the request_queue is dying when creating a blkg to prevent creating a blkg for a dead request_queue. [1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/ Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device") Reported-and-tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-08blkcg: rename blkg_try_get() to blkg_tryget()Dennis Zhou1-2/+1
blkg reference counting now uses percpu_ref rather than atomic_t. Let's make this consistent with css_tryget. This renames blkg_try_get to blkg_tryget and now returns a bool rather than the blkg or %NULL. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-08blkcg: change blkg reference counting to use percpu_refDennis Zhou1-2/+39
Every bio is now associated with a blkg putting blkg_get, blkg_try_get, and blkg_put on the hot path. Switch over the refcnt in blkg to use percpu_ref. Signed-off-by: Dennis Zhou <dennis@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-08blkcg: convert blkg_lookup_create() to find closest blkgDennis Zhou1-7/+16
There are several scenarios where blkg_lookup_create() can fail such as the blkcg dying, request_queue is dying, or simply being OOM. Most handle this by simply falling back to the q->root_blkg and calling it a day. This patch implements the notion of closest blkg. During blkg_lookup_create(), if it fails to create, return the closest blkg found or the q->root_blkg. blkg_try_get_closest() is introduced and used during association so a bio is always attached to a blkg. Signed-off-by: Dennis Zhou <dennis@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-08blkcg: update blkg_lookup_create() to do lockingDennis Zhou1-3/+25
To know when to create a blkg, the general pattern is to do a blkg_lookup() and if that fails, lock and do the lookup again, and if that fails finally create. It doesn't make much sense for everyone who wants to do creation to write this themselves. This changes blkg_lookup_create() to do locking and implement this pattern. The old blkg_lookup_create() is renamed to __blkg_lookup_create(). If a call site wants to do its own error handling or already owns the queue lock, they can use __blkg_lookup_create(). This will be used in upcoming patches. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-16block: add queue_is_mq() helperJens Axboe1-4/+4
Various spots check for q->mq_ops being non-NULL, but provide a helper to do this instead. Where the ->mq_ops != NULL check is redundant, remove it. Since mq == rq-based now that legacy is gone, get rid of the queue_is_rq_based() and just use queue_is_mq() everywhere. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15block: remove the queue_lock indirectionChristoph Hellwig1-31/+31
With the legacy request path gone there is no good reason to keep queue_lock as a pointer, we can always use the embedded lock now. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixed floppy and blk-cgroup missing conversions and half done edits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15blk-cgroup: move locking into blkg_destroy_allChristoph Hellwig1-7/+2
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15blk-cgroup: consolidate error handling in blkcg_init_queueChristoph Hellwig1-12/+10
Use a goto label to merge two identical pieces of error handling code. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15block: remove a few unused exportsChristoph Hellwig1-6/+0
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-15block: remove QUEUE_FLAG_BYPASS and ->bypassChristoph Hellwig1-15/+0
Unused since the removal of the legacy request code. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07block: remove request_list codeJens Axboe1-47/+0
It's now dead code, nobody uses it. Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-cgroup: remove legacy queue bypassingJens Axboe1-8/+0
We only support mq devices now. Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-02blkcg: revert blkcg cleanups seriesDennis Zhou1-85/+38
This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae, f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef, a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53 Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01Merge tag 'v4.19-rc6' into for-4.20/blockJens Axboe1-1/+3
Merge -rc6 in, for two reasons: 1) Resolve a trivial conflict in the blk-mq-tag.c documentation 2) A few important regression fixes went into upstream directly, so they aren't in the 4.20 branch. Signed-off-by: Jens Axboe <axboe@kernel.dk> * tag 'v4.19-rc6': (780 commits) Linux 4.19-rc6 MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c cpufreq: qcom-kryo: Fix section annotations perf/core: Add sanity check to deal with pinned event failure xen/blkfront: correct purging of persistent grants Revert "xen/blkfront: When purging persistent grants, keep them in the buffer" selftests/powerpc: Fix Makefiles for headers_install change blk-mq: I/O and timer unplugs are inverted in blktrace dax: Fix deadlock in dax_lock_mapping_entry() x86/boot: Fix kexec booting failure in the SEV bit detection code bcache: add separate workqueue for journal_write to avoid deadlock drm/amd/display: Fix Edid emulation for linux drm/amd/display: Fix Vega10 lightup on S3 resume drm/amdgpu: Fix vce work queue was not cancelled when suspend Revert "drm/panel: Add device_link from panel device to DRM device" xen/blkfront: When purging persistent grants, keep them in the buffer clocksource/drivers/timer-atmel-pit: Properly handle error cases block: fix deadline elevator drain for zoned block devices ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-22blkcg: rename blkg_try_get to blkg_trygetDennis Zhou (Facebook)1-2/+1
blkg reference counting now uses percpu_ref rather than atomic_t. Let's make this consistent with css_tryget. This renames blkg_try_get to blkg_tryget and now returns a bool rather than the blkg or NULL. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-22blkcg: change blkg reference counting to use percpu_refDennis Zhou (Facebook)1-25/+39
Now that every bio is associated with a blkg, this puts the use of blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over the refcnt in blkg to use percpu_ref. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-22blkcg: convert blkg_lookup_create to find closest blkgDennis Zhou (Facebook)1-8/+17
There are several scenarios where blkg_lookup_create can fail. Examples include the blkcg dying, request_queue is dying, or simply being OOM. At the end of the day, most handle this by simply falling back to the q->root_blkg and calling it a day. This patch implements the notion of closest blkg. During blkg_lookup_create, if it fails to create, return the closest blkg found or the q->root_blkg. blkg_try_get_closest is introduced and used during association so a bio is always attached to a blkg. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-22blkcg: update blkg_lookup_create to do lockingDennis Zhou (Facebook)1-3/+28
To know when to create a blkg, the general pattern is to do a blkg_lookup and if that fails, lock and then do a lookup again and if that fails finally create. It doesn't make much sense for everyone who wants to do creation to write this themselves. This changes blkg_lookup_create to do locking and implement this pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create. If a call site wants to do its own error handling or already owns the queue lock, they can use __blkg_lookup_create. This will be used in upcoming patches. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-11blk-cgroup: increase number of supported policiesJens Axboe1-1/+3
After merging the iolatency policy, we potentially now have 4 policies being registered, but only support 3. This causes one of them to fail loading. Takashi reports that BFQ no longer works for him, because it fails to load due to policy registration failure. Bump to 5 policies, and also add a warning for when we have exceeded the global amount. If we have to touch this again, we should switch to a dynamic scheme instead. Reported-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31blkcg: delay blkg destruction until after writeback has finishedDennis Zhou (Facebook)1-8/+45
Currently, blkcg destruction relies on a sequence of events: 1. Destruction starts. blkcg_css_offline() is called and blkgs release their reference to the blkcg. This immediately destroys the cgwbs (writeback). 2. With blkgs giving up their reference, the blkcg ref count should become zero and eventually call blkcg_css_free() which finally frees the blkcg. Jiufei Xue reported that there is a race between blkcg_bio_issue_check() and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent on the completion of all writeback associated with the blkcg. A count of the number of cgwbs is maintained and once that goes to zero, blkg destruction can follow. This should prevent premature blkg destruction related to writeback. The new process for blkcg cleanup is as follows: 1. Destruction starts. blkcg_css_offline() is called which offlines writeback. Blkg destruction is delayed on the cgwb_refcnt count to avoid punting potentially large amounts of outstanding writeback to root while maintaining any ongoing policies. Here, the base cgwb_refcnt is put back. 2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called and handles destruction of blkgs. This is where the css reference held by each blkg is released. 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called. This finally frees the blkg. It seems in the past blk-throttle didn't do the most understandable things with taking data from a blkg while associating with current. So, the simplification and unification of what blk-throttle is doing caused this. Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31Revert "blk-throttle: fix race between blkcg_bio_issue_check() and ↵Dennis Zhou (Facebook)1-62/+16
cgroup_rmdir()" This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0. Destroying blkgs is tricky because of the nature of the relationship. A blkg should go away when either a blkcg or a request_queue goes away. However, blkg's pin the blkcg to ensure they remain valid. To break this cycle, when a blkcg is offlined, blkgs put back their css ref. This eventually lets css_free() get called which frees the blkcg. The above commit (4c6994806f70) breaks this order of events by trying to destroy blkgs in css_free(). As the blkgs still hold references to the blkcg, css_free() is never called. The race between blkcg_bio_issue_check() and cgroup_rmdir() will be addressed in the following patch by delaying destruction of a blkg until all writeback associated with the blkcg has been finished. Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01blk-cgroup: hold the queue ref during throttlingJosef Bacik1-1/+1
The blkg lifetime is protected by the queue lifetime, so we need to put the queue _after_ we're done using the blkg. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18blkcg: Track DISCARD statistics and output them in cgroup io.statTejun Heo1-4/+9
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two fields, dbytes and dios, to respectively count the total bytes and number of discards are added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09block: introduce blk-iolatency io controllerJosef Bacik1-0/+8
Current IO controllers for the block layer are less than ideal for our use case. The io.max controller is great at hard limiting, but it is not work conserving. This patch introduces io.latency. You provide a latency target for your group and we monitor the io in short windows to make sure we are not exceeding those latency targets. This makes use of the rq-qos infrastructure and works much like the wbt stuff. There are a few differences from wbt - It's bio based, so the latency covers the whole block layer in addition to the actual io. - We will throttle all IO types that comes in here if we need to. - We use the mean latency over the 100ms window. This is because writes can be particularly fast, which could give us a false sense of the impact of other workloads on our protected workload. - By default there's no throttling, we set the queue_depth to INT_MAX so that we can have as many outstanding bio's as we're allowed to. Only at throttle time do we pay attention to the actual queue depth. - We backcharge cgroups for root cg issued IO and induce artificial delays in order to deal with cases like metadata only or swap heavy workloads. In testing this has worked out relatively well. Protected workloads will throttle noisy workloads down to 1 io at time if they are doing normal IO on their own, or induce up to a 1 second delay per syscall if they are doing a lot of root issued IO (metadata/swap IO). Our testing has revolved mostly around our production web servers where we have hhvm (the web server application) in a protected group and everything else in another group. We see slightly higher requests per second (RPS) on the test tier vs the control tier, and much more stable RPS across all machines in the test tier vs the control tier. Another test we run is a slow memory allocator in the unprotected group. Before this would eventually push us into swap and cause the whole box to die and not recover at all. With these patches we see slight RPS drops (usually 10-15%) before the memory consumer is properly killed and things recover within seconds. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09blkcg: add generic throttling mechanismJosef Bacik1-0/+220
Since IO can be issued from literally anywhere it's almost impossible to do throttling without having some sort of adverse effect somewhere else in the system because of locking or other dependencies. The best way to solve this is to do the throttling when we know we aren't holding any other kernel resources. Do this by tracking throttling in a per-blkg basis, and if we require throttling flag the task that it needs to check before it returns to user space and possibly sleep there. This is to address the case where a process is doing work that is generating IO that can't be throttled, whether that is directly with a lot of REQ_META IO, or indirectly by allocating so much memory that it is swamping the disk with REQ_SWAP. We can't use task_add_work as we don't want to induce a memory allocation in the IO path, so simply saving the request queue in the task and flagging it to do the notify_resume thing achieves the same result without the overhead of a memory allocation. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09blk-cgroup: allow controllers to output their own statsJosef Bacik1-3/+44
blk-iolatency has a few stats that it would like to print out, and instead of adding a bunch of crap to the generic code just provide a helper so that controllers can add stuff to the stat line if they want to. Hide it behind a boot option since it changes the output of io.stat from normal, and these stats are only interesting to developers. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-19blkcg: init root blkcg_gq under lockJiang Biao1-6/+11
The initializing of q->root_blkg is currently outside of queue lock and rcu, so the blkg may be destroied before the initializing, which may cause dangling/null references. On the other side, the destroys of blkg are protected by queue lock or rcu. Put the initializing inside the queue lock and rcu to make it safer. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> CC: Tejun Heo <tj@kernel.org> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-19blkcg: small fix on comment in blkcg_init_queueJiang Biao1-5/+1
The comment before blkg_create() in blkcg_init_queue() was moved from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but it does not suit for the new context. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> CC: Tejun Heo <tj@kernel.org> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-18blkcg: don't hold blkcg lock when deactivating policyJiang Biao1-5/+0
As described in the comment of blkcg_activate_policy(), *Update of each blkg is protected by both queue and blkcg locks so that holding either lock and testing blkcg_policy_enabled() is always enough for dereferencing policy data.* with queue lock held, there is no need to hold blkcg lock in blkcg_deactivate_policy(). Similar case is in blkcg_activate_policy(), which has removed holding of blkcg lock in commit 4c55f4f9ad3001ac1fefdd8d8ca7641d18558e23. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> CC: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-16blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()Joseph Qi1-16/+62
We've triggered a WARNING in blk_throtl_bio() when throttling writeback io, which complains blkg->refcnt is already 0 when calling blkg_get(), and then kernel crashes with invalid page request. After investigating this issue, we've found it is caused by a race between blkcg_bio_issue_check() and cgroup_rmdir(), which is described below: writeback kworker cgroup_rmdir cgroup_destroy_locked kill_css css_killed_ref_fn css_killed_work_fn offline_css blkcg_css_offline blkcg_bio_issue_check rcu_read_lock blkg_lookup spin_trylock(q->queue_lock) blkg_destroy spin_unlock(q->queue_lock) blk_throtl_bio spin_lock_irq(q->queue_lock) ... spin_unlock_irq(q->queue_lock) rcu_read_unlock Since rcu can only prevent blkg from releasing when it is being used, the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule blkg release. Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING. And then the corresponding blkg_put() will schedule blkg release again, which result in double free. This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()"). Before this commit, it will lookup first and then try to lookup/create again with queue_lock. Since revive this logic is a bit drastic, so fix it by only offlining pd during blkcg_css_offline(), and move the rest destruction (especially blkg_put()) into blkcg_css_free(), which should be the right way as discussed. Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()") Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26genhd: Add helper put_disk_and_module()Jan Kara1-9/+2
Add a proper counterpart to get_disk_and_module() - put_disk_and_module(). Currently it is opencoded in several places. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04blkcg: add sanity check for blkcg policy operationsweiping zhang1-0/+5
blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs, otherwise policy would register fail. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blkcg: check pol->cpd_free_fn before free cpdweiping zhang1-2/+2
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25blkcg: avoid free blkcg_root when failed to alloc blkcg policyweiping zhang1-3/+5
this patch fix two errors, firstly avoid kfree blk_root, secondly not free(blkcg) ,if blkcg alloc fail(blkcg == NULL), just unlock that mutex; Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-01block: Avoid that blk_exit_rl() triggers a use-after-freeBart Van Assche1-1/+1
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is essential that the memory allocated for struct request_queue stays around until all blk_exit_rl() calls have finished. Hence make blk_init_rl() take a reference on struct request_queue. This patch fixes the following crash: general protection fault: 0000 [#2] SMP CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 task: ffff88013a108040 task.stack: ffffc9000071c000 RIP: 0010:free_request_size+0x1a/0x30 RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003 RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418 RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009 R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8 R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0 Call Trace: mempool_destroy.part.10+0x21/0x40 mempool_destroy+0xe/0x10 blk_exit_rl+0x12/0x20 blkg_free+0x4d/0xa0 __blkg_release_rcu+0x59/0x170 rcu_process_callbacks+0x260/0x4e0 __do_softirq+0x116/0x250 smpboot_thread_fn+0x123/0x1e0 kthread+0x109/0x140 ret_from_fork+0x31/0x40 Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29blkcg: allocate struct blkcg_gq outside request queue spinlockTahsin Erdogan1-25/+98
blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Do struct blkcg_gq allocations outside the queue spinlock to allow blocking during memory allocations. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29Revert "blkcg: allocate struct blkcg_gq outside request queue spinlock"Jens Axboe1-87/+51
I inadvertently applied the v5 version of this patch, whereas the agreed upon version was v5. Revert this one so we can apply the right one. This reverts commit 7fc6b87a9ff537e7df32b1278118ce9c5bcd6788.
2017-03-29blkcg: allocate struct blkcg_gq outside request queue spinlockTahsin Erdogan1-51/+87
blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Update blkg_create() function to temporarily drop the rcu and queue locks when it is allowed by gfp mask. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar1-0/+1
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-03blkcg: fix double free of new_blkg in blkcg_init_queueHou Tao1-3/+1
If blkg_create fails, new_blkg passed as an argument will be freed by blkg_create, so there is no need to free it again. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02block: Use pointer to backing_dev_info from request_queueJan Kara1-3/+3
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19blk-cgroup: don't quiesce the queue on policy activate/deactivateJens Axboe1-12/+8
There's no potential harm in quiescing the queue, but it also doesn't buy us anything. And we can't run the queue async for policy deactivate, since we could be in the path of tearing the queue down. If we schedule an async run of the queue at that time, we're racing with queue teardown AFTER having we've already torn most of it down. Reported-by: Omar Sandoval <osandov@fb.com> Fixes: 4d199c6f1c84 ("blk-cgroup: ensure that we clear the stop bit on quiesced queues") Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-18blk-cgroup: ensure that we clear the stop bit on quiesced queuesJens Axboe1-4/+6
If we call blk_mq_quiesce_queue() on a queue, we must remember to pair that with something that clears the stopped by on the queues later on. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-17blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe1-4/+20
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
2016-11-22block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcgTejun Heo1-4/+5
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and when that fails falls back to operations which aren't specific to the cgroup. Occassional failures are expected under pressure and falling back to non-cgroup operation is the right thing to do. Unfortunately, I forgot to add __GFP_NOWARN to these allocations and these expected failures end up creating a lot of noise. Add __GFP_NOWARN. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marc MERLIN <marc@merlins.org> Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-30blkcg: Unlock blkcg_pol_mutex only once when cpd == NULLBart Van Assche1-3/+1
Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register() such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch avoids that smatch reports the following error: block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex' Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> # v4.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-14block/blk-cgroup.c: Declare local symbols staticBart Van Assche1-2/+2
Detected by sparse. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09block: fix module reference leak on put_disk() call for cgroups throttleRoman Pen1-0/+9
get_disk(),get_gendisk() calls have non explicit side effect: they increase the reference on the disk owner module. The following is the correct sequence how to get a disk reference and to put it: disk = get_gendisk(...); /* use disk */ owner = disk->fops->owner; put_disk(disk); module_put(owner); fs/block_dev.c is aware of this required module_put() call, but f.e. blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put a module reference. To see a leakage in action cgroups throttle config can be used. In the following script I'm removing throttle for /dev/ram0 (actually this is NOP, because throttle was never set for this device): # lsmod | grep brd brd 5175 0 # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \ /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \ done # lsmod | grep brd brd 5175 100 Now brd module has 100 references. The issue is fixed by calling module_put() just right away put_disk(). Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>