kernel/linux.git/block/blk-cgroup.c, branch v6.6.141

blk-cgroup: fix disk reference leak in blkcg_maybe_throttle_current()

2026-05-23T11:03:05+00:00

[ Upstream commit 23308af722fefed00af5f238024c11710938fba3 ] Add the missing put_disk() on the error path in blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or blkg_tryget() fails, the function jumps to the out label which only calls rcu_read_unlock() but does not release the disk reference acquired by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk is already set to NULL before the lookup, blkcg_exit() cannot release this reference either, causing the disk to never be freed. Restore the reference release that was present as blk_put_queue() in the original code but was inadvertently dropped during the conversion from request_queue to gendisk. Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct") Signed-off-by: Jackie Liu Acked-by: Tejun Heo Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: wait for blkcg cleanup before initializing new disk

2026-05-23T11:03:04+00:00

[ Upstream commit 3dbaacf6ab68f81e3375fe769a2ecdbd3ce386fd ] When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the previous disk's blkcg state is cleaned up asynchronously via disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk() runs before that cleanup finishes, we may overwrite q->root_blkg while the old one is still alive, and radix_tree_insert() in blkg_create() fails with -EEXIST because the old blkg entries still occupy the same queue id slot in blkcg->blkg_tree. This causes the sd probe to fail with -ENOMEM. Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL, which indicates the previous disk's blkcg cleanup has completed. Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") Cc: Yi Zhang Signed-off-by: Ming Lei Reviewed-by: Christoph Hellwig Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: fix possible deadlock while configuring policy

2025-11-24T09:29:22+00:00

[ Upstream commit 5d726c4dbeeddef612e6bed27edd29733f4d13af ] Following deadlock can be triggered easily by lockdep: WARNING: possible circular locking dependency detected 6.17.0-rc3-00124-ga12c2658ced0 #1665 Not tainted ------------------------------------------------------ check/1334 is trying to acquire lock: ff1100011d9d0678 (&q->sysfs_lock){+.+.}-{4:4}, at: blk_unregister_queue+0x53/0x180 but task is already holding lock: ff1100011d9d00e0 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: del_gendisk+0xba/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&q->q_usage_counter(queue)#3){++++}-{0:0}: blk_queue_enter+0x40b/0x470 blkg_conf_prep+0x7b/0x3c0 tg_set_limit+0x10a/0x3e0 cgroup_file_write+0xc6/0x420 kernfs_fop_write_iter+0x189/0x280 vfs_write+0x256/0x490 ksys_write+0x83/0x190 __x64_sys_write+0x21/0x30 x64_sys_call+0x4608/0x4630 do_syscall_64+0xdb/0x6b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&q->rq_qos_mutex){+.+.}-{4:4}: __mutex_lock+0xd8/0xf50 mutex_lock_nested+0x2b/0x40 wbt_init+0x17e/0x280 wbt_enable_default+0xe9/0x140 blk_register_queue+0x1da/0x2e0 __add_disk+0x38c/0x5d0 add_disk_fwnode+0x89/0x250 device_add_disk+0x18/0x30 virtblk_probe+0x13a3/0x1800 virtio_dev_probe+0x389/0x610 really_probe+0x136/0x620 __driver_probe_device+0xb3/0x230 driver_probe_device+0x2f/0xe0 __driver_attach+0x158/0x250 bus_for_each_dev+0xa9/0x130 driver_attach+0x26/0x40 bus_add_driver+0x178/0x3d0 driver_register+0x7d/0x1c0 __register_virtio_driver+0x2c/0x60 virtio_blk_init+0x6f/0xe0 do_one_initcall+0x94/0x540 kernel_init_freeable+0x56a/0x7b0 kernel_init+0x2b/0x270 ret_from_fork+0x268/0x4c0 ret_from_fork_asm+0x1a/0x30 -> #0 (&q->sysfs_lock){+.+.}-{4:4}: __lock_acquire+0x1835/0x2940 lock_acquire+0xf9/0x450 __mutex_lock+0xd8/0xf50 mutex_lock_nested+0x2b/0x40 blk_unregister_queue+0x53/0x180 __del_gendisk+0x226/0x690 del_gendisk+0xba/0x110 sd_remove+0x49/0xb0 [sd_mod] device_remove+0x87/0xb0 device_release_driver_internal+0x11e/0x230 device_release_driver+0x1a/0x30 bus_remove_device+0x14d/0x220 device_del+0x1e1/0x5a0 __scsi_remove_device+0x1ff/0x2f0 scsi_remove_device+0x37/0x60 sdev_store_delete+0x77/0x100 dev_attr_store+0x1f/0x40 sysfs_kf_write+0x65/0x90 kernfs_fop_write_iter+0x189/0x280 vfs_write+0x256/0x490 ksys_write+0x83/0x190 __x64_sys_write+0x21/0x30 x64_sys_call+0x4608/0x4630 do_syscall_64+0xdb/0x6b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &q->sysfs_lock --> &q->rq_qos_mutex --> &q->q_usage_counter(queue)#3 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)#3); lock(&q->rq_qos_mutex); lock(&q->q_usage_counter(queue)#3); lock(&q->sysfs_lock); Root cause is that queue_usage_counter is grabbed with rq_qos_mutex held in blkg_conf_prep(), while queue should be freezed before rq_qos_mutex from other context. The blk_queue_enter() from blkg_conf_prep() is used to protect against policy deactivation, which is already protected with blkcg_mutex, hence convert blk_queue_enter() to blkcg_mutex to fix this problem. Meanwhile, consider that blkcg_mutex is held after queue is freezed from policy deactivation, also convert blkg_alloc() to use GFP_NOIO. Signed-off-by: Yu Kuai Reviewed-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: Fix class @block_class's subsystem refcount leakage

2025-02-17T08:40:20+00:00

commit d1248436cbef1f924c04255367ff4845ccd9025e upstream. blkcg_fill_root_iostats() iterates over @block_class's devices by class_dev_iter_(init|next)(), but does not end iterating with class_dev_iter_exit(), so causes the class's subsystem refcount leakage. Fix by ending the iterating with class_dev_iter_exit(). Fixes: ef45fe470e1e ("blk-cgroup: show global disk stats in root cgroup io.stat") Reviewed-by: Michal Koutný Cc: Greg Kroah-Hartman Cc: stable@vger.kernel.org Acked-by: Tejun Heo Signed-off-by: Zijun Hu Link: https://lore.kernel.org/r/20250105-class_fix-v6-2-3a2f1768d4d4@quicinc.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Greg Kroah-Hartman

blk-cgroup: Fix UAF in blkcg_unpin_online()

2024-12-19T17:11:22+00:00

commit 86e6ca55b83c575ab0f2e105cf08f98e58d3d7af upstream. blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To walk up, it uses blkcg_parent(blkcg) but it was calling that after blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the following UAF: ================================================================== BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270 Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117 CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022 Workqueue: cgwb_release cgwb_release_workfn Call Trace: dump_stack_lvl+0x27/0x80 print_report+0x151/0x710 kasan_report+0xc0/0x100 blkcg_unpin_online+0x15a/0x270 cgwb_release_workfn+0x194/0x480 process_scheduled_works+0x71b/0xe20 worker_thread+0x82a/0xbd0 kthread+0x242/0x2c0 ret_from_fork+0x33/0x70 ret_from_fork_asm+0x1a/0x30 ... Freed by task 1944: kasan_save_track+0x2b/0x70 kasan_save_free_info+0x3c/0x50 __kasan_slab_free+0x33/0x50 kfree+0x10c/0x330 css_free_rwork_fn+0xe6/0xb30 process_scheduled_works+0x71b/0xe20 worker_thread+0x82a/0xbd0 kthread+0x242/0x2c0 ret_from_fork+0x33/0x70 ret_from_fork_asm+0x1a/0x30 Note that the UAF is not easy to trigger as the free path is indirected behind a couple RCU grace periods and a work item execution. I could only trigger it with artifical msleep() injected in blkcg_unpin_online(). Fix it by reading the parent pointer before destroying the blkcg's blkg's. Signed-off-by: Tejun Heo Reported-by: Abagail ren Suggested-by: Linus Torvalds Fixes: 4308a434e5e0 ("blkcg: don't offline parent blkcg first") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

blk-cgroup: Properly propagate the iostat update up the hierarchy

2024-06-12T09:12:46+00:00

[ Upstream commit 9d230c09964e6e18c8f6e4f0d41ee90eef45ec1c ] During a cgroup_rstat_flush() call, the lowest level of nodes are flushed first before their parents. Since commit 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()"), iostat propagation was still done to the parent. Grandparent, however, may not get the iostat update if the parent has no blkg_iostat_set queued in its lhead lockless list. Fix this iostat propagation problem by queuing the parent's global blkg->iostat into one of its percpu lockless lists to make sure that the delta will always be propagated up to the grandparent and so on toward the root blkcg. Note that successive calls to __blkcg_rstat_flush() are serialized by the cgroup_rstat_lock. So no special barrier is used in the reading and writing of blkg->iostat.lqueued. Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Reported-by: Dan Schatzberg Closes: https://lore.kernel.org/lkml/ZkO6l%2FODzadSgdhC@dschatzberg-fedora-PF3DHTBV/ Signed-off-by: Waiman Long Reviewed-by: Ming Lei Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20240515143059.276677-1-longman@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: fix list corruption from reorder of WRITE ->lqueued

2024-06-12T09:12:45+00:00

[ Upstream commit d0aac2363549e12cc79b8e285f13d5a9f42fd08e ] __blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start is being executed. If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one stat instance being added in blk_cgroup_bio_start(), then the local list in __blkcg_rstat_flush() could be corrupted. Fix the issue by adding one barrier. Cc: Tejun Heo Cc: Waiman Long Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20240515013157.443672-3-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: fix list corruption from resetting io stat

2024-06-12T09:12:45+00:00

[ Upstream commit 6da6680632792709cecf2b006f2fe3ca7857e791 ] Since commit 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()"), each iostat instance is added to blkcg percpu list, so blkcg_reset_stats() can't reset the stat instance by memset(), otherwise the llist may be corrupted. Fix the issue by only resetting the counter part. Cc: Tejun Heo Cc: Waiman Long Cc: Jay Shin Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by: Ming Lei Acked-by: Tejun Heo Reviewed-by: Waiman Long Link: https://lore.kernel.org/r/20240515013157.443672-2-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: fix q->blkg_list corruption during disk rebind

2024-04-17T09:19:28+00:00

[ Upstream commit 8b8ace080319a866f5dfe9da8e665ae51d971c54 ] Multiple gendisk instances can allocated/added for single request queue in case of disk rebind. blkg may still stay in q->blkg_list when calling blkcg_init_disk() for rebind, then q->blkg_list becomes corrupted. Fix the list corruption issue by: - add blkg_init_queue() to initialize q->blkg_list & q->blkcg_mutex only - move calling blkg_init_queue() into blk_alloc_queue() The list corruption should be started since commit f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") which delays removing blkg from q->blkg_list into blkg_free_workfn(). Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler") Cc: Yu Kuai Cc: Tejun Heo Signed-off-by: Ming Lei Reviewed-by: Yu Kuai Link: https://lore.kernel.org/r/20240407125910.4053377-1-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-cgroup: bypass blkcg_deactivate_policy after destroying

2023-12-20T16:01:55+00:00

[ Upstream commit e63a57303599b17290cd8bc48e6f20b24289a8bc ] blkcg_deactivate_policy() can be called after blkg_destroy_all() returns, and it isn't necessary since blkg_destroy_all has covered policy deactivation. Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20231117023527.3188627-4-ming.lei@redhat.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin