<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/cgroup-defs.h, branch v7.1-rc5</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1-rc5</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1-rc5'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-05-04T18:52:26+00:00</updated>
<entry>
<title>cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated</title>
<updated>2026-05-04T18:52:26+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-05-01T18:31:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=93618edf753838a727dbff63c7c291dee22d656b'/>
<id>urn:sha1:93618edf753838a727dbff63c7c291dee22d656b</id>
<content type='text'>
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's -&gt;css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset-&gt;tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: -&gt;css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while -&gt;css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and -&gt;css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

v2: Pin cgrp across the deferred destroy work with explicit
    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
    wasn't actually broken (ordered cgroup_offline_wq + queue_work order
    in cgroup_task_dead() saved it) but the explicit ref removes the
    dependency on those non-obvious invariants. Also note the
    pre-existing cgroup_apply_control_disable() race in the description;
    a follow-up will defer kill_css_finish() there.

Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
Cc: stable@vger.kernel.org # v7.0+
Reported-and-tested-by: Martin Pitt &lt;martin@piware.de&gt;
Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/
Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext</title>
<updated>2026-04-15T17:54:24+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-04-15T17:54:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5bdb4078e1efba9650c03753616866192d680718'/>
<id>urn:sha1:5bdb4078e1efba9650c03753616866192d680718</id>
<content type='text'>
Pull sched_ext updates from Tejun Heo:

 - cgroup sub-scheduler groundwork

   Multiple BPF schedulers can be attached to cgroups and the dispatch
   path is made hierarchical. This involves substantial restructuring of
   the core dispatch, bypass, watchdog, and dump paths to be
   per-scheduler, along with new infrastructure for scheduler ownership
   enforcement, lifecycle management, and cgroup subtree iteration

   The enqueue path is not yet updated and will follow in a later cycle

 - scx_bpf_dsq_reenq() generalized to support any DSQ including remote
   local DSQs and user DSQs

   Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
   to local DSQs either run immediately or get reenqueued back through
   ops.enqueue(), giving schedulers tighter control over queueing
   latency

   Also useful for opportunistic CPU sharing across sub-schedulers

 - ops.dequeue() was only invoked when the core knew a task was in BPF
   data structures, missing scheduling property change events and
   skipping callbacks for non-local DSQ dispatches from ops.select_cpu()

   Fixed to guarantee exactly one ops.dequeue() call when a task leaves
   BPF scheduler custody

 - Kfunc access validation moved from runtime to BPF verifier time,
   removing runtime mask enforcement

 - Idle SMT sibling prioritization in the idle CPU selection path

 - Documentation, selftest, and tooling updates. Misc bug fixes and
   cleanups

* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
  tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
  sched_ext: Make string params of __ENUM_set() const
  tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  sched_ext: Drop spurious warning on kick during scheduler disable
  sched_ext: Warn on task-based SCX op recursion
  sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
  sched_ext: Remove runtime kfunc mask enforcement
  sched_ext: Add verifier-time kfunc context filter
  sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
  sched_ext: Decouple kfunc unlocked-context check from kf_mask
  sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
  sched_ext: Track @p's rq lock across set_cpus_allowed_scx -&gt; ops.set_cpumask
  sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
  sched_ext: Drop TRACING access to select_cpu kfuncs
  selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
  sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
  selftests/sched_ext: Improve runner error reporting for invalid arguments
  sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
  sched_ext: Documentation: Add ops.dequeue() to task lifecycle
  tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
  ...
</content>
</entry>
<entry>
<title>Merge tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2026-04-15T17:18:49+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-04-15T17:18:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b71f0be2d23d876648758d57bc6761500e3b9c70'/>
<id>urn:sha1:b71f0be2d23d876648758d57bc6761500e3b9c70</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - cgroup_file_notify() locking converted from a global lock to
   per-cgroup_file spinlock with a lockless fast-path when no
   notification is needed

 - Misc changes including exposing cgroup helpers for sched_ext and
   minor fixes

* tag 'cgroup-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/rdma: fix swapped arguments in pr_warn() format string
  cgroup/dmem: remove region parameter from dmemcg_parse_limit
  cgroup: replace global cgroup_file_kn_lock with per-cgroup_file lock
  cgroup: add lockless fast-path checks to cgroup_file_notify()
  cgroup: reduce cgroup_file_kn_lock hold time in cgroup_file_notify()
  cgroup: Expose some cgroup helpers
</content>
</entry>
<entry>
<title>cgroup: Wait for dying tasks to leave on rmdir</title>
<updated>2026-03-24T20:21:40+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-03-24T20:21:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1b164b876c36c3eb5561dd9b37702b04401b0166'/>
<id>urn:sha1:1b164b876c36c3eb5561dd9b37702b04401b0166</id>
<content type='text'>
a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.

Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().

v2: cgroup_is_populated() true to false transition happens under css_set_lock
    not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
    missed wakeups (Sebastian).

Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Cc: Bert Karwatzki &lt;spasswolf@web.de&gt;
Cc: Michal Koutny &lt;mkoutny@suse.com&gt;
Cc: cgroups@vger.kernel.org
</content>
</entry>
<entry>
<title>cgroup: replace global cgroup_file_kn_lock with per-cgroup_file lock</title>
<updated>2026-03-11T22:16:21+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeel.butt@linux.dev</email>
</author>
<published>2026-03-11T01:01:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4ef420b3450026b56807e5d53001f80eb495403c'/>
<id>urn:sha1:4ef420b3450026b56807e5d53001f80eb495403c</id>
<content type='text'>
Replace the global cgroup_file_kn_lock with a per-cgroup_file spinlock
to eliminate cross-cgroup contention as it is not really protecting
data shared between different cgroups.

The lock is initialized in cgroup_add_file() alongside timer_setup().
No lock acquisition is needed during initialization since the cgroup
directory is being populated under cgroup_mutex and no concurrent
accessors exist at that point.

Reported-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Introduce cgroup sub-sched support</title>
<updated>2026-03-06T17:58:03+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2026-03-06T17:58:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ebeca1f930eac8f11f815d58eb38fa5d07e7c16e'/>
<id>urn:sha1:ebeca1f930eac8f11f815d58eb38fa5d07e7c16e</id>
<content type='text'>
A system often runs multiple workloads especially in multi-tenant server
environments where a system is split into partitions servicing separate
more-or-less independent workloads each requiring an application-specific
scheduler. To support such and other use cases, sched_ext is in the process
of growing multiple scheduler support.

When partitioning a system in terms of CPUs for such use cases, an
oft-taken approach is hard partitioning the system using cpuset. While it
would be possible to tie sched_ext multiple scheduler support to cpuset
partitions, such an approach would have fundamental limitations stemming
from the lack of dynamism and flexibility.

Users often don't care which specific CPUs are assigned to which workload
and want to take advantage of optimizations which are enabled by running
workloads on a larger machine - e.g. opportunistic over-commit, improving
latency critical workload characteristics while maintaining bandwidth
fairness, employing control mechanisms based on different criteria than
on-CPU time for e.g. flexible memory bandwidth isolation, packing similar
parts from different workloads on same L3s to improve cache efficiency,
and so on.

As this sort of dynamic behaviors are impossible or difficult to implement
with hard partitioning, sched_ext is implementing cgroup sub-sched support
where schedulers can be attached to the cgroup hierarchy and a parent
scheduler is responsible for controlling the CPUs that each child can use
at any given moment. This makes CPU distribution dynamically controlled by
BPF allowing high flexibility.

This patch adds the skeletal sched_ext cgroup sub-sched support:

- sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero
  sub_cgroup_id indicates that the scheduler is to be attached to the
  identified cgroup. A sub-sched is attached to the cgroup iff the nearest
  ancestor scheduler implements .sub_attach() and grants the attachment. Max
  nesting depth is limited by SCX_SUB_MAX_DEPTH.

- When a scheduler exits, all its descendant schedulers are exited
  together. Also, cgroup.scx_sched added which points to the effective
  scheduler instance for the cgroup. This is updated on scheduler
  init/exit and inherited on cgroup online. When a cgroup is offlined, the
  attached scheduler is automatically exited.

- Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is
  automatically enabled if both SCX and cgroups are enabled. Sub-sched
  support is not tied to the CPU controller but rather the cgroup
  hierarchy itself. This is intentional as the support for cpu.weight and
  cpu.max based resource control is orthogonal to sub-sched support. Note
  that CONFIG_CGROUPS around cgroup subtree iteration support for
  scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency.

- This allows loading sub-scheds and most framework operations such as
  propagating disable down the hierarchy work. However, sub-scheds are not
  operational yet and all tasks stay with the root sched. This will serve
  as the basis for building up full sub-sched support.

- DSQs point to the scx_sched they belong to.

- scx_qmap is updated to allow attachment of sub-scheds and also serving
  as sub-scheds.

- scx_is_descendant() is added but not yet used in this patch. It is used by
  later changes in the series and placed here as this is where the function
  belongs.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2026-02-11T21:20:50+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-02-11T21:20:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ff661eeee26038f15ed9dd33c91809632e11d9eb'/>
<id>urn:sha1:ff661eeee26038f15ed9dd33c91809632e11d9eb</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - cpuset changes:

    - Continue separating v1 and v2 implementations by moving more
      v1-specific logic into cpuset-v1.c

    - Improve partition handling. Sibling partitions are no longer
      invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
      fail in v2, and effective_xcpus computation is made consistent

    - Fix partition effective CPUs overlap that caused a warning on
      cpuset removal when sibling partitions shared CPUs

 - Increase the maximum cgroup subsystem count from 16 to 32 to
   accommodate future subsystem additions

 - Misc cleanups and selftest improvements including switching to
   css_is_online() helper, removing dead code and stale documentation
   references, using lockdep_assert_cpuset_lock_held() consistently, and
   adding polling helpers for asynchronously updated cgroup statistics

* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: fix overlap of partition effective CPUs
  cgroup: increase maximum subsystem count from 16 to 32
  cgroup: Remove stale cpu.rt.max reference from documentation
  cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
  cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
  cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
  cgroup/cpuset: Don't fail cpuset.cpus change in v2
  cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
  cgroup/cpuset: Streamline rm_siblings_excl_cpus()
  cpuset: remove dead code in cpuset-v1.c
  cpuset: remove v1-specific code from generate_sched_domains
  cpuset: separate generate_sched_domains for v1 and v2
  cpuset: move update_domain_attr_tree to cpuset_v1.c
  cpuset: add cpuset1_init helper for v1 initialization
  cpuset: add cpuset1_online_css helper for v1-specific operations
  cpuset: add lockdep_assert_cpuset_lock_held helper
  cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
  cgroup: switch to css_is_online() helper
  selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
  selftests: cgroup: make test_memcg_sock robust against delayed sock stats
  ...
</content>
</entry>
<entry>
<title>cgroup: increase maximum subsystem count from 16 to 32</title>
<updated>2026-02-01T16:34:15+00:00</updated>
<author>
<name>Chen Ridong</name>
<email>chenridong@huawei.com</email>
</author>
<published>2026-01-31T03:05:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5eab8c588bf37b7eb498f23a2ac3fb135c258e17'/>
<id>urn:sha1:5eab8c588bf37b7eb498f23a2ac3fb135c258e17</id>
<content type='text'>
The current cgroup subsystem limit of 16 is insufficient, as the number of
existing subsystems has already reached this limit. When adding a new
subsystem that is not yet in the mainline kernel, building with
`make allmodconfig` requires first bypassing the
`BUILD_BUG_ON(CGROUP_SUBSYS_COUNT &gt; 16)` restriction to allow compilation
to succeed. However, the kernel still fails to boot afterward.

This patch increases the maximum number of supported cgroup subsystems from
16 to 32, providing enough room for future subsystem additions.

Signed-off-by: Chen Ridong &lt;chenridong@huawei.com&gt;
Acked-by: Waiman Long &lt;longman@redhat.com&gt;
Tested-by: JP Kobryn &lt;inwardvessel@gmail.com&gt;
Acked-by: JP Kobryn &lt;inwardvessel@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Eliminate cgrp_ancestor_storage in cgroup_root</title>
<updated>2026-01-08T01:11:03+00:00</updated>
<author>
<name>Michal Koutný</name>
<email>mkoutny@suse.com</email>
</author>
<published>2026-01-07T16:59:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ef56578274d2b98423c8ef82bb450223f5811b59'/>
<id>urn:sha1:ef56578274d2b98423c8ef82bb450223f5811b59</id>
<content type='text'>
The cgrp_ancestor_storage has two drawbacks:
- it's not guaranteed that the member immediately follows struct cgrp in
  cgroup_root (root cgroup's ancestors[0] might thus point to a padding
  and not in cgrp_ancestor_storage proper),
- this idiom raises warnings with -Wflex-array-member-not-at-end.

Instead of relying on the auxiliary member in cgroup_root, define the
0-th level ancestor inside struct cgroup (needed for static allocation
of cgrp_dfl_root), deeper cgroups would allocate flexible
_low_ancestors[].  Unionized alias through ancestors[] will
transparently join the two ranges.

The above change would still leave the flexible array at the end of
struct cgroup inside cgroup_root, so move cgrp also towards the end of
cgroup_root to resolve the -Wflex-array-member-not-at-end.

Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/
Reported-by: Gustavo A. R. Silva &lt;gustavo@embeddedor.com&gt;
Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/
Cc: David Laight &lt;david.laight.linux@gmail.com&gt;
Acked-by: Gustavo A. R. Silva &lt;gustavoars@kernel.org&gt;
Signed-off-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Fix seqcount lockdep assertion in cgroup freezer</title>
<updated>2025-10-03T14:30:28+00:00</updated>
<author>
<name>Nirbhay Sharma</name>
<email>nirbhay.lkd@gmail.com</email>
</author>
<published>2025-10-03T11:45:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=93a4b36ef3cf4ce5e6a7e7a7686181de76e246a1'/>
<id>urn:sha1:93a4b36ef3cf4ce5e6a7e7a7686181de76e246a1</id>
<content type='text'>
The commit afa3701c0e45 ("cgroup: cgroup.stat.local time accounting")
introduced a seqcount to track freeze timing but initialized it as a
plain seqcount_t using seqcount_init().

However, the write-side critical section in cgroup_do_freeze() holds
the css_set_lock spinlock while calling write_seqcount_begin(). On
PREEMPT_RT kernels, spinlocks do not disable preemption, causing the
lockdep assertion for a plain seqcount_t, which checks for preemption
being disabled, to fail.

This triggers the following warning:
  WARNING: CPU: 0 PID: 9692 at include/linux/seqlock.h:221

Fix this by changing the type to seqcount_spinlock_t and initializing
it with seqcount_spinlock_init() to associate css_set_lock with the
seqcount. This allows lockdep to correctly validate that the spinlock
is held during write operations, resolving the assertion failure on all
kernel configurations.

Reported-by: syzbot+27a2519eb4dad86d0156@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=27a2519eb4dad86d0156
Fixes: afa3701c0e45 ("cgroup: cgroup.stat.local time accounting")
Signed-off-by: Nirbhay Sharma &lt;nirbhay.lkd@gmail.com&gt;
Link: https://lore.kernel.org/r/20251002165510.KtY3IT--@linutronix.de/
Acked-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
