<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/cgroup-defs.h, branch v6.6.131</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-02-21T12:57:17+00:00</updated>
<entry>
<title>cgroup: fix race between fork and cgroup.kill</title>
<updated>2025-02-21T12:57:17+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeel.butt@linux.dev</email>
</author>
<published>2025-01-31T00:05:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a00e607102ebd47dbeece2ce0730da811ebd2833'/>
<id>urn:sha1:a00e607102ebd47dbeece2ce0730da811ebd2833</id>
<content type='text'>
commit b69bb476dee99d564d65d418e9a20acca6f32c3f upstream.

Tejun reported the following race between fork() and cgroup.kill at [1].

Tejun:
  I was looking at cgroup.kill implementation and wondering whether there
  could be a race window. So, __cgroup_kill() does the following:

   k1. Set CGRP_KILL.
   k2. Iterate tasks and deliver SIGKILL.
   k3. Clear CGRP_KILL.

  The copy_process() does the following:

   c1. Copy a bunch of stuff.
   c2. Grab siglock.
   c3. Check fatal_signal_pending().
   c4. Commit to forking.
   c5. Release siglock.
   c6. Call cgroup_post_fork() which puts the task on the css_set and tests
       CGRP_KILL.

  The intention seems to be that either a forking task gets SIGKILL and
  terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
  don't see what guarantees that k3 can't happen before c6. ie. After a
  forking task passes c5, k2 can take place and then before the forking task
  reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
  What am I missing?

This is indeed a race. One way to fix this race is by taking
cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
to cgroup_post_fork(). However that would be heavy handed as this adds
one more potential stall scenario for cgroup.kill which is usually
called under extreme situation like memory pressure.

To fix this race, let's maintain a sequence number per cgroup which gets
incremented on __cgroup_kill() call. On the fork() side, the
cgroup_can_fork() will cache the sequence number locally and recheck it
against the cgroup's sequence number at cgroup_post_fork() site. If the
sequence numbers mismatch, it means __cgroup_kill() can been called and
we should send SIGKILL to the newly created task.

Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Reviewed-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Move rcu_head up near the top of cgroup_root</title>
<updated>2024-08-19T04:04:31+00:00</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2023-12-07T13:46:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f3c60ab676bb62e01d004d5b1cf2963a296c8e6a'/>
<id>urn:sha1:f3c60ab676bb62e01d004d5b1cf2963a296c8e6a</id>
<content type='text'>
commit a7fb0423c201ba12815877a0b5a68a6a1710b23a upstream.

Commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU
safe") adds a new rcu_head to the cgroup_root structure and kvfree_rcu()
for freeing the cgroup_root.

The current implementation of kvfree_rcu(), however, has the limitation
that the offset of the rcu_head structure within the larger data
structure must be less than 4096 or the compilation will fail. See the
macro definition of __is_kvfree_rcu_offset() in include/linux/rcupdate.h
for more information.

By putting rcu_head below the large cgroup structure, any change to the
cgroup structure that makes it larger run the risk of causing build
failure under certain configurations. Commit 77070eeb8821 ("cgroup:
Avoid false cacheline sharing of read mostly rstat_cpu") happens to be
the last straw that breaks it. Fix this problem by moving the rcu_head
structure up before the cgroup structure.

Fixes: d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe")
Reported-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Closes: https://lore.kernel.org/lkml/20231207143806.114e0a74@canb.auug.org.au/
Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Acked-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Reviewed-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Reviewed-by: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Make operations on the cgroup root_list RCU safe</title>
<updated>2024-08-19T04:04:25+00:00</updated>
<author>
<name>Yafang Shao</name>
<email>laoar.shao@gmail.com</email>
</author>
<published>2023-10-29T06:14:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dd9542ae7c7ca82ed2d7c185754ba9026361f6bc'/>
<id>urn:sha1:dd9542ae7c7ca82ed2d7c185754ba9026361f6bc</id>
<content type='text'>
commit d23b5c577715892c87533b13923306acc6243f93 upstream.

At present, when we perform operations on the cgroup root_list, we must
hold the cgroup_mutex, which is a relatively heavyweight lock. In reality,
we can make operations on this list RCU-safe, eliminating the need to hold
the cgroup_mutex during traversal. Modifications to the list only occur in
the cgroup root setup and destroy paths, which should be infrequent in a
production environment. In contrast, traversal may occur frequently.
Therefore, making it RCU-safe would be beneficial.

Signed-off-by: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
To: Michal Koutný &lt;mkoutny@suse.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>cgroup: Fix incorrect css_set_rwsem reference in comment</title>
<updated>2023-10-12T17:22:56+00:00</updated>
<author>
<name>Waiman Long</name>
<email>longman@redhat.com</email>
</author>
<published>2023-10-12T17:09:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=13cc9ee8f8ed58e563294d87d74a62006be40f21'/>
<id>urn:sha1:13cc9ee8f8ed58e563294d87d74a62006be40f21</id>
<content type='text'>
Since commit f0d9a5f17575 ("cgroup: make css_set_rwsem a spinlock
and rename it to css_set_lock"), css_set_rwsem has been replaced by
css_set_lock. That commit, however, missed the css_set_rwsem reference
in include/linux/cgroup-defs.h. Fix that by changing it to css_set_lock
as well.

Signed-off-by: Waiman Long &lt;longman@redhat.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2023-09-01T22:58:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-09-01T22:58:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7716f383a58314378604eecdd66949ea2cd80ef3'/>
<id>urn:sha1:7716f383a58314378604eecdd66949ea2cd80ef3</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - Per-cpu cpu usage stats are now tracked

   This currently isn't printed out in the cgroupfs interface and can
   only be accessed through e.g. BPF. Should decide on a not-too-ugly
   way to show per-cpu stats in cgroupfs

 - cpuset received some cleanups and prepatory patches for the pending
   cpus.exclusive patchset which will allow cpuset partitions to be
   created below non-partition parents, which should ease the management
   of partition cpusets

 - A lot of code and documentation cleanup patches

 - tools/testing/selftests/cgroup/test_cpuset.c added

* tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits)
  cgroup: Avoid -Wstringop-overflow warnings
  cgroup:namespace: Remove unused cgroup_namespaces_init()
  cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
  cgroup: clean up if condition in cgroup_pidlist_start()
  cgroup: fix obsolete function name in cgroup_destroy_locked()
  Documentation: cgroup-v2.rst: Correct number of stats entries
  cgroup: fix obsolete function name above css_free_rwork_fn()
  cgroup/cpuset: fix kernel-doc
  cgroup: clean up printk()
  cgroup: fix obsolete comment above cgroup_create()
  docs: cgroup-v1: fix typo
  docs: cgroup-v1: correct the term of Page Cache organization in inode
  cgroup/misc: Store atomic64_t reads to u64
  cgroup/misc: Change counters to be explicit 64bit types
  cgroup/misc: update struct members descriptions
  cgroup: remove cgrp-&gt;kn check in css_populate_dir()
  cgroup: fix obsolete function name
  cgroup: use cached local variable parent in for loop
  cgroup: remove obsolete comment above struct cgroupstats
  cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED
  ...
</content>
</entry>
<entry>
<title>cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants</title>
<updated>2023-08-07T18:41:25+00:00</updated>
<author>
<name>Hao Jia</name>
<email>jiahao.os@bytedance.com</email>
</author>
<published>2023-08-07T03:29:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0437719c1a97791481c5fd59642494f2108701a8'/>
<id>urn:sha1:0437719c1a97791481c5fd59642494f2108701a8</id>
<content type='text'>
The member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.

Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.

So we add the subtree_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF/drgn, etc.
And we are still trying to determine how to expose it in the
cgroupfs interface.

Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Hao Jia &lt;jiahao.os@bytedance.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: add throttled time stat for throttled children</title>
<updated>2023-07-13T13:21:49+00:00</updated>
<author>
<name>Josh Don</name>
<email>joshdon@google.com</email>
</author>
<published>2023-06-20T18:32:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=677ea015f231aa38b3972aa7be54ecd2637e99fd'/>
<id>urn:sha1:677ea015f231aa38b3972aa7be54ecd2637e99fd</id>
<content type='text'>
We currently export the total throttled time for cgroups that are given
a bandwidth limit. This patch extends this accounting to also account
the total time that each children cgroup has been throttled.

This is useful to understand the degree to which children have been
affected by the throttling control. Children which are not runnable
during the entire throttled period, for example, will not show any
self-throttling time during this period.

Expose this in a new interface, 'cpu.stat.local', which is similar to
how non-hierarchical events are accounted in 'memory.events.local'.

Signed-off-by: Josh Don &lt;joshdon@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
</content>
</entry>
<entry>
<title>Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next</title>
<updated>2022-11-02T15:18:27+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2022-11-02T15:18:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b54a0d4094f5c94eda1cafad44323306afe8f807'/>
<id>urn:sha1:b54a0d4094f5c94eda1cafad44323306afe8f807</id>
<content type='text'>
Daniel Borkmann says:

====================
bpf-next 2022-11-02

We've added 70 non-merge commits during the last 14 day(s) which contain
a total of 96 files changed, 3203 insertions(+), 640 deletions(-).

The main changes are:

1) Make cgroup local storage available to non-cgroup attached BPF programs
   such as tc BPF ones, from Yonghong Song.

2) Avoid unnecessary deadlock detection and failures wrt BPF task storage
   helpers, from Martin KaFai Lau.

3) Add LLVM disassembler as default library for dumping JITed code
   in bpftool, from Quentin Monnet.

4) Various kprobe_multi_link fixes related to kernel modules,
   from Jiri Olsa.

5) Optimize x86-64 JIT with emitting BMI2-based shift instructions,
   from Jie Meng.

6) Improve BPF verifier's memory type compatibility for map key/value
   arguments, from Dave Marchevsky.

7) Only create mmap-able data section maps in libbpf when data is exposed
   via skeletons, from Andrii Nakryiko.

8) Add an autoattach option for bpftool to load all object assets,
   from Wang Yufen.

9) Various memory handling fixes for libbpf and BPF selftests,
   from Xu Kuohai.

10) Initial support for BPF selftest's vmtest.sh on arm64,
    from Manu Bretelle.

11) Improve libbpf's BTF handling to dedup identical structs,
    from Alan Maguire.

12) Add BPF CI and denylist documentation for BPF selftests,
    from Daniel Müller.

13) Check BPF cpumap max_entries before doing allocation work,
    from Florian Lehner.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits)
  samples/bpf: Fix typo in README
  bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
  bpf: check max_entries before allocating memory
  bpf: Fix a typo in comment for DFS algorithm
  bpftool: Fix spelling mistake "disasembler" -&gt; "disassembler"
  selftests/bpf: Fix bpftool synctypes checking failure
  selftests/bpf: Panic on hard/soft lockup
  docs/bpf: Add documentation for new cgroup local storage
  selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x
  selftests/bpf: Add selftests for new cgroup local storage
  selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str
  bpftool: Support new cgroup local storage
  libbpf: Support new cgroup local storage
  bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  bpf: Refactor some inode/task/sk storage functions for reuse
  bpf: Make struct cgroup btf id global
  selftests/bpf: Tracing prog can still do lookup under busy lock
  selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection
  bpf: Add new bpf_task_storage_delete proto with no deadlock detection
  bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
  ...
====================

Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Implement cgroup storage available to non-cgroup-attached bpf progs</title>
<updated>2022-10-26T06:19:19+00:00</updated>
<author>
<name>Yonghong Song</name>
<email>yhs@fb.com</email>
</author>
<published>2022-10-26T04:28:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c4bcfb38a95edb1021a53f2d0356a78120ecfbe4'/>
<id>urn:sha1:c4bcfb38a95edb1021a53f2d0356a78120ecfbe4</id>
<content type='text'>
Similar to sk/inode/task storage, implement similar cgroup local storage.

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

The life-cycle of storage is managed with the life-cycle of the
cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.

The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.

Typically, the following code is used to get the current cgroup:
    struct task_struct *task = bpf_get_current_task_btf();
    ... task-&gt;cgroups-&gt;dfl_cgrp ...
and in structure task_struct definition:
    struct task_struct {
        ....
        struct css_set __rcu            *cgroups;
        ....
    }
With sleepable program, accessing task-&gt;cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.

Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.

Acked-by: David Vernet &lt;void@manifault.com&gt;
Signed-off-by: Yonghong Song &lt;yhs@fb.com&gt;
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2022-10-14T20:03:00+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-10-14T20:03:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bd9a3dba185ce6701b41f0341470d3f53bbbbaed'/>
<id>urn:sha1:bd9a3dba185ce6701b41f0341470d3f53bbbbaed</id>
<content type='text'>
Pull PSI updates from Ingo Molnar:

 - Various performance optimizations, resulting in a 4%-9% speedup in
   the mmtests/config-scheduler-perfpipe micro-benchmark.

 - New interface to turn PSI on/off on a per cgroup level.

* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Per-cgroup PSI accounting disable/re-enable interface
  sched/psi: Cache parent psi_group to speed up group iteration
  sched/psi: Consolidate cgroup_psi()
  sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
  sched/psi: Remove NR_ONCPU task accounting
  sched/psi: Optimize task switch inside shared cgroups again
  sched/psi: Move private helpers to sched/stats.h
  sched/psi: Save percpu memory when !psi_cgroups_enabled
  sched/psi: Don't create cgroup PSI files when psi_disabled
  sched/psi: Fix periodic aggregation shut off
</content>
</entry>
</feed>
