<feed xmlns='http://www.w3.org/2005/Atom'>
<title>starfive-tech/linux.git/kernel/sched/psi.c, branch rt-linux-release</title>
<subtitle>StarFive Tech Linux Kernel for VisionFive (JH7110) boards (mirror)</subtitle>
<id>https://git.radix-linux.su/starfive-tech/linux.git/atom?h=rt-linux-release</id>
<link rel='self' href='https://git.radix-linux.su/starfive-tech/linux.git/atom?h=rt-linux-release'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/'/>
<updated>2023-11-06T11:24:37+00:00</updated>
<entry>
<title>printk: remove deferred printing</title>
<updated>2023-11-06T11:24:37+00:00</updated>
<author>
<name>John Ogness</name>
<email>john.ogness@linutronix.de</email>
</author>
<published>2020-11-30T00:36:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=5c17a5c18154f25e16336b4127f8d14916d5d028'/>
<id>urn:sha1:5c17a5c18154f25e16336b4127f8d14916d5d028</id>
<content type='text'>
Since printing occurs either atomically or from the printing
kthread, there is no need for any deferring or tracking possible
recursion paths. Remove all printk defer functions and context
tracking.

Signed-off-by: John Ogness &lt;john.ogness@linutronix.de&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup</title>
<updated>2021-07-02T00:22:14+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-07-02T00:22:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=3dbdb38e286903ec220aaf1fb29a8d94297da246'/>
<id>urn:sha1:3dbdb38e286903ec220aaf1fb29a8d94297da246</id>
<content type='text'>
Pull cgroup updates from Tejun Heo:

 - cgroup.kill is added which implements atomic killing of the whole
   subtree.

   Down the line, this should be able to replace the multiple userland
   implementations of "keep killing till empty".

 - PSI can now be turned off at boot time to avoid overhead for
   configurations which don't care about PSI.

* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: make per-cgroup pressure stall tracking configurable
  cgroup: Fix kernel-doc
  cgroup: inline cgroup_task_freeze()
  tests/cgroup: test cgroup.kill
  tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
  tests/cgroup: use cgroup.kill in cg_killall()
  docs/cgroup: add entry for cgroup.kill
  cgroup: introduce cgroup.kill
</content>
</entry>
<entry>
<title>psi: Fix race between psi_trigger_create/destroy</title>
<updated>2021-06-24T07:07:50+00:00</updated>
<author>
<name>Zhaoyang Huang</name>
<email>zhaoyang.huang@unisoc.com</email>
</author>
<published>2021-06-11T00:29:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=8f91efd870ea5d8bc10b0fcc9740db51cd4c0c83'/>
<id>urn:sha1:8f91efd870ea5d8bc10b0fcc9740db51cd4c0c83</id>
<content type='text'>
Race detected between psi_trigger_destroy/create as shown below, which
cause panic by accessing invalid psi_system-&gt;poll_wait-&gt;wait_queue_entry
and psi_system-&gt;poll_timer-&gt;entry-&gt;next. Under this modification, the
race window is removed by initialising poll_wait and poll_timer in
group_init which are executed only once at beginning.

  psi_trigger_destroy()                   psi_trigger_create()

  mutex_lock(trigger_lock);
  rcu_assign_pointer(poll_task, NULL);
  mutex_unlock(trigger_lock);
					  mutex_lock(trigger_lock);
					  if (!rcu_access_pointer(group-&gt;poll_task)) {
					    timer_setup(poll_timer, poll_timer_fn, 0);
					    rcu_assign_pointer(poll_task, task);
					  }
					  mutex_unlock(trigger_lock);

  synchronize_rcu();
  del_timer_sync(poll_timer); &lt;-- poll_timer has been reinitialized by
                                  psi_trigger_create()

So, trigger_lock/RCU correctly protects destruction of
group-&gt;poll_task but misses this race affecting poll_timer and
poll_wait.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Co-developed-by: ziwei.dai &lt;ziwei.dai@unisoc.com&gt;
Signed-off-by: ziwei.dai &lt;ziwei.dai@unisoc.com&gt;
Co-developed-by: ke.wang &lt;ke.wang@unisoc.com&gt;
Signed-off-by: ke.wang &lt;ke.wang@unisoc.com&gt;
Signed-off-by: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
</content>
</entry>
<entry>
<title>cgroup: make per-cgroup pressure stall tracking configurable</title>
<updated>2021-06-08T18:59:02+00:00</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2021-05-24T19:53:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=3958e2d0c34e18c41b60dc01832bd670a59ef70f'/>
<id>urn:sha1:3958e2d0c34e18c41b60dc01832bd670a59ef70f</id>
<content type='text'>
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.

Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>psi: Fix psi state corruption when schedule() races with cgroup move</title>
<updated>2021-05-06T13:33:26+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2021-05-03T17:49:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=d583d360a620e6229422b3455d0be082b8255f5e'/>
<id>urn:sha1:d583d360a620e6229422b3455d0be082b8255f5e</id>
<content type='text'>
4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p-&gt;in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p-&gt;in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task-&gt;psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task-&gt;psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
</content>
</entry>
<entry>
<title>sched,psi: Handle potential task count underflow bugs more gracefully</title>
<updated>2021-04-21T11:55:41+00:00</updated>
<author>
<name>Charan Teja Reddy</name>
<email>charante@codeaurora.org</email>
</author>
<published>2021-04-16T15:02:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=9d10a13d1e4c349b76f1c675a874a7f981d6d3b4'/>
<id>urn:sha1:9d10a13d1e4c349b76f1c675a874a7f981d6d3b4</id>
<content type='text'>
psi_group_cpu-&gt;tasks, represented by the unsigned int, stores the
number of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu-&gt;state_mask is being set with the
respective pressure state. This could result into the unnecessary time
sampling for the pressure state thus cause the spurious psi events.
This can further lead to wrong actions being taken at the user land
based on these psi events.

Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the -&gt;tasks count only when it is
non-zero.

Signed-off-by: Charan Teja Reddy &lt;charante@codeaurora.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org
</content>
</entry>
<entry>
<title>psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files</title>
<updated>2021-04-08T21:09:44+00:00</updated>
<author>
<name>Josh Hunt</name>
<email>johunt@akamai.com</email>
</author>
<published>2021-04-02T02:58:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=6db12ee0456d0e369c7b59788d46e15a56ad0294'/>
<id>urn:sha1:6db12ee0456d0e369c7b59788d46e15a56ad0294</id>
<content type='text'>
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt &lt;johunt@akamai.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
</content>
</entry>
<entry>
<title>psi: Reduce calls to sched_clock() in psi</title>
<updated>2021-03-23T15:01:58+00:00</updated>
<author>
<name>Shakeel Butt</name>
<email>shakeelb@google.com</email>
</author>
<published>2021-03-21T20:51:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=df77430639c9cf73559bac0f25084518bf9a812d'/>
<id>urn:sha1:df77430639c9cf73559bac0f25084518bf9a812d</id>
<content type='text'>
We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().

Performed perf bench on Intel Broadwell with 3 levels of cgroup.

Before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.747 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.516 [sec]

       3.516689 usecs/op
         284358 ops/sec

After the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.640 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.329 [sec]

       3.329820 usecs/op
         300316 ops/sec

Signed-off-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
</content>
</entry>
<entry>
<title>sched: Fix various typos</title>
<updated>2021-03-21T23:11:52+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2021-03-18T12:38:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=3b03706fa621ce31a3e9ef6307020fde4e6aae16'/>
<id>urn:sha1:3b03706fa621ce31a3e9ef6307020fde4e6aae16</id>
<content type='text'>
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Juri Lelli &lt;juri.lelli@redhat.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Cc: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: linux-kernel@vger.kernel.org
</content>
</entry>
<entry>
<title>psi: Optimize task switch inside shared cgroups</title>
<updated>2021-03-06T11:40:23+00:00</updated>
<author>
<name>Chengming Zhou</name>
<email>zhouchengming@bytedance.com</email>
</author>
<published>2021-03-03T03:46:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/starfive-tech/linux.git/commit/?id=4117cebf1a9fcbf35b9aabf0e37b6c5eea296798'/>
<id>urn:sha1:4117cebf1a9fcbf35b9aabf0e37b6c5eea296798</id>
<content type='text'>
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev &amp; next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.

Co-developed-by: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
</content>
</entry>
</feed>
