<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/sched/ext.h, branch v6.19.11</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.19.11'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-11-14T21:11:08+00:00</updated>
<entry>
<title>sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs</title>
<updated>2025-11-14T21:11:08+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-14T01:33:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1dcb98bbb7538d4b9015d47c934acdf5ea86045c'/>
<id>urn:sha1:1dcb98bbb7538d4b9015d47c934acdf5ea86045c</id>
<content type='text'>
With the buddy lockup detector, smp_processor_id() returns the detecting CPU,
not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the
locked CPU number from watchdog_hardlockup_check() as a parameter instead.

Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and
scx_rcu_cpu_stall() documenting their return value semantics.

Suggested-by: Doug Anderson &lt;dianders@chromium.org&gt;
Reviewed-by: Douglas Anderson &lt;dianders@chromium.org&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Reviewed-by: Emil Tsalapatis &lt;emil@etsalapatis.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR</title>
<updated>2025-11-12T16:43:44+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-11T19:18:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d2974cc79f7139cc851b84ad4f77805e93c40fe1'/>
<id>urn:sha1:d2974cc79f7139cc851b84ad4f77805e93c40fe1</id>
<content type='text'>
Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
macro in preparation for additional users.

Reviewed-by: Emil Tsalapatis &lt;emil@etsalapatis.com&gt;
Cc: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Hook up hardlockup detector</title>
<updated>2025-11-12T16:43:44+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-11T19:18:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=582f700e1bdc5978f41e3d8d65d3e16e34e9be8a'/>
<id>urn:sha1:582f700e1bdc5978f41e3d8d65d3e16e34e9be8a</id>
<content type='text'>
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.

Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.

v2: Add missing dummy scx_hardlockup() definition for
    !CONFIG_SCHED_CLASS_EXT (kernel test bot).

Reported-by: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Cc: Emil Tsalapatis &lt;etsal@meta.com&gt;
Cc: Douglas Anderson &lt;dianders@chromium.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode</title>
<updated>2025-11-12T16:43:44+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-11T19:18:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=61debc251c1c9150c7bdfd5c028bc2d078e17d22'/>
<id>urn:sha1:61debc251c1c9150c7bdfd5c028bc2d078e17d22</id>
<content type='text'>
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.

Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.

Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.

This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.

v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).

Reported-by: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Reviewed-by: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Reviewed-by: Emil Tsalapatis &lt;emil@etsalapatis.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Use shorter slice in bypass mode</title>
<updated>2025-11-12T16:43:43+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-11T19:18:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bfd3749d489ec0df27ed94ee3dfd9475fea27bf9'/>
<id>urn:sha1:bfd3749d489ec0df27ed94ee3dfd9475fea27bf9</id>
<content type='text'>
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.

Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.

v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).

v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea).

Reviewed-by: Emil Tsalapatis &lt;emil@etsalapatis.com&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Cc: Dan Schatzberg &lt;schatzberg.dan@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()</title>
<updated>2025-11-03T21:57:30+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-11-03T20:25:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7900aa699c34401cf5d0c701d9ef72880ddc1a83'/>
<id>urn:sha1:7900aa699c34401cf5d0c701d9ef72880ddc1a83</id>
<content type='text'>
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:

- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
  during scheduler load, because the cgroup might be destroyed/unlinked
  while the zombie or dead task is still lingering on the scx_tasks list.

- ops.cgroup_exit() could be called before ops.exit_task() is called on all
  member tasks, leading to incorrect exit ordering.

Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class-&gt;task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.

By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:

- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
  as cgroup_mutex prevents cgroup destruction while the task is still linked.

- All member tasks have ops.exit_task() called and are removed from scx_tasks
  before the cgroup can be destroyed and trigger ops.cgroup_exit().

This fix is made possible by the cgroup_task_dead() split in the previous patch.

This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.

Reported-by: Dan Schatzberg &lt;dschatzberg@meta.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Add lockless peek operation for DSQs</title>
<updated>2025-10-15T16:46:25+00:00</updated>
<author>
<name>Ryan Newton</name>
<email>newton@meta.com</email>
</author>
<published>2025-10-15T15:50:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=44f5c8ec5b9ad8ed4ade08d727f803b2bb07f1c3'/>
<id>urn:sha1:44f5c8ec5b9ad8ed4ade08d727f803b2bb07f1c3</id>
<content type='text'>
The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.

This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.

This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.

Signed-off-by: Ryan Newton &lt;newton@meta.com&gt;
Reviewed-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Reviewed-by: Christian Loehle &lt;christian.loehle@arm.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/ext: Implement cgroup_set_idle() callback</title>
<updated>2025-10-14T20:17:33+00:00</updated>
<author>
<name>zhidao su</name>
<email>suzhidao@xiaomi.com</email>
</author>
<published>2025-10-11T07:16:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=347ed2d566dabb06c7970fff01129c4f59995ed6'/>
<id>urn:sha1:347ed2d566dabb06c7970fff01129c4f59995ed6</id>
<content type='text'>
Implement the missing cgroup_set_idle() callback that was marked as a
TODO. This allows BPF schedulers to be notified when a cgroup's idle
state changes, enabling them to adjust their scheduling behavior
accordingly.

The implementation follows the same pattern as other cgroup callbacks
like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the
BPF scheduler has implemented the callback and invokes it with the
appropriate parameters.

Fixes a spelling error in the cgroup_set_bandwidth() documentation.

tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage.

Signed-off-by: zhidao su &lt;soolaugust@gmail.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Improve SCX_KF_DISPATCH comment</title>
<updated>2025-09-23T19:03:26+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2025-09-23T19:03:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=edf005fa274a0c224e550a52726aa7a426384e36'/>
<id>urn:sha1:edf005fa274a0c224e550a52726aa7a426384e36</id>
<content type='text'>
The comment for SCX_KF_DISPATCH was incomplete and didn't explain that
ops.dispatch() may temporarily release the rq lock, allowing ENQUEUE and
SELECT_CPU operations to be nested inside DISPATCH contexts.

Update the comment to clarify this nesting behavior and provide better
context for when these operations can occur within dispatch.

Acked-by: Andrea Righi &lt;arighi@nvidia.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched_ext: Drop kfuncs marked for removal in 6.15</title>
<updated>2025-06-25T23:13:20+00:00</updated>
<author>
<name>Jake Hillion</name>
<email>jake@hillion.co.uk</email>
</author>
<published>2025-06-25T17:05:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4ecf83741401c70d4420588ee1f3b1ca04ef58d5'/>
<id>urn:sha1:4ecf83741401c70d4420588ee1f3b1ca04ef58d5</id>
<content type='text'>
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names
around for compatibility with old binaries. These were scheduled for
cleanup in 6.15 but were missed. Submitting for cleanup in for-next.

Removed the kfuncs, their flags, and any references I could find to them
in doc comments. Left the entries in include/scx/compat.bpf.h as they're
still useful to make new binaries compatible with old kernels.

Tested by applying to my kernel. It builds and a modern version of
scx_lavd loads fine.

Signed-off-by: Jake Hillion &lt;jake@hillion.co.uk&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
