Age | Commit message (Collapse) | Author | Files | Lines |
|
fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of
pick_next_task() from:
pick_next_task() := pick_task() + set_next_task(.first = true)
to:
pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)
making invoking put_prev_task() pick_next_task()'s responsibility. This
reordering allows pick_task() to be shared between regular and core-sched
paths and put_prev_task() to know the next task.
sched_ext depended on put_prev_task_scx() enqueueing the current task before
pick_next_task_scx() is called. While pulling sched/core changes,
70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
before picking the first task as a workaround.
Clean it up and adopt the conventions that other sched classes are
following.
The operation of keeping running the current task was spread and required
the task to be put on the local DSQ before picking:
- balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
runnable, hasn't exhausted its slice, and thus should keep running.
- put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
only runnable task. do_enqueue_task() in turn decided whether to use the
local DSQ depending on SCX_OPS_ENQ_LAST.
Consolidate the logic in balance_one() as it always knows whether it is
going to keep the current task. balance_one() now considers all conditions
where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
pick_next_task_scx() to keep the current task instead of picking one from
the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
updated to pick the current task if SCX_TASK_BAL_KEEP is set.
The workaround put_prev_task[_scx]() calls are replaced with
put_prev_set_next_task().
This causes two behavior changes observable from the BPF scheduler:
- When a task keep running, it no longer goes through enqueue/dequeue cycle
and thus ops.stopping/running() transitions. The new behavior is better
and all the existing schedulers should be able to handle the new behavior.
- The BPF scheduler cannot keep executing the current task by enqueueing
SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
BPF scheduler is responsible for resuming execution after each
SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
decisions are not made on the local CPU - e.g. central or userspace-driven
schedulin - and the new behavior is more logical and shouldn't pose any
problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
doesn't fit that well anymore and the last task handling is moved to the
end of qmap_dispatch().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
|
|
- Resolve trivial context conflicts from dl_server clearing being moved
around.
- Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to
match sched/core.
- Merge sched_class->switch_class() addition from sched_ext with
tip/sched/core changes in __pick_next_task().
- Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous
behavior where sched_class->put_prev_task() was called before
sched_class->pick_next_task().
While this makes sched_ext build and function, the behavior is not in line
with other sched classes. The follow-up patches will address the
discrepancies and remove sched_class->switch_class().
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
touch_core_sched()
Since 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from
balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost
function. This is simpler and in line with what other balance functions are
doing but it loses control over rq->clock_update_flags which makes
assert_clock_udpated() trigger if other CPUs pins the rq lock.
The only place this matters is touch_core_sched() which uses the timestamp
to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may
be better to use per-core dispatch sequence number.
v2: Use sched_clock_cpu() instead of ktime_get_ns() per David.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()")
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
When deciding whether a task can be migrated to a CPU,
dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online()
tests instead of using task_can_run_on_remote_rq(). This had two problems.
- It was missing is_migration_disabled() check and thus could try to migrate
a task which shouldn't leading to assertion and scheduling failures.
- It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu()
and thus failed to consider ISA compatibility.
Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq():
- Move scx_ops_error() triggering into task_can_run_on_remote_rq().
- When migration isn't allowed, fall back to the global DSQ instead of the
source DSQ by returning DTL_INVALID. This is both simpler and an overall
better behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
The cfi stub for ops.tick was missing which will fail scheduler loading
after pending BPF changes. Add it.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Since dequeue_task() allowed to fail, there is a compile error:
kernel/sched/ext.c:3630:19: error: initialization of ‘bool (*)(struct rq*, struct task_struct *, int)’ {aka ‘_Bool (*)(struct rq *, struct task_struct *, int)’} from incompatible pointer type ‘void (*)(struct rq*, struct task_struct *, int)’
3630 | .dequeue_task = dequeue_task_scx,
| ^~~~~~~~~~~~~~~~
Allow dequeue_task_scx to fail too.
Fixes: 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
consume_remote_task() and dispatch_to_local_dsq() use
move_task_to_local_dsq() to migrate the task to the target CPU. Currently,
move_task_to_local_dsq() expects the caller to lock both the source and
destination rq's. While this may save a few lock operations while the rq's
are not contended, under contention, the double locking can exacerbate the
situation significantly (refer to the linked message below).
Update the migration path so that double locking is not used.
move_task_to_local_dsq() now expects the caller to be locking the source rq,
drops it and then acquires the destination rq lock. Code is simpler this way
and, on a 2-way NUMA machine w/ Xeon Gold 6138, 'hackbench 100 thread 5000`
shows ~3% improvement with scx_simple.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20240806082716.GP37996@noisy.programming.kicks-ass.net
Acked-by: David Vernet <void@manifault.com>
|
|
`__bpf_ops_sched_ext_ops` was missing the initialization of some struct
attributes. With
https://lore.kernel.org/all/20240722183049.2254692-4-martin.lau@linux.dev/
every single attributes need to be initialized programs (like scx_layered)
will fail to load.
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero
05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero
05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524)
05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG --
attach to unsupported member dump of struct sched_ext_ops
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
-- END PROG LOAD LOG --
05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524
05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf'
05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524
Error: Failed to load BPF program
Signed-off-by: Manu Bretelle <chantr4@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
sched_ext currently doesn't generate messages when the BPF scheduler is
enabled and disabled unless there are errors. It is useful to have paper
trail. Improve logging around enable/disable:
- Generate info messages on enable and non-error disable.
- Update error exit message formatting so that it's consistent with
non-error message. Also, prefix ei->msg with the BPF scheduler's name to
make it clear where the message is coming from.
- Shorten scx_exit_reason() strings for SCX_EXIT_UNREG* for brevity and
consistency.
v2: Use pr_*() instead of KERN_* consistently. (David)
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: David Vernet <void@manifault.com>
|
|
SCX_RQ_ONLINE
scx_rq_online() currently only tests SCX_RQ_ONLINE. This isn't fully correct
- e.g. consume_dispatch_q() uses task_run_on_remote_rq() which tests
scx_rq_online() to see whether the current rq can run the task, and, if so,
calls consume_remote_task() to migrate the task to @rq. While the test
itself was done while locking @rq, @rq can be temporarily unlocked by
consume_remote_task() and nothing prevents SCX_RQ_ONLINE from going offline
before the migration takes place.
To address the issue, add cpu_active() test to scx_rq_online(). There is a
synchronize_rcu() between cpu_active() being cleared and the rq going
offline, so if an on-going scheduling operation sees cpu_active(), the
associated rq is guaranteed to not go offline until the scheduling operation
is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
Acked-by: David Vernet <void@manifault.com>
|
|
process_ddsp_deferred_locals() executes deferred direct dispatches to the
local DSQs of remote CPUs. It iterates the tasks on
rq->scx.ddsp_deferred_locals list, removing and calling
dispatch_to_local_dsq() on each. However, the list is protected by the rq
lock that can be dropped by dispatch_to_local_dsq() temporarily, so the list
can be modified during the iteration, which can lead to oopses and other
failures.
Fix it by popping from the head of the list instead of iterating the list.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches")
Acked-by: David Vernet <void@manifault.com>
|
|
task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are
subtle differences. It currently open codes all the tests. This is
cumbersome to understand and error-prone in case the intersecting tests need
to be updated.
Factor out the common part - testing whether the task is allowed on the CPU
at all regardless of the CPU state - into task_allowed_on_cpu() and make
both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the
code is now linked between the two and each contains only the extra tests
that differ between them, it's less error-prone when the conditions need to
be updated. Also, improve the comment to explain why they are different.
v2: Replace accidental "extern inline" with "static inline" (Peter).
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
scx_task_iter_next_locked()
scx_task_iter_next_locked() skips tasks whose sched_class is
idle_sched_class. While it has a short comment explaining why it's testing
the sched_class directly isntead of using is_idle_task(), the comment
doesn't sufficiently explain what's going on and why. Improve the comment.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
On SMP, SCX performs dispatch from sched_class->balance(). As balance() was
not available in UP, it instead called the internal balance function from
put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is
rather nasty.
Enabling sched_class->balance() on UP shouldn't cause any meaningful
overhead. Enable balance() on UP and drop the ugly workaround.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
update_curr_scx() is open coding runtime updates. Use update_curr_common()
instead and avoid unnecessary deviations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
|
|
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 31 Jul 2024 09:14:52 -1000
p->scx.disallow provides a way for the BPF scheduler to reject certain tasks
from attaching. It's currently allowed for both the load and fork paths;
however, the latter doesn't actually work as p->sched_class is already set
by the time scx_ops_init_task() is called during fork.
This is a convenience feature which is mostly useful from the load path
anyway. Allow it only from the load path.
v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit
easier for the BPF scheduler (David).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com>
Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com
Fixes: 7bb6f0810ecf ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT")
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_dump_task() uses stack_trace_save_tsk() which is only available when
CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if
the support is available and skip capturing stack trace if
!CONFIG_STACKTRACE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407161844.reewQQrR-lkp@intel.com/
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
We currently only allow calling sleepable scx kfuncs (i.e.
scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here
was that we'd never have to call scx_bpf_create_dsq() outside of a
sched_ext struct_ops callback, but that might not actually be true. For
example, a scheduler could do something like the following:
1. Open and load (not yet attach) a scheduler skel
2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space.
For example, to initialize an LLC domain, or some other global,
read-only state.
3. Attach the skel, which actually enables the scheduler
The advantage of doing this is that it can preclude having to do pretty
ugly boilerplate like initializing a read-only, statically sized array of
u64[]'s which the kernel consumes literally once at init time to then
create struct bpf_cpumask objects which are actually queried at runtime.
Doing the above is already possible given that we can invoke core BPF
kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We
already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs
(e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The type_id is defined as u32type, if(type_id<0) is invalid, hence
modified its type to s32.
./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
and ops.enqueue(), this isn't allowed. This is because dispatching to the
local DSQ of a remote CPU requires locking both the task's current and new
rq's and such double locking can't be done directly from ops.enqueue().
While waking up a task, as ops.select_cpu() can pick any CPU and both
ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
do whatever it wants to do. However, while a task is being enqueued for a
different reason, e.g. after its slice expiration, only ops.enqueue() is
called and there's no way for the BPF scheduler to directly dispatch to the
local DSQ of a remote CPU. This gap in API forces schedulers into
work-arounds which are not straightforward or optimal such as skipping
direct dispatches in such cases.
Implement deferred enqueueing to allow directly dispatching to the local DSQ
of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
safely released, the tasks are taken off the list and queued on the target
local DSQs using dispatch_to_local_dsq().
v2: - Add missing return after queue_balance_callback() in
schedule_deferred(). (David).
- dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
and thus no longer takes @rf. Updated accordingly.
- UP build warning fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Changwoo Min <changwoo@igalia.com>
|
|
SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
the rq is currently enqueueing for a wakeup. This will be used to implement
direct dispatching to local DSQ of another CPU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
sched_ext often needs to migrate tasks across CPUs right before execution
and thus uses the balance path to dispatch tasks from the BPF scheduler.
balance_scx() is called with rq locked and pinned but is passed @rf and thus
allowed to unpin and unlock. Currently, @rf is passed down the call stack so
the rq lock is unpinned just when double locking is needed.
This creates unnecessary complications such as having to explicitly
manipulate lock pinning for core scheduling. We also want to use
dispatch_to_local_dsq_lock() from other paths which are called with rq
locked but unpinned.
rq lock handling in the dispatch path is straightforward outside the
migration implementation and extending the pinning protection down the
callstack doesn't add enough meaningful extra protection to justify the
extra complexity.
Unpin and repin rq lock from the outer balance_scx() and drop @rf passing
and lock pinning handling from the inner functions. UP is updated to call
balance_one() instead of balance_scx() to avoid adding NULL @rf handling to
balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a
CONFIG_SMP block.
No functional changes intended outside of lock annotation updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Righi <righi.andrea@gmail.com>
|
|
task_linked_on_dsq() exists as a helper because it used to test both the
rbtree and list nodes. It now only tests the list node and the list node
will soon be used for something else too. The helper doesn't improve
anything materially and the naming will become confusing. Open-code the list
node testing and remove task_linked_on_dsq()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
When a running task is migrated to another CPU, the stop_task is used to
preempt the running task and migrate it. This, expectedly, invokes
ops.cpu_release(). If the BPF scheduler then calls
scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ
including the task which is being migrated.
This creates an unnecessary re-enqueue of a task which is about to be
deactivated and re-activated for migration anyway. It can also cause
confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its
allowed CPUs may not agree while migration is pending.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
Acked-by: David Vernet <void@manifault.com>
|
|
scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from
ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same
local DSQ, to avoid processing the same tasks repeatedly, it first takes the
number of queued tasks and processes the task at the head of the queue that
number of times.
This is incorrect as a task can be dispatched to the same local DSQ with
SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is
exhausted and the succeeding tasks won't be processed at all.
Fix it by first moving all candidate tasks to a private list and then
processing that list. While at it, remove the WARNs. They're rather
superflous as later steps will check them anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
Acked-by: David Vernet <void@manifault.com>
|
|
DSQs are very opaque in the consumption path. The BPF scheduler has no way
of knowing which tasks are being considered and which is picked. This patch
adds BPF DSQ iterator.
- Allows iterating tasks queued on a DSQ in the dispatch order or reverse
from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs
directly.
- Has ordering guarantee where only tasks which were already queued when the
iteration started are visible and consumable during the iteration.
v5: - Add a comment to the naked list_empty(&dsq->list) test in
consume_dispatch_q() to explain the reasoning behind the lockless test
and by extension why nldsq_next_task() isn't used there.
- scx_qmap changes separated into its own patch.
v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong
type for the last argument (bool rev instead of u64 flags). Fix it.
v3: - Alexei pointed out that the iterator is too big to allocate on stack.
Added a prep patch to reduce the size of the cursor. Now
bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on
64bit.
- u32_before() comparison factored out.
v2: - scx_bpf_consume_task() is separated out into a separate patch.
- DSQ seq and iter flags don't need to be u64. Use u32.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
|
|
struct scx_dsq_node contains two data structure nodes to link the containing
task to a DSQ and a flags field that is protected by the lock of the
associated DSQ. One reason why they are grouped into a struct is to use the
type independently as a cursor node when iterating tasks on a DSQ. However,
when iterating, the cursor only needs to be linked on the FIFO list and the
rb_node part ends up inflating the size of the iterator data structure
unnecessarily making it potentially too expensive to place it on stack.
Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity
as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to
scx_dsq_list_node and the field names are renamed accordingly. This will
help implementing DSQ task iterator that can be allocated on stack.
No functional change intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: David Vernet <void@manifault.com>
|
|
- scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to
be global. Make it static.
- Relocate task_on_scx() so that the inline functions are located together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
in effect
sched_domains regulate the load balancing for sched_classes. A machine can
be partitioned into multiple sections that are not load-balanced across
using either isolcpus= boot param or cpuset partitions. In such cases, tasks
that are in one partition are expected to stay within that partition.
cpuset configured partitions are always reflected in each member task's
cpumask. As SCX always honors the task cpumasks, the BPF scheduler is
automatically in compliance with the configured partitions.
However, for isolcpus= domain isolation, the isolated CPUs are simply
omitted from the top-level sched_domain[s] without further restrictions on
tasks' cpumasks, so, for example, a task currently running in an isolated
CPU may have more CPUs in its allowed cpumask while expected to remain on
the same CPU.
There is no straightforward way to enforce this partitioning preemptively on
BPF schedulers and erroring out after a violation can be surprising.
isolcpus= domain isolation is being replaced with cpuset partitions anyway,
so keep it simple and simply disallow loading a BPF scheduler if isolcpus=
domain isolation is in effect.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net
Cc: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
|
|
scx_ops_enable_task()
When initializing p->scx.weight, scx_ops_enable_task() wasn't considering
whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the
source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole
user of set_task_scx_weight(). Open code it. @weight is going to be provided
by sched core in the future anyway.
v2: Use the newly available @lw->weight to set @p->scx.weight in
reweight_task_scx().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
|
|
rq contains many useful fields to implement a custom scheduler. For
example, various clock signals like clock_task and clock_pelt can be
used to track load. It also contains stats in other sched_classes, which
are useful to drive scheduling decisions in ext.
tj: Put the new helper below scx_bpf_task_*() helpers.
Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11
d32960528702 ("sched/fair: set_load_weight() must also call reweight_task()
for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is
called causing conflicts with e83edbf88f18 ("sched: Add
sched_class->reweight_task()"). Resolve the conflicts by taking
set_load_weight() changes from d32960528702 and updating
sched_class->reweight_task() to take pointer to struct load_weight instead
of int prio.
Signed-off-by: Tejun Heo<tj@kernel.org>
|
|
alloc_exit_info() calls kcalloc() but puts in the size of the element as the
first argument which triggers the following gcc warning:
kernel/sched/ext.c:3815:32: warning: ‘kmalloc_array_noprof’ sizes
specified with ‘sizeof’ in the earlier argument and not in the later
argument [-Wcalloc-transposed-args]
Fix it by swapping the positions of the first two arguments. No functional
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vishal Chourasia <vishalc@linux.ibm.com>
Link: http://lkml.kernel.org/r/ZoG6zreEtQhAUr_2@linux.ibm.com
|
|
Correct eight to weight in the description of the .set_weight()
operation in sched_ext_ops.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative
performance target of a specified CPU. Commit d86adb4fc065 ("sched_ext: Add
cpuperf support") defined the @cpu argument to be unsigned. Let's update it
to be signed to match the norm for the rest of ext.c and the kernel.
Note that the kfunc declaration of scx_bpf_cpuperf_set() in the
common.bpf.h header in tools/sched_ext already listed the cpu as signed, so
this also fixes the build for tools/sched_ext and the sched_ext selftests
due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus
causing the compiler to error due to observing conflicting types).
Fixes: d86adb4fc065 ("sched_ext: Add cpuperf support")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.
Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:
- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
different CPUs in the system.
- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
relative to its max performance.
- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.
With cpuperf support added, there is no reason to block uclamp. Enable while
at it.
A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.
v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
to avoid factoring in stale util metric. (Christian)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Loehle <christian.loehle@arm.com>
|
|
sched_class->switch_class()
scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
a CPU is taken away by a task dispatched from a higher priority sched_class
so that the BPF scheduler can, e.g., punt the task[s] which was running or
were waiting for the CPU to other CPUs.
Replace the sched_ext specific hook scx_next_task_picked() with a new
sched_class operation switch_class().
The changes are straightforward and the code looks better afterwards.
However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook
which is unlikely to be useful to other sched_classes. For further
discussion on this subject, please refer to the following:
http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.
v6: - Add paragraph explaining debug dump.
v5: - Updated to reflect /sys/kernel interface change. Kconfig options
added.
v4: - README improved, reformatted in markdown and renamed to README.md.
v3: - Added tools/sched_ext/README.
- Dropped _example prefix from scheduler names.
v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
of them are addressed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
|
|
Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
consumed or executed earlier. While this is sufficient when dsq's are used
for simple staging areas for tasks which are ready to execute, it'd make
dsq's a lot more useful if they can implement custom ordering.
This patch adds a vtime-ordered priority queue to dsq's. When the BPF
scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
can specify the vtime tha the task should be inserted at and the task is
inserted into the priority queue in the dsq which is ordered according to
time_before64() comparison of the vtime values.
A DSQ can either be a FIFO or priority queue and automatically switches
between the two depending on whether scx_bpf_dispatch() or
scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
already has the other type queued is not allowed and triggers an ops error.
Built-in DSQs must always be FIFOs.
This makes it very easy for the BPF schedulers to implement proper vtime
based scheduling within each dsq very easy and efficient at a negligible
cost in terms of code complexity and overhead.
scx_simple and scx_example_flatcg are updated to default to weighted
vtime scheduling (the latter within each cgroup). FIFO scheduling can be
selected with -f option.
v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
led to unexpected starvations, DSQs now error out if both modes are
used at the same time and the built-in DSQs are no longer allowed to
be priority queues.
- Explicit type struct scx_dsq_node added to contain fields needed to be
linked on DSQs. This will be used to implement stateful iterator.
- Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
PRIQ mode. This confines PRIQ related complexities to the enqueue and
dequeue paths. Other paths only need to look at dsq->list. This will
also ease implementing BPF iterator.
- Print p->scx.dsq_flags in debug dump.
v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
flags in p->scx.flags. This led to flag corruption in some cases.
- Add comments explaining the interaction between using consumption of
p->scx.slice to determine vtime progress and yielding.
v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
migrations leading to some tasks being stalled for extended period of
time depending on how saturated the machine is. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
|
|
The core-sched support is composed of the following parts:
- task_struct->scx.core_sched_at is added. This is a timestamp which can be
used to order tasks. Depending on whether the BPF scheduler implements
custom ordering, it tracks either global FIFO ordering of all tasks or
local-DSQ ordering within the dispatched tasks on a CPU.
- prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
scx_prio_less() calls ops.core_sched_before() if available or uses the
core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
doesn't need to do anything. Otherwise, it should implement
ops.core_sched_before() which reflects the ordering.
- When core-sched is enabled, balance_scx() balances all SMT siblings so
that they all have tasks dispatched if necessary before pick_task_scx() is
called. pick_task_scx() picks between the current task and the first
dispatched task on the local DSQ based on availability and the
core_sched_at timestamps. Note that FIFO ordering is expected among the
already dispatched tasks whether running or on the local DSQ, so this path
always compares core_sched_at instead of calling into
ops.core_sched_before().
qmap_core_sched_before() is added to scx_qmap. It scales the
distances from the heads of the queues to compare the tasks across different
priority queues and seems to behave as expected.
v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi.
v2: Sched core added the const qualifiers to prio_less task arguments.
Explicitly drop them for ops.core_sched_before() task arguments. BPF
enforces access control through the verifier, so the qualifier isn't
actually operative and only gets in the way when interacting with
various helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Reviewed-by: Josh Don <joshdon@google.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
|
|
PM operations freeze userspace. Some BPF schedulers have active userspace
component and may misbehave as expected across PM events. While the system
is frozen, nothing too interesting is happening in terms of scheduling and
we can get by just fine with the fallback FIFO behavior. Let's make things
easier by always bypassing the BPF scheduler while PM events are in
progress.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
|
|
Add ops.cpu_online/offline() which are invoked when CPUs come online and
offline respectively. As the enqueue path already automatically bypasses
tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
to see tasks only on CPUs which are between online() and offline().
If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
scheduler is automatically exited with SCX_ECODE_RESTART |
SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
trivially by simply reinitializing and reloading the scheduler.
scx_qmap is updated to print out online CPUs on hotplug events. Other
schedulers are updated to restart based on ecode.
v3: - The previous implementation added @reason to
sched_class.rq_on/offline() to distinguish between CPU hotplug events
and topology updates. This was buggy and fragile as the methods are
skipped if the current state equals the target state. Instead, add
scx_rq_[de]activate() which are directly called from
sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
sleep which can be useful.
- ops.dispatch() could be called on a CPU that the BPF scheduler was
told to be offline. The dispatch patch is updated to bypass in such
cases.
v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
block and enabled eariler during scx_ope_enable() so that
cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
- Auto exit with ECODE added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.
This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.
scx_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.
scx_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.
v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
access control through the verifier, so the qualifier isn't actually
operative and only gets in the way when interacting with various
helpers.
v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
from ops.cpu_release() nested inside ops.init() and other sleepable
operations.
Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. See the following for example usage:
https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
- Include SCX_KICK_WAIT related information in debug dump.
Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
When some SCX operations are in flight, it is known that the subject task's
rq lock is held throughout which makes it safe to access certain fields of
the task - e.g. its current task_group. We want to add SCX kfunc helpers
that can make use of this guarantee - e.g. to help determining the currently
associated CPU cgroup from the task's current task_group.
As it'd be dangerous call such a helper on a task which isn't rq lock
protected, the helper should be able to verify the input task and reject
accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
tasks which are currently being operated on by a terminal SCX operation. The
new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
used by kfunc helpers to verify the input task status.
Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
is currently only limited to terminal SCX operations. If needed in the
future, this restriction can be removed by moving the tracking to the task
side with a couple per-task counters.
v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
|
|
Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
tickless operation.
scx_central is updated to use tickless operations for all tasks and
instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
and task state tracking added by the previous patches.
Currently, there is no way to pin the timer on the central CPU, so it may
end up on one of the worker CPUs; however, outside of that, the worker CPUs
can go tickless both while running sched_ext tasks and idling.
With schbench running, scx_central shows:
root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 142024 656 664 449 Local timer interrupts
LOC: 161663 663 665 449 Local timer interrupts
Without it:
root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
LOC: 188778 3142 3793 3993 Local timer interrupts
LOC: 198993 5314 6323 6438 Local timer interrupts
While scx_central itself is too barebone to be useful as a
production scheduler, a more featureful central scheduler can be built using
the same approach. Google's experience shows that such an approach can have
significant benefits for certain applications such as VM hosting.
v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.
v3: Pin the central scheduler's timer on the central_cpu using
BPF_F_TIMER_CPU_PIN.
v2: Convert to BPF inline iterators.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
Being able to track the task runnable and running state transitions are
useful for a variety of purposes including latency tracking and load factor
calculation.
Currently, BPF schedulers don't have a good way of tracking these
transitions. Becoming runnable can be determined from ops.enqueue() but
becoming quiescent can only be inferred from the lack of subsequent enqueue.
Also, as the local dsq can have multiple tasks and some events are handled
in the sched_ext core, it's difficult to determine when a given task starts
and stops executing.
This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
.quiescent() operations to track the task runnable and running state
transitions. They're mostly self explanatory; however, we want to ensure
that running <-> stopping transitions are always contained within runnable
<-> quiescent transitions which is a bit different from how the scheduler
core behaves. This adds a bit of complication. See the comment in
dequeue_task_scx().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
The dispatch path retries if the local DSQ is still empty after
ops.dispatch() either dispatched or consumed a task. This is both out of
necessity and for convenience. It has to retry because the dispatch path
might lose the tasks to dequeue while the rq lock is released while trying
to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
implementation easier as it only needs to make some forward progress each
iteration.
However, this makes it possible for ops.dispatch() to stall CPUs by
repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
the watchdog or sysrq handler can't run and the system can't be saved. Let's
address the issue by breaking out of the dispatch loop after 32 iterations.
It is unlikely but not impossible for ops.dispatch() to legitimately go over
the iteration limit. We want to come back to the dispatch path in such cases
as not doing so risks stalling the CPU by idling with runnable tasks
pending. As the previous task is still current in balance_scx(),
resched_curr() doesn't do anything - it will just get cleared. Let's instead
use scx_kick_bpf() which will trigger reschedule after switching to the next
task which will likely be the idle task.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
|
|
It's often useful to wake up and/or trigger reschedule on other CPUs. This
patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
kick the target CPU into the scheduling path.
As a sched_ext task relinquishes its CPU only after its slice is depleted,
this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
slice of the target CPU's current task to guarantee that sched_ext's
scheduling path runs on the CPU.
If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
to guarantee that the target CPU will go through at least one full sched_ext
scheduling cycle after the kicking. This can be used to wake up idle CPUs
without incurring unnecessary overhead if it isn't currently idle.
As a demonstration of how backward compatibility can be supported using BPF
CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
__COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
becomes a regular kicking otherwise. This allows schedulers to use the new
SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
temporarily use compat helpers to ease API updates and drop them after a few
kernel releases.
v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
schedulers so that they can support kernels without SCX_KICK_IDLE.
This is useful as a demonstration of how new feature flags can be
added in a backward compatible way.
- kick_cpus_irq_workfn() reimplemented so that it touches the pending
cpumasks only as necessary to reduce kicking overhead on machines with
a lot of CPUs.
- tools/sched_ext/include/scx/compat.bpf.h added.
v4: - Move example scheduler to its own patch.
v3: - Make scx_example_central switch all tasks by default.
- Convert to BPF inline iterators.
v2: - Julia Lawall reported that scx_example_central can overflow the
dispatch buffer and malfunction. As scheduling for other CPUs can't be
handled by the automatic retry mechanism, fix by implementing an
explicit overflow and retry handling.
- Updated to use generic BPF cpumask helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
|
|
If a BPF scheduler triggers an error, the scheduler is aborted and the
system is reverted to the built-in scheduler. In the process, a lot of
information which may be useful for figuring out what happened can be lost.
This patch adds debug dump which captures information which may be useful
for debugging including runqueue and runnable thread states at the time of
failure. The following shows a debug dump after triggering the watchdog:
root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
stats : enq=1 dsp=0 delta=1 deq=0
stats : enq=90 dsp=90 delta=0 deq=0
stats : enq=156 dsp=156 delta=0 deq=0
stats : enq=218 dsp=218 delta=0 deq=0
stats : enq=255 dsp=255 delta=0 deq=0
stats : enq=271 dsp=271 delta=0 deq=0
stats : enq=284 dsp=284 delta=0 deq=0
stats : enq=293 dsp=293 delta=0 deq=0
DEBUG DUMP
================================================================================
kworker/u32:12[320] triggered exit kind 1026:
runnable task stall (stress[1530] failed to run for 6.841s)
Backtrace:
scx_watchdog_workfn+0x136/0x1c0
process_scheduled_works+0x2b5/0x600
worker_thread+0x269/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30
QMAP FIFO[0]:
QMAP FIFO[1]:
QMAP FIFO[2]: 1436
QMAP FIFO[3]:
QMAP FIFO[4]:
CPU states
----------
CPU 0 : nr_run=1 ops_qseq=244
curr=swapper/0[0] class=idle_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
R stress[1530] -6841ms
scx_state/flags=3/0x1 ops_state/qseq=2/20
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff
QMAP: force_local=0
asm_sysvec_apic_timer_interrupt+0x16/0x20
CPU 2 : nr_run=2 ops_qseq=142
curr=swapper/2[0] class=idle_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
R sshd[1703] -5905ms
scx_state/flags=3/0x9 ops_state/qseq=2/88
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff
QMAP: force_local=1
__x64_sys_ppoll+0xf6/0x120
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
R fish[1539] -4141ms
scx_state/flags=3/0x9 ops_state/qseq=2/124
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff
QMAP: force_local=1
futex_wait+0x60/0xe0
do_futex+0x109/0x180
__x64_sys_futex+0x117/0x190
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
CPU 3 : nr_run=2 ops_qseq=162
curr=kworker/u32:12[320] class=ext_sched_class
QMAP: dsp_idx=1 dsp_cnt=0
*R kworker/u32:12[320] +0ms
scx_state/flags=3/0xd ops_state/qseq=0/0
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=ff
QMAP: force_local=0
scx_dump_state+0x613/0x6f0
scx_ops_error_irq_workfn+0x1f/0x40
irq_work_run_list+0x82/0xd0
irq_work_run+0x14/0x30
__sysvec_irq_work+0x40/0x140
sysvec_irq_work+0x60/0x70
asm_sysvec_irq_work+0x16/0x20
scx_watchdog_workfn+0x15f/0x1c0
process_scheduled_works+0x2b5/0x600
worker_thread+0x269/0x360
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30
R kworker/3:2[1436] +0ms
scx_state/flags=3/0x9 ops_state/qseq=2/160
sticky/holding_cpu=-1/-1 dsq_id=(n/a)
cpus=08
QMAP: force_local=0
kthread+0xeb/0x110
ret_from_fork+0x36/0x40
ret_from_fork_asm+0x1a/0x30
CPU 7 : nr_run=0 ops_qseq=76
curr=swapper/7[0] class=idle_sched_class
================================================================================
EXIT: runnable task stall (stress[1530] failed to run for 6.841s)
It shows that CPU 3 was running the watchdog when it triggered the error
condition and the scx_qmap thread has been queued on CPU 0 for over 5
seconds but failed to run. It also prints out scx_qmap specific information
- e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
This dump has proved pretty useful for developing and debugging BPF
schedulers.
Debug dump is generated automatically when the BPF scheduler exits due to an
error. The debug buffer used in such cases is determined by
sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
the available buffer, the output is truncated and marked accordingly.
Debug dump output can also be read through the sched_ext_dump tracepoint.
When read through the tracepoint, there is no length limit.
SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
is loaded. This is non-destructive - the scheduler keeps running afterwards.
The output can be read through the sched_ext_dump tracepoint.
v2: - The size of exit debug dump buffer can now be customized using
sched_ext_ops.exit_dump_len.
- sched_ext_ops.dump*() added to enable dumping of BPF scheduler
specific information.
- Tracpoint output and SysRq-D triggering added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
|