kernel/linux.git/include/linux/sched/ext.h, branch v7.2-rc1

Merge branch 'for-7.1-fixes' into for-7.2

2026-05-10T22:51:09+00:00

Conflict between: [1] 41e3312861ea ("sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup") [2] c941d7391f25 ("sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN") in scx_root_enable_workfn()'s post-init block. [1] added a tid hash insertion under a scoped_guard() for scx_tasks_lock; [2] wraps the same region in task_rq_lock() for a DEAD recheck. A naive merge would invert the iter's outer/inner order. [3] f25ad1e3cbaa ("sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn()") was added to for-7.2 for a clean resolution: scx_task_iter_relock(iter, p) takes both scx_tasks_lock and @p's rq lock in iter order. Resolved by routing both sides through [3]'s dual-lock helper: the post-init region runs under a single scx_task_iter_relock() acquisition, with [2]'s state machine and [1]'s hash insert in sequence inside it. Signed-off-by: Tejun Heo

sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN

2026-05-10T20:08:16+00:00

scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before @p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL, exit_task), or observes SCX_TASK_NONE during the unlocked init window and skips cleanup so exit_task() never runs. Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN -> INIT -> READY. sched_ext_dead() that wins the rq-lock race observes INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck unwinds via scx_sub_init_cancel_task(). scx_fork() runs single-threaded against sched_ext_dead() (the task is not on scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure. The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s migration writes INIT_BEGIN as a synthetic predecessor to satisfy the tightened verification. The sub-sched paths still race with sched_ext_dead() during the unlocked init window. This will be fixed by the next patch. Reported-by: zhidao su Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state

2026-05-10T20:08:16+00:00

SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup task iteration would skip them. This can be expressed better with a task state. Replace the flag with SCX_TASK_DEAD. scx_disable_and_exit_task() resets state to NONE on its way out, so sched_ext_dead() now sets DEAD after the wrapper returns. The validation matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's predecessor to INIT or ENABLED so the new DEAD value cannot silently transition to READY. Prepares for the following enable vs dead race fix. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()

2026-05-04T19:06:03+00:00

scx_task_iter's cgroup-scoped mode can return tasks whose sched_ext_dead() has already completed: cgroup_task_dead() removes from cset->tasks after sched_ext_dead() in finish_task_switch() and is irq-work deferred on PREEMPT_RT. The global mode is fine - sched_ext_dead() removes from scx_tasks via list_del_init() first. Callers (sub-sched enable prep/abort/apply, scx_sub_disable(), scx_fail_parent()) assume returned tasks are still on @sch and trip WARN_ON_ONCE() or operate on torn-down state otherwise. Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and have scx_task_iter_next_locked() skip flagged tasks under the same lock. Setter and reader serialize on the per-task rq lock - no race. Signed-off-by: Tejun Heo

sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup

2026-04-20T16:55:33+00:00

BPF schedulers that can't hold task_struct pointers (arena-backed ones in particular) key tasks by pid. During exit, pid is released before the task finishes passing through scheduler callbacks, so a dying task becomes invisible to the BPF side mid-schedule. scx_qmap hits this: an exiting task's dispatch callback can't recover its queue entry, stalling dispatch until SCX_EXIT_ERROR_STALL. Add a unique non-zero u64 p->scx.tid assigned at fork that survives the full task lifetime including exit. scx_bpf_tid_to_task() looks up the task; unlike bpf_task_from_pid(), it handles exiting tasks. The lookup costs an rhashtable insert/remove under scx_tasks_lock, so root schedulers opt in via SCX_OPS_TID_TO_TASK. Sub-schedulers that set the flag to declare a dependency are rejected at attach if root didn't opt in. scx_qmap converted: keys tasks by tid and enables SCX_OPS_ENQ_EXITING. Pre-patch it stalls within seconds under a non-leader-exec workload; with the patch it runs cleanly. v3: Warn on rhashtable_lookup_insert_fast() failure via new scx_tid_hash_insert() helper (Cheng-Yang Chou). v2: Guard scx_root deref in scx_bpf_tid_to_task() error path. The kfunc is registered via scx_kfunc_set_any and reachable from tracing and syscall programs when no scheduler is attached (Cheng-Yang Chou). Signed-off-by: Tejun Heo Reviewed-by: Cheng-Yang Chou Reviewed-by: Andrea Righi

sched_ext: Remove runtime kfunc mask enforcement

2026-04-10T17:54:06+00:00

Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng Signed-off-by: Juntong Deng Signed-off-by: Cheng-Yang Chou Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Implement SCX_ENQ_IMMED

2026-03-13T19:43:22+00:00

Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Add SCX_TASK_REENQ_REASON flags

2026-03-07T15:29:50+00:00

SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of p->scx.flags to communicate the reason during ops.enqueue(): - NONE: Not being reenqueued - KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends More reasons will be added. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Simplify task state handling

2026-03-07T15:29:50+00:00

Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with unshifted values and then shifted when stored in scx_entity.flags. Simplify by defining them as pre-shifted values directly in scx_ent_flags and removing the separate scx_task_state enum. This removes the need for shifting when reading/writing state values. scx_get_task_state() now returns the masked flags value directly. scx_set_task_state() accepts the pre-shifted state value. scx_dump_task() shifts down for display to maintain readable output. No functional changes. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi

sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs

2026-03-07T15:29:50+00:00

scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi