summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2023-12-03sched/fair: Fix the decision for load balanceKeisuke Nishimura1-3/+7
[ Upstream commit 6d7e4782bcf549221b4ccfffec2cf4d1a473f1a3 ] should_we_balance is called for the decision to do load-balancing. When sched ticks invoke this function, only one CPU should return true. However, in the current code, two CPUs can return true. The following situation, where b means busy and i means idle, is an example, because CPU 0 and CPU 2 return true. [0, 1] [2, 3] b b i b This fix checks if there exists an idle CPU with busy sibling(s) after looking for a CPU on an idle core. If some idle CPUs with busy siblings are found, just the first one should do load-balancing. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03sched/eevdf: Fix vruntime adjustment on reweightAbel Wu1-23/+128
[ Upstream commit eab03c23c2a162085b13200d7942fc5a00b5ccc8 ] vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28sched/core: Fix RQCF_ACT_SKIP leakHao Jia1-4/+1
commit 5ebde09d91707a4a9bec1e3d213e3c12ffde348f upstream. Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() *rq->clock_update_flags <<= 1;* unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0* rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(*CPU0_rq*, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF_*_SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF_*_SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20sched: Fix stop_one_cpu_nowait() vs hotplugPeter Zijlstra4-3/+17
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ] Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0Qais Yousef1-17/+1
[ Upstream commit 23c9519def98ee0fa97ea5871535e9b136f522fc ] find_energy_efficient_cpu() bails out early if effective util of the task is 0 as the delta at this point will be zero and there's nothing for EAS to do. When uclamp is being used, this could lead to wrong decisions when uclamp_max is set to 0. In this case the task is capped to performance point 0, but it is actually running and consuming energy and we can benefit from EAS energy calculations. Rework the condition so that it bails out when both util and uclamp_min are 0. We can do that without needing to use uclamp_task_util(); remove it. Fixes: d81304bc6193 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition") Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-3-qyousef@layalina.io Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0Qais Yousef1-6/+5
[ Upstream commit 6b00a40147653c8ea748e8f4396510f252763364 ] When uclamp_max is being used, the util of the task could be higher than the spare capacity of the CPU, but due to uclamp_max value we force-fit it there. The way the condition for checking for max_spare_cap in find_energy_efficient_cpu() was constructed; it ignored any CPU that has its spare_cap less than or _equal_ to max_spare_cap. Since we initialize max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and hence ending up never performing compute_energy() for this cluster and missing an opportunity for a better energy efficient placement to honour uclamp_max setting. max_spare_cap = 0; cpu_cap = capacity_of(cpu) - cpu_util(p); // 0 if cpu_util(p) is high ... util_fits_cpu(...); // will return true if uclamp_max forces it to fit ... // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 if (cpu_cap > max_spare_cap) { max_spare_cap = cpu_cap; max_spare_cap_cpu = cpu; } prev_spare_cap suffers from a similar problem. Fix the logic by converting the variables into long and treating -1 value as 'not populated' instead of 0 which is a viable and correct spare capacity value. We need to be careful signed comparison is used when comparing with cpu_cap in one of the conditions. Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230916232955.2099394-2-qyousef@layalina.io Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20sched/fair: Fix cfs_rq_is_decayed() on !SMPChengming Zhou1-1/+1
[ Upstream commit c0490bc9bb62d9376f3dd4ec28e03ca0fef97152 ] We don't need to maintain per-queue leaf_cfs_rq_list on !SMP, since it's used for cfs_rq load tracking & balancing on SMP. But sched debug interface uses it to print per-cfs_rq stats. This patch fixes the !SMP version of cfs_rq_is_decayed(), so the per-queue leaf_cfs_rq_list is also maintained correctly on !SMP, to fix the warning in assert_list_leaf_cfs_rq(). Fixes: 0a00a354644e ("sched/fair: Delete useless condition in tg_unthrottle_up()") Reported-by: Leo Yu-Chi Liang <ycliang@andestech.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Leo Yu-Chi Liang <ycliang@andestech.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Closes: https://lore.kernel.org/all/ZN87UsqkWcFLDxea@swlinux02/ Link: https://lore.kernel.org/r/20230913132031.2242151-1-chengming.zhou@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20sched/topology: Fix sched_numa_find_nth_cpu() in CPU-less caseYury Norov1-1/+5
[ Upstream commit 617f2c38cb5ce60226042081c09e2ee3a90d03f8 ] When the node provided by user is CPU-less, corresponding record in sched_domains_numa_masks is not set. Trying to dereference it in the following code leads to kernel crash. To avoid it, start searching from the nearest node with CPUs. Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()") Reported-by: Yicong Yang <yangyicong@hisilicon.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20230819141239.287290-4-yury.norov@gmail.com Closes: https://lore.kernel.org/lkml/CAAH8bW8C5humYnfpW3y5ypwx0E-09A3QxFE1JFzR66v+mO4XfA@mail.gmail.com/T/ Closes: https://lore.kernel.org/lkml/ZMHSNQfv39HN068m@yury-ThinkPad/T/#mf6431cb0b7f6f05193c41adeee444bc95bf2b1c4 Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-18sched/eevdf: Fix heap corruption morePeter Zijlstra1-1/+2
Because someone is a flaming idiot... and forgot we have current as se->on_rq but not actually in the tree itself, and walking rb_parent() on an entry not in the tree is 'funky' and KASAN complains. Fixes: 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") Reported-by: 0599jiangyc@gmail.com Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020 Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com
2023-10-09sched/eevdf: Fix pick_eevdf()Benjamin Segall1-14/+58
The old pick_eevdf() could fail to find the actual earliest eligible deadline when it descended to the right looking for min_deadline, but it turned out that that min_deadline wasn't actually eligible. In that case we need to go back and search through any left branches we skipped looking for the actual best _eligible_ min_deadline. This is more expensive, but still O(log n), and at worst should only involve descending two branches of the rbtree. I've run this through a userspace stress test (thank you tools/lib/rbtree.c), so hopefully this implementation doesn't miss any corner cases. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
2023-10-09sched/eevdf: Fix min_deadline heap integrityPeter Zijlstra1-0/+1
Marek and Biju reported instances of: "EEVDF scheduling fail, picking leftmost" which Mike correlated with cgroup scheduling and the min_deadline heap getting corrupted; some trace output confirms: > And yeah, min_deadline is hosed somehow: > > validate_cfs_rq: --- / > __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr) > __print_se: ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31) > __print_se: ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33) > validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256 Turns out that reweight_entity(), which tries really hard to be fast, does not do the normal dequeue+update+enqueue pattern but *does* scale the deadline. However, it then fails to propagate the updated deadline value up the heap. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Biju Das <biju.das.jz@bp.renesas.com> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Tested-by: Mike Galbraith <efault@gmx.de> Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
2023-10-05cpufreq: schedutil: Update next_freq when cpufreq_limits changeXuewen Yan1-1/+2
When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Guohua Yan <guohua.yan@unisoc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
2023-10-03sched/eevdf: Fix avg_vruntime()Peter Zijlstra1-1/+9
The expectation is that placing a task at avg_vruntime() makes it eligible. Turns out there is a corner case where this is not the case. Specifically, avg_vruntime() relies on the fact that integer division is a flooring function (eg. it discards the remainder). By this property the value returned is slightly left of the true average. However! when the average is a negative (relative to min_vruntime) the effect is flipped and it becomes a ceil, with the result that the returned value is just right of the average and thus not eligible. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-10-03sched/eevdf: Also update slice on placementPeter Zijlstra1-2/+4
Tasks that never consume their full slice would not update their slice value. This means that tasks that are spawned before the sysctl scaling keep their original (UP) slice length. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
2023-09-28sched/rt: Fix live lock between select_fallback_rq() and RT pushJoel Fernandes (Google)1-0/+1
During RCU-boost testing with the TREE03 rcutorture config, I found that after a few hours, the machine locks up. On tracing, I found that there is a live lock happening between 2 CPUs. One CPU has an RT task running, while another CPU is being offlined which also has an RT task running. During this offlining, all threads are migrated. The migration thread is repeatedly scheduled to migrate actively running tasks on the CPU being offlined. This results in a live lock because select_fallback_rq() keeps picking the CPU that an RT task is already running on only to get pushed back to the CPU being offlined. It is anyway pointless to pick CPUs for pushing tasks to if they are being offlined only to get migrated away to somewhere else. This could also add unwanted latency to this task. Fix these issues by not selecting CPUs in RT if they are not 'active' for scheduling, using the cpu_active_mask. Other parts in core.c already use cpu_active_mask to prevent tasks from being put on CPUs going offline. With this fix I ran the tests for days and could not reproduce the hang. Without the patch, I hit it in a few hours. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
2023-09-19kernel/sched: Modify initial boot task idle setupLiam R. Howlett2-1/+2
Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
2023-09-13sched/fair: Fix SMT4 group_smt_balance handlingTim Chen1-1/+11
For SMT4, any group with more than 2 tasks will be marked as group_smt_balance. Retain the behaviour of group_has_spare by marking the busiest group as the group which has the least number of idle_cpus. Also, handle rounding effect of adding (ncores_local + ncores_busy) when the local is fully idle and busy group imbalance is less than 2 tasks. Local group should try to pull at least 1 task in this case so imbalance should be set to 2 instead. Fixes: fee1759e4f04 ("sched/fair: Determine active load balance for SMT sched groups") Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
2023-09-02sched/fair: Optimize should_we_balance() for large SMT systemsShrikanth Hegde1-1/+14
should_we_balance() is called in load_balance() to find out if the CPU that is trying to do the load balance is the right one or not. With commit: b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance") the code tries to find an idle core to do the load balancing and falls back on an idle sibling CPU if there is no idle core. However, on larger SMT systems, it could be needlessly iterating to find a idle by scanning all the CPUs in an non-idle core. If the core is not idle, and first SMT sibling which is idle has been found, then its not needed to check other SMT siblings for idleness Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE. balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and same process would be repeated for CPU4 and CPU6 but this is unnecessary. Since calling is_core_idle loops through all CPU's in the SMT mask, effect is multiplied by weight of smt_mask. For example,when say 1 CPU is busy, we would skip loop for 2 CPU's and skip iterating over 8CPU's. That effect would be more in DIE/NUMA domain where there are more cores. Testing and performance evaluation ================================== The test has been done on this system which has 12 cores, i.e 24 small cores with SMT=4: lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Model name: POWER10 (architected), altivec supported Thread(s) per core: 8 Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For base tip/sched/core the time taken is collected by making the should_we_balance() noinline. time is in nanoseconds. The values are collected by running the funclatency tracer for 60 seconds. values are average of 3 such runs. This represents the expected reduced time with patch. tip/sched/core was at commit: 2f88c8e802c8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well") Results: ------------------------------------------------------------------------------ workload tip/sched/core with_patch(%gain) ------------------------------------------------------------------------------ idle system 809.3 695.0(16.45) stress ng – 12 threads -l 100 1013.5 893.1(13.49) stress ng – 24 threads -l 100 1073.5 980.0(9.54) stress ng – 48 threads -l 100 683.0 641.0(6.55) stress ng – 96 threads -l 100 2421.0 2300(5.26) stress ng – 96 threads -l 15 375.5 377.5(-0.53) stress ng – 96 threads -l 25 635.5 637.5(-0.31) stress ng – 96 threads -l 35 934.0 891.0(4.83) Ran schbench(old), hackbench and stress_ng to evaluate the workload performance between tip/sched/core and with patch. No modification to tip/sched/core TL;DR: Good improvement is seen with schbench. when hackbench and stress_ng runs for longer good improvement is seen. ------------------------------------------------------------------------------ schbench(old) tip +patch(%gain) 10 iterations sched/core ------------------------------------------------------------------------------ 1 Threads 50.0th: 8.00 9.00(-12.50) 75.0th: 9.60 9.00(6.25) 90.0th: 11.80 10.20(13.56) 95.0th: 12.60 10.40(17.46) 99.0th: 13.60 11.90(12.50) 99.5th: 14.10 12.60(10.64) 99.9th: 15.90 14.60(8.18) 2 Threads 50.0th: 9.90 9.20(7.07) 75.0th: 12.60 10.10(19.84) 90.0th: 15.50 12.00(22.58) 95.0th: 17.70 14.00(20.90) 99.0th: 21.20 16.90(20.28) 99.5th: 22.60 17.50(22.57) 99.9th: 30.40 19.40(36.18) 4 Threads 50.0th: 12.50 10.60(15.20) 75.0th: 15.30 12.00(21.57) 90.0th: 18.60 14.10(24.19) 95.0th: 21.30 16.20(23.94) 99.0th: 26.00 20.70(20.38) 99.5th: 27.60 22.50(18.48) 99.9th: 33.90 31.40(7.37) 8 Threads 50.0th: 16.30 14.30(12.27) 75.0th: 20.20 17.40(13.86) 90.0th: 24.50 21.90(10.61) 95.0th: 27.30 24.70(9.52) 99.0th: 35.00 31.20(10.86) 99.5th: 46.40 33.30(28.23) 99.9th: 89.30 57.50(35.61) 16 Threads 50.0th: 22.70 20.70(8.81) 75.0th: 30.10 27.40(8.97) 90.0th: 36.00 32.80(8.89) 95.0th: 39.60 36.40(8.08) 99.0th: 49.20 44.10(10.37) 99.5th: 64.90 50.50(22.19) 99.9th: 143.50 100.60(29.90) 32 Threads 50.0th: 34.60 35.50(-2.60) 75.0th: 48.20 50.50(-4.77) 90.0th: 59.20 62.40(-5.41) 95.0th: 65.20 69.00(-5.83) 99.0th: 80.40 83.80(-4.23) 99.5th: 102.10 98.90(3.13) 99.9th: 727.10 506.80(30.30) schbench does improve in general. There is some run to run variation with schbench. Did a validation run to confirm that trend is similar. ------------------------------------------------------------------------------ hackbench tip +patch(%gain) 20 iterations, 50000 loops sched/core ------------------------------------------------------------------------------ Process 10 groups : 11.74 11.70(0.34) Process 20 groups : 22.73 22.69(0.18) Process 30 groups : 33.39 33.40(-0.03) Process 40 groups : 43.73 43.61(0.27) Process 50 groups : 53.82 54.35(-0.98) Process 60 groups : 64.16 65.29(-1.76) thread 10 Time : 12.81 12.79(0.16) thread 20 Time : 24.63 24.47(0.65) Process(Pipe) 10 Time : 6.40 6.34(0.94) Process(Pipe) 20 Time : 10.62 10.63(-0.09) Process(Pipe) 30 Time : 15.09 14.84(1.66) Process(Pipe) 40 Time : 19.42 19.01(2.11) Process(Pipe) 50 Time : 24.04 23.34(2.91) Process(Pipe) 60 Time : 28.94 27.51(4.94) thread(Pipe) 10 Time : 6.96 6.87(1.29) thread(Pipe) 20 Time : 11.74 11.73(0.09) hackbench shows slight improvement with pipe. Slight degradation in process. ------------------------------------------------------------------------------ stress_ng tip +patch(%gain) 10 iterations 100000 cpu_ops sched/core ------------------------------------------------------------------------------ --cpu=96 -util=100 Time taken : 5.30, 5.01(5.47) --cpu=48 -util=100 Time taken : 7.94, 6.73(15.24) --cpu=24 -util=100 Time taken : 11.67, 8.75(25.02) --cpu=12 -util=100 Time taken : 15.71, 15.02(4.39) --cpu=96 -util=10 Time taken : 22.71, 22.19(2.29) --cpu=96 -util=20 Time taken : 12.14, 12.37(-1.89) --cpu=96 -util=30 Time taken : 8.76, 8.86(-1.14) --cpu=96 -util=40 Time taken : 7.13, 7.14(-0.14) --cpu=96 -util=50 Time taken : 6.10, 6.13(-0.49) --cpu=96 -util=60 Time taken : 5.42, 5.41(0.18) --cpu=96 -util=70 Time taken : 4.94, 4.94(0.00) --cpu=96 -util=80 Time taken : 4.56, 4.53(0.66) --cpu=96 -util=90 Time taken : 4.27, 4.26(0.23) Good improvement seen with 24 CPUs. In this case only one CPU is busy, and no core is idle. Decent improvement with 100% utilization case. no difference in other utilization. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
2023-08-29sched/fair: Make update_entity_lag() staticHao Jia1-1/+1
The function update_entity_lag() is only used inside the kernel/sched/fair.c file. Make it static. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230829030325.69128-1-jiahao.os@bytedance.com
2023-08-29Merge tag 'x86-cleanups-2023-08-28' of ↵Linus Torvalds2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 cleanups from Ingo Molnar: "The following commit deserves special mention: 22dc02f81cddd Revert "sched/fair: Move unused stub functions to header" This is in x86/cleanups, because the revert is a re-application of a number of cleanups that got removed inadvertedly" [ This also effectively undoes the amd_check_microcode() microcode declaration change I had done in my microcode loader merge in commit 42a7f6e3ffe0 ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]"). I picked the declaration change by Arnd from this branch instead, which put it in <asm/processor.h> instead of <asm/microcode.h> like I had done in my merge resolution - Linus ] * tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy() x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy() x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy() x86/qspinlock-paravirt: Fix missing-prototype warning x86/paravirt: Silence unused native_pv_lock_init() function warning x86/alternative: Add a __alt_reloc_selftest() prototype x86/purgatory: Include header for warn() declaration x86/asm: Avoid unneeded __div64_32 function definition Revert "sched/fair: Move unused stub functions to header" x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP x86/cpu: Fix amd_check_microcode() declaration
2023-08-29Merge tag 'sched-core-2023-08-28' of ↵Linus Torvalds8-895/+1092
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change is introduction of a new iteration of the SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual Deadline First") scheduler EEVDF too is a virtual-time scheduler, with two parameters (weight and relative deadline), compared to CFS that had weight only. It completely reworks the base scheduler: placement, preemption, picking -- everything LWN.net, as usual, has a terrific writeup about EEVDF: https://lwn.net/Articles/925371/ Preemption (both tick and wakeup) is driven by testing against a fresh pick. Because the tree is now effectively an interval tree, and the selection is no longer the 'leftmost' task, over-scheduling is less of a problem. A lot of the CFS heuristics are removed or replaced by more natural latency-space parameters & constructs In terms of expected performance regressions: we will and can fix everything where a 'good' workload misbehaves with the new scheduler, but EEVDF inevitably changes workload scheduling in a binary fashion, hopefully for the better in the overwhelming majority of cases, but in some cases it won't, especially in adversarial loads that got lucky with the previous code, such as some variants of hackbench. We are trying hard to err on the side of fixing all performance regressions, but we expect some inevitable post-release iterations of that process - Improve load-balancing on hybrid x86 systems: enable cluster scheduling (again) - Improve & fix bandwidth-scheduling on nohz systems - Improve bandwidth-throttling - Use lock guards to simplify and de-goto-ify control flow - Misc improvements, cleanups and fixes * tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) sched/eevdf/doc: Modify the documented knob to base_slice_ns as well sched/eevdf: Curb wakeup-preemption sched: Simplify sched_core_cpu_{starting,deactivate}() sched: Simplify try_steal_cookie() sched: Simplify sched_tick_remote() sched: Simplify sched_exec() sched: Simplify ttwu() sched: Simplify wake_up_if_idle() sched: Simplify: migrate_swap_stop() sched: Simplify sysctl_sched_uclamp_handler() sched: Simplify get_nohz_timer_target() sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset sched/rt: Fix sysctl_sched_rr_timeslice intial value sched/fair: Block nohz tick_stop when cfs bandwidth in use sched, cgroup: Restore meaning to hierarchical_quota MAINTAINERS: Add Peter explicitly to the psi section sched/psi: Select KERNFS as needed sched/topology: Align group flags when removing degenerate domain sched/fair: remove util_est boosting sched/fair: Propagate enqueue flags into place_entity() ...
2023-08-17sched/eevdf: Curb wakeup-preemptionPeter Zijlstra2-0/+13
Mike and others noticed that EEVDF does like to over-schedule quite a bit -- which does hurt performance of a number of benchmarks / workloads. In particular, what seems to cause over-scheduling is that when lag is of the same order (or larger) than the request / slice then placement will not only cause the task to be placed left of current, but also with a smaller deadline than current, which causes immediate preemption. [ notably, lag bounds are relative to HZ ] Mike suggested we stick to picking 'current' for as long as it's eligible to run, giving it uninterrupted runtime until it reaches parity with the pack. Augment Mike's suggestion by only allowing it to exhaust it's initial request. One random data point: echo NO_RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 3,723,554 context-switches ( +- 0.56% ) 9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% ) echo RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 2,556,535 context-switches ( +- 0.51% ) 9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% ) Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
2023-08-14sched: Simplify sched_core_cpu_{starting,deactivate}()Peter Zijlstra1-15/+12
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org
2023-08-14sched: Simplify try_steal_cookie()Peter Zijlstra1-12/+9
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org
2023-08-14sched: Simplify sched_tick_remote()Peter Zijlstra1-23/+16
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org
2023-08-14sched: Simplify sched_exec()Peter Zijlstra1-12/+9
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org
2023-08-14sched: Simplify ttwu()Peter Zijlstra1-112/+109
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org
2023-08-14sched: Simplify wake_up_if_idle()Peter Zijlstra2-14/+21
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org
2023-08-14sched: Simplify: migrate_swap_stop()Peter Zijlstra2-16/+27
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org
2023-08-14sched: Simplify sysctl_sched_uclamp_handler()Peter Zijlstra1-7/+4
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org
2023-08-14sched: Simplify get_nohz_timer_target()Peter Zijlstra1-9/+6
Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org
2023-08-14sched/rt: sysctl_sched_rr_timeslice show default timeslice after resetCyril Hrubis1-0/+3
The sched_rr_timeslice can be reset to default by writing value that is <= 0. However after reading from this file we always got the last value written, which is not useful at all. $ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms $ cat /proc/sys/kernel/sched_rr_timeslice_ms -1 Fix this by setting the variable that holds the sysctl file value to the jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
2023-08-14sched/rt: Fix sysctl_sched_rr_timeslice intial valueCyril Hrubis1-1/+1
There is a 10% rounding error in the intial value of the sysctl_sched_rr_timeslice with CONFIG_HZ_300=y. This was found with LTP test sched_rr_get_interval01: sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 What this test does is to compare the return value from the sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and fails if they do not match. The problem it found is the intial sysctl file value which was computed as: static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE; which works fine as long as MSEC_PER_SEC is multiple of HZ, however it introduces 10% rounding error for CONFIG_HZ_300: (MSEC_PER_SEC / HZ) * (100 * HZ / 1000) (1000 / 300) * (100 * 300 / 1000) 3 * 30 = 90 This can be easily fixed by reversing the order of the multiplication and division. After this fix we get: (MSEC_PER_SEC * (100 * HZ / 1000)) / HZ (1000 * (100 * 300 / 1000)) / 300 (1000 * 30) / 300 = 100 Fixes: 975e155ed873 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds") Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
2023-08-10Merge branch 'sched/eevdf' into sched/coreIngo Molnar5-648/+531
Pick up the EEVDF work into the main branch - it's looking good so far. Conflicts: kernel/sched/features.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-08-02sched/fair: Block nohz tick_stop when cfs bandwidth in usePhil Auld4-1/+81
CFS bandwidth limits and NOHZ full don't play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Currently, when presented with these conflicting requirements the scheduler is favoring nohz_full and letting the tick be stopped. However, nohz tick stopping is already best-effort, there are a number of conditions that can prevent it, whereas cfs runtime bandwidth is expected to be enforced. Make the scheduler favor bandwidth over stopping the tick by setting TICK_DEP_BIT_SCHED when the only running task is a cfs task with runtime limit enabled. We use cfs_b->hierarchical_quota to determine if the task requires the tick. Add check in pick_next_task_fair() as well since that is where we have a handle on the task that is actually going to be running. Add check in sched_can_stop_tick() to cover some edge cases such as nr_running going from 2->1 and the 1 remains the running task. Reviewed-By: Ben Segall <bsegall@google.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
2023-08-02sched, cgroup: Restore meaning to hierarchical_quotaPhil Auld3-8/+14
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task groups due to the previous fix simply taking the min. It should reflect a limit imposed at that level or by an ancestor. Even though cgroupv2 does not require child quota to be less than or equal to that of its ancestors the task group will still be constrained by such a quota so this should be shown here. Cgroupv1 continues to set this correctly. In both cases, add initialization when a new task group is created based on the current parent's value (or RUNTIME_INF in the case of root_task_group). Otherwise, the field is wrong until a quota is changed after creation and __cfs_schedulable() is called. Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
2023-07-31Revert "sched/fair: Move unused stub functions to header"Peter Zijlstra2-3/+5
Revert commit 7aa55f2a5902 ("sched/fair: Move unused stub functions to header"), for while it has the right Changelog, the actual patch content a revert of the previous 4 patches: f7df852ad6db ("sched: Make task_vruntime_update() prototype visible") c0bdfd72fbfb ("sched/fair: Hide unused init_cfs_bandwidth() stub") 378be384e01f ("sched: Add schedule_user() declaration") d55ebae3f312 ("sched: Hide unused sched_update_scaling()") So in effect this is a revert of a revert and re-applies those patches. Fixes: 7aa55f2a5902 ("sched/fair: Move unused stub functions to header") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-26sched/topology: Align group flags when removing degenerate domainChen Yu1-2/+1
The flags of the child of a given scheduling domain are used to initialize the flags of its scheduling groups. When the child of a scheduling domain is degenerated, the flags of its local scheduling group need to be updated to align with the flags of its new child domain. The flag SD_SHARE_CPUCAPACITY was aligned in Commit bf2dc42d6beb ("sched/topology: Propagate SMT flags when removing degenerate domain"). Further generalize this alignment so other flags can be used later, such as in cluster-based task wakeup. [1] Reported-by: Yicong Yang <yangyicong@huawei.com> Suggested-by: Ricardo Neri <ricardo.neri@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
2023-07-26sched/fair: remove util_est boostingVincent Guittot1-3/+0
There is no need to use runnable_avg when estimating util_est and that even generates wrong behavior because one includes blocked tasks whereas the other one doesn't. This can lead to accounting twice the waking task p, once with the blocked runnable_avg and another one when adding its util_est. cpu's runnable_avg is already used when computing util_avg which is then compared with util_est. In some situation, feec will not select prev_cpu but another one on the same performance domain because of higher max_util Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
2023-07-19sched/fair: Propagate enqueue flags into place_entity()Peter Zijlstra2-5/+6
This allows place_entity() to consider ENQUEUE_WAKEUP and ENQUEUE_MIGRATED. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
2023-07-19sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slicePeter Zijlstra4-10/+10
EEVDF uses this tunable as the base request/slice -- make sure the name reflects this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
2023-07-19sched/fair: Commit to EEVDFPeter Zijlstra4-450/+38
EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
2023-07-19sched/smp: Use lag to simplify cross-runqueue placementPeter Zijlstra1-126/+19
Using lag is both more correct and simpler when moving between runqueues. Notable, min_vruntime() was invented as a cheap approximation of avg_vruntime() for this very purpose (SMP migration). Since we now have the real thing; use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
2023-07-19sched/fair: Commit to lag based placementPeter Zijlstra2-66/+1
Removes the FAIR_SLEEPERS code in favour of the new LAG based placement. Specifically, the whole FAIR_SLEEPER thing was a very crude approximation to make up for the lack of lag based placement, specifically the 'service owed' part. This is important for things like 'starve' and 'hackbench'. One side effect of FAIR_SLEEPER is that it caused 'small' unfairness, specifically, by always ignoring up-to 'thresh' sleeptime it would have a 50%/50% time distribution for a 50% sleeper vs a 100% runner, while strictly speaking this should (of course) result in a 33%/67% split (as CFS will also do if the sleep period exceeds 'thresh'). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
2023-07-19sched/fair: Implement an EEVDF-like scheduling policyPeter Zijlstra5-48/+304
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit. Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription. EEVDF has two parameters: - weight, or time-slope: which is mapped to nice just as before - request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier. Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline. Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
2023-07-19sched/fair: Add lag based placementPeter Zijlstra3-38/+139
With the introduction of avg_vruntime, it is possible to approximate lag (the entire purpose of introducing it in fact). Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
2023-07-19sched/fair: Remove sched_feat(START_DEBIT)Peter Zijlstra2-26/+1
With the introduction of avg_vruntime() there is no need to use worse approximations. Take the 0-lag point as starting point for inserting new tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
2023-07-19sched/fair: Add cfs_rq::avg_vruntimePeter Zijlstra3-20/+154
In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler. Specifically, for a virtual time weighted fair queueing based scheduler the ideal scheduler will be the weighted average of the individual virtual runtimes (math in the comment). As such, compute the weighted average to approximate the ideal scheduler -- note that the approximation is in the individual task behaviour, which isn't strictly conformant. Specifically consider adding a task with a vruntime left of center, in this case the average will move backwards in time -- something the ideal scheduler would of course never do. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
2023-07-19Merge tag 'v6.5-rc2' into sched/core, to pick up fixesIngo Molnar2-9/+22
Sync with upstream fixes before applying EEVDF. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-18sched: add a few helpers to wake up tasks on the current cpuAndrei Vagin4-13/+28
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake up tasks on the current CPU. These two helpers are useful when the task needs to make a synchronous context switch to another task. In this context, synchronous means it wakes up the target task and falls asleep right after that. One example of such workloads is seccomp user notifies. This mechanism allows the supervisor process handles system calls on behalf of a target process. While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back. On-CPU context switches are much faster than regular ones. Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org>