summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-08-17sched/fair: Implement ENQUEUE_DELAYEDPeter Zijlstra1-2/+31
Doing a wakeup on a delayed dequeue task is about as simple as it sounds -- remove the delayed mark and enjoy the fact it was actually still on the runqueue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
2024-08-17sched/fair: Prepare pick_next_task() for delayed dequeuePeter Zijlstra1-4/+19
Delayed dequeue's natural end is when it gets picked again. Ensure pick_next_task() knows what to do with delayed tasks. Note, this relies on the earlier patch that made pick_next_task() state invariant -- it will restart the pick on dequeue, because obviously the just dequeued task is no longer eligible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org
2024-08-17sched/fair: Prepare exit/cleanup paths for delayed_dequeuePeter Zijlstra1-13/+46
When dequeue_task() is delayed it becomes possible to exit a task (or cgroup) that is still enqueued. Ensure things are dequeued before freeing. Thanks to Valentin for asking the obvious questions and making switched_from_fair() less weird. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.631948434@infradead.org
2024-08-17sched/fair: Assert {set_next,put_prev}_entity() are properly balancedPeter Zijlstra1-0/+2
Just a little sanity test.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.486423066@infradead.org
2024-08-17sched/uclamg: Handle delayed dequeuePeter Zijlstra1-1/+15
Delayed dequeue has tasks sit around on the runqueue that are not actually runnable -- specifically, they will be dequeued the moment they get picked. One side-effect is that such a task can get migrated, which leads to a 'nested' dequeue_task() scenario that messes up uclamp if we don't take care. Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on the runqueue. This however will have removed the task from uclamp -- per uclamp_rq_dec() in dequeue_task(). So far so good. However, if at that point the task gets migrated -- or nice adjusted or any of a myriad of operations that does a dequeue-enqueue cycle -- we'll pass through dequeue_task()/enqueue_task() again. Without modification this will lead to a double decrement for uclamp, which is wrong. Reported-by: Luis Machado <luis.machado@arm.com> Reported-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
2024-08-17sched: Prepare generic code for delayed dequeuePeter Zijlstra3-1/+16
While most of the delayed dequeue code can be done inside the sched_class itself, there is one location where we do not have an appropriate hook, namely ttwu_runnable(). Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking delayed dequeue tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
2024-08-17sched: Split DEQUEUE_SLEEP from deactivate_task()Peter Zijlstra2-10/+27
As a preparation for dequeue_task() failing, and a second code-path needing to take care of the 'success' path, split out the DEQEUE_SLEEP path from deactivate_task(). Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING ordering fail. Fixed-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
2024-08-17sched/fair: Re-organize dequeue_task_fair()Peter Zijlstra1-21/+41
Working towards delaying dequeue, notably also inside the hierachy, rework dequeue_task_fair() such that it can 'resume' an interrupted hierarchy walk. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org
2024-08-17sched: Allow sched_class::dequeue_task() to failPeter Zijlstra7-9/+20
Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
2024-08-17sched/fair: Unify pick_{,next_}_task_fair()Peter Zijlstra1-52/+8
Implement pick_next_task_fair() in terms of pick_task_fair() to de-duplicate the pick loop. More importantly, this makes all the pick loops use the state-invariant form, which is useful to introduce further re-try conditions in later patches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.725062368@infradead.org
2024-08-17sched/fair: Cleanup pick_task_fair()'s currPeter Zijlstra1-8/+2
With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from pick_next_entity()") curr is no longer being used, so no point in clearing it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.614707623@infradead.org
2024-08-17sched/fair: Cleanup pick_task_fair() vs throttlePeter Zijlstra1-3/+3
Per 54d27365cae8 ("sched/fair: Prevent throttling in early pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the 'if (curr)' check is to ensure the (downward) traversal does not result in an empty cfs_rq. But then the pick_task_fair() 'copy' of all this made it restart the traversal anyway, so that seems to solve the issue too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.501679876@infradead.org
2024-08-17sched/eevdf: Remove min_vruntime_copyPeter Zijlstra2-7/+2
Since commit e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement") the min_vruntime_copy is no longer used. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org
2024-08-17sched/eevdf: Add feature commentsPeter Zijlstra1-0/+7
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.287790895@infradead.org
2024-08-07sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()Qais Yousef13-23/+23
Some find the name realtime overloaded. Use rt_or_dl() as an alternative, hopefully better, name. Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07sched/rt, dl: Convert functions to return boolQais Yousef2-15/+9
{rt, realtime, dl}_{task, prio}() functions' return value is actually a bool. Convert their return type to reflect that. Suggested-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20240610192018.1567075-3-qyousef@layalina.io
2024-08-07sched/rt: Clean up usage of rt_task()Qais Yousef15-21/+49
rt_task() checks if a task has RT priority. But depends on your dictionary, this could mean it belongs to RT class, or is a 'realtime' task, which includes RT and DL classes. Since this has caused some confusion already on discussion [1], it seemed a clean up is due. I define the usage of rt_task() to be tasks that belong to RT class. Make sure that it returns true only for RT class and audit the users and replace the ones required the old behavior with the new realtime_task() which returns true for RT and DL classes. Introduce similar realtime_prio() to create similar distinction to rt_prio() and update the users that required the old behavior to use the new function. Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. Document the functions to make it more obvious what is the difference between them. PI-boosted tasks is a factor that must be taken into account when choosing which function to use. Rename task_is_realtime() to realtime_task_policy() as the old name is confusing against the new realtime_task(). No functional changes were intended. [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/ Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-07sched/debug: Fix fair_server_period_max valueDan Carpenter1-1/+1
This code has an integer overflow or sign extension bug which was caught by gcc-13: kernel/sched/debug.c:341:57: error: integer overflow in expression of type 'long int' results in '-100663296' [-Werror=overflow] 341 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */ The result is that "fair_server_period_max" is set to 0xfffffffffa000000 (585 years) instead of instead of 0xfa000000 (4 seconds) that was intended. Fix this by changing the type to shift from (1 << 22) to (1UL << 22). Closes: https://lore.kernel.org/all/CA+G9fYtE2GAbeqU+AOCffgo2oH0RTJUxU+=Pi3cFn4di_KgBAQ@mail.gmail.com/ Fixes: d741f297bcea ("sched/fair: Fair server interface") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: Arnd Bergmann <arnd@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a936b991-e464-4bdf-94ab-08e25d364986@stanley.mountain
2024-08-07sched/fair: Make balance_fair() test sched_fair_runnable() instead of ↵Tejun Heo1-1/+1
rq->nr_running balance_fair() skips newidle balancing if rq->nr_running - there are already tasks on the rq, so no need to try to pull tasks. This tests the total number of queued tasks on the CPU instead of only the fair class, but is still correct as the rq can currently only have fair class tasks while balance_fair() is running. However, with the addition of sched_ext below the fair class, this will not hold anymore and make put_prev_task_balance() skip sched_ext's balance() incorrectly as, when a CPU has only lower priority class tasks, rq->nr_running would still be positive and balance_fair() would return 1 even when fair doesn't have any tasks to run. Update balance_fair() to use sched_fair_runnable() which tests rq->cfs.nr_running which is updated by bandwidth throttling. Note that pick_next_task_fair() already uses sched_fair_runnable() in its optimized path for the same purpose. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/ZrFUjlCf7x3TNXB8@slm.duckdns.org
2024-07-29sched/fair: Cleanup fair_serverPeter Zijlstra1-15/+17
The throttle interaction made my brain hurt, make it consistently about 0 transitions of h_nr_running. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29sched/rt: Remove default bandwidth controlPeter Zijlstra5-142/+120
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED. Enable fair_server with parameters equivalent to RT throttling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix picking of tasks for core scheduling with DL serverJoel Fernandes (Google)4-9/+47
* Use simple CFS pick_task for DL pick_task DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong because core scheduling's pick_task only calls CFS's pick_task() for evaluation / checking of the CFS task (comparing across CPUs), not for actually affirmatively picking the next task. This causes RB tree corruption issues in CFS that were found by syzbot. * Make pick_task_fair clear DL server A DL task pick might set ->dl_server, but it is possible the task will never run (say the other HT has a stop task). If the CFS task is picked in the future directly (say without DL server), ->dl_server will be set. So clear it in pick_task_fair(). This fixes the KASAN issue reported by syzbot in set_next_entity(). (DL refactoring suggestions by Vineeth Pillai). Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix priority checking for DL server picksJoel Fernandes (Google)1-2/+21
In core scheduling, a DL server pick (which is CFS task) should be given higher priority than tasks in other classes. Not doing so causes CFS starvation. A kselftest is added later to demonstrate this. A CFS task that is competing with RT tasks can be completely starved without this and the DL server's boosting completely ignored. Fix these problems. Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Fair server interfaceDaniel Bristot de Oliveira4-17/+256
Add an interface for fair server setup on debugfs. Each CPU has two files under /debug/sched/fair_server/cpu{ID}: - runtime: set runtime in ns - period: set period in ns This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set bounds on admission control. The interface also add the server to the dl bandwidth accounting. Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
2024-07-29sched/deadline: Deferrable dl serverDaniel Bristot de Oliveira5-45/+298
Among the motivations for the DL servers is the real-time throttling mechanism. This mechanism works by throttling the rt_rq after running for a long period without leaving space for fair tasks. The base dl server avoids this problem by boosting fair tasks instead of throttling the rt_rq. The point is that it boosts without waiting for potential starvation, causing some non-intuitive cases. For example, an IRQ dispatches two tasks on an idle system, a fair and an RT. The DL server will be activated, running the fair task before the RT one. This problem can be avoided by deferring the dl server activation. By setting the defer option, the dl_server will dispatch an SCHED_DEADLINE reservation with replenished runtime, but throttled. The dl_timer will be set for the defer time at (period - runtime) ns from start time. Thus boosting the fair rq at defer time. If the fair scheduler has the opportunity to run while waiting for defer time, the dl server runtime will be consumed. If the runtime is completely consumed before the defer time, the server will be replenished while still in a throttled state. Then, the dl_timer will be reset to the new defer time If the fair server reaches the defer time without consuming its runtime, the server will start running, following CBS rules (thus without breaking SCHED_DEADLINE). Then the server will continue the running state (without deferring) until it fair tasks are able to execute as regular fair scheduler (end of the starvation). Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Add trivial fair serverPeter Zijlstra4-0/+62
Use deadline servers to service fair tasks. This patch adds a fair_server deadline entity which acts as a container for fair entities and can be used to fix starvation when higher priority (wrt fair) tasks are monopolizing CPU(s). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Clear prev->dl_server in CFS pick fast pathYoussef Esmat1-0/+7
In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Add clearing of ->dl_server in put_prev_task_balance()Joel Fernandes (Google)1-8/+8
Paths using put_prev_task_balance() need to do a pick shortly after. Make sure they also clear the ->dl_server on prev as a part of that. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
2024-07-29sched/deadline: Comment sched_dl_entity::dl_server variableDaniel Bristot de Oliveira1-0/+2
Add an explanation for the newly added variable. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/147f7aa8cb8fd925f36aa8059af6a35aad08b45a.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchyTianchen Ding1-13/+9
Consider the following cgroup: root | ------------------------ | | normal_cgroup idle_cgroup | | SCHED_IDLE task_A SCHED_NORMAL task_B According to the cgroup hierarchy, A should preempt B. But current check_preempt_wakeup_fair() treats cgroup se and task separately, so B will preempt A unexpectedly. Unify the wakeup logic by {c,p}se_is_idle only. This makes SCHED_IDLE of a task a relative policy that is effective only within its own cgroup, similar to the behavior of NICE. Also fix se_is_idle() definition when !CONFIG_FAIR_GROUP_SCHED. Fixes: 304000390f88 ("sched: Cgroup SCHED_IDLE support") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240626023505.1332596-1-dtcccc@linux.alibaba.com
2024-07-29sched: remove HZ_BW feature hedgePhil Auld3-4/+2
As a hedge against unexpected user issues commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use") included a scheduler feature to disable the new functionality. It's been a few releases (v6.6) and no screams, so remove it. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com
2024-07-29sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clockChuyi Zhou2-10/+0
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virtual runtime in the runqueue exceeds three times the scheduler latency, indicating significant disparity in task scheduling. Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF cfs_rq->exec_clock was used to account for time spent executing tasks. Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime accounting across classes cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in eevdf. Remove them from struct cfs_rq. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
2024-07-29sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()Peilin He1-3/+15
Background ========== When repeated migrate_disable() calls are made with missing the corresponding migrate_enable() calls, there is a risk of 'migration_disabled' going upper overflow because 'migration_disabled' is a type of unsigned short whose max value is 65535. In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may make the migrate_disable() ineffective within local_lock_irqsave(). This is because, during the scheduling procedure, the value of 'migration_disabled' will be checked, which can trigger CPU migration. Consequently, the count of 'rcu_read_lock_nesting' may leak due to local_lock_irqsave() and local_unlock_irqrestore() occurring on different CPUs. Usecase ======== For example, When I developed a driver, I encountered a warning like "WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315 rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month to locate this issue. Ultimately, I discovered that the lack of upper overflow detection mechanism in migrate_disable() was the root cause, leading to a significant amount of time spent on problem localization. If the upper overflow detection mechanism was added to migrate_disable(), the root cause could be very quickly and easily identified. Effect ====== Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow can help developers identify the issue quickly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peilin He<he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn> Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: Fan Yu <fan.yu9@zte.com.cn> Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29sched: Initialize the vruntime of a new task when it is first enqueuedZhang Qiao2-16/+1
When creating a new task, we initialize vruntime of the newly task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate. Because it uses current CPU to init the vruntime, but the new task actually runs on the cpu which be assigned at wake_up_new_task(). To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task() in wake_up_new_task(), in this way, when place_entity is called in enqueue_entity(), the vruntime of the new task will be initialized. In addition, place_entity() in task_fork_fair() was introduced for two reasons: 1. Previously, the __enqueue_entity() was in task_new_fair(), in order to provide vruntime for enqueueing the newly task, the vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was introduced by commit e9acbff6484d ("sched: introduce se->vruntime"). This is the initial state of place_entity(). 2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added child_runs_first task placement feature which based on vruntime, this also requires the new task's vruntime value. After removing the child_runs_first and enqueue_entity() from task_fork_fair(), this place_entity() no longer makes sense, so remove it also. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang1-0/+1
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang1-14/+26
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang1-0/+1
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang1-7/+19
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29sched/cputime: Fix mul_u64_u64_div_u64() precision for cputimeZheng Zucheng1-0/+6
In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29Linux 6.11-rc1v6.11-rc1Linus Torvalds1-2/+2
2024-07-29Merge tag 'kbuild-fixes-v6.11' of ↵Linus Torvalds4-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix RPM package build error caused by an incorrect locale setup - Mark modules.weakdep as ghost in RPM package - Fix the odd combination of -S and -c in stack protector scripts, which is an error with the latest Clang * tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: Fix '-S -c' in x86 stack protector scripts kbuild: rpm-pkg: ghost modules.weakdep file kbuild: rpm-pkg: Fix C locale setup
2024-07-28minmax: simplify and clarify min_t()/max_t() implementationLinus Torvalds1-8/+11
This simplifies the min_t() and max_t() macros by no longer making them work in the context of a C constant expression. That means that you can no longer use them for static initializers or for array sizes in type definitions, but there were only a couple of such uses, and all of them were converted (famous last words) to use MIN_T/MAX_T instead. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28minmax: add a few more MIN_T/MAX_T usersLinus Torvalds7-10/+10
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28Merge tag 'ubifs-for-linus-6.11-rc1-take2' of ↵Linus Torvalds21-214/+135
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI and UBIFS updates from Richard Weinberger: - Many fixes for power-cut issues by Zhihao Cheng - Another ubiblock error path fix - ubiblock section mismatch fix - Misc fixes all over the place * tag 'ubifs-for-linus-6.11-rc1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: Fix ubi_init() ubiblock_exit() section mismatch ubifs: add check for crypto_shash_tfm_digest ubifs: Fix inconsistent inode size when powercut happens during appendant writing ubi: block: fix null-pointer-dereference in ubiblock_create() ubifs: fix kernel-doc warnings ubifs: correct UBIFS_DFS_DIR_LEN macro definition and improve code clarity mtd: ubi: Restore missing cleanup on ubi_init() failure path ubifs: dbg_orphan_check: Fix missed key type checking ubifs: Fix unattached inode when powercut happens in creating ubifs: Fix space leak when powercut happens in linking tmpfile ubifs: Move ui->data initialization after initializing security ubifs: Fix adding orphan entry twice for the same inode ubifs: Remove insert_dead_orphan from replaying orphan process Revert "ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path" ubifs: Don't add xattr inode into orphan area ubifs: Fix unattached xattr inode if powercut happens after deleting mtd: ubi: avoid expensive do_div() on 32-bit machines mtd: ubi: make ubi_class constant ubi: eba: properly rollback inside self_check_eba
2024-07-28kbuild: Fix '-S -c' in x86 stack protector scriptsNathan Chancellor2-2/+2
After a recent change in clang to stop consuming all instances of '-S' and '-c' [1], the stack protector scripts break due to the kernel's use of -Werror=unused-command-line-argument to catch cases where flags are not being properly consumed by the compiler driver: $ echo | clang -o - -x c - -S -c -Werror=unused-command-line-argument clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument] This results in CONFIG_STACKPROTECTOR getting disabled because CONFIG_CC_HAS_SANE_STACKPROTECTOR is no longer set. '-c' and '-S' both instruct the compiler to stop at different stages of the pipeline ('-S' after compiling, '-c' after assembling), so having them present together in the same command makes little sense. In this case, the test wants to stop before assembling because it is looking at the textual assembly output of the compiler for either '%fs' or '%gs', so remove '-c' from the list of arguments to resolve the error. All versions of GCC continue to work after this change, along with versions of clang that do or do not contain the change mentioned above. Cc: stable@vger.kernel.org Fixes: 4f7fd4d7a791 ("[PATCH] Add the -fstack-protector option to the CFLAGS") Fixes: 60a5317ff0f4 ("x86: implement x86_32 stack protector") Link: https://github.com/llvm/llvm-project/commit/6461e537815f7fa68cef06842505353cf5600e9c [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-07-28ubi: Fix ubi_init() ubiblock_exit() section mismatchRichard Weinberger1-1/+1
Since ubiblock_exit() is now called from an init function, the __exit section no longer makes sense. Cc: Ben Hutchings <bwh@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407131403.wZJpd8n2-lkp@intel.com/ Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
2024-07-28Merge tag 'v6.11-merge' of ↵Linus Torvalds5-495/+2274
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat updates from Len Brown: - Enable turbostat extensions to add both perf and PMT (Intel Platform Monitoring Technology) counters via the cmdline - Demonstrate PMT access with built-in support for Meteor Lake's Die C6 counter * tag 'v6.11-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: version 2024.07.26 tools/power turbostat: Include umask=%x in perf counter's config tools/power turbostat: Document PMT in turbostat.8 tools/power turbostat: Add MTL's PMT DC6 builtin counter tools/power turbostat: Add early support for PMT counters tools/power turbostat: Add selftests for added perf counters tools/power turbostat: Add selftests for SMI, APERF and MPERF counters tools/power turbostat: Move verbose counter messages to level 2 tools/power turbostat: Move debug prints from stdout to stderr tools/power turbostat: Fix typo in turbostat.8 tools/power turbostat: Add perf added counter example to turbostat.8 tools/power turbostat: Fix formatting in turbostat.8 tools/power turbostat: Extend --add option with perf counters tools/power turbostat: Group SMI counter with APERF and MPERF tools/power turbostat: Add ZERO_ARRAY for zero initializing builtin array tools/power turbostat: Replace enum rapl_source and cstate_source with counter_source tools/power turbostat: Remove anonymous union from rapl_counter_info_t tools/power/turbostat: Switch to new Intel CPU model defines
2024-07-28Merge tag 'cxl-for-6.11' of ↵Linus Torvalds19-208/+438
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull CXL updates from Dave Jiang: "Core: - A CXL maturity map has been added to the documentation to detail the current state of CXL enabling. It provides the status of the current state of various CXL features to inform current and future contributors of where things are and which areas need contribution. - A notifier handler has been added in order for a newly created CXL memory region to trigger the abstract distance metrics calculation. This should bring parity for CXL memory to the same level vs hotplugged DRAM for NUMA abstract distance calculation. The abstract distance reflects relative performance used for memory tiering handling. - An addition for XOR math has been added to address the CXL DPA to SPA translation. CXL address translation did not support address interleave math with XOR prior to this change. Fixes: - Fix to address race condition in the CXL memory hotplug notifier - Add missing MODULE_DESCRIPTION() for CXL modules - Fix incorrect vendor debug UUID define Misc: - A warning has been added to inform users of an unsupported configuration when mixing CXL VH and RCH/RCD hierarchies - The ENXIO error code has been replaced with EBUSY for inject poison limit reached via debugfs and cxl-test support - Moving the PCI config read in cxl_dvsec_rr_decode() to avoid unnecessary PCI config reads - A refactor to a common struct for DRAM and general media CXL events" * tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/core/pci: Move reading of control register to immediately before usage cxl: Remove defunct code calculating host bridge target positions cxl/region: Verify target positions using the ordered target list cxl: Restore XOR'd position bits during address translation cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa() cxl/test: Replace ENXIO with EBUSY for inject poison limit reached cxl/memdev: Replace ENXIO with EBUSY for inject poison limit reached cxl/acpi: Warn on mixed CXL VH and RCH/RCD Hierarchy cxl/core: Fix incorrect vendor debug UUID define Documentation: CXL Maturity Map cxl/region: Simplify cxl_region_nid() cxl/region: Support to calculate memory tier abstract distance cxl/region: Fix a race condition in memory hotplug notifier cxl: add missing MODULE_DESCRIPTION() macros cxl/events: Use a common struct for DRAM and General Media events
2024-07-28Merge tag 'unicode-next-6.11' of ↵Linus Torvalds3-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode update from Gabriel Krisman Bertazi: "Two small fixes to silence the compiler and static analyzers tools from Ben Dooks and Jeff Johnson" * tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: add MODULE_DESCRIPTION() macros unicode: make utf8 test count static
2024-07-28kbuild: rpm-pkg: ghost modules.weakdep fileJose Ignacio Tornos Martinez1-1/+1
In the same way as for other similar files, mark as ghost the new file generated by depmod for configured weak dependencies for modules, modules.weakdep, so that although it is not included in the package, claim the ownership on it. Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>