summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)AuthorFilesLines
2022-04-13sched: Teach the forced-newidle balancer about CPU affinity limitation.Sebastian Andrzej Siewior1-1/+1
commit 386ef214c3c6ab111d05e1790e79475363abaa05 upstream. try_steal_cookie() looks at task_struct::cpus_mask to decide if the task could be moved to `this' CPU. It ignores that the task might be in a migration disabled section while not on the CPU. In this case the task must not be moved otherwise per-CPU assumption are broken. Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a task can be moved. Fixes: d2dfa17bc7de6 ("sched: Trivial forced-newidle balancer") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08sched/core: Export pelt_thermal_tpQais Yousef1-0/+1
[ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ] We can't use this tracepoint in modules without having the symbol exported first, fix that. Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-08sched: Fix yet more sched_fork() racesPeter Zijlstra1-13/+21
commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream. Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-08sched/fair: Fix fault in reweight_entityTadeusz Struk1-5/+6
[ Upstream commit 13765de8148f71fa795e0a6607de37c49ea5915a ] Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-16sched: Avoid double preemption in __cond_resched_*lock*()Peter Zijlstra1-9/+3
[ Upstream commit 7e406d1ff39b8ee574036418a5043c86723170cf ] For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a preemption, no point in then calling preempt_schedule_common() *again*. Use _cond_resched() instead, since this is a NOP for the preemptible configs while it provide a preemption point for the others. Reported-by: xuhaifeng <xuhaifeng@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08sched/uclamp: Fix rq->uclamp_max not set on first enqueueQais Yousef1-1/+1
[ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ] Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08preempt/dynamic: Fix setup_preempt_mode() return valueAndrew Halaney1-2/+2
[ Upstream commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86 ] __setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-01sched/scs: Reset task stack state in bringup_cpu()Mark Rutland1-4/+0
[ Upstream commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 ] To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25sched/fair: Prevent dead task groups from regaining cfs_rq'sMathias Krause1-9/+35
[ Upstream commit b027789e5e50494c2325cc70c8642e7fd6059479 ] Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()Vincent Donnefort1-0/+3
[ Upstream commit 42dc938a590c96eeb429e1830123fef2366d9c80 ] Nothing protects the access to the per_cpu variable sd_llc_id. When testing the same CPU (i.e. this_cpu == that_cpu), a race condition exists with update_top_cache_domain(). One scenario being: CPU1 CPU2 ================================================================== per_cpu(sd_llc_id, CPUX) => 0 partition_sched_domains_locked() detach_destroy_domains() cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX) per_cpu(sd_llc_id, CPUX) => 0 per_cpu(sd_llc_id, CPUX) = CPUX per_cpu(sd_llc_id, CPUX) => CPUX return false ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result is a warning triggered from ttwu_queue_wakelist(). Avoid a such race in cpus_share_cache() by always returning true when this_cpu == that_cpu. Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18kernel/sched: Fix sched_fork() access an invalid sched_task_groupZhang Qiao1-21/+22
[ Upstream commit 4ef0c5c6b5ba1f38f0ea1cedad0cad722f00c14a ] There is a small race between copy_process() and sched_fork() where child->sched_task_group point to an already freed pointer. parent doing fork() | someone moving the parent | to another cgroup -------------------------------+------------------------------- copy_process() + dup_task_struct()<1> parent move to another cgroup, and free the old cgroup. <2> + sched_fork() + __set_task_cpu()<3> + task_fork_fair() + sched_slice()<4> In the worst case, this bug can lead to "use-after-free" and cause panic as shown above: (1) parent copy its sched_task_group to child at <1>; (2) someone move the parent to another cgroup and free the old cgroup at <2>; (3) the sched_task_group and cfs_rq that belong to the old cgroup will be accessed at <3> and <4>, which cause a panic: [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0 [] Oops: 0000 [#1] SMP PTI [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1 [] RIP: 0010:sched_slice+0x84/0xc0 [] Call Trace: [] task_fork_fair+0x81/0x120 [] sched_fork+0x132/0x240 [] copy_process.part.5+0x675/0x20e0 [] ? __handle_mm_fault+0x63f/0x690 [] _do_fork+0xcd/0x3b0 [] do_syscall_64+0x5d/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x65/0xca [] RIP: 0033:0x7f04418cd7e1 Between cgroup_can_fork() and cgroup_post_fork(), the cgroup membership and thus sched_task_group can't change. So update child's sched_task_group at sched_post_fork() and move task_fork() and __set_task_cpu() (where accees the sched_task_group) from sched_fork() to sched_post_fork(). Fixes: 8323f26ce342 ("sched: Fix race in task_group") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-19sched/scs: Reset the shadow stack when idle_task_exitWoody Lin1-0/+1
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
2021-09-12Merge tag 'sched_urgent_for_v5.15_rc1' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the idle timer expires in hardirq context, on PREEMPT_RT - Make sure the run-queue balance callback is invoked only on the outgoing CPU * tag 'sched_urgent_for_v5.15_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Prevent balance_push() on remote runqueues sched/idle: Make the idle timer expire in hard interrupt context
2021-09-09sched: Prevent balance_push() on remote runqueuesThomas Gleixner1-3/+3
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance callback after changing priorities or the scheduling class of a task. The run-queue for which the callback is invoked can be local or remote. That's not a problem for the regular rq::push_work which is serialized with a busy flag in the run-queue struct, but for the balance_push() work which is only valid to be invoked on the outgoing CPU that's wrong. It not only triggers the debug warning, but also leaves the per CPU variable push_work unprotected, which can result in double enqueues on the stop machine list. Remove the warning and validate that the function is invoked on the outgoing CPU. Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
2021-08-31Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds1-17/+92
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-30Merge tag 'sched-core-2021-08-30' of ↵Linus Torvalds1-94/+346
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ...
2021-08-30Merge branch 'core-rcu.2021.08.28a' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: "RCU changes for this cycle were: - Documentation updates - Miscellaneous fixes - Offloaded-callbacks updates - Updates to the nolibc library - Tasks-RCU updates - In-kernel torture-test updates - Torture-test scripting, perhaps most notably the pinning of torture-test guest OSes so as to force differences in memory latency. For example, in a two-socket system, a four-CPU guest OS will have one pair of its CPUs pinned to threads in a single core on one socket and the other pair pinned to threads in a single core on the other socket. This approach proved able to force race conditions that earlier testing missed. Some of these race conditions are still being tracked down" * 'core-rcu.2021.08.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (61 commits) torture: Replace deprecated CPU-hotplug functions. rcu: Replace deprecated CPU-hotplug functions rcu: Print human-readable message for schedule() in RCU reader rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCU rcu: Use per_cpu_ptr to get the pointer of per_cpu variable rcu: Remove useless "ret" update in rcu_gp_fqs_loop() rcu: Mark accesses in tree_stall.h rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stack rcu: Mark lockless ->qsmask read in rcu_check_boost_fail() srcutiny: Mark read-side data races rcu: Start timing stall repetitions after warning complete rcu: Do not disable GP stall detection in rcu_cpu_stall_reset() rcu/tree: Handle VM stoppage in stall detection rculist: Unify documentation about missing list_empty_rcu() rcu: Mark accesses to ->rcu_read_lock_nesting rcu: Weaken ->dynticks accesses and updates rcu: Remove special bit at the bottom of the ->dynticks counter rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock rcu: Fix to include first blocked task in stall warning torture: Make kvm-test-1-run-qemu.sh check for reboot loops ...
2021-08-20sched: Introduce dl_task_check_affinity() to check proposed affinityWill Deacon1-17/+29
In preparation for restricting the affinity of a task during execve() on arm64, introduce a new dl_task_check_affinity() helper function to give an indication as to whether the restricted mask is admissible for a deadline task. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lore.kernel.org/r/20210730112443.23245-10-will@kernel.org
2021-08-20sched: Allow task CPU affinity to be restricted on asymmetric systemsWill Deacon1-18/+180
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Although userspace can carefully manage the affinity masks for such tasks, one place where it is particularly problematic is execve() because the CPU on which the execve() is occurring may be incompatible with the new application image. In such a situation, it is desirable to restrict the affinity mask of the task and ensure that the new image is entered on a compatible CPU. From userspace's point of view, this looks the same as if the incompatible CPUs have been hotplugged off in the task's affinity mask. Similarly, if a subsequent execve() reverts to a compatible image, then the old affinity is restored if it is still valid. In preparation for restricting the affinity mask for compat tasks on arm64 systems without uniform support for 32-bit applications, introduce {force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict and restore the affinity mask for a task based on the compatible CPUs. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
2021-08-20sched: Split the guts of sched_setaffinity() into a helper functionWill Deacon1-48/+57
In preparation for replaying user affinity requests using a saved mask, split sched_setaffinity() up so that the initial task lookup and security checks are only performed when the request is coming directly from userspace. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-8-will@kernel.org
2021-08-20sched: Introduce task_struct::user_cpus_ptr to track requested affinityWill Deacon1-0/+20
In preparation for saving and restoring the user-requested CPU affinity mask of a task, add a new cpumask_t pointer to 'struct task_struct'. If the pointer is non-NULL, then the mask is copied across fork() and freed on task exit. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-7-will@kernel.org
2021-08-20sched: Reject CPU affinity changes based on task_cpu_possible_mask()Will Deacon1-1/+8
Reject explicit requests to change the affinity mask of a task via set_cpus_allowed_ptr() if the requested mask is not a subset of the mask returned by task_cpu_possible_mask(). This ensures that the 'cpus_mask' for a given task cannot contain CPUs which are incapable of executing it, except in cases where the affinity is forced. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-6-will@kernel.org
2021-08-20cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()Will Deacon1-2/+1
select_fallback_rq() only needs to recheck for an allowed CPU if the affinity mask of the task has changed since the last check. Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether the affinity mask was updated, and use this to elide the allowed check when the mask has been left alone. No functional change. Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
2021-08-20sched: Introduce task_cpu_possible_mask() to limit fallback rq selectionWill Deacon1-6/+3
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. On such a system, we must take care not to migrate a task to an unsupported CPU when forcefully moving tasks in select_fallback_rq() in response to a CPU hot-unplug operation. Introduce a task_cpu_possible_mask() hook which, given a task argument, allows an architecture to return a cpumask of CPUs that are capable of executing that task. The default implementation returns the cpu_possible_mask, since sane machines do not suffer from per-cpu ISA limitations that affect scheduling. The new mask is used when selecting the fallback runqueue as a last resort before forcing a migration to the first active CPU. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-2-will@kernel.org
2021-08-20sched: Cgroup SCHED_IDLE supportJosh Don1-0/+25
This extends SCHED_IDLE to cgroups. Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight. There are two key differences between the per-task and per-cgroup SCHED_IDLE interface: - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves. - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity. A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task). For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities. The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
2021-08-20sched: Fix Core-wide rq->lock for uninitialized CPUsPeter Zijlstra1-26/+117
Eugene tripped over the case where rq_lock(), as called in a for_each_possible_cpu() loop came apart because rq->core hadn't been setup yet. This is a somewhat unusual, but valid case. Rework things such that rq->core is initialized to point at itself. IOW initialize each CPU as a single threaded Core. CPU online will then join the new CPU (thread) to an existing Core where needed. For completeness sake, have CPU offline fully undo the state so as to not presume the topology will match the next time it comes online. Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock") Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Eugene Syromiatnikov <esyr@redhat.com> Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
2021-08-17sched/core: Provide a scheduling point for RT locksThomas Gleixner1-1/+19
RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based on rtmutexes. Blocking on such a lock is similar to preemption versus: - I/O scheduling and worker handling, because these functions might block on another substituted lock, or come from a lock contention within these functions. - RCU considers this like a preemption, because the task might be in a read side critical section. Add a separate scheduling point for this, and hand a new scheduling mode argument to __schedule() which allows, along with separate mode masks, to handle this gracefully from within the scheduler, without proliferating that to other subsystems like RCU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.372319055@linutronix.de
2021-08-17sched/core: Rework the __schedule() preempt argumentThomas Gleixner1-11/+23
PREEMPT_RT needs to hand a special state into __schedule() when a task blocks on a 'sleeping' spin/rwlock. This is required to handle rcu_note_context_switch() correctly without having special casing in the RCU code. From an RCU point of view the blocking on the sleeping spinlock is equivalent to preemption, because the task might be in a read side critical section. schedule_debug() also has a check which would trigger with the !preempt case, but that could be handled differently. To avoid adding another argument and extra checks which cannot be optimized out by the compiler, the following solution has been chosen: - Replace the boolean 'preempt' argument with an unsigned integer 'sched_mode' argument and define constants to hand in: (0 == no preemption, 1 = preemption). - Add two masks to apply on that mode: one for the debug/rcu invocations, and one for the actual scheduling decision. For a non RT kernel these masks are UINT_MAX, i.e. all bits are set, which allows the compiler to optimize the AND operation out, because it is not masking out anything. IOW, it's not different from the boolean. RT enabled kernels will define these masks separately. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.315473019@linutronix.de
2021-08-17sched/wakeup: Prepare for RT sleeping spin/rwlocksThomas Gleixner1-0/+33
Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state preserving. Any wakeup which matches the state is valid. RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates an issue vs. task::__state. In order to block on the lock, the task has to overwrite task::__state and a consecutive wakeup issued by the unlocker sets the state back to TASK_RUNNING. As a consequence the task loses the state which was set before the lock acquire and also any regular wakeup targeted at the task while it is blocked on the lock. To handle this gracefully, add a 'saved_state' member to task_struct which is used in the following way: 1) When a task blocks on a 'sleeping' spinlock, the current state is saved in task::saved_state before it is set to TASK_RTLOCK_WAIT. 2) When the task unblocks and after acquiring the lock, it restores the saved state. 3) When a regular wakeup happens for a task while it is blocked then the state change of that wakeup is redirected to operate on task::saved_state. This is also required when the task state is running because the task might have been woken up from the lock wait and has not yet restored the saved state. To make it complete, provide the necessary helpers to save and restore the saved state along with the necessary documentation how the RT lock blocking is supposed to work. For non-RT kernels there is no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.258751046@linutronix.de
2021-08-17sched/wakeup: Split out the wakeup ->__state checkThomas Gleixner1-6/+18
RT kernels have a slightly more complicated handling of wakeups due to 'sleeping' spin/rwlocks. If a task is blocked on such a lock then the original state of the task is preserved over the blocking period, and any regular (non lock related) wakeup has to be targeted at the saved state to ensure that these wakeups are not lost. Once the task acquires the lock it restores the task state from the saved state. To avoid cluttering try_to_wake_up() with that logic, split the wakeup state check out into an inline helper and use it at both places where task::__state is checked against the state argument of try_to_wake_up(). No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.088945085@linutronix.de
2021-08-10sched: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior1-2/+2
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210803141621.780504-33-bigeasy@linutronix.de
2021-08-06rcu: Explain why rcu_all_qs() is a stub in preemptible TREE RCUFrederic Weisbecker1-0/+11
The cond_resched() function reports an RCU quiescent state only in non-preemptible TREE RCU implementation. This commit therefore adds a comment explaining why cond_resched() does nothing in preemptible kernels. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMSQuentin Perret1-6/+13
SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that the call must not touch scheduling parameters (nice or priority). This is particularly handy for uclamp when used in conjunction with SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only impacts uclamp values. However, sched_setattr always checks whether the priorities and nice values passed in sched_attr are valid first, even if those never get used down the line. This is useless at best since userspace can trivially bypass this check to set the uclamp values by specifying low priorities. However, it is cumbersome to do so as there is no single expression of this that skips both RT and CFS checks at once. As such, userspace needs to query the task policy first with e.g. sched_getattr and then set sched_attr.sched_priority accordingly. This is racy and slower than a single call. As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS is specified, simply inherit them in this case to match the policy inheritance of SCHED_FLAG_KEEP_POLICY. Reported-by: Wei Wang <wvw@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-3-qperret@google.com
2021-08-06sched: Fix UCLAMP_FLAG_IDLE settingQuentin Perret1-6/+19
The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last uclamp active task (that is, when buckets.tasks reaches 0 for all buckets) to maintain the last uclamp.max and prevent blocked util from suddenly becoming visible. However, there is an asymmetry in how the flag is set and cleared which can lead to having the flag set whilst there are active tasks on the rq. Specifically, the flag is cleared in the uclamp_rq_inc() path, which is called at enqueue time, but set in uclamp_rq_dec_id() which is called both when dequeueing a task _and_ in the update_uclamp_active() path. As a result, when both uclamp_rq_{dec,ind}_id() are called from update_uclamp_active(), the flag ends up being set but not cleared, hence leaving the runqueue in a broken state. Fix this by clearing the flag in update_uclamp_active() as well. Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
2021-08-04sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()Quentin Perret1-0/+1
SCHED_FLAG_SUGOV is supposed to be a kernel-only flag that userspace cannot interact with. However, sched_getattr() currently reports it in sched_flags if called on a sugov worker even though it is not actually defined in a UAPI header. To avoid this, make sure to clean-up the sched_flags field in sched_getattr() before returning to userspace. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210727101103.2729607-3-qperret@google.com
2021-08-04sched: remove redundant on_rq status changeWang Hui1-2/+0
activate_task/deactivate_task will change on_rq status, no need to do it again. Signed-off-by: Wang Hui <john.wanghui@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210721091109.1406043-1-john.wanghui@huawei.com
2021-08-04sched/rt: Fix double enqueue caused by rt_effective_prioPeter Zijlstra1-55/+35
Double enqueues in rt runqueues (list) have been reported while running a simple test that spawns a number of threads doing a short sleep/run pattern while being concurrently setscheduled between rt and fair class. WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360 CPU: 3 PID: 2825 Comm: setsched__13 RIP: 0010:enqueue_task_rt+0x355/0x360 Call Trace: __sched_setscheduler+0x581/0x9d0 _sched_setscheduler+0x63/0xa0 do_sched_setscheduler+0xa0/0x150 __x64_sys_sched_setscheduler+0x1a/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40, next=ffff98679fc67ca0. kernel BUG at lib/list_debug.c:31! invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI CPU: 3 PID: 2825 Comm: setsched__13 RIP: 0010:__list_add_valid+0x41/0x50 Call Trace: enqueue_task_rt+0x291/0x360 __sched_setscheduler+0x581/0x9d0 _sched_setscheduler+0x63/0xa0 do_sched_setscheduler+0xa0/0x150 __x64_sys_sched_setscheduler+0x1a/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae __sched_setscheduler() uses rt_effective_prio() to handle proper queuing of priority boosted tasks that are setscheduled while being boosted. rt_effective_prio() is however called twice per each __sched_setscheduler() call: first directly by __sched_setscheduler() before dequeuing the task and then by __setscheduler() to actually do the priority change. If the priority of the pi_top_task is concurrently being changed however, it might happen that the two calls return different results. If, for example, the first call returned the same rt priority the task was running at and the second one a fair priority, the task won't be removed by the rt list (on_list still set) and then enqueued in the fair runqueue. When eventually setscheduled back to rt it will be seen as enqueued already and the WARNING/BUG be issued. Fix this by calling rt_effective_prio() only once and then reusing the return value. While at it refactor code as well for clarity. Concurrent priority inheritance handling is still safe and will eventually converge to a new state by following the inheritance chain(s). Fixes: 0782e63bc6fe ("sched: Handle priority boosted tasks proper in setscheduler()") [squashed Peterz changes; added changelog] Reported-by: Mark Simmons <msimmons@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com
2021-07-01cpufreq: CPPC: Add support for frequency invarianceViresh Kumar1-0/+1
The Frequency Invariance Engine (FIE) is providing a frequency scaling correction factor that helps achieve more accurate load-tracking. Normally, this scaling factor can be obtained directly with the help of the cpufreq drivers as they know the exact frequency the hardware is running at. But that isn't the case for CPPC cpufreq driver. Another way of obtaining that is using the arch specific counter support, which is already present in kernel, but that hardware is optional for platforms. This patch updates the CPPC driver to register itself with the topology core to provide its own implementation (cppc_scale_freq_tick()) of topology_scale_freq_tick() which gets called by the scheduler on every tick. Note that the arch specific counters have higher priority than CPPC counters, if available, though the CPPC driver doesn't need to have any special handling for that. On an invocation of cppc_scale_freq_tick(), we schedule an irq work (since we reach here from hard-irq context), which then schedules a normal work item and cppc_scale_freq_workfn() updates the per_cpu arch_freq_scale variable based on the counter updates since the last tick. To allow platforms to disable this CPPC counter-based frequency invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE, which is enabled by default. This also exports sched_setattr_nocheck() as the CPPC driver can be built as a module. Cc: linux-acpi@vger.kernel.org Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2021-06-28Merge tag 'timers-nohz-2021-06-28' of ↵Linus Torvalds1-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers/nohz updates from Ingo Molnar: - Micro-optimize tick_nohz_full_cpu() - Optimize idle exit tick restarts to be less eager - Optimize tick_nohz_dep_set_task() to only wake up a single CPU. This reduces IPIs and interruptions on nohz_full CPUs. - Optimize tick_nohz_dep_set_signal() in a similar fashion. - Skip IPIs in tick_nohz_kick_task() when trying to kick a non-running task. - Micro-optimize tick_nohz_task_switch() IRQ flags handling to reduce context switching costs. - Misc cleanups and fixes * tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add myself as context tracking maintainer tick/nohz: Call tick_nohz_task_switch() with interrupts disabled tick/nohz: Kick only _queued_ task whose tick dependency is updated tick/nohz: Change signal tick dependency to wake up CPUs of member tasks tick/nohz: Only wake up a single target cpu when kicking a task tick/nohz: Update nohz_full Kconfig help tick/nohz: Update idle_exittime on actual idle exit tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE tick/nohz: Conditionally restart tick on idle exit tick/nohz: Evaluate the CPU expression after the static key
2021-06-28Merge tag 'sched-core-2021-06-28' of ↵Linus Torvalds1-132/+1006
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler udpates from Ingo Molnar: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There are new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/doc: Update the CPU capacity asymmetry bits sched/topology: Rework CPU capacity asymmetry detection sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag psi: Fix race between psi_trigger_create/destroy sched/fair: Introduce the burstable CFS controller sched/uclamp: Fix uclamp_tg_restrict() sched/rt: Fix Deadline utilization tracking during policy change sched/rt: Fix RT utilization tracking during policy change sched: Change task_struct::state sched,arch: Remove unused TASK_STATE offsets sched,timer: Use __set_current_state() sched: Add get_current_state() sched,perf,kvm: Fix preemption condition sched: Introduce task_is_running() sched: Unbreak wakeups sched/fair: Age the average idle time sched/cpufreq: Consider reduced CPU capacity in energy calculation sched/fair: Take thermal pressure into account while estimating energy thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure sched/fair: Return early from update_tg_cfs_load() if delta == 0 ...
2021-06-28sched: Optimize housekeeping_cpumask() in for_each_cpu_and()Yuan ZhaoXiong1-2/+4
On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and the others are used for housekeeping. When many housekeeping cpus are in idle state, we can observe huge time burn in the loop for searching nearest busy housekeeper cpu by ftrace. 9) | get_nohz_timer_target() { 9) | housekeeping_test_cpu() { 9) 0.390 us | housekeeping_get_mask.part.1(); 9) 0.561 us | } 9) 0.090 us | __rcu_read_lock(); 9) 0.090 us | housekeeping_cpumask(); 9) 0.521 us | housekeeping_cpumask(); 9) 0.140 us | housekeeping_cpumask(); ... 9) 0.500 us | housekeeping_cpumask(); 9) | housekeeping_any_cpu() { 9) 0.090 us | housekeeping_get_mask.part.1(); 9) 0.100 us | sched_numa_find_closest(); 9) 0.491 us | } 9) 0.100 us | __rcu_read_unlock(); 9) + 76.163 us | } for_each_cpu_and() is a micro function, so in get_nohz_timer_target() function the for_each_cpu_and(i, sched_domain_span(sd), housekeeping_cpumask(HK_FLAG_TIMER)) equals to below: for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd), housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;) That will cause that housekeeping_cpumask() will be invoked many times. The housekeeping_cpumask() function returns a const value, so it is unnecessary to invoke it every time. This patch can minimize the worst searching time from ~76us to ~16us in my testing. Similarly, the find_new_ilb() function has the same problem. Co-developed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
2021-06-24sched/fair: Introduce the burstable CFS controllerHuaixin Chang1-6/+62
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired. We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded. Traditional (UP-EDF) bandwidth control is something like: (U = \Sum u_i) <= 1 This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail. This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution. For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average. That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs. At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET). The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used. mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10 The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times. Tail latencies are shown below, and it wasn't the worst case. Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44% The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-22sched/uclamp: Fix uclamp_tg_restrict()Qais Yousef1-31/+18
Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 60% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 20% -----------+-----+------+----------- With this fix we get: | p | tg | effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min | 0% | 30% | 30% -----------+-----+------+----------- uclamp_max | 20% | 50% | 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. | p | tg | effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min | 60% | 0% | 50% -----------+-----+------+----------- uclamp_max | 80% | 50% | 50% -----------+-----+------+----------- *new* -----------+-----+------+----------- uclamp_min | 60% | 0% | *60%* -----------+-----+------+----------- uclamp_max | 80% |*70%* | *70%* -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
2021-06-18sched: Change task_struct::statePeter Zijlstra1-25/+28
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18sched: Add get_current_state()Peter Zijlstra1-3/+3
Remove yet another few p->state accesses. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org
2021-06-18sched: Introduce task_is_running()Peter Zijlstra1-3/+3
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-06-17sched/fair: Age the average idle timePeter Zijlstra1-0/+5
This is a partial forward-port of Peter Ziljstra's work first posted at: https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/ Currently select_idle_cpu()'s proportional scheme uses the average idle time *for when we are idle*, that is temporally challenged. When a CPU is not at all idle, we'll happily continue using whatever value we did see when the CPU goes idle. To fix this, introduce a separate average idle and age it (the existing value still makes sense for things like new-idle balancing, which happens when we do go idle). The overall goal is to not spend more time scanning for idle CPUs than we're idle for. Otherwise we're inhibiting work. This means that we need to consider the cost over all the wake-ups between consecutive idle periods. To track this, the scan cost is subtracted from the estimated average idle time. The impact of this patch is related to workloads that have domains that are fully busy or overloaded. Without the patch, the scan depth may be too high because a CPU is not reaching idle. Due to the nature of the patch, this is a regression magnet. It potentially wins when domains are almost fully busy or overloaded -- at that point searches are likely to fail but idle is not being aged as CPUs are active so search depth is too large and useless. It will potentially show regressions when there are idle CPUs and a deep search is beneficial. This tbench result on a 2-socket broadwell machine partially illustates the problem 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Hmean 1 445.02 ( 0.00%) 451.36 * 1.42%* Hmean 2 830.69 ( 0.00%) 846.03 * 1.85%* Hmean 4 1350.80 ( 0.00%) 1505.56 * 11.46%* Hmean 8 2888.88 ( 0.00%) 2586.40 * -10.47%* Hmean 16 5248.18 ( 0.00%) 5305.26 * 1.09%* Hmean 32 8914.03 ( 0.00%) 9191.35 * 3.11%* Hmean 64 10663.10 ( 0.00%) 10192.65 * -4.41%* Hmean 128 18043.89 ( 0.00%) 18478.92 * 2.41%* Hmean 256 16530.89 ( 0.00%) 17637.16 * 6.69%* Hmean 320 16451.13 ( 0.00%) 17270.97 * 4.98%* Note that 8 was a regression point where a deeper search would have helped but it gains for high thread counts when searches are useless. Hackbench is a more extreme example although not perfect as the tasks idle rapidly hackbench-process-pipes 5.13.0-rc2 5.13.0-rc2 vanilla sched-avgidle-v1r5 Amean 1 0.3950 ( 0.00%) 0.3887 ( 1.60%) Amean 4 0.9450 ( 0.00%) 0.9677 ( -2.40%) Amean 7 1.4737 ( 0.00%) 1.4890 ( -1.04%) Amean 12 2.3507 ( 0.00%) 2.3360 * 0.62%* Amean 21 4.0807 ( 0.00%) 4.0993 * -0.46%* Amean 30 5.6820 ( 0.00%) 5.7510 * -1.21%* Amean 48 8.7913 ( 0.00%) 8.7383 ( 0.60%) Amean 79 14.3880 ( 0.00%) 13.9343 * 3.15%* Amean 110 21.2233 ( 0.00%) 19.4263 * 8.47%* Amean 141 28.2930 ( 0.00%) 25.1003 * 11.28%* Amean 172 34.7570 ( 0.00%) 30.7527 * 11.52%* Amean 203 41.0083 ( 0.00%) 36.4267 * 11.17%* Amean 234 47.7133 ( 0.00%) 42.0623 * 11.84%* Amean 265 53.0353 ( 0.00%) 47.7720 * 9.92%* Amean 296 60.0170 ( 0.00%) 53.4273 * 10.98%* Stddev 1 0.0052 ( 0.00%) 0.0025 ( 51.57%) Stddev 4 0.0357 ( 0.00%) 0.0370 ( -3.75%) Stddev 7 0.0190 ( 0.00%) 0.0298 ( -56.64%) Stddev 12 0.0064 ( 0.00%) 0.0095 ( -48.38%) Stddev 21 0.0065 ( 0.00%) 0.0097 ( -49.28%) Stddev 30 0.0185 ( 0.00%) 0.0295 ( -59.54%) Stddev 48 0.0559 ( 0.00%) 0.0168 ( 69.92%) Stddev 79 0.1559 ( 0.00%) 0.0278 ( 82.17%) Stddev 110 1.1728 ( 0.00%) 0.0532 ( 95.47%) Stddev 141 0.7867 ( 0.00%) 0.0968 ( 87.69%) Stddev 172 1.0255 ( 0.00%) 0.0420 ( 95.91%) Stddev 203 0.8106 ( 0.00%) 0.1384 ( 82.92%) Stddev 234 1.1949 ( 0.00%) 0.1328 ( 88.89%) Stddev 265 0.9231 ( 0.00%) 0.0820 ( 91.11%) Stddev 296 1.0456 ( 0.00%) 0.1327 ( 87.31%) Again, higher thread counts benefit and the standard deviation shows that results are also a lot more stable when the idle time is aged. The patch potentially matters when a socket was multiple LLCs as the maximum search depth is lower. However, some of the test results were suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and other results were not dramatically different to other mcahines. Given the nature of the patch, Peter's full series is not being forward ported as each part should stand on its own. Preferably they would be merged at different times to reduce the risk of false bisections. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
2021-06-14Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar1-1/+0
This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-04sched/debug: Remove obsolete init_schedstats()Eric Dumazet1-17/+2
Revert commit 4698f88c06b8 ("sched/debug: Fix 'schedstats=enable' cmdline option"). After commit 6041186a3258 ("init: initialize jump labels before command line option parsing") we can rely on jump label infra being ready for use when setup_schedstats() is called. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20210602112108.1709635-1-eric.dumazet@gmail.com
2021-06-01sched: Don't defer CPU pick to migration_cpu_stop()Valentin Schneider1-8/+12
Will reported that the 'XXX __migrate_task() can fail' in migration_cpu_stop() can happen, and it *is* sort of a big deal. Looking at it some more, one will note there is a glaring hole in the deferred CPU selection: (w/ CONFIG_CPUSET=n, so that the affinity mask passed via taskset doesn't get AND'd with cpu_online_mask) $ taskset -pc 0-2 $PID # offline CPUs 3-4 $ taskset -pc 3-5 $PID `\ $PID may stay on 0-2 due to the cpumask_any_distribute() picking an offline CPU and __migrate_task() refusing to do anything due to cpu_is_allowed(). set_cpus_allowed_ptr() goes to some length to pick a dest_cpu that matches the right constraints vs affinity and the online/active state of the CPUs. Reuse that instead of discarding it in the affine_move_task() case. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210526205751.842360-2-valentin.schneider@arm.com