Age | Commit message (Collapse) | Author | Files | Lines |
|
TASK_DEAD is special in that the final schedule call from do_exit()
must be done with preemption disabled.
This means we end up scheduling with a preempt_count() higher than
usual (3 instead of the 'expected' 2).
Since future patches will want to rely on an invariant
preempt_count() value during schedule, fix this up.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
So the problem this patch is trying to address is as follows:
CPU0 CPU1
context_switch(A, B)
ttwu(A)
LOCK A->pi_lock
A->on_cpu == 0
finish_task_switch(A)
prev_state = A->state <-.
WMB |
A->on_cpu = 0; |
UNLOCK rq0->lock |
| context_switch(C, A)
`-- A->state = TASK_DEAD
prev_state == TASK_DEAD
put_task_struct(A)
context_switch(A, C)
finish_task_switch(A)
A->state == TASK_DEAD
put_task_struct(A)
The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.
Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.
We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.
The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single bug fix for the scheduler to prevent dequeueing of the idle
task when setting the cpus allowed mask"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix crash trying to dequeue/enqueue the idle thread
|
|
The 'sched_domain_topology' variable is only used within kernel/sched/core.c.
Make it static.
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442918939-9907-1-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The return value of (do_)balance_runtime() is not consumed by anybody.
Make them return void.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Move dl_time_before() static definition in include/linux/sched/deadline.h
so that it can be used by different parties without being re-defined.
Reported-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
__wake_up_locked_key"
This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
fs/userfaultfd.c to use the old version of that function.
It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull KVM fixes from Paolo Bonzini:
"Mostly stable material, a lot of ARM fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
sched: access local runqueue directly in single_task_running
arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS'
arm64: KVM: Remove all traces of the ThumbEE registers
arm: KVM: Disable virtual timer even if the guest is not using it
arm64: KVM: Disable virtual timer even if the guest is not using it
arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources
KVM: s390: Replace incorrect atomic_or with atomic_andnot
arm: KVM: Fix incorrect device to IPA mapping
arm64: KVM: Fix user access for debug registers
KVM: vmx: fix VPID is 0000H in non-root operation
KVM: add halt_attempted_poll to VCPU stats
kvm: fix zero length mmio searching
kvm: fix double free for fast mmio eventfd
kvm: factor out core eventfd assign/deassign logic
kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
KVM: make the declaration of functions within 80 characters
KVM: arm64: add workaround for Cortex-A57 erratum #852523
KVM: fix polling for guest halt continued even if disable it
arm/arm64: KVM: Fix PSCI affinity info return value for non valid cores
arm64: KVM: set {v,}TCR_EL2 RES1 bits
...
|
|
Commit 2ee507c47293 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id. When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).
With commit f78195129963 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.
To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: 2ee507c472939db4b146d545352b8a7c79ef47f8
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The group_classify() function does not use the "env" parameter, so remove it.
Also unify code to always use group_classify() to calculate group's
load type.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314605-14838-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Macro LOAD_AVG_MAX is defined far away from the precompuated tables
for decay calculation in code; So explicitly comments for this.
Also fix one typo: s/LOAD_MAX_AVG/LOAD_AVG_MAX.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442314657-14949-1-git-send-email-leo.yan@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.
It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.
This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).
Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.
While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Most of the policy-tests are done via the <class>_policy() helpers with
the notable exception of idle. A new wrapper for valid_policy() has also
been added to improve readability in set_load_weight().
This commit does not change the logical behavior of the scheduler core.
Signed-off-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.us
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Sasha reports that his virtual machine tries to schedule the idle
thread since commit 6c37067e2786 ("sched: Change the
sched_class::set_cpus_allowed() calling context").
Hit trace shows this happening from idle_thread_get()->init_idle(),
which is the _second_ init_idle() invocation on that task_struct, the
first being done through idle_init()->fork_idle(). (this code is
insane...)
Because we call init_idle() twice in a row, its ->sched_class ==
&idle_sched_class and ->on_rq = TASK_ON_RQ_QUEUED. This means
do_set_cpus_allowed() think we're queued and will call dequeue_task(),
which is implemented with BUG() for the idle class, seeing how
dequeueing the idle task is a daft thing.
Aside of the whole insanity of calling init_idle() _twice_, change the
code to call set_cpus_allowed_common() instead as this is 'obviously'
before the idle task gets ran etc..
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6c37067e2786 ("sched: Change the sched_class::set_cpus_allowed() calling context")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"A migrate_tasks() locking fix, and a late-coming nohz change plus a
nohz debug check"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: 'Annotate' migrate_tasks()
nohz: Assert existing housekeepers when nohz full enabled
nohz: Affine unpinned timers to housekeepers
|
|
Currently the load_{sum,avg} and util_{sum,avg} tracking is asymmetric
in that load tracking gets a 2^10 unit from the weight, but util gets
no such factor.
This results in more lost bits for util scaling and asymmetric scaling
rules.
Fix this by removing shifts, such that we gain the 2^10 factor from
scaling. There is no risk of overflowing the u32 as the max value is
now LOAD_AVG_MAX << 10, which is still well below UINT_MAX.
This further entangles the assumption that both LOAD and CAPACITY
shifts are the same (and 10) so put in an assertion for that.
This fixes the math for the LOAD_RESOLUTION != 0 case.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDA2E9.8040900@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Prior to this patch; the line:
scaled_delta_w = (delta_w * 1024) >> 10;
which is the result of the default arch_scale_freq_capacity()
function, turns into:
1b03: 49 89 d1 mov %rdx,%r9
1b06: 49 c1 e1 0a shl $0xa,%r9
1b0a: 49 c1 e9 0a shr $0xa,%r9
Which is silly; when made unsigned int, GCC recognises this as
pointless ops and fails to emit them (confirmed on 4.9.3 and 5.1.1).
Furthermore, afaict unsigned is actually the correct type for these
fields anyway, as we've explicitly ruled out negative delta's earlier
in this function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Rename scale() to cap_scale() to better reflect its purpose, it is
after all not a general purpose scale function, it has
SCHED_CAPACITY_SHIFT hardcoded in it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Task load or utilization is not currently considered in
select_task_rq_fair(), but if we want that in the future we should make
sure it is not zero for new tasks.
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steve Muckle <steve.muckle@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDAF43.30500@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.
Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.
With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.
In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.
While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the CPU
is running slower which affects load-balancing decisions.
Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.
Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Variable sched_numa_balancing toggles numa_balancing feature. Hence
moving from a simple read mostly variable to a more apt static_branch.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439310261-16124-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Variable sched_numa_balancing is available for both CONFIG_SCHED_DEBUG
and !CONFIG_SCHED_DEBUG. All code paths now check for
sched_numa_balancing. Hence remove sched_feat(NUMA).
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit 2a1ed24 ("sched/numa: Prefer NUMA hotness over cache hotness")
sets sched feature NUMA to true. However this can enable NUMA hinting
faults on a UMA system.
This commit ensures that NUMA hinting faults occur only on a NUMA system
by setting/resetting sched_numa_balancing.
This commit:
- Makes sched_numa_balancing common to CONFIG_SCHED_DEBUG and
!CONFIG_SCHED_DEBUG. Earlier it was only in !CONFIG_SCHED_DEBUG.
- Checks for sched_numa_balancing instead of sched_feat(NUMA).
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Simple rename of the 'numabalancing_enabled' variable to 'sched_numa_balancing'.
No functional changes.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439290813-6683-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Since commit:
d4573c3e1c99 ("sched: Improve load balancing in the presence of idle CPUs")
the ILB CPU starts with the idle load balancing of other idle CPUs and
finishes with itself in order to speed up the spread of tasks in all
idle CPUs.
The this_rq->next_balance is still used in nohz_idle_balance() as an
intermediate step to gather the shortest next balance before updating
nohz.next_balance. But the former has not been updated yet and is likely to
be set with the current jiffies. As a result, the nohz.next_balance will be
set with current jiffies instead of the real next balance date. This
generates spurious kicks of nohz ilde balance.
nohz_idle_balance() must set the nohz.next_balance without taking into
account this_rq->next_balance which is not updated yet. Then, this_rq will
update nohz.next_update with its next_balance once updated and if necessary.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438595750-20455-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
cgroup_exit() is not called from copy_process() after commit:
e8604cb43690 ("cgroup: fix spurious lockdep warning in cgroup_exit()")
from do_exit(). So this check is useless and the comment is obsolete.
Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The previous patches made the second argument go unused, remove it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
By observing that switched_from_fair() detaches from a runqueue, and
switched_to_fair() attaches to a runqueue, we can see that
task_move_group_fair() is one followed by the other with flipping the
runqueue in between.
Therefore extract all the common bits and implement all three
functions in terms of them.
This should fix a few corner cases wrt. vruntime normalization; where,
when we take a task off of a runqueue we convert to an approximation
of lag by subtracting min_vruntime, and when placing a task on the a
runqueue to the reverse.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-6-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In case there are problems with the aging on attach, provide a debug
knob to turn it off.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Where switched_from_fair() will remove the entity's load from the
runqueue, switched_to_fair() does not currently add it back. This
means that when a task leaves the fair class for a short duration; say
because of PI; we loose its load contribution.
This can ripple forward and disturb the load tracking because other
operations (enqueue, dequeue) assume its factored in. Only once the
runqueue empties will the load tracking recover.
When we add it back in, age the per entity average to match up with
the runqueue age. This has the obvious problem that if the task leaves
the fair class for a significant time, the load will age to 0.
Employ the normal migration rule for inter-runqueue moves in
task_move_group_fair(). Again, there is the obvious problem of the
task migrating while not in the fair class.
The alternative solution would be to to omit the chunk in
attach_entity_load_avg(), which would effectively reset the timestamp
and use whatever avg there was.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog and comments. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-5-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
runqueue
Since we attach the entity load to the new runqueue, we should also
detatch the entity load from the old runqueue, otherwise load can
accumulate.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-4-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
to the runqueue
Currently we conditionally add the entity load to the rq when moving
the task between cgroups.
This doesn't make sense as we always 'migrate' the task between
cgroups, so we should always migrate the load too.
[ The history here is that we used to only migrate the blocked load
which was only meaningfull when !queued. ]
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
the runqueue
Currently we open-code the addition/subtraction of the per entity load
to/from the runqueue, factor this out into helper functions.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Rewrote the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1440069720-27038-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Kernel testing triggered this warning:
| WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
| Modules linked in:
| CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
| Call Trace:
| dump_stack+0x4b/0x75
| warn_slowpath_common+0x8b/0xc0
| warn_slowpath_null+0x22/0x30
| do_set_cpus_allowed+0x7e/0x80
| cpuset_cpus_allowed_fallback+0x7c/0x170
| select_fallback_rq+0x221/0x280
| migration_call+0xe3/0x250
| notifier_call_chain+0x53/0x70
| __raw_notifier_call_chain+0x1e/0x30
| cpu_notify+0x28/0x50
| take_cpu_down+0x22/0x40
| multi_cpu_stop+0xd5/0x140
| cpu_stopper_thread+0xbc/0x170
| smpboot_thread_fn+0x174/0x2f0
| kthread+0xc4/0xe0
| ret_from_kernel_thread+0x21/0x30
As Peterz pointed out:
| So the normal rules for changing task_struct::cpus_allowed are holding
| both pi_lock and rq->lock, such that holding either stabilizes the mask.
|
| This is so that wakeup can happen without rq->lock and load-balance
| without pi_lock.
|
| From this we already get the relaxation that we can omit acquiring
| rq->lock if the task is not on the rq, because in that case
| load-balancing will not apply to it.
|
| ** these are the rules currently tested in do_set_cpus_allowed() **
|
| Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
| unconditionally acquires both locks, we could get away with holding just
| rq->lock when on_rq for modification because that'd still exclude
| __set_cpus_allowed_ptr(), it would also work against
| __kthread_bind_mask() because that assumes !on_rq.
|
| That said, this is all somewhat fragile.
|
| Now, I don't think dropping rq->lock is quite as disastrous as it
| usually is because !cpu_active at this point, which means load-balance
| will not interfere, but that too is somewhat fragile.
|
| So we end up with a choice of two fragile..
This patch fixes it by following the rules for changing
task_struct::cpus_allowed with both pi_lock and rq->lock held.
Reported-by: kernel test robot <ying.huang@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Modified changelog and patch. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gbl
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
of the current hardcoded 1 (that would wake just the first waitqueue in
the head list).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and atomic updates from Ingo Molnar:
"Main changes in this cycle are:
- Extend atomic primitives with coherent logic op primitives
(atomic_{or,and,xor}()) and deprecate the old partial APIs
(atomic_{set,clear}_mask())
The old ops were incoherent with incompatible signatures across
architectures and with incomplete support. Now every architecture
supports the primitives consistently (by Peter Zijlstra)
- Generic support for 'relaxed atomics':
- _acquire/release/relaxed() flavours of xchg(), cmpxchg() and {add,sub}_return()
- atomic_read_acquire()
- atomic_set_release()
This came out of porting qwrlock code to arm64 (by Will Deacon)
- Clean up the fragile static_key APIs that were causing repeat bugs,
by introducing a new one:
DEFINE_STATIC_KEY_TRUE(name);
DEFINE_STATIC_KEY_FALSE(name);
which define a key of different types with an initial true/false
value.
Then allow:
static_branch_likely()
static_branch_unlikely()
to take a key of either type and emit the right instruction for the
case. To be able to know the 'type' of the static key we encode it
in the jump entry (by Peter Zijlstra)
- Static key self-tests (by Jason Baron)
- qrwlock optimizations (by Waiman Long)
- small futex enhancements (by Davidlohr Bueso)
- ... and misc other changes"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
jump_label/x86: Work around asm build bug on older/backported GCCs
locking, ARM, atomics: Define our SMP atomics in terms of _relaxed() operations
locking, include/llist: Use linux/atomic.h instead of asm/cmpxchg.h
locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics
locking/qrwlock: Implement queue_write_unlock() using smp_store_release()
locking/lockref: Remove homebrew cmpxchg64_relaxed() macro definition
locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'
locking, asm-generic: Rework atomic-long.h to avoid bulk code duplication
locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations
locking, compiler.h: Cast away attributes in the WRITE_ONCE() magic
locking/static_keys: Make verify_keys() static
jump label, locking/static_keys: Update docs
locking/static_keys: Provide a selftest
jump_label: Provide a self-test
s390/uaccess, locking/static_keys: employ static_branch_likely()
x86, tsc, locking/static_keys: Employ static_branch_likely()
locking/static_keys: Add selftest
locking/static_keys: Add a new static_key interface
locking/static_keys: Rework update logic
locking/static_keys: Add static_key_{en,dis}able() helpers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- a new PIDs controller is added. It turns out that PIDs are actually
an independent resource from kmem due to the limited PID space.
- more core preparations for the v2 interface. Once cpu side interface
is settled, it should be ready for lifting the devel mask.
for-4.3-unified-base was temporarily branched so that other trees
(block) can pull cgroup core changes that blkcg changes depend on.
- a non-critical idr_preload usage bug fix.
* 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: pids: fix invalid get/put usage
cgroup: introduce cgroup_subsys->legacy_name
cgroup: don't print subsystems for the default hierarchy
cgroup: make cftype->private a unsigned long
cgroup: export cgrp_dfl_root
cgroup: define controller file conventions
cgroup: fix idr_preload usage
cgroup: add documentation for the PIDs controller
cgroup: implement the PIDs subsystem
cgroup: allow a cgroup subsystem to reject a fork
|
|
The problem addressed in this patch is about affining unpinned
timers. Adaptive or Full Dynticks CPUs are currently disturbed
by unnecessary jitter due to firing of such timers on them.
This patch will affine timers to online CPUs which are not full
dynticks in NOHZ_FULL configured systems. It should not
introduce overhead in nohz full off case due to static keys.
Signed-off-by: Vatika Harlalka <vatikaharlalka@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1441119060-2230-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar:
"The main changes, mostly written by Frederic Weisbecker, include:
- Fix some jiffies based cputime assumptions. (No real harm because
the concerned code isn't used by full dynticks.)
- Simplify jiffies <-> usecs conversions. Remove dead code.
- Remove early hacks on nohz full code that avoided messing up idle
nohz internals. Now nohz integrates well full and idle and such
hack have become needless.
- Restart nohz full tick from irq exit. (A simplification and a
preparation for future optimization on scheduler kick to nohz
full)
- Code cleanups.
- Tile driver isolation enhancement on top of nohz. (Chris Metcalf)"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: Remove useless argument on tick_nohz_task_switch()
nohz: Move tick_nohz_restart_sched_tick() above its users
nohz: Restart nohz full tick from irq exit
nohz: Remove idle task special case
nohz: Prevent tilegx network driver interrupts
alpha: Fix jiffies based cputime assumption
apm32: Fix cputime == jiffies assumption
jiffies: Remove HZ > USEC_PER_SEC special case
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"This is a leftover scheduler fix from the v4.2 cycle"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix cpu_active_mask/cpu_online_mask race
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:
9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
It is done in a way that significantly reduces complexity of the code:
5 files changed, 249 insertions(+), 494 deletions(-)
and the performance testing results are encouraging. Nevertheless we
need to keep an eye on potential regressions, since this potentially
affects every SMP workload in existence.
This work comes from Yuyang Du.
Other changes:
- SCHED_DL updates. (Andrea Parri)
- Simplify architecture callbacks by removing finish_arch_switch().
(Peter Zijlstra et al)
- cputime accounting: guarantee stime + utime == rtime. (Peter
Zijlstra)
- optimize idle CPU wakeups some more - inspired by Facebook server
loads. (Mike Galbraith)
- stop_machine fixes and updates. (Oleg Nesterov)
- Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)
- sched/numa tweaks. (Srikar Dronamraju)
- misc fixes and small cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched/deadline: Fix comment in enqueue_task_dl()
sched/deadline: Fix comment in push_dl_tasks()
sched: Change the sched_class::set_cpus_allowed() calling context
sched: Make sched_class::set_cpus_allowed() unconditional
sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched: Ensure a task has a non-normalized vruntime when returning back to CFS
sched/numa: Fix NUMA_DIRECT topology identification
tile: Reorganize _switch_to()
sched, sparc32: Update scheduler comments in copy_thread()
sched: Remove finish_arch_switch()
sched, tile: Remove finish_arch_switch
sched, sh: Fold finish_arch_switch() into switch_to()
sched, score: Remove finish_arch_switch()
sched, avr32: Remove finish_arch_switch()
sched, MIPS: Get rid of finish_arch_switch()
sched, arm: Remove finish_arch_switch()
sched/fair: Clean up load average references
sched/fair: Provide runnable_load_avg back to cfs_rq
sched/fair: Remove task and group entity load when they are dead
sched/fair: Init cfs_rq's sched_entity load average
...
|
|
There is a race condition in SMP bootup code, which may result
in
WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
or
kernel BUG at kernel/smpboot.c:135!
It can be triggered with a bit of luck in Linux guests running
on busy hosts.
CPU0 CPUn
==== ====
_cpu_up()
__cpu_up()
start_secondary()
set_cpu_online()
cpumask_set_cpu(cpu,
to_cpumask(cpu_online_bits));
cpu_notify(CPU_ONLINE)
<do stuff, see below>
cpumask_set_cpu(cpu,
to_cpumask(cpu_active_bits));
During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.
Variant 1:
cpu_notify(CPU_ONLINE)
workqueue_cpu_up_callback()
rebind_workers()
set_cpus_allowed_ptr()
This call fails because it requires an active CPU; rebind_workers()
ends with a warning:
WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
Variant 2:
cpu_notify(CPU_ONLINE)
smpboot_thread_call()
smpboot_unpark_threads()
..
__kthread_unpark()
__kthread_bind()
wake_up_state()
..
select_task_rq()
select_fallback_rq()
The ->wake_cpu of the unparked thread is not allowed, making a call
to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
find an allowed, active CPU and promptly resets the allowed CPUs, so
that the task in question ends up on CPU0.
When those unparked tasks are eventually executed, they run
immediately into a BUG:
kernel BUG at kernel/smpboot.c:135!
Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.
Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")
Together, these give us the following non-working solutions:
- secondary CPU sets active before online, because active is assumed to
be a subset of online;
- secondary CPU sets online before active, because the primary CPU
assumes that an online CPU is also active;
- secondary CPU sets online and waits for primary CPU to set active,
because it might deadlock.
Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.
Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.
set_cpus_allowed_ptr()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Wilson <msw@amazon.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:
- The combination of tree geometry-initialization simplifications
and OS-jitter-reduction changes to expedited grace periods.
These two are stacked due to the large number of conflicts
that would otherwise result.
[ With one addition, a temporary commit to silence a lockdep false
positive. Additional changes to the expedited grace-period
primitives (queued for 4.4) remove the cause of this false
positive, and therefore include a revert of this temporary commit. ]
- Documentation updates.
- Torture-test updates.
- Miscellaneous fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|