<feed xmlns='http://www.w3.org/2005/Atom'>
<title>BMC/Intel-BMC/linux.git/kernel/sched, branch dev-4.7</title>
<subtitle>Intel OpenBMC Linux kernel source tree (mirror)</subtitle>
<id>https://git.radix-linux.su/BMC/Intel-BMC/linux.git/atom?h=dev-4.7</id>
<link rel='self' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/atom?h=dev-4.7'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/'/>
<updated>2016-10-07T13:21:18+00:00</updated>
<entry>
<title>sched/cputime: Fix prev steal time accouting during CPU hotplug</title>
<updated>2016-10-07T13:21:18+00:00</updated>
<author>
<name>Wanpeng Li</name>
<email>wanpeng.li@hotmail.com</email>
</author>
<published>2016-06-13T10:32:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=afca668faa80cbd97ca767d41c2845a175d931c2'/>
<id>urn:sha1:afca668faa80cbd97ca767d41c2845a175d931c2</id>
<content type='text'>
commit 3d89e5478bf550a50c99e93adf659369798263b0 upstream.

Commit:

  e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq-&gt;prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq-&gt;prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq-&gt;prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

Signed-off-by: Wanpeng Li &lt;wanpeng.li@hotmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Radim Krčmář &lt;rkrcmar@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/core: Fix a race between try_to_wake_up() and a woken up task</title>
<updated>2016-09-24T08:09:36+00:00</updated>
<author>
<name>Balbir Singh</name>
<email>bsingharora@gmail.com</email>
</author>
<published>2016-09-05T03:16:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=00f179bf9ca41d436c13246d449d3504512db3c3'/>
<id>urn:sha1:00f179bf9ca41d436c13246d449d3504512db3c3</id>
<content type='text'>
commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task-&gt;state and
the check for task-&gt;on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p-&gt;on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&amp;waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p-&gt;pi_lock)
   ..
   if (p-&gt;on_rq &amp;&amp; ttwu_wakeup())
   ..
   while (p-&gt;on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p-&gt;on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p-&gt;on_rq to be 0. This was the most confusing bit of the analysis,
but p-&gt;on_rq is changed under runqueue lock, rq_lock, the p-&gt;on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p-&gt;on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh &lt;bsingharora@gmail.com&gt;
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Alexey Kardashevskiy &lt;aik@ozlabs.ru&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Nicholas Piggin &lt;nicholas.piggin@gmail.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression</title>
<updated>2016-09-07T06:34:49+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-08-15T16:38:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=dac02e35f3112548ea8faa34735fa93c665f8d3c'/>
<id>urn:sha1:dac02e35f3112548ea8faa34735fa93c665f8d3c</id>
<content type='text'>
commit 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e upstream.

Mike reports:

 Roughly 10% of the time, ltp testcase getrusage04 fails:
 getrusage04    0  TINFO  :  Expected timers granularity is 4000 us
 getrusage04    0  TINFO  :  Using 1 as multiply factor for max [us]time increment (1000+4000us)!
 getrusage04    0  TINFO  :  utime:           0us; stime:         179us
 getrusage04    0  TINFO  :  utime:        3751us; stime:           0us
 getrusage04    1  TFAIL  :  getrusage04.c:133: stime increased &gt; 5000us:

And tracked it down to the case where the task simply doesn't get
_any_ [us]time ticks.

Update the code to assume all rtime is utime when we lack information,
thus ensuring a task that elides the tick gets time accounted.

Reported-by: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Tested-by: Mike Galbraith &lt;umgwanakikbuti@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: Fredrik Markstrom &lt;fredrik.markstrom@gmail.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Radim &lt;rkrcmar@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Stephane Eranian &lt;eranian@google.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Vince Weaver &lt;vincent.weaver@maine.edu&gt;
Cc: Wanpeng Li &lt;wanpeng.li@hotmail.com&gt;
Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sched/core: Correct off by one bug in load migration calculation</title>
<updated>2016-07-13T12:58:20+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2016-07-12T16:33:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=d60585c5766e9620d5d83e2b25dc042c7bdada2c'/>
<id>urn:sha1:d60585c5766e9620d5d83e2b25dc042c7bdada2c</id>
<content type='text'>
The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
account that the function is now called from a thread running on the outgoing
CPU. As a result a cpu unplug leakes a load of 1 into the global load
accounting mechanism.

Fix it by adjusting for the currently running thread which calls
calc_load_migrate().

Reported-by: Anton Blanchard &lt;anton@samba.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Vaidyanathan Srinivasan &lt;svaidy@linux.vnet.ibm.com&gt;
Cc: rt@linutronix.de
Cc: shreyas@linux.vnet.ibm.com
Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion</title>
<updated>2016-06-27T09:18:37+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-06-24T14:11:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=ea1dc6fc6242f991656e35e2ed3d90ec1cd13418'/>
<id>urn:sha1:ea1dc6fc6242f991656e35e2ed3d90ec1cd13418</id>
<content type='text'>
Commit:

  fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

did something non-obvious but also did it buggy yet latent.

The problem was exposed for real by a later commit in the v4.7 merge window:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

... after which tg-&gt;load_avg and cfs_rq-&gt;load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).

Add a comment to explain the use of cfs_rq-&gt;load.weight over the
'natural' cfs_rq-&gt;avg.load_avg and add scale_load_down() to correct
for the difference in unit.

Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.

The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.

Reported-by: Jirka Hladky &lt;jhladky@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Fix effective_load() to consistently use smoothed load</title>
<updated>2016-06-27T09:18:36+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-06-24T13:53:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=7dd4912594daf769a46744848b05bd5bc6d62469'/>
<id>urn:sha1:7dd4912594daf769a46744848b05bd5bc6d62469</id>
<content type='text'>
Starting with the following commit:

  fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j &gt;= rw_i we cannot use tg-&gt;load_avg directly, since that might be
lagging a correction on the current cfs_rq-&gt;avg.load_avg value.
Therefore we use tg-&gt;load_avg - cfs_rq-&gt;tg_load_avg_contrib +
cfs_rq-&gt;avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq-&gt;avg.load_avg, as is later used in @w, but uses
cfs_rq-&gt;load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v4.3+
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/core: Allow kthreads to fall back to online &amp;&amp; !active cpus</title>
<updated>2016-06-24T06:26:53+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>htejun@gmail.com</email>
</author>
<published>2016-06-16T19:35:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=feb245e304f343cf5e4f9123db36354144dce8a4'/>
<id>urn:sha1:feb245e304f343cf5e4f9123db36354144dce8a4</id>
<content type='text'>
During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
online but not active.  A CPU_ONLINE callback may create or bind a
kthread so that its cpus_allowed mask only allows the CPU which is
being brought online.  The kthread may start executing before the CPU
is made active and can end up in select_fallback_rq().

In such cases, the expected behavior is selecting the CPU which is
coming online; however, because select_fallback_rq() only chooses from
active CPUs, it determines that the task doesn't have any viable CPU
in its allowed mask and ends up overriding it to cpu_possible_mask.

CPU_ONLINE callbacks should be able to put kthreads on the CPU which
is coming online.  Update select_fallback_rq() so that it follows
cpu_online() rather than cpu_active() for kthreads.

Reported-by: Gautham R Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Tested-by: Gautham R. Shenoy &lt;ego@linux.vnet.ibm.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Abdul Haleem &lt;abdhalee@linux.vnet.ibm.com&gt;
Cc: Aneesh Kumar &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: kernel-team@fb.com
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Do not announce throttled next buddy in dequeue_task_fair()</title>
<updated>2016-06-24T06:26:45+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2016-06-16T12:57:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=754bd598be9bbc953bc709a9e8ed7f3188bfb9d7'/>
<id>urn:sha1:754bd598be9bbc953bc709a9e8ed7f3188bfb9d7</id>
<content type='text'>
Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Ben Segall &lt;bsegall@google.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Initialize throttle_count for new task-groups lazily</title>
<updated>2016-06-24T06:26:44+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2016-06-16T12:57:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=094f469172e00d6ab0a3130b0e01c83b3cf3a98d'/>
<id>urn:sha1:094f469172e00d6ab0a3130b0e01c83b3cf3a98d</id>
<content type='text'>
Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq-&gt;throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/fair: Fix cfs_rq avg tracking underflow</title>
<updated>2016-06-20T09:29:09+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2016-06-16T08:50:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=8974189222159154c55f24ddad33e3613960521a'/>
<id>urn:sha1:8974189222159154c55f24ddad33e3613960521a</id>
<content type='text'>
As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

&gt; the code generated from update_cfs_rq_load_avg():
&gt;
&gt; 	if (atomic_long_read(&amp;cfs_rq-&gt;removed_load_avg)) {
&gt; 		s64 r = atomic_long_xchg(&amp;cfs_rq-&gt;removed_load_avg, 0);
&gt; 		sa-&gt;load_avg = max_t(long, sa-&gt;load_avg - r, 0);
&gt; 		sa-&gt;load_sum = max_t(s64, sa-&gt;load_sum - r * LOAD_AVG_MAX, 0);
&gt; 		removed_load = 1;
&gt; 	}
&gt;
&gt; turns into:
&gt;
&gt; ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
&gt; ffffffff8108706b:       48 85 c0                test   %rax,%rax
&gt; ffffffff8108706e:       74 40                   je     ffffffff810870b0 &lt;update_blocked_averages+0xc0&gt;
&gt; ffffffff81087070:       4c 89 f8                mov    %r15,%rax
&gt; ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
&gt; ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
&gt; ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
&gt; ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
&gt; ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
&gt; ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
&gt;
&gt; Which you'll note ends up with sa-&gt;load_avg -= r in memory at
&gt; ffffffff8108707a.

So I _should_ have looked at other unserialized users of -&gt;load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Mike Galbraith &lt;efault@gmx.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Yuyang Du &lt;yuyang.du@intel.com&gt;
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
</feed>
