Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch allows for reading the current (leftover) runtime and
absolute deadline of a SCHED_DEADLINE task through /proc/*/sched
(entries dl.runtime and dl.deadline), while debugging/testing.
Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Acked-by: Daniel Bistrot de Oliveira <danielbristot@gmail.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1477473437-10346-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When switching between the unstable and stable variants it is
currently possible that clock discontinuities occur.
And while these will mostly be 'small', attempt to do better.
As observed on my IVB-EP, the sched_clock() is ~1.5s ahead of the
ktime_get_ns() based timeline at the point of switchover
(sched_clock_init_late()) after SMP bringup.
Equally, when the TSC is later found to be unstable -- typically
because SMM tries to hide its SMI latencies by mucking with the TSC --
we want to avoid large jumps.
Since the clocksource watchdog reports the issue after the fact we
cannot exactly fix up time, but since SMI latencies are typically
small (~10ns range), the discontinuity is mainly due to drift between
sched_clock() and ktime_get_ns() (which on my desktop is ~79s over
24days).
I dislike this patch because it adds overhead to the good case in
favour of dealing with badness. But given the widespread failure of
TSC stability this is worth it.
Note that in case the TSC makes drastic jumps after SMP bringup we're
still hosed. There's just not much we can do in that case without
stupid overhead.
If we were to somehow expose tsc_clocksource_reliable (which is hard
because this code is also used on ia64 and parisc) we could avoid some
of the newly introduced overhead.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently we switch to the stable sched_clock if we guess the TSC is
usable, and then switch back to the unstable path if it turns out TSC
isn't stable during SMP bringup after all.
Delay switching to the stable path until after SMP bringup is
complete. This way we'll avoid switching during the time we detect the
worst of the TSC offences.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
sched_clock was still using the deprecated static_key interface.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
PeterZ reported that we'd fail to mark the TSC unstable when the
clocksource watchdog finds it unsuitable.
Allow a clocksource to run a custom action when its being marked
unstable and hook up the TSC unstable code.
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There's no diagnostic checks for figuring out when we've accidentally
missed update_rq_clock() calls. Let's add some by piggybacking on the
rq_*pin_lock() wrappers.
The idea behind the diagnostic checks is that upon pining rq lock the
rq clock should be updated, via update_rq_clock(), before anybody
reads the clock with rq_clock() or rq_clock_task().
The exception to this rule is when updates have explicitly been
disabled with the rq_clock_skip_update() optimisation.
There are some functions that only unpin the rq lock in order to grab
some other lock and avoid deadlock. In that case we don't need to
update the clock again and the previous diagnostic state can be
carried over in rq_repin_lock() by saving the state in the rq_flags
context.
Since this patch adds a new clock update flag and some already exist
in rq::clock_skip_update, that field has now been renamed. An attempt
has been made to keep the flag manipulation code small and fast since
it's used in the heart of the __schedule() fast path.
For the !CONFIG_SCHED_DEBUG case the only object code change (other
than addresses) is the following change to reset RQCF_ACT_SKIP inside
of __schedule(),
- c7 83 38 09 00 00 00 movl $0x0,0x938(%rbx)
- 00 00 00
+ 83 a3 38 09 00 00 fc andl $0xfffffffc,0x938(%rbx)
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-8-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Address this rq-clock update bug:
WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
set_next_entity()
? _raw_spin_lock()
set_curr_task_fair()
set_user_nice.part.85()
set_user_nice()
create_worker()
worker_thread()
kthread()
ret_from_fork()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.
WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
assert_clock_updated.isra.63.part.64()
can_migrate_task()
load_balance()
pick_next_task_fair()
__schedule()
schedule()
worker_thread()
kthread()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.
WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
detach_task_cfs_rq()
switched_from_fair()
__sched_setscheduler()
_sched_setscheduler()
sched_set_stop_task()
cpu_stop_create()
__smpboot_create_thread.part.2()
smpboot_register_percpu_thread_cpumask()
cpu_stop_init()
do_one_initcall()
? print_cpu_info()
kernel_init_freeable()
? rest_init()
kernel_init()
ret_from_fork()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Address this rq-clock update bug:
WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
__warn()
post_init_entity_util_avg()
wake_up_new_task()
_do_fork()
kernel_thread()
rest_init()
start_kernel()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Future patches will emit warnings if rq_clock() is called before
update_rq_clock() inside a rq_pin_lock()/rq_unpin_lock() pair.
Since there is only one caller of idle_balance() we can push the
unpin/repin there.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-7-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
rq_clock() is called from sched_info_{depart,arrive}() after resetting
RQCF_ACT_SKIP but prior to a call to update_rq_clock().
In preparation for pending patches that check whether the rq clock has
been updated inside of a pin context before rq_clock() is called, move
the reset of rq->clock_skip_update immediately before unpinning the rq
lock.
This will avoid the new warnings which check if update_rq_clock() is
being actively skipped.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-6-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In preparation for adding diagnostic checks to catch missing calls to
update_rq_clock(), provide wrappers for (re)pinning and unpinning
rq->lock.
Because the pending diagnostic checks allow state to be maintained in
rq_flags across pin contexts, swap the 'struct pin_cookie' arguments
for 'struct rq_flags *'.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/20160921133813.31976-5-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y used to accumulate user time and
account it on ticks and context switches only through the
vtime_account_user() function.
Now this model has been generalized on the 3 archs for all kind of
cputime (system, irq, ...) and all the cputime flushing happens under
vtime_account_user().
So let's rename this function to better reflect its new role.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-11-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The account_system_time() function is called with a cputime that
occurred while running in the kernel. The function detects which
context the CPU is currently running in and accounts the time to
the correct bucket. This forces the arch code to account the
cputime for hardirq and softirq immediately.
Such accounting function can be costly and perform unwelcome divisions
and multiplications, among others.
The arch code can delay the accounting for system time. For s390
the accounting is done once per timer tick and for each task switch.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Rebase against latest linus tree and move account_system_index_scaled(). ]
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-10-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y accounts the cputime on
any context boundary: irq entry/exit, guest entry/exit, context switch,
etc...
Calling functions such as account_system_time(), account_user_time()
and such can be costly, especially if they are called on many fastpath
such as twice per IRQ. Those functions do more than just accounting to
kcpustat and task cputime. Depending on the config, some subsystems can
perform unpleasant multiplications and divisions, among other things.
So lets accumulate the cputime instead and delay the accounting on ticks
and context switches only.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-9-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
tick/task switch
Currently CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y accounts the cputime on
any context boundary: irq entry/exit, guest entry/exit, context switch,
etc...
Calling functions such as account_system_time(), account_user_time()
and such can be costly, especially if they are called on many fastpath
such as twice per IRQ. Those functions do more than just accounting to
kcpustat and task cputime. Depending on the config, some subsystems can
perform unpleasant multiplications and divisions, among other things.
So lets accumulate the cputime instead and delay the accounting on ticks
and context switches only.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
That in order to gather all cputime accumulation to the same place.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, provide finegrained accumulators to
powerpc in order to store the cputime until flushing.
While at it, normalize the name of several fields according to common
cputime naming.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's allow archs to account cputime
directly to gtime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In order to prepare for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y to delay
cputime accounting to the tick, let's provide APIs to account system
time to precise contexts: hardirq, softirq, pure system, ...
Inspired-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On task switch we must initialize the current cputime of the next task
using the value of the previous task which got freshly updated.
But we are confusing that with doing the opposite, which should result
in incorrect cputime accounting.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On context switch with powerpc32, the cputime is accumulated in the
thread_info struct. So the switching-in task must move forward its
start time snapshot to the current time in order to later compute the
delta spent in system mode.
This is what we do for the normal cputime by initializing the starttime
field to the value of the previous task's starttime which got freshly
updated.
But we are missing the update of the scaled cputime start time. As a
result we may be accounting too much scaled cputime later.
Fix this by initializing the scaled cputime the same way we do for
normal cputime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1483636310-6557-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are all over the place.
The tracepoint part of the pull fixes a crash and adds a little more
information to two tracepoints, while the rest are good old fashioned
fixes"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: make tracepoint format strings more compact
Btrfs: add truncated_len for ordered extent tracepoints
Btrfs: add 'inode' for extent map tracepoint
btrfs: fix crash when tracepoint arguments are freed by wq callbacks
Btrfs: adjust outstanding_extents counter properly when dio write is split
Btrfs: fix lockdep warning about log_mutex
Btrfs: use down_read_nested to make lockdep silent
btrfs: fix locking when we put back a delayed ref that's too new
btrfs: fix error handling when run_delayed_extent_op fails
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
|
|
Pull ceph fixes from Ilya Dryomov:
"Two small fixups for the filesystem changes that went into this merge
window"
* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
ceph: fix get_oldest_context()
ceph: fix mds cluster availability check
|
|
Pull VFIO fixes from Alex Williamson:
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
- Export and make use of has_capability() to fix incorrect use of
ns_capable() for testing task capabilities (Jike Song)
* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
vfio/type1: Remove pid_namespace.h include
vfio iommu type1: fix the testing of capability for remote task
capability: export has_capability
vfio-mdev: remove some dead code
vfio-mdev: buffer overflow in ioctl()
vfio-mdev: return -EFAULT if copy_to_user() fails
|
|
Pull KVM fixes from Paolo Bonzini:
- fix for module unload vs deferred jump labels (note: there might be
other buggy modules!)
- two NULL pointer dereferences from syzkaller
- also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
made worse during this merge window, "just" kernel memory leak on
releases
- fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix emulation of "MOV SS, null selector"
KVM: x86: fix NULL deref in vcpu_scan_ioapic
KVM: eventfd: fix NULL deref irqbypass consumer
KVM: x86: Introduce segmented_write_std
KVM: x86: flush pending lapic jump label updates on module unload
jump_labels: API for flushing deferred jump label updates
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix huge_ptep_set_access_flags() to return "changed" when any of the
ptes in the contiguous range is changed, not just the last one
- Fix the adr_l assembly macro to work in modules under KASLR
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: assembler: make adr_l work in modules under KASLR
arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The major fix is the bfa firmware, since the latest 10Gb cards fail
probing with the current firmware.
The rest is a set of minor fixes: one missed Kconfig dependency
causing randconfig failures, a missed error return on an error leg, a
change for how multiqueue waits on a blocked device and a don't reset
while in reset fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: bfa: Increase requested firmware version to 3.2.5.1
scsi: snic: Return error code on memory allocation failure
scsi: fnic: Avoid sending reset to firmware when another reset is in progress
scsi: qedi: fix build, depends on UIO
scsi: scsi-mq: Wait for .queue_rq() if necessary
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"Small driver fixups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data
Input: adxl34x - make it enumerable in ACPI environment
Input: ALPS - fix TrackStick Y axis handling for SS5 hardware
Input: synaptics-rmi4 - fix F03 build error when serio is module
Input: xpad - use correct product id for x360w controllers
Input: synaptics_i2c - change msleep to usleep_range for small msecs
Input: i8042 - add Pegatron touchpad to noloop table
Input: joydev - remove unused linux/miscdevice.h include
|
|
Using has_capability() rather than ns_capable(), we're no longer using
this header.
Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.
Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This time we got a few more fixes than the previous rc's, and most of
commits were about ASoC.
The only significant change in the core side is the regression fix wrt
the aux device list handling, and all the rest are driver-specific
small / trivial fixes"
* tag 'sound-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add a quirk for Plantronics BT600
ASoC: rt5645: set sel_i2s_pre_div1 to 2
ASoC: dpcm: Avoid putting stream state to STOP when FE stream is paused
ASoC: Intel: Skylake: Release FW ctx in cleanup
ASoC: Intel: bytcr-rt5640: fix settings in internal clock mode
ASoC: fsl_ssi: set fifo watermark to more reliable value
ASoC: nau8825: fix invalid configuration in Pre-Scalar of FLL
ASoC: nau8825: correct the function name of register
ASoC: Intel: Skylake: Fix to fail safely if module not available in path
ASoC: tlv320aic3x: Mark the RESET register as volatile
ASoC: Fix binding and probing of auxiliary components
ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
ASoC: Intel: bytcr_rt5640: fallback mechanism if MCLK is not enabled
ASoC: hdmi-codec: use unsigned type to structure members with bit-field
ASoC: topology: kfree kcontrol->private_value before freeing kcontrol
ASoC: rsnd: don't double free kctrl
ASoC: dwc: Fix PIO mode initialization
|
|
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"As promised last week, here's some stability fixes from Christoph and
Jan Kara:
- fix free space request handling when low on disk space
- remove redundant log failure error messages
- free truncated dirty pages instead of letting them build up
forever"
* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Timely free truncated dirty pages
xfs: don't print warnings when xfs_log_force fails
xfs: don't rely on ->total in xfs_alloc_space_available
xfs: adjust allocation length in xfs_alloc_space_available
xfs: fix bogus minleft manipulations
xfs: bump up reserved blocks in xfs_alloc_set_aside
|
|
Pull remoteproc fixes from Bjorn Andersson:
"This fixes two regressions that have been reported to be introduced in
v4.10-rc1.
- correct an incorrect usage of the kref api
- revert the change to make the resource table read-only. As the
space each vdev resource is used as virtio device config space it
must be shared with the remote"
* tag 'rproc-v4.10-fixes' of git://github.com/andersson/remoteproc:
Revert "remoteproc: Merge table_ptr and cached_table pointers"
remoteproc: fix vdev reference management
|
|
Pull rpmsg fixes from Bjorn Andersson:
"This fixes a regression introduced in v4.10-rc1 that prohibits
multiple channels with the same name but different endpoint addresses
to be used"
* tag 'rpmsg-v4.10-fixes' of git://github.com/andersson/remoteproc:
rpmsg: virtio_rpmsg_bus: fix channel creation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- device descriptor length validation fix to hid-cypress driver from
Greg
- introduction of a short delay into i2c-hid, which is not really
mandated by the spec, but fixes Asus Touchpads
- Petzl USB connectable flashlight quirk from myself
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: i2c-hid: Add sleep between POWER ON and RESET
HID: hid-cypress: validate length of report
HID: ignore Petzl USB headlamp
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux
Pull scsi target fixes from Bart Van Assche:
- a series of bug fixes for the XCOPY implementation from David
Disseldorp
- one bug fix for the ibmvscsis driver, a driver that is used for
communication between partitions on IBM POWER systems.
* 'scsi-target-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/bvanassche/linux:
ibmvscsis: Fix srp_transfer_data fail return code
target: support XCOPY requests without parameters
target: check for XCOPY parameter truncation
target: use XCOPY segment descriptor CSCD IDs
target: check XCOPY segment descriptor CSCD IDs
target: simplify XCOPY wwn->se_dev lookup helper
target: return UNSUPPORTED TARGET/SEGMENT DESC TYPE CODE sense
target: bounds check XCOPY total descriptor list length
target: bounds check XCOPY segment descriptor list
target: use XCOPY TOO MANY TARGET DESCRIPTORS sense
target: add XCOPY target/segment desc sense codes
|
|
For no snapshot case, we should use ci->truncate_{seq,size}.
Fixes: 5f743e456606 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
We should apply the check after getting the initial mdsmap.
Fixes: e9e427f0a14f ("ceph: check availability of mds cluster on mount")
Link: http://tracker.ceph.com/issues/18161
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Pull md fixes from Shaohua Li:
"Basically one fix for raid5 cache which is merged in this cycle,
others are trival fixes"
* tag 'md/4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/raid5: Use correct IS_ERR() variation on pointer check
md: cleanup mddev flag clear for takeover
md/r5cache: fix spelling mistake on "recoverying"
md/r5cache: assign conf->log before r5l_load_log()
md/r5cache: simplify handling of sh->log_start in recovery
md/raid5-cache: removes unnecessary write-through mode judgments
md/raid10: Refactor raid10_make_request
md/raid1: Refactor raid1_make_request
|
|
When CONFIG_RANDOMIZE_MODULE_REGION_FULL=y, the offset between loaded
modules and the core kernel may exceed 4 GB, putting symbols exported
by the core kernel out of the reach of the ordinary adrp/add instruction
pairs used to generate relative symbol references. So make the adr_l
macro emit a movz/movk sequence instead when executing in module context.
While at it, remove the pointless special case for the stack pointer.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.
Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.
Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
has_capability() is sometimes needed by modules to test capability
for specified task other than current, so export it.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
IP: _raw_spin_lock+0xc/0x30
PGD 3e28eb067
PUD 3f0ac6067
PMD 0
Oops: 0002 [#1] SMP
CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
Call Trace:
? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
? pick_next_task_fair+0xe1/0x4e0
? kvm_arch_vcpu_load+0xea/0x260 [kvm]
kvm_vcpu_ioctl+0x33a/0x600 [kvm]
? hrtimer_try_to_cancel+0x29/0x130
? do_nanosleep+0x97/0xf0
do_vfs_ioctl+0xa1/0x5d0
? __hrtimer_init+0x90/0x90
? do_nanosleep+0x5b/0xf0
SyS_ioctl+0x79/0x90
do_syscall_64+0x6e/0x180
entry_SYSCALL64_slow_path+0x25/0x25
RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0
The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip. The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <linux/kvm.h>
#include <pthread.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#ifndef KVM_CAP_HYPERV_SYNIC
#define KVM_CAP_HYPERV_SYNIC 123
#endif
void* thr(void* arg)
{
struct kvm_enable_cap cap;
cap.flags = 0;
cap.cap = KVM_CAP_HYPERV_SYNIC;
ioctl((long)arg, KVM_ENABLE_CAP, &cap);
return 0;
}
int main()
{
void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
int kvmfd = open("/dev/kvm", 0);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
struct kvm_userspace_memory_region memreg;
memreg.slot = 0;
memreg.flags = 0;
memreg.guest_phys_addr = 0;
memreg.memory_size = 0x1000;
memreg.userspace_addr = (unsigned long)host_mem;
host_mem[0] = 0xf4;
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
struct kvm_sregs sregs;
ioctl(cpufd, KVM_GET_SREGS, &sregs);
sregs.cr0 = 0;
sregs.cr4 = 0;
sregs.efer = 0;
sregs.cs.selector = 0;
sregs.cs.base = 0;
ioctl(cpufd, KVM_SET_SREGS, &sregs);
struct kvm_regs regs = { .rflags = 2 };
ioctl(cpufd, KVM_SET_REGS, ®s);
ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
pthread_t th;
pthread_create(&th, 0, thr, (void*)(long)cpufd);
usleep(rand() % 10000);
ioctl(cpufd, KVM_RUN, 0);
pthread_join(th, 0);
return 0;
}
This patch fixes it by failing ENABLE_CAP if without an irqchip.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reported syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
PGD 0
Oops: 0002 [#1] SMP
CPU: 1 PID: 125 Comm: kworker/1:1 Not tainted 4.9.0+ #1
Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
task: ffff9bbe0dfbb900 task.stack: ffffb61802014000
RIP: 0010:irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
Call Trace:
irqfd_shutdown+0x66/0xa0 [kvm]
process_one_work+0x16b/0x480
worker_thread+0x4b/0x500
kthread+0x101/0x140
? process_one_work+0x480/0x480
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x25/0x30
RIP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass] RSP: ffffb61802017e20
CR2: 0000000000000008
The syzkaller folks reported a NULL pointer dereference that due to
unregister an consumer which fails registration before. The syzkaller
creates two VMs w/ an equal eventfd occasionally. So the second VM
fails to register an irqbypass consumer. It will make irqfd as inactive
and queue an workqueue work to shutdown irqfd and unregister the irqbypass
consumer when eventfd is closed. However, the second consumer has been
initialized though it fails registration. So the token(same as the first
VM's) is taken to unregister the consumer through the workqueue, the
consumer of the first VM is found and unregistered, then NULL deref incurred
in the path of deleting consumer from the consumers list.
This patch fixes it by making irq_bypass_register/unregister_consumer()
looks for the consumer entry based on consumer pointer itself instead of
token matching.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduces segemented_write_std.
Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
kernel memory leak.
Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 96051572c819194c37a8367624b285be10297eca
Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.
Use the new static_key_deferred_flush() API to flush pending updates on
module unload.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Modules that use static_key_deferred need a way to synchronize with
any delayed work that is still pending when the module is unloaded.
Introduce static_key_deferred_flush() which flushes any pending
jump label updates.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Support for the Asus Touchpad was recently added. It turns out this
device can fail initialisation (and become unusable) when the RESET
command is sent too soon after the POWER ON command.
Unfortunately the i2c-hid specification does not specify the need for
a delay between these two commands. But it was discovered the Windows
driver has a 1ms delay.
As a result, this patch modifies the i2c-hid module to add a sleep
inbetween the POWER ON and RESET commands which lasts between 1ms and 5ms.
See https://github.com/vlasenko/hid-asus-dkms/issues/24 for further
details.
Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|