summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-11-24rcu: Fix up pending cbs check in rcu_prepare_for_idleNeeraj Upadhyay1-1/+1
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream. The pending-callbacks check in rcu_prepare_for_idle() is backwards. It should accelerate if there are pending callbacks, but the check rather uselessly accelerates only if there are no callbacks. This commit therefore inverts this check. Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15workqueue: Fix NULL pointer dereferenceLi Bin1-1/+2
commit cef572ad9bd7f85035ba8272e5352040e8be0152 upstream. When queue_work() is used in irq (not in task context), there is a potential case that trigger NULL pointer dereference. ---------------------------------------------------------------- worker_thread() |-spin_lock_irq() |-process_one_work() |-worker->current_pwq = pwq |-spin_unlock_irq() |-worker->current_func(work) |-spin_lock_irq() |-worker->current_pwq = NULL |-spin_unlock_irq() //interrupt here |-irq_handler |-__queue_work() //assuming that the wq is draining |-is_chained_work(wq) |-current_wq_worker() //Here, 'current' is the interrupted worker! |-current->current_pwq is NULL here! |-schedule() ---------------------------------------------------------------- Avoid it by checking for task context in current_wq_worker(), and if not in task context, we shouldn't use the 'current' to check the condition. Reported-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Li Bin <huawei.libin@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-08perf/cgroup: Fix perf cgroup hierarchy supportTejun Heo1-2/+4
commit be96b316deff35e119760982c43af74e606fa143 upstream. The following commit: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups") made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event targets %current's cgroup. This breaks perf_event's hierarchical support because events which target one of the ancestors get ignored. Fix it by using cgroup_is_descendant() test instead of equality. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Fixes: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups") Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-08futex: Fix more put_pi_state() vs. exit_pi_state_list() racesPeter Zijlstra1-3/+20
commit 153fbd1226fb30b8630802aa5047b8af5ef53c9f upstream. Dmitry (through syzbot) reported being able to trigger the WARN in get_pi_state() and a use-after-free on: raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); Both are due to this race: exit_pi_state_list() put_pi_state() lock(&curr->pi_lock) while() { pi_state = list_first_entry(head); hb = hash_futex(&pi_state->key); unlock(&curr->pi_lock); dec_and_test(&pi_state->refcount); lock(&hb->lock) lock(&pi_state->pi_mutex.wait_lock) // uaf if pi_state free'd lock(&curr->pi_lock); .... unlock(&curr->pi_lock); get_pi_state(); // WARN; refcount==0 The problem is we take the reference count too late, and don't allow it being 0. Fix it by using inc_not_zero() and simply retrying the loop when we fail to get a refcount. In that case put_pi_state() should remove the entry from the list. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Gratian Crisan <gratian.crisan@ni.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: dvhart@infradead.org Cc: syzbot <bot+2af19c9e1ffe4d4ee1d16c56ae7580feaee75765@syzkaller.appspotmail.com> Cc: syzkaller-bugs@googlegroups.com Fixes: c74aef2d06a9 ("futex: Fix pi_state->owner serialization") Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02workqueue: replace pool->manager_arb mutex with a flagTejun Heo1-22/+15
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream. Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18genirq/cpuhotplug: Add sanity check for effective affinity maskThomas Gleixner1-1/+27
commit 60b09c51bb4fb46e2331fdbb39f91520f31d35f7 upstream. The effective affinity mask handling has no safety net when the mask is not updated by the interrupt chip or the mask contains offline CPUs. If that happens the CPU unplug code fails to migrate interrupts. Add sanity checks and emit a warning when the mask contains only offline CPUs. Fixes: 415fcf1a2293 ("genirq/cpuhotplug: Use effective affinity mask") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18genirq/cpuhotplug: Enforce affinity setting on startup of managed irqsThomas Gleixner2-1/+4
commit e43b3b58548051f8809391eb7bec7a27ed3003ea upstream. Managed interrupts can end up in a stale state on CPU hotplug. If the interrupt is not targeting a single CPU, i.e. the affinity mask spawns multiple CPUs then the following can happen: After boot: dstate: 0x01601200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 0 After offlining CPU 31 - 24 dstate: 0x01a31000 IRQD_IRQ_DISABLED IRQD_IRQ_MASKED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_AFFINITY_MANAGED IRQD_MANAGED_SHUTDOWN node: 0 affinity: 24-31 effectiv: 24 pending: 0 Now CPU 25 gets onlined again, so it should get the effective interrupt affinity for this interruopt, but due to the x86 interrupt affinity setter restrictions this ends up after restarting the interrupt with: dstate: 0x01601300 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_SETAFFINITY_PENDING IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 24-31 So the interrupt is still affine to CPU 24, which was the last CPU to go offline of that affinity set and the move to an online CPU within 24-31, in this case 25, is pending. This mechanism is x86/ia64 specific as those architectures cannot move interrupts from thread context and do this when an interrupt is actually handled. So the move is set to pending. Whats worse is that offlining CPU 25 again results in: dstate: 0x01601300 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_SET IRQD_SETAFFINITY_PENDING IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 24 pending: 24-31 This means the interrupt has not been shut down, because the outgoing CPU is not in the effective affinity mask, but of course nothing notices that the effective affinity mask is pointing at an offline CPU. In the case of restarting a managed interrupt the move restriction does not apply, so the affinity setting can be made unconditional. This needs to be done _before_ the interrupt is started up as otherwise the condition for moving it from thread context would not longer be fulfilled. With that change applied onlining CPU 25 after offlining 31-24 results in: dstate: 0x01600200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_AFFINITY_MANAGED node: 0 affinity: 24-31 effectiv: 25 pending: And after offlining CPU 25: dstate: 0x01a30000 IRQD_IRQ_DISABLED IRQD_IRQ_MASKED IRQD_SINGLE_TARGET IRQD_AFFINITY_MANAGED IRQD_MANAGED_SHUTDOWN node: 0 affinity: 24-31 effectiv: 25 pending: which is the correct and expected result. Fixes: 761ea388e8c4 ("genirq: Handle managed irqs gracefully in irq_startup()") Reported-by: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: axboe@kernel.dk Cc: linux-scsi@vger.kernel.org Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: mpe@ellerman.id.au Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: keith.busch@intel.com Cc: peterz@infradead.org Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-14waitid(): Add missing access_ok() checksKees Cook1-0/+6
commit 96ca579a1ecc943b75beba58bebb0356f6cc4b51 upstream. Adds missing access_ok() checks. CVE-2017-5123 Reported-by: Chris Salls <chrissalls5@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12ftrace: Fix kmemleak in unregister_ftrace_graphShu Wang1-14/+0
commit 2b0b8499ae75df91455bbeb7491d45affc384fb0 upstream. The trampoline allocated by function tracer was overwriten by function_graph tracer, and caused a memory leak. The save_global_trampoline should have saved the previous trampoline in register_ftrace_graph() and restored it in unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was only used in unregister_ftrace_graph as default value 0, and it overwrote the previous trampoline's value. Causing the previous allocated trampoline to be lost. kmmeleak backtrace: kmemleak_vmalloc+0x77/0xc0 __vmalloc_node_range+0x1b5/0x2c0 module_alloc+0x7c/0xd0 arch_ftrace_update_trampoline+0xb5/0x290 ftrace_startup+0x78/0x210 register_ftrace_function+0x8b/0xd0 function_trace_init+0x4f/0x80 tracing_set_tracer+0xe6/0x170 tracing_set_trace_write+0x90/0xd0 __vfs_write+0x37/0x170 vfs_write+0xb2/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x67/0x180 return_from_SYSCALL_64+0x0/0x6a [ Looking further into this, I found that this was left over from when the function and function graph tracers shared the same ftrace_ops. But in commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer together"), the two were separated, and the save_global_trampoline no longer was necessary (and it may have been broken back then too). -- Steven Rostedt ] Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together") Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returnsWaiman Long1-0/+8
commit c4fa6c43ce4b427350cfbb659436bfe3d9e09a1d upstream. The cgroup_taskset structure within the larger cgroup_mgctx structure is supposed to be used once and then discarded. That is not really the case in the hotplug code path: cpuset_hotplug_workfn() - cgroup_transfer_tasks() - cgroup_migrate() - cgroup_migrate_add_task() - cgroup_migrate_execute() In this case, the cgroup_migrate() function is called multiple time with the same cgroup_mgctx structure to transfer the tasks from one cgroup to another one-by-one. The second time cgroup_migrate() is called, the cgroup_taskset will be in an incorrect state and so may cause the system to panic. For example, [ 150.888410] Faulting instruction address: 0xc0000000001db648 [ 150.888414] Oops: Kernel access of bad area, sig: 11 [#1] [ 150.888417] SMP NR_CPUS=2048 [ 150.888417] NUMA [ 150.888419] pSeries : [ 150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0 [ 150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0 [ 150.888551] Call Trace: [ 150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable) [ 150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0 [ 150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370 [ 150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0 [ 150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0 [ 150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0 [ 150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0 [ 150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74 To allow reuse of the cgroup_mgctx structure, some fields in that structure are now re-initialized at the end of cgroup_migrate_execute() function call so that the structure can be reused again in a later iteration without causing problem. This bug was introduced in the commit e595cd706982 ("group: track migration context in cgroup_mgctx") in 4.11. This commit moves the cgroup_taskset initialization out of cgroup_migrate(). The commit 10467270fb3 ("cgroup: don't call migration methods if there are no tasks to migrate") helped, but did not completely resolve the problem. Fixes: e595cd706982bff0211e6fafe5a108421e747fbc ("group: track migration context in cgroup_mgctx") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf: fix bpf_tail_call() x64 JITAlexei Starovoitov1-1/+1
[ Upstream commit 90caccdd8cc0215705f18b92771b449b01e2474a ] - bpf prog_array just like all other types of bpf array accepts 32-bit index. Clarify that in the comment. - fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes - tighten corresponding check in the interpreter to stay consistent The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and though JIT code is wrong it will check bounds correctly. Hence two fixes tags. All other JITs don't have this problem. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation") Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf: one perf event close won't free bpf program attached by another perf eventYonghong Song1-1/+2
[ Upstream commit ec9dd352d591f0c90402ec67a317c1ed4fb2e638 ] This patch fixes a bug exhibited by the following scenario: 1. fd1 = perf_event_open with attr.config = ID1 2. attach bpf program prog1 to fd1 3. fd2 = perf_event_open with attr.config = ID1 <this will be successful> 4. user program closes fd2 and prog1 is detached from the tracepoint. 5. user program with fd1 does not work properly as tracepoint no output any more. The issue happens at step 4. Multiple perf_event_open can be called successfully, but only one bpf prog pointer in the tp_event. In the current logic, any fd release for the same tp_event will free the tp_event->prog. The fix is to free tp_event->prog only when the closing fd corresponds to the one which registered the program. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf: do not disable/enable BH in bpf_map_free_id()Eric Dumazet1-2/+4
[ Upstream commit 930651a75bf1ba6893a8b8475270664ebdb6cf4a ] syzkaller reported following splat [1] Since hard irq are disabled by the caller, bpf_map_free_id() should not try to enable/disable BH. Another solution would be to change htab_map_delete_elem() to defer the free_htab_elem() call after raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not enough to cover other code paths. [1] WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip +0x1e/0x160 kernel/softirq.c:161 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+ #23 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x417 kernel/panic.c:181 __warn+0x1c4/0x1d9 kernel/panic.c:542 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178 do_trap_no_signal arch/x86/kernel/traps.c:212 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:261 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905 RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161 RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046 RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000 RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0 RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6 R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47 R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001 __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline] _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207 spin_unlock_bh include/linux/spinlock.h:361 [inline] bpf_map_free_id kernel/bpf/syscall.c:197 [inline] __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227 bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235 bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96 free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658 htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063 map_delete_elem kernel/bpf/syscall.c:633 [inline] SYSC_bpf kernel/bpf/syscall.c:1479 [inline] SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451 entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: f3f1c054c288 ("bpf: Introduce bpf_map ID") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree1-1/+2
[ Upstream commit e67b8a685c7c984e834e3181ef4619cd7025a136 ] Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05futex: Fix pi_state->owner serializationPeter Zijlstra1-11/+22
commit c74aef2d06a9f59cece89093eecc552933cba72a upstream. There was a reported suspicion about a race between exit_pi_state_list() and put_pi_state(). The same report mentioned the comment with put_pi_state() said it should be called with hb->lock held, and it no longer is in all places. As it turns out, the pi_state->owner serialization is indeed broken. As per the new rules: 734009e96d19 ("futex: Change locking rules") pi_state->owner should be serialized by pi_state->pi_mutex.wait_lock. For the sites setting pi_state->owner we already hold wait_lock (where required) but exit_pi_state_list() and put_pi_state() were not and raced on clearing it. Fixes: 734009e96d19 ("futex: Change locking rules") Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dvhart@infradead.org Link: https://lkml.kernel.org/r/20170922154806.jd3ffltfk24m4o4y@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05irq/generic-chip: Don't replace domain's nameJeffy Chen1-1/+0
commit 72364d320644c12948786962673772f271039a4a upstream. When generic irq chips are allocated for an irq domain the domain name is set to the irq chip name. That was done to have named domains before the recent changes which enforce domain naming were done. Since then the overwrite causes a memory leak when the domain name is dynamically allocated and even worse it would cause the domain free code to free the wrong name pointer, which might point to a constant. Remove the name assignment to prevent this. Fixes: d59f6617eef0 ("genirq: Allow fwnode to carry name information only") Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20170928043731.4764-1-jeffy.chen@rock-chips.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05sched/sysctl: Check user input value of sysctl_sched_time_avgEthan Zhao1-1/+2
commit 5ccba44ba118a5000cccc50076b0344632459779 upstream. System will hang if user set sysctl_sched_time_avg to 0: [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3 ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000 0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8 ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08 Call Trace: <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200 [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600 [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530 [<ffffffff810c5b84>] ? load_balance+0x194/0x900 [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0 [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290 [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0 [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320 [<ffffffff8108ac85>] ? irq_exit+0x125/0x130 [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160 [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30 [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80 <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230 [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230 [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20 [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420 [<ffffffff81053373>] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function: update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hitting invalid input, and set the minimum limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ethan.kernel@gmail.com Cc: keescook@chromium.org Cc: mcgrof@kernel.org Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05fix infoleak in waitid(2)Al Viro1-13/+10
commit 6c85501f2fabcfc4fc6ed976543d252c4eaf4be9 upstream. kernel_waitid() can return a PID, an error or 0. rusage is filled in the first case and waitid(2) rusage should've been copied out exactly in that case, *not* whenever kernel_waitid() has not returned an error. Compat variant shares that braino; none of kernel_wait4() callers do, so the below ought to fix it. Reported-and-tested-by: Alexander Potapenko <glider@google.com> Fixes: ce72a16fa705 ("wait4(2)/waitid(2): separate copying rusage to userland") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()Oleg Nesterov1-7/+16
commit 66a733ea6b611aecf0119514d2dddab5f9d6c01e upstream. As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end up using different filters. Once we drop ->siglock it is possible for task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC. Fixes: f8e529ed941b ("seccomp, ptrace: add support for dumping seccomp filters") Reported-by: Chris Salls <chrissalls5@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> [tycho: add __get_seccomp_filter vs. open coding refcount_inc()] Signed-off-by: Tycho Andersen <tycho@docker.com> [kees: tweak commit log] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05extable: Enable RCU if it is not watching in kernel_text_address()Steven Rostedt (VMware)1-5/+30
commit e8cac8b1d10589be45671a5ade0926a639b543b7 upstream. If kernel_text_address() is called when RCU is not watching, it can cause an RCU bug because is_module_text_address(), the is_kprobe_*insn_slot() and is_bpf_text_address() functions require the use of RCU. Only enable RCU if it is not currently watching before it calls is_module_text_address(). The use of rcu_nmi_enter() is used to enable RCU because kernel_text_address() can happen pretty much anywhere (like an NMI), and even from within an NMI. It is called via save_stack_trace() that can be called by any WARN() or tracing function, which can happen while RCU is not watching (for example, going to or coming from idle, or during CPU take down or bring up). Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05extable: Consolidate *kernel_text_address() functionsSteven Rostedt (VMware)1-9/+1
commit 9aadde91b3c035413c806619beb3e3ef6e697953 upstream. The functionality between kernel_text_address() and _kernel_text_address() is the same except that _kernel_text_address() does a little more (that function needs a rename, but that can be done another time). Instead of having duplicate code in both, simply have _kernel_text_address() calls kernel_text_address() instead. This is marked for stable because there's an RCU bug that can happen if one of these functions gets called while RCU is not watching. That fix depends on this fix to keep from having to write the fix twice. Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05rcu: Allow for page faults in NMI handlersPaul E. McKenney1-0/+10
commit 28585a832602747cbfa88ad8934013177a3aae38 upstream. A number of architecture invoke rcu_irq_enter() on exception entry in order to allow RCU read-side critical sections in the exception handler when the exception is from an idle or nohz_full CPU. This works, at least unless the exception happens in an NMI handler. In that case, rcu_nmi_enter() would already have exited the extended quiescent state, which would mean that rcu_irq_enter() would (incorrectly) cause RCU to think that it is again in an extended quiescent state. This will in turn result in lockdep splats in response to later RCU read-side critical sections. This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to take no action if there is an rcu_nmi_enter() in effect, thus avoiding the unscheduled return to RCU quiescent state. This in turn should make the kernel safe for on-demand RCU voyeurism. Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05tracing: Remove RCU work arounds from stack tracerSteven Rostedt (VMware)1-15/+0
commit 15516c89acce948debc4c598e03c3fee53045797 upstream. Currently the stack tracer calls rcu_irq_enter() to make sure RCU is watching when it records a stack trace. But if the stack tracer is triggered while tracing inside of a rcu_irq_enter(), calling rcu_irq_enter() unconditionally can be problematic. The reason for having rcu_irq_enter() in the first place has been fixed from within the saving of the stack trace code, and there's no reason for doing it in the stack tracer itself. Just remove it. Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05tracing: Erase irqsoff trace with empty writeBo Yan1-2/+8
commit 8dd33bcb7050dd6f8c1432732f930932c9d3a33e upstream. One convenient way to erase trace is "echo > trace". However, this is currently broken if the current tracer is irqsoff tracer. This is because irqsoff tracer use max_buffer as the default trace buffer. Set the max_buffer as the one to be cleared when it's the trace buffer currently in use. Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com Cc: <mingo@redhat.com> Fixes: 4acd4d00f ("tracing: give easy way to clear trace buffer") Signed-off-by: Bo Yan <byan@nvidia.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05tracing: Fix trace_pipe behavior for instance tracesTahsin Erdogan1-1/+1
commit 75df6e688ccd517e339a7c422ef7ad73045b18a2 upstream. When reading data from trace_pipe, tracing_wait_pipe() performs a check to see if tracing has been turned off after some data was read. Currently, this check always looks at global trace state, but it should be checking the trace instance where trace_pipe is located at. Because of this bug, cat instances/i1/trace_pipe in the following script will immediately exit instead of waiting for data: cd /sys/kernel/debug/tracing echo 0 > tracing_on mkdir -p instances/i1 echo 1 > instances/i1/tracing_on echo 1 > instances/i1/events/sched/sched_process_exec/enable cat instances/i1/trace_pipe Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com Fixes: 10246fa35d4f ("tracing: give easy way to clear trace buffer") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05genirq: Fix cpumask check in __irq_startup_managed()Thomas Gleixner1-1/+1
commit 9cb067ef8a10bb13112e4d1c0ea996ec96527422 upstream. The result of cpumask_any_and() is invalid when result greater or equal nr_cpu_ids. The current check is checking for greater only. Fix it. Fixes: 761ea388e8c4 ("genirq: Handle managed irqs gracefully in irq_startup()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05genirq/msi: Fix populating multiple interruptsJohn Keeping1-2/+3
commit 596a7a1d0989c621c3ae49be73a1d1f9de22eb5a upstream. On allocating the interrupts routed via a wire-to-MSI bridge, the allocator iterates over the MSI descriptors to build the hierarchy, but fails to use the descriptor interrupt number, and instead uses the base number, generating the wrong IRQ domain mappings. The fix is to use the MSI descriptor interrupt number when setting up the interrupt instead of the base interrupt for the allocation range. The only saving grace is that although the MSI descriptors are allocated in bulk, the wired interrupts are only allocated one by one (so desc->irq == virq) and the bug went unnoticed so far. Fixes: 2145ac9310b60 ("genirq/msi: Add msi_domain_populate_irqs") Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/20170906103540.373864a2.john@metanate.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05genirq: Make sparse_irq_lock protect what it should protectThomas Gleixner1-17/+7
commit 12ac1d0f6c3e95732d144ffa65c8b20fbd9aa462 upstream. for_each_active_irq() iterates the sparse irq allocation bitmap. The caller must hold sparse_irq_lock. Several code pathes expect that an active bit in the sparse bitmap also has a valid interrupt descriptor. Unfortunately that's not true. The (de)allocation is a two step process, which holds the sparse_irq_lock only across the queue/remove from the radix tree and the set/clear in the allocation bitmap. If a iteration locks sparse_irq_lock between the two steps, then it might see an active bit but the corresponding irq descriptor is NULL. If that is dereferenced unconditionally, then the kernel oopses. Of course, all iterator sites could be audited and fixed, but.... There is no reason why the sparse_irq_lock needs to be dropped between the two steps, in fact the code becomes simpler when the mutex is held across both and the semantics become more straight forward, so future problems of missing NULL pointer checks in the iteration are avoided and all existing sites are fixed in one go. Expand the lock held sections so both operations are covered and the bitmap and the radixtree are in sync. Fixes: a05a900a51c7 ("genirq: Make sparse_lock a mutex") Reported-and-tested-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27sched/cpuset/pm: Fix cpuset vs. suspend-resume bugsPeter Zijlstra3-6/+22
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream. Cpusets vs. suspend-resume is _completely_ broken. And it got noticed because it now resulted in non-cpuset usage breaking too. On suspend cpuset_cpu_inactive() doesn't call into cpuset_update_active_cpus() because it doesn't want to move tasks about, there is no need, all tasks are frozen and won't run again until after we've resumed everything. But this means that when we finally do call into cpuset_update_active_cpus() after resuming the last frozen cpu in cpuset_cpu_active(), the top_cpuset will not have any difference with the cpu_active_mask and this it will not in fact do _anything_. So the cpuset configuration will not be restored. This was largely hidden because we would unconditionally create identity domains and mobile users would not in fact use cpusets much. And servers what do use cpusets tend to not suspend-resume much. An addition problem is that we'd not in fact wait for the cpuset work to finish before resuming the tasks, allowing spurious migrations outside of the specified domains. Fix the rebuild by introducing cpuset_force_rebuild() and fix the ordering with cpuset_wait_for_hotplug(). Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27tracing: Apply trace_clock changes to instance max bufferBaohong Liu1-1/+1
commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 upstream. Currently trace_clock timestamps are applied to both regular and max buffers only for global trace. For instance trace, trace_clock timestamps are applied only to regular buffer. But, regular and max buffers can be swapped, for example, following a snapshot. So, for instance trace, bad timestamps can be seen following a snapshot. Let's apply trace_clock timestamps to instance max buffer as well. Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers") Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27tracing: Fix clear of RECORDED_TGID flag when disabling trace eventChunyu Hu1-1/+1
commit 7685ab6c58557c6234f3540260195ecbee7fc4b3 upstream. When disabling one trace event, the RECORDED_TGID flag in the event file is not correctly cleared. It's clearing RECORDED_CMD flag when it should clear RECORDED_TGID flag. Link: http://lkml.kernel.org/r/1504589806-8425-1-git-send-email-chuhu@redhat.com Cc: Joel Fernandes <joelaf@google.com> Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27tracing: Add barrier to trace_printk() buffer nesting modificationSteven Rostedt (VMware)1-1/+7
commit 3d9622c12c8873911f4cc0ccdabd0362c2fca06b upstream. trace_printk() uses 4 buffers, one for each context (normal, softirq, irq and NMI), such that it does not need to worry about one context preempting the other. There's a nesting counter that gets incremented to figure out which buffer to use. If the context gets preempted by another context which calls trace_printk() it will increment the counter and use the next buffer, and restore the counter when it is finished. The problem is that gcc may optimize the modification of the buffer nesting counter and it may not be incremented in memory before the buffer is used. If this happens, and the context gets interrupted by another context, it could pick the same buffer and corrupt the one that is being used. Compiler barriers need to be added after the nesting variable is incremented and before it is decremented to prevent usage of the context buffers by more than one context at the same time. Cc: Andy Lutomirski <luto@kernel.org> Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count") Hat-tip-to: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27ftrace: Fix memleak when unregistering dynamic ops when tracing disabledSteven Rostedt (VMware)1-4/+6
commit edb096e00724f02db5f6ec7900f3bbd465c6c76f upstream. If function tracing is disabled by the user via the function-trace option or the proc sysctl file, and a ftrace_ops that was allocated on the heap is unregistered, then the shutdown code exits out without doing the proper clean up. This was found via kmemleak and running the ftrace selftests, as one of the tests unregisters with function tracing disabled. # cat kmemleak unreferenced object 0xffffffffa0020000 (size 4096): comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s) hex dump (first 32 bytes): 55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89 U.t$.UH...t$.UH. e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c .H......H.D$PH.L backtrace: [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0 [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0 [<ffffffff8109697f>] module_alloc+0x4f/0x90 [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420 [<ffffffff81249947>] ftrace_startup+0xe7/0x300 [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90 [<ffffffff81263786>] trace_selftest_ops+0x204/0x397 [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624 [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7 [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192 [<ffffffff81002230>] do_one_initcall+0x90/0x1e2 [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe [<ffffffff81d61ec3>] kernel_init+0x13/0x122 [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27ftrace: Fix selftest goto location on errorSteven Rostedt (VMware)1-1/+1
commit 46320a6acc4fb58f04bcf78c4c942cc43b20f986 upstream. In the second iteration of trace_selftest_ops(), the error goto label is wrong in the case where trace_selftest_test_global_cnt is off. In the case of error, it leaks the dynamic ops that was allocated. Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27smp/hotplug: Handle removal correctly in cpuhp_store_callbacks()Ethan Barnes1-1/+11
commit 0c96b27305faf06c068b45e07d28336c80dac286 upstream. If cpuhp_store_callbacks() is called for CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN, which are the indicators for dynamically allocated states, then cpuhp_store_callbacks() allocates a new dynamic state. The first allocation in each range returns CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN. If cpuhp_remove_state() is invoked for one of these states, then there is no protection against the allocation mechanism. So the removal, which should clear the callbacks and the name, gets a new state assigned and clears that one. As a consequence the state which should be cleared stays initialized. A consecutive CPU hotplug operation dereferences the state callbacks and accesses either freed or reused memory, resulting in crashes. Add a protection against this by checking the name argument for NULL. If it's NULL it's a removal. If not, it's an allocation. [ tglx: Added a comment and massaged changelog ] Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface") Signed-off-by: Ethan Barnes <ethan.barnes@sandisk.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.or> Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu> Cc: Sebastian Siewior <bigeasy@linutronix.d> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/DM2PR04MB398242FC7776D603D9F99C894A60@DM2PR04MB398.namprd04.prod.outlook.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27srcu: Provide ordering for CPU not involved in grace periodPaul E. McKenney1-0/+9
commit 35732cf9dd38b1efb0f2f22c91c61b51337d1ac3 upstream. Tree RCU guarantees that every online CPU has a memory barrier between any given grace period and any of that CPU's RCU read-side sections that must be ordered against that grace period. Since RCU doesn't always know where read-side critical sections are, the actual implementation guarantees order against prior and subsequent non-idle non-offline code, whether in an RCU read-side critical section or not. As a result, there does not need to be a memory barrier at the end of synchronize_rcu() and friends because the ordering internal to the grace period has ordered every CPU's post-grace-period execution against each CPU's pre-grace-period execution, again for all non-idle online CPUs. In contrast, SRCU can have non-idle online CPUs that are completely uninvolved in a given SRCU grace period, for example, a CPU that never runs any SRCU read-side critical sections and took no part in the grace-period processing. It is in theory possible for a given synchronize_srcu()'s wakeup to be delivered to a CPU that was completely uninvolved in the prior SRCU grace period, which could mean that the code following that synchronize_srcu() would end up being unordered with respect to both the grace period and any pre-existing SRCU read-side critical sections. This commit therefore adds an smp_mb() to the end of __synchronize_srcu(), which prevents this scenario from occurring. Reported-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Lance Roy <ldr709@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.Martijn Coenen1-0/+1
commit 9e18d0c82f0c07f2a41898d4adbb698a953403ee upstream. These will be required going forward. Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-03Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for a thinko in the raw timekeeper update which causes clock MONOTONIC_RAW to run with erratically increased frequency" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix ktime_get_raw() incorrect base accumulation
2017-09-03Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds5-10/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: - Prevent a potential inconistency in the perf user space access which might lead to evading sanity checks. - Prevent perf recording function trace entries twice * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/ftrace: Fix double traces of perf on ftrace:function perf/core: Fix potential double-fetch bug
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-13/+17
Pull networking fixes from David Miller: 1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel Borkmann. 2) IPSEC ESP error paths leak memory, from Steffen Klassert. 3) We need an RCU grace period before freeing fib6_node objects, from Wei Wang. 4) Must check skb_put_padto() return value in HSR driver, from FLorian Fainelli. 5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew Jeffery. 6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric Dumazet. 7) Use after free when tcf_chain_destroy() called multiple times, from Jiri Pirko. 8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli. 9) Fix leak of uninitialized memory in sctp_get_sctp_info(), inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill(). From Stefano Brivio. 10) L2TP tunnel refcount fixes from Guillaume Nault. 11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi Kauperman. 12) Revert a PHY layer change wrt. handling of PHY_HALTED state in phy_stop_machine(), it causes regressions for multiple people. From Florian Fainelli. 13) When packets are sent out of br0 we have to clear the offload_fwdq_mark value. 14) Several NULL pointer deref fixes in packet schedulers when their ->init() routine fails. From Nikolay Aleksandrov. 15) Aquantium devices cannot checksum offload correctly when the packet is <= 60 bytes. From Pavel Belous. 16) Fix vnet header access past end of buffer in AF_PACKET, from Benjamin Poirier. 17) Double free in probe error paths of nfp driver, from Dan Carpenter. 18) QOS capability not checked properly in DCB init paths of mlx5 driver, from Huy Nguyen. 19) Fix conflicts between firmware load failure and health_care timer in mlx5, also from Huy Nguyen. 20) Fix dangling page pointer when DMA mapping errors occur in mlx5, from Eran Ben ELisha. 21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly, from Michael Chan. 22) Missing MSIX vector free in bnxt_en, also from Michael Chan. 23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo Colitti. 24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann. 25) bpf_setsockopts() erroneously always returns -EINVAL even on success. Fix from Yuchung Cheng. 26) tipc_rcv() needs to linearize the SKB before parsing the inner headers, from Parthasarathy Bhuvaragan. 27) Fix deadlock between link status updates and link removal in netvsc driver, from Stephen Hemminger. 28) Missed locking of page fragment handling in ESP output, from Steffen Klassert. 29) Fix refcnt leak in ebpf congestion control code, from Sabrina Dubroca. 30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value, from Christophe Jaillet. 31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during early demux, from Paolo Abeni. 32) Several info leaks in xfrm_user layer, from Mathias Krause. 33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio. 34) Properly propagate obsolete state of route upwards in ipv6 so that upper holders like xfrm can see it. From Xin Long. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits) udp: fix secpath leak bridge: switchdev: Clear forward mark when transmitting packet mlxsw: spectrum: Forbid linking to devices that have uppers wl1251: add a missing spin_lock_init() Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278 kcm: do not attach PF_KCM sockets to avoid deadlock sch_tbf: fix two null pointer dereferences on init failure sch_sfq: fix null pointer dereference on init failure sch_netem: avoid null pointer deref on init failure sch_fq_codel: avoid double free on init failure sch_cbq: fix null pointer dereferences on init failure sch_hfsc: fix null pointer deref and double free on init failure sch_hhf: fix null pointer dereference on init failure sch_multiq: fix double free on init failure sch_htb: fix crash on init failure net/mlx5e: Fix CQ moderation mode not set properly net/mlx5e: Fix inline header size for small packets net/mlx5: E-Switch, Unload the representors in the correct order net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address ...
2017-09-01mm, uprobes: fix multiple free of ->uprobes_state.xol_areaEric Biggers2-2/+8
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before the new mm_struct's ->uprobes_state.xol_area has been set to NULL after being copied from the old mm_struct by the memcpy in dup_mm(). For a task that has previously hit a uprobe tracepoint, this resulted in the 'struct xol_area' being freed multiple times if the task was killed at just the right time while forking. Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather than in uprobe_dup_mmap(). With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of ->exe_file causing use-after-free"), provided that a uprobe tracepoint has been set on the fork_thread() function. For example: $ gcc reproducer.c -o reproducer -lpthread $ nm reproducer | grep fork_thread 0000000000400719 t fork_thread $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable $ ./reproducer Here is the use-after-free reported by KASAN: BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200 Read of size 8 at addr ffff8800320a8b88 by task reproducer/198 CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Call Trace: dump_stack+0xdb/0x185 print_address_description+0x7e/0x290 kasan_report+0x23b/0x350 __asan_report_load8_noabort+0x19/0x20 uprobe_clear_state+0x1c4/0x200 mmput+0xd6/0x360 do_exit+0x740/0x1670 do_group_exit+0x13f/0x380 get_signal+0x597/0x17d0 do_signal+0x99/0x1df0 exit_to_usermode_loop+0x166/0x1e0 syscall_return_slowpath+0x258/0x2c0 entry_SYSCALL_64_fastpath+0xbc/0xbe ... Allocated by task 199: save_stack_trace+0x1b/0x20 kasan_kmalloc+0xfc/0x180 kmem_cache_alloc_trace+0xf3/0x330 __create_xol_area+0x10f/0x780 uprobe_notify_resume+0x1674/0x2210 exit_to_usermode_loop+0x150/0x1e0 prepare_exit_to_usermode+0x14b/0x180 retint_user+0x8/0x20 Freed by task 199: save_stack_trace+0x1b/0x20 kasan_slab_free+0xa8/0x1a0 kfree+0xba/0x210 uprobe_clear_state+0x151/0x200 mmput+0xd6/0x360 copy_process.part.8+0x605f/0x65d0 _do_fork+0x1a5/0xbd0 SyS_clone+0x19/0x20 do_syscall_64+0x22f/0x660 return_from_SYSCALL_64+0x0/0x7a Note: without KASAN, you may instead see a "Bad page state" message, or simply a general protection fault. Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers <ebiggers@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01kernel/kthread.c: kthread_worker: don't hog the cpuShaohua Li1-0/+1
If the worker thread continues getting work, it will hog the cpu and rcu stall complains. Make it a good citizen. This is triggered in a loop block device test. Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-29Merge branch 'for-4.13-fixes' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "A late but obvious fix for cgroup. I broke the 'cpuset.memory_pressure' file a long time ago (v4.4) by accidentally deleting its file index, which made it a duplicate of the 'cpuset.memory_migrate' file. Spotted and fixed by Waiman" * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix incorrect memory_pressure control file mapping
2017-08-29perf/ftrace: Fix double traces of perf on ftrace:functionZhou Chengming5-10/+17
When running perf on the ftrace:function tracepoint, there is a bug which can be reproduced by: perf record -e ftrace:function -a sleep 20 & perf record -e ftrace:function ls perf script ls 10304 [005] 171.853235: ftrace:function: perf_output_begin ls 10304 [005] 171.853237: ftrace:function: perf_output_begin ls 10304 [005] 171.853239: ftrace:function: task_tgid_nr_ns ls 10304 [005] 171.853240: ftrace:function: task_tgid_nr_ns ls 10304 [005] 171.853242: ftrace:function: __task_pid_nr_ns ls 10304 [005] 171.853244: ftrace:function: __task_pid_nr_ns We can see that all the function traces are doubled. The problem is caused by the inconsistency of the register function perf_ftrace_event_register() with the probe function perf_ftrace_function_call(). The former registers one probe for every perf_event. And the latter handles all perf_events on the current cpu. So when two perf_events on the current cpu, the traces of them will be doubled. So this patch adds an extra parameter "event" for perf_tp_event, only send sample data to this event when it's not NULL. Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: huawei.libin@huawei.com Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29perf/core: Fix potential double-fetch bugMeng Xu1-0/+2
While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. 1. The first fetch happens in line 9573 get_user(size, &uattr->size). 2. Subsequently the 'size' variable undergoes a few sanity checks and transformations (line 9577 to 9584). 3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size) 4. Given that 'uattr' can be fully controlled in userspace, an attacker can race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF) after the first fetch but before the second fetch. The changed value will be copied to 'attr->size'. 5. There is no further checks on 'attr->size' until the end of this function, and once the function returns, we lose the context to verify that 'attr->size' conforms to the sanity checks performed in step 2 (line 9577 to 9584). 6. My manual analysis shows that 'attr->size' is not used elsewhere later, so, there is no working exploit against it right now. However, this could easily turns to an exploitable one if careless developers start to use 'attr->size' later. To fix this, override 'attr->size' from the second fetch to the one from the first fetch, regardless of what is actually copied in. In this way, it is assured that 'attr->size' is consistent with the checks performed after the first fetch. Signed-off-by: Meng Xu <mengxu.gatech@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: meng.xu@gatech.edu Cc: sanidhya@gatech.edu Cc: taesoo@gatech.edu Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-27Minor page waitqueue cleanupsLinus Torvalds1-3/+4
Tim Chen and Kan Liang have been battling a customer load that shows extremely long page wakeup lists. The cause seems to be constant NUMA migration of a hot page that is shared across a lot of threads, but the actual root cause for the exact behavior has not been found. Tim has a patch that batches the wait list traversal at wakeup time, so that we at least don't get long uninterruptible cases where we traverse and wake up thousands of processes and get nasty latency spikes. That is likely 4.14 material, but we're still discussing the page waitqueue specific parts of it. In the meantime, I've tried to look at making the page wait queues less expensive, and failing miserably. If you have thousands of threads waiting for the same page, it will be painful. We'll need to try to figure out the NUMA balancing issue some day, in addition to avoiding the excessive spinlock hold times. That said, having tried to rewrite the page wait queues, I can at least fix up some of the braindamage in the current situation. In particular: (a) we don't want to continue walking the page wait list if the bit we're waiting for already got set again (which seems to be one of the patterns of the bad load). That makes no progress and just causes pointless cache pollution chasing the pointers. (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. Also add a comment about the layout of 'struct wait_page_key' - there is an external user of it in the cachefiles code that means that it has to match the layout of 'struct wait_bit_key' in the two first members. It so happens to match, because 'struct page *' and 'unsigned long *' end up having the same values simply because the page flags are the first member in struct page. Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Christopher Lameter <cl@linux.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-26Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-9/+41
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a timer granularity handling race+bug, which would manifest itself by spuriously increasing timeouts of some timers (from 1 jiffy to ~500 jiffies in the worst case measured) in certain nohz states" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix excessive granularity of new timers after a nohz idle
2017-08-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-20/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A single fix to not allow nonsensical event groups that result in kernel warnings" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix group {cpu,task} validation
2017-08-26time: Fix ktime_get_raw() incorrect base accumulationJohn Stultz1-3/+1
In comqit fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time handling"), the following code got mistakenly added to the update of the raw timekeeper: /* Update the monotonic raw base */ seconds = tk->raw_sec; nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift); tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); Which adds the raw_sec value and the shifted down raw xtime_nsec to the base value. But the read function adds the shifted down tk->tkr_raw.xtime_nsec value another time, The result of this is that ktime_get_raw() users (which are all internal users) see the raw time move faster then it should (the rate at which can vary with the current size of tkr_raw.xtime_nsec), which has resulted in at least problems with graphics rendering performance. The change tried to match the monotonic base update logic: seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec); nsec = (u32) tk->wall_to_monotonic.tv_nsec; tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); Which adds the wall_to_monotonic.tv_nsec value, but not the tk->tkr_mono.xtime_nsec value to the base. To fix this, simplify the tkr_raw.base accumulation to only accumulate the raw_sec portion, and do not include the tkr_raw.xtime_nsec portion, which will be added at read time. Fixes: fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time handling") Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1503701824-1645-1-git-send-email-john.stultz@linaro.org
2017-08-26fork: fix incorrect fput of ->exe_file causing use-after-freeEric Biggers1-0/+1
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before a reference is taken on the mm_struct's ->exe_file. Since the ->exe_file of the new mm_struct was already set to the old ->exe_file by the memcpy() in dup_mm(), it was possible for the mmput() in the error path of dup_mm() to drop a reference to ->exe_file which was never taken. This caused the struct file to later be freed prematurely. Fix it by updating mm_init() to NULL out the ->exe_file, in the same place it clears other things like the list of mmaps. This bug was found by syzkaller. It can be reproduced using the following C program: #define _GNU_SOURCE #include <pthread.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *mmap_thread(void *_arg) { for (;;) { mmap(NULL, 0x1000000, PROT_READ, MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); } } static void *fork_thread(void *_arg) { usleep(rand() % 10000); fork(); } int main(void) { fork(); fork(); fork(); for (;;) { if (fork() == 0) { pthread_t t; pthread_create(&t, NULL, mmap_thread, NULL); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } No special kernel config options are needed. It usually causes a NULL pointer dereference in __remove_shared_vm_struct() during exit, or in dup_mmap() (which is usually inlined into copy_process()) during fork. Both are due to a vm_area_struct's ->vm_file being used after it's already been freed. Google Bug Id: 64772007 Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [v4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>