summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-06-01bpf: split bpf core interpreterAlexei Starovoitov1-7/+14
split __bpf_prog_run() interpreter into stack allocation and execution parts. The code section shrinks which helps interpreter performance in some cases. text data bss dec hex filename 26350 10328 624 37302 91b6 kernel/bpf/core.o.before 25777 10328 624 36729 8f79 kernel/bpf/core.o.after Very short programs got slower (due to extra function call): Before: test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 7 PASS test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 8 PASS test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 7 PASS test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 11 PASS test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 7 PASS After: test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 11 PASS test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 11 PASS test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 11 PASS test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 14 PASS test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 10 PASS Longer programs got faster: Before: test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 20286 20513 PASS test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 31853 31768 PASS test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9815 PASS test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 6 PASS test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13959 PASS test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 210 PASS test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 21724 PASS test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19118 PASS After: test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 19008 18827 PASS test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 29238 28450 PASS test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9485 PASS test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 12 PASS test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13257 PASS test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 213 PASS test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 19389 PASS test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19583 PASS For real world production programs the difference is noise. This patch is first step towards reducing interpreter stack consumption. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-01bpf: free up BPF_JMP | BPF_CALL | BPF_X opcodeAlexei Starovoitov2-2/+2
free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual indirect call by register and use kernel internal opcode to mark call instruction into bpf_tail_call() helper. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31audit: add ambient capabilities to CAPSET and BPRM_FCAPS recordsRichard Guy Briggs2-3/+10
Capabilities were augmented to include ambient capabilities in v4.3 commit 58319057b784 ("capabilities: ambient capabilities"). Add ambient capabilities to the audit BPRM_FCAPS and CAPSET records. The record contains fields "old_pp", "old_pi", "old_pe", "new_pp", "new_pi", "new_pe" so in keeping with the previous record normalizations, change the "new_*" variants to simply drop the "new_" prefix. A sample of the replaced BPRM_FCAPS record: RAW: type=BPRM_FCAPS msg=audit(1491468034.252:237): fver=2 fp=0000000000200000 fi=0000000000000000 fe=1 old_pp=0000000000000000 old_pi=0000000000000000 old_pe=0000000000000000 old_pa=0000000000000000 pp=0000000000200000 pi=0000000000000000 pe=0000000000200000 pa=0000000000000000 INTERPRET: type=BPRM_FCAPS msg=audit(04/06/2017 04:40:34.252:237): fver=2 fp=sys_admin fi=none fe=chown old_pp=none old_pi=none old_pe=none old_pa=none pp=sys_admin pi=none pe=sys_admin pa=none A sample of the replaced CAPSET record: RAW: type=CAPSET msg=audit(1491469502.371:242): pid=833 cap_pi=0000003fffffffff cap_pp=0000003fffffffff cap_pe=0000003fffffffff cap_pa=0000000000000000 INTERPRET: type=CAPSET msg=audit(04/06/2017 05:05:02.371:242) : pid=833 cap_pi=chown,dac_override,dac_read_search,fowner,fsetid,kill, setgid,setuid,setpcap,linux_immutable,net_bind_service,net_broadcast, net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot, sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,sys_time, sys_tty_config,mknod,lease,audit_write,audit_control,setfcap, mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read cap_pp=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid, setuid,setpcap,linux_immutable,net_bind_service,net_broadcast, net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot, sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource, sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap, mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read cap_pe=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid, setuid,setpcap,linux_immutable,net_bind_service,net_broadcast, net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot, sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource, sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap, mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read cap_pa=none See: https://github.com/linux-audit/audit-kernel/issues/40 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-30nohz: Reset next_tick cache even when the timer has no regsFrederic Weisbecker1-0/+2
Handle tick interrupts whose regs are NULL, out of general paranoia. It happens when hrtimer_interrupt() is called from non-interrupt contexts, such as hotplug CPU down events. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-27do_sigaltstack(): lift copying to/from userland into callersAl Viro1-61/+46
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27take compat_sys_old_getrlimit() to native syscallAl Viro2-29/+24
... and sanitize the ifdefs in there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-8/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixlet from Thomas Gleixner: "Silence dmesg spam by making the posix cpu timer printks depend on print_fatal_signals" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timers: Make signal printks conditional
2017-05-27Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds1-6/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A fix for a state leak which was introduced in the recent rework of futex/rtmutex interaction" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock()
2017-05-27Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull kthread fix from Thomas Gleixner: "A single fix which prevents a use after free when kthread fork fails" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kthread: Fix use-after-free if kthread fork fails
2017-05-27Merge tag 'trace-v4.12-rc2' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fixes from Steven Rostedt: "There's been a few memory issues found with ftrace. One was simply a memory leak where not all was being freed that should have been in releasing a file pointer on set_graph_function. Then Thomas found that the ftrace trampolines were marked for read/write as well as execute. To shrink the possible attack surface, he added calls to set them to ro. Which also uncovered some other issues with freeing module allocated memory that had its permissions changed. Kprobes had a similar issue which is fixed and a selftest was added to trigger that issue again" * tag 'trace-v4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: x86/ftrace: Make sure that ftrace trampolines are not RWX x86/mm/ftrace: Do not bug in early boot on irqs_disabled in cpu_flush_range() selftests/ftrace: Add a testcase for many kprobe events kprobes/x86: Fix to set RWX bits correctly before releasing trampoline ftrace: Fix memory leak in ftrace_graph_release()
2017-05-27alarmtimer: Fix posix-timer constification falloutThomas Gleixner1-0/+2
Some freezer related variables are only used when either CONFIG_POSIX_TIMER or CONFIG_RTC_CLASS are enabled. Hide them when both are off. Fixes: d3ba5a9a345b ("posix-timers: Make posix_clocks immutable") Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Helwig <hch@lst.de>
2017-05-27posix-timers: Make posix_clocks immutableChristoph Hellwig4-170/+146
There are no more modular users providing a posix clock. The register function is now pointless so the posix clock array can be initialized statically at compile time and the array including the various k_clock structs can be marked 'const'. Inspired by changes in the Grsecurity patch set, but done proper. [ tglx: Massaged changelog and fixed the POSIX_TIMER=n case ] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Travis <mike.travis@hpe.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Link: http://lkml.kernel.org/r/20170526090311.3377-3-hch@lst.de
2017-05-27kprobes/x86: Fix to set RWX bits correctly before releasing trampolineMasami Hiramatsu1-1/+1
Fix kprobes to set(recover) RWX bits correctly on trampoline buffer before releasing it. Releasing readonly page to module_memfree() crash the kernel. Without this fix, if kprobes user register a bunch of kprobes in function body (since kprobes on function entry usually use ftrace) and unregister it, kernel hits a BUG and crash. Link: http://lkml.kernel.org/r/149570868652.3518.14120169373590420503.stgit@devbox Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: d0381c81c2f7 ("kprobes/x86: Set kprobes pages read-only") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-27ftrace: Fix memory leak in ftrace_graph_release()Luis Henriques1-1/+1
ftrace_hash is being kfree'ed in ftrace_graph_release(), however the ->buckets field is not. This results in a memory leak that is easily captured by kmemleak: unreferenced object 0xffff880038afe000 (size 8192): comm "trace-cmd", pid 238, jiffies 4294916898 (age 9.736s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff815f561e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8113964d>] __kmalloc+0x12d/0x1a0 [<ffffffff810bf6d1>] alloc_ftrace_hash+0x51/0x80 [<ffffffff810c0523>] __ftrace_graph_open.isra.39.constprop.46+0xa3/0x100 [<ffffffff810c05e8>] ftrace_graph_open+0x68/0xa0 [<ffffffff8114003d>] do_dentry_open.isra.1+0x1bd/0x2d0 [<ffffffff81140df7>] vfs_open+0x47/0x60 [<ffffffff81150f95>] path_openat+0x2a5/0x1020 [<ffffffff81152d6a>] do_filp_open+0x8a/0xf0 [<ffffffff811411df>] do_sys_open+0x12f/0x200 [<ffffffff811412ce>] SyS_open+0x1e/0x20 [<ffffffff815fa6e0>] entry_SYSCALL_64_fastpath+0x13/0x94 [<ffffffffffffffff>] 0xffffffffffffffff Link: http://lkml.kernel.org/r/20170525152038.7661-1-lhenriques@suse.com Cc: stable@vger.kernel.org Fixes: b9b0c831bed2 ("ftrace: Convert graph filter to use hash tables") Signed-off-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-27livepatch: Make livepatch dependent on !TRIM_UNUSED_KSYMSMiroslav Benes1-0/+1
If TRIM_UNUSED_KSYMS is enabled, all unneeded exported symbols are made unexported. Two-pass build of the kernel is done to find out which symbols are needed based on a configuration. This effectively complicates things for out-of-tree modules. Livepatch exports functions to (un)register and enable/disable a live patch. The only in-tree module which uses these functions is a sample in samples/livepatch/. If the sample is disabled, the functions are trimmed and out-of-tree live patches cannot be built. Note that live patches are intended to be built out-of-tree. Suggested-by: Michal Marek <mmarek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Jessica Yu <jeyu@redhat.com> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds4-30/+29
Pull networking fixes from David Miller: 1) Fix state pruning in bpf verifier wrt. alignment, from Daniel Borkmann. 2) Handle non-linear SKBs properly in SCTP ICMP parsing, from Davide Caratti. 3) Fix bit field definitions for rss_hash_type of descriptors in mlx5 driver, from Jesper Brouer. 4) Defer slave->link updates until bonding is ready to do a full commit to the new settings, from Nithin Sujir. 5) Properly reference count ipv4 FIB metrics to avoid use after free situations, from Eric Dumazet and several others including Cong Wang and Julian Anastasov. 6) Fix races in llc_ui_bind(), from Lin Zhang. 7) Fix regression of ESP UDP encapsulation for TCP packets, from Steffen Klassert. 8) Fix mdio-octeon driver Kconfig deps, from Randy Dunlap. 9) Fix regression in setting DSCP on ipv6/GRE encapsulation, from Peter Dawson. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits) ipv4: add reference counting to metrics net: ethernet: ax88796: don't call free_irq without request_irq first ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets sctp: fix ICMP processing if skb is non-linear net: llc: add lock_sock in llc_ui_bind to avoid a race condition bonding: Don't update slave->link until ready to commit test_bpf: Add a couple of tests for BPF_JSGE. bpf: add various verifier test cases bpf: fix wrong exposure of map_flags into fdinfo for lpm bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data bpf: properly reset caller saved regs after helper call and ld_abs/ind bpf: fix incorrect pruning decision when alignment must be tracked arp: fixed -Wuninitialized compiler warning tcp: avoid fastopen API to be used on AF_UNSPEC net: move somaxconn init from sysctl code net: fix potential null pointer dereference geneve: fix fill_info when using collect_metadata virtio-net: enable TSO/checksum offloads for Q-in-Q vlans be2net: Fix offload features for Q-in-Q packets vlan: Fix tcp checksum offloads in Q-in-Q vlans ...
2017-05-26genirq: Make early_irq_init() print out more informativeVincent Legoll1-2/+3
The printk in early_irq_init() is cryptic and badly formatted: NR_IRQS:33024 nr_irqs:968 16 The last number is the number of preallocated interrupts, so add a prefix to it: NR_IRQS: 33024, nr_irqs: 968, preallocated irqs: 16 Cleanup the formatting for better readability as well. Signed-off-by: Vincent Legoll <vincent.legoll@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494318849-6733-1-git-send-email-vincent.legoll@gmail.com
2017-05-26cpuhotplug: Link lock stacks for hotplug callbacksThomas Gleixner1-0/+13
The CPU hotplug callbacks are not covered by lockdep versus the cpu hotplug rwsem. CPU0 CPU1 cpuhp_setup_state(STATE, startup, teardown); cpus_read_lock(); invoke_callback_on_ap(); kick_hotplug_thread(ap); wait_for_completion(); hotplug_thread_fn() lock(m); do_stuff(); unlock(m); Lockdep does not know about this dependency and will not trigger on the following code sequence: lock(m); cpus_read_lock(); Add a lockdep map and connect the initiators lock chain with the hotplug thread lock chain, so potential deadlocks can be detected. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081549.709375845@linutronix.de
2017-05-26cpu/hotplug: Convert hotplug locking to percpu rwsemThomas Gleixner1-94/+13
There are no more (known) nested calls to get_online_cpus() and all observed lock ordering problems have been addressed. Replace the magic nested 'rwsem' hackery with a percpu-rwsem. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081549.447014063@linutronix.de
2017-05-26kprobes: Cure hotplug lock ordering issuesThomas Gleixner1-27/+32
Converting the cpu hotplug locking to a percpu rwsem unearthed hidden lock ordering problems. There is a wide range of locks involved in this: kprobe_mutex, jump_label_mutex, ftrace_lock, text_mutex, event_mutex, module_mutex, func_hash->regex_lock and a gazillion of lock order permutations with nested get_online_cpus() calls. Some of those permutations are potential deadlocks even with the current nesting hotplug locking scheme, but they can't be discovered by lockdep. The conversion of the hotplug locking to a percpu rwsem requires to prevent nested locking, so it's required to take the hotplug rwsem early in the call chain and establish a proper lock order. After quite some analysis and going down the wrong road severa times the following lock order has been chosen: kprobe_mutex -> cpus_rwsem -> jump_label_mutex -> text_mutex For kprobes which hook on an ftrace function trace point, it's required to drop cpus_rwsem before calling into the ftrace code to avoid a deadlock on the func_hash->regex_lock. [ Steven: Ftrace interaction fixes ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/20170524081549.104864779@linutronix.de
2017-05-26jump_label: Reorder hotplug lock and jump_label_lockThomas Gleixner1-6/+14
The conversion of the hotplug locking to a percpu rwsem unearthed lock ordering issues all over the place. The jump_label code has two issues: 1) Nested get_online_cpus() invocations 2) Ordering problems vs. the cpus rwsem and the jump_label_mutex To cure these, the following lock order has been established; cpus_rwsem -> jump_label_lock -> text_mutex Even if not all architectures need protection against CPU hotplug, taking cpus_rwsem before jump_label_lock is now mandatory in code pathes which actually modify code and therefor need text_mutex protection. Move the get_online_cpus() invocations into the core jump label code and establish the proper lock order where required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Ralf Baechle <ralf@linux-mips.org> Link: http://lkml.kernel.org/r/20170524081549.025830817@linutronix.de
2017-05-26perf/tracing/cpuhotplug: Fix locking orderThomas Gleixner1-30/+76
perf, tracing, kprobes and jump_labels have a gazillion of ways to create dependency lock chains. Some of those involve nested invocations of get_online_cpus(). The conversion of the hotplug locking to a percpu rwsem requires to avoid such nested calls. sys_perf_event_open() protects most of the syscall logic against cpu hotplug. This causes nested calls and lock inversions versus ftrace and kprobes in various interesting ways. It's impossible to move the hotplug locking to the outer end of all call chains in the involved facilities, so the hotplug protection in sys_perf_event_open() needs to be solved differently. Introduce 'pmus_mutex' which protects a perf private online cpumask. This mutex is taken when the mask is updated in the cpu hotplug callbacks and can be taken in sys_perf_event_open() to protect the swhash setup/teardown code and when the final judgement about a valid event has to be made. [ tglx: Produced changelog and fixed the swhash interaction ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: http://lkml.kernel.org/r/20170524081548.930941109@linutronix.de
2017-05-26cpu/hotplug: Use stop_machine_cpuslocked() in takedown_cpu()Sebastian Andrzej Siewior1-1/+1
takedown_cpu() is a cpu hotplug function invoking stop_machine(). The cpu hotplug machinery holds the hotplug lock for write. stop_machine() invokes get_online_cpus() as well. This is correct, but prevents the conversion of the hotplug locking to a percpu rwsem. Use stop_machine_cpuslocked() to avoid the nested call. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081548.423292433@linutronix.de
2017-05-26padata: Avoid nested calls to cpus_read_lock() in pcrypt_init_padata()Sebastian Andrzej Siewior1-5/+6
pcrypt_init_padata() cpus_read_lock() padata_alloc_possible() padata_alloc() cpus_read_lock() The nested call to cpus_read_lock() works with the current implementation, but prevents the conversion to a percpu rwsem. The other caller of padata_alloc_possible() is pcrypt_init_padata() which calls from a cpus_read_lock() protected region as well. Remove the cpus_read_lock() call in padata_alloc() and document the calling convention. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-crypto@vger.kernel.org Link: http://lkml.kernel.org/r/20170524081547.571278910@linutronix.de
2017-05-26padata: Make padata_alloc() staticThomas Gleixner1-16/+16
No users outside of padata.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-crypto@vger.kernel.org Link: http://lkml.kernel.org/r/20170524081547.491457256@linutronix.de
2017-05-26stop_machine: Provide stop_machine_cpuslocked()Sebastian Andrzej Siewior1-4/+7
Some call sites of stop_machine() are within a get_online_cpus() protected region. stop_machine() calls get_online_cpus() as well, which is possible in the current implementation but prevents converting the hotplug locking to a percpu rwsem. Provide stop_machine_cpuslocked() to avoid nested calls to get_online_cpus(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.400700852@linutronix.de
2017-05-26cpu/hotplug: Add __cpuhp_state_add_instance_cpuslocked()Thomas Gleixner1-3/+15
Add cpuslocked() variants for the multi instance registration so this can be called from a cpus_read_lock() protected region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.321782217@linutronix.de
2017-05-26cpu/hotplug: Provide cpuhp_setup/remove_state[_nocalls]_cpuslocked()Sebastian Andrzej Siewior1-11/+36
Some call sites of cpuhp_setup/remove_state[_nocalls]() are within a cpus_read locked region. cpuhp_setup/remove_state[_nocalls]() call cpus_read_lock() as well, which is possible in the current implementation but prevents converting the hotplug locking to a percpu rwsem. Provide locked versions of the interfaces to avoid nested calls to cpus_read_lock(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.239600868@linutronix.de
2017-05-26cpu/hotplug: Provide cpus_read|write_[un]lock()Thomas Gleixner1-18/+18
The counting 'rwsem' hackery of get|put_online_cpus() is going to be replaced by percpu rwsem. Rename the functions to make it clear that it's locking and not some refcount style interface. These new functions will be used for the preparatory patches which make the code ready for the percpu rwsem conversion. Rename all instances in the cpu hotplug code while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170524081547.080397752@linutronix.de
2017-05-25kernel/locking: Fix compile error with qrwlock.cBabu Moger1-0/+1
Saw these compile errors on SPARC when queued rwlock feature is enabled. CC kernel/locking/qrwlock.o kernel/locking/qrwlock.c: In function ‘queued_read_lock_slowpath’: kernel/locking/qrwlock.c:89: error: implicit declaration of function ‘arch_spin_lock’ kernel/locking/qrwlock.c:102: error: implicit declaration of function ‘arch_spin_unlock’ make[4]: *** [kernel/locking/qrwlock.o] Error 1 Include spinlock.h in qrwlock.c to fix it. Signed-off-by: Babu Moger <babu.moger@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25bpf: fix wrong exposure of map_flags into fdinfo for lpmDaniel Borkmann3-0/+3
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via attr->map_flags, since it does not support preallocation yet. We check the flag, but we never copy the flag into trie->map.map_flags, which is later on exposed into fdinfo and used by loaders such as iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test whether a pinned map has the same spec as the one from the BPF obj file and if not, bails out, which is currently the case for lpm since it exposes always 0 as flags. Also copy over flags in array_map_alloc() and stack_map_alloc(). They always have to be 0 right now, but we should make sure to not miss to copy them over at a later point in time when we add actual flags for them to use. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reported-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25bpf: properly reset caller saved regs after helper call and ld_abs/indDaniel Borkmann1-21/+16
Currently, after performing helper calls, we clear all caller saved registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto specification. The way we reset these regs can affect pruning decisions in later paths, since we only reset register's imm to 0 and type to NOT_INIT. However, we leave out clearing of other variables such as id, min_value, max_value, etc, which can later on lead to pruning mismatches due to stale data. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25bpf: fix incorrect pruning decision when alignment must be trackedDaniel Borkmann1-9/+10
Currently, when we enforce alignment tracking on direct packet access, the verifier lets the following program pass despite doing a packet write with unaligned access: 0: (61) r2 = *(u32 *)(r1 +76) 1: (61) r3 = *(u32 *)(r1 +80) 2: (61) r7 = *(u32 *)(r1 +8) 3: (bf) r0 = r2 4: (07) r0 += 14 5: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 6: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 7: (63) *(u32 *)(r0 -4) = r0 8: (b7) r0 = 0 9: (95) exit from 6 to 8: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp 8: (b7) r0 = 0 9: (95) exit from 5 to 10: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R10=fp 10: (07) r0 += 1 11: (05) goto pc-6 6: safe <----- here, wrongly found safe processed 15 insns However, if we enforce a pruning mismatch by adding state into r8 which is then being mismatched in states_equal(), we find that for the otherwise same program, the verifier detects a misaligned packet access when actually walking that path: 0: (61) r2 = *(u32 *)(r1 +76) 1: (61) r3 = *(u32 *)(r1 +80) 2: (61) r7 = *(u32 *)(r1 +8) 3: (b7) r8 = 1 4: (bf) r0 = r2 5: (07) r0 += 14 6: (25) if r7 > 0x1 goto pc+4 R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 8: (63) *(u32 *)(r0 -4) = r0 9: (b7) r0 = 0 10: (95) exit from 7 to 9: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=0,max_value=1 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 9: (b7) r0 = 0 10: (95) exit from 6 to 11: R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0) R3=pkt_end R7=inv,min_value=2 R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp 11: (07) r0 += 1 12: (b7) r8 = 0 13: (05) goto pc-7 <----- mismatch due to r8 7: (2d) if r0 > r3 goto pc+1 R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15) R3=pkt_end R7=inv,min_value=2 R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp 8: (63) *(u32 *)(r0 -4) = r0 misaligned packet access off 2+15+-4 size 4 The reason why we fail to see it in states_equal() is that the third test in compare_ptrs_to_packet() ... if (old->off <= cur->off && old->off >= old->range && cur->off >= cur->range) return true; ... will let the above pass. The situation we run into is that old->off <= cur->off (14 <= 15), meaning that prior walked paths went with smaller offset, which was later used in the packet access after successful packet range check and found to be safe already. For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14 as in above program to it, results in R0=pkt(id=0,off=14,r=0) before the packet range test. Now, testing this against R3=pkt_end with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14) for the case when we're within bounds. A write into the packet at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and aligned (2 is for NET_IP_ALIGN). After processing this with all fall-through paths, we later on check paths from branches. When the above skb->mark test is true, then we jump near the end of the program, perform r0 += 1, and jump back to the 'if r0 > r3 goto out' test we've visited earlier already. This time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune that part because this time we'll have a larger safe packet range, and we already found that with off=14 all further insn were already safe, so it's safe as well with a larger off. However, the problem is that the subsequent write into the packet with 2 + 15 -4 is then unaligned, and not caught by the alignment tracking. Note that min_align, aux_off, and aux_off_align were all 0 in this example. Since we cannot tell at this time what kind of packet access was performed in the prior walk and what minimal requirements it has (we might do so in the future, but that requires more complexity), fix it to disable this pruning case for strict alignment for now, and let the verifier do check such paths instead. With that applied, the test cases pass and reject the program due to misalignment. Fixes: d1174416747d ("bpf: Track alignment of register values in the verifier.") Reference: http://patchwork.ozlabs.org/patch/761909/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25kobject: support passing in variables for synthetic ueventsPeter Rajnoha1-4/+1
This patch makes it possible to pass additional arguments in addition to uevent action name when writing /sys/.../uevent attribute. These additional arguments are then inserted into generated synthetic uevent as additional environment variables. Before, we were not able to pass any additional uevent environment variables for synthetic uevents. This made it hard to identify such uevents properly in userspace to make proper distinction between genuine uevents originating from kernel and synthetic uevents triggered from userspace. Also, it was not possible to pass any additional information which would make it possible to optimize and change the way the synthetic uevents are processed back in userspace based on the originating environment of the triggering action in userspace. With the extra additional variables, we are able to pass through this extra information needed and also it makes it possible to synchronize with such synthetic uevents as they can be clearly identified back in userspace. The format for writing the uevent attribute is following: ACTION [UUID [KEY=VALUE ...] There's no change in how "ACTION" is recognized - it stays the same ("add", "change", "remove"). The "ACTION" is the only argument required to generate synthetic uevent, the rest of arguments, that this patch adds support for, are optional. The "UUID" is considered as transaction identifier so it's possible to use the same UUID value for one or more synthetic uevents in which case we logically group these uevents together for any userspace listeners. The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" format where "x" is a hex digit. The value appears in uevent as "SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable. The "KEY=VALUE" pairs can contain alphanumeric characters only. It's possible to define zero or more more pairs - each pair is then delimited by a space character " ". Each pair appears in synthetic uevents as "SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains "SYNTH_ARG_" prefix to avoid possible collisions with existing variables. To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID" part for the synthetic uevent first. If "UUID" is not passed in, the generated synthetic uevent gains "SYNTH_UUID=0" environment variable automatically so it's possible to identify this situation in userspace when reading generated uevent and so we can still make a difference between genuine and synthetic uevents. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-24cpuset: consider dying css as offlineTejun Heo1-2/+2
In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-24Merge branch 'for-linus' of ↵Linus Torvalds1-7/+13
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace fix from Eric Biederman: "This fixes a brown paper bag bug. When I fixed the ptrace interaction with user namespaces I added a new field ptracer_cred in struct_task and I failed to properly initialize it on fork. This dangling pointer wound up breaking runing setuid applications run from the enlightenment window manager. As this is the worst sort of bug. A regression breaking user space for no good reason let's get this fixed" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Properly initialize ptracer_cred on fork
2017-05-24sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()Peter Zijlstra1-1/+8
The more strict early boot preemption warnings found that __set_sched_clock_stable() was incorrectly assuming we'd still be running on a single CPU: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is debug_smp_processor_id+0x1c/0x1e CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1 Call Trace: dump_stack+0x110/0x192 check_preemption_disabled+0x10c/0x128 ? set_debug_rodata+0x25/0x25 debug_smp_processor_id+0x1c/0x1e sched_clock_init_late+0x27/0x87 [...] Fix it by disabling IRQs. Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: lkp@01.org Cc: tipbuild@zytor.com Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-24posix-timers: Make signal printks conditionalThomas Gleixner1-8/+16
A recent commit added extra printks for CPU/RT limits. This can result in excessive spam in dmesg. Make the printks conditional on print_fatal_signals. Fixes: e7ea7c9806a2 ("rlimits: Print more information when CPU/RT limits are exceeded") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arun Raghavan <arun@arunraghavan.net>
2017-05-24module: Add module name to modinfoKees Cook1-7/+22
Accessing the mod structure (e.g. for mod->name) prior to having completed check_modstruct_version() can result in writing garbage to the error logs if the layout of the mod structure loaded from disk doesn't match the running kernel's mod structure layout. This kind of mismatch will become much more likely if a kernel is built with different randomization seed for the struct layout randomization plugin. Instead, add and use a new modinfo string for logging the module name. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-05-24module: Pass struct load_info into symbol checksKees Cook1-12/+10
Since we're already using values from struct load_info, just pass this pointer in directly and use what's needed as we need it. This allows us to access future fields in struct load_info too. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-05-23audit: unswing cap_* fields in PATH recordsRichard Guy Briggs1-16/+4
The cap_* fields swing in and out of PATH records. If no capabilities are set, the cap_* fields are completely missing and when one of the cap_fi or cap_fp values is empty, that field is omitted. Original: type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL type=PATH msg=audit(04/20/2017 12:17:11.222:193) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fe=1 cap_fver=2 Normalize the PATH record by always printing all 4 cap_* fields. Fixed: type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=1 name=/lib64/ld-linux-x86-64.so.2 inode=787694 dev=08:03 mode=file,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:ld_so_t:s0 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 type=PATH msg=audit(04/20/2017 13:01:31.679:201) : item=0 name=/home/sleep inode=1319469 dev=08:03 mode=file,suid,755 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:bin_t:s0 nametype=NORMAL cap_fp=sys_admin cap_fi=none cap_fe=1 cap_fver=2 See: https://github.com/linux-audit/audit-kernel/issues/42 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-23ptrace: Properly initialize ptracer_cred on forkEric W. Biederman1-7/+13
When I introduced ptracer_cred I failed to consider the weirdness of fork where the task_struct copies the old value by default. This winds up leaving ptracer_cred set even when a process forks and the child process does not wind up being ptraced. Because ptracer_cred is not set on non-ptraced processes whose parents were ptraced this has broken the ability of the enlightenment window manager to start setuid children. Fix this by properly initializing ptracer_cred in ptrace_init_task This must be done with a little bit of care to preserve the current value of ptracer_cred when ptrace carries through fork. Re-reading the ptracer_cred from the ptracing process at this point is inconsistent with how PT_PTRACE_CAP has been maintained all of these years. Tested-by: Takashi Iwai <tiwai@suse.de> Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23sched/core: Enable might_sleep() and smp_processor_id() checks earlyThomas Gleixner1-1/+3
might_sleep() and smp_processor_id() checks are enabled after the boot process is done. That hides bugs in the SMP bringup and driver initialization code. Enable it right when the scheduler starts working, i.e. when init task and kthreadd have been created and right before the idle task enables preemption. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23printk: Adjust system_state checksThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in boot_delay_msec() to handle the extra states. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184736.027534895@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23extable: Adjust system_state checksThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in core_kernel_text() to handle the extra states, i.e. to cover init text up to the point where the system switches to state RUNNING. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184735.949992741@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23async: Adjust system_state checksThomas Gleixner1-4/+4
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in async_run_entry_fn() and async_synchronize_cookie_domain() to handle the extra states. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184735.865155020@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/numa: Use down_read_trylock() for the mmap_semVlastimil Babka1-1/+2
A customer has reported a soft-lockup when running an intensive memory stress test, where the trace on multiple CPU's looks like this: RIP: 0010:[<ffffffff810c53fe>] [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190 ... Call Trace: [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa [<ffffffff811bc331>] change_protection_range+0x3b1/0x930 [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 [<ffffffff81098322>] task_work_run+0x72/0x90 Further investigation showed that the lock contention here is pmd_lock(). The task_numa_work() function makes sure that only one thread is let to perform the work in a single scan period (via cmpxchg), but if there's a thread with mmap_sem locked for writing for several periods, multiple threads in task_numa_work() can build up a convoy waiting for mmap_sem for read and then all get unblocked at once. This patch changes the down_read() to the trylock version, which prevents the build up. For a workload experiencing mmap_sem contention, it's probably better to postpone the NUMA balancing work anyway. This seems to have fixed the soft lockups involving pmd_lock(), which is in line with the convoy theory. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()Dave Kleikamp1-0/+11
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially takes each CPU's rq->lock. On a large, busy system, the cumulative time it takes to acquire each lock can be excessive, even triggering a watchdog timeout. If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does nothing while holding the lock, so don't bother taking it at all. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/core: Allow __sched_setscheduler() in interrupts when PI is not usedSteven Rostedt (VMware)1-2/+2
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI is not a common occurrence, lockdep will likely never trigger if sched_setscheduler was called from interrupt context. A BUG_ON() was added to trigger if __sched_setscheduler() was ever called from interrupt context because there was a possibility to take the wait_lock. Today the wait_lock is irq safe, but the path to taking it in sched_setscheduler() is the same as the path to taking it from normal context. The wait_lock is taken with raw_spin_lock_irq() and released with raw_spin_unlock_irq() which will indiscriminately enable interrupts, which would be bad in interrupt context. The problem is that normalize_rt_tasks, which is called by triggering the sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this is done from interrupt context! Now __sched_setscheduler() takes a "pi" parameter that is used to know if the priority inheritance should be called or not. As the BUG_ON() only cares about calling the PI code, it should only bug if called from interrupt context with the "pi" parameter set to true. Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@osdl.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dbc7f069b93a ("sched: Use replace normalize_task() with __sched_setscheduler()") Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23sched/deadline: Remove unnecessary condition in push_dl_task()Byungchul Park1-1/+1
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>