summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2019-05-28srcu: Allocate per-CPU data for DEFINE_SRCU() in modulesPaul E. McKenney2-0/+70
Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires that the size of the reserved region be increased, which is not something we want to be doing all that often. One approach would be to require that loadable modules define an srcu_struct and invoke init_srcu_struct() from their module_init function and cleanup_srcu_struct() from their module_exit function. However, this is more than a bit user unfriendly. This commit therefore creates an ___srcu_struct_ptrs linker section, and pointers to srcu_struct structures created by DEFINE_SRCU() and DEFINE_STATIC_SRCU() within a module are placed into that module's ___srcu_struct_ptrs section. The required init_srcu_struct() and cleanup_srcu_struct() functions are then automatically invoked as needed when that module is loaded and unloaded, thus allowing modules to continue to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need to increase the size of the reserved region. Many of the algorithms and some of the code was cheerfully cherry-picked from other code making use of linker sections, perhaps most notably from tracepoints. All bugs are nevertheless the sole property of the author. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> [ paulmck: Use __section() and use "default" in srcu_module_notify()'s "switch" statement as suggested by Joel Fernandes. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-28rcu: Set a maximum limit for back-to-back callback invocationPaul E. McKenney1-3/+4
Currently, if a CPU has more than 10,000 callbacks pending, it will increase rdp->blimit to LONG_MAX. If you are lucky, LONG_MAX is only about two billion, but this is still a bit too many callbacks to invoke back-to-back while otherwise ignoring the world. This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT, which is set to 10,000, for rdp->blimit. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28rcu: Correctly unlock root node in rcu_check_gp_start_stall()Neeraj Upadhyay1-1/+3
On systems whose rcu_node tree has only one node, the rcu_check_gp_start_stall() function's values of rnp and rnp_root will be identical. In this case, it clearly does not make sense to release both rnp->lock and rnp_root->lock, but that is exactly what this function does in the last early exit. This commit therefore unlocks only rnp->lock when rnp and rnp_root are equal. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28rcu: Dump specified number of blocked tasksNeeraj Upadhyay1-1/+1
The dump_blkd_tasks() function dumps at most 10 blocked tasks, ignoring the value of the ncheck parameter. This commit therefore substitutes the value of ncheck for the hard-coded value of 10. Because all callers currently pass 10 as the number, this patch does not change behavior, but it is clearly an accident waiting to happen. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28rcu: Remove unused rdp local from synchronize_rcu_expedited()Jiang Biao1-2/+0
Because rdp is initialized but never used in synchronize_rcu_expedited(), this commit removes it. Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28rcu: Rename rcu_data's ->deferred_qs to ->exp_deferred_qsPaul E. McKenney3-12/+12
The rcu_data structure's ->deferred_qs field is used to indicate that the current CPU is blocking an expedited grace period (perhaps a future one). Given that it is used only for expedited grace periods, its current name is misleading, so this commit renames it to ->exp_deferred_qs. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28rcu: Add checks for dynticks counters in rcu_is_cpu_rrupt_from_idle()Joel Fernandes (Google)1-4/+17
It would be good to combine the dynticks and dynticks_nesting counters in order to simplify the code. Unfortunately, there are concerns about usermode upcalls appearing to RCU as half of an interrupt, as Byungchul learned [1]. The "half" in "half interrupt" is due to an unpaired rcu_irq_enter(): Normally, each rcu_irq_enter() has a later call to rcu_irq_exit(). Out of an abundance of caution, Paul added warnings [2] in the RCU code which if not fired by 2021 will be interpreted as meaning that this half-interrupt scenario cannot happen any more, thus permitting simplification of this code. In the meantime, this commit makes the following changes: (1) Combining these two counters requires that rcu_rrupt_from_idle() is invoked only from hard-interrupt contexts as discussed here [3]. This commit therefore adds the required lockdep_assert_in_irq() to check this constraint. (2) Furthermore, rcu_rrupt_from_idle() is not explicit about how it is using the counters which can lead to weird future bugs. This commit therefore adds comments indicating the meaning and use of each counter. (3) Lastly, this commit checks for counter underflows as another check that half interrupts don't occur. (Previously, the function would simply return true upon underflow.) All these checks checks are NOOPs if PROVE_LOCKING (and thus PROVE_RCU) are disabled. [1] https://lore.kernel.org/patchwork/patch/952349/ [2] Commit e11ec65cc8d6 ("rcu: Add warning to detect half-interrupts") [3] https://lore.kernel.org/lkml/20190312150514.GB249405@google.com/ Cc: byungchul.park@lge.com Cc: kernel-team@android.com Cc: rcu@vger.kernel.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESSSteven Rostedt (VMware)1-13/+4
Commit c19fa94a8fed ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for architectures that required 64bit aligned access for all 64bit words. As the ftrace ring buffer stores data on 4 byte alignment, this config option was used to force it to store data on 8 byte alignment to make sure the data being stored and written directly into the ring buffer was 8 byte aligned as it would cause issues trying to write an 8 byte word on a 4 not 8 byte aligned memory location. But with the removal of the metag architecture, which was the only architecture to use this, there is no architecture supported by Linux that requires 8 byte aligne access for all 8 byte words (4 byte alignment is good enough). Removing this config can simplify the code a bit. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28bpf: check signal validity in nmi for bpf_send_signal() helperYonghong Song1-0/+6
Commit 8b401f9ed244 ("bpf: implement bpf_send_signal() helper") introduced bpf_send_signal() helper. If the context is nmi, the sending signal work needs to be deferred to irq_work. If the signal is invalid, the error will appear in irq_work and it won't be propagated to user. This patch did an early check in the helper itself to notify user invalid signal, as suggested by Daniel. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-27signal: Remove task parameter from force_sig_mceerrEric W. Biederman1-2/+2
All of the callers pass current into force_sig_mceer so remove the task parameter to make this obvious. This also makes it clear that force_sig_mceerr passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27signal: Remove task parameter from force_sigEric W. Biederman3-6/+6
All of the remaining callers pass current into force_sig so remove the task parameter to make this obvious and to make misuse more difficult in the future. This also makes it clear force_sig passes current into force_sig_info. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27signal: Remove task parameter from force_sigsegvEric W. Biederman2-3/+5
The function force_sigsegv is always called on the current task so passing in current is redundant and not passing in current makes this fact obvious. This also makes it clear force_sigsegv always calls force_sig on the current task. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sigEric W. Biederman1-1/+1
The locking in force_sig_info is not prepared to deal with a task that exits or execs (as sighand may change). The is not a locking problem in force_sig as force_sig is only built to handle synchronous exceptions. Further the function force_sig_info changes the signal state if the signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the delivery of the signal. The signal SIGKILL can not be ignored and can not be blocked and SIGNAL_UNKILLABLE won't prevent it from being delivered. So using force_sig rather than send_sig for SIGKILL is confusing and pointless. Because it won't impact the sending of the signal and and because using force_sig is wrong, replace force_sig with send_sig. Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Serge Hallyn <serge@hallyn.com> Cc: Oleg Nesterov <oleg@redhat.com> Fixes: cf3f89214ef6 ("pidns: add reboot_pid_ns() to handle the reboot syscall") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27ACPI: PM: Call pm_set_suspend_via_firmware() during hibernationRafael J. Wysocki1-2/+2
On systems with ACPI platform firmware the last stage of hibernation is analogous to system suspend to S3 (suspend-to-RAM), so it should be handled analogously. In particular, pm_suspend_via_firmware() should return 'true' in that stage to let the callers of it know that control will be passed to the platform firmware going forward, so pm_set_suspend_via_firmware() needs to be called then in analogy with acpi_suspend_begin(). However, the platform hibernation ->begin() callback is invoked during the "freeze" transition (before creating a snapshot image of system memory) as well as during the "hibernate" transition which is the last stage of it and pm_set_suspend_via_firmware() should be invoked by that callback in the latter stage only. In order to implement that redefine the hibernation ->begin() callback to take a pm_message_t argument to indicate which stage of hibernation is taking place and rework acpi_hibernation_begin() and acpi_hibernation_begin_old() to take it into account as needed. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-26Merge tag 'trace-v5.2-rc1-2' of ↵Linus Torvalds3-10/+20
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing warning fix from Steven Rostedt: "Make the GCC 9 warning for sub struct memset go away. GCC 9 now warns about calling memset() on partial structures when it goes across multiple fields. This adds a helper for the place in tracing that does this type of clearing of a structure" * tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Silence GCC 9 array bounds warning
2019-05-26ftrace: Enable trampoline when rec count returns back to oneCheng Jian1-13/+15
Custom trampolines can only be enabled if there is only a single ops attached to it. If there's only a single callback registered to a function, and the ops has a trampoline registered for it, then we can call the trampoline directly. This is very useful for improving the performance of ftrace and livepatch. If more than one callback is registered to a function, the general trampoline is used, and the custom trampoline is not restored back to the direct call even if all the other callbacks were unregistered and we are back to one callback for the function. To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented to one, and the ops that left has a trampoline. Testing After this patch : insmod livepatch_unshare_files.ko cat /sys/kernel/debug/tracing/enabled_functions unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0 echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter echo function > /sys/kernel/debug/tracing/current_tracer cat /sys/kernel/debug/tracing/enabled_functions unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150 echo nop > /sys/kernel/debug/tracing/current_tracer cat /sys/kernel/debug/tracing/enabled_functions unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0 Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.com Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdlineSteven Rostedt (VMware)1-0/+8
When having kprobe trace event start up tests enabled and adding a kprobe_event on the kernel command line, it produced the following: trace_kprobe: Testing kprobe tracing: WARNING: CPU: 5 PID: 1 at kernel/trace/trace_kprobe.c:1724 kprobe_trace_self_tests_init+0x32d/0x36b Modules linked in: CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc1-test+ #249 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:kprobe_trace_self_tests_init+0x32d/0x36b Code: b7 e8 4f 8d a2 fe 85 c0 74 10 0f 0b 48 c7 c7 c8 1b 0d b7 ff c3 e8 19 af 99 fe 48 c7 c7 40 93 27 b7 e8 7f 1a a5 fe 85 c0 74 10 <0f> 0b 48 c7 c7 f8 1b 0d b7 ff c3 e8 f9 ae 9 a0 fe 85 RSP: 0018:ffffb36e40653e08 EFLAGS: 00010286 RAX: 00000000fffffff0 RBX: 0000000000000000 RCX: ffffb36e40653d5c RDX: 0000000000000000 RSI: ffffffffb72776e0 RDI: 0000000000000246 RBP: ffff98414fe58ff8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffff98415d8aa940 R12: 0000000000000000 R13: ffffffffb737c1b0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff98415ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f959ce741b8 CR3: 000000011a210002 CR4: 00000000001606e0 Call Trace: ? init_kprobe_trace+0x19e/0x19e ? do_early_param+0x8e/0x8e do_one_initcall+0x6f/0x2b4 ? do_early_param+0x8e/0x8e kernel_init_freeable+0x21d/0x2c6 ? rest_init+0x146/0x146 kernel_init+0xa/0x10a ret_from_fork+0x3a/0x50 ---[ end trace 488430c083a4c956 ]--- As with the trace events, if a trace event is set on the kernel command line, the trace events start up tests are suspended. The kprobe start up tests should do the same when a kprobe is enabled on the kernel command line. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing: Make a separate config for trace event self testsSteven Rostedt (VMware)2-2/+12
The trace event self tests enable loop through *all* events, enables each one, one at a time, runs some code to trigger various events (not necessarily the same events), and checks if anything went wrong. The issue is that trace events are usually the least likely start up test to cause a problem, but they take the longest to run (because there are so many events). When one of the other tests trigger a bug, the trace event start up tests causes the bisect to take much longer, because it takes 10s of seconds to get through the trace event tests. By making them a separate config (even though they are enabled by default if start up tests are set), it is possible to turn them off and still run the other tracing start up tests much quicker. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing/kprobe: Add kprobe_event= boot parameterMasami Hiramatsu1-0/+54
Add kprobe_event= boot parameter to define kprobe events at boot time. The definition syntax is similar to tracefs/kprobe_events interface, but use ',' and ';' instead of ' ' and '\n' respectively. e.g. kprobe_event=p,vfs_read,$arg1,$arg2 This puts a probe on vfs_read with argument1 and 2, and enable the new event. Link: http://lkml.kernel.org/r/155851395498.15728.830529496248543583.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26kprobes: Initialize kprobes at postcore_initcallMasami Hiramatsu1-2/+1
Initialize kprobes at postcore_initcall level instead of module_init since kprobes is not a module, and it depends on only subsystems initialized in core_initcall. This will allow ftrace kprobe event to add new events when it is initializing because ftrace kprobe event is initialized at later initcall level. Link: http://lkml.kernel.org/r/155851394736.15728.13626739508905120098.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing/kprobe: Cast user-space address correctlyMasami Hiramatsu1-1/+3
Cast user-space address correctly to pass to probe_user_read(). Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing: Use correct function name in trace_filter_add_remove_task() commentMatthias Kaehlcke1-1/+1
The comment of trace_filter_add_remove_task() refers to the function as 'trace_pid_filter_add_remove_task', use the correct name. Link: http://lkml.kernel.org/r/20190523192628.134406-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing/probe: Support user-space dereferenceMasami Hiramatsu6-14/+53
Support user-space dereference syntax for probe event arguments to dereference the data-structure or array in user-space. The syntax is just adding 'u' before an offset value. +|-u<OFFSET>(<FETCHARG>) e.g. +u8(%ax), +u0(+0(%si)) For example, if you probe do_sched_setscheduler(pid, policy, param) and record param->sched_priority, you can add new probe as below; p do_sched_setscheduler priority=+u0($arg3) Note that kprobe event provides this and it doesn't change the dereference method automatically because we do not know whether the given address is in userspace or kernel on some archs. So as same as "ustring", this is an option for user, who has to carefully choose the dereference method. Link: http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2 Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing/probe: Add ustring type for user-space stringMasami Hiramatsu6-8/+77
Add "ustring" type for fetching user-space string from kprobe event. User can specify ustring type at uprobe event, and it is same as "string" for uprobe. Note that probe-event provides this option but it doesn't choose the correct type automatically since we have not way to decide the address is in user-space or not on some arch (and on some other arch, you can fetch the string by "string" type). So user must carefully check the target code (e.g. if you see __user on the target variable) and use this new type. Link: http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2 Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26ftrace: Make enable and update parameters bool when applicableSteven Rostedt (VMware)1-10/+10
The code modification functions have "enable" and "update" variables that are sometimes "int" but used as "bool". Remove the ambiguity and make them "bool" when they are only used for true or false values. Link: http://lkml.kernel.org/r/e1429923d9eda92a3cf5ee9e33c7eacce539781d.1558115654.git.naveen.n.rao@linux.vnet.ibm.com Reported-by: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26tracing: Silence GCC 9 array bounds warningMiguel Ojeda3-10/+20
Starting with GCC 9, -Warray-bounds detects cases when memset is called starting on a member of a struct but the size to be cleared ends up writing over further members. Such a call happens in the trace code to clear, at once, all members after and including `seq` on struct trace_iterator: In function 'memset', inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3: ./include/linux/string.h:344:9: warning: '__builtin_memset' offset [8505, 8560] from the object at 'iter' is out of the bounds of referenced subobject 'seq' with type 'struct trace_seq' at offset 4368 [-Warray-bounds] 344 | return __builtin_memset(p, c, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to avoid GCC complaining about it, we compute the address ourselves by adding the offsetof distance instead of referring directly to the member. Since there are two places doing this clear (trace.c and trace_kdb.c), take the chance to move the workaround into a single place in the internal header. Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.com Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [ Removed unnecessary parenthesis around "iter" ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-26cpuset: move mount -t cpuset logics into cgroup.cAl Viro2-60/+48
... and get rid of the weird dances in ->get_tree() - that logics can be easily handled in ->init_fs_context(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-26no need to protect against put_user_ns(NULL)Al Viro1-2/+1
it's a no-op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-26rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()Paul E. McKenney1-6/+29
The sync_sched_exp_online_cleanup() is invoked at online time to handle the case where the start of an expedited grace period ran concurrently with a CPU being taken offline and then immediately being placed online. It checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. However, it is quite possible that sync_sched_exp_online_cleanup() is running on that CPU, in which case it is considerably less overhead to simply request the quiescent state locally instead of simulating a self-IPI. This commit therefore places the last few lines of rcu_exp_handler() into a new rcu_exp_need_qs() function, which is invoked both by rcu_exp_handler() and by sync_sched_exp_online_cleanup() in the self-IPI case. This also reduces the rcu_exp_handler() function's state space by removing the direct call that this smp_call_function_single() uses to emulate the requested self-IPI. This in turn will allow tighter error checking in rcu_is_cpu_rrupt_from_idle(). Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-26rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()Paul E. McKenney1-0/+5
Although sync_rcu_exp_select_node_cpus() treats the current CPU as being in a quiescent state, it might well migrate to some other CPU before reaching the smp_call_function_single(), which could then result in an unnecessary simulated self-IPI. This commit therefore instead simply refuses to invoke smp_call_function_single() on the current CPU, which causes the later rcu_report_exp_cpu_mult() to report this CPU's quiescent state with less overhead. This also reduces the rcu_exp_handler() function's state space by removing the direct call that this smp_call_function_single() uses to emulate the requested self-IPI. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Use get_cpu() instead of preempt_disable() per Joel Fernandes. ]
2019-05-26rcu: Inline invoke_rcu_callbacks() into its sole remaining callerPaul E. McKenney1-17/+3
This commit saves a few lines of code by inlining invoke_rcu_callbacks() into its sole remaining caller. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-26rcu: Use irq_work to get scheduler's attention in clean contextPaul E. McKenney2-0/+22
When rcu_read_unlock_special() is invoked with interrupts disabled, is either not in an interrupt handler or is not using RCU_SOFTIRQ, is not the first RCU read-side critical section in the chain, and either there is an expedited grace period in flight or this is a NO_HZ_FULL kernel, the end of the grace period can be unduly delayed. The reason for this is that it is not safe to do wakeups in this situation. This commit fixes this problem by using the irq_work subsystem to force a later interrupt handler in a clean environment. Because set_tsk_need_resched(current) and set_preempt_need_resched() are invoked prior to this, the scheduler will force a context switch upon return from this interrupt (though perhaps at the end of any interrupted preempt-disable or BH-disable region of code), which will invoke rcu_note_context_switch() (again in a clean environment), which will in turn give RCU the chance to report the deferred quiescent state. Of course, by then this task might be within another RCU read-side critical section. But that will be detected at that time and reporting will be further deferred to the outermost rcu_read_unlock(). See rcu_preempt_need_deferred_qs() and rcu_preempt_deferred_qs() for more details on the checking. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-26rcu: Allow rcu_read_unlock_special() to raise_softirq() if in_irq()Paul E. McKenney1-1/+1
When running in an interrupt handler, raise_softirq() and raise_softirq_irqoff() have extremely low overhead: They simply set a bit in a per-CPU mask, which is checked upon exit from that interrupt handler. Therefore, if rcu_read_unlock_special() is invoked within an interrupt handler and RCU_SOFTIRQ is in use, this commit make use of raise_softirq_irqoff() even if there is no expedited grace period in flight and even if this is not a nohz_full CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-26rcu: Only do rcu_read_unlock_special() wakeups if expeditedPaul E. McKenney1-2/+10
Currently, rcu_read_unlock_special() will do wakeups whenever it is safe to do so. However, wakeups are expensive, and they are only really needed when the just-ended RCU read-side critical section is blocking an expedited grace period (in which case speed is of the essence) or on a nohz_full CPU (where it might be a good long time before an interrupt arrives). This commit therefore checks for these conditions, and does the expensive wakeups only if doing so would be useful. Note it can be rather expensive to determine whether or not the current task (as opposed to the current CPU) is blocking the current expedited grace period. Doing so requires traversing the ->blkd_tasks list, which can be quite long. This commit therefore cheats: If the current task is on a given ->blkd_tasks list, and some task on that list is blocking the current expedited grace period, the code assumes that the current task is blocking that expedited grace period. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-26rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()Paul E. McKenney1-5/+14
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc kthreads, a full and unconditional wakeup is required to initiate RCU core processing. In contrast, when RCU core processing is carried out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are situations where raise_softirq() does a full wakeup, but these do not occur with normal usage of rcu_read_unlock(). The reason that full wakeups can be problematic is that the scheduler sometimes invokes rcu_read_unlock() with its pi or rq locks held, which can of course result in deadlock in CONFIG_PREEMPT=y kernels when rcu_read_unlock() invokes the scheduler. Scheduler invocations can happen in the following situations: (1) The just-ended reader has been subjected to RCU priority boosting, in which case rcu_read_unlock() must deboost, (2) Interrupts were disabled across the call to rcu_read_unlock(), so the quiescent state must be deferred, requiring a wakeup of the rcuc kthread corresponding to the current CPU. Now, the scheduler may hold one of its locks across rcu_read_unlock() only if preemption has been disabled across the entire RCU read-side critical section, which in the days prior to RCU flavor consolidation meant that rcu_read_unlock() never needed to do wakeups. However, this is no longer the case for any but the first rcu_read_unlock() following a condition (e.g., preempted RCU reader) requiring special rcu_read_unlock() attention. For example, an RCU read-side critical section might be preempted, but preemption might be disabled across the rcu_read_unlock(). The rcu_read_unlock() must defer the quiescent state, and therefore leaves the task queued on its leaf rcu_node structure. If a scheduler interrupt occurs, the scheduler might well invoke rcu_read_unlock() with one of its locks held. However, the preempted task is still queued, so rcu_read_unlock() will attempt to defer the quiescent state once more. When RCU core processing is carried out by RCU_SOFTIRQ, this works just fine: The raise_softirq() function simply sets a bit in a per-CPU mask and the RCU core processing will be undertaken upon return from interrupt. Not so when RCU core processing is carried out by the rcuc kthread: In this case, the required wakeup can result in deadlock. The initial solution to this problem was to use set_tsk_need_resched() and set_preempt_need_resched() to force a future context switch, which allows rcu_preempt_note_context_switch() to report the deferred quiescent state to RCU's core processing. Unfortunately for expedited grace periods, there can be a significant delay between the call for a context switch and the actual context switch. This commit therefore introduces a ->deferred_qs flag to the task_struct structure's rcu_special structure. This flag is initially false, and is set to true by the first call to rcu_read_unlock() requiring special attention, then finally reset back to false when the quiescent state is finally reported. Then rcu_read_unlock() attempts full wakeups only when ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring special attention. Note that a chain of RCU readers linked by some other sort of reader may find that a later rcu_read_unlock() is once again able to do a full wakeup, courtesy of an intervening preemption: rcu_read_lock(); /* preempted */ local_irq_disable(); rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */ rcu_read_lock(); local_irq_enable(); preempt_disable() rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */ rcu_read_lock(); preempt_enable(); /* preempted, >deferred_qs reset. */ local_irq_disable(); rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */ Such linked RCU readers do not yet seem to appear in the Linux kernel, and it is probably best if they don't. However, RCU needs to handle them, and some variations on this theme could make even raise_softirq() unsafe due to the possibility of its doing a full wakeup. This commit therefore also avoids invoking raise_softirq() when the ->deferred_qs set flag is set. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-05-26rcu: Enable elimination of Tree-RCU softirq processingSebastian Andrzej Siewior3-134/+140
Some workloads need to change kthread priority for RCU core processing without affecting other softirq work. This commit therefore introduces the rcutree.use_softirq kernel boot parameter, which moves the RCU core work from softirq to a per-CPU SCHED_OTHER kthread named rcuc. Use of SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Note that rcutree.use_softirq=0 must be specified to move RCU core processing to the rcuc kthreads: rcutree.use_softirq=1 is the default. Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [ paulmck: Adjust for invoke_rcu_callbacks() only ever being invoked from RCU core processing, in contrast to softirq->rcuc transition in old mainline RCU priority boosting. ] [ paulmck: Avoid wakeups when scheduler might have invoked rcu_read_unlock() while holding rq or pi locks, also possibly fixing a pre-existing latent bug involving raise_softirq()-induced wakeups. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25Merge tag 'trace-v5.2-rc1' of ↵Linus Torvalds2-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Tom Zanussi sent me some small fixes and cleanups to the histogram code and I forgot to incorporate them. I also added a small clean up patch that was sent to me a while ago and I just noticed it" * tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: kernel/trace/trace.h: Remove duplicate header of trace_seq.h tracing: Add a check_val() check before updating cond_snapshot() track_val tracing: Check keys for variable references in expressions too tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
2019-05-25bpf: verifier: randomize high 32-bit when BPF_F_TEST_RND_HI32 is setJiong Wang1-11/+57
This patch randomizes high 32-bit of a definition when BPF_F_TEST_RND_HI32 is set. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25bpf: introduce new bpf prog load flags "BPF_F_TEST_RND_HI32"Jiong Wang1-1/+3
x86_64 and AArch64 perhaps are two arches that running bpf testsuite frequently, however the zero extension insertion pass is not enabled for them because of their hardware support. It is critical to guarantee the pass correction as it is supposed to be enabled at default for a couple of other arches, for example PowerPC, SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way to test this pass on for example x86_64. The test methodology employed by this set is "poisoning" useless bits. High 32-bit of a definition is randomized if it is identified as not used by any later insn. Such randomization is only enabled under testing mode which is gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32". Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25bpf: verifier: insert zero extension according to analysis resultJiong Wang2-0/+50
After previous patches, verifier will mark a insn if it really needs zero extension on dst_reg. It is then for back-ends to decide how to use such information to eliminate unnecessary zero extension code-gen during JIT compilation. One approach is verifier insert explicit zero extension for those insns that need zero extension in a generic way, JIT back-ends then do not generate zero extension for sub-register write at default. However, only those back-ends which do not have hardware zero extension want this optimization. Back-ends like x86_64 and AArch64 have hardware zero extension support that the insertion should be disabled. This patch introduces new target hook "bpf_jit_needs_zext" which returns false at default, meaning verifier zero extension insertion is disabled at default. A back-end could override this hook to return true if it doesn't have hardware support and want verifier insert zero extension explicitly. Offload targets do not use this native target hook, instead, they could get the optimization results using bpf_prog_offload_ops.finalize. NOTE: arches could have diversified features, it is possible for one arch to have hardware zero extension support for some sub-register write insns but not for all. For example, PowerPC, SPARC have zero extended loads, but not for alu32. So when verifier zero extension insertion enabled, these JIT back-ends need to peephole insns to remove those zero extension inserted for insn that actually has hardware zero extension support. The peephole could be as simple as looking the next insn, if it is a special zero extension insn then it is safe to eliminate it if the current insn has hardware zero extension support. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25bpf: verifier: mark patched-insn with sub-register zext flagJiong Wang1-4/+33
Patched insns do not go through generic verification, therefore doesn't has zero extension information collected during insn walking. We don't bother analyze them at the moment, for any sub-register def comes from them, just conservatively mark it as needing zero extension. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25bpf: verifier: mark verified-insn with sub-register zext flagJiong Wang1-13/+160
eBPF ISA specification requires high 32-bit cleared when low 32-bit sub-register is written. This applies to destination register of ALU32 etc. JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and AArch64 ISA has the same semantics, so the corresponding JIT back-end doesn't need to do extra work. However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches (PowerPC, SPARC etc) need to do explicit zero extension to meet this requirement, otherwise code like the following will fail. u64_value = (u64) u32_value ... other uses of u64_value This is because compiler could exploit the semantic described above and save those zero extensions for extending u32_value to u64_value, these JIT back-ends are expected to guarantee this through inserting extra zero extensions which however could be a significant increase on the code size. Some benchmarks show there could be ~40% sub-register writes out of total insns, meaning at least ~40% extra code-gen. One observation is these extra zero extensions are not always necessary. Take above code snippet for example, it is possible u32_value will never be casted into a u64, the value of high 32-bit of u32_value then could be ignored and extra zero extension could be eliminated. This patch implements this idea, insns defining sub-registers will be marked when the high 32-bit of the defined sub-register matters. For those unmarked insns, it is safe to eliminate high 32-bit clearnace for them. Algo: - Split read flags into READ32 and READ64. - Record index of insn that does sub-register write. Keep the index inside reg state and update it during verifier insn walking. - A full register read on a sub-register marks its definition insn as needing zero extension on dst register. A new sub-register write overrides the old one. - When propagating read64 during path pruning, also mark any insn defining a sub-register that is read in the pruned path as full-register. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25Merge tag 'spdx-5.2-rc2-2' of ↵Linus Torvalds4-17/+4
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pule more SPDX updates from Greg KH: "Here is another set of reviewed patches that adds SPDX tags to different kernel files, based on a set of rules that are being used to parse the comments to try to determine that the license of the file is "GPL-2.0-or-later". Only the "obvious" versions of these matches are included here, a number of "non-obvious" variants of text have been found but those have been postponed for later review and analysis. These patches have been out for review on the linux-spdx@vger mailing list, and while they were created by automatic tools, they were hand-verified by a bunch of different people, all whom names are on the patches are reviewers" * tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98 ...
2019-05-25bpf: implement bpf_send_signal() helperYonghong Song1-0/+72
This patch tries to solve the following specific use case. Currently, bpf program can already collect stack traces through kernel function get_perf_callchain() when certain events happens (e.g., cache miss counter or cpu clock counter overflows). But such stack traces are not enough for jitted programs, e.g., hhvm (jited php). To get real stack trace, jit engine internal data structures need to be traversed in order to get the real user functions. bpf program itself may not be the best place to traverse the jit engine as the traversing logic could be complex and it is not a stable interface either. Instead, hhvm implements a signal handler, e.g. for SIGALARM, and a set of program locations which it can dump stack traces. When it receives a signal, it will dump the stack in next such program location. Such a mechanism can be implemented in the following way: . a perf ring buffer is created between bpf program and tracing app. . once a particular event happens, bpf program writes to the ring buffer and the tracing app gets notified. . the tracing app sends a signal SIGALARM to the hhvm. But this method could have large delays and causing profiling results skewed. This patch implements bpf_send_signal() helper to send a signal to hhvm in real time, resulting in intended stack traces. Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-25locking/lock_events: Use this_cpu_add() when necessaryWaiman Long1-2/+40
The kernel test robot has reported that the use of __this_cpu_add() causes bug messages like: BUG: using __this_cpu_add() in preemptible [00000000] code: ... Given the imprecise nature of the count and the possibility of resetting the count and doing the measurement again, this is not really a big problem to use the unprotected __this_cpu_*() functions. To make the preemption checking code happy, the this_cpu_*() functions will be used if CONFIG_DEBUG_PREEMPT is defined. The imprecise nature of the locking counts are also documented with the suggestion that we should run the measurement a few times with the counts reset in between to get a better picture of what is going on under the hood. Fixes: a8654596f0371 ("locking/rwsem: Enable lock event counting") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-24kheaders: Do not regenerate archive if config is not changedJoel Fernandes (Google)1-4/+11
Linus reported an issue that doing an allmodconfig was causing the kheaders archive to be regenerated even though the config is the same. This patch fixes the issue by ignoring the config-related header files for "knowing when to regenerate based on timestamps". Instead, if the CONFIG_X_Y option really changes, then we there are the include/config/X/Y.h which will already tells us "if a config really changed". So we don't really need these files for regeneration detection anyway, and ignoring them fixes Linus's issue. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24kheaders: Move from proc to sysfsJoel Fernandes (Google)3-27/+19
The kheaders archive consisting of the kernel headers used for compiling bpf programs is in /proc. However there is concern that moving it here will make it permanent. Let us move it to /sys/kernel as discussed [1]. [1] https://lore.kernel.org/patchwork/patch/1067310/#1265969 Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 38Thomas Gleixner1-2/+1
Based on 1 normalized pattern(s): this file is released under the gplv2 and any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170857.732920462@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36Thomas Gleixner3-15/+3
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public licence as published by the free software foundation either version 2 of the licence or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 114 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170857.552531963@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24locking/lockdep: Remove the unused print_lock_trace() functionAnders Roxell1-4/+0
gcc warns that function print_lock_trace() is unused if CONFIG_PROVE_LOCKING isn't set: ../kernel/locking/lockdep.c:2820:13: warning: ‘print_lock_trace’ defined but not used [-Wunused-function] Rework so we remove the function if CONFIG_PROVE_LOCKING isn't set. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: will.deacon@arm.com Fixes: c120bce78065 ("lockdep: Simplify stack trace handling") Link: http://lkml.kernel.org/r/20190516191326.27003-1-anders.roxell@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>