summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-03-09rcutorture: Replace rcu_torture_stall string with %sStephen Zhang1-3/+3
This commit replaces a hard-coded "rcu_torture_stall" string in a pr_alert() format with "%s" and __func__. Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09torture: Replace torture_init_begin string with %sStephen Zhang1-3/+3
This commit replaces a hard-coded "torture_init_begin" string in a pr_alert() format with "%s" and __func__. Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu-tasks: Add block comment laying out RCU Tasks Trace designPaul E. McKenney1-0/+36
This commit adds a block comment that gives a high-level overview of how RCU tasks trace grace periods progress. It also adds a note about how exiting tasks are handled, plus it gives an overview of the memory ordering. Reported-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> [ paulmck: Fix commit log per Mathieu Desnoyers feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu-tasks: Rectify kernel-doc for struct rcu_tasksLukas Bulwahn1-2/+2
The command 'find ./kernel/rcu/ | xargs ./scripts/kernel-doc -none' reported an issue with the kernel-doc of struct rcu_tasks. This commit rectifies the kernel-doc, such that no issues remain for ./kernel/rcu/. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: Make rcu_read_unlock_special() expedite strict grace periodsPaul E. McKenney1-0/+1
In kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, every grace period is an expedited grace period. However, rcu_read_unlock_special() does not treat them that way, instead allowing the deferred quiescent state to be reported whenever. This commit therefore adds a check of this Kconfig option that causes rcu_read_unlock_special() to treat all grace periods as expedited for CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcutorture: Fix testing of RCU priority boostingPaul E. McKenney1-14/+22
Currently, rcutorture refuses to test RCU priority boosting in CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on x86 these days. This commit therefore updates rcutorture's tests of RCU priority boosting to make them safe for CPU hotplug. However, these tests will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not happen in current mainline. This commit therefore also refuses to test RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y. While in the area, this commt adds some debug output at boost-fail time that helps diagnose the cause of the failure, for example, failing to run TIMER_SOFTIRQ at realtime priority. Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: Expedite deboost in case of deferred quiescent statePaul E. McKenney1-12/+14
Historically, a task that has been subjected to RCU priority boosting is deboosted at rcu_read_unlock() time. However, with the advent of deferred quiescent states, if the outermost rcu_read_unlock() was invoked with either bottom halves, interrupts, or preemption disabled, the deboosting will be delayed for some time. During this time, a low-priority process might be incorrectly running at a high real-time priority level. Fortunately, rcu_read_unlock_special() already provides mechanisms for forcing a minimal deferral of quiescent states, at least for kernels built with CONFIG_IRQ_WORK=y. These mechanisms are currently used when expedited grace periods are pending that might be blocked by the current task. This commit therefore causes those mechanisms to also be used in cases where the current task has been or might soon be subjected to RCU priority boosting. Note that this applies to all kernels built with CONFIG_RCU_BOOST=y, regardless of whether or not they are also built with CONFIG_PREEMPT_RT=y. This approach assumes that kernels build for use with aggressive real-time applications are built with CONFIG_IRQ_WORK=y. It is likely to be far simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting scheme that works correctly in its absence. While in the area, alphabetize the rcu_preempt_deferred_qs_handler() function's local variables. Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Scott Wood <swood@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloadingFrederic Weisbecker1-4/+5
The name nocb_gp_update_state() is unenlightening, so this commit changes it to nocb_gp_update_state_deoffloading(). This function now does what its name says, updates state and returns true if the CPU corresponding to the specified rcu_data structure is in the process of being de-offloaded. Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Only (re-)initialize segcblist when needed on CPU upFrederic Weisbecker1-5/+4
At the start of a CPU-hotplug operation, the incoming CPU's callback list can be in a number of states: 1. Disabled and empty. This is the case when the boot CPU has not invoked call_rcu(), when a non-boot CPU first comes online, and when a non-offloaded CPU comes back online. In this case, it is both necessary and permissible to initialize ->cblist. Because either the CPU is currently running with interrupts disabled (boot CPU) or is not yet running at all (other CPUs), it is not necessary to acquire ->nocb_lock. In this case, initialization is required. 2. Disabled and non-empty. This cannot occur, because early boot call_rcu() invocations enable the callback list before enqueuing their callback. 3. Enabled, whether empty or not. In this case, the callback list has already been initialized. This case occurs when the boot CPU has executed an early boot call_rcu() and also when an offloaded CPU comes back online. In both cases, there is no need to initialize the callback list: In the boot-CPU case, the CPU has not (yet) gone offline, and in the offloaded case, the rcuo kthreads are taking care of business. Because it is not necessary to initialize the callback list, it is also not necessary to acquire ->nocb_lock. Therefore, checking if the segcblist is enabled suffices. This commit therefore initializes the callback list at rcutree_prepare_cpu() time only if that list is disabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleepFrederic Weisbecker1-3/+4
The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to true by after invoking the callbacks, and then sets it back to false if it finds more callbacks that are ready to invoke. This is confusing and will become unsafe if this flag is ever read locklessly. This commit therefore writes it only once, based on the state after both callback invocation and checking. Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Forbid NOCB toggling on offline CPUsFrederic Weisbecker2-38/+22
It makes no sense to de-offload an offline CPU because that CPU will never invoke any remaining callbacks. It also makes little sense to offload an offline CPU because any pending RCU callbacks were migrated when that CPU went offline. Yes, it is in theory possible to use a number of tricks to permit offloading and deoffloading offline CPUs in certain cases, but in practice it is far better to have the simple and deterministic rule "Toggling the offload state of an offline CPU is forbidden". For but one example, consider that an offloaded offline CPU might have millions of callbacks queued. Best to just say "no". This commit therefore forbids toggling of the offloaded state of offline CPUs. Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Comment the reason behind BH disablement on batch processingFrederic Weisbecker1-0/+6
This commit explains why softirqs need to be disabled while invoking callbacks, even when callback processing has been offloaded. After all, invoking callbacks concurrently is one thing, but concurrently invoking the same callback is quite another. Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu/nocb: Detect unsafe checks for offloaded rdpFrederic Weisbecker2-24/+87
Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading the offloaded state of an rdp in a safe and stable way and prevent from its value to be changed under us. We must either hold the barrier mutex, the cpu-hotplug lock (read or write) or the nocb lock. Local non-preemptible reads are also safe. NOCB kthreads and timers have their own means of synchronization against the offloaded state updaters. Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcutorture: Add crude tests for mem_dump_obj()Paul E. McKenney1-0/+39
This commit adds a few crude tests for mem_dump_obj() to rcutorture runs. Just to prevent bitrot, you understand! Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcuscale: Add kfree_rcu() single-argument scale testUladzislau Rezki (Sony)1-1/+14
The single-argument variant of kfree_rcu() is currently not tested by any member of the rcutoture test suite. This commit therefore adds rcuscale code to test it. This testing is controlled by two new boolean module parameters, kfree_rcu_test_single and kfree_rcu_test_double. If one is set and the other not, only the corresponding variant is tested, otherwise both are tested, with the variant to be tested determined randomly on each invocation. Both of these module parameters are initialized to false, so setting either to true will test only that variant. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09kvfree_rcu: Use same set of GFP flags as does single-argumentUladzislau Rezki (Sony)1-1/+1
Running an rcuscale stress-suite can lead to "Out of memory" of a system. This can happen under high memory pressure with a small amount of physical memory. For example, a KVM test configuration with 64 CPUs and 512 megabytes can result in OOM when running rcuscale with below parameters: ../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig CONFIG_NR_CPUS=64 \ --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 \ rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make <snip> [ 12.054448] kworker/1:1H invoked oom-killer: gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0 [ 12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 [ 12.056485] Workqueue: events_highpri fill_page_cache_func [ 12.056485] Call Trace: [ 12.056485] dump_stack+0x57/0x6a [ 12.056485] dump_header+0x4c/0x30a [ 12.056485] ? del_timer_sync+0x20/0x30 [ 12.056485] out_of_memory.cold.47+0xa/0x7e [ 12.056485] __alloc_pages_slowpath.constprop.123+0x82f/0xc00 [ 12.056485] __alloc_pages_nodemask+0x289/0x2c0 [ 12.056485] __get_free_pages+0x8/0x30 [ 12.056485] fill_page_cache_func+0x39/0xb0 [ 12.056485] process_one_work+0x1ed/0x3b0 [ 12.056485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] worker_thread+0x28/0x3c0 [ 12.060485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] kthread+0x138/0x160 [ 12.060485] ? kthread_park+0x80/0x80 [ 12.060485] ret_from_fork+0x22/0x30 [ 12.062156] Mem-Info: [ 12.062350] active_anon:0 inactive_anon:0 isolated_anon:0 [ 12.062350] active_file:0 inactive_file:0 isolated_file:0 [ 12.062350] unevictable:0 dirty:0 writeback:0 [ 12.062350] slab_reclaimable:2797 slab_unreclaimable:80920 [ 12.062350] mapped:1 shmem:2 pagetables:8 bounce:0 [ 12.062350] free:10488 free_pcp:1227 free_cma:0 ... [ 12.101610] Out of memory and no killable processes... [ 12.102042] Kernel panic - not syncing: System is deadlocked on memory [ 12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 <snip> Because kvfree_rcu() has a fallback path, memory allocation failure is not the end of the world. Furthermore, the added overhead of aggressive GFP settings must be balanced against the overhead of the fallback path, which is a cache miss for double-argument kvfree_rcu() and a call to synchronize_rcu() for single-argument kvfree_rcu(). The current choice of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call to synchronize_rcu(), so less-tenacious GFP flags would be helpful. Here is the tradeoff that must be balanced: a) Minimize use of the fallback path, b) Avoid pushing the system into OOM, c) Bound allocation latency to that of synchronize_rcu(), and d) Leave the emergency reserves to use cases lacking fallbacks. This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN. This combination leaves the emergency reserves alone and can initiate reclaim, but will not invoke the OOM killer. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRYUladzislau Rezki (Sony)1-1/+13
__GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this can be wasted effort given that there is a fallback code path in case memory allocation fails. __GFP_NORETRY does perform some light-weight reclaim, but it will fail under OOM conditions, allowing the fallback to be taken as an alternative to hard-OOMing the system. There is a four-way tradeoff that must be balanced: 1) Minimize use of the fallback path; 2) Avoid full-up OOM; 3) Do a light-wait allocation request; 4) Avoid dipping into the emergency reserves. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()Paul E. McKenney1-2/+1
The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately followed by a local_irq_restore(). This commit saves a line of code by merging them into a raw_spin_unlock_irqrestore(). This transformation also reduces scheduling latency because raw_spin_unlock_irqrestore() responds immediately to a reschedule request. In contrast, local_irq_restore() does a scheduling-oblivious enabling of interrupts. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()Paul E. McKenney1-1/+1
This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations carried out by the single-argument variant of kvfree_rcu(), thus avoiding this can-sleep code path from dipping into the emergency reserves. Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09kvfree_rcu: Directly allocate page for single-argument caseUladzislau Rezki (Sony)1-16/+26
Single-argument kvfree_rcu() must be invoked from sleepable contexts, so we can directly allocate pages. Furthermmore, the fallback in case of page-allocation failure is the high-latency synchronize_rcu(), so it makes sense to do these page allocations from the fastpath, and even to permit limited sleeping within the allocator. This commit therefore allocates if needed on the fastpath using GFP_KERNEL|__GFP_RETRY_MAYFAIL. This also has the beneficial effect of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant of kvfree_rcu(), given that the double-argument variant cannot directly invoke the allocator. [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ] Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()Zhouyi Zhou1-1/+0
In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the second branch of the "if" statement. Oddly enough, "objtool check -f vmlinux.o" fails to complain because it is unable to correctly cover all cases. Instead, objtool visits the third branch first, which marks following trace_rcu_dyntick() as visited. This commit therefore removes the spurious instrumentation_end(). Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr") Reported-by Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: Fix CPU-offline trace in rcutree_dying_cpuNeeraj Upadhyay1-1/+1
The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is backwards, so that it uses the string "cpuofl" when the offline CPU is blocking the current grace period and "cpuofl-bgp" otherwise. Given that the "-bgp" stands for "blocking grace period", this is at best misleading. This commit therefore switches these strings in order to correctly trace whether the outgoing cpu blocks the current grace period. Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: Remove superfluous rdp fetchFrederic Weisbecker1-1/+0
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar<mingo@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-09rcu: deprecate "all" option to rcu_nocbs=Paul Gortmaker1-4/+2
With the core bitmap support now accepting "N" as a placeholder for the end of the bitmap, "all" can be represented as "0-N" and has the advantage of not being specific to RCU (or any other subsystem). So deprecate the use of "all" by removing documentation references to it. The support itself needs to remain for now, since we don't know how many people out there are using it currently, but since it is in an __init area anyway, it isn't worth losing sleep over. Cc: Yury Norov <yury.norov@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08irqdomain: Remove debugfs_file from struct irq_domainGreg Kroah-Hartman1-5/+4
There's no need to keep around a dentry pointer to a simple file that debugfs itself can look up when we need to remove it from the system. So simplify the code by deleting the variable and cleaning up the logic around the debugfs file. Cc: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/YCvYV53ZdzQSWY6w@kroah.com
2021-03-08bpf: Change inode_storage's lookup_elem return value from NULL to -EBADFTal Lossos1-1/+1
bpf_fd_inode_storage_lookup_elem() returned NULL when getting a bad FD, which caused -ENOENT in bpf_map_copy_value. -EBADF error is better than -ENOENT for a bad FD behaviour. The patch was partially contributed by CyberArk Software, Inc. Fixes: 8ea636848aca ("bpf: Implement bpf_local_storage for inodes") Signed-off-by: Tal Lossos <tallossos@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20210307120948.61414-1-tallossos@gmail.com
2021-03-08bpf: Dont allow vmlinux BTF to be used in map_create and prog_load.Alexei Starovoitov2-0/+9
The syzbot got FD of vmlinux BTF and passed it into map_create which caused crash in btf_type_id_size() when it tried to access resolved_ids. The vmlinux BTF doesn't have 'resolved_ids' and 'resolved_sizes' initialized to save memory. To avoid such issues disallow using vmlinux BTF in prog_load and map_create commands. Fixes: 5329722057d4 ("bpf: Assign ID to vmlinux BTF and return extra info for BTF in GET_OBJ_INFO") Reported-by: syzbot+8bab8ed346746e7540e8@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210307225248.79031-1-alexei.starovoitov@gmail.com
2021-03-08printk: console: remove unnecessary safe buffer usageJohn Ogness1-7/+3
Upon registering a console, safe buffers are activated when setting up the sequence number to replay the log. However, these are already protected by @console_sem and @syslog_lock. Remove the unnecessary safe buffer usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-16-john.ogness@linutronix.de
2021-03-08printk: kmsg_dump: remove _nolock() variantsJohn Ogness2-56/+12
kmsg_dump_rewind() and kmsg_dump_get_line() are lockless, so there is no need for _nolock() variants. Remove these functions and switch all callers of the _nolock() variants. The functions without _nolock() were chosen because they are already exported to kernel modules. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-15-john.ogness@linutronix.de
2021-03-08printk: remove logbuf_lockJohn Ogness3-97/+46
Since the ringbuffer is lockless, there is no need for it to be protected by @logbuf_lock. Remove @logbuf_lock. @console_seq, @exclusive_console_stop_seq, @console_dropped are protected by @console_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-14-john.ogness@linutronix.de
2021-03-08printk: introduce a kmsg_dump iteratorJohn Ogness2-36/+37
Rather than storing the iterator information in the registered kmsg_dumper structure, create a separate iterator structure. The kmsg_dump_iter structure can reside on the stack of the caller, thus allowing lockless use of the kmsg_dump functions. Update code that accesses the kernel logs using the kmsg_dumper structure to use the new kmsg_dump_iter structure. For kmsg_dumpers, this also means adding a call to kmsg_dump_rewind() to initialize the iterator. All this is in preparation for removal of @logbuf_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> # pstore Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-13-john.ogness@linutronix.de
2021-03-08printk: kmsg_dumper: remove @active fieldJohn Ogness2-10/+2
All 6 kmsg_dumpers do not benefit from the @active flag: (provide their own synchronization) - arch/powerpc/kernel/nvram_64.c - arch/um/kernel/kmsg_dump.c - drivers/mtd/mtdoops.c - fs/pstore/platform.c (only dump on KMSG_DUMP_PANIC, which does not require synchronization) - arch/powerpc/platforms/powernv/opal-kmsg.c - drivers/hv/vmbus_drv.c The other 2 kmsg_dump users also do not rely on @active: (hard-code @active to always be true) - arch/powerpc/xmon/xmon.c - kernel/debug/kdb/kdb_main.c Therefore, @active can be removed. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-12-john.ogness@linutronix.de
2021-03-08printk: add syslog_lockJohn Ogness1-4/+37
The global variables @syslog_seq, @syslog_partial, @syslog_time and write access to @clear_seq are protected by @logbuf_lock. Once @logbuf_lock is removed, these variables will need their own synchronization method. Introduce @syslog_lock for this purpose. @syslog_lock is a raw_spin_lock for now. This simplifies the transition to removing @logbuf_lock. Once @logbuf_lock and the safe buffers are removed, @syslog_lock can change to spin_lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-11-john.ogness@linutronix.de
2021-03-08printk: use atomic64_t for devkmsg_user.seqJohn Ogness1-12/+12
@user->seq is indirectly protected by @logbuf_lock. Once @logbuf_lock is removed, @user->seq will be no longer safe from an atomicity point of view. In preparation for the removal of @logbuf_lock, change it to atomic64_t to provide this safety. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-10-john.ogness@linutronix.de
2021-03-08printk: use seqcount_latch for clear_seqJohn Ogness1-8/+50
kmsg_dump_rewind_nolock() locklessly reads @clear_seq. However, this is not done atomically. Since @clear_seq is 64-bit, this cannot be an atomic operation for all platforms. Therefore, use a seqcount_latch to allow readers to always read a consistent value. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-9-john.ogness@linutronix.de
2021-03-08printk: introduce CONSOLE_LOG_MAXJohn Ogness1-8/+12
Instead of using "LOG_LINE_MAX + PREFIX_MAX" for temporary buffer sizes, introduce CONSOLE_LOG_MAX. This represents the maximum size that is allowed to be printed to the console for a single record. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-8-john.ogness@linutronix.de
2021-03-08printk: consolidate kmsg_dump_get_buffer/syslog_print_all codeJohn Ogness1-37/+50
The logic for finding records to fit into a buffer is the same for kmsg_dump_get_buffer() and syslog_print_all(). Introduce a helper function find_first_fitting_seq() to handle this logic. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-7-john.ogness@linutronix.de
2021-03-08printk: refactor kmsg_dump_get_buffer()John Ogness1-29/+33
kmsg_dump_get_buffer() requires nearly the same logic as syslog_print_all(), but uses different variable names and does not make use of the ringbuffer loop macros. Modify kmsg_dump_get_buffer() so that the implementation is as similar to syslog_print_all() as possible. A follow-up commit will move this common logic into a separate helper function. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-6-john.ogness@linutronix.de
2021-03-08printk: limit second loop of syslog_print_allJohn Ogness1-1/+8
The second loop of syslog_print_all() subtracts lengths that were added in the first loop. With commit b031a684bfd0 ("printk: remove logbuf_lock writer-protection of ringbuffer") it is possible that records are (over)written during syslog_print_all(). This allows the possibility of the second loop subtracting lengths that were never added in the first loop. This situation can result in syslog_print_all() filling the buffer starting from a later record, even though there may have been room to fit the earlier record(s) as well. Fixes: b031a684bfd0 ("printk: remove logbuf_lock writer-protection of ringbuffer") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210303101528.29901-4-john.ogness@linutronix.de
2021-03-08hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()Anna-Maria Behnsen1-21/+39
hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers") Reported-by: Mikael Beckius <mikael.beckius@windriver.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Beckius <mikael.beckius@windriver.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210223160240.27518-1-anna-maria@linutronix.de
2021-03-06Merge branch 'locking/core' into x86/mm, to resolve conflictIngo Molnar3-19/+277
There's a non-trivial conflict between the parallel TLB flush framework and the IPI flush debugging code - merge them manually. Conflicts: kernel/smp.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-06smp: Micro-optimize smp_call_function_many_cond()Peter Zijlstra1-1/+1
Call the generic send_call_function_single_ipi() function, which will avoid the IPI when @last_cpu is idle. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-06smp: Inline on_each_cpu_cond() and on_each_cpu()Nadav Amit2-93/+1
Simplify the code and avoid having an additional function on the stack by inlining on_each_cpu_cond() and on_each_cpu(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Nadav Amit <namit@vmware.com> [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210220231712.2475218-10-namit@vmware.com
2021-03-06smp: Run functions concurrently in smp_call_function_many_cond()Nadav Amit1-68/+88
Currently, on_each_cpu() and similar functions do not exploit the potential of concurrency: the function is first executed remotely and only then it is executed locally. Functions such as TLB flush can take considerable time, so this provides an opportunity for performance optimization. To do so, modify smp_call_function_many_cond(), to allows the callers to provide a function that should be executed (remotely/locally), and run them concurrently. Keep other smp_call_function_many() semantic as it is today for backward compatibility: the called function is not executed in this case locally. smp_call_function_many_cond() does not use the optimized version for a single remote target that smp_call_function_single() implements. For synchronous function call, smp_call_function_single() keeps a call_single_data (which is used for synchronization) on the stack. Interestingly, it seems that not using this optimization provides greater performance improvements (greater speedup with a single remote target than with multiple ones). Presumably, holding data structures that are intended for synchronization on the stack can introduce overheads due to TLB misses and false-sharing when the stack is used for other purposes. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-2-namit@vmware.com
2021-03-06perf/core: Flush PMU internal buffers for per-CPU eventsKan Liang1-4/+38
Sometimes the PMU internal buffers have to be flushed for per-CPU events during a context switch, e.g., large PEBS. Otherwise, the perf tool may report samples in locations that do not belong to the process where the samples are processed in, because PEBS does not tag samples with PID/TID. The current code only flush the buffers for a per-task event. It doesn't check a per-CPU event. Add a new event state flag, PERF_ATTACH_SCHED_CB, to indicate that the PMU internal buffers have to be flushed for this event during a context switch. Add sched_cb_entry and perf_sched_cb_usages back to track the PMU/cpuctx which is required to be flushed. Only need to invoke the sched_task() for per-CPU events in this patch. The per-task events have been handled in perf_event_context_sched_in/out already. Fixes: 9c964efa4330 ("perf/x86/intel: Drain the PEBS buffer during context switches") Reported-by: Gabriel Marin <gmx@google.com> Originally-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201130193842.10569-1-kan.liang@linux.intel.com
2021-03-06lockdep: Add lockdep lock state definesShuah Khan1-5/+6
Adds defines for lock state returns from lock_is_held_type() based on Johannes Berg's suggestions as it make it easier to read and maintain the lock states. These are defines and a enum to avoid changes to lock_is_held_type() and lockdep_is_held() return types. Updates to lock_is_held_type() and __lock_is_held() to use the new defines. Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/linux-wireless/871rdmu9z9.fsf@codeaurora.org/
2021-03-06lockdep: Add lockdep_assert_not_held()Shuah Khan1-1/+5
Some kernel functions must be called without holding a specific lock. Add lockdep_assert_not_held() to be used in these functions to detect incorrect calls while holding a lock. lockdep_assert_not_held() provides the opposite functionality of lockdep_assert_held() which is used to assert calls that require holding a specific lock. Incorporates suggestions from Peter Zijlstra to avoid misfires when lockdep_off() is employed. The need for lockdep_assert_not_held() came up in a discussion on ath10k patch. ath10k_drain_tx() and i915_vma_pin_ww() are examples of functions that can use lockdep_assert_not_held(). Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/linux-wireless/871rdmu9z9.fsf@codeaurora.org/
2021-03-06locking/csd_lock: Add more data to CSD lock debuggingJuergen Gross1-4/+222
In order to help identifying problems with IPI handling and remote function execution add some more data to IPI debugging code. There have been multiple reports of CPUs looping long times (many seconds) in smp_call_function_many() waiting for another CPU executing a function like tlb flushing. Most of these reports have been for cases where the kernel was running as a guest on top of KVM or Xen (there are rumours of that happening under VMWare, too, and even on bare metal). Finding the root cause hasn't been successful yet, even after more than 2 years of chasing this bug by different developers. Commit: 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout diagnostics") tried to address this by adding some debug code and by issuing another IPI when a hang was detected. This helped mitigating the problem (the repeated IPI unlocks the hang), but the root cause is still unknown. Current available data suggests that either an IPI wasn't sent when it should have been, or that the IPI didn't result in the target CPU executing the queued function (due to the IPI not reaching the CPU, the IPI handler not being called, or the handler not seeing the queued request). Try to add more diagnostic data by introducing a global atomic counter which is being incremented when doing critical operations (before and after queueing a new request, when sending an IPI, and when dequeueing a request). The counter value is stored in percpu variables which can be printed out when a hang is detected. The data of the last event (consisting of sequence counter, source CPU, target CPU, and event type) is stored in a global variable. When a new event is to be traced, the data of the last event is stored in the event related percpu location and the global data is updated with the new event's data. This allows to track two events in one data location: one by the value of the event data (the event before the current one), and one by the location itself (the current event). A typical printout with a detected hang will look like this: csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410). csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request. csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty) csd: cnt(00008cd): ffff->0006 idle csd: cnt(0003668): 0001->0006 queue csd: cnt(0003669): 0001->0006 ipi csd: cnt(0003e0f): 0007->000a queue csd: cnt(0003e10): 0001->ffff ping csd: cnt(0003e71): 0003->0000 ping csd: cnt(0003e72): ffff->0006 gotipi csd: cnt(0003e73): ffff->0006 handle csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty) csd: cnt(0003e7f): 0004->0006 ping csd: cnt(0003e80): 0001->ffff pinged csd: cnt(0003eb2): 0005->0001 noipi csd: cnt(0003eb3): 0001->0006 queue csd: cnt(0003eb4): 0001->0006 noipi csd: cnt now: 0003f00 The idea is to print only relevant entries. Those are all events which are associated with the hang (so sender side events for the source CPU of the hanging request, and receiver side events for the target CPU), and the related events just before those (for adding data needed to identify a possible race). Printing all available data would be possible, but this would add large amounts of data printed on larger configurations. Signed-off-by: Juergen Gross <jgross@suse.com> [ Minor readability edits. Breaks col80 but is far more readable. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20210301101336.7797-4-jgross@suse.com
2021-03-06locking/csd_lock: Prepare more CSD lock debuggingJuergen Gross1-6/+10
In order to be able to easily add more CSD lock debugging data to struct call_function_data->csd move the call_single_data_t element into a sub-structure. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210301101336.7797-3-jgross@suse.com
2021-03-06locking/csd_lock: Add boot parameter for controlling CSD lock debuggingJuergen Gross1-4/+34
Currently CSD lock debugging can be switched on and off via a kernel config option only. Unfortunately there is at least one problem with CSD lock handling pending for about 2 years now, which has been seen in different environments (mostly when running virtualized under KVM or Xen, at least once on bare metal). Multiple attempts to catch this issue have finally led to introduction of CSD lock debug code, but this code is not in use in most distros as it has some impact on performance. In order to be able to ship kernels with CONFIG_CSD_LOCK_WAIT_DEBUG enabled even for production use, add a boot parameter for switching the debug functionality on. This will reduce any performance impact of the debug coding to a bare minimum when not being used. Signed-off-by: Juergen Gross <jgross@suse.com> [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210301101336.7797-2-jgross@suse.com