summaryrefslogtreecommitdiff
path: root/kernel/rcu/tree_plugin.h
AgeCommit message (Collapse)AuthorFilesLines
2022-10-24rcu: Avoid triggering strict-GP irq-work when RCU is idleZqiang1-1/+2
[ Upstream commit 621189a1fe93cb2b34d62c5cdb9e258bca044813 ] Kernels built with PREEMPT_RCU=y and RCU_STRICT_GRACE_PERIOD=y trigger irq-work from rcu_read_unlock(), and the resulting irq-work handler invokes rcu_preempt_deferred_qs_handle(). The point of this triggering is to force grace periods to end quickly in order to give tools like KASAN a better chance of detecting RCU usage bugs such as leaking RCU-protected pointers out of an RCU read-side critical section. However, this irq-work triggering is unconditional. This works, but there is no point in doing this irq-work unless the current grace period is waiting on the running CPU or task, which is not the common case. After all, in the common case there are many rcu_read_unlock() calls per CPU per grace period. This commit therefore triggers the irq-work only when the current grace period is waiting on the running CPU or task. This change was tested as follows on a four-CPU system: echo rcu_preempt_deferred_qs_handler > /sys/kernel/debug/tracing/set_ftrace_filter echo 1 > /sys/kernel/debug/tracing/function_profile_enabled insmod rcutorture.ko sleep 20 rmmod rcutorture.ko echo 0 > /sys/kernel/debug/tracing/function_profile_enabled echo > /sys/kernel/debug/tracing/set_ftrace_filter This procedure produces results in this per-CPU set of files: /sys/kernel/debug/tracing/trace_stat/function* Sample output from one of these files is as follows: Function Hit Time Avg s^2 -------- --- ---- --- --- rcu_preempt_deferred_qs_handle 838746 182650.3 us 0.217 us 0.004 us The baseline sum of the "Hit" values (the number of calls to this function) was 3,319,015. With this commit, that sum was 1,140,359, for a 2.9x reduction. The worst-case variance across the CPUs was less than 25%, so this large effect size is statistically significant. The raw data is available in the Link: URL. Link: https://lore.kernel.org/all/20220808022626.12825-1-qiang1.zhang@intel.com/ Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-03Merge branches 'docs.2022.04.20a', 'fixes.2022.04.20a', 'nocb.2022.04.11b', ↵Paul E. McKenney1-18/+10
'rcu-tasks.2022.04.11b', 'srcu.2022.05.03a', 'torture.2022.04.11b', 'torture-tasks.2022.04.20a' and 'torturescript.2022.04.20a' into HEAD docs.2022.04.20a: Documentation updates. fixes.2022.04.20a: Miscellaneous fixes. nocb.2022.04.11b: Callback-offloading updates. rcu-tasks.2022.04.11b: RCU-tasks updates. srcu.2022.05.03a: Put SRCU on a memory diet. torture.2022.04.11b: Torture-test updates. torture-tasks.2022.04.20a: Avoid torture testing changing RCU configuration. torturescript.2022.04.20a: Torture-test scripting updates.
2022-04-21rcu: Use IRQ_WORK_INIT_HARD() to avoid rcu_read_unlock() hangsZqiang1-1/+7
When booting kernels built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y and CONFIG_PREEMPT_RT=y, the rcu_read_unlock_special() function's invocation of irq_work_queue_on() the init_irq_work() causes the rcu_preempt_deferred_qs_handler() function to work execute in SCHED_FIFO irq_work kthreads. Because rcu_read_unlock_special() is invoked on each rcu_read_unlock() in such kernels, the amount of work just keeps piling up, resulting in a boot-time hang. This commit therefore avoids this hang by using IRQ_WORK_INIT_HARD() instead of init_irq_work(), but only in kernels built with both CONFIG_PREEMPT_RT=y and CONFIG_RCU_STRICT_GRACE_PERIOD=y. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-21rcu: Check for successful spawn of ->boost_kthread_taskZqiang1-1/+2
For the spawning of the priority-boost kthreads can fail, improbable though this might seem. This commit therefore refrains from attemoting to initiate RCU priority boosting when The ->boost_kthread_task pointer is NULL. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-12rcu: Fix rcu_preempt_deferred_qs_irqrestore() strict QS reportingPaul E. McKenney1-0/+1
Suppose we have a kernel built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y and CONFIG_PREEMPT=y. Suppose further that an RCU reader from which RCU core needs a quiescent state ends in rcu_preempt_deferred_qs_irqrestore(). This function will then invoke rcu_report_qs_rdp() in order to immediately report that quiescent state. Unfortunately, it will not have cleared that reader's CPU's rcu_data structure's ->cpu_no_qs.b.norm field. As a result, rcu_report_qs_rdp() will take an early exit because it will believe that this CPU has not yet encountered a quiescent state, and there will be no reporting of the current quiescent state. This commit therefore causes rcu_preempt_deferred_qs_irqrestore() to clear the ->cpu_no_qs.b.norm field before invoking rcu_report_qs_rdp(). Kudos to Boqun Feng and Neeraj Upadhyay for helping with analysis of this issue! Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-04-12rcu: Initialize boost kthread only for boot node prior SMP initializationFrederic Weisbecker1-16/+0
The rcu_spawn_gp_kthread() function is called as an early initcall, which means that SMP initialization hasn't happened yet and only the boot CPU is online. Therefore, create only the boost kthread for the leaf node of the boot CPU. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-03-23Merge tag 'sched-core-2022-03-22' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Cleanups for SCHED_DEADLINE - Tracing updates/fixes - CPU Accounting fixes - First wave of changes to optimize the overhead of the scheduler build, from the fast-headers tree - including placeholder *_api.h headers for later header split-ups. - Preempt-dynamic using static_branch() for ARM64 - Isolation housekeeping mask rework; preperatory for further changes - NUMA-balancing: deal with CPU-less nodes - NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD) - Updates to RSEQ UAPI in preparation for glibc usage - Lots of RSEQ/selftests, for same - Add Suren as PSI co-maintainer * tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) sched/headers: ARM needs asm/paravirt_api_clock.h too sched/numa: Fix boot crash on arm64 systems headers/prep: Fix header to build standalone: <linux/psi.h> sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y cgroup: Fix suspicious rcu_dereference_check() usage warning sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity() sched/deadline,rt: Remove unused functions for !CONFIG_SMP sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file sched/deadline: Remove unused def_dl_bandwidth sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE sched/tracing: Don't re-read p->state when emitting sched_switch event sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race sched/cpuacct: Remove redundant RCU read lock sched/cpuacct: Optimize away RCU read lock sched/cpuacct: Fix charge percpu cpuusage sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies ...
2022-02-24Merge branches 'exp.2022.02.24a', 'fixes.2022.02.14a', ↵Paul E. McKenney1-13/+18
'rcu_barrier.2022.02.08a', 'rcu-tasks.2022.02.08a', 'rt.2022.02.01b', 'torture.2022.02.01b' and 'torturescript.2022.02.08a' into HEAD exp.2022.02.24a: Expedited grace-period updates. fixes.2022.02.14a: Miscellaneous fixes. rcu_barrier.2022.02.08a: Make rcu_barrier() no longer exclude CPU hotplug. rcu-tasks.2022.02.08a: RCU-tasks updates. rt.2022.02.01b: Real-time-related updates. torture.2022.02.01b: Torture-test updates. torturescript.2022.02.08a: Torture-test scripting updates.
2022-02-16sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker1-3/+3
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-14rcu: Replace cpumask_weight with cpumask_empty where appropriateYury Norov1-1/+1
In some places, RCU code calls cpumask_weight() to check if any bit of a given cpumask is set. We can do it more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally. Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14rcu: Add mutex for rcu boost kthread spawning and affinity settingDavid Woodhouse1-2/+8
As we handle parallel CPU bringup, we will need to take care to avoid spawning multiple boost threads, or race conditions when setting their affinity. Spotted by Paul McKenney. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08rcu: Create and use an rcu_rdp_cpu_online()Paul E. McKenney1-4/+2
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently in RCU code in order to determine whether rdp->cpu is online from an RCU perspective. This commit therefore creates an rcu_rdp_cpu_online() function to replace it. [ paulmck: Apply kernel test robot unused-variable feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-02rcu: Add per-CPU rcuc task dumps to RCU CPU stall warningsZqiang1-0/+3
When the rcutree.use_softirq kernel boot parameter is set to zero, all RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads. If these kthreads are being starved, quiescent states will not be reported, which in turn means that the grace period will not end, which can in turn trigger RCU CPU stall warnings. This commit therefore dumps stack traces of stalled CPUs' rcuc kthreads, which can help identify what is preventing those kthreads from running. Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-02rcu: Don't deboost before reporting expedited quiescent statePaul E. McKenney1-4/+4
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx before reporting the expedited quiescent state. Under heavy real-time load, this can result in this function being preempted before the quiescent state is reported, which can in turn prevent the expedited grace period from completing. Tim Murray reports that the resulting expedited grace periods can take hundreds of milliseconds and even more than one second, when they should normally complete in less than a millisecond. This was fine given that there were no particular response-time constraints for synchronize_rcu_expedited(), as it was designed for throughput rather than latency. However, some users now need sub-100-millisecond response-time constratints. This patch therefore follows Neeraj's suggestion (seconded by Tim and by Uladzislau Rezki) of simply reversing the two operations. Reported-by: Tim Murray <timmurray@google.com> Reported-by: Joel Fernandes <joelaf@google.com> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Tim Murray <timmurray@google.com> Cc: Todd Kjos <tkjos@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-02rcu: Remove unused rcu_state.boostNeeraj Upadhyay1-2/+0
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-09Merge branches 'doc.2021.11.30c', 'exp.2021.12.07a', 'fastnohz.2021.11.30c', ↵Paul E. McKenney1-217/+33
'fixes.2021.11.30c', 'nocb.2021.12.09a', 'nolibc.2021.11.30c', 'tasks.2021.12.09a', 'torture.2021.12.07a' and 'torturescript.2021.11.30c' into HEAD doc.2021.11.30c: Documentation updates. exp.2021.12.07a: Expedited-grace-period fixes. fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ. fixes.2021.11.30c: Miscellaneous fixes. nocb.2021.12.09a: No-CB CPU updates. nolibc.2021.11.30c: Tiny in-kernel library updates. tasks.2021.12.09a: RCU-tasks updates, including update-side scalability. torture.2021.12.07a: Torture-test in-kernel module updates. torturescript.2021.11.30c: Torture-test scripting updates.
2021-12-08rcu: Make idle entry report expedited quiescent statesPaul E. McKenney1-1/+12
In non-preemptible kernels, an unfortunately timed expedited grace period can result in the rcu_exp_handler() IPI handler setting the rcu_data structure's cpu_no_qs.b.exp field just as the target CPU enters idle. There are situations in which this field will not be checked until after that CPU exits idle. The resulting grace-period latency does not qualify as "expedited". This commit therefore checks this field upon non-preemptible idle entry in the rcu_preempt_deferred_qs() function. It also qualifies the rcu_core() preempt_count() check with IS_ENABLED(CONFIG_PREEMPT_COUNT) to prevent false-positive quiescent states from count-free kernels. Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-08rcu: Remove rcu_data.exp_deferred_qs and convert to rcu_data.cpu no_qs.b.expFrederic Weisbecker1-6/+6
Having two fields for the same purpose with subtle differences on different RCU flavours is confusing, especially when both fields always exist on both RCU flavours. Fortunately, it is now safe for preemptible RCU to rely on the rcu_data structure's ->cpu_no_qs.b.exp field, just like non-preemptible RCU. This commit therefore removes the ad-hoc ->exp_deferred_qs field. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-08rcu: Move rcu_data.cpu_no_qs.b.exp reset to rcu_export_exp_rdp()Frederic Weisbecker1-4/+2
On non-preemptible RCU, move clearing of the rcu_data structure's ->cpu_no_qs.b.exp filed to the actual expedited quiescent state report function, matching hw preemptible RCU handles the ->exp_deferred_qs field. This prepares for removing ->exp_deferred_qs in favor of ->cpu_no_qs.b.exp for both preemptible and non-preemptible RCU. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-08rcu: Ignore rdp.cpu_no_qs.b.exp on preemptible RCU's rcu_qs()Frederic Weisbecker1-1/+5
Preemptible RCU does not use the rcu_data structure's ->cpu_no_qs.b.exp, instead using a separate ->exp_deferred_qs field to record the need for an expedited quiescent state. In fact ->cpu_no_qs.b.exp should never be set in preemptible RCU because preemptible RCU's expedited grace periods use other mechanisms to record quiescent states. This commit therefore removes the implicit rcu_qs() reference to ->cpu_no_qs.b.exp in favor of a direct reference to ->cpu_no_qs.b.norm. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-01rcu: Avoid running boost kthreads on isolated CPUsZqiang1-1/+2
When the boost kthreads are created on systems with nohz_full CPUs, the cpus_allowed_ptr is set to housekeeping_cpumask(HK_FLAG_KTHREAD). However, when the rcu_boost_kthread_setaffinity() is called, the original affinity will be changed and these kthreads can subsequently run on nohz_full CPUs. This commit makes rcu_boost_kthread_setaffinity() restrict these boost kthreads to housekeeping CPUs. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-01rcu: Improve tree_plugin.h comments and add code cleanupsZhouyi Zhou1-6/+5
This commit cleans up some comments and code in kernel/rcu/tree_plugin.h. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-01rcu: in_irq() cleanupChangbin Du1-1/+1
This commit replaces the obsolete and ambiguous macro in_irq() with its shiny new in_hardirq() equivalent. Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-01rcu: Move rcu_needs_cpu() to tree.cPaul E. McKenney1-16/+0
Now that RCU_FAST_NO_HZ is no more, there is but one implementation of the rcu_needs_cpu() function. This commit therefore moves this function from kernel/rcu/tree_plugin.c to kernel/rcu/tree.c. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-12-01rcu: Remove the RCU_FAST_NO_HZ Kconfig optionPaul E. McKenney1-183/+2
All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-10-07rcu: Always inline rcu_dynticks_task*_{enter,exit}()Peter Zijlstra1-4/+4
RCU managed to grow a few noinstr violations: vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section Fix them by adding __always_inline to the relevant trivial functions. Also replace the noinstr with __always_inline for the existing rcu_dynticks_task_*() functions since noinstr would force noinline them, even when empty, which seems silly. Fixes: 7d0c9c50c5a1 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-09-16rcu: Avoid unneeded function call in rcu_read_unlock()Waiman Long1-2/+1
Since commit aa40c138cc8f3 ("rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict() is invoked by the inlined rcu_read_unlock() function. However, rcu_read_unlock_strict() is an empty function in production kernels, which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n. There is a mention of rcu_read_unlock_strict() in the BPF verifier, but this is in a deny-list, meaning that BPF does not care whether rcu_read_unlock_strict() is ever called. This commit therefore provides a slight performance improvement by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the pointless call to an empty function. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-31Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-17locking/rtmutex: Split out the inner parts of 'struct rtmutex'Peter Zijlstra1-3/+3
RT builds substitutions for rwsem, mutex, spinlock and rwlock around rtmutexes. Split the inner working out so each lock substitution can use them with the appropriate lockdep annotations. This avoids having an extra unused lockdep map in the wrapped rtmutex. No functional change. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.784739994@linutronix.de
2021-08-10Merge branches 'doc.2021.07.20c', 'fixes.2021.08.06a', 'nocb.2021.07.20c', ↵Paul E. McKenney1-1495/+11
'nolibc.2021.07.20c', 'tasks.2021.07.20c', 'torture.2021.07.27a' and 'torturescript.2021.07.27a' into HEAD doc.2021.07.20c: Documentation updates. fixes.2021.08.06a: Miscellaneous fixes. nocb.2021.07.20c: Callback-offloading (NOCB CPU) updates. nolibc.2021.07.20c: Tiny userspace library updates. tasks.2021.07.20c: Tasks RCU updates. torture.2021.07.27a: In-kernel torture-test updates. torturescript.2021.07.27a: Torture-test scripting updates.
2021-08-06rcu: Print human-readable message for schedule() in RCU readerPaul E. McKenney1-1/+1
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of rcu_note_context_switch() triggers when there is a voluntary context switch in an RCU read-side critical section, but there is quite a gap between the output of that WARN_ON_ONCE() and this RCU-usage error. This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE() that explicitly describes the problem in its message. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Mark accesses to ->rcu_read_lock_nestingPaul E. McKenney1-3/+6
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but in the past, the overhead of marked accesses was excessive. However, that was long ago, and much has changed since then, both in terms of hardware and of compilers. Here is data taken on an eight-core laptop using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built using gcc version 9.3.0, with all data in nanoseconds. Unmarked accesses (status quo), measured by three refscale runs: Minimum reader duration: 3.286 2.851 3.395 Median reader duration: 3.698 3.531 3.4695 Maximum reader duration: 4.481 5.215 5.157 Marked accesses, also measured by three refscale runs: Minimum reader duration: 3.501 3.677 3.580 Median reader duration: 4.053 3.723 3.895 Maximum reader duration: 7.307 4.999 5.511 This focused microbenhmark shows only sub-nanosecond differences which are unlikely to be visible at the system level. This commit therefore marks data-racing accesses to ->rcu_read_lock_nesting. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu: Fix macro name CONFIG_TASKS_RCU_TRACEZhouyi Zhou1-4/+4
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from memory-ordering bugs that could result in false-positive quiescent states and too-short grace periods. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu/nocb: Start moving nocb code to its own plugin fileFrederic Weisbecker1-1487/+0
The kernel/rcu/tree_plugin.h file contains not only the plugins for preemptible RCU, but also many other features including rcu_nocbs callback offloading. This offloading has become large and complex, so it is time to put it in its own file. This commit starts that process. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Rename to tree_nocb.h, add Frederic as author. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-04Merge branch 'core-rcu-2021.07.04' of ↵Linus Torvalds1-126/+113
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Bitmap parsing support for "all" as an alias for all bits - Documentation updates - Miscellaneous fixes, including some that overlap into mm and lockdep - kvfree_rcu() updates - mem_dump_obj() updates, with acks from one of the slab-allocator maintainers - RCU NOCB CPU updates, including limited deoffloading - SRCU updates - Tasks-RCU updates - Torture-test updates * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits) tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states rcu: Add missing __releases() annotation rcu: Remove obsolete rcu_read_unlock() deadlock commentary rcu: Improve comments describing RCU read-side critical sections rcu: Create an unrcu_pointer() to remove __rcu from a pointer srcu: Early test SRCU polling start rcu: Fix various typos in comments rcu/nocb: Unify timers rcu/nocb: Prepare for fine-grained deferred wakeup rcu/nocb: Only cancel nocb timer if not polling rcu/nocb: Delete bypass_timer upon nocb_gp wakeup rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup rcu/nocb: Allow de-offloading rdp leader rcu/nocb: Directly call __wake_nocb_gp() from bypass timer rcu: Don't penalize priority boosting when there is nothing to boost rcu: Point to documentation of ordering guarantees rcu: Make rcu_gp_cleanup() be noinline for tracing rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP ...
2021-06-18sched: Introduce task_is_running()Peter Zijlstra1-1/+1
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-05-18Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', ↵Paul E. McKenney1-120/+110
'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD bitmaprange.2021.05.10c: Allow "all" for bitmap ranges. doc.2021.05.10c: Documentation updates. fixes.2021.05.13a: Miscellaneous fixes. kvfree_rcu.2021.05.10c: kvfree_rcu() updates. mmdumpobj.2021.05.10c: mem_dump_obj() updates. nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading. srcu.2021.05.12a: SRCU updates. tasks.2021.05.18a: Tasks-RCU updates. torture.2021.05.10c: Torture-test updates.
2021-05-12rcu: Fix various typos in commentsIngo Molnar1-1/+1
Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Unify timersFrederic Weisbecker1-53/+39
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar, this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early. The timer also unconditionally executes a full barrier so as to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp. This unconditional full barrier shouldn't bring visible overhead since these timers almost never fire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Prepare for fine-grained deferred wakeupFrederic Weisbecker1-8/+9
Tuning the deferred wakeup level must be done from a safe wakeup point. Currently those sites are: * ->nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE. In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early. To prepare for that, this commit specifies the per-callsite wakeup level/limit. Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Only cancel nocb timer if not pollingFrederic Weisbecker1-7/+7
This commit refrains deleting the ->nocb_timer if rcu_nocb is polling because it should not ever have been queued in the polling case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Delete bypass_timer upon nocb_gp wakeupFrederic Weisbecker1-0/+2
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because nocb_gp_wait() will recheck again the bypass state and rearm the bypass timer if necessary. This commit therefore deletes this timer. Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Cancel nocb_timer upon nocb_gp wakeupFrederic Weisbecker1-0/+4
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer around because this function will traverse the whole rdp list. Any update performed before the timer was armed will now be visible after the ->nocb_gp_lock acquire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Allow de-offloading rdp leaderFrederic Weisbecker1-4/+0
The only thing that prevented an rdp leader from being de-offloaded was the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader. If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock() calls and do its job in the timer unsafely. Worse yet: If it gets re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to unlock, leaving it imbalanced. Now that the nocb_bypass_timer doesn't use the nocb_lock anymore, de-offloading the rdp leader is now safe. This commit therefore allows the rdp leader to be de-offloaded. Reported-by: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Directly call __wake_nocb_gp() from bypass timerFrederic Weisbecker1-2/+3
The bypass timer calls __call_rcu_nocb_wake() instead of directly calling __wake_nocb_gp(). The only difference here is that rdp->qlen_last_fqs_check gets overridden. But resetting the deferred force quiescent state base shouldn't be relevant for that timer. In fact the bypass queue in question can be for any rdp from the group and not necessarily the rdp leader on which the bypass timer is attached. This commit therefore calls __wake_nocb_gp() directly. This way we don't even need to lock the ->nocb_lock. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-11rcu: Make RCU priority boosting work on single-CPU rcu_node structuresPaul E. McKenney1-22/+7
When any CPU comes online, it checks to see if an RCU-boost kthread has already been created for that CPU's leaf rcu_node structure, and if not, it creates one. Unfortunately, it also verifies that this leaf rcu_node structure actually has at least one online CPU, and if not, it declines to create the kthread. Although this behavior makes sense during early boot, especially on systems that claim far more CPUs than they actually have, it makes no sense for the first CPU to come online for a given rcu_node structure. There is no point in checking because we know there is a CPU on its way in. The problem is that timing differences can cause this incoming CPU to not yet be reflected in the various bit masks even at rcutree_online_cpu() time, and there is no chance at rcutree_prepare_cpu() time. Plus it would be better to create the RCU-boost kthread at rcutree_prepare_cpu() to handle the case where the CPU is involved in an RCU priority inversion very shortly after it comes online. This commit therefore moves the checking to rcu_prepare_kthreads(), which is called only at early boot, when the check is appropriate. In addition, it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which no longer does any checking for online CPUs. With this change, RCU priority boosting tests now pass for short rcutorture runs, even with single-CPU leaf rcu_node structures. Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-11rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() outputPaul E. McKenney1-0/+1
This commit adds each rcu_node structure's ->qsmask and "bBEG" output indicating whether: (1) There is a boost kthread, (2) A reader needs to be (or is in the process of being) boosted, (3) A reader is blocking an expedited grace period, and (4) A reader is blocking a normal grace period. This helps diagnose RCU priority boosting failures. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-11rcu/nocb: Use the rcuog CPU's ->nocb_timerFrederic Weisbecker1-63/+77
Currently each CPU has its own ->nocb_timer queued when the nocb_gp wakeup must be deferred. This approach has many drawbacks, compared to a solution based on a single timer per NOCB group: * There are a lot of timers to maintain. * The per-rdp ->nocb_lock must be held to queue and cancel the timer and this lock can already be heavily contended. * One timer firing doesn't cancel the other timers in the same group: - These other timers can thus cause spurious wakeups - Each rdp that queued a timer must lock both ->nocb_lock and then ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode. * We can't cancel all of them if we detect an unflushed bypass in nocb_gp_wait(). In fact currently we only ever cancel the ->nocb_timer of the leader group. * The leader group's nocb_timer is cancelled without locking ->nocb_lock in nocb_gp_wait(). This currently appears to be safe but is an accident waiting to happen. * Since the timer acquires ->nocb_lock, it requires extra care in the NOCB (de-)offloading process, requiring that it be either enabled or disabled and then flushed. This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead. It is protected by nocb_gp_lock, which is _way_ less contended and remains so even after this change. As a matter of fact, the nocb_timer almost never fires and the deferred wakeup is mostly carried out upon idle/user/guest entry. Now the early check performed at this point in do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which is of course racy. However, this raciness is harmless because we only need the guarantee that the timer is queued if we were the last one to queue it. Any other situation (another CPU has queued it and we either see it or not) is fine. This solves all the issues listed above. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-11rcu: Fix typo in comment: kthead -> kthreadRolf Eike Beer1-1/+1
Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-11rcu/tree_plugin: Don't handle the case of 'all' CPU rangeYury Norov1-6/+3
The 'all' semantics is now supported by the bitmap_parselist() so we can drop supporting it as a special case in RCU code. Since 'all' is properly supported in core bitmap code, also drop legacy comment in RCU for it. This patch does not make any functional changes for existing users. Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>