diff options
author | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-08-08 03:23:20 +0300 |
---|---|---|
committer | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-10-10 00:23:36 +0300 |
commit | bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa (patch) | |
tree | 9707a2bac694c443b95341fd777d3d287cbbe6c5 /Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | |
parent | 8a5776a5f49812d29fe4b2d0a2d71675c3facf3f (diff) | |
download | linux-bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa.tar.xz |
documentation: RCU grace-period memory ordering guarantees
This commit provides text and diagrams showing how Tree RCU implements
its grace-period memory ordering guarantees.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html')
-rw-r--r-- | Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | 707 |
1 files changed, 707 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html new file mode 100644 index 000000000000..8651b0b4fd79 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -0,0 +1,707 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> + <html> + <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> + <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> + + <p>August 8, 2017</p> + <p>This article was contributed by Paul E. McKenney</p> + +<h3>Introduction</h3> + +<p>This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +<ol> +<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> + What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> +<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> + Tree RCU Grace Period Memory Ordering Building Blocks</a> +<li> <a href="#Tree RCU Grace Period Memory Ordering Components"> + Tree RCU Grace Period Memory Ordering Components</a> +<li> <a href="#Putting It All Together">Putting It All Together</a> +</ol> + +<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> +What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> + +<p>RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>, +for which RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids. + +<p>RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +<p>The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> +Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> + +<p>The workhorse for RCU's grace-period memory ordering is the +critical section for the <tt>rcu_node</tt> structure's +<tt>->lock</tt>. +These critical sections use helper functions for lock acquisition, including +<tt>raw_spin_lock_rcu_node()</tt>, +<tt>raw_spin_lock_irq_rcu_node()</tt>, and +<tt>raw_spin_lock_irqsave_rcu_node()</tt>. +Their lock-release counterparts are +<tt>raw_spin_unlock_rcu_node()</tt>, +<tt>raw_spin_unlock_irq_rcu_node()</tt>, and +<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, +respectively. +For completeness, a +<tt>raw_spin_trylock_rcu_node()</tt> +is also provided. +The key point is that the lock-acquisition functions, including +<tt>raw_spin_trylock_rcu_node()</tt>, all invoke +<tt>smp_mb__after_unlock_lock()</tt> immediately after successful +acquisition of the lock. + +<p>Therefore, for any given <tt>rcu_node</tt> struction, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different <tt>rcu_node</tt> structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +<p>The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions: + +<pre> + 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } +10 +11 void task1(void) +12 { +13 raw_spin_lock_rcu_node(rnp); +14 WRITE_ONCE(y, 1); +15 r2 = READ_ONCE(z); +16 raw_spin_unlock_rcu_node(rnp); +17 } +18 +19 void task2(void) +20 { +21 WRITE_ONCE(z, 1); +22 smp_mb(); +23 r3 = READ_ONCE(x); +24 } +25 +26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); +</pre> + +<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the +acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example +on PowerPC. +The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this +<tt>WARN_ON()</tt> from triggering. + +<p>This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +<tt>atomic_add_return()</tt> read-modify-write atomic operation that +is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry +time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. +The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and +<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke +an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what about CPUs that remain offline for the entire + grace period? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Such CPUs will be offline at the beginning of the grace period, + so the grace period won't expect quiescent states from them. + Races between grace-period start and CPU-hotplug operations + are mediated by the CPU's leaf <tt>rcu_node</tt> structure's + <tt>->lock</tt> as described above. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p>The approach must be extended to handle one final case, that +of waking a task blocked in <tt>synchronize_rcu()</tt>. +This task might be affinitied to a CPU that is not yet aware that +the grace period has ended, and thus might not yet be subject to +the grace period's memory ordering. +Therefore, there is an <tt>smp_mb()</tt> after the return from +<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> +code path. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + What? Where??? + I don't see any <tt>smp_mb()</tt> after the return from + <tt>wait_for_completion()</tt>!!! +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + That would be because I spotted the need for that + <tt>smp_mb()</tt> during the creation of this documentation, + and it is therefore unlikely to hit mainline before v4.14. + Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and + Jonathan Cameron for asking questions that sensitized me + to the rather elaborate sequence of events that demonstrate + the need for this memory barrier. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p>Tree RCU's grace--period memory-ordering guarantees rely most +heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> +field, so much so that it is necessary to abbreviate this pattern +in the diagrams in the next section. +For example, consider the <tt>rcu_prepare_for_idle()</tt> function +shown below, which is one of several functions that enforce ordering +of newly arrived RCU callbacks against future grace periods: + +<pre> + 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 +10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || +11 rcu_is_nocb_cpu(smp_processor_id())) +12 return; +13 tne = READ_ONCE(tick_nohz_active); +14 if (tne != rdtp->tick_nohz_enabled_snap) { +15 if (rcu_cpu_has_callbacks(NULL)) +16 invoke_rcu_core(); +17 rdtp->tick_nohz_enabled_snap = tne; +18 return; +19 } +20 if (!tne) +21 return; +22 if (rdtp->all_lazy && +23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { +24 rdtp->all_lazy = false; +25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; +26 invoke_rcu_core(); +27 return; +28 } +29 if (rdtp->last_accelerate == jiffies) +30 return; +31 rdtp->last_accelerate = jiffies; +32 for_each_rcu_flavor(rsp) { +33 rdp = this_cpu_ptr(rsp->rda); +34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) +35 continue; +36 rnp = rdp->mynode; +37 raw_spin_lock_rcu_node(rnp); +38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); +39 raw_spin_unlock_rcu_node(rnp); +40 if (needwake) +41 rcu_gp_kthread_wake(rsp); +42 } +43 } +</pre> + +<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really +matters for this discussion are lines 37–39. +We will therefore abbreviate this function as follows: + +</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> + +<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> +critical section, with the double line on top representing the additional +<tt>smp_mb__after_unlock_lock()</tt>. + +<h3><a name="Tree RCU Grace Period Memory Ordering Components"> +Tree RCU Grace Period Memory Ordering Components</a></h3> + +<p>Tree RCU's grace-period memory-ordering guarantee is provided by +a number of RCU components: + +<ol> +<li> <a href="#Callback Registry">Callback Registry</a> +<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> +<li> <a href="#Self-Reported Quiescent States"> + Self-Reported Quiescent States</a> +<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> +<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> +<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> +<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> +<li> <a href="Callback Invocation">Callback Invocation</a> +</ol> + +<p>Each of the following section looks at the corresponding component +in detail. + +<h4><a name="Callback Registry">Callback Registry</a></h4> + +<p>If RCU's grace-period guarantee is to mean anything at all, any +access that happens before a given invocation of <tt>call_rcu()</tt> +must also happen before the corresponding grace period. +The implementation of this portion of RCU's grace period guarantee +is shown in the following figure: + +</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> + +<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, +it provides no ordering guarantees, either for itself or for +phase one of the update (which again will usually be removal of +an element from an RCU-protected data structure). +It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, +which cannot become associated with a grace period until a later +call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. + +<p>One set of code paths shown on the left invokes +<tt>rcu_accelerate_cbs()</tt> via +<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if +the current CPU is inundated with queued <tt>rcu_head</tt> structures) +or more likely from an <tt>RCU_SOFTIRQ</tt> handler. +Another code path in the middle is taken only in kernels built with +<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes +<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. +The final code path on the right is taken only in kernels built with +<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes +<tt>rcu_accelerate_cbs()</tt> via +<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, +<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, +which in turn is invoked on a surviving CPU after the outgoing +CPU has been completely offlined. + +<p>There are a few other code paths within grace-period processing +that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. +However, either way, all of the CPU's recently queued <tt>rcu_head</tt> +structures are associated with a future grace-period number under +the protection of the CPU's lead <tt>rcu_node</tt> structure's +<tt>->lock</tt>. +In all cases, there is full ordering against any prior critical section +for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and +also full ordering against any of the current task's or CPU's prior critical +sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. + +<p>The next section will show how this ordering ensures that any +accesses prior to the <tt>call_rcu()</tt> (particularly including phase +one of the update) +happen before the start of the corresponding grace period. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what about <tt>synchronize_rcu()</tt>? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> + to <tt>wait_rcu_gp()</tt>, which invokes it. + So either way, it eventually comes down to <tt>call_rcu()</tt>. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> + +<p>Grace-period initialization is carried out by +the grace-period kernel thread, which makes several passes over the +<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. +This means that showing the full flow of ordering through the +grace-period computation will require duplicating this tree. +If you find this confusing, please note that the state of the +<tt>rcu_node</tt> changes over time, just like Heraclitus's river. +However, to keep the <tt>rcu_node</tt> river tractable, the +grace-period kernel thread's traversals are presented in multiple +parts, starting in this section with the various phases of +grace-period initialization. + +<p>The first ordering-related grace-period initialization action is to +increment the <tt>rcu_state</tt> structure's <tt>->gpnum</tt> +grace-period-number counter, as shown below: + +</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> + +<p>The actual increment is carried out using <tt>smp_store_release()</tt>, +which helps reject false-positive RCU CPU stall detection. +Note that only the root <tt>rcu_node</tt> structure is touched. + +<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks +based on CPUs having come online or gone offline since the start of +the previous grace period. +In the common case where the number of online CPUs for this <tt>rcu_node</tt> +structure has not transitioned to or from zero, +this pass will scan only the leaf <tt>rcu_node</tt> structures. +However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> +structure has transitioned from zero, +<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. +Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> +structure has transitioned to zero, +<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +<tt>rcu_node</tt> structure onlines its first CPU and if the next +<tt>rcu_node</tt> structure has no online CPUs +(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines +its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). + +</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> + +<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> +tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's +<tt>->gpnum</tt> field to the newly incremented value from the +<tt>rcu_state</tt> structure, as shown in the following diagram. + +</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> + +<p>This change will also cause each CPU's next call to +<tt>__note_gp_changes()</tt> +to notice that a new grace period has started, as described in the next +section. +But because the grace-period kthread started the grace period at the +root (with the increment of the <tt>rcu_state</tt> structure's +<tt>->gpnum</tt> field) before setting each leaf <tt>rcu_node</tt> +structure's <tt>->gpnum</tt> field, each CPU's observation of +the start of the grace period will happen after the actual start +of the grace period. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what about the CPU that started the grace period? + Why wouldn't it see the start of the grace period right when + it started that grace period? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + In some deep philosophical and overly anthromorphized + sense, yes, the CPU starting the grace period is immediately + aware of having done so. + However, if we instead assume that RCU is not self-aware, + then even the CPU starting the grace period does not really + become aware of the start of this grace period until its + first call to <tt>__note_gp_changes()</tt>. + On the other hand, this CPU potentially gets early notification + because it invokes <tt>__note_gp_changes()</tt> during its + last <tt>rcu_gp_init()</tt> pass through its leaf + <tt>rcu_node</tt> structure. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h4><a name="Self-Reported Quiescent States"> +Self-Reported Quiescent States</a></h4> + +<p>When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. +Online non-idle CPUs report their own quiescent states, as shown +in the following diagram: + +</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> + +<p>This is for the last CPU to report a quiescent state, which signals +the end of the grace period. +Earlier quiescent states would push up the <tt>rcu_node</tt> tree +only until they encountered an <tt>rcu_node</tt> structure that +is waiting for additional quiescent states. +However, ordering is nevertheless preserved because some later quiescent +state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. + +<p>Any number of events can lead up to a CPU invoking +<tt>note_gp_changes</tt> (or alternatively, directly invoking +<tt>__note_gp_changes()</tt>), at which point that CPU will notice +the start of a new grace period while holding its leaf +<tt>rcu_node</tt> lock. +Therefore, all execution shown in this diagram happens after the +start of the grace period. +In addition, this CPU will consider any RCU read-side critical +section that started before the invocation of <tt>__note_gp_changes()</tt> +to have started before the grace period, and thus a critical +section that the grace period must wait on. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But a RCU read-side critical section might have started + after the beginning of the grace period + (the <tt>->gpnum++</tt> from earlier), so why should + the grace period wait on such a critical section? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + It is indeed not necessary for the grace period to wait on such + a critical section. + However, it is permissible to wait on it. + And it is furthermore important to wait on it, as this + lazy approach is far more scalable than a “big bang” + all-at-once grace-period start could possibly be. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p>If the CPU does a context switch, a quiescent state will be +noted by <tt>rcu_node_context_switch()</tt> on the left. +On the other hand, if the CPU takes a scheduler-clock interrupt +while executing in usermode, a quiescent state will be noted by +<tt>rcu_check_callbacks()</tt> on the right. +Either way, the passage through a quiescent state will be noted +in a per-CPU variable. + +<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on +this CPU (for example, after the next scheduler-clock +interrupt), <tt>__rcu_process_callbacks()</tt> will invoke +<tt>rcu_check_quiescent_state()</tt>, which will notice the +recorded quiescent state, and invoke +<tt>rcu_report_qs_rdp()</tt>. +If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state +really does apply to the current grace period, it invokes +<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> +tree as shown at the bottom of the diagram, clearing bits from +each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, +and propagating up the tree when the result is zero. + +<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> +structure only if the current CPU is reporting the last quiescent +state for the subtree headed by that <tt>rcu_node</tt> structure. +A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> +structure, then there will be a later traversal by another CPU +(or perhaps the same one) that proceeds upwards +from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. +Applying this line of thought repeatedly shows that all CPUs' +quiescent states happen before the last CPU traverses through +the root <tt>rcu_node</tt> structure, the “last CPU” +being the one that clears the last bit in the root <tt>rcu_node</tt> +structure's <tt>->qsmask</tt> field. + +<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> + +<p>Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. +CPUs are therefore required to notify RCU when entering or leaving idle +state, which they do via fully ordered value-returning atomic operations +on a per-CPU variable. +The ordering effects are as shown below: + +</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> + +<p>The RCU grace-period kernel thread samples the per-CPU idleness +variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> +structure's <tt>->lock</tt>. +This means that any RCU read-side critical sections that precede the +idle period (the oval near the top of the diagram above) will happen +before the end of the current grace period. +Similarly, the beginning of the current grace period will happen before +any RCU read-side critical sections that follow the +idle period (the oval near the bottom of the diagram above). + +<p>Plumbing this into the full grace-period execution is described +<a href="#Forcing Quiescent States">below</a>. + +<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> + +<p>RCU is also forbidden from disturbing offline CPUs, which might well +be powered off and removed from the system completely. +CPUs are therefore required to notify RCU of their comings and goings +as part of the corresponding CPU hotplug operations. +The ordering effects are shown below: + +</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> + +<p>Because CPU hotplug operations are much less frequent than idle transitions, +they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> +structure's <tt>->lock</tt> and update this structure's +<tt>->qsmaskinitnext</tt>. +The RCU grace-period kernel thread samples this mask to detect CPUs +having gone offline since the beginning of this grace period. + +<p>Plumbing this into the full grace-period execution is described +<a href="#Forcing Quiescent States">below</a>. + +<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> + +<p>As noted above, idle and offline CPUs cannot report their own +quiescent states, and therefore the grace-period kernel thread +must do the reporting on their behalf. +This process is called “forcing quiescent states”, it is +repeated every few jiffies, and its ordering effects are shown below: + +</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> + +<p>Each pass of quiescent state forcing is guaranteed to traverse the +leaf <tt>rcu_node</tt> structures, and if there are no new quiescent +states due to recently idled and/or offlined CPUs, then only the +leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left +or a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. +As with self-reported quiescent states, the upwards driving stops +once it reaches an <tt>rcu_node</tt> structure that has quiescent +states outstanding from other CPUs. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + The leftmost drive to root stopped before it reached + the root <tt>rcu_node</tt> structure, which means that + there are still CPUs subordinate to that structure on + which the current grace period is waiting. + Given that, how is it possible that the rightmost drive + to root ended the grace period? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Good analysis! + It is in fact impossible in the absence of bugs in RCU. + But this diagram is complex enough as it is, so simplicity + overrode accuracy. + You can think of it as poetic license, or you can think of + it as misdirection that is resolved in the + <a href="#Putting It All Together">stitched-together diagram</a>. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> + +<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree +breadth-first setting all the <tt>->completed</tt> fields equal +to the number of the newly completed grace period, then it sets +the <tt>rcu_state</tt> structure's <tt>->completed</tt> field, +again to the number of the newly completed grace period. +The ordering effects are shown below: + +</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> + +<p>As indicated by the oval at the bottom of the diagram, once +grace-period cleanup is complete, the next grace period can begin. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But when precisely does the grace period end? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + There is no useful single point at which the grace period + can be said to end. + The earliest reasonable candidate is as soon as the last + CPU has reported its quiescent state, but it may be some + milliseconds before RCU becomes aware of this. + The latest reasonable candidate is once the <tt>rcu_state</tt> + structure's <tt>->completed</tt> field has been updated, + but it is quite possible that some CPUs have already completed + phase two of their updates by that time. + In short, if you are going to work with RCU, you need to + learn to embrace uncertainty. +</font></td></tr> +<tr><td> </td></tr> +</table> + + +<h4><a name="Callback Invocation">Callback Invocation</a></h4> + +<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's +<tt>->completed</tt> field has been updated, that CPU can begin +invoking its RCU callbacks that were waiting for this grace period +to end. +These callbacks are identified by <tt>rcu_advance_cbs()</tt>, +which is usually invoked by <tt>__note_gp_changes()</tt>. +As shown in the diagram below, this invocation can be triggered by +the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on +the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on +the right, but only for kernels build with +<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). +Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in +<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn +allows those callbacks to carry out (either directly or indirectly +via wakeup) the needed phase-two processing for each update. + +</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> + +<p>Please note that callback invocation can also be prompted by any +number of corner-case code paths, for example, when a CPU notes that +it has excessive numbers of callbacks queued. +In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's +<tt>->lock</tt> before invoking callbacks, which preserves the +required ordering against the newly completed grace period. + +<p>However, if the callback function communicates to other CPUs, +for example, doing a wakeup, then it is that function's responsibility +to maintain ordering. +For example, if the callback function wakes up a task that runs on +some other CPU, proper ordering must in place in both the callback +function and the task being awakened. +To see why this is important, consider the top half of the +<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. +The callback might be running on a CPU corresponding to the leftmost +leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on +a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, +and the grace-period kernel thread might not yet have reached the +rightmost leaf. +In this case, the grace period's memory ordering might not yet have +reached that CPU, so again the callback function and the awakened +task must supply proper ordering. + +<h3><a name="Putting It All Together">Putting It All Together</a></h3> + +<p>A stitched-together diagram is +<a href="Tree-RCU-Diagram.html">here</a>. + +<h3><a name="Legal Statement"> +Legal Statement</a></h3> + +<p>This work represents the view of the author and does not necessarily +represent the view of IBM. + +</p><p>Linux is a registered trademark of Linus Torvalds. + +</p><p>Other company, product, and service names may be trademarks or +service marks of others. + +</body></html> |