From d956028e99b30726b0bce0ca684b40b1ad67b514 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 31 Mar 2015 09:39:41 +0100 Subject: documentation: memory-barriers: Fix smp_mb__before_spinlock() semantics Our current documentation claims that, when followed by an ACQUIRE, smp_mb__before_spinlock() orders prior loads against subsequent loads and stores, which isn't the intent. This commit therefore fixes the documentation to state that this sequence orders only prior stores against subsequent loads and stores. In addition, the original intent of smp_mb__before_spinlock() was to only order prior loads against subsequent stores, however, people have started using it as if it ordered prior loads against subsequent loads and stores. This commit therefore also updates smp_mb__before_spinlock()'s header comment to reflect this new reality. Cc: Oleg Nesterov Cc: "Paul E. McKenney" Cc: Peter Zijlstra Signed-off-by: Will Deacon Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f95746189b5d..1f362fd2ecb4 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1784,10 +1784,9 @@ for each construct. These operations all imply certain barriers: Memory operations issued before the ACQUIRE may be completed after the ACQUIRE operation has completed. An smp_mb__before_spinlock(), - combined with a following ACQUIRE, orders prior loads against - subsequent loads and stores and also orders prior stores against - subsequent stores. Note that this is weaker than smp_mb()! The - smp_mb__before_spinlock() primitive is free on many architectures. + combined with a following ACQUIRE, orders prior stores against + subsequent loads and stores. Note that this is weaker than smp_mb()! + The smp_mb__before_spinlock() primitive is free on many architectures. (2) RELEASE operation implication: -- cgit v1.2.3 From ee7c29be3695996536395f647e8a2ed6b1ab3a0d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 7 Apr 2015 12:45:41 -0700 Subject: documentation: Update rcu_dereference.txt based on WG21 discussions This commit provides another caveat for the care and feeding of pointers returned by rcu_dereference() that was pointed out in discussions within the C++ standards committee. Signed-off-by: Paul E. McKenney Reviewed-by: Mathieu Desnoyers --- Documentation/RCU/rcu_dereference.txt | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation') diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt index ceb05da5a5ac..2d05c9241a33 100644 --- a/Documentation/RCU/rcu_dereference.txt +++ b/Documentation/RCU/rcu_dereference.txt @@ -193,6 +193,11 @@ o Be very careful about comparing pointers obtained from pointer. Note that the volatile cast in rcu_dereference() will normally prevent the compiler from knowing too much. + However, please note that if the compiler knows that the + pointer takes on only one of two values, a not-equal + comparison will provide exactly the information that the + compiler needs to deduce the value of the pointer. + o Disable any value-speculation optimizations that your compiler might provide, especially if you are making use of feedback-based optimizations that take data collected from prior runs. Such -- cgit v1.2.3 From ed38446424dd531f1b7a167677232a6d400d69d5 Mon Sep 17 00:00:00 2001 From: Milos Vyletel Date: Fri, 17 Apr 2015 16:38:04 +0200 Subject: documentation: State that rcu_dereference() reloads pointer Make a note stating that repeated calls of rcu_dereference() may not return the same pointer if update happens while in critical section. Reported-by: Jeff Haran Signed-off-by: Milos Vyletel Reviewed-by: Steven Rostedt Signed-off-by: Paul E. McKenney --- Documentation/RCU/whatisRCU.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 88dfce182f66..16622c9e86b5 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -256,7 +256,9 @@ rcu_dereference() If you are going to be fetching multiple fields from the RCU-protected structure, using the local variable is of course preferred. Repeated rcu_dereference() calls look - ugly and incur unnecessary overhead on Alpha CPUs. + ugly, do not guarantee that the same pointer will be returned + if an update happened while in the critical section, and incur + unnecessary overhead on Alpha CPUs. Note that the value returned by rcu_dereference() is valid only within the enclosing RCU read-side critical section. -- cgit v1.2.3 From 5af4692a75daf08dddc93dbb4cd2a1b3d3b617af Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 25 Apr 2015 12:48:29 -0700 Subject: smp: Make control dependencies work on Alpha, improve documentation The current formulation of control dependencies fails on DEC Alpha, which does not respect dependencies of any kind unless an explicit memory barrier is provided. This means that the current fomulation of control dependencies fails on Alpha. This commit therefore creates a READ_ONCE_CTRL() that has the same overhead on non-Alpha systems, but causes Alpha to produce the needed ordering. This commit also applies READ_ONCE_CTRL() to the one known use of control dependencies. Use of READ_ONCE_CTRL() also has the beneficial effect of adding a bit of self-documentation to control dependencies. Signed-off-by: Paul E. McKenney Acked-by: Peter Zijlstra (Intel) --- Documentation/memory-barriers.txt | 55 +++++++++++++++++++++++---------------- include/linux/compiler.h | 16 ++++++++++++ kernel/events/ring_buffer.c | 2 +- 3 files changed, 50 insertions(+), 23 deletions(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f95746189b5d..a3014bcc5b08 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -617,16 +617,16 @@ case what's actually required is: However, stores are not speculated. This means that ordering -is- provided for load-store control dependencies, as in the following example: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); if (q) { ACCESS_ONCE(b) = p; } -Control dependencies pair normally with other types of barriers. -That said, please note that ACCESS_ONCE() is not optional! Without the -ACCESS_ONCE(), might combine the load from 'a' with other loads from -'a', and the store to 'b' with other stores to 'b', with possible highly -counterintuitive effects on ordering. +Control dependencies pair normally with other types of barriers. That +said, please note that READ_ONCE_CTRL() is not optional! Without the +READ_ONCE_CTRL(), the compiler might combine the load from 'a' with +other loads from 'a', and the store to 'b' with other stores to 'b', +with possible highly counterintuitive effects on ordering. Worse yet, if the compiler is able to prove (say) that the value of variable 'a' is always non-zero, it would be well within its rights @@ -636,12 +636,15 @@ as follows: q = a; b = p; /* BUG: Compiler and CPU can both reorder!!! */ -So don't leave out the ACCESS_ONCE(). +Finally, the READ_ONCE_CTRL() includes an smp_read_barrier_depends() +that DEC Alpha needs in order to respect control depedencies. + +So don't leave out the READ_ONCE_CTRL(). It is tempting to try to enforce ordering on identical stores on both branches of the "if" statement as follows: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); if (q) { barrier(); ACCESS_ONCE(b) = p; @@ -655,7 +658,7 @@ branches of the "if" statement as follows: Unfortunately, current compilers will transform this as follows at high optimization levels: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); barrier(); ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */ if (q) { @@ -685,7 +688,7 @@ memory barriers, for example, smp_store_release(): In contrast, without explicit memory barriers, two-legged-if control ordering is guaranteed only when the stores differ, for example: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); if (q) { ACCESS_ONCE(b) = p; do_something(); @@ -694,14 +697,14 @@ ordering is guaranteed only when the stores differ, for example: do_something_else(); } -The initial ACCESS_ONCE() is still required to prevent the compiler from -proving the value of 'a'. +The initial READ_ONCE_CTRL() is still required to prevent the compiler +from proving the value of 'a'. In addition, you need to be careful what you do with the local variable 'q', otherwise the compiler might be able to guess the value and again remove the needed conditional. For example: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); if (q % MAX) { ACCESS_ONCE(b) = p; do_something(); @@ -714,7 +717,7 @@ If MAX is defined to be 1, then the compiler knows that (q % MAX) is equal to zero, in which case the compiler is within its rights to transform the above code into the following: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); ACCESS_ONCE(b) = p; do_something_else(); @@ -725,7 +728,7 @@ is gone, and the barrier won't bring it back. Therefore, if you are relying on this ordering, you should make sure that MAX is greater than one, perhaps as follows: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ if (q % MAX) { ACCESS_ONCE(b) = p; @@ -742,14 +745,15 @@ of the 'if' statement. You must also be careful not to rely too much on boolean short-circuit evaluation. Consider this example: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); if (a || 1 > 0) ACCESS_ONCE(b) = 1; -Because the second condition is always true, the compiler can transform -this example as following, defeating control dependency: +Because the first condition cannot fault and the second condition is +always true, the compiler can transform this example as following, +defeating control dependency: - q = ACCESS_ONCE(a); + q = READ_ONCE_CTRL(a); ACCESS_ONCE(b) = 1; This example underscores the need to ensure that the compiler cannot @@ -762,8 +766,8 @@ demonstrated by two related examples, with the initial values of x and y both being zero: CPU 0 CPU 1 - ===================== ===================== - r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y); + ======================= ======================= + r1 = READ_ONCE_CTRL(x); r2 = READ_ONCE_CTRL(y); if (r1 > 0) if (r2 > 0) ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1; @@ -783,7 +787,8 @@ But because control dependencies do -not- provide transitivity, the above assertion can fail after the combined three-CPU example completes. If you need the three-CPU example to provide ordering, you will need smp_mb() between the loads and stores in the CPU 0 and CPU 1 code fragments, -that is, just before or just after the "if" statements. +that is, just before or just after the "if" statements. Furthermore, +the original two-CPU example is very fragile and should be avoided. These two examples are the LB and WWC litmus tests from this paper: http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this @@ -791,6 +796,12 @@ site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html. In summary: + (*) Control dependencies must be headed by READ_ONCE_CTRL(). + Or, as a much less preferable alternative, interpose + be headed by READ_ONCE() or an ACCESS_ONCE() read and must + have smp_read_barrier_depends() between this read and the + control-dependent write. + (*) Control dependencies can order prior loads against later stores. However, they do -not- guarantee any other sort of ordering: Not prior loads against later loads, nor prior stores against diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 867722591be2..5d66777914db 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -252,6 +252,22 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s #define WRITE_ONCE(x, val) \ ({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; }) +/** + * READ_ONCE_CTRL - Read a value heading a control dependency + * @x: The value to be read, heading the control dependency + * + * Control dependencies are tricky. See Documentation/memory-barriers.txt + * for important information on how to use them. Note that in many cases, + * use of smp_load_acquire() will be much simpler. Control dependencies + * should be avoided except on the hottest of hotpaths. + */ +#define READ_ONCE_CTRL(x) \ +({ \ + typeof(x) __val = READ_ONCE(x); \ + smp_read_barrier_depends(); /* Enforce control dependency. */ \ + __val; \ +}) + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 232f00f273cb..17fcb73c4a50 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -141,7 +141,7 @@ int perf_output_begin(struct perf_output_handle *handle, perf_output_get_handle(handle); do { - tail = ACCESS_ONCE(rb->user_page->data_tail); + tail = READ_ONCE_CTRL(rb->user_page->data_tail); offset = head = local_read(&rb->head); if (!rb->overwrite && unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size)) -- cgit v1.2.3 From 0f41c0ddadfb3d5baffe62351c380e2881aacd58 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 10 Mar 2015 18:33:20 -0700 Subject: rcu: Provide diagnostic option to slow down grace-period scans Grace-period scans of the rcu_node combining tree normally proceed quite quickly, so that it is very difficult to reproduce races against them. This commit therefore allows grace-period pre-initialization and cleanup to be artificially slowed down, increasing race-reproduction probability. A pair of pairs of new Kconfig parameters are provided, RCU_TORTURE_TEST_SLOW_PREINIT to enable the slowing down of propagating CPU-hotplug changes up the combining tree along with RCU_TORTURE_TEST_SLOW_PREINIT_DELAY to specify the delay in jiffies, and RCU_TORTURE_TEST_SLOW_CLEANUP to enable the slowing down of the end-of-grace-period cleanup scan along with RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY to specify the delay in jiffies. Boot-time parameters named rcutree.gp_preinit_delay and rcutree.gp_cleanup_delay allow these delays to be specified at boot time. Signed-off-by: Paul E. McKenney --- Documentation/kernel-parameters.txt | 16 ++++++- kernel/rcu/tree.c | 29 ++++++++++-- lib/Kconfig.debug | 54 +++++++++++++++++++++- .../selftests/rcutorture/configs/rcu/CFcommon | 2 + 4 files changed, 93 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 61ab1628a057..10a4fb80c033 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2992,11 +2992,23 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Set maximum number of finished RCU callbacks to process in one batch. + rcutree.gp_cleanup_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period cleanup. This only has effect + when CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP is set. + rcutree.gp_init_delay= [KNL] Set the number of jiffies to delay each step of RCU grace-period initialization. This only has - effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT is - set. + effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT + is set. + + rcutree.gp_preinit_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period pre-initialization, that is, + the propagation of recent CPU-hotplug changes up + the rcu_node combining tree. This only has effect + when CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT is set. rcutree.rcu_fanout_leaf= [KNL] Increase the number of CPUs assigned to each diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9b076b284695..2f3cb5513ca3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -163,6 +163,14 @@ static int kthread_prio = CONFIG_RCU_KTHREAD_PRIO; module_param(kthread_prio, int, 0644); /* Delay in jiffies for grace-period initialization delays, debug only. */ + +#ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT +static int gp_preinit_delay = CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY; +module_param(gp_preinit_delay, int, 0644); +#else /* #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT */ +static const int gp_preinit_delay; +#endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT */ + #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_INIT static int gp_init_delay = CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY; module_param(gp_init_delay, int, 0644); @@ -170,6 +178,13 @@ module_param(gp_init_delay, int, 0644); static const int gp_init_delay; #endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_INIT */ +#ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP +static int gp_cleanup_delay = CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY; +module_param(gp_cleanup_delay, int, 0644); +#else /* #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP */ +static const int gp_cleanup_delay; +#endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP */ + /* * Number of grace periods between delays, normalized by the duration of * the delay. The longer the the delay, the more the grace periods between @@ -1742,6 +1757,13 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp) rcu_gp_kthread_wake(rsp); } +static void rcu_gp_slow(struct rcu_state *rsp, int delay) +{ + if (delay > 0 && + !(rsp->gpnum % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay))) + schedule_timeout_uninterruptible(delay); +} + /* * Initialize a new grace period. Return 0 if no grace period required. */ @@ -1784,6 +1806,7 @@ static int rcu_gp_init(struct rcu_state *rsp) * will handle subsequent offline CPUs. */ rcu_for_each_leaf_node(rsp, rnp) { + rcu_gp_slow(rsp, gp_preinit_delay); raw_spin_lock_irq(&rnp->lock); smp_mb__after_unlock_lock(); if (rnp->qsmaskinit == rnp->qsmaskinitnext && @@ -1840,6 +1863,7 @@ static int rcu_gp_init(struct rcu_state *rsp) * process finishes, because this kthread handles both. */ rcu_for_each_node_breadth_first(rsp, rnp) { + rcu_gp_slow(rsp, gp_init_delay); raw_spin_lock_irq(&rnp->lock); smp_mb__after_unlock_lock(); rdp = this_cpu_ptr(rsp->rda); @@ -1857,10 +1881,6 @@ static int rcu_gp_init(struct rcu_state *rsp) raw_spin_unlock_irq(&rnp->lock); cond_resched_rcu_qs(); WRITE_ONCE(rsp->gp_activity, jiffies); - if (gp_init_delay > 0 && - !(rsp->gpnum % - (rcu_num_nodes * PER_RCU_NODE_PERIOD * gp_init_delay))) - schedule_timeout_uninterruptible(gp_init_delay); } return 1; @@ -1955,6 +1975,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) raw_spin_unlock_irq(&rnp->lock); cond_resched_rcu_qs(); WRITE_ONCE(rsp->gp_activity, jiffies); + rcu_gp_slow(rsp, gp_cleanup_delay); } rnp = rcu_get_root(rsp); raw_spin_lock_irq(&rnp->lock); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ba2b0c87e65b..e1af93ae246b 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1261,12 +1261,38 @@ config RCU_TORTURE_TEST_RUNNABLE Say N here if you want the RCU torture tests to start only after being manually enabled via /proc. +config RCU_TORTURE_TEST_SLOW_PREINIT + bool "Slow down RCU grace-period pre-initialization to expose races" + depends on RCU_TORTURE_TEST + help + This option delays grace-period pre-initialization (the + propagation of CPU-hotplug changes up the rcu_node combining + tree) for a few jiffies between initializing each pair of + consecutive rcu_node structures. This helps to expose races + involving grace-period pre-initialization, in other words, it + makes your kernel less stable. It can also greatly increase + grace-period latency, especially on systems with large numbers + of CPUs. This is useful when torture-testing RCU, but in + almost no other circumstance. + + Say Y here if you want your system to crash and hang more often. + Say N if you want a sane system. + +config RCU_TORTURE_TEST_SLOW_PREINIT_DELAY + int "How much to slow down RCU grace-period pre-initialization" + range 0 5 + default 3 + depends on RCU_TORTURE_TEST_SLOW_PREINIT + help + This option specifies the number of jiffies to wait between + each rcu_node structure pre-initialization step. + config RCU_TORTURE_TEST_SLOW_INIT bool "Slow down RCU grace-period initialization to expose races" depends on RCU_TORTURE_TEST help - This option makes grace-period initialization block for a - few jiffies between initializing each pair of consecutive + This option delays grace-period initialization for a few + jiffies between initializing each pair of consecutive rcu_node structures. This helps to expose races involving grace-period initialization, in other words, it makes your kernel less stable. It can also greatly increase grace-period @@ -1286,6 +1312,30 @@ config RCU_TORTURE_TEST_SLOW_INIT_DELAY This option specifies the number of jiffies to wait between each rcu_node structure initialization. +config RCU_TORTURE_TEST_SLOW_CLEANUP + bool "Slow down RCU grace-period cleanup to expose races" + depends on RCU_TORTURE_TEST + help + This option delays grace-period cleanup for a few jiffies + between cleaning up each pair of consecutive rcu_node + structures. This helps to expose races involving grace-period + cleanup, in other words, it makes your kernel less stable. + It can also greatly increase grace-period latency, especially + on systems with large numbers of CPUs. This is useful when + torture-testing RCU, but in almost no other circumstance. + + Say Y here if you want your system to crash and hang more often. + Say N if you want a sane system. + +config RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY + int "How much to slow down RCU grace-period cleanup" + range 0 5 + default 3 + depends on RCU_TORTURE_TEST_SLOW_CLEANUP + help + This option specifies the number of jiffies to wait between + each rcu_node structure cleanup operation. + config RCU_CPU_STALL_TIMEOUT int "RCU CPU stall timeout in seconds" depends on RCU_STALL_COMMON diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon index 49701218dc62..f824b4c9d9d9 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon +++ b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon @@ -1,3 +1,5 @@ CONFIG_RCU_TORTURE_TEST=y CONFIG_PRINTK_TIME=y +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y -- cgit v1.2.3 From 7fa270010e0ddd3693381431f373b3e3135b0695 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 20 Apr 2015 10:27:15 -0700 Subject: rcu: Convert CONFIG_RCU_FANOUT_EXACT to boot parameter The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and perhaps only) by rcutorture to verify that RCU works correctly in specific rcu_node combining-tree configurations. It therefore does not make much sense have this as a question to people attempting to configure their kernels. So this commit creates an rcutree.rcu_fanout_exact= boot parameter that rcutorture can use, and eliminates the original CONFIG_RCU_FANOUT_EXACT Kconfig parameter. Reported-by: Ingo Molnar Signed-off-by: Paul E. McKenney Reviewed-by: Pranith Kumar --- Documentation/kernel-parameters.txt | 6 ++++++ init/Kconfig | 14 -------------- kernel/rcu/tree.c | 7 +++++-- kernel/rcu/tree_plugin.h | 2 +- 4 files changed, 12 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 10a4fb80c033..f5582dcdf80d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3010,6 +3010,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. the rcu_node combining tree. This only has effect when CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT is set. + rcutree.rcu_fanout_exact= [KNL] + Disable autobalancing of the rcu_node combining + tree. This is used by rcutorture, and might + possibly be useful for architectures having high + cache-to-cache transfer latencies. + rcutree.rcu_fanout_leaf= [KNL] Increase the number of CPUs assigned to each leaf rcu_node structure. Useful for very large diff --git a/init/Kconfig b/init/Kconfig index 927210810189..0ec82362cfc0 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -611,20 +611,6 @@ config RCU_FANOUT_LEAF Take the default if unsure. -config RCU_FANOUT_EXACT - bool "Disable tree-based hierarchical RCU auto-balancing" - depends on TREE_RCU || PREEMPT_RCU - default n - help - This option forces use of the exact RCU_FANOUT value specified, - regardless of imbalances in the hierarchy. This is useful for - testing RCU itself, and might one day be useful on systems with - strong NUMA behavior. - - Without RCU_FANOUT_EXACT, the code will balance the hierarchy. - - Say N if unsure. - config RCU_FAST_NO_HZ bool "Accelerate last non-dyntick-idle CPU's grace periods" depends on NO_HZ_COMMON && SMP diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2f3cb5513ca3..b49c474e1fff 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -113,6 +113,9 @@ RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh); static struct rcu_state *rcu_state_p; LIST_HEAD(rcu_struct_flavors); +/* Control rcu_node-tree auto-balancing at boot time. */ +static bool rcu_fanout_exact; +module_param(rcu_fanout_exact, bool, 0444); /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */ static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF; module_param(rcu_fanout_leaf, int, 0444); @@ -3956,13 +3959,13 @@ void rcu_scheduler_starting(void) /* * Compute the per-level fanout, either using the exact fanout specified - * or balancing the tree, depending on CONFIG_RCU_FANOUT_EXACT. + * or balancing the tree, depending on the rcu_fanout_exact boot parameter. */ static void __init rcu_init_levelspread(struct rcu_state *rsp) { int i; - if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT)) { + if (rcu_fanout_exact) { rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf; for (i = rcu_num_lvls - 2; i >= 0; i--) rsp->levelspread[i] = CONFIG_RCU_FANOUT; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 58b1ebdc4387..eb460ec747ef 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -64,7 +64,7 @@ static void __init rcu_bootup_announce_oddness(void) (!IS_ENABLED(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 32)) pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n", CONFIG_RCU_FANOUT); - if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT)) + if (rcu_fanout_exact) pr_info("\tHierarchical RCU autobalancing is disabled.\n"); if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ)) pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n"); -- cgit v1.2.3 From a3dc2948cec80f20a89e9b646d0c01b121e48e02 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 20 Apr 2015 11:40:50 -0700 Subject: rcu: Enable diagnostic dump of rcu_node combining tree The purpose of this commit is to make it easier to verify that RCU's combining tree is set up correctly, which is useful to have when making changes in how that tree is initialized. Signed-off-by: Paul E. McKenney Reviewed-by: Pranith Kumar [ paulmck: Fold fix found by Fengguang's 0-day test robot. ] --- Documentation/kernel-parameters.txt | 5 +++++ kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++ 2 files changed, 32 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f5582dcdf80d..a1cb88d9864e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2992,6 +2992,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Set maximum number of finished RCU callbacks to process in one batch. + rcutree.dump_tree= [KNL] + Dump the structure of the rcu_node combining tree + out at early boot. This is used for diagnostic + purposes, to verify correct tree setup. + rcutree.gp_cleanup_delay= [KNL] Set the number of jiffies to delay each step of RCU grace-period cleanup. This only has effect diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b49c474e1fff..1bc14c670641 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -113,6 +113,9 @@ RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh); static struct rcu_state *rcu_state_p; LIST_HEAD(rcu_struct_flavors); +/* Dump rcu_node combining tree at boot to verify correct setup. */ +static bool dump_tree; +module_param(dump_tree, bool, 0444); /* Control rcu_node-tree auto-balancing at boot time. */ static bool rcu_fanout_exact; module_param(rcu_fanout_exact, bool, 0444); @@ -4144,6 +4147,28 @@ static void __init rcu_init_geometry(void) rcu_num_nodes -= n; } +/* + * Dump out the structure of the rcu_node combining tree associated + * with the rcu_state structure referenced by rsp. + */ +static void __init rcu_dump_rcu_node_tree(struct rcu_state *rsp) +{ + int level = 0; + struct rcu_node *rnp; + + pr_info("rcu_node tree layout dump\n"); + pr_info(" "); + rcu_for_each_node_breadth_first(rsp, rnp) { + if (rnp->level != level) { + pr_cont("\n"); + pr_info(" "); + level = rnp->level; + } + pr_cont("%d:%d ^%d ", rnp->grplo, rnp->grphi, rnp->grpnum); + } + pr_cont("\n"); +} + void __init rcu_init(void) { int cpu; @@ -4154,6 +4179,8 @@ void __init rcu_init(void) rcu_init_geometry(); rcu_init_one(&rcu_bh_state, &rcu_bh_data); rcu_init_one(&rcu_sched_state, &rcu_sched_data); + if (dump_tree) + rcu_dump_rcu_node_tree(&rcu_sched_state); __rcu_init_preempt(); open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); -- cgit v1.2.3 From 3838cc1850ccd09f93e729e9047ec1995026f83e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 12 Mar 2015 13:55:48 -0700 Subject: rcutorture: Allow negative values of nreaders to oversubscribe By default, with rcutorture.nreaders equal to -1, rcutorture provisions N-1 reader kthreads, where N is the number of CPUs. This avoids rcutorture-induced stalls, but also avoids heavier levels of torture. This commit therefore allows negative values of rcutorture.nreaders to specify larger numbers of reader kthreads, so that for example rcutorture.nreaders=-2 provisions N kthreads and rcutorture.nreaders=-5 provisions N+3 kthreads. Signed-off-by: Paul E. McKenney [ paulmck: Update documentation, as suggested by Josh Triplett. ] --- Documentation/kernel-parameters.txt | 6 +++++- kernel/rcu/rcutorture.c | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 61ab1628a057..04b811086dca 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -3101,7 +3101,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. test, hence the "fake". rcutorture.nreaders= [KNL] - Set number of RCU readers. + Set number of RCU readers. The value -1 selects + N-1, where N is the number of CPUs. A value + "n" less than -1 selects N-n-2, where N is again + the number of CPUs. For example, -2 selects N + (the number of CPUs), -3 selects N+1, and so on. rcutorture.object_debug= [KNL] Enable debug-object double-call_rcu() testing. diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index a67ef6ff86b0..7294d605c481 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1701,7 +1701,7 @@ rcu_torture_init(void) if (nreaders >= 0) { nrealreaders = nreaders; } else { - nrealreaders = num_online_cpus() - 1; + nrealreaders = num_online_cpus() - 2 - nreaders; if (nrealreaders <= 0) nrealreaders = 1; } -- cgit v1.2.3