summaryrefslogtreecommitdiff
path: root/kernel/rcu
AgeCommit message (Collapse)AuthorFilesLines
2019-12-09rcu: Use lockdep rather than comment to enforce lock heldPaul E. McKenney1-2/+2
The rcu_preempt_check_blocked_tasks() function has a comment that states that the rcu_node structure's ->lock must be held, which might be informative, but which carries little weight if not read. This commit therefore removes this comment in favor of raw_lockdep_assert_held_rcu_node(), which will complain quite visibly if the required lock is not held. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Avoid data-race in rcu_gp_fqs_check_wake()Eric Dumazet1-5/+6
The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp() to read ->gp_tasks while other cpus might overwrite this field. We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler tricks and KCSAN splats like the following : BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0: rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533 ip_queue_xmit+0x45/0x60 include/net/ip.h:236 __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158 __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685 tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691 tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575 tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 sock_read_iter+0x15f/0x1e0 net/socket.c:967 call_read_iter include/linux/fs.h:1864 [inline] new_sync_read+0x389/0x4f0 fs/read_write.c:414 read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1: rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline] rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546 rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611 rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> [ paulmck: Added another READ_ONCE() for RCU CPU stall warnings. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu/nocb: Fix dump_tree hierarchy print always activeStefan Reiter1-5/+17
Commit 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree") added print statements to rcu_organize_nocb_kthreads for debugging, but incorrectly guarded them, causing the function to always spew out its message. This patch fixes it by guarding both pr_alert statements with dump_tree, while also changing the second pr_alert to a pr_cont, to print the hierarchy in a single line (assuming that's how it was supposed to work). Fixes: 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree") Signed-off-by: Stefan Reiter <stefan@pimaker.at> [ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Enable tick for nohz_full CPUs slow to provide expedited QSPaul E. McKenney2-7/+46
An expedited grace period can be stalled by a nohz_full CPU looping in kernel context. This possibility is currently handled by some carefully crafted checks in rcu_read_unlock_special() that enlist help from ksoftirqd when permitted by the scheduler. However, it is exactly these checks that require the scheduler avoid holding any of its rq or pi locks across rcu_read_unlock() without also having held them across the entire RCU read-side critical section. It would therefore be very nice if expedited grace periods could handle nohz_full CPUs looping in kernel context without such checks. This commit therefore adds code to the expedited grace period's wait and cleanup code that forces the scheduler-clock interrupt on for CPUs that fail to quickly supply a quiescent state. "Quickly" is currently a hard-coded single-jiffy delay. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Replace synchronize_sched_expedited_wait() "_sched" with "_rcu"Paul E. McKenney1-2/+2
After RCU flavor consolidation, synchronize_sched_expedited_wait() does both RCU-preempt and RCU-sched, whichever happens to have been built into the running kernel. This commit therefore changes this function's name to synchronize_rcu_expedited_wait() to reflect its new generic nature. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Update tree_exp.h function-header commentsPaul E. McKenney1-12/+13
The function-header comments in kernel/rcu/tree_exp.h have gotten a bit out of date, so this commit updates a number of them. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Rename sync_rcu_preempt_exp_done() to sync_rcu_exp_done()Paul E. McKenney2-12/+11
Now that the RCU flavors have been consolidated, there is one common function for checking to see if an expedited RCU grace period has completed, namely sync_rcu_preempt_exp_done(). Because this function is no longer specific to RCU-preempt, this commit removes the "_preempt" from its name. This commit also changes sync_rcu_preempt_exp_done_unlocked() to sync_rcu_exp_done_unlocked() for the same reason. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Allow only one expedited GP to run concurrently with wakeupsNeeraj Upadhyay1-6/+5
The current expedited RCU grace-period code expects that a task requesting an expedited grace period cannot awaken until that grace period has reached the wakeup phase. However, it is possible for a long preemption to result in the waiting task never sleeping. For example, consider the following sequence of events: 1. Task A starts an expedited grace period by invoking synchronize_rcu_expedited(). It proceeds normally up to the wait_event() near the end of that function, and is then preempted (or interrupted or whatever). 2. The expedited grace period completes, and a kworker task starts the awaken phase, having incremented the counter and acquired the rcu_state structure's .exp_wake_mutex. This kworker task is then preempted or interrupted or whatever. 3. Task A resumes and enters wait_event(), which notes that the expedited grace period has completed, and thus doesn't sleep. 4. Task B starts an expedited grace period exactly as did Task A, complete with the preemption (or whatever delay) just before the call to wait_event(). 5. The expedited grace period completes, and another kworker task starts the awaken phase, having incremented the counter. However, it blocks when attempting to acquire the rcu_state structure's .exp_wake_mutex because step 2's kworker task has not yet released it. 6. Steps 4 and 5 repeat, resulting in overflow of the rcu_node structure's ->exp_wq[] array. In theory, this is harmless. Tasks waiting on the various ->exp_wq[] array will just be spuriously awakened, but they will just sleep again on noting that the rcu_state structure's ->expedited_sequence value has not advanced far enough. In practice, this wastes CPU time and is an accident waiting to happen. This commit therefore moves the rcu_exp_gp_seq_end() call that officially ends the expedited grace period (along with associate tracing) until after the ->exp_wake_mutex has been acquired. This prevents Task A from awakening prematurely, thus preventing more than one expedited grace period from being in flight during a previous expedited grace period's wakeup phase. Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> [ paulmck: Added updated comment. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Fix missed wakeup of exp_wq waitersNeeraj Upadhyay1-1/+1
Tasks waiting within exp_funnel_lock() for an expedited grace period to elapse can be starved due to the following sequence of events: 1. Tasks A and B both attempt to start an expedited grace period at about the same time. This grace period will have completed when the lower four bits of the rcu_state structure's ->expedited_sequence field are 0b'0100', for example, when the initial value of this counter is zero. Task A wins, and thus does the actual work of starting the grace period, including acquiring the rcu_state structure's .exp_mutex and sets the counter to 0b'0001'. 2. Because task B lost the race to start the grace period, it waits on ->expedited_sequence to reach 0b'0100' inside of exp_funnel_lock(). This task therefore blocks on the rcu_node structure's ->exp_wq[1] field, keeping in mind that the end-of-grace-period value of ->expedited_sequence (0b'0100') is shifted down two bits before indexing the ->exp_wq[] field. 3. Task C attempts to start another expedited grace period, but blocks on ->exp_mutex, which is still held by Task A. 4. The aforementioned expedited grace period completes, so that ->expedited_sequence now has the value 0b'0100'. A kworker task therefore acquires the rcu_state structure's ->exp_wake_mutex and starts awakening any tasks waiting for this grace period. 5. One of the first tasks awakened happens to be Task A. Task A therefore releases the rcu_state structure's ->exp_mutex, which allows Task C to start the next expedited grace period, which causes the lower four bits of the rcu_state structure's ->expedited_sequence field to become 0b'0101'. 6. Task C's expedited grace period completes, so that the lower four bits of the rcu_state structure's ->expedited_sequence field now become 0b'1000'. 7. The kworker task from step 4 above continues its wakeups. Unfortunately, the wake_up_all() refetches the rcu_state structure's .expedited_sequence field: wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]); This results in the wakeup being applied to the rcu_node structure's ->exp_wq[2] field, which is unfortunate given that Task B is instead waiting on ->exp_wq[1]. On a busy system, no harm is done (or at least no permanent harm is done). Some later expedited grace period will redo the wakeup. But on a quiet system, such as many embedded systems, it might be a good long time before there was another expedited grace period. On such embedded systems, this situation could therefore result in a system hang. This issue manifested as DPM device timeout during suspend (which usually qualifies as a quiet time) due to a SCSI device being stuck in _synchronize_rcu_expedited(), with the following stack trace: schedule() synchronize_rcu_expedited() synchronize_rcu() scsi_device_quiesce() scsi_bus_suspend() dpm_run_callback() __device_suspend() This commit therefore prevents such delays, timeouts, and hangs by making rcu_exp_wait_wake() use its "s" argument consistently instead of refetching from rcu_state.expedited_sequence. Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Substitute lookup for bit-twiddling in sync_rcu_exp_select_node_cpus()Paul E. McKenney1-2/+2
The code in sync_rcu_exp_select_node_cpus() calculates the current CPU's mask within its rcu_node structure's bitmasks, but this has already been computed in the ->grpmask field of that CPU's rcu_data structure. This commit therefore just uses this ->grpmask field. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Fix data-race due to atomic_t copy-by-valueMarco Elver1-5/+6
This fixes a data-race where `atomic_t dynticks` is copied by value. The copy is performed non-atomically, resulting in a data-race if `dynticks` is updated concurrently. This data-race was found with KCSAN: ================================================================== BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3: atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline] rcu_dynticks_snap kernel/rcu/tree.c:310 [inline] dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline] rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799 kthread+0x1b5/0x200 kernel/kthread.c:255 <snip> read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7: rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline] rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870 irq_enter+0x5/0x50 kernel/softirq.c:347 <snip> Reported by Kernel Concurrency Sanitizer on: CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Workqueue: kblockd blk_mq_run_work_fn ================================================================== Signed-off-by: Marco Elver <elver@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: rcu@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09rcu: Avoid modifying mask_ofl_ipi in sync_rcu_exp_select_node_cpus()Boqun Feng1-7/+6
The "mask_ofl_ipi" is used to track which CPUs get IPIed, however in the IPI sending loop, "mask_ofl_ipi" along with another variable "mask_ofl_test" might also get modified to record which CPUs' quiesent states must be reported by the sync_rcu_exp_select_node_cpus() at the end of sync_rcu_exp_select_node_cpus(). This overlap of roles can be confusing, so this patch cleans things a little by using "mask_ofl_ipi" solely for determining which CPUs must be IPIed and "mask_ofl_test" for solely determining on behalf of which CPUs sync_rcu_exp_select_node_cpus() must report a quiscent state. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Marco Elver <elver@google.com>
2019-12-09rcu: Use *_ONCE() to protect lockless ->expmask accessesPaul E. McKenney1-10/+9
The rcu_node structure's ->expmask field is accessed locklessly when starting a new expedited grace period and when reporting an expedited RCU CPU stall warning. This commit therefore handles the former by taking a snapshot of ->expmask while the lock is held and the latter by applying READ_ONCE() to lockless reads and WRITE_ONCE() to the corresponding updates. Link: https://lore.kernel.org/lkml/CANpmjNNmSOagbTpffHr4=Yedckx9Rm2NuGqC9UqE+AOz5f1-ZQ@mail.gmail.com Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Marco Elver <elver@google.com>
2019-10-30Merge branches 'doc.2019.10.29a', 'fixes.2019.10.30a', 'nohz.2019.10.28a', ↵Paul E. McKenney7-52/+94
'replace.2019.10.30a', 'torture.2019.10.05a' and 'lkmm.2019.10.05a' into HEAD doc.2019.10.29a: RCU documentation updates. fixes.2019.10.30a: RCU miscellaneous fixes. nohz.2019.10.28a: RCU NO_HZ and NO_HZ_FULL updates. replace.2019.10.30a: Replace rcu_swap_protected() with rcu_replace(). torture.2019.10.05a: RCU torture-test updates. lkmm.2019.10.05a: Linux kernel memory model updates.
2019-10-30rcu: Suppress levelspread uninitialized messagesPaul E. McKenney1-0/+2
New tools bring new warnings, and with v5.3 comes: kernel/rcu/srcutree.c: warning: 'levelspread[<U aa0>]' may be used uninitialized in this function [-Wuninitialized]: => 121:34 This commit suppresses this warning by initializing the full array to INT_MIN, which will result in failures should any out-of-bounds references appear. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-10-30rcu: Fix uninitialized variable in nocb_gp_wait()Dan Carpenter1-1/+1
We never set this to false. This probably doesn't affect most people's runtime because GCC will automatically initialize it to false at certain common optimization levels. But that behavior is related to a bug in GCC and obviously should not be relied on. Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30rcu: Ensure that ->rcu_urgent_qs is set before resched IPIJoel Fernandes (Google)1-0/+1
The RCU-specific resched_cpu() function sends a resched IPI to the specified CPU, which can be used to force the tick on for a given nohz_full CPU. This is needed when this nohz_full CPU is looping in the kernel while blocking the current grace period. However, for the tick to actually be forced on in all cases, that CPU's rcu_data structure's ->rcu_urgent_qs flag must be set beforehand. This commit therefore causes rcu_implicit_dynticks_qs() to set this flag prior to invoking resched_cpu() on a holdout nohz_full CPU. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30rcu: Several rcu_segcblist functions can be statickbuild test robot1-3/+3
None of rcu_segcblist_set_len(), rcu_segcblist_add_len(), or rcu_segcblist_xchg_len() are used outside of kernel/rcu/rcu_segcblist.c. This commit therefore makes them static. Fixes: eda669a6a2c5 ("rcu/nocb: Atomic ->len field in rcu_segcblist structure") Signed-off-by: kbuild test robot <lkp@intel.com> [ paulmck: "Fixes:" updated per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28rcu: Make kernel-mode nohz_full CPUs invoke the RCU core processingPaul E. McKenney1-5/+5
If a nohz_full CPU is idle or executing in userspace, it makes good sense to keep it out of RCU core processing. After all, the RCU grace-period kthread can see its quiescent states and all of its callbacks are offloaded, so there is nothing for RCU core processing to do. However, if a nohz_full CPU is executing in kernel space, the RCU grace-period kthread cannot do anything for it, so such a CPU must report its own quiescent states. This commit therefore makes nohz_full CPUs skip RCU core processing only if the scheduler-clock interrupt caught them in idle or in userspace. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28rcu: Confine ->core_needs_qs accesses to the corresponding CPUPaul E. McKenney1-4/+4
Commit 671a63517cf9 ("rcu: Avoid unnecessary softirq when system is idle") fixed a bug that could result in an indefinite number of unnecessary invocations of the RCU_SOFTIRQ handler at the trailing edge of a scheduler-clock interrupt. However, the fix introduced off-CPU stores to ->core_needs_qs. These writes did not conflict with the on-CPU stores because the CPU's leaf rcu_node structure's ->lock was held across all such stores. However, the loads from ->core_needs_qs were not promoted to READ_ONCE() and, worse yet, the code loading from ->core_needs_qs was written assuming that it was only ever updated by the corresponding CPU. So operation has been robust, but only by luck. This situation is therefore an accident waiting to happen. This commit therefore takes a different approach. Instead of clearing ->core_needs_qs from the grace-period kthread's force-quiescent-state processing, it modifies the rcu_pending() function to suppress the rcu_sched_clock_irq() function's call to invoke_rcu_core() if there is no grace period in progress. This avoids the infinite needless RCU_SOFTIRQ handlers while still keeping all accesses to ->core_needs_qs local to the corresponding CPU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28rcu: Reset CPU hints when reporting a quiescent stateJoel Fernandes (Google)1-7/+10
In some cases, tracing shows that need_heavy_qs is still set even though urgent_qs was cleared upon reporting of a quiescent state. One such case is when the softirq reports that a CPU has passed quiescent state. Commit 671a63517cf9 ("rcu: Avoid unnecessary softirq when system is idle") fixed a bug where core_needs_qs was not being cleared. In order to avoid running into similar situations with the urgent-grace-period flags, this commit causes rcu_disable_urgency_upon_qs(), previously rcu_disable_tick_upon_qs(), to clear the urgency hints, ->rcu_urgent_qs and ->rcu_need_heavy_qs. Note that it is possible for CPUs to go offline with these urgency hints still set. This is handled because rcu_disable_urgency_upon_qs() is also invoked during the online process. Because these hints can be cleared both by the corresponding CPU and by the grace-period kthread, this commit also adds a number of READ_ONCE() and WRITE_ONCE() calls. Tested overnight with rcutorture running for 60 minutes on all configurations of RCU. Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> [ paulmck: Clear urgency flags in rcu_disable_urgency_upon_qs(). ] [ paulmck: Remove ->core_needs_qs from the set cleared at quiescent state. ] [ paulmck: Make rcu_disable_urgency_upon_qs static per kbuild test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28rcu: Force nohz_full tick on upon irq enter instead of exitPaul E. McKenney1-6/+5
There is interrupt-exit code that forces on the tick for nohz_full CPUs failing to respond to the current grace period in a timely fashion. However, this code must compare ->dynticks_nmi_nesting to the value 2 in the interrupt-exit fastpath. This commit therefore moves this code to the interrupt-entry fastpath, where a lighter-weight comparison to zero may be used. Reported-by: Joel Fernandes <joel@joelfernandes.org> [ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28rcu: Force tick on for nohz_full CPUs not reaching quiescent statesPaul E. McKenney2-7/+32
CPUs running for long time periods in the kernel in nohz_full mode might leave the scheduling-clock interrupt disabled for then full duration of their in-kernel execution. This can (among other things) delay grace periods. This commit therefore forces the tick back on for any nohz_full CPU that is failing to pass through a quiescent state upon return from interrupt, which the resched_cpu() will induce. Reported-by: Joel Fernandes <joel@joelfernandes.org> [ paulmck: Clear ->rcu_forced_tick as reported by Joel Fernandes testing. ] [ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcutorture: Make in-kernel-loop testing more brutalPaul E. McKenney1-1/+1
The rcu_torture_fwd_prog_nr() tests the ability of RCU to tolerate in-kernel busy loops. It invokes rcu_torture_fwd_prog_cond_resched() within its delay loop, which, in PREEMPT && NO_HZ_FULL kernels results in the occasional direct call to schedule(). Now, this direct call to schedule() is appropriate for call_rcu() flood testing, in which either the kernel should restrain itself or userspace transitions will supply the needed restraint. But in pure in-kernel loops, the occasional cond_resched() should do the job. This commit therefore makes rcu_torture_fwd_prog_nr() use cond_resched() instead of rcu_torture_fwd_prog_cond_resched() in order to increase the brutality of this aspect of rcutorture testing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcutorture: Separate warnings for each failure typePaul E. McKenney1-6/+9
Currently, each of six different types of failure triggers a single WARN_ON_ONCE(), and it is then necessary to stare at the rcu_torture_stats(), Reader Pipe, and Reader Batch lines looking for inappropriately non-zero values. This can be annoying and error-prone, so this commit provides a separate WARN_ON_ONCE() for each of the six error conditions and adds short comments to each to ease error identification. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcu: Remove unused variable rcu_perf_writer_stateEthan Hansen1-16/+0
The variable rcu_perf_writer_state is declared and initialized, but is never actually referenced. Remove it to clean code. Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com> [ paulmck: Also removed unused macros assigned to that variable. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcu: Remove unused function rcutorture_record_progress()Ethan Hansen1-2/+0
The function rcutorture_record_progress() is declared in rcu.h, but is never used. This commit therefore removes rcutorture_record_progress() to clean code. Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcutorture: Emulate dyntick aspect of userspace nohz_full sojournPaul E. McKenney2-0/+12
During an actual call_rcu() flood, there would be frequent trips to userspace (in-kernel call_rcu() floods must be otherwise housebroken). Userspace execution on nohz_full CPUs implies an RCU dyntick idle/not-idle transition pair, so this commit adds emulation of that pair. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcu: Make CPU-hotplug removal operations enable tickPaul E. McKenney1-0/+9
CPU-hotplug removal operations run the multi_cpu_stop() function, which relies on the scheduler to gain control from whatever is running on the various online CPUs, including any nohz_full CPUs running long loops in kernel-mode code. Lack of the scheduler-clock interrupt on such CPUs can delay multi_cpu_stop() for several minutes and can also result in RCU CPU stall warnings. This commit therefore causes CPU-hotplug removal operations to enable the scheduler-clock interrupt on all online CPUs. [ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ] [ paulmck: Apply simplifications suggested by Frederic Weisbecker. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05stop_machine: Provide RCU quiescent state in multi_cpu_stop()Paul E. McKenney1-1/+1
When multi_cpu_stop() loops waiting for other tasks, it can trigger an RCU CPU stall warning. This can be misleading because what is instead needed is information on whatever task is blocking multi_cpu_stop(). This commit therefore inserts an RCU quiescent state into the multi_cpu_stop() function's waitloop. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcutorture: Force on tick for readers and callback floodersPaul E. McKenney1-6/+10
Readers and callback flooders in the rcutorture stress-test suite run for extended time periods by design. They do take pains to relinquish the CPU from time to time, but in some cases this relies on the scheduler being active, which in turn relies on the scheduler-clock interrupt firing from time to time. This commit therefore forces scheduling-clock interrupts within these loops. While in the area, this commit also prevents rcu_torture_reader()'s occasional timed sleeps from delaying shutdown. [ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05rcu: Force on tick when invoking lots of callbacksPaul E. McKenney1-0/+2
Callback invocation can run for a significant time period, and within CONFIG_NO_HZ_FULL=y kernels, this period will be devoid of scheduler-clock interrupts. In-kernel execution without such interrupts can cause all manner of malfunction, with RCU CPU stall warnings being but one result. This commit therefore forces scheduling-clock interrupts on whenever more than a few RCU callbacks are invoked. Because offloaded callback invocation can be preempted, this forcing is withdrawn on each context switch. This in turn requires that the loop invoking RCU callbacks reiterate the forcing periodically. [ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ] [ paulmck: Remove NO_HZ_FULL check per Frederic Weisbecker feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-09-17Merge branch 'sched-core-for-linus' of ↵Linus Torvalds3-13/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann, Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers. As perf and the scheduler is getting bigger and more complex, document the status quo of current responsibilities and interests, and spread the review pain^H^H^H^H fun via an increase in the Cc: linecount generated by scripts/get_maintainer.pl. :-) - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. - Extend the CPU cgroup controller with uclamp.min and uclamp.max to allow the finer shaping of CPU bandwidth usage. - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS). - Improve the behavior of high CPU count, high thread count applications running under cpu.cfs_quota_us constraints. - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present. - Improve CPU isolation housekeeping CPU allocation NUMA locality. - Fix deadline scheduler bandwidth calculations and logic when cpusets rebuilds the topology, or when it gets deadline-throttled while it's being offlined. - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from setscheduler() system calls without creating global serialization. Add new synchronization between cpuset topology-changing events and the deadline acceptance tests in setscheduler(), which were broken before. - Rework the active_mm state machine to be less confusing and more optimal. - Rework (simplify) the pick_next_task() slowpath. - Improve load-balancing on AMD EPYC systems. - ... and misc cleanups, smaller fixes and improvements - please see the Git log for more details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) sched/psi: Correct overly pessimistic size calculation sched/fair: Speed-up energy-aware wake-ups sched/uclamp: Always use 'enum uclamp_id' for clamp_id values sched/uclamp: Update CPU's refcount on TG's clamp changes sched/uclamp: Use TG's clamps to restrict TASK's clamps sched/uclamp: Propagate system defaults to the root group sched/uclamp: Propagate parent clamps sched/uclamp: Extend CPU's cgroup controller sched/topology: Improve load balancing on AMD EPYC systems arch, ia64: Make NUMA select SMP sched, perf: MAINTAINERS update, add submaintainers and reviewers sched/fair: Use rq_lock/unlock in online_fair_sched_group cpufreq: schedutil: fix equation in comment sched: Rework pick_next_task() slow-path sched: Allow put_prev_task() to drop rq->lock sched/fair: Expose newidle_balance() sched: Add task_struct pointer to sched_class::set_curr_task sched: Rework CPU hotplug task selection sched/{rt,deadline}: Fix set_next_task vs pick_next_task sched: Fix kerneldoc comment for ia64_set_curr_task ...
2019-09-16Merge branch 'sched/rt' into sched/core, to pick up -rt changesIngo Molnar3-10/+10
Pick up the first couple of patches working towards PREEMPT_RT. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-14rcu: Allow rcu_do_batch() to dynamically adjust batch sizesEric Dumazet1-1/+19
Bimodal behavior of rcu_do_batch() is not really suited to Google applications like gfe servers. When a process with millions of sockets exits, closing all files queues two rcu callbacks per socket. This eventually reaches the point where RCU enters an emergency mode, where rcu_do_batch() do not return until whole queue is flushed. Each rcu callback lasts at least 70 nsec, so with millions of elements, we easily spend more than 100 msec without rescheduling. Goal of this patch is to avoid the infamous message like following "need_resched set for > 51999388 ns (52 ticks) without schedule" We dynamically adjust the number of elements we process, instead of 10 / INFINITE choices, we use a floor of ~1 % of current entries. If the number is above 1000, we switch to a time based limit of 3 msec per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns Signed-off-by: Eric Dumazet <edumazet@google.com> [ paulmck: Forward-port and remove debug statements. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overloadPaul E. McKenney1-2/+5
When under overload conditions, __call_rcu_nocb_wake() will wake the no-CBs GP kthread any time the no-CBs CB kthread is asleep or there are no ready-to-invoke callbacks, but only after a timer delay. If the no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup from __call_rcu_nocb_wake() is redundant. This commit therefore makes __call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if ->nocb_bypass_timer is pending. This requires adding a bit of ordering of timer actions. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contentionPaul E. McKenney1-3/+10
Currently, __call_rcu_nocb_wake() advances callbacks each time that it detects excessive numbers of callbacks, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed and advancing has not yet been done during this jiffy. Note that this decision does not take the presence of new callbacks into account. That is because on this code path, there will always be at least one new callback, namely the one we just enqueued. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contentionPaul E. McKenney1-1/+4
Currently, nocb_cb_wait() advances callbacks on each pass through its loop, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed. Note that nocb_cb_wait() doesn't worry about callbacks that have not yet been assigned a grace period. The idea is that the only reason for nocb_cb_wait() to advance callbacks is to allow it to continue invoking callbacks. Time will tell whether this is the correct choice. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()Paul E. McKenney1-0/+1
The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the offlined CPU's ->cblist and that of the surviving CPU, then merges them. However, after the merge, and of the offlined CPU's callbacks that were not ready to be invoked will no longer be associated with a grace-period number. This commit therefore invokes rcu_advance_cbs() one more time on the merged ->cblist in order to assign a grace-period number to these callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()Paul E. McKenney1-14/+5
When callbacks are in full flow, the common case is waiting for a grace period, and this grace period will normally take a few jiffies to complete. It therefore isn't all that helpful for __call_rcu_nocb_wake() to do a synchronous wakeup in this case. This commit therefore turns this into a timer-based deferred wakeup of the no-CBs grace-period kthread. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayedPaul E. McKenney4-1/+94
This commit causes locking, sleeping, and callback state to be printed for no-CBs CPUs when the rcutorture writer is delayed sufficiently for rcutorture to complain. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contendedPaul E. McKenney1-1/+3
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Add bypass callback queueingPaul E. McKenney5-41/+395
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs takes advantage of unrelated grace periods, thus reducing the memory footprint in the face of floods of call_rcu() invocations. However, the ->cblist field is a more-complex rcu_segcblist structure which must be protected via locking. Even though there are only three entities which can acquire this lock (the CPU invoking call_rcu(), the no-CBs grace-period kthread, and the no-CBs callbacks kthread), the contention on this lock is excessive under heavy stress. This commit therefore greatly reduces contention by provisioning an rcu_cblist structure field named ->nocb_bypass within the rcu_data structure. Each no-CBs CPU is permitted only a limited number of enqueues onto the ->cblist per jiffy, controlled by a new nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is exceeded, the CPU instead enqueues onto the new ->nocb_bypass. The ->nocb_bypass is flushed into the ->cblist every jiffy or when the number of callbacks on ->nocb_bypass exceeds qhimark, whichever happens first. During call_rcu() floods, this flushing is carried out by the CPU during the course of its call_rcu() invocations. However, a CPU could simply stop invoking call_rcu() at any time. The no-CBs grace-period kthread therefore carries out less-aggressive flushing (every few jiffies or when the number of callbacks on ->nocb_bypass exceeds (2 * qhimark), whichever comes first). This means that the no-CBs grace-period kthread cannot be permitted to do unbounded waits while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is used to provide the needed wakeups. [ paulmck: Apply Coverity feedback reported by Colin Ian King. ] Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Atomic ->len field in rcu_segcblist structurePaul E. McKenney2-8/+90
Upcoming ->nocb_lock contention-reduction work requires that the rcu_segcblist structure's ->len field be concurrently manipulated, but only if there are no-CBs CPUs in the kernel. This commit therefore makes this ->len field be an atomic_long_t, but only in CONFIG_RCU_NOCB_CPU=y kernels. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Unconditionally advance and wake for excessive CBsPaul E. McKenney1-5/+11
When there are excessive numbers of callbacks, and when either the corresponding no-CBs callback kthread is asleep or there is no more ready-to-invoke callbacks, and when least one callback is pending, __call_rcu_nocb_wake() will advance the callbacks, but refrain from awakening the corresponding no-CBs grace-period kthread. However, because rcu_advance_cbs_nowake() is used, it is possible (if a bit unlikely) that the needed advancement could not happen due to a grace period not being in progress. Plus there will always be at least one pending callback due to one having just now been enqueued. This commit therefore attempts to advance callbacks and awakens the no-CBs grace-period kthread when there are excessive numbers of callbacks posted and when the no-CBs callback kthread is not in a position to do anything helpful. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lockPaul E. McKenney2-5/+7
The sleep/wakeup of the no-CBs grace-period kthreads is synchronized using the ->nocb_lock of the first CPU corresponding to that kthread. This commit provides a separate ->nocb_gp_lock for this purpose, thus reducing contention on ->nocb_lock. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Reduce contention at no-CBs invocation-done timePaul E. McKenney1-3/+4
Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node ->lock to advance callbacks when done invoking the previous batch. It does this while holding ->nocb_lock, which means that contention on the leaf rcu_node ->lock visits itself on the ->nocb_lock. This commit therefore makes this lock acquisition conditional, forgoing callback advancement when the leaf rcu_node ->lock is not immediately available. (In this case, the no-CBs grace-period kthread will eventually do any needed callback advancement.) Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Reduce contention at no-CBs registry-time CB advancementPaul E. McKenney2-5/+4
Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node structure's ->lock, and only afterwards does rcu_advance_cbs_nowake() check to see if it is possible to advance callbacks without potentially needing to awaken the grace-period kthread. Given that the no-awaken check can be done locklessly, this commit reverses the order, so that rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period state before conditionally acquiring that lock, thus reducing the number of needless acquistions of the leaf rcu_node structure's ->lock. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Round down for number of no-CBs grace-period kthreadsPaul E. McKenney1-1/+1
Currently, when the square root of the number of CPUs is rounded down by int_sqrt(), this round-down is applied to the number of callback kthreads per grace-period kthreads. This makes almost no difference for large systems, but results in oddities such as three no-CBs grace-period kthreads for a five-CPU system, which is a bit excessive. This commit therefore causes the round-down to apply to the number of no-CBs grace-period kthreads, so that systems with from four to eight CPUs have only two no-CBs grace period kthreads. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-14rcu/nocb: Avoid ->nocb_lock capture by corresponding CPUPaul E. McKenney2-24/+62
A given rcu_data structure's ->nocb_lock can be acquired very frequently by the corresponding CPU and occasionally by the corresponding no-CBs grace-period and callbacks kthreads. In particular, these two kthreads will have frequent gaps between ->nocb_lock acquisitions that are roughly a grace period in duration. This means that any excessive ->nocb_lock contention will be due to the CPU's acquisitions, and this in turn enables a very naive contention-avoidance strategy to be quite effective. This commit therefore modifies rcu_nocb_lock() to first attempt a raw_spin_trylock(), and to atomically increment a separate ->nocb_lock_contended across a raw_spin_lock(). This new ->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when interrupts are enabled, with a spin-wait for contending acquisitions to complete, thus allowing the kthreads a chance to acquire the lock. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>