summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-11-06cpupri: a work around for non-rt test panicRTLINUX_5.15_v0.1.0rt-linux-releaseMinda Chen1-0/+3
kernel BUG at kernel/sched/cpupri.c:151! The same issue can be seen in link. https://www.spinics.net/lists/kernel/msg4184866.html Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
2023-11-06x86: Support for lazy preemptionThomas Gleixner1-1/+1
Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06x86/entry: Use should_resched() in idtentry_exit_cond_resched()Sebastian Andrzej Siewior1-1/+1
The TIF_NEED_RESCHED bit is inlined on x86 into the preemption counter. By using should_resched(0) instead of need_resched() the same check can be performed which uses the same variable as 'preempt_count()` which was issued before. Use should_resched(0) instead need_resched(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06signal/x86: Delay calling signals in atomicOleg Nesterov2-0/+36
On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: also needed on 32bit as per Yang Shi <yang.shi@linaro.org>] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06sysfs: Add /sys/kernel/realtime entryClark Williams1-0/+12
Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06sched: Add support for lazy preemptionThomas Gleixner8-30/+145
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06genirq: update irq_set_irqchip_state documentationJosh Cartwright1-1/+1
On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210917103055.92150-1-bigeasy@linutronix.de
2023-11-06random: Make it work on rtThomas Gleixner2-2/+14
Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06panic: skip get_random_bytes for RT_FULL in init_oops_idThomas Gleixner1-0/+2
Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06rcu: Delay RCU-selftestsSebastian Andrzej Siewior1-7/+2
Delay RCU-selftests until ksoftirqd is up and running. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06fs/dcache: use swait_queue instead of waitqueueSebastian Andrzej Siewior1-0/+1
__d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06ptrace: fix ptrace vs tasklist_lock raceSebastian Andrzej Siewior2-9/+33
As explained by Alexander Fyodorov <halcy@yandex.ru>: |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. [ Fix for ptrace_unfreeze_traced() by Oleg Nesterov ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06signal: Revert ptrace preempt magicThomas Gleixner1-8/+0
Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06lockdep: Make it RT awareThomas Gleixner1-0/+2
There is not really a softirq context on PREEMPT_RT. Softirqs on PREEMPT_RT are always invoked within the context of a threaded interrupt handler or within ksoftirqd. The "in-softirq" context is preemptible and is protected by a per-CPU lock to ensure mutual exclusion. There is no difference on PREEMPT_RT between spin_lock_irq() and spin_lock() because the former does not disable interrupts. Therefore if lock is used in_softirq() and locked once with spin_lock_irq() then lockdep will report this with "inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage". Teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06rtmutex: Add rt_mutex_lock_nest_lock() and rt_mutex_lock_killable().Sebastian Andrzej Siewior1-4/+26
The locking selftest for ww-mutex expects to operate directly on the base-mutex which becomes a rtmutex on PREEMPT_RT. Add rt_mutex_lock_nest_lock(), follows mutex_lock_nest_lock() for rtmutex. Add rt_mutex_lock_killable(), follows mutex_lock_killable() for rtmutex. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06rtmutex: Add a special case for ww-mutex handling.Sebastian Andrzej Siewior1-1/+19
The lockdep selftest for ww-mutex assumes in a few cases the ww_ctx->contending_lock assignment via __ww_mutex_check_kill() which does not happen if the rtmutex detects the deadlock early. The testcase passes if the deadlock handling here is removed. This means that it will work if multiple threads/tasks are involved and not just a single one. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06sched: Trigger warning if ->migration_disabled counter underflows.Sebastian Andrzej Siewior1-0/+2
If migrate_enable() is used more often than its counter part then it remains undetected and rq::nr_pinned will underflow, too. Add a warning if migrate_enable() is attempted if without a matching a migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06locking: Remove rt_rwlock_is_contended()Sebastian Andrzej Siewior1-6/+0
rt_rwlock_is_contended() has not users. It makes no sense to use it as rwlock_is_contended() because it is a sleeping lock on RT and preemption is possible. It reports always != 0 if used by a writer and even if there is a waiter then the lock might not be handed over if the current owner has the highest priority. Remove rt_rwlock_is_contended(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06cgroup: use irqsave in cgroup_rstat_flush_locked()Sebastian Andrzej Siewior1-2/+3
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock either with spin_lock_irq() or spin_lock_irqsave(). cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which is a raw_spin_lock. This lock is also acquired in cgroup_rstat_updated() in IRQ context and therefore requires _irqsave() locking suffix in cgroup_rstat_flush_locked(). Since there is no difference between spin_lock_t and raw_spin_lock_t on !RT lockdep does not complain here. On RT lockdep complains because the interrupts were not disabled here and a deadlock is possible. Acquire the raw_spin_lock_t with disabled interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://www.spinics.net/lists/cgroups/msg23051.html
2023-11-06sched: Move mmdrop to RCU on RTThomas Gleixner2-1/+14
mmdrop() is invoked from finish_task_switch() by the incoming task to drop the mm which was handed over by the previous task. mmdrop() can be quite expensive which prevents an incoming real-time task from getting useful work done. Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels it delagates the eventually required invocation of __mmdrop() to RCU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210928122411.648582026@linutronix.de
2023-11-06sched: Delay task stack freeing on RTSebastian Andrzej Siewior3-3/+15
Anything which is done on behalf of a dead task at the end of finish_task_switch() is preventing the incoming task from doing useful work. While it is benefitial for fork heavy workloads to recycle the task stack quickly, this is a latency source for real-time tasks. Therefore delay the stack cleanup on RT enabled kernels. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210928122411.593486363@linutronix.de
2023-11-06sched: Move kprobes cleanup out of finish_task_switch()Thomas Gleixner3-10/+6
Doing cleanups in the tail of schedule() is a latency punishment for the incoming task. The point of invoking kprobes_task_flush() for a dead task is that the instances are returned and cannot leak when __schedule() is kprobed. Move it into the delayed cleanup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210928122411.537994026@linutronix.de
2023-11-06sched: Disable TTWU_QUEUE on RTThomas Gleixner1-0/+5
The queued remote wakeup mechanism has turned out to be suboptimal for RT enabled kernels. The maximum latencies go up by a factor of > 5x in certain scenarious. This is caused by either long wake lists or by a large number of TTWU IPIs which are processed back to back. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210928122411.482262764@linutronix.de
2023-11-06sched: Limit the number of task migrations per batch on RTThomas Gleixner1-0/+4
Batched task migrations are a source for large latencies as they keep the scheduler from running while processing the migrations. Limit the batch size to 8 instead of 32 when running on a RT enabled kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210928122411.425097596@linutronix.de
2023-11-06locking/rt: Take RCU nesting into account for __might_resched()Thomas Gleixner1-3/+14
The general rule that rcu_read_lock() held sections cannot voluntary sleep does apply even on RT kernels. Though the substitution of spin/rw locks on RT enabled kernels has to be exempt from that rule. On !RT a spin_lock() can obviously nest inside a RCU read side critical section as the lock acquisition is not going to block, but on RT this is not longer the case due to the 'sleeping' spinlock substitution. The RT patches contained a cheap hack to ignore the RCU nesting depth in might_sleep() checks, which was a pragmatic but incorrect workaround. Instead of generally ignoring the RCU nesting depth in __might_sleep() and __might_resched() checks, pass the rcu_preempt_depth() via the offsets argument to __might_resched() from spin/read/write_lock() which makes the checks work correctly even in RCU read side critical sections. The actual blocking on such a substituted lock within a RCU read side critical section is already handled correctly in __schedule() by treating it as a "preemption" of the RCU read side critical section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165358.368305497@linutronix.de
2023-11-06sched: Make RCU nest depth distinct in __might_resched()Thomas Gleixner1-12/+16
For !RT kernels RCU nest depth in __might_resched() is always expected to be 0, but on RT kernels it can be non zero while the preempt count is expected to be always 0. Instead of playing magic games in interpreting the 'preempt_offset' argument, rename it to 'offsets' and use the lower 8 bits for the expected preempt count, allow to hand in the expected RCU nest depth in the upper bits and adopt the __might_resched() code and related checks and printks. The affected call sites are updated in subsequent steps. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165358.243232823@linutronix.de
2023-11-06sched: Make might_sleep() output less confusingThomas Gleixner1-5/+22
might_sleep() output is pretty informative, but can be confusing at times especially with PREEMPT_RCU when the check triggers due to a voluntary sleep inside a RCU read side critical section: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: migrate_disable+0x33/0xa0 in_atomic() is 0, but it still tells that preemption was disabled at migrate_disable(), which is completely useless because preemption is not disabled. But the interesting information to decode the above, i.e. the RCU nesting depth, is not printed. That becomes even more confusing when might_sleep() is invoked from cond_resched_lock() within a RCU read side critical section. Here the expected preemption count is 1 and not 0. BUG: sleeping function called from invalid context at kernel/test.c:131 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 Preemption disabled at: test_cond_lock+0xf3/0x1c0 So in_atomic() is set, which is expected as the caller holds a spinlock, but it's unclear why this is broken and the preempt disable IP is just pointing at the correct place, i.e. spin_lock(), which is obviously not helpful either. Make that more useful in general: - Print preempt_count() and the expected value and for the CONFIG_PREEMPT_RCU case: - Print the RCU read side critical section nesting depth - Print the preempt disable IP only when preempt count does not have the expected value. So the might_sleep() dump from a within a preemptible RCU read side critical section becomes: BUG: sleeping function called from invalid context at kernel/test.c:110 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 and the cond_resched_lock() case becomes: BUG: sleeping function called from invalid context at kernel/test.c:141 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 415, name: kworker/u112:52 preempt_count: 1, expected: 1 RCU nest depth: 1, expected: 0 which makes is pretty obvious what's going on. For all other cases the preempt disable IP is still printed as before: BUG: sleeping function called from invalid context at kernel/test.c: 156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0xbe/0xf8 BUG: sleeping function called from invalid context at kernel/test.c: 163 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 0 Preemption disabled at: [<ffffffff82b48326>] test_might_sleep+0x1e4/0x280 This also prepares to provide a better debugging output for RT enabled kernels and their spinlock substitutions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165358.181022656@linutronix.de
2023-11-06sched: Cleanup might_sleep() printksThomas Gleixner1-8/+6
Convert them to pr_*(). No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165358.117496067@linutronix.de
2023-11-06sched: Remove preempt_offset argument from __might_sleep()Thomas Gleixner1-2/+2
All callers hand in 0 and never will hand in anything else. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165358.054321586@linutronix.de
2023-11-06sched: Clean up the might_sleep() underscore zooThomas Gleixner2-6/+6
__might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that the three underscore variant is exposed to provide a checkpoint for rescheduling points which are distinct from blocking points. They are semantically a preemption point which means that scheduling is state preserving. A real blocking operation, e.g. mutex_lock(), wait*(), which cannot preserve a task state which is not equal to RUNNING. While technically blocking on a "sleeping" spinlock in RT enabled kernels falls into the voluntary scheduling category because it has to wait until the contended spin/rw lock becomes available, the RT lock substitution code can semantically be mapped to a voluntary preemption because the RT lock substitution code and the scheduler are providing mechanisms to preserve the task state and to take regular non-lock related wakeups into account. Rename ___might_sleep() to __might_resched() to make the distinction of these functions clear. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165357.928693482@linutronix.de
2023-11-06smp: Wake ksoftirqd on PREEMPT_RT instead do_softirq().Sebastian Andrzej Siewior1-2/+12
The softirq implementation on PREEMPT_RT does not provide do_softirq(). The other user of do_softirq() is replaced with a local_bh_disable() + enable() around the possible raise-softirq invocation. This can not be done here because migration_cpu_stop() is invoked with disabled preemption. Wake the softirq thread on PREEMPT_RT if there are any pending softirqs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210927073814.x5h6osr4dgiu44sc@linutronix.de
2023-11-06irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RTSebastian Andrzej Siewior1-2/+4
On PREEMPT_RT most items are processed as LAZY via softirq context. Avoid to spin-wait for them because irq_work_sync() could have higher priority and not allow the irq-work to be completed. Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-5-bigeasy@linutronix.de
2023-11-06irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RTSebastian Andrzej Siewior1-12/+106
The irq_work callback is invoked in hard IRQ context. By default all callbacks are scheduled for invocation right away (given supported by the architecture) except for the ones marked IRQ_WORK_LAZY which are delayed until the next timer-tick. While looking over the callbacks, some of them may acquire locks (spinlock_t, rwlock_t) which are transformed into sleeping locks on PREEMPT_RT and must not be acquired in hard IRQ context. Changing the locks into locks which could be acquired in this context will lead to other problems such as increased latencies if everything in the chain has IRQ-off locks. This will not solve all the issues as one callback has been noticed which invoked kref_put() and its callback invokes kfree() and this can not be invoked in hardirq context. Some callbacks are required to be invoked in hardirq context even on PREEMPT_RT to work properly. This includes for instance the NO_HZ callback which needs to be able to observe the idle context. The callbacks which require to be run in hardirq have already been marked. Use this information to split the callbacks onto the two lists on PREEMPT_RT: - lazy_list Work items which are not marked with IRQ_WORK_HARD_IRQ will be added to this list. Callbacks on this list will be invoked from a per-CPU thread. The handler here may acquire sleeping locks such as spinlock_t and invoke kfree(). - raised_list Work items which are marked with IRQ_WORK_HARD_IRQ will be added to this list. They will be invoked in hardirq context and must not acquire any sleeping locks. The wake up of the per-CPU thread occurs from irq_work handler/ hardirq context. The thread runs with lowest RT priority to ensure it runs before any SCHED_OTHER tasks do. [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant. Collected fixes over time from Steven Rostedt and Mike Galbraith. Move to per-CPU threads instead of softirq as suggested by PeterZ.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211007092646.uhshe3ut2wkrcfzv@linutronix.de
2023-11-06irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.Sebastian Andrzej Siewior1-0/+10
irq_work() triggers instantly an interrupt if supported by the architecture. Otherwise the work will be processed on the next timer tick. In worst case irq_work_sync() could spin up to a jiffy. irq_work_sync() is usually used in tear down context which is fully preemptible. Based on review irq_work_sync() is invoked from preemptible context and there is one waiter at a time. This qualifies it to use rcuwait for synchronisation. Let irq_work_sync() synchronize with rcuwait if the architecture processes irqwork via the timer tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-3-bigeasy@linutronix.de
2023-11-06sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQSebastian Andrzej Siewior1-1/+1
The push-IPI logic for RT tasks expects to be invoked from hardirq context. One reason is that a RT task on the remote CPU would block the softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT tasks as intended. Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
2023-11-06kcov: Replace local_irq_save() with a local_lock_t.Sebastian Andrzej Siewior1-13/+17
The kcov code mixes local_irq_save() and spin_lock() in kcov_remote_{start|end}(). This creates a warning on PREEMPT_RT because local_irq_save() disables interrupts and spin_lock_t is turned into a sleeping lock which can not be acquired in a section with disabled interrupts. The kcov_remote_lock is used to synchronize the access to the hash-list kcov_remote_map. The local_irq_save() block protects access to the per-CPU data kcov_percpu_data. There no compelling reason to change the lock type to raw_spin_lock_t to make it work with local_irq_save(). Changing it would require to move memory allocation (in kcov_remote_add()) and deallocation outside of the locked section. Adding an unlimited amount of entries to the hashlist will increase the IRQ-off time during lookup. It could be argued that this is debug code and the latency does not matter. There is however no need to do so and it would allow to use this facility in an RT enabled build. Using a local_lock_t instead of local_irq_save() has the befit of adding a protection scope within the source which makes it obvious what is protected. On a !PREEMPT_RT && !LOCKDEP build the local_lock_irqsave() maps directly to local_irq_save() so there is overhead at runtime. Replace the local_irq_save() section with a local_lock_t. Reported-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-6-bigeasy@linutronix.de
2023-11-06kcov: Avoid enable+disable interrupts if !in_task().Sebastian Andrzej Siewior1-3/+3
kcov_remote_start() may need to allocate memory in the in_task() case (otherwise per-CPU memory has been pre-allocated) and therefore requires enabled interrupts. The interrupts are enabled before checking if the allocation is required so if no allocation is required then the interrupts are needlessly enabled and disabled again. Enable interrupts only if memory allocation is performed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-5-bigeasy@linutronix.de
2023-11-06kcov: Allocate per-CPU memory on the relevant node.Sebastian Andrzej Siewior1-2/+2
During boot kcov allocates per-CPU memory which is used later if remote/ softirq processing is enabled. Allocate the per-CPU memory on the CPU local node to avoid cross node memory access. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-4-bigeasy@linutronix.de
2023-11-06lockdep: Let lock_is_held_type() detect recursive read as readSebastian Andrzej Siewior1-1/+1
lock_is_held_type(, 1) detects acquired read locks. It only recognized locks acquired with lock_acquire_shared(). Read locks acquired with lock_acquire_shared_recursive() are not recognized because a `2' is stored as the read value. Rework the check to additionally recognise lock's read value one and two as a read held lock. Fixes: e918188611f07 ("locking: More accurate annotations for read_lock()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://lkml.kernel.org/r/20210903084001.lblecrvz4esl4mrr@linutronix.de
2023-11-06genirq: Disable irqfixup/poll on PREEMPT_RT.Ingo Molnar1-0/+8
The support for misrouted IRQs is used on old / legacy systems and is not feasible on PREEMPT_RT. Polling for interrupts reduces the overall system performance. Additionally the interrupt latency depends on the polling frequency and delays are not desired for real time workloads. Disable IRQ polling on PREEMPT_RT and let the user know that it is not enabled. The compiler will optimize the real fixup/poll code out. [ bigeasy: Update changelog and switch to IS_ENABLED() ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210917223841.c6j6jcaffojrnot3@linutronix.de
2023-11-06genirq: Move prio assignment into the newly created threadThomas Gleixner1-2/+2
With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority assignment to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: Patch description] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
2023-11-06kthread: Move prio/affinite change into the newly created threadSebastian Andrzej Siewior1-8/+8
With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority reset to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
2023-11-06rcutorture: Avoid problematic critical section nesting on PREEMPT_RTFrom: Scott Wood1-12/+36
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210820074236.2zli4nje7bof62rh@linutronix.de
2023-11-06sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARDSebastian Andrzej Siewior1-1/+1
With PREEMPT_RT enabled all hrtimers callbacks will be invoked in softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD. During boot kthread_bind() is used for the creation of per-CPU threads and then hangs in wait_task_inactive() if the ksoftirqd is not yet up and running. The hang disappeared since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") but enabling function trace on boot reliably leads to the freeze on boot behaviour again. The timer in wait_task_inactive() can not be directly used by an user interface to abuse it and create a mass wake of several tasks at the same time which would to long sections with disabled interrupts. Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD. Switch the timer to HRTIMER_MODE_REL_HARD. Cc: stable-rt@vger.kernel.org Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06printk: Enhance the condition check of msleep in pr_flush()Chao Qin1-1/+3
There is msleep in pr_flush(). If call WARN() in the early boot stage such as in early_initcall, pr_flush() will run into msleep when process scheduler is not ready yet. And then the system will sleep forever. Before the system_state is SYSTEM_RUNNING, make sure DO NOT sleep in pr_flush(). Fixes: c0b395bd0fe3("printk: add pr_flush()") Signed-off-by: Chao Qin <chao.qin@intel.com> Signed-off-by: Lili Li <lili.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20210719022649.3444072-1-chao.qin@intel.com
2023-11-06printk: add pr_flush()John Ogness2-11/+98
Provide a function to allow waiting for console printers to catch up to the latest logged message. Use pr_flush() to give console printers a chance to finish in critical situations if no atomic console is available. For now pr_flush() is only used in the most common error paths: panic(), print_oops_end_marker(), report_bug(), kmsg_dump(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: add console handoverJohn Ogness1-2/+13
If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: remove deferred printingJohn Ogness15-198/+66
Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk defer functions and context tracking. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: move console printing to kthreadsJohn Ogness1-492/+223
Create a kthread for each console to perform console printing. Now all console printing is fully asynchronous except for the boot console and when the kernel enters sync mode (and there are atomic consoles available). The console_lock() and console_unlock() functions now only do what their name says... locking and unlocking of the console. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: introduce kernel sync modeJohn Ogness1-10/+168
When the kernel performs an OOPS, enter into "sync mode": - only atomic consoles (write_atomic() callback) will print - printing occurs within vprintk_store() instead of console_unlock() CONSOLE_LOG_MAX is moved to printk.h to support the per-console buffer used in sync mode. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>