summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-11-06sched: Make cond_resched_*lock() variants consistent vs. might_sleep()Thomas Gleixner1-6/+6
Commit 3427445afd26 ("sched: Exclude cond_resched() from nested sleep test") removed the task state check of __might_sleep() for cond_resched_lock() because cond_resched_lock() is not a voluntary scheduling point which blocks. It's a preemption point which requires the lock holder to release the spin lock. The same rationale applies to cond_resched_rwlock_read/write(), but those were not touched. Make it consistent and use the non-state checking __might_resched() there as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165357.991262778@linutronix.de
2023-11-06sched: Clean up the might_sleep() underscore zooThomas Gleixner4-13/+13
__might_sleep() vs. ___might_sleep() is hard to distinguish. Aside of that the three underscore variant is exposed to provide a checkpoint for rescheduling points which are distinct from blocking points. They are semantically a preemption point which means that scheduling is state preserving. A real blocking operation, e.g. mutex_lock(), wait*(), which cannot preserve a task state which is not equal to RUNNING. While technically blocking on a "sleeping" spinlock in RT enabled kernels falls into the voluntary scheduling category because it has to wait until the contended spin/rw lock becomes available, the RT lock substitution code can semantically be mapped to a voluntary preemption because the RT lock substitution code and the scheduler are providing mechanisms to preserve the task state and to take regular non-lock related wakeups into account. Rename ___might_sleep() to __might_resched() to make the distinction of these functions clear. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210923165357.928693482@linutronix.de
2023-11-06fscache: Use only one fscache_object_cong_wait.Sebastian Andrzej Siewior3-15/+5
In the commit mentioned below, fscache was converted from slow-work to workqueue. slow_work_enqueue() and slow_work_sleep_till_thread_needed() did not use a per-CPU workqueue. They choose from two global waitqueues depending on the SLOW_WORK_VERY_SLOW bit which was not set so it always one waitqueue. I can't find out how it is ensured that a waiter on certain CPU is woken up be the other side. My guess is that the timeout in schedule_timeout() ensures that it does not wait forever (or a random wake up). fscache_object_sleep_till_congested() must be invoked from preemptible context in order for schedule() to work. In this case this_cpu_ptr() should complain with CONFIG_DEBUG_PREEMPT enabled except the thread is bound to one CPU. wake_up() wakes only one waiter and I'm not sure if it is guaranteed that only one waiter exists. Replace the per-CPU waitqueue with one global waitqueue. Fixes: 8b8edefa2fffb ("fscache: convert object to use workqueue instead of slow-work") Reported-by: Gregor Beck <gregor.beck@gmail.com> Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20211029083839.xwwt7jgzru3kcpii@linutronix.de
2023-11-06fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.Sebastian Andrzej Siewior1-2/+18
The MNT_WRITE_HOLD flag is used to hold back any new writers while the mount point is about to be made read-only. __mnt_want_write() then loops with disabled preemption until this flag disappears. Callers of mnt_hold_writers() (which sets the flag) hold the spinlock_t of mount_lock (seqlock_t) which disables preemption on !PREEMPT_RT and ensures the task is not scheduled away so that the spinning side spins for a long time. On PREEMPT_RT the spinlock_t does not disable preemption and so it is possible that the task setting MNT_WRITE_HOLD is preempted by task with higher priority which then spins infinitely waiting for MNT_WRITE_HOLD to get removed. Acquire mount_lock::lock which is held by setter of MNT_WRITE_HOLD. This will PI-boost the owner and wait until the lock is dropped and which means that MNT_WRITE_HOLD is cleared again. Link: https://lkml.kernel.org/r/20211025152218.opvcqfku2lhqvp4o@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06smp: Wake ksoftirqd on PREEMPT_RT instead do_softirq().Sebastian Andrzej Siewior1-2/+12
The softirq implementation on PREEMPT_RT does not provide do_softirq(). The other user of do_softirq() is replaced with a local_bh_disable() + enable() around the possible raise-softirq invocation. This can not be done here because migration_cpu_stop() is invoked with disabled preemption. Wake the softirq thread on PREEMPT_RT if there are any pending softirqs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210927073814.x5h6osr4dgiu44sc@linutronix.de
2023-11-06irq_poll: Use raise_softirq_irqoff() in cpu_dead notifierSebastian Andrzej Siewior1-0/+2
__raise_softirq_irqoff() adds a bit to the pending sofirq mask and this is it. The softirq won't be handled in a deterministic way but randomly when an interrupt fires and handles softirq in its irq_exit() routine or if something randomly checks and handles pending softirqs in the call chain before the CPU goes idle. Add a local_bh_disable/enable() around the IRQ-off section which will handle pending softirqs. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210930103754.2128949-1-bigeasy@linutronix.de
2023-11-06irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RTSebastian Andrzej Siewior2-2/+9
On PREEMPT_RT most items are processed as LAZY via softirq context. Avoid to spin-wait for them because irq_work_sync() could have higher priority and not allow the irq-work to be completed. Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-5-bigeasy@linutronix.de
2023-11-06irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RTSebastian Andrzej Siewior1-12/+106
The irq_work callback is invoked in hard IRQ context. By default all callbacks are scheduled for invocation right away (given supported by the architecture) except for the ones marked IRQ_WORK_LAZY which are delayed until the next timer-tick. While looking over the callbacks, some of them may acquire locks (spinlock_t, rwlock_t) which are transformed into sleeping locks on PREEMPT_RT and must not be acquired in hard IRQ context. Changing the locks into locks which could be acquired in this context will lead to other problems such as increased latencies if everything in the chain has IRQ-off locks. This will not solve all the issues as one callback has been noticed which invoked kref_put() and its callback invokes kfree() and this can not be invoked in hardirq context. Some callbacks are required to be invoked in hardirq context even on PREEMPT_RT to work properly. This includes for instance the NO_HZ callback which needs to be able to observe the idle context. The callbacks which require to be run in hardirq have already been marked. Use this information to split the callbacks onto the two lists on PREEMPT_RT: - lazy_list Work items which are not marked with IRQ_WORK_HARD_IRQ will be added to this list. Callbacks on this list will be invoked from a per-CPU thread. The handler here may acquire sleeping locks such as spinlock_t and invoke kfree(). - raised_list Work items which are marked with IRQ_WORK_HARD_IRQ will be added to this list. They will be invoked in hardirq context and must not acquire any sleeping locks. The wake up of the per-CPU thread occurs from irq_work handler/ hardirq context. The thread runs with lowest RT priority to ensure it runs before any SCHED_OTHER tasks do. [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant. Collected fixes over time from Steven Rostedt and Mike Galbraith. Move to per-CPU threads instead of softirq as suggested by PeterZ.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211007092646.uhshe3ut2wkrcfzv@linutronix.de
2023-11-06irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.Sebastian Andrzej Siewior2-0/+13
irq_work() triggers instantly an interrupt if supported by the architecture. Otherwise the work will be processed on the next timer tick. In worst case irq_work_sync() could spin up to a jiffy. irq_work_sync() is usually used in tear down context which is fully preemptible. Based on review irq_work_sync() is invoked from preemptible context and there is one waiter at a time. This qualifies it to use rcuwait for synchronisation. Let irq_work_sync() synchronize with rcuwait if the architecture processes irqwork via the timer tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-3-bigeasy@linutronix.de
2023-11-06sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQSebastian Andrzej Siewior1-1/+1
The push-IPI logic for RT tasks expects to be invoked from hardirq context. One reason is that a RT task on the remote CPU would block the softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT tasks as intended. Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
2023-11-06net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding.Sebastian Andrzej Siewior1-6/+37
Since the rework, the statistics code always adds up the byte and packet value(s). On 32bit architectures a seqcount_t is used in gnet_stats_basic_sync to ensure that the 64bit values are not modified during the read since two 32bit loads are required. The usage of a seqcount_t requires a lock to ensure that only one writer is active at a time. This lock leads to disabled preemption during the update. The lack of disabling preemption is now creating a warning as reported by Naresh since the query done by gnet_stats_copy_basic() is in preemptible context. For ___gnet_stats_copy_basic() there is no need to disable preemption since the update is performed on stack and can't be modified by another writer. Instead of disabling preemption, to avoid the warning, simply create a read function to just read the values and return as u64. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Fixes: 67c9e6270f301 ("net: sched: Protect Qdisc::bstats with u64_stats") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211021095919.bi3szpt3c2kcoiso@linutronix.de
2023-11-06net: sched: remove one pair of atomic operationsEric Dumazet1-4/+8
__QDISC_STATE_RUNNING is only set/cleared from contexts owning qdisc lock. Thus we can use less expensive bit operations, as we were doing before commit f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: 29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ahmed S. Darwish <a.darwish@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06net: sched: fix logic error in qdisc_run_begin()Eric Dumazet1-1/+1
For non TCQ_F_NOLOCK qdisc, qdisc_run_begin() tries to set __QDISC_STATE_RUNNING and should return true if the bit was not set. test_and_set_bit() returns old bit value, therefore we need to invert. Fixes: 29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ahmed S. Darwish <a.darwish@linutronix.de> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06net: sched: Allow statistics reads from softirq.Sebastian Andrzej Siewior1-1/+1
Eric reported that the rate estimator reads statics from the softirq which in turn triggers a warning introduced in the statistics rework. The warning is too cautious. The updates happen in the softirq context so reads from softirq are fine since the writes can not be preempted. The updates/writes happen during qdisc_run() which ensures one writer and the softirq context. The remaining bad context for reading statistics remains in hard-IRQ because it may preempt a writer. Fixes: 29cbcd8582837 ("net: sched: Remove Qdisc::running sequence counter") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06net: sched: Remove Qdisc::running sequence counterAhmed S. Darwish21-134/+102
The Qdisc::running sequence counter has two uses: 1. Reliably reading qdisc's tc statistics while the qdisc is running (a seqcount read/retry loop at gnet_stats_add_basic()). 2. As a flag, indicating whether the qdisc in question is running (without any retry loops). For the first usage, the Qdisc::running sequence counter write section, qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what is actually needed: the raw qdisc's bstats update. A u64_stats sync point was thus introduced (in previous commits) inside the bstats structure itself. A local u64_stats write section is then started and stopped for the bstats updates. Use that u64_stats sync point mechanism for the bstats read/retry loop at gnet_stats_add_basic(). For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag, accessed with atomic bitops, is sufficient. Using a bit flag instead of a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads to the SMP barriers implicitly added through raw_read_seqcount() and write_seqcount_begin/end() getting removed. All call sites have been surveyed though, and no required ordering was identified. Now that the qdisc->running sequence counter is no longer used, remove it. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the qdisc tc statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data typesAhmed S. Darwish30-160/+155
The only factor differentiating per-CPU bstats data type (struct gnet_stats_basic_cpu) from the packed non-per-CPU one (struct gnet_stats_basic_packed) was a u64_stats sync point inside the former. The two data types are now equivalent: earlier commits added a u64_stats sync point to the latter. Combine both data types into "struct gnet_stats_basic_sync". This eliminates redundancy and simplifies the bstats read/write APIs. Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit architectures, u64_stats sync points do not use sequence counter protection. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06net: sched: Use _bstats_update/set() instead of raw writesAhmed S. Darwish5-21/+26
The Qdisc::running sequence counter, used to protect Qdisc::bstats reads from parallel writes, is in the process of being removed. Qdisc::bstats read/writes will synchronize using an internal u64_stats sync point instead. Modify all bstats writes to use _bstats_update(). This ensures that the internal u64_stats sync point is always acquired and released as appropriate. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06net: sched: Protect Qdisc::bstats with u64_statsAhmed S. Darwish17-10/+39
The not-per-CPU variant of qdisc tc (traffic control) statistics, Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running sequence counter. This sequence counter is used for reliably protecting bstats reads from parallel writes. Meanwhile, the seqcount's write section covers a much wider area than bstats update: qdisc_run_begin() => qdisc_run_end(). That read/write section asymmetry can lead to needless retries of the read section. To prepare for removing the Qdisc::running sequence counter altogether, introduce a u64_stats sync point inside bstats instead. Modify _bstats_update() to start/end the bstats u64_stats write section. For bisectability, and finer commits granularity, the bstats read section is still protected with a Qdisc::running read/retry loop and qdisc_run_begin/end() still starts/ends that seqcount write section. Once all call sites are modified to use _bstats_update(), the Qdisc::running seqcount will be removed and bstats read/retry loop will be modified to utilize the internal u64_stats sync point. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. [bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.] Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06u64_stats: Introduce u64_stats_set()Ahmed S. Darwish1-0/+10
Allow to directly set a u64_stats_t value which is used to provide an init function which sets it directly to zero intead of memset() the value. Add u64_stats_set() to the u64_stats API. [bigeasy: commit message. ] Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06gen_stats: Move remaining users to gnet_stats_add_queue().Sebastian Andrzej Siewior3-43/+4
The gnet_stats_queue::qlen member is only used in the SMP-case. qdisc_qstats_qlen_backlog() needs to add qdisc_qlen() to qstats.qlen to have the same value as that provided by qdisc_qlen_sum(). gnet_stats_copy_queue() needs to overwritte the resulting qstats.qlen field whith the caller submitted qlen value. It might be differ from the submitted value. Let both functions use gnet_stats_add_queue() and remove unused __gnet_stats_copy_queue(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06mq, mqprio: Use gnet_stats_add_queue().Sebastian Andrzej Siewior2-57/+18
gnet_stats_add_basic() and gnet_stats_add_queue() add up the statistics so they can be used directly for both the per-CPU and global case. gnet_stats_add_queue() copies either Qdisc's per-CPU gnet_stats_queue::qlen or the global member. The global gnet_stats_queue::qlen isn't touched in the per-CPU case so there is no need to consider it in the global-case. In the per-CPU case, the sum of global gnet_stats_queue::qlen and the per-CPU gnet_stats_queue::qlen was assigned to sch->q.qlen and sch->qstats.qlen. Now both fields are copied individually. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06gen_stats: Add gnet_stats_add_queue().Sebastian Andrzej Siewior2-0/+35
This function will replace __gnet_stats_copy_queue(). It reads all arguments and adds them into the passed gnet_stats_queue argument. In contrast to __gnet_stats_copy_queue() it also copies the qlen member. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06gen_stats: Add instead Set the value in __gnet_stats_copy_basic().Sebastian Andrzej Siewior5-27/+28
__gnet_stats_copy_basic() always assigns the value to the bstats argument overwriting the previous value. The later added per-CPU version always accumulated the values in the returning gnet_stats_basic_packed argument. Based on review there are five users of that function as of today: - est_fetch_counters(), ___gnet_stats_copy_basic() memsets() bstats to zero, single invocation. - mq_dump(), mqprio_dump(), mqprio_dump_class_stats() memsets() bstats to zero, multiple invocation but does not use the function due to !qdisc_is_percpu_stats(). Add the values in __gnet_stats_copy_basic() instead overwriting. Rename the function to gnet_stats_add_basic() to make it more obvious. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-06net/sched: sch_ets: properly init all active DRR list handlesDavide Caratti1-3/+9
leaf classes of ETS qdiscs are served in strict priority or deficit round robin (DRR), depending on the value of 'nstrict'. Since this value can be changed while traffic is running, we need to be sure that the active list of DRR classes can be updated at any time, so: 1) call INIT_LIST_HEAD(&alist) on all leaf classes in .init(), before the first packet hits any of them. 2) ensure that 'alist' is not overwritten with zeros when a leaf class is no more strict priority nor DRR (i.e. array elements beyond 'nbands'). Link: https://lore.kernel.org/netdev/YS%2FoZ+f0Nr8eQkzH@dcaratti.users.ipa.redhat.com Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06kcov: Replace local_irq_save() with a local_lock_t.Sebastian Andrzej Siewior1-13/+17
The kcov code mixes local_irq_save() and spin_lock() in kcov_remote_{start|end}(). This creates a warning on PREEMPT_RT because local_irq_save() disables interrupts and spin_lock_t is turned into a sleeping lock which can not be acquired in a section with disabled interrupts. The kcov_remote_lock is used to synchronize the access to the hash-list kcov_remote_map. The local_irq_save() block protects access to the per-CPU data kcov_percpu_data. There no compelling reason to change the lock type to raw_spin_lock_t to make it work with local_irq_save(). Changing it would require to move memory allocation (in kcov_remote_add()) and deallocation outside of the locked section. Adding an unlimited amount of entries to the hashlist will increase the IRQ-off time during lookup. It could be argued that this is debug code and the latency does not matter. There is however no need to do so and it would allow to use this facility in an RT enabled build. Using a local_lock_t instead of local_irq_save() has the befit of adding a protection scope within the source which makes it obvious what is protected. On a !PREEMPT_RT && !LOCKDEP build the local_lock_irqsave() maps directly to local_irq_save() so there is overhead at runtime. Replace the local_irq_save() section with a local_lock_t. Reported-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-6-bigeasy@linutronix.de
2023-11-06kcov: Avoid enable+disable interrupts if !in_task().Sebastian Andrzej Siewior1-3/+3
kcov_remote_start() may need to allocate memory in the in_task() case (otherwise per-CPU memory has been pre-allocated) and therefore requires enabled interrupts. The interrupts are enabled before checking if the allocation is required so if no allocation is required then the interrupts are needlessly enabled and disabled again. Enable interrupts only if memory allocation is performed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-5-bigeasy@linutronix.de
2023-11-06kcov: Allocate per-CPU memory on the relevant node.Sebastian Andrzej Siewior1-2/+2
During boot kcov allocates per-CPU memory which is used later if remote/ softirq processing is enabled. Allocate the per-CPU memory on the CPU local node to avoid cross node memory access. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-4-bigeasy@linutronix.de
2023-11-06Documentation/kcov: Define `ip' in the example.Sebastian Andrzej Siewior1-0/+2
The example code uses the variable `ip' but never declares it. Declare `ip' as a 64bit variable which is the same type as the array from which it loads its value. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-3-bigeasy@linutronix.de
2023-11-06Documentation/kcov: Include types.h in the example.Sebastian Andrzej Siewior1-0/+3
The first example code has includes at the top, the following two example share that part. The last example (remote coverage collection) requires the linux/types.h header file due its __aligned_u64 usage. Add the linux/types.h to the top most example and a comment that the header files from above are required as it is done in the second example. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210830172627.267989-2-bigeasy@linutronix.de
2023-11-06x86/softirq: Disable softirq stacks on PREEMPT_RTThomas Gleixner2-0/+5
PREEMPT_RT preempts softirqs and the current implementation avoids do_softirq_own_stack() and only uses __do_softirq(). Disable the unused softirqs stacks on PREEMPT_RT to safe some memory and ensure that do_softirq_own_stack() is not used which is not expected. [bigeasy: commit description.] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210924161245.2357247-1-bigeasy@linutronix.de
2023-11-06mm: Disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on ↵Sebastian Andrzej Siewior2-2/+2
PREEMPT_RT TRANSPARENT_HUGEPAGE: There are potential non-deterministic delays to an RT thread if a critical memory region is not THP-aligned and a non-RT buffer is located in the same hugepage-aligned region. It's also possible for an unrelated thread to migrate pages belonging to an RT task incurring unexpected page faults due to memory defragmentation even if khugepaged is disabled. Regular HUGEPAGEs are not affected by this can be used. NUMA_BALANCING: There is a non-deterministic delay to mark PTEs PROT_NONE to gather NUMA fault samples, increased page faults of regions even if mlocked and non-deterministic delays when migrating pages. [Mel Gorman worded 99% of the commit description]. Link: https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/ Link: https://lore.kernel.org/all/20211026165100.ahz5bkx44lrrw5pt@linutronix.de/ Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20211028143327.hfbxjze7palrpfgp@linutronix.de
2023-11-06mm/scatterlist: Replace the !preemptible warning in sg_miter_stop()Thomas Gleixner1-7/+4
sg_miter_stop() checks for disabled preemption before unmapping a page via kunmap_atomic(). The kernel doc mentions under context that preemption must be disabled if SG_MITER_ATOMIC is set. There is no active requirement for the caller to have preemption disabled before invoking sg_mitter_stop(). The sg_mitter_*() implementation itself has no such requirement. In fact, preemption is disabled by kmap_atomic() as part of sg_miter_next() and remains disabled as long as there is an active SG_MITER_ATOMIC mapping. This is a consequence of kmap_atomic() and not a requirement for sg_mitter_*() itself. The user chooses SG_MITER_ATOMIC because it uses the API in a context where blocking is not possible or blocking is possible but he chooses a lower weight mapping which is not available on all CPUs and so it might need less overhead to setup at a price that now preemption will be disabled. The kmap_atomic() implementation on PREEMPT_RT does not disable preemption. It simply disables CPU migration to ensure that the task remains on the same CPU while the caller remains preemptible. This in turn triggers the warning in sg_miter_stop() because preemption is allowed. The PREEMPT_RT and !PREEMPT_RT implementation of kmap_atomic() disable pagefaults as a requirement. It is sufficient to check for this instead of disabled preemption. Check for disabled pagefault handler in the SG_MITER_ATOMIC case. Remove the "preemption disabled" part from the kernel doc as the sg_milter*() implementation does not care. [bigeasy: commit description. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211015211409.cqopacv3pxdwn2ty@linutronix.de
2023-11-06mm: page_alloc: Use migrate_disable() in drain_local_pages_wq()Sebastian Andrzej Siewior1-2/+2
drain_local_pages_wq() disables preemption to avoid CPU migration during CPU hotplug and can't use cpus_read_lock(). Using migrate_disable() works here, too. The scheduler won't take the CPU offline until the task left the migrate-disable section. The problem with disabled preemption here is that drain_local_pages() acquires locks which are turned into sleeping locks on PREEMPT_RT and can't be acquired with disabled preemption. Use migrate_disable() in drain_local_pages_wq(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de
2023-11-06mm: Allow only SLUB on PREEMPT_RTIngo Molnar1-0/+2
Memory allocators may disable interrupts or preemption as part of the allocation and freeing process. For PREEMPT_RT it is important that these sections remain deterministic and short and therefore don't depend on the size of the memory to allocate/ free or the inner state of the algorithm. Until v3.12-RT the SLAB allocator was an option but involved several changes to meet all the requirements. The SLUB design fits better with PREEMPT_RT model and so the SLAB patches were dropped in the 3.12-RT patchset. Comparing the two allocator, SLUB outperformed SLAB in both throughput (time needed to allocate and free memory) and the maximal latency of the system measured with cyclictest during hackbench. SLOB was never evaluated since it was unlikely that it preforms better than SLAB. During a quick test, the kernel crashed with SLOB enabled during boot. Disable SLAB and SLOB on PREEMPT_RT. [bigeasy: commit description.] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20211015210336.gen3tib33ig5q2md@linutronix.de
2023-11-06crypto: testmgr - Only disable migration in crypto_disable_simd_for_test()Sebastian Andrzej Siewior1-2/+2
crypto_disable_simd_for_test() disables preemption in order to receive a stable per-CPU variable which it needs to modify in order to alter crypto_simd_usable() results. This can also be achived by migrate_disable() which forbidds CPU migrations but allows the task to be preempted. The latter is important for PREEMPT_RT since operation like skcipher_walk_first() may allocate memory which must not happen with disabled preemption on PREEMPT_RT. Use migrate_disable() in crypto_disable_simd_for_test() to achieve a stable per-CPU pointer. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210928115401.441339-1-bigeasy@linutronix.de
2023-11-06samples/kfifo: Rename read_lock/write_lockSebastian Andrzej Siewior3-18/+18
The variables names read_lock and write_lock can clash with functions used for read/writer locks. Rename read_lock to read_access and write_lock to write_access to avoid a name collision. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20210806152551.qio7c3ho6pexezup@linutronix.de
2023-11-06net/core: disable NET_RX_BUSY_POLL on PREEMPT_RTSebastian Andrzej Siewior1-1/+1
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption which would be required while __napi_poll() invokes the callback of the driver. A threaded interrupt performing the NAPI-poll can be preempted on PREEMPT_RT. A RT thread on another CPU may observe NAPIF_STATE_SCHED bit set and busy-spin until it is cleared or its spin time runs out. Given it is the task with the highest priority it will never observe the NEED_RESCHED bit set. In this case the time is better spent by simply sleeping. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero). Disabling NET_RX_BUSY_POLL on PREEMPT_RT to avoid wrong locking context in case it is used. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20211001145841.2308454-1-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-06mm: Disable zsmalloc on PREEMPT_RTSebastian Andrzej Siewior1-1/+2
For efficiency reasons, zsmalloc is using a slim `handle'. The value is the address of a memory allocation of 4 or 8 bytes depending on the size of the long data type. The lowest bit in that allocated memory is used as a bit spin lock. The usage of the bit spin lock is problematic because with the bit spin lock held zsmalloc acquires a rwlock_t and spinlock_t which are both sleeping locks on PREEMPT_RT and therefore must not be acquired with disabled preemption. There is a patch which extends the handle on PREEMPT_RT so that a full spinlock_t fits (even with lockdep enabled) and then eliminates the bit spin lock. I'm not sure how sensible zsmalloc on PREEMPT_RT is given that it is used to store compressed user memory. Disable ZSMALLOC on PREEMPT_RT. If there is need for it, we can try to get it to work. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210923170121.1860133-1-bigeasy@linutronix.de
2023-11-06efi: Allow efi=runtimeSebastian Andrzej Siewior1-0/+3
In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. This is useful on PREEMPT_RT where "efi=noruntime" is default and the user might need to alter the boot order for instance. Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210924134919.1913476-3-bigeasy@linutronix.de
2023-11-06efi: Disable runtime services on RTSebastian Andrzej Siewior1-1/+1
Based on measurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. These 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firmware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during run-time (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers on PREEMPT_RT. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210924134919.1913476-2-bigeasy@linutronix.de
2023-11-06lockdep: Let lock_is_held_type() detect recursive read as readSebastian Andrzej Siewior1-1/+1
lock_is_held_type(, 1) detects acquired read locks. It only recognized locks acquired with lock_acquire_shared(). Read locks acquired with lock_acquire_shared_recursive() are not recognized because a `2' is stored as the read value. Rework the check to additionally recognise lock's read value one and two as a read held lock. Fixes: e918188611f07 ("locking: More accurate annotations for read_lock()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://lkml.kernel.org/r/20210903084001.lblecrvz4esl4mrr@linutronix.de
2023-11-06genirq: Disable irqfixup/poll on PREEMPT_RT.Ingo Molnar1-0/+8
The support for misrouted IRQs is used on old / legacy systems and is not feasible on PREEMPT_RT. Polling for interrupts reduces the overall system performance. Additionally the interrupt latency depends on the polling frequency and delays are not desired for real time workloads. Disable IRQ polling on PREEMPT_RT and let the user know that it is not enabled. The compiler will optimize the real fixup/poll code out. [ bigeasy: Update changelog and switch to IS_ENABLED() ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210917223841.c6j6jcaffojrnot3@linutronix.de
2023-11-06genirq: Move prio assignment into the newly created threadThomas Gleixner1-2/+2
With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority assignment to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: Patch description] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
2023-11-06kthread: Move prio/affinite change into the newly created threadSebastian Andrzej Siewior1-8/+8
With enabled threaded interrupts the nouveau driver reported the following: | Chain exists of: | &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem | | Possible unsafe locking scenario: | | CPU0 CPU1 | ---- ---- | lock(&cpuset_rwsem); | lock(&device->mutex); | lock(&cpuset_rwsem); | lock(&mm->mmap_lock#2); The device->mutex is nvkm_device::mutex. Unblocking the lockchain at `cpuset_rwsem' is probably the easiest thing to do. Move the priority reset to the start of the newly created thread. Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
2023-11-06rcutorture: Avoid problematic critical section nesting on PREEMPT_RTFrom: Scott Wood1-12/+36
rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20210820074236.2zli4nje7bof62rh@linutronix.de
2023-11-06sched: Switch wait_task_inactive to HRTIMER_MODE_REL_HARDSebastian Andrzej Siewior1-1/+1
With PREEMPT_RT enabled all hrtimers callbacks will be invoked in softirq mode unless they are explicitly marked as HRTIMER_MODE_HARD. During boot kthread_bind() is used for the creation of per-CPU threads and then hangs in wait_task_inactive() if the ksoftirqd is not yet up and running. The hang disappeared since commit 26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()") but enabling function trace on boot reliably leads to the freeze on boot behaviour again. The timer in wait_task_inactive() can not be directly used by an user interface to abuse it and create a mass wake of several tasks at the same time which would to long sections with disabled interrupts. Therefore it is safe to make the timer HRTIMER_MODE_REL_HARD. Switch the timer to HRTIMER_MODE_REL_HARD. Cc: stable-rt@vger.kernel.org Link: https://lkml.kernel.org/r/20210826170408.vm7rlj7odslshwch@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2023-11-06printk: Enhance the condition check of msleep in pr_flush()Chao Qin1-1/+3
There is msleep in pr_flush(). If call WARN() in the early boot stage such as in early_initcall, pr_flush() will run into msleep when process scheduler is not ready yet. And then the system will sleep forever. Before the system_state is SYSTEM_RUNNING, make sure DO NOT sleep in pr_flush(). Fixes: c0b395bd0fe3("printk: add pr_flush()") Signed-off-by: Chao Qin <chao.qin@intel.com> Signed-off-by: Lili Li <lili.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20210719022649.3444072-1-chao.qin@intel.com
2023-11-06printk: add pr_flush()John Ogness4-11/+106
Provide a function to allow waiting for console printers to catch up to the latest logged message. Use pr_flush() to give console printers a chance to finish in critical situations if no atomic console is available. For now pr_flush() is only used in the most common error paths: panic(), print_oops_end_marker(), report_bug(), kmsg_dump(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: add console handoverJohn Ogness2-2/+14
If earlyprintk is used, a boot console will print directly to the console immediately. The boot console will unregister itself as soon as a non-boot console registers. However, the non-boot console does not begin printing until its kthread has started. Since this happens much later, there is a long pause in the console output. If the ringbuffer is small, messages could even be dropped during the pause. Add a new CON_HANDOVER console flag to be used internally by printk in order to track which non-boot console took over from a boot console. If handover consoles have implemented write_atomic(), they are allowed to print directly to the console until their kthread can take over. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-11-06printk: remove deferred printingJohn Ogness26-265/+83
Since printing occurs either atomically or from the printing kthread, there is no need for any deferring or tracking possible recursion paths. Remove all printk defer functions and context tracking. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>