summaryrefslogtreecommitdiff
path: root/Documentation/trace/rv/monitor_sched.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace/rv/monitor_sched.rst')
-rw-r--r--Documentation/trace/rv/monitor_sched.rst402
1 files changed, 402 insertions, 0 deletions
diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
new file mode 100644
index 000000000000..3f8381ad9ec7
--- /dev/null
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -0,0 +1,402 @@
+Scheduler monitors
+==================
+
+- Name: sched
+- Type: container for multiple monitors
+- Author: Gabriele Monaco <gmonaco@redhat.com>, Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+Monitors describing complex systems, such as the scheduler, can easily grow to
+the point where they are just hard to understand because of the many possible
+state transitions.
+Often it is possible to break such descriptions into smaller monitors,
+sharing some or all events. Enabling those smaller monitors concurrently is,
+in fact, testing the system as if we had one single larger monitor.
+Splitting models into multiple specification is not only easier to
+understand, but gives some more clues when we see errors.
+
+The sched monitor is a set of specifications to describe the scheduler behaviour.
+It includes several per-cpu and per-task monitors that work independently to verify
+different specifications the scheduler should follow.
+
+To make this system as straightforward as possible, sched specifications are *nested*
+monitors, whereas sched itself is a *container*.
+From the interface perspective, sched includes other monitors as sub-directories,
+enabling/disabling or setting reactors to sched, propagates the change to all monitors,
+however single monitors can be used independently as well.
+
+It is important that future modules are built after their container (sched, in
+this case), otherwise the linker would not respect the order and the nesting
+wouldn't work as expected.
+To do so, simply add them after sched in the Makefile.
+
+Specifications
+--------------
+
+The specifications included in sched are currently a work in progress, adapting the ones
+defined in by Daniel Bristot in [1].
+
+Currently we included the following:
+
+Monitor sco
+~~~~~~~~~~~
+
+The scheduling context operations (sco) monitor ensures changes in a task state
+happen only in thread context::
+
+
+ |
+ |
+ v
+ sched_set_state +------------------+
+ +------------------ | |
+ | | thread_context |
+ +-----------------> | | <+
+ +------------------+ |
+ | |
+ | schedule_entry | schedule_exit
+ v |
+ |
+ scheduling_context -+
+
+Monitor snroc
+~~~~~~~~~~~~~
+
+The set non runnable on its own context (snroc) monitor ensures changes in a
+task state happens only in the respective task's context. This is a per-task
+monitor::
+
+ |
+ |
+ v
+ +------------------+
+ | other_context | <+
+ +------------------+ |
+ | |
+ | sched_switch_in | sched_switch_out
+ v |
+ sched_set_state |
+ +------------------ |
+ | own_context |
+ +-----------------> -+
+
+Monitor scpd
+~~~~~~~~~~~~
+
+The schedule called with preemption disabled (scpd) monitor ensures schedule is
+called with preemption disabled::
+
+ |
+ |
+ v
+ +------------------+
+ | cant_sched | <+
+ +------------------+ |
+ | |
+ | preempt_disable | preempt_enable
+ v |
+ schedule_entry |
+ schedule_exit |
+ +----------------- can_sched |
+ | |
+ +----------------> -+
+
+Monitor snep
+~~~~~~~~~~~~
+
+The schedule does not enable preempt (snep) monitor ensures a schedule call
+does not enable preemption::
+
+ |
+ |
+ v
+ preempt_disable +------------------------+
+ preempt_enable | |
+ +------------------ | non_scheduling_context |
+ | | |
+ +-----------------> | | <+
+ +------------------------+ |
+ | |
+ | schedule_entry | schedule_exit
+ v |
+ |
+ scheduling_contex -+
+
+Monitor sts
+~~~~~~~~~~~
+
+The schedule implies task switch (sts) monitor ensures a task switch happens
+only in scheduling context and up to once, as well as scheduling occurs with
+interrupts enabled but no task switch can happen before interrupts are
+disabled. When the next task picked for execution is the same as the previously
+running one, no real task switch occurs but interrupts are disabled nonetheless::
+
+ irq_entry |
+ +----+ |
+ v | v
+ +------------+ irq_enable #===================# irq_disable
+ | | ------------> H H irq_entry
+ | cant_sched | <------------ H H irq_enable
+ | | irq_disable H can_sched H --------------+
+ +------------+ H H |
+ H H |
+ +---------------> H H <-------------+
+ | #===================#
+ | |
+ schedule_exit | schedule_entry
+ | v
+ | +-------------------+ irq_enable
+ | | scheduling | <---------------+
+ | +-------------------+ |
+ | | |
+ | | irq_disable +--------+ irq_entry
+ | v | | --------+
+ | +-------------------+ irq_entry | in_irq | |
+ | | | -----------> | | <-------+
+ | | disable_to_switch | +--------+
+ | | | --+
+ | +-------------------+ |
+ | | |
+ | | sched_switch |
+ | v |
+ | +-------------------+ |
+ | | switching | | irq_enable
+ | +-------------------+ |
+ | | |
+ | | irq_enable |
+ | v |
+ | +-------------------+ |
+ +-- | enable_to_exit | <-+
+ +-------------------+
+ ^ | irq_disable
+ | | irq_entry
+ +---------------+ irq_enable
+
+Monitor nrp
+-----------
+
+The need resched preempts (nrp) monitor ensures preemption requires
+``need_resched``. Only kernel preemption is considered, since preemption
+while returning to userspace, for this monitor, is indistinguishable from
+``sched_switch_yield`` (described in the sssw monitor).
+A kernel preemption is whenever ``__schedule`` is called with the preemption
+flag set to true (e.g. from preempt_enable or exiting from interrupts). This
+type of preemption occurs after the need for ``rescheduling`` has been set.
+This is not valid for the *lazy* variant of the flag, which causes only
+userspace preemption.
+A ``schedule_entry_preempt`` may involve a task switch or not, in the latter
+case, a task goes through the scheduler from a preemption context but it is
+picked as the next task to run. Since the scheduler runs, this clears the need
+to reschedule. The ``any_thread_running`` state does not imply the monitored
+task is not running as this monitor does not track the outcome of scheduling.
+
+In theory, a preemption can only occur after the ``need_resched`` flag is set. In
+practice, however, it is possible to see a preemption where the flag is not
+set. This can happen in one specific condition::
+
+ need_resched
+ preempt_schedule()
+ preempt_schedule_irq()
+ __schedule()
+ !need_resched
+ __schedule()
+
+In the situation above, standard preemption starts (e.g. from preempt_enable
+when the flag is set), an interrupt occurs before scheduling and, on its exit
+path, it schedules, which clears the ``need_resched`` flag.
+When the preempted task runs again, the standard preemption started earlier
+resumes, although the flag is no longer set. The monitor considers this a
+``nested_preemption``, this allows another preemption without re-setting the
+flag. This condition relaxes the monitor constraints and may catch false
+negatives (i.e. no real ``nested_preemptions``) but makes the monitor more
+robust and able to validate other scenarios.
+For simplicity, the monitor starts in ``preempt_irq``, although no interrupt
+occurred, as the situation above is hard to pinpoint::
+
+ schedule_entry
+ irq_entry #===========================================#
+ +-------------------------- H H
+ | H H
+ +-------------------------> H any_thread_running H
+ H H
+ +-------------------------> H H
+ | #===========================================#
+ | schedule_entry | ^
+ | schedule_entry_preempt | sched_need_resched | schedule_entry
+ | | schedule_entry_preempt
+ | v |
+ | +----------------------+ |
+ | +--- | | |
+ | sched_need_resched | | rescheduling | -+
+ | +--> | |
+ | +----------------------+
+ | | irq_entry
+ | v
+ | +----------------------+
+ | | | ---+
+ | ---> | | | sched_need_resched
+ | | preempt_irq | | irq_entry
+ | | | <--+
+ | | | <--+
+ | +----------------------+ |
+ | | schedule_entry | sched_need_resched
+ | | schedule_entry_preempt |
+ | v |
+ | +-----------------------+ |
+ +-------------------------- | nested_preempt | --+
+ +-----------------------+
+ ^ irq_entry |
+ +-------------------+
+
+Due to how the ``need_resched`` flag on the preemption count works on arm64,
+this monitor is unstable on that architecture, as it often records preemption
+when the flag is not set, even in presence of the workaround above.
+For the time being, the monitor is disabled by default on arm64.
+
+Monitor sssw
+------------
+
+The set state sleep and wakeup (sssw) monitor ensures ``set_state`` to
+sleepable leads to sleeping and sleeping tasks require wakeup. It includes the
+following types of switch:
+
+* ``switch_suspend``:
+ a task puts itself to sleep, this can happen only after explicitly setting
+ the task to ``sleepable``. After a task is suspended, it needs to be woken up
+ (``waking`` state) before being switched in again.
+ Setting the task's state to ``sleepable`` can be reverted before switching if it
+ is woken up or set to ``runnable``.
+* ``switch_blocking``:
+ a special case of a ``switch_suspend`` where the task is waiting on a
+ sleeping RT lock (``PREEMPT_RT`` only), it is common to see wakeup and set
+ state events racing with each other and this leads the model to perceive this
+ type of switch when the task is not set to sleepable. This is a limitation of
+ the model in SMP system and workarounds may slow down the system.
+* ``switch_preempt``:
+ a task switch as a result of kernel preemption (``schedule_entry_preempt`` in
+ the nrp model).
+* ``switch_yield``:
+ a task explicitly calls the scheduler or is preempted while returning to
+ userspace. It can happen after a ``yield`` system call, from the idle task or
+ if the ``need_resched`` flag is set. By definition, a task cannot yield while
+ ``sleepable`` as that would be a suspension. A special case of a yield occurs
+ when a task in ``TASK_INTERRUPTIBLE`` calls the scheduler while a signal is
+ pending. The task doesn't go through the usual blocking/waking and is set
+ back to runnable, the resulting switch (if there) looks like a yield to the
+ ``signal_wakeup`` state and is followed by the signal delivery. From this
+ state, the monitor expects a signal even if it sees a wakeup event, although
+ not necessary, to rule out false negatives.
+
+This monitor doesn't include a running state, ``sleepable`` and ``runnable``
+are only referring to the task's desired state, which could be scheduled out
+(e.g. due to preemption). However, it does include the event
+``sched_switch_in`` to represent when a task is allowed to become running. This
+can be triggered also by preemption, but cannot occur after the task got to
+``sleeping`` before a ``wakeup`` occurs::
+
+ +--------------------------------------------------------------------------+
+ | |
+ | |
+ | switch_suspend | |
+ | switch_blocking | |
+ v v |
+ +----------+ #==========================# set_state_runnable |
+ | | H H wakeup |
+ | | H H switch_in |
+ | | H H switch_yield |
+ | sleeping | H H switch_preempt |
+ | | H H signal_deliver |
+ | | switch_ H H ------+ |
+ | | _blocking H runnable H | |
+ | | <----------- H H <-----+ |
+ +----------+ H H |
+ | wakeup H H |
+ +---------------------> H H |
+ H H |
+ +---------> H H |
+ | #==========================# |
+ | | ^ |
+ | | | set_state_runnable |
+ | | | wakeup |
+ | set_state_sleepable | +------------------------+
+ | v | |
+ | +--------------------------+ set_state_sleepable
+ | | | switch_in
+ | | | switch_preempt
+ signal_deliver | sleepable | signal_deliver
+ | | | ------+
+ | | | |
+ | | | <-----+
+ | +--------------------------+
+ | | ^
+ | switch_yield | set_state_sleepable
+ | v |
+ | +---------------+ |
+ +---------- | signal_wakeup | -+
+ +---------------+
+ ^ | switch_in
+ | | switch_preempt
+ | | switch_yield
+ +-----------+ wakeup
+
+Monitor opid
+------------
+
+The operations with preemption and irq disabled (opid) monitor ensures
+operations like ``wakeup`` and ``need_resched`` occur with interrupts and
+preemption disabled or during interrupt context, in such case preemption may
+not be disabled explicitly.
+``need_resched`` can be set by some RCU internals functions, in which case it
+doesn't match a task wakeup and might occur with only interrupts disabled::
+
+ | sched_need_resched
+ | sched_waking
+ | irq_entry
+ | +--------------------+
+ v v |
+ +------------------------------------------------------+
+ +----------- | disabled | <+
+ | +------------------------------------------------------+ |
+ | | ^ |
+ | | preempt_disable sched_need_resched |
+ | preempt_enable | +--------------------+ |
+ | v | v | |
+ | +------------------------------------------------------+ |
+ | | irq_disabled | |
+ | +------------------------------------------------------+ |
+ | | | ^ |
+ | irq_entry irq_entry | | |
+ | sched_need_resched v | irq_disable |
+ | sched_waking +--------------+ | | |
+ | +----- | | irq_enable | |
+ | | | in_irq | | | |
+ | +----> | | | | |
+ | +--------------+ | | irq_disable
+ | | | | |
+ | irq_enable | irq_enable | | |
+ | v v | |
+ | #======================================================# |
+ | H enabled H |
+ | #======================================================# |
+ | | ^ ^ preempt_enable | |
+ | preempt_disable preempt_enable +--------------------+ |
+ | v | |
+ | +------------------+ | |
+ +----------> | preempt_disabled | -+ |
+ +------------------+ |
+ | |
+ +-------------------------------------------------------+
+
+This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of
+events occurring in interrupt context is a shortcut to identify valid scenarios
+where the preemption tracepoints might not be visible, during interrupts
+preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts
+might invoke a softirq to set ``need_resched`` and wake up a task. This is
+another special case that is currently not supported by the monitor.
+
+References
+----------
+
+[1] - https://bristot.me/linux-task-model