diff options
Diffstat (limited to 'Documentation/trace/rv/monitor_sched.rst')
-rw-r--r-- | Documentation/trace/rv/monitor_sched.rst | 402 |
1 files changed, 402 insertions, 0 deletions
diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst new file mode 100644 index 000000000000..3f8381ad9ec7 --- /dev/null +++ b/Documentation/trace/rv/monitor_sched.rst @@ -0,0 +1,402 @@ +Scheduler monitors +================== + +- Name: sched +- Type: container for multiple monitors +- Author: Gabriele Monaco <gmonaco@redhat.com>, Daniel Bristot de Oliveira <bristot@kernel.org> + +Description +----------- + +Monitors describing complex systems, such as the scheduler, can easily grow to +the point where they are just hard to understand because of the many possible +state transitions. +Often it is possible to break such descriptions into smaller monitors, +sharing some or all events. Enabling those smaller monitors concurrently is, +in fact, testing the system as if we had one single larger monitor. +Splitting models into multiple specification is not only easier to +understand, but gives some more clues when we see errors. + +The sched monitor is a set of specifications to describe the scheduler behaviour. +It includes several per-cpu and per-task monitors that work independently to verify +different specifications the scheduler should follow. + +To make this system as straightforward as possible, sched specifications are *nested* +monitors, whereas sched itself is a *container*. +From the interface perspective, sched includes other monitors as sub-directories, +enabling/disabling or setting reactors to sched, propagates the change to all monitors, +however single monitors can be used independently as well. + +It is important that future modules are built after their container (sched, in +this case), otherwise the linker would not respect the order and the nesting +wouldn't work as expected. +To do so, simply add them after sched in the Makefile. + +Specifications +-------------- + +The specifications included in sched are currently a work in progress, adapting the ones +defined in by Daniel Bristot in [1]. + +Currently we included the following: + +Monitor sco +~~~~~~~~~~~ + +The scheduling context operations (sco) monitor ensures changes in a task state +happen only in thread context:: + + + | + | + v + sched_set_state +------------------+ + +------------------ | | + | | thread_context | + +-----------------> | | <+ + +------------------+ | + | | + | schedule_entry | schedule_exit + v | + | + scheduling_context -+ + +Monitor snroc +~~~~~~~~~~~~~ + +The set non runnable on its own context (snroc) monitor ensures changes in a +task state happens only in the respective task's context. This is a per-task +monitor:: + + | + | + v + +------------------+ + | other_context | <+ + +------------------+ | + | | + | sched_switch_in | sched_switch_out + v | + sched_set_state | + +------------------ | + | own_context | + +-----------------> -+ + +Monitor scpd +~~~~~~~~~~~~ + +The schedule called with preemption disabled (scpd) monitor ensures schedule is +called with preemption disabled:: + + | + | + v + +------------------+ + | cant_sched | <+ + +------------------+ | + | | + | preempt_disable | preempt_enable + v | + schedule_entry | + schedule_exit | + +----------------- can_sched | + | | + +----------------> -+ + +Monitor snep +~~~~~~~~~~~~ + +The schedule does not enable preempt (snep) monitor ensures a schedule call +does not enable preemption:: + + | + | + v + preempt_disable +------------------------+ + preempt_enable | | + +------------------ | non_scheduling_context | + | | | + +-----------------> | | <+ + +------------------------+ | + | | + | schedule_entry | schedule_exit + v | + | + scheduling_contex -+ + +Monitor sts +~~~~~~~~~~~ + +The schedule implies task switch (sts) monitor ensures a task switch happens +only in scheduling context and up to once, as well as scheduling occurs with +interrupts enabled but no task switch can happen before interrupts are +disabled. When the next task picked for execution is the same as the previously +running one, no real task switch occurs but interrupts are disabled nonetheless:: + + irq_entry | + +----+ | + v | v + +------------+ irq_enable #===================# irq_disable + | | ------------> H H irq_entry + | cant_sched | <------------ H H irq_enable + | | irq_disable H can_sched H --------------+ + +------------+ H H | + H H | + +---------------> H H <-------------+ + | #===================# + | | + schedule_exit | schedule_entry + | v + | +-------------------+ irq_enable + | | scheduling | <---------------+ + | +-------------------+ | + | | | + | | irq_disable +--------+ irq_entry + | v | | --------+ + | +-------------------+ irq_entry | in_irq | | + | | | -----------> | | <-------+ + | | disable_to_switch | +--------+ + | | | --+ + | +-------------------+ | + | | | + | | sched_switch | + | v | + | +-------------------+ | + | | switching | | irq_enable + | +-------------------+ | + | | | + | | irq_enable | + | v | + | +-------------------+ | + +-- | enable_to_exit | <-+ + +-------------------+ + ^ | irq_disable + | | irq_entry + +---------------+ irq_enable + +Monitor nrp +----------- + +The need resched preempts (nrp) monitor ensures preemption requires +``need_resched``. Only kernel preemption is considered, since preemption +while returning to userspace, for this monitor, is indistinguishable from +``sched_switch_yield`` (described in the sssw monitor). +A kernel preemption is whenever ``__schedule`` is called with the preemption +flag set to true (e.g. from preempt_enable or exiting from interrupts). This +type of preemption occurs after the need for ``rescheduling`` has been set. +This is not valid for the *lazy* variant of the flag, which causes only +userspace preemption. +A ``schedule_entry_preempt`` may involve a task switch or not, in the latter +case, a task goes through the scheduler from a preemption context but it is +picked as the next task to run. Since the scheduler runs, this clears the need +to reschedule. The ``any_thread_running`` state does not imply the monitored +task is not running as this monitor does not track the outcome of scheduling. + +In theory, a preemption can only occur after the ``need_resched`` flag is set. In +practice, however, it is possible to see a preemption where the flag is not +set. This can happen in one specific condition:: + + need_resched + preempt_schedule() + preempt_schedule_irq() + __schedule() + !need_resched + __schedule() + +In the situation above, standard preemption starts (e.g. from preempt_enable +when the flag is set), an interrupt occurs before scheduling and, on its exit +path, it schedules, which clears the ``need_resched`` flag. +When the preempted task runs again, the standard preemption started earlier +resumes, although the flag is no longer set. The monitor considers this a +``nested_preemption``, this allows another preemption without re-setting the +flag. This condition relaxes the monitor constraints and may catch false +negatives (i.e. no real ``nested_preemptions``) but makes the monitor more +robust and able to validate other scenarios. +For simplicity, the monitor starts in ``preempt_irq``, although no interrupt +occurred, as the situation above is hard to pinpoint:: + + schedule_entry + irq_entry #===========================================# + +-------------------------- H H + | H H + +-------------------------> H any_thread_running H + H H + +-------------------------> H H + | #===========================================# + | schedule_entry | ^ + | schedule_entry_preempt | sched_need_resched | schedule_entry + | | schedule_entry_preempt + | v | + | +----------------------+ | + | +--- | | | + | sched_need_resched | | rescheduling | -+ + | +--> | | + | +----------------------+ + | | irq_entry + | v + | +----------------------+ + | | | ---+ + | ---> | | | sched_need_resched + | | preempt_irq | | irq_entry + | | | <--+ + | | | <--+ + | +----------------------+ | + | | schedule_entry | sched_need_resched + | | schedule_entry_preempt | + | v | + | +-----------------------+ | + +-------------------------- | nested_preempt | --+ + +-----------------------+ + ^ irq_entry | + +-------------------+ + +Due to how the ``need_resched`` flag on the preemption count works on arm64, +this monitor is unstable on that architecture, as it often records preemption +when the flag is not set, even in presence of the workaround above. +For the time being, the monitor is disabled by default on arm64. + +Monitor sssw +------------ + +The set state sleep and wakeup (sssw) monitor ensures ``set_state`` to +sleepable leads to sleeping and sleeping tasks require wakeup. It includes the +following types of switch: + +* ``switch_suspend``: + a task puts itself to sleep, this can happen only after explicitly setting + the task to ``sleepable``. After a task is suspended, it needs to be woken up + (``waking`` state) before being switched in again. + Setting the task's state to ``sleepable`` can be reverted before switching if it + is woken up or set to ``runnable``. +* ``switch_blocking``: + a special case of a ``switch_suspend`` where the task is waiting on a + sleeping RT lock (``PREEMPT_RT`` only), it is common to see wakeup and set + state events racing with each other and this leads the model to perceive this + type of switch when the task is not set to sleepable. This is a limitation of + the model in SMP system and workarounds may slow down the system. +* ``switch_preempt``: + a task switch as a result of kernel preemption (``schedule_entry_preempt`` in + the nrp model). +* ``switch_yield``: + a task explicitly calls the scheduler or is preempted while returning to + userspace. It can happen after a ``yield`` system call, from the idle task or + if the ``need_resched`` flag is set. By definition, a task cannot yield while + ``sleepable`` as that would be a suspension. A special case of a yield occurs + when a task in ``TASK_INTERRUPTIBLE`` calls the scheduler while a signal is + pending. The task doesn't go through the usual blocking/waking and is set + back to runnable, the resulting switch (if there) looks like a yield to the + ``signal_wakeup`` state and is followed by the signal delivery. From this + state, the monitor expects a signal even if it sees a wakeup event, although + not necessary, to rule out false negatives. + +This monitor doesn't include a running state, ``sleepable`` and ``runnable`` +are only referring to the task's desired state, which could be scheduled out +(e.g. due to preemption). However, it does include the event +``sched_switch_in`` to represent when a task is allowed to become running. This +can be triggered also by preemption, but cannot occur after the task got to +``sleeping`` before a ``wakeup`` occurs:: + + +--------------------------------------------------------------------------+ + | | + | | + | switch_suspend | | + | switch_blocking | | + v v | + +----------+ #==========================# set_state_runnable | + | | H H wakeup | + | | H H switch_in | + | | H H switch_yield | + | sleeping | H H switch_preempt | + | | H H signal_deliver | + | | switch_ H H ------+ | + | | _blocking H runnable H | | + | | <----------- H H <-----+ | + +----------+ H H | + | wakeup H H | + +---------------------> H H | + H H | + +---------> H H | + | #==========================# | + | | ^ | + | | | set_state_runnable | + | | | wakeup | + | set_state_sleepable | +------------------------+ + | v | | + | +--------------------------+ set_state_sleepable + | | | switch_in + | | | switch_preempt + signal_deliver | sleepable | signal_deliver + | | | ------+ + | | | | + | | | <-----+ + | +--------------------------+ + | | ^ + | switch_yield | set_state_sleepable + | v | + | +---------------+ | + +---------- | signal_wakeup | -+ + +---------------+ + ^ | switch_in + | | switch_preempt + | | switch_yield + +-----------+ wakeup + +Monitor opid +------------ + +The operations with preemption and irq disabled (opid) monitor ensures +operations like ``wakeup`` and ``need_resched`` occur with interrupts and +preemption disabled or during interrupt context, in such case preemption may +not be disabled explicitly. +``need_resched`` can be set by some RCU internals functions, in which case it +doesn't match a task wakeup and might occur with only interrupts disabled:: + + | sched_need_resched + | sched_waking + | irq_entry + | +--------------------+ + v v | + +------------------------------------------------------+ + +----------- | disabled | <+ + | +------------------------------------------------------+ | + | | ^ | + | | preempt_disable sched_need_resched | + | preempt_enable | +--------------------+ | + | v | v | | + | +------------------------------------------------------+ | + | | irq_disabled | | + | +------------------------------------------------------+ | + | | | ^ | + | irq_entry irq_entry | | | + | sched_need_resched v | irq_disable | + | sched_waking +--------------+ | | | + | +----- | | irq_enable | | + | | | in_irq | | | | + | +----> | | | | | + | +--------------+ | | irq_disable + | | | | | + | irq_enable | irq_enable | | | + | v v | | + | #======================================================# | + | H enabled H | + | #======================================================# | + | | ^ ^ preempt_enable | | + | preempt_disable preempt_enable +--------------------+ | + | v | | + | +------------------+ | | + +----------> | preempt_disabled | -+ | + +------------------+ | + | | + +-------------------------------------------------------+ + +This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of +events occurring in interrupt context is a shortcut to identify valid scenarios +where the preemption tracepoints might not be visible, during interrupts +preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts +might invoke a softirq to set ``need_resched`` and wake up a task. This is +another special case that is currently not supported by the monitor. + +References +---------- + +[1] - https://bristot.me/linux-task-model |