summaryrefslogtreecommitdiff
path: root/drivers/cpuidle/governors
AgeCommit message (Collapse)AuthorFilesLines
2019-11-15cpuidle: teo: Avoid code duplication in conditionalsRafael J. Wysocki1-5/+8
There are three places in teo_select() where a given amount of time is compared with TICK_NSEC if tick_nohz_tick_stopped() returns true, which is a bit of duplicated code. Avoid that code duplication by defining a helper function to do the check and using it in all of the places in question. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-14cpuidle: teo: Avoid using "early hits" incorrectlyRafael J. Wysocki1-4/+17
If the current state with the maximum "early hits" metric in teo_select() is also the one "matching" the expected idle duration, it will be used as the candidate one for selection even if its "misses" metric is greater than its "hits" metric, which is not correct. In that case, the candidate state should be shallower than the current one and its "early hits" metric should be the maximum among the idle states shallower than the current one. To make that happen, modify teo_select() to save the index of the state whose "early hits" metric is the maximum for the range of states below the current one and go back to that state if it turns out that the current one should be rejected. Fixes: 159e48560f51 ("cpuidle: teo: Fix "early hits" handling for disabled idle states") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-14cpuidle: teo: Exclude cpuidle overhead from computationsRafael J. Wysocki1-1/+8
One purpose of the computations in teo_update() is to determine whether or not the (saved) time till the next timer event and the measured idle duration fall into the same "bin", so avoid using values that include the cpuidle overhead to obtain the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-11cpuidle: Use nanoseconds as the unit of timeRafael J. Wysocki4-124/+107
Currently, the cpuidle subsystem uses microseconds as the unit of time which (among other things) causes the idle loop to incur some integer division overhead for no clear benefit. In order to allow cpuidle to measure time in nanoseconds, add two new fields, exit_latency_ns and target_residency_ns, to represent the exit latency and target residency of an idle state in nanoseconds, respectively, to struct cpuidle_state and initialize them with the help of the corresponding values in microseconds provided by drivers. Additionally, change cpuidle_governor_latency_req() to return the idle state exit latency constraint in nanoseconds. Also meeasure idle state residency (last_residency_ns in struct cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds and update the cpuidle core and governors accordingly. However, the menu governor still computes typical intervals in microseconds to avoid integer overflows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-06cpuidle: Consolidate disabled state checksRafael J. Wysocki3-11/+6
There are two reasons why CPU idle states may be disabled: either because the driver has disabled them or because they have been disabled by user space via sysfs. In the former case, the state's "disabled" flag is set once during the initialization of the driver and it is never cleared later (it is read-only effectively). In the latter case, the "disable" field of the given state's cpuidle_state_usage struct is set and it may be changed via sysfs. Thus checking whether or not an idle state has been disabled involves reading these two flags every time. In order to avoid the additional check of the state's "disabled" flag (which is effectively read-only anyway), use the value of it at the init time to set a (new) flag in the "disable" field of that state's cpuidle_state_usage structure and use the sysfs interface to manipulate another (new) flag in it. This way the state is disabled whenever the "disable" field of its cpuidle_state_usage structure is nonzero, whatever the reason, and it is the only place to look into to check whether or not the state has been disabled. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-10-14cpuidle: teo: Fix "early hits" handling for disabled idle statesRafael J. Wysocki1-9/+26
The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "early hits" metric measures the likelihood of a situation in which the idle duration measured after wakeup falls into to given bin, but the time till the next timer (sleep length) falls into a bin corresponding to one of the deeper idle states. It is used when the "hits" and "misses" metrics indicate that the state "matching" the sleep length should not be selected, so that the state with the maximum "early hits" value is selected instead of it. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. As far as the "early hits" metric is concerned, teo_select() tries to take disabled states into account, but the state index corresponding to the maximum "early hits" value computed by it may be incorrect. Namely, it always uses the index of the previous maximum "early hits" state then, but there may be enabled idle states closer to the disabled one in question. In particular, if the current candidate state (whose index is the idx value) is closer to the disabled one and the "early hits" value of the disabled state is greater than the current maximum, the index of the current candidate state (idx) should replace the "maximum early hits state" index. Modify the code to handle that case correctly. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
2019-10-14cpuidle: teo: Consider hits and misses metrics of disabled statesRafael J. Wysocki1-4/+21
The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "hits" and "misses" metrics measure the likelihood of a situation in which both the time till the next timer (sleep length) and the idle duration measured after wakeup fall into the given bin. Namely, if the "hits" value is greater than the "misses" one, that situation is more likely than the one in which the sleep length falls into the given bin, but the idle duration measured after wakeup falls into a bin corresponding to one of the shallower idle states. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. For this reason, make teo_select() always use the "hits" and "misses" values of the idle duration range that the sleep length falls into even if the specific idle state corresponding to it is disabled and if the "hits" values is greater than the "misses" one, select the closest enabled shallower idle state in that case. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
2019-10-14cpuidle: teo: Rename local variable in teo_select()Rafael J. Wysocki1-10/+9
Rename a local variable in teo_select() in preparation for subsequent code modifications, no intentional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
2019-10-14cpuidle: teo: Ignore disabled idle states that are too deepRafael J. Wysocki1-0/+7
Prevent disabled CPU idle state with target residencies beyond the anticipated idle duration from being taken into account by the TEO governor. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
2019-09-11cpuidle-haltpoll: set haltpoll as preferred governorJoao Martins1-1/+1
Right now, guest current governors have the following ratings: * ladder -> 10 * teo -> 19 * menu -> 20 * haltpoll -> 21 * ladder + nohz=off -> 25 haltpoll governor got introduced and it is now the default governor given its highest rating -- with ladder+nohz being the exception -- regardless of idle driver in the guest. An example of an undesirable case is x86 KVM guests with MWAIT which have intel_idle registered first, and consequently will have haltpoll be used as governor which would get limited to a poll state and state 1 and the other states wouldn't get used. To keep the previous defaults we decrease rating of governor to 9 (below current lowest rating) and thus rely on @governor switch on cpuidle_register_driver() to tie in haltpoll idle driver and governor together. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-10cpuidle: teo: Get rid of redundant check in teo_update()Rafael J. Wysocki1-12/+4
Notice that setting measured_us to UINT_MAX in teo_update() earlier doesn't change the behavior of the following code, so do that and eliminate a redundant check used for setting measured_us to UINT_MAX. This change is not expected to alter functionality. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-05cpuidle: teo: Allow tick to be stopped if PM QoS is usedRafael J. Wysocki1-16/+16
The TEO goveror prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, modify the TEO governor to carry out the entire search for the most suitable idle state (from the target residency perspective) even if a latency constraint is present, to allow it to determine the expected idle duration in all cases. Also, when using the last several measured idle duration values to refine the idle state selection, make it compare those values with the current expected idle duration value (instead of comparing them with the target residency of the idle state selected so far) which should prevent the tick from being retained when it makes sense to stop it sometimes (especially in the presence of PM QoS latency constraints). Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-05cpuidle: menu: Allow tick to be stopped if PM QoS is usedRafael J. Wysocki1-11/+5
After commit 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") the menu governor prevents the scheduler tick from being stopped (unless stopped already) if there is a PM QoS latency constraint for the given CPU and the target residency of the deepest idle state matching that constraint is below the tick boundary. However, that is problematic if CPUs with PM QoS latency constraints are idle for long times, because it effectively causes the tick to run on them all the time which is wasteful. [It is also confusing and questionable if they are full dynticks CPUs.] To address that issue, make the menu governor allow the tick to be stopped only if the idle duration predicted by it is beyond the tick boundary, except when the shallowest idle state is selected upfront and it is not a "polling" one. Fixes: 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") Link: https://lore.kernel.org/lkml/79b247b3-e056-610e-9a07-e685dfdaa6c9@gmail.com/ Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Tested-by: Thomas Lindroth <thomas.lindroth@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-07-30cpuidle: add haltpoll governorMarcelo Tosatti2-0/+151
The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle driver, allows guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-07-30governors: unify last_state_idxMarcelo Tosatti3-20/+18
Since this field is shared by all governors, move it to cpuidle device structure. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215Thomas Gleixner1-3/+1
Based on 1 normalized pattern(s): this code is licenced under the gpl version 2 as described in the copying file that acompanies the linux kernel extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Steve Winslow <swinslow@gmail.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171439.466585205@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner1-0/+1
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-07cpuidle: menu: Avoid overflows when computing varianceRafael J. Wysocki1-1/+1
The variance computation in get_typical_interval() may overflow if the square of the value of diff exceeds the maximum for the int64_t data type value which basically is the case when it is of the order of UINT_MAX. However, data points so far in the future don't matter for idle state selection anyway, so change the initial threshold value in get_typical_interval() to INT_MAX which will cause more "outlying" data points to be discarded without affecting the selection result. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-01-17cpuidle: New timer events oriented governor for tickless systemsRafael J. Wysocki2-0/+445
The venerable menu governor does some things that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (because it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors depend on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular, and they are not correlated to the list of available idle states in any way whatever. Also, the pattern detection code in menu may end up considering values that are too large to matter at all, in which cases running it is a waste of time. A major rework of the menu governor would be required to address these issues and the performance of at least some workloads (tuned specifically to the current behavior of the menu governor) is likely to suffer from that. It is thus better to introduce an entirely new governor without them and let everybody use the governor that works better with their actual workloads. The new governor introduced here, the timer events oriented (TEO) governor, uses the same basic strategy as menu: it always tries to find the deepest idle state that can be used in the given conditions. However, it applies a different approach to that problem. First, it doesn't use "correction factors" for the time till the closest timer, but instead it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to "match" the upcoming CPU idle interval. Second, it doesn't take the number of "I/O waiters" into account at all and the pattern detection code in it avoids taking timer wakeups into account. It also only uses idle duration values less than the current time till the closest timer (with the tick excluded) for that purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-30Merge tag 'pm-4.20-rc1-2' of ↵Linus Torvalds1-19/+6
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These remove a questionable heuristic from the menu cpuidle governor, fix a recent build regression in the intel_pstate driver, clean up ARM big-Little support in cpufreq and fix up hung task watchdog's interaction with system-wide power management transitions. Specifics: - Fix build regression in the intel_pstate driver that doesn't build without CONFIG_ACPI after recent changes (Dominik Brodowski). - One of the heuristics in the menu cpuidle governor is based on a function returning 0 most of the time, so drop it and clean up the scheduler code related to it (Daniel Lezcano). - Prevent the arm_big_little cpufreq driver from being used on ARM64 which is not suitable for it and drop the arm_big_little_dt driver that is not used any more (Sudeep Holla). - Prevent the hung task watchdog from triggering during resume from system-wide sleep states by disabling it before freezing tasks and enabling it again after they have been thawed (Vitaly Kuznetsov)" * tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: kernel: hung_task.c: disable on suspend cpufreq: remove unused arm_big_little_dt driver cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64 cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI cpuidle: menu: Remove get_loadavg() from the performance multiplier sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-27sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner1-4/+0
There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-25cpuidle: menu: Remove get_loadavg() from the performance multiplierDaniel Lezcano1-19/+6
The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb46b2 (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-18cpuidle: menu: Avoid computations when result will be discardedRafael J. Wysocki1-3/+16
If the minimum interval taken into account in the average computation loop in get_typical_interval() is less than the expected idle duration determined so far, the resultant average cannot be greater than that value as well and the entire return result of the function is going to be discarded anyway going forward. In that case, it is a waste of time to carry out the remaining computations in get_typical_interval(), so avoid that by returning early if the minimum interval is not below the expected idle duration. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-18cpuidle: menu: Drop redundant comparisonRafael J. Wysocki1-6/+1
Since the correction factor cannot be greater than RESOLUTION * DECAY, the result of the predicted_us computation in menu_select() cannot be greater than data->next_timer_us, so it is not necessary to compare the "typical interval" value coming from get_typical_interval() with data->next_timer_us separately. It is sufficient to copmare predicted_us with the return value of get_typical_interval() directly, so do that and drop the now redundant expected_interval variable. No intentional changes of behavior. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-12cpuidle: menu: Simplify checks related to the polling stateRafael J. Wysocki1-4/+4
After some recent menu governor changes, the promotion of the "polling" state to a physical one is mostly controlled by the latency limit (resulting from the "interactivity" factor) and not by the time to the closest timer event, so it should be sufficient to check the exit latency of that state for this purpose (of course, its target residency still needs to be within the next timer event range for energy-efficiency). Also, the physical state the "polling" one is promoted to need not be the next one in principle (in case the next state is disabled, for example). For these reasons, simplify the checks made to decide whether or not to promote the "polling" state to a physical one and update the target idle duration when it is promoted in case the residency of the new state turns out to be above the tick boundary (in which case there is no reason to stop the tick). Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-04cpuidle: menu: Move the latency_req == 0 special case checkRafael J. Wysocki1-7/+1
It is better to always update data->bucket before returning from menu_select() to avoid updating the correction factor for a stale bucket, so combine the latency_req == 0 special check with the more general check below. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-04cpuidle: menu: Avoid computations for very close timersRafael J. Wysocki1-0/+12
If the next timer event (not including the tick) is closer than the target residency of the second state or the PM QoS latency constraint is below its exit latency, state[0] will be used regardless of any other factors, so skip the computations in menu_select() then and return 0 straight away from it. Still, do that after the bucket has been determined to avoid updating the correction factor for a stale bucket. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-04cpuidle: menu: Do not update last_state_idx in menu_select()Rafael J. Wysocki1-5/+2
It is not necessary to update data->last_state_idx in menu_select() as it only is used in menu_update() which only runs when data->needs_update is set and that is set only when updating data->last_state_idx in menu_reflect(). Accordingly, drop the update of data->last_state_idx from menu_select() and get rid of the (now redundant) "out" label from it. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Get rid of first_idx from menu_select()Rafael J. Wysocki1-18/+14
Rearrange the code in menu_select() so that the loop over idle states always starts from 0 and get rid of the first_idx variable. While at it, add two empty lines to separate conditional statements from one another. No intentional behavior changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Compute first_idx when latency_req is knownRafael J. Wysocki1-16/+16
Since menu_select() can only set first_idx to 1 if the exit latency of the second state is not greater than the latency limit, it should first determine that limit. Thus first_idx should be computed after the "interactivity" factor has been taken into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewedy-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-04cpuidle: menu: Fix wakeup statistics updates for polling stateRafael J. Wysocki1-0/+10
If the CPU exits the "polling" state due to the time limit in the loop in poll_idle(), this is not a real wakeup and it just means that the "polling" state selection was not adequate. The governor mispredicted short idle duration, but had a more suitable state been selected, the CPU might have spent more time in it. In fact, there is no reason to expect that there would have been a wakeup event earlier than the next timer in that case. Handling such cases as regular wakeups in menu_update() may cause the menu governor to make suboptimal decisions going forward, but ignoring them altogether would not be correct either, because every time menu_select() is invoked, it makes a separate new attempt to predict the idle duration taking distinct time to the closest timer event as input and the outcomes of all those attempts should be recorded. For this reason, make menu_update() always assume that if the "polling" state was exited due to the time limit, the next proper wakeup event for the CPU would be the next timer event (not including the tick). Fixes: a37b969a61c1 "cpuidle: poll_state: Add time limit to poll_idle()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-10-03cpuidle: menu: Replace data->predicted_us with local variableRafael J. Wysocki1-12/+11
The predicted_us field in struct menu_device is only accessed in menu_select(), so replace it with a local variable in that function. With that, stop using expected_interval instead of predicted_us to store the new predicted idle duration value if it is set to the selected state's target residency which is quite confusing. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-09-18cpuidle: Remove unnecessary wrapper cpuidle_get_last_residency()Fieah Lim2-2/+2
cpuidle_get_last_residency() is just a wrapper for retrieving the last_residency member of struct cpuidle_device. It's also weirdly the only wrapper function for accessing cpuidle_* struct member (by my best guess is it could be a leftover from v2.x). Anyhow, since the only two users (the ladder and menu governors) can access dev->last_residency directly, and it's more intuitive to do it that way, let's just get rid of the wrapper. This patch tidies up CPU idle code a bit without functional changes. Signed-off-by: Fieah Lim <kw@fieahl.im> [ rjw: Changelog cleanup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-25cpuidle: menu: Retain tick when shallow state is selectedRafael J. Wysocki1-1/+12
The case addressed by commit 5ef499cd571c (cpuidle: menu: Handle stopped tick more aggressively) in the stopped tick case is present when the tick has not been stopped yet too. Namely, if only two CPU idle states, shallow state A with target residency significantly below the tick boundary and deep state B with target residency significantly above it, are available and the predicted idle duration is above the tick boundary, but below the target residency of state B, state A will be selected and the CPU may spend indefinite amount of time in it, which is not quite energy-efficient. However, if the tick has not been stopped yet and the governor is about to select a shallow idle state for the CPU even though the idle duration predicted by it is above the tick boundary, it should be fine to wake up the CPU early, so the tick can be retained then and the governor will have a chance to select a deeper state when it runs next time. [Note that when this really happens, it will make the idle duration predictor believe that the CPU might be idle longer than predicted, which will make it more likely to predict longer idle durations going forward, but that will also cause deeper idle states to be selected going forward, on average, which is what's needed here.] Fixes: 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+: 5ef499cd571c (cpuidle: menu: Handle ...) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-20cpuidle: menu: Handle stopped tick more aggressivelyRafael J. Wysocki1-12/+24
Commit 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) missed the case when the target residencies of deep idle states of CPUs are above the tick boundary which may cause the CPU to get stuck in a shallow idle state for a long time. Say there are two CPU idle states available: one shallow, with the target residency much below the tick boundary and one deep, with the target residency significantly above the tick boundary. In that case, if the tick has been stopped already and the expected next timer event is relatively far in the future, the governor will assume the idle duration to be equal to TICK_USEC and it will select the idle state for the CPU accordingly. However, that will cause the shallow state to be selected even though it would have been more energy-efficient to select the deep one. To address this issue, modify the governor to always use the time till the closest timer event instead of the predicted idle duration if the latter is less than the tick period length and the tick has been stopped already. Also make it extend the search for a matching idle state if the tick is stopped to avoid settling on a shallow state if deep states with target residencies above the tick period length are available. In addition, make it always indicate that the tick should be stopped if it has been stopped already for consistency. Fixes: 87c9fe6ee495 (cpuidle: menu: Avoid selecting shallow states with stopped tick) Reported-by: Leo Yan <leo.yan@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-17cpuidle: menu: Update stale polling override commentRafael J. Wysocki1-3/+2
The comment to explain why the menu governor uses idle state 1 instead of idle state 0 as the first one sometimes is stale (among other things it mentions a user setting not present any more), so update it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-15cpuidle: menu: Fix white spaceRafael J. Wysocki1-2/+2
Fix some damaged white space in menu_select(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-31cpuidle: governors: Consolidate PM QoS handlingRafael J. Wysocki2-16/+2
There is some code duplication related to the PM QoS handling between the existing cpuidle governors, so move that code to a common helper function and call that from the governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-31cpuidle: governors: Drop redundant checks related to PM QoSRafael J. Wysocki2-4/+2
PM_QOS_RESUME_LATENCY_NO_CONSTRAINT is defined as the 32-bit integer maximum, so it is not necessary to test the return value of dev_pm_qos_raw_read_value() against it directly in the menu and ladder cpuidle governors. Drop these redundant checks. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-09cpuidle: menu: Avoid selecting shallow states with stopped tickRafael J. Wysocki1-7/+22
If the scheduler tick has been stopped already and the governor selects a shallow idle state, the CPU can spend a long time in that state if the selection is based on an inaccurate prediction of idle time. That effect turns out to be relevant, so it needs to be mitigated. To that end, modify the menu governor to discard the result of the idle time prediction if the tick is stopped and the predicted idle time is less than the tick period length, unless the tick timer is going to expire soon. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-09cpuidle: menu: Refine idle state selection for running tickRafael J. Wysocki1-2/+25
If the tick isn't stopped, the target residency of the state selected by the menu governor may be greater than the actual time to the next tick and that means lost energy. To avoid that, make tick_nohz_get_sleep_length() return the current time to the next event (before stopping the tick) in addition to the estimated one via an extra pointer argument and make menu_select() use that value to refine the state selection when necessary. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-06cpuidle: Return nohz hint from cpuidle_select()Rafael J. Wysocki2-13/+49
Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, or (2) the selected state is a polling one, or (3) the expected idle period duration is within the tick period range. In addition to that, the correction factor computations in the menu governor need to take the possibility that the tick may not be stopped into account to avoid artificially small correction factor values. To that end, add a mechanism to record tick wakeups, as suggested by Peter Zijlstra, and use it to modify the menu_update() behavior when tick wakeup occurs. Namely, if the CPU is woken up by the tick and the return value of tick_nohz_get_sleep_length() is not within the tick boundary, the predicted idle duration is likely too short, so make menu_update() try to compensate for that by updating the governor statistics as though the CPU was idle for a long time. Since the value returned through the new argument pointer of cpuidle_select() is not used by its caller yet, this change by itself is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2017-11-08cpuidle: ladder: Add per CPU PM QoS resume latency supportRamesh Thomas1-0/+7
Individual CPUs may have special requirements to not enter deep idle states. For example, a CPU running real time applications would not want to enter deep idle states to avoid latency impacts. At the same time other CPUs that do not have such a requirement could allow deep idle states to save power. This was already implemented in the menu governor. Implementing similar changes in the ladder governor which gets selected when CONFIG_NO_HZ and CONFIG_NO_HZ_IDLE are not set. Refer following commits for the menu governor changes. Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08PM / QoS: Fix device resume latency frameworkRafael J. Wysocki1-2/+2
The special value of 0 for device resume latency PM QoS means "no restriction", but there are two problems with that. First, device resume latency PM QoS requests with 0 as the value are always put in front of requests with positive values in the priority lists used internally by the PM QoS framework, causing 0 to be chosen as an effective constraint value. However, that 0 is then interpreted as "no restriction" effectively overriding the other requests with specific restrictions which is incorrect. Second, the users of device resume latency PM QoS have no way to specify that *any* resume latency at all should be avoided, which is an artificial limitation in general. To address these issues, modify device resume latency PM QoS to use S32_MAX as the "no constraint" value and 0 as the "no latency at all" one and rework its users (the cpuidle menu governor, the genpd QoS governor and the runtime PM framework) to follow these changes. Also add a special "n/a" value to the corresponding user space I/F to allow user space to indicate that it cannot accept any resume latencies at all for the given device. Fixes: 85dc0b8a4019 (PM / QoS: Make it possible to expose PM QoS latency constraints) Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323 Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Tero Kristo <t-kristo@ti.com> Reviewed-by: Ramesh Thomas <ramesh.thomas@intel.com>
2017-08-30cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbolRafael J. Wysocki2-14/+13
On some architectures the first (index 0) idle state is a polling one and it doesn't really save energy, so there is the CPUIDLE_DRIVER_STATE_START symbol allowing some pieces of cpuidle code to avoid using that state. However, this makes the code rather hard to follow. It is better to explicitly avoid the polling state, so add a new cpuidle state flag CPUIDLE_FLAG_POLLING to mark it and make the relevant code check that flag for the first state instead of using the CPUIDLE_DRIVER_STATE_START symbol. In the ACPI processor driver that cannot always rely on the state flags (like before the states table has been set up) define a new internal symbol ACPI_IDLE_STATE_START equivalent to the CPUIDLE_DRIVER_STATE_START one and drop the latter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2017-06-29cpuidle: menu: allow state 0 to be disabledNicholas Piggin1-5/+15
The menu driver does not allow state0 to be disabled completely. If it is disabled but other enabled states don't meet latency requirements, it is still used. Fix this by starting with the first enabled idle state. Fall back to state 0 if no idle states are enabled (arguably this should be -EINVAL if it is attempted, but this is the minimal fix). Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/stat.h> We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/stat.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/loadavg.h> We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which will have to be picked up from a couple of .c files. Create a trivial placeholder <linux/sched/topology.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27cpuidle: menu: Avoid taking spinlock for accessing QoS valuesRafael J. Wysocki1-1/+1
After commit 9908859acaa9 (cpuidle/menu: add per CPU PM QoS resume latency consideration) the cpuidle menu governor calls dev_pm_qos_read_value() on CPU devices to read the current resume latency QoS constraint values for them. That function takes a spinlock to prevent the device's power.qos pointer from becoming NULL during the access which is a problem for the RT patchset where spinlocks are converted into mutexes and the idle loop stops working. However, it is not even necessary for the menu governor to take that spinlock, because the power.qos pointer accessed under it cannot be modified during the access anyway. For this reason, introduce a "raw" routine for accessing device QoS resume latency constraints without locking and use it in the menu governor. Fixes: 9908859acaa9 (cpuidle/menu: add per CPU PM QoS resume latency consideration) Acked-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>