summaryrefslogtreecommitdiff
path: root/drivers/cpuidle/governors
AgeCommit message (Collapse)AuthorFilesLines
2014-05-01cpuidle / menu: move repeated correction factor check to initChander Kashyap1-7/+8
In menu_select function we check for correction factor every time. If it is zero we are initializing to unity. Hence move it to init function and initialise by unity, hence avoid repeated comparisons. Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org> Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-01cpuidle / menu: Return (-1) if there are no suitable statesRafael J. Wysocki1-1/+1
If there is a PM QoS latency limit and all of the sufficiently shallow C-states are disabled, the cpuidle menu governor returns 0 which on some systems is CPUIDLE_DRIVER_STATE_START and shouldn't be returned if that C-state has been disabled. Fix the issue by modifying the menu governor to return (-1) in such situations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Move perf multiplier calculation out of the selection looptuukka.tikkanen@linaro.org1-5/+10
The menu governor performance multiplier defines a minimum predicted idle duration to latency ratio. Instead of checking this separately in every iteration of the state selection loop, adjust the overall latency restriction for the whole loop if this restriction is tighter than what is set by the QoS subsystem. The original test s->exit_latency * multiplier > data->predicted_us becomes s->exit_latency > data->predicted_us / multiplier by dividing both sides of the comparison by "multiplier". While division is likely to be several times slower than multiplication, the minor performance hit allows making a generic sleep state selection function based on (sleep duration, maximum latency) tuple. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Do not substract exit latency from assumed sleep lengthtuukka.tikkanen@linaro.org1-18/+26
The menu governor statistics update function tries to determine the amount of time between entry to low power state and the occurrence of the wakeup event. However, the time measured by the framework includes exit latency on top of the desired value. This exit latency is substracted from the measured value to obtain the desired value. When measured value is not available, the menu governor assumes the wakeup was caused by the timer and the time is equal to remaining timer length. No exit latency should be substracted from this value. This patch prevents the erroneous substraction and clarifies the associated comment. It also removes one intermediate variable that serves no purpose. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Ensure menu coefficients stay within domaintuukka.tikkanen@linaro.org1-0/+3
The menu governor uses coefficients as one method of actual idle period length estimation. The coefficients are, as detailed below, multipliers giving expected idle period length from time until next timer expiry. The multipliers are supposed to have domain of (0..1]. The coefficients are fractions where only the numerators are stored and denominators are a shared constant RESOLUTION*DECAY. Since the value of the coefficient should always be greater than 0 and less than or equal to 1, the numerator must have a value greater than 0 and less than or equal to RESOLUTION*DECAY. If the coefficients are updated with measured idle durations exceeding timer length, the multiplier may reach values exceeding unity (i.e. the stored numerator exceeds RESOLUTION*DECAY). This patch ensures that the multipliers are updated with durations capped to timer length. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: Use actual state latency in menu governortuukka.tikkanen@linaro.org1-5/+2
Currently menu governor records the exit latency of the state it has chosen for the idle period. The stored latency value is then later used to calculate the actual length of the idle period. This value may however be incorrect, as the entered state may not be the one chosen by the governor. The entered state information is available, so we can use that to obtain the real exit latency. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-06cpuidle: rename expected_us to next_timer_us in menu governortuukka.tikkanen@linaro.org1-9/+9
The field expected_us is used to store the time remaining until next timer expiry. The name is inaccurate, as we really do not expect all wakeups to be caused by timers. In addition, another field with a very similar name (predicted_us) is used to store the predicted time remaining until any wakeup source being active. This patch renames expected_us to next_timer_us in order to better reflect the contained information. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Change struct menu_device field typesTuukka Tikkanen1-11/+17
Field predicted_us value can never exceed expected_us value, but it has a potentially larger type. As there is no need for additional 32 bits of zeroes on 32 bit plaforms, change the type of predicted_us to match the type of expected_us. Field correction_factor is used to store a value that cannot exceed the product of RESOLUTION and DECAY (default 1024*8 = 8192). The constants cannot in practice be incremented to such values, that they'd overflow unsigned int even on 32 bit systems, so the type is changed to avoid unnecessary 64 bit arithmetic on 32 bit systems. One multiplication of (now) 32 bit values needs an added cast to avoid truncation of the result and has been added. In order to avoid another multiplication from 32 bit domain to 64 bit domain, the new correction_factor calculation has been changed from new = old * (DECAY-1) / DECAY to new = old - old / DECAY, which with infinite precision would yeild exactly the same result, but now changes the direction of rounding. The impact is not significant as the maximum accumulated difference cannot exceed the value of DECAY, which is relatively small compared to product of RESOLUTION and DECAY (8 / 8192). Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Add a comment warning about possible overflowTuukka Tikkanen1-0/+9
The menu governor has a number of tunable constants that may be changed in the source. If certain combination of values are chosen, an overflow is possible when the correction_factor is being recalculated. This patch adds a warning regarding this possibility and describes the change needed for fixing the issue. The change should not be permanently enabled, as it will hurt performance when it is not needed. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Fix variable domains in get_typical_interval()Tuukka Tikkanen1-6/+9
The menu governor uses a static function get_typical_interval() to try to detect a repeating pattern of wakeups. The previous interval durations are stored as an array of unsigned ints, but the arithmetic in the function is performed exclusively as 64 bit values, even when the value stored in a variable is known not to exceed unsigned int, which may be smaller and more efficient on some platforms. This patch changes the types of varibles used to store some intermediates, the maximum and and the cutoff threshold to unsigned ints. Average and standard deviation are still treated as 64 bit values, even when the values are known to be within the domain of unsigned int, to avoid casts to ensure correct integer promotion for arithmetic operations. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Fix menu_device->intervals typeTuukka Tikkanen1-1/+1
Struct menu_device member intervals is declared as u32, but the value stored is (unsigned) int. The type is changed to match the value being stored. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: CodingStyle: Break up multiple assignments on single lineTuukka Tikkanen1-3/+6
The function get_typical_interval() initializes a number of variables that are immediately after declarations assigned constant values. In addition, there are multiple assignments on a single line, which is explicitly forbidden by Documentation/CodingStyle. This patch removes redundant initial values for the variables and breaks up the multiple assignment line. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Check called function parameter in get_typical_interval()Tuukka Tikkanen1-5/+13
get_typical_interval() uses int_sqrt() in calculation of standard deviation. The formal parameter of int_sqrt() is unsigned long, which may on some platforms be smaller than the 64 bit unsigned integer used as the actual parameter. The overflow can occur frequently when actual idle period lengths are in hundreds of milliseconds. This patch adds a check for such overflow and rejects the candidate average when an overflow would occur. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Rearrange code and comments in get_typical_interval()Tuukka Tikkanen1-13/+15
This patch rearranges a if-return-elsif-goto-fi-return sequence into if-return-fi-if-return-fi-goto sequence. The functionality remains the same. Also, a lengthy comment that did not describe the functionality in the order it occurs is split into half and top half is moved closer to actual implementation it describes. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-23cpuidle: Ignore interval prediction result when timer is shorterTuukka Tikkanen1-1/+4
This patch prevents cpuidle menu governor from using repeating interval prediction result if the idle period predicted is longer than the one allowed by shortest running timer. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-15Merge back earlier 'pm-cpuidle' material.Rafael J. Wysocki2-22/+2
2013-07-29Revert "cpuidle: Quickly notice prediction failure for repeat mode"Rafael J. Wysocki1-69/+4
Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for repeat mode), because it has been identified as the source of a significant performance regression in v3.8 and later as explained by Jeremy Eder: We believe we've identified a particular commit to the cpuidle code that seems to be impacting performance of variety of workloads. The simplest way to reproduce is using netperf TCP_RR test, so we're using that, on a pair of Sandy Bridge based servers. We also have data from a large database setup where performance is also measurably/positively impacted, though that test data isn't easily share-able. Included below are test results from 3 test kernels: kernel reverts ----------------------------------------------------------- 1) vanilla upstream (no reverts) 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c 3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 e11538d1f03914eb92af5a1a378375c05ae8520c In summary, netperf TCP_RR numbers improve by approximately 4% after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never seems to get above 40%. Taking that patch out gets C0 near 100% quite often, and performance increases. The below data are histograms representing the %c0 residency @ 1-second sample rates (using turbostat), while under netperf test. - If you look at the first 4 histograms, you can see %c0 residency almost entirely in the 30,40% bin. - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4, shows %c0 in the 80,90,100% bins. Below each kernel name are netperf TCP_RR trans/s numbers for the particular kernel that can be disclosed publicly, comparing the 3 test kernels. We ran a 4th test with the vanilla kernel where we've also set /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded TCP_RR performance over 11% above baseline. 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0): TCP_RR trans/s 54323.78 ----------------------------------------------------------- 3.10-rc2 vanilla RX (no reverts) TCP_RR trans/s 48192.47 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 59]: *********************************************************** 40.0000 - 50.0000 [ 1]: * 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: Sender %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 11]: *********** 40.0000 - 50.0000 [ 49]: ************************************************* 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: ----------------------------------------------------------- 3.10-rc2 perfteam2 RX (reverts commit e11538d1f03914eb92af5a1a378375c05ae8520c) TCP_RR trans/s 49698.69 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 1]: * 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 59]: *********************************************************** 40.0000 - 50.0000 [ 0]: 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: Sender %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 2]: ** 40.0000 - 50.0000 [ 58]: ********************************************************** 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 0]: ----------------------------------------------------------- 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and e11538d1f03914eb92af5a1a378375c05ae8520c) TCP_RR trans/s 47766.95 Receiver %c0 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 1]: * 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 27]: *************************** 40.0000 - 50.0000 [ 2]: ** 50.0000 - 60.0000 [ 0]: 60.0000 - 70.0000 [ 2]: ** 70.0000 - 80.0000 [ 0]: 80.0000 - 90.0000 [ 0]: 90.0000 - 100.0000 [ 28]: **************************** Sender: 0.0000 - 10.0000 [ 1]: * 10.0000 - 20.0000 [ 0]: 20.0000 - 30.0000 [ 0]: 30.0000 - 40.0000 [ 11]: *********** 40.0000 - 50.0000 [ 0]: 50.0000 - 60.0000 [ 1]: * 60.0000 - 70.0000 [ 0]: 70.0000 - 80.0000 [ 3]: *** 80.0000 - 90.0000 [ 7]: ******* 90.0000 - 100.0000 [ 38]: ************************************** These results demonstrate gaining back the tendency of the CPU to stay in more responsive, performant C-states (and thus yield measurably better performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. Requested-by: Jeremy Eder <jeder@redhat.com> Tested-by: Len Brown <len.brown@intel.com> Cc: 3.8+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-29Revert "cpuidle: Quickly notice prediction failure in general case"Rafael J. Wysocki1-34/+1
Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in general case), since it depends on commit 69a37be (cpuidle: Quickly notice prediction failure for repeat mode) that has been identified as the source of a significant performance regression in v3.8 and later. Requested-by: Jeremy Eder <jeder@redhat.com> Tested-by: Len Brown <len.brown@intel.com> Cc: 3.8+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-15cpuidle: Make it clear that governors cannot be modulesDaniel Lezcano2-22/+2
cpufreq governors are defined as modules in the code, but the Kconfig options do not allow them to be built as modules. This is not really a problem, but the cpuidle init ordering is: the cpuidle init functions (framework and driver) and then the governors. That leads to some weirdness in the cpuidle framework. Namely, cpuidle_register_device() calls cpuidle_enable_device() which fails at the first attempt, because governors have not been registered yet. When a governor is registered, the framework calls cpuidle_enable_device() again which runs __cpuidle_register_device() only then. Of course, for that to work, the cpuidle_enable_device() return value has to be ignored by cpuidle_register_device(). Instead of having this cyclic call graph and relying on a positive side effects of the hackish back and forth cpuidle_enable_device() calls it is better to fix the cpuidle init ordering. To that end, replace the module init code with postcore_initcall() so we have: * cpuidle framework : core_initcall * cpuidle governors : postcore_initcall * cpuidle drivers : device_initcall and remove the corresponding module exit code as it is dead anyway (governors can't be built as modules). [rjw: Changelog] Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-01-15cpuidle: remove the power_specified field in the driverDaniel Lezcano1-6/+2
We realized that the power usage field is never filled and when it is filled for tegra, the power_specified flag is not set causing all of these values to be reset when the driver is initialized with set_power_state(). However, the power_specified flag can be simply removed under the assumption that the states are always backward sorted, which is the case with the current code. This change allows the menu governor select function and the cpuidle_play_dead() to be simplified. Moreover, the set_power_states() function can removed as it does not make sense any more. Drop the power_specified flag from struct cpuidle_driver and make the related changes as described above. As a consequence, this also fixes the bug where on the dynamic C-states system, the power fields are not initialized. [rjw: Changelog] References: https://bugzilla.kernel.org/show_bug.cgi?id=42870 References: https://bugzilla.kernel.org/show_bug.cgi?id=43349 References: https://lkml.org/lkml/2012/10/16/518 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-01-03cpuidle: Fix finding state with min power_usageSivaram Nair1-1/+1
Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead of -1) to init the local copies so that functions that tries to find cpuidle states with minimum power usage works correctly even if they use non-negative values. Signed-off-by: Sivaram Nair <sivaramn@nvidia.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-23cpuidle: fix a suspicious RCU usage in menu governorLi Zhong1-4/+6
I saw this suspicious RCU usage on the next tree of 11/15 [ 67.123404] =============================== [ 67.123413] [ INFO: suspicious RCU usage. ] [ 67.123423] 3.7.0-rc5-next-20121115-dirty #1 Not tainted [ 67.123434] ------------------------------- [ 67.123444] include/trace/events/timer.h:186 suspicious rcu_dereference_check() usage! [ 67.123458] [ 67.123458] other info that might help us debug this: [ 67.123458] [ 67.123474] [ 67.123474] RCU used illegally from idle CPU! [ 67.123474] rcu_scheduler_active = 1, debug_locks = 0 [ 67.123493] RCU used illegally from extended quiescent state! [ 67.123507] 1 lock held by swapper/1/0: [ 67.123516] #0: (&cpu_base->lock){-.-...}, at: [<c0000000000979b0>] .__hrtimer_start_range_ns+0x28c/0x524 [ 67.123555] [ 67.123555] stack backtrace: [ 67.123566] Call Trace: [ 67.123576] [c0000001e2ccb920] [c00000000001275c] .show_stack+0x78/0x184 (unreliable) [ 67.123599] [c0000001e2ccb9d0] [c0000000000c15a0] .lockdep_rcu_suspicious+0x120/0x148 [ 67.123619] [c0000001e2ccba70] [c00000000009601c] .enqueue_hrtimer+0x1c0/0x1c8 [ 67.123639] [c0000001e2ccbb00] [c000000000097aa0] .__hrtimer_start_range_ns+0x37c/0x524 [ 67.123660] [c0000001e2ccbc20] [c0000000005c9698] .menu_select+0x508/0x5bc [ 67.123678] [c0000001e2ccbd20] [c0000000005c740c] .cpuidle_idle_call+0xa8/0x6e4 [ 67.123699] [c0000001e2ccbdd0] [c0000000000459a0] .pSeries_idle+0x10/0x34 [ 67.123717] [c0000001e2ccbe40] [c000000000014dc8] .cpu_idle+0x130/0x280 [ 67.123738] [c0000001e2ccbee0] [c0000000006ffa8c] .start_secondary+0x378/0x384 [ 67.123758] [c0000001e2ccbf90] [c00000000000936c] .start_secondary_prolog+0x10/0x14 hrtimer_start was added in 198fd638 and ae515197. The patch below tries to use RCU_NONIDLE around it to avoid the above report. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Get typical recent sleep intervalYouquan Song1-23/+46
The function detect_repeating_patterns was not very useful for workloads with alternating long and short pauses, for example virtual machines handling network requests for each other (say a web and database server). Instead, try to find a recent sleep interval that is somewhere between the median and the mode sleep time, by discarding outliers to the up side and recalculating the average and standard deviation until that is no longer required. This should do something sane with a sleep interval series like: 200 180 210 10000 30 1000 170 200 The current code would simply discard such a series, while the new code will guess a typical sleep interval just shy of 200. The original patch come from Rik van Riel <riel@redhat.com>. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Quickly notice prediction failure in general caseYouquan Song1-1/+33
The prediction for future is difficult and when the cpuidle governor prediction fails and govenor possibly choose the shallower C-state than it should. How to quickly notice and find the failure becomes important for power saving. The patch extends to general case that prediction logic get a small predicted residency, so it choose a shallow C-state though the expected residency is large . Once the prediction will be fail, the CPU will keep staying at shallow C-state for a long time. Acutally, the CPU has change enter into deep C-state. So when the expected residency is long enough but governor choose a shallow C-state, an timer will be added in order to monitor if the prediction failure. When C-state is waken up prior to the adding timer, the timer will be cancelled initiatively. When the timer is triggered and menu governor will quickly notice prediction failure and re-evaluates deeper C-states possibility. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15cpuidle: Quickly notice prediction failure for repeat modeYouquan Song1-5/+70
The prediction for future is difficult and when the cpuidle governor prediction fails and govenor possibly choose the shallower C-state than it should. How to quickly notice and find the failure becomes important for power saving. cpuidle menu governor has a method to predict the repeat pattern if there are 8 C-states residency which are continuous and the same or very close, so it will predict the next C-states residency will keep same residency time. There is a real case that turbostat utility (tools/power/x86/turbostat) at kernel 3.3 or early. turbostat utility will read 10 registers one by one at Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu governor will predict it is repeat mode and there is another IPI wake up idle CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally idle. However, in the turbostat, following 10 registers reading is sleep 5 seconds by default, so the idle CPU will keep at C1 for a long time though it is idle until break event occurs. In a idle Sandybridge system, run "./turbostat -v", we will notice that deep C-state dangles between "70% ~ 99%". After patched the kernel, we will notice deep C-state stays at >99.98%. In the patch, a timer is added when menu governor detects a repeat mode and choose a shallow C-state. The timer is set to a time out value that greater than predicted time, and we conclude repeat mode prediction failure if timer is triggered. When repeat mode happens as expected, the timer is not triggered and CPU waken up from C-states and it will cancel the timer initiatively. When repeat mode does not happen, the timer will be time out and menu governor will quickly notice that the repeat mode prediction fails and then re-evaluates deeper C-states possibility. Below is another case which will clearly show the patch much benefit: #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/time.h> #include <time.h> #include <pthread.h> volatile int * shutdown; volatile long * count; int delay = 20; int loop = 8; void usage(void) { fprintf(stderr, "Usage: idle_predict [options]\n" " --help -h Print this help\n" " --thread -n Thread number\n" " --loop -l Loop times in shallow Cstate\n" " --delay -t Sleep time (uS)in shallow Cstate\n"); } void *simple_loop() { int idle_num = 1; while (!(*shutdown)) { *count = *count + 1; if (idle_num % loop) usleep(delay); else { /* sleep 1 second */ usleep(1000000); idle_num = 0; } idle_num++; } } static void sighand(int sig) { *shutdown = 1; } int main(int argc, char *argv[]) { sigset_t sigset; int signum = SIGALRM; int i, c, er = 0, thread_num = 8; pthread_t pt[1024]; static char optstr[] = "n:l:t:h:"; while ((c = getopt(argc, argv, optstr)) != EOF) switch (c) { case 'n': thread_num = atoi(optarg); break; case 'l': loop = atoi(optarg); break; case 't': delay = atoi(optarg); break; case 'h': default: usage(); exit(1); } printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay); count = malloc(sizeof(long)); shutdown = malloc(sizeof(int)); *count = 0; *shutdown = 0; sigemptyset(&sigset); sigaddset(&sigset, signum); sigprocmask (SIG_BLOCK, &sigset, NULL); signal(SIGINT, sighand); signal(SIGTERM, sighand); for(i = 0; i < thread_num ; i++) pthread_create(&pt[i], NULL, simple_loop, NULL); for (i = 0; i < thread_num; i++) pthread_join(pt[i], NULL); exit(0); } Get powertop V2 from git://github.com/fenrus75/powertop, build powertop. After build the above test application, then run it. Test plaform can be Intel Sandybridge or other recent platforms. #./idle_predict -l 10 & #./powertop We will find that deep C-state will dangle between 40%~100% and much time spent on C1 state. It is because menu governor wrongly predict that repeat mode is kept, so it will choose the C1 shallow C-state even though it has chance to sleep 1 second in deep C-state. While after patched the kernel, we find that deep C-state will keep >99.6%. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-09-04PM / cpuidle: Make ladder governor use the "disabled" state flagRafael J. Wysocki1-1/+3
For the mechanism introduced by commit cbc9ef0 (PM / Domains: Add preliminary support for cpuidle, v2) to work with the ladder governor, that governor should respect the "disabled" state flag added by that commit. Change the ladder governor accordingly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-09-04Honor state disabling in the cpuidle ladder governorCarsten Emde1-1/+3
There are two cpuidle governors ladder and menu. While the ladder governor is always available, if CONFIG_CPU_IDLE is selected, the menu governor additionally requires CONFIG_NO_HZ. A particular C state can be disabled by writing to the sysfs file /sys/devices/system/cpu/cpuN/cpuidle/stateN/disable, but this mechanism is only implemented in the menu governor. Thus, in a system where CONFIG_NO_HZ is not selected, the ladder governor becomes default and always will walk through all sleep states - irrespective of whether the C state was disabled via sysfs or not. The only way to select a specific C state was to write the related latency to /dev/cpu_dma_latency and keep the file open as long as this setting was required - not very practical and not suitable for setting a single core in an SMP system. With this patch, the ladder governor only will promote to the next C state, if it has not been disabled, and it will demote, if the current C state was disabled. Note that the patch does not make the setting of the sysfs variable "disable" coherent, i.e. if one is disabling a light state, then all deeper states are disabled as well, but the "disable" variable does not reflect it. Likewise, if one enables a deep state but a lighter state still is disabled, then this has no effect. A related section has been added to the documentation. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-03PM / Domains: Add preliminary support for cpuidle, v2Rafael J. Wysocki1-1/+2
On some systems there are CPU cores located in the same power domains as I/O devices. Then, power can only be removed from the domain if all I/O devices in it are not in use and the CPU core is idle. Add preliminary support for that to the generic PM domains framework. First, the platform is expected to provide a cpuidle driver with one extra state designated for use with the generic PM domains code. This state should be initially disabled and its exit_latency value should be set to whatever time is needed to bring up the CPU core itself after restoring power to it, not including the domain's power on latency. Its .enter() callback should point to a procedure that will remove power from the domain containing the CPU core at the end of the CPU power transition. The remaining characteristics of the extra cpuidle state, referred to as the "domain" cpuidle state below, (e.g. power usage, target residency) should be populated in accordance with the properties of the hardware. Next, the platform should execute genpd_attach_cpuidle() on the PM domain containing the CPU core. That will cause the generic PM domains framework to treat that domain in a special way such that: * When all devices in the domain have been suspended and it is about to be turned off, the states of the devices will be saved, but power will not be removed from the domain. Instead, the "domain" cpuidle state will be enabled so that power can be removed from the domain when the CPU core is idle and the state has been chosen as the target by the cpuidle governor. * When the first I/O device in the domain is resumed and __pm_genpd_poweron(() is called for the first time after power has been removed from the domain, the "domain" cpuidle state will be disabled to avoid subsequent surprise power removals via cpuidle. The effective exit_latency value of the "domain" cpuidle state depends on the time needed to bring up the CPU core itself after restoring power to it as well as on the power on latency of the domain containing the CPU core. Thus the "domain" cpuidle state's exit_latency has to be recomputed every time the domain's power on latency is updated, which may happen every time power is restored to the domain, if the measured power on latency is greater than the latency stored in the corresponding generic_pm_domain structure. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Kevin Hilman <khilman@ti.com>
2012-07-03cpuidle: move field disable from per-driver to per-cpuShuoX Liu1-2/+3
Andrew J.Schorr raises a question. When he changes the disable setting on a single CPU, it affects all the other CPUs. Basically, currently, the disable field is per-driver instead of per-cpu. All the C states of the same driver are shared by all CPU in the same machine. The patch changes the `disable' field to per-cpu, so we could set this separately for each cpu. Signed-off-by: ShuoX Liu <shuox.liu@intel.com> Reported-by: Andrew J.Schorr <aschorr@telemetry-investments.com> Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-03-30cpuidle: power_usage should be declared signed integerBoris Ostrovsky1-1/+1
power_usage is always assigned a negative value and should be declared a signed integer Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: Len Brown <len.brown@intel.com>
2012-03-30cpuidle: add a sysfs entry to disable specific C state for debug purpose.ShuoX Liu1-1/+4
Some C states of new CPU might be not good. One reason is BIOS might configure them incorrectly. To help developers root cause it quickly, the patch adds a new sysfs entry, so developers could disable specific C state manually. In addition, C state might have much impact on performance tuning, as it takes much time to enter/exit C states, which might delay interrupt processing. With the new debug option, developers could check if a deep C state could impact performance and how much impact it could cause. Also add this option in Documentation/cpuidle/sysfs.txt. [akpm@linux-foundation.org: check kstrtol return value] Signed-off-by: ShuoX Liu <shuox.liu@intel.com> Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com> Reviewed-and-Tested-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com>
2011-11-07Merge branch 'release' of ↵Linus Torvalds2-23/+47
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: cpuidle: Single/Global registration of idle states cpuidle: Split cpuidle_state structure and move per-cpu statistics fields cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare() cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state ACPI: Fix CONFIG_ACPI_DOCK=n compiler warning ACPI: Export FADT pm_profile integer value to userspace thermal: Prevent polling from happening during system suspend ACPI: Drop ACPI_NO_HARDWARE_INIT ACPI atomicio: Convert width in bits to bytes in __acpi_ioremap_fast() PNPACPI: Simplify disabled resource registration ACPI: Fix possible recursive locking in hwregs.c ACPI: use kstrdup() mrst pmu: update comment tools/power turbostat: less verbose debugging
2011-11-07cpuidle: Single/Global registration of idle statesDeepthi Dharwar2-19/+29
This patch makes the cpuidle_states structure global (single copy) instead of per-cpu. The statistics needed on per-cpu basis by the governor are kept per-cpu. This simplifies the cpuidle subsystem as state registration is done by single cpu only. Having single copy of cpuidle_states saves memory. Rare case of asymmetric C-states can be handled within the cpuidle driver and architectures such as POWER do not have asymmetric C-states. Having single/global registration of all the idle states, dynamic C-state transitions on x86 are handled by the boot cpu. Here, the boot cpu would disable all the devices, re-populate the states and later enable all the devices, irrespective of the cpu that would receive the notification first. Reference: https://lkml.org/lkml/2011/4/25/83 Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com> Tested-by: Jean Pihet <j-pihet@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-11-07cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()Deepthi Dharwar1-2/+0
The cpuidle_device->prepare() mechanism causes updates to the cpuidle_state[].flags, setting and clearing CPUIDLE_FLAG_IGNORE to tell the governor not to chose a state on a per-cpu basis at run-time. State demotion is now handled by the driver and it returns the actual state entered. Hence, this mechanism is not required. Also this removes per-cpu flags from cpuidle_state enabling it to be made global. Reference: https://lkml.org/lkml/2011/3/25/52 Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm> Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com> Tested-by: Jean Pihet <j-pihet@ti.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-11-07cpuidle: Move dev->last_residency update to driver enter routine; remove ↵Deepthi Dharwar2-2/+18
dev->last_state Cpuidle governor only suggests the state to enter using the governor->select() interface, but allows the low level driver to override the recommended state. The actual entered state may be different because of software or hardware demotion. Software demotion is done by the back-end cpuidle driver and can be accounted correctly. Current cpuidle code uses last_state field to capture the actual state entered and based on that updates the statistics for the state entered. Ideally the driver enter routine should update the counters, and it should return the state actually entered rather than the time spent there. The generic cpuidle code should simply handle where the counters live in the sysfs namespace, not updating the counters. Reference: https://lkml.org/lkml/2011/3/25/52 Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com> Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com> Tested-by: Jean Pihet <j-pihet@ti.com> Reviewed-by: Kevin Hilman <khilman@ti.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-11-01cpuidle: ladder.c needs module.h and not just moduleparam.hPaul Gortmaker1-1/+1
This file has module_init/exit and MODULE_LICENSE, and so it needs the full module.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-11-01cpuidle: Add module.h to drivers/cpuidle files as required.Paul Gortmaker1-0/+1
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-08-25PM QoS: Move and rename the implementation filesJean Pihet2-2/+2
The PM QoS implementation files are better named kernel/power/qos.c and include/linux/pm_qos.h. The PM QoS support is compiled under the CONFIG_PM option. Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-05-29cpuidle: menu: fixed wrapping timers at 4.294 secondsTero Kristo1-1/+3
Cpuidle menu governor is using u32 as a temporary datatype for storing nanosecond values which wrap around at 4.294 seconds. This causes errors in predicted sleep times resulting in higher than should be C state selection and increased power consumption. This also breaks cpuidle state residency statistics. cc: stable@kernel.org # .32.x through .39.x Signed-off-by: Tero Kristo <tero.kristo@nokia.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-09-29cpuidle: Fix typosLucas De Marchi1-1/+1
Signed-off-by: Len Brown <len.brown@intel.com>
2010-08-10cpuidle: extend cpuidle and menu governor to handle dynamic statesAi Li1-7/+16
On some SoC chips, HW resources may be in use during any particular idle period. As a consequence, the cpuidle states that the SoC is safe to enter can change from idle period to idle period. In addition, the latency and threshold of each cpuidle state can vary, depending on the operating condition when the CPU becomes idle, e.g. the current cpu frequency, the current state of the HW blocks, etc. cpuidle core and the menu governor, in the current form, are geared towards cpuidle states that are static, i.e. the availabiltiy of the states, their latencies, their thresholds are non-changing during run time. cpuidle does not provide any hook that cpuidle drivers can use to adjust those values on the fly for the current idle period before the menu governor selects the target cpuidle state. This patch extends cpuidle core and the menu governor to handle states that are dynamic. There are three additions in the patch and the patch maintains backwards-compatibility with existing cpuidle drivers. 1) add prepare() to struct cpuidle_device. A cpuidle driver can hook into the callback and cpuidle will call prepare() before calling the governor's select function. The callback gives the cpuidle driver a chance to update the dynamic information of the cpuidle states for the current idle period, e.g. state availability, latencies, thresholds, power values, etc. 2) add CPUIDLE_FLAG_IGNORE as one of the state flags. In the prepare() function, a cpuidle driver can set/clear the flag to indicate to the menu governor whether a cpuidle state should be ignored, i.e. not available, during the current idle period. 3) add power_specified bit to struct cpuidle_device. The menu governor currently assumes that the cpuidle states are arranged in the order of increasing latency, threshold, and power savings. This is true or can be made true for static states. Once the state parameters are dynamic, the latencies, thresholds, and power savings for the cpuidle states can increase or decrease by different amounts from idle period to idle period. So the assumption of increasing latency, threshold, and power savings from Cn to C(n+1) can no longer be guaranteed. It can be straightforward to calculate the power consumption of each available state and to specify it in power_usage for the idle period. Using the power_usage fields, the menu governor then selects the state that has the lowest power consumption and that still satisfies all other critieria. The power_specified bit defaults to 0. For existing cpuidle drivers, cpuidle detects that power_specified is 0 and fills in a dummy set of power_usage values. Signed-off-by: Ai Li <aili@codeaurora.org> Cc: Len Brown <len.brown@intel.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-01sched: Cure nr_iowait_cpu() usersPeter Zijlstra1-2/+2
Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us()) broke things by not making sure preemption was indeed disabled by the callers of nr_iowait_cpu() which took the iowait value of the current cpu. This resulted in a heap of preempt warnings. Cure this by making nr_iowait_cpu() take a cpu number and fix up the callers to pass in the right number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: linux-pm@lists.linux-foundation.org LKML-Reference: <1277968037.1868.120.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-25cpuidle: add a repeating pattern detector to the menu governorArjan van de Ven1-1/+59
Currently, the menu governor uses the (corrected) next timer as key item for predicting the idle duration. It turns out that there are specific cases where this breaks down: There are cases where we have a very repetitive pattern of idle durations, where the idle period is pretty much the same, for reasons completely unrelated to the next timer event. Examples of such repeating patterns are network loads with irq mitigation, the mouse moving but in theory also the wifi beacons. This patch adds a relatively simple detector for such repeating patterns, where the standard deviation of the last 8 idle periods is compared to a threshold. With this extra predictor in place, measurements show that the DECAY factor can now be increased (the decaying average will now decay slower) to get an even more stable result. [arjan@infradead.org: fix bug identified by Frank] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Corrado Zoccolo <czoccolo@gmail.com> Cc: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11PM QOS updateMark Gross2-2/+2
This patch changes the string based list management to a handle base implementation to help with the hot path use of pm-qos, it also renames much of the API to use "request" as opposed to "requirement" that was used in the initial implementation. I did this because request more accurately represents what it actually does. Also, I added a string based ABI for users wanting to use a string interface. So if the user writes 0xDDDDDDDD formatted hex it will be accepted by the interface. (someone asked me for it and I don't think it hurts anything.) This patch updates some documentation input I got from Randy. Signed-off-by: markgross <mgross@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10cpuidle: Fix incorrect optimizationArjan van de Ven1-5/+4
commit 672917dcc78 ("cpuidle: menu governor: reduce latency on exit") added an optimization, where the analysis on the past idle period moved from the end of idle, to the beginning of the new idle. Unfortunately, this optimization had a bug where it zeroed one key variable for new use, that is needed for the analysis. The fix is simple, zero the variable after doing the work from the previous idle. During the audit of the code that found this issue, another issue was also found; the ->measured_us data structure member is never set, a local variable is always used instead. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Corrado Zoccolo <czoccolo@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06cpuidle menu: remove 8 bytes of padding on 64 bit buildsRichard Kennedy1-1/+1
Reorder struct menu_device to remove 8 bytes of padding on 64 bit builds. Size drops from 136 to 128 bytes, so possibly needing one fewer cache lines. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11drivers/cpuidle/governors/menu.c: fix undefined reference to `__udivdi3'Stephen Hemminger1-3/+9
menu: use proper 64 bit math The new menu governor is incorrectly doing a 64 bit divide. Compile tested only Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15drivers/cpuidle: Move dereference after NULL testJulia Lawall1-3/+0
It does not seem possible that ldev can be NULL, so drop the unnecessary test. If ldev can somehow be NULL, then the initialization of last_idx should be moved below the test. A simplified version of the semantic match that detects this problem is as follows (http://coccinelle.lip6.fr/): // <smpl> @match exists@ expression x, E; identifier fld; @@ * x->fld ... when != \(x = E\|&x\) * x == NULL // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22cpuidle: menu governor: reduce latency on exitCorrado Zoccolo1-1/+19
Move the state residency accounting and statistics computation off the hot exit path. On exit, the need to recompute statistics is recorded, and new statistics will be computed when menu_select is called again. The expected effect is to reduce processor wakeup latency from sleep (C-states). We are speaking of few hundreds of cycles reduction out of a several microseconds latency (determined by the hardware transition), so it is difficult to measure. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Adam Belay <abelay@novell.com Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22cpuidle: fix the menu governor to boost IO performanceArjan van de Ven1-39/+212
Fix the menu idle governor which balances power savings, energy efficiency and performance impact. The reason for a reworked governor is that there have been serious performance issues reported with the existing code on Nehalem server systems. To show this I'm sure Andrew wants to see benchmark results: (benchmark is "fio", "no cstates" is using "idle=poll") no cstates current linux new algorithm 1 disk 107 Mb/s 85 Mb/s 105 Mb/s 2 disks 215 Mb/s 123 Mb/s 209 Mb/s 12 disks 590 Mb/s 320 Mb/s 585 Mb/s In various power benchmark measurements, no degredation was found by our measurement&diagnostics team. Obviously a small percentage more power was used in the "fio" benchmark, due to the much higher performance. While it would be a novel idea to describe the new algorithm in this commit message, I cheaped out and described it in comments in the code instead. [changes since first post: spelling fixes from akpm, review feedback, folded menu-tng into menu.c] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>