diff options
Diffstat (limited to 'Documentation/cpu-freq/governors.txt')
-rw-r--r-- | Documentation/cpu-freq/governors.txt | 322 |
1 files changed, 177 insertions, 145 deletions
diff --git a/Documentation/cpu-freq/governors.txt b/Documentation/cpu-freq/governors.txt index c15aa75f5227..61b3184b6c24 100644 --- a/Documentation/cpu-freq/governors.txt +++ b/Documentation/cpu-freq/governors.txt @@ -10,6 +10,8 @@ Dominik Brodowski <linux@brodo.de> some additions and corrections by Nico Golde <nico@ngolde.de> + Rafael J. Wysocki <rafael.j.wysocki@intel.com> + Viresh Kumar <viresh.kumar@linaro.org> @@ -28,32 +30,27 @@ Contents: 2.3 Userspace 2.4 Ondemand 2.5 Conservative +2.6 Schedutil 3. The Governor Interface in the CPUfreq Core +4. References 1. What Is A CPUFreq Governor? ============================== Most cpufreq drivers (except the intel_pstate and longrun) or even most -cpu frequency scaling algorithms only offer the CPU to be set to one -frequency. In order to offer dynamic frequency scaling, the cpufreq -core must be able to tell these drivers of a "target frequency". So -these specific drivers will be transformed to offer a "->target/target_index" -call instead of the existing "->setpolicy" call. For "longrun", all -stays the same, though. +cpu frequency scaling algorithms only allow the CPU frequency to be set +to predefined fixed values. In order to offer dynamic frequency +scaling, the cpufreq core must be able to tell these drivers of a +"target frequency". So these specific drivers will be transformed to +offer a "->target/target_index/fast_switch()" call instead of the +"->setpolicy()" call. For set_policy drivers, all stays the same, +though. How to decide what frequency within the CPUfreq policy should be used? -That's done using "cpufreq governors". Two are already in this patch --- they're the already existing "powersave" and "performance" which -set the frequency statically to the lowest or highest frequency, -respectively. At least two more such governors will be ready for -addition in the near future, but likely many more as there are various -different theories and models about dynamic frequency scaling -around. Using such a generic interface as cpufreq offers to scaling -governors, these can be tested extensively, and the best one can be -selected for each specific use. +That's done using "cpufreq governors". Basically, it's the following flow graph: @@ -71,7 +68,7 @@ CPU can be set to switch independently | CPU can only be set / the limits of policy->{min,max} / \ / \ - Using the ->setpolicy call, Using the ->target/target_index call, + Using the ->setpolicy call, Using the ->target/target_index/fast_switch call, the limits and the the frequency closest "policy" is set. to target_freq is set. It is assured that it @@ -109,114 +106,159 @@ directory. 2.4 Ondemand ------------ -The CPUfreq governor "ondemand" sets the CPU depending on the -current usage. To do this the CPU must have the capability to -switch the frequency very quickly. There are a number of sysfs file -accessible parameters: - -sampling_rate: measured in uS (10^-6 seconds), this is how often you -want the kernel to look at the CPU usage and to make decisions on -what to do about the frequency. Typically this is set to values of -around '10000' or more. It's default value is (cmp. with users-guide.txt): -transition_latency * 1000 -Be aware that transition latency is in ns and sampling_rate is in us, so you -get the same sysfs value by default. -Sampling rate should always get adjusted considering the transition latency -To set the sampling rate 750 times as high as the transition latency -in the bash (as said, 1000 is default), do: -echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ - >ondemand/sampling_rate - -sampling_rate_min: -The sampling rate is limited by the HW transition latency: -transition_latency * 100 -Or by kernel restrictions: -If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. -If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the -limits depend on the CONFIG_HZ option: -HZ=1000: min=20000us (20ms) -HZ=250: min=80000us (80ms) -HZ=100: min=200000us (200ms) -The highest value of kernel and HW latency restrictions is shown and -used as the minimum sampling rate. - -up_threshold: defines what the average CPU usage between the samplings -of 'sampling_rate' needs to be for the kernel to make a decision on -whether it should increase the frequency. For example when it is set -to its default value of '95' it means that between the checking -intervals the CPU needs to be on average more than 95% in use to then -decide that the CPU frequency needs to be increased. - -ignore_nice_load: this parameter takes a value of '0' or '1'. When -set to '0' (its default), all processes are counted towards the -'cpu utilisation' value. When set to '1', the processes that are -run with a 'nice' value will not count (and thus be ignored) in the -overall usage calculation. This is useful if you are running a CPU -intensive calculation on your laptop that you do not care how long it -takes to complete as you can 'nice' it and prevent it from taking part -in the deciding process of whether to increase your CPU frequency. - -sampling_down_factor: this parameter controls the rate at which the -kernel makes a decision on when to decrease the frequency while running -at top speed. When set to 1 (the default) decisions to reevaluate load -are made at the same interval regardless of current clock speed. But -when set to greater than 1 (e.g. 100) it acts as a multiplier for the -scheduling interval for reevaluating load when the CPU is at its top -speed due to high load. This improves performance by reducing the overhead -of load evaluation and helping the CPU stay at its top speed when truly -busy, rather than shifting back and forth in speed. This tunable has no -effect on behavior at lower speeds/lower CPU loads. - -powersave_bias: this parameter takes a value between 0 to 1000. It -defines the percentage (times 10) value of the target frequency that -will be shaved off of the target. For example, when set to 100 -- 10%, -when ondemand governor would have targeted 1000 MHz, it will target -1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 -(disabled) by default. -When AMD frequency sensitivity powersave bias driver -- -drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter -defines the workload frequency sensitivity threshold in which a lower -frequency is chosen instead of ondemand governor's original target. -The frequency sensitivity is a hardware reported (on AMD Family 16h -Processors and above) value between 0 to 100% that tells software how -the performance of the workload running on a CPU will change when -frequency changes. A workload with sensitivity of 0% (memory/IO-bound) -will not perform any better on higher core frequency, whereas a -workload with sensitivity of 100% (CPU-bound) will perform better -higher the frequency. When the driver is loaded, this is set to 400 -by default -- for CPUs running workloads with sensitivity value below -40%, a lower frequency is chosen. Unloading the driver or writing 0 -will disable this feature. +The CPUfreq governor "ondemand" sets the CPU frequency depending on the +current system load. Load estimation is triggered by the scheduler +through the update_util_data->func hook; when triggered, cpufreq checks +the CPU-usage statistics over the last period and the governor sets the +CPU accordingly. The CPU must have the capability to switch the +frequency very quickly. + +Sysfs files: + +* sampling_rate: + + Measured in uS (10^-6 seconds), this is how often you want the kernel + to look at the CPU usage and to make decisions on what to do about the + frequency. Typically this is set to values of around '10000' or more. + It's default value is (cmp. with users-guide.txt): transition_latency + * 1000. Be aware that transition latency is in ns and sampling_rate + is in us, so you get the same sysfs value by default. Sampling rate + should always get adjusted considering the transition latency to set + the sampling rate 750 times as high as the transition latency in the + bash (as said, 1000 is default), do: + + $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate + +* sampling_rate_min: + + The sampling rate is limited by the HW transition latency: + transition_latency * 100 + + Or by kernel restrictions: + - If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. + - If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is + used, the limits depend on the CONFIG_HZ option: + HZ=1000: min=20000us (20ms) + HZ=250: min=80000us (80ms) + HZ=100: min=200000us (200ms) + + The highest value of kernel and HW latency restrictions is shown and + used as the minimum sampling rate. + +* up_threshold: + + This defines what the average CPU usage between the samplings of + 'sampling_rate' needs to be for the kernel to make a decision on + whether it should increase the frequency. For example when it is set + to its default value of '95' it means that between the checking + intervals the CPU needs to be on average more than 95% in use to then + decide that the CPU frequency needs to be increased. + +* ignore_nice_load: + + This parameter takes a value of '0' or '1'. When set to '0' (its + default), all processes are counted towards the 'cpu utilisation' + value. When set to '1', the processes that are run with a 'nice' + value will not count (and thus be ignored) in the overall usage + calculation. This is useful if you are running a CPU intensive + calculation on your laptop that you do not care how long it takes to + complete as you can 'nice' it and prevent it from taking part in the + deciding process of whether to increase your CPU frequency. + +* sampling_down_factor: + + This parameter controls the rate at which the kernel makes a decision + on when to decrease the frequency while running at top speed. When set + to 1 (the default) decisions to reevaluate load are made at the same + interval regardless of current clock speed. But when set to greater + than 1 (e.g. 100) it acts as a multiplier for the scheduling interval + for reevaluating load when the CPU is at its top speed due to high + load. This improves performance by reducing the overhead of load + evaluation and helping the CPU stay at its top speed when truly busy, + rather than shifting back and forth in speed. This tunable has no + effect on behavior at lower speeds/lower CPU loads. + +* powersave_bias: + + This parameter takes a value between 0 to 1000. It defines the + percentage (times 10) value of the target frequency that will be + shaved off of the target. For example, when set to 100 -- 10%, when + ondemand governor would have targeted 1000 MHz, it will target + 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 + (disabled) by default. + + When AMD frequency sensitivity powersave bias driver -- + drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter + defines the workload frequency sensitivity threshold in which a lower + frequency is chosen instead of ondemand governor's original target. + The frequency sensitivity is a hardware reported (on AMD Family 16h + Processors and above) value between 0 to 100% that tells software how + the performance of the workload running on a CPU will change when + frequency changes. A workload with sensitivity of 0% (memory/IO-bound) + will not perform any better on higher core frequency, whereas a + workload with sensitivity of 100% (CPU-bound) will perform better + higher the frequency. When the driver is loaded, this is set to 400 by + default -- for CPUs running workloads with sensitivity value below + 40%, a lower frequency is chosen. Unloading the driver or writing 0 + will disable this feature. 2.5 Conservative ---------------- The CPUfreq governor "conservative", much like the "ondemand" -governor, sets the CPU depending on the current usage. It differs in -behaviour in that it gracefully increases and decreases the CPU speed -rather than jumping to max speed the moment there is any load on the -CPU. This behaviour more suitable in a battery powered environment. -The governor is tweaked in the same manner as the "ondemand" governor -through sysfs with the addition of: - -freq_step: this describes what percentage steps the cpu freq should be -increased and decreased smoothly by. By default the cpu frequency will -increase in 5% chunks of your maximum cpu frequency. You can change this -value to anywhere between 0 and 100 where '0' will effectively lock your -CPU at a speed regardless of its load whilst '100' will, in theory, make -it behave identically to the "ondemand" governor. - -down_threshold: same as the 'up_threshold' found for the "ondemand" -governor but for the opposite direction. For example when set to its -default value of '20' it means that if the CPU usage needs to be below -20% between samples to have the frequency decreased. - -sampling_down_factor: similar functionality as in "ondemand" governor. -But in "conservative", it controls the rate at which the kernel makes -a decision on when to decrease the frequency while running in any -speed. Load for frequency increase is still evaluated every -sampling rate. +governor, sets the CPU frequency depending on the current usage. It +differs in behaviour in that it gracefully increases and decreases the +CPU speed rather than jumping to max speed the moment there is any load +on the CPU. This behaviour is more suitable in a battery powered +environment. The governor is tweaked in the same manner as the +"ondemand" governor through sysfs with the addition of: + +* freq_step: + + This describes what percentage steps the cpu freq should be increased + and decreased smoothly by. By default the cpu frequency will increase + in 5% chunks of your maximum cpu frequency. You can change this value + to anywhere between 0 and 100 where '0' will effectively lock your CPU + at a speed regardless of its load whilst '100' will, in theory, make + it behave identically to the "ondemand" governor. + +* down_threshold: + + Same as the 'up_threshold' found for the "ondemand" governor but for + the opposite direction. For example when set to its default value of + '20' it means that if the CPU usage needs to be below 20% between + samples to have the frequency decreased. + +* sampling_down_factor: + + Similar functionality as in "ondemand" governor. But in + "conservative", it controls the rate at which the kernel makes a + decision on when to decrease the frequency while running in any speed. + Load for frequency increase is still evaluated every sampling rate. + + +2.6 Schedutil +------------- + +The "schedutil" governor aims at better integration with the Linux +kernel scheduler. Load estimation is achieved through the scheduler's +Per-Entity Load Tracking (PELT) mechanism, which also provides +information about the recent load [1]. This governor currently does +load based DVFS only for tasks managed by CFS. RT and DL scheduler tasks +are always run at the highest frequency. Unlike all the other +governors, the code is located under the kernel/sched/ directory. + +Sysfs files: + +* rate_limit_us: + + This contains a value in microseconds. The governor waits for + rate_limit_us time before reevaluating the load again, after it has + evaluated the load once. + +For an in-depth comparison with the other governors refer to [2]. + 3. The Governor Interface in the CPUfreq Core ============================================= @@ -225,26 +267,10 @@ A new governor must register itself with the CPUfreq core using "cpufreq_register_governor". The struct cpufreq_governor, which has to be passed to that function, must contain the following values: -governor->name - A unique name for this governor -governor->governor - The governor callback function -governor->owner - .THIS_MODULE for the governor module (if - appropriate) - -The governor->governor callback is called with the current (or to-be-set) -cpufreq_policy struct for that CPU, and an unsigned int event. The -following events are currently defined: - -CPUFREQ_GOV_START: This governor shall start its duty for the CPU - policy->cpu -CPUFREQ_GOV_STOP: This governor shall end its duty for the CPU - policy->cpu -CPUFREQ_GOV_LIMITS: The limits for CPU policy->cpu have changed to - policy->min and policy->max. - -If you need other "events" externally of your driver, _only_ use the -cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the -CPUfreq core to ensure proper locking. +governor->name - A unique name for this governor. +governor->owner - .THIS_MODULE for the governor module (if appropriate). +plus a set of hooks to the functions implementing the governor's logic. The CPUfreq governor may call the CPU processor driver using one of these two functions: @@ -258,12 +284,18 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy, unsigned int relation); target_freq must be within policy->min and policy->max, of course. -What's the difference between these two functions? When your governor -still is in a direct code path of a call to governor->governor, the -per-CPU cpufreq lock is still held in the cpufreq core, and there's -no need to lock it again (in fact, this would cause a deadlock). So -use __cpufreq_driver_target only in these cases. In all other cases -(for example, when there's a "daemonized" function that wakes up -every second), use cpufreq_driver_target to lock the cpufreq per-CPU -lock before the command is passed to the cpufreq processor driver. +What's the difference between these two functions? When your governor is +in a direct code path of a call to governor callbacks, like +governor->start(), the policy->rwsem is still held in the cpufreq core, +and there's no need to lock it again (in fact, this would cause a +deadlock). So use __cpufreq_driver_target only in these cases. In all +other cases (for example, when there's a "daemonized" function that +wakes up every second), use cpufreq_driver_target to take policy->rwsem +before the command is passed to the cpufreq driver. + +4. References +============= + +[1] Per-entity load tracking: https://lwn.net/Articles/531853/ +[2] Improvements in CPU frequency management: https://lwn.net/Articles/682391/ |