From a00ec3874e7d326ab2dffbed92faddf6a77a84e9 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 26 Mar 2020 20:24:35 +0100 Subject: cpufreq: intel_pstate: Select schedutil as the default governor Modify cpufreq Kconfig to select schedutil as the default governor if the intel_pstate driver has been selected and SMP support is enabled (because schedutil depends on SMP). Also select schedutil as well as the performance governor from the intel_pstate Kconfig section to ensure the equivalence of the passive and active mode governor configuration options. Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/Kconfig | 3 ++- drivers/cpufreq/Kconfig.x86 | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index bff5295016ae..9f0e7e79ed14 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -37,10 +37,11 @@ config CPU_FREQ_STAT choice prompt "Default CPUFreq governor" default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help This option sets which CPUFreq governor shall be loaded at - startup. If in doubt, select 'performance'. + startup. If in doubt, use the default setting. config CPU_FREQ_DEFAULT_GOV_PERFORMANCE bool "performance" diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86 index a6528388952e..758c69a2e1bf 100644 --- a/drivers/cpufreq/Kconfig.x86 +++ b/drivers/cpufreq/Kconfig.x86 @@ -8,6 +8,8 @@ config X86_INTEL_PSTATE depends on X86 select ACPI_PROCESSOR if ACPI select ACPI_CPPC_LIB if X86_64 && ACPI && SCHED_MC_PRIO + select CPU_FREQ_GOV_PERFORMANCE + select CPU_FREQ_GOV_SCHEDUTIL if SMP help This driver provides a P state for Intel core processors. The driver implements an internal governor and will become -- cgit v1.2.3 From 3704a6a445790e6621c19be25d85dfadbeb16a69 Mon Sep 17 00:00:00 2001 From: Dexuan Cui Date: Tue, 31 Mar 2020 08:55:25 -0700 Subject: PM: hibernate: Propagate the return value of hibernation_restore() If hibernation_restore() fails, the 'error' should not be zero. Signed-off-by: Dexuan Cui Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 6dbeedb7354c..86aba8706b16 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -678,7 +678,7 @@ static int load_image_and_restore(void) error = swsusp_read(&flags); swsusp_close(FMODE_READ); if (!error) - hibernation_restore(flags & SF_PLATFORM_MODE); + error = hibernation_restore(flags & SF_PLATFORM_MODE); pr_err("Failed to load image, recovering.\n"); swsusp_free(); -- cgit v1.2.3 From b5252a6cbbdaefb2648abe8807beda8a28ec7603 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 29 Mar 2020 16:11:18 +0200 Subject: PM: sleep: core: Drop racy and redundant checks from device_prepare() Alan Stern points out that the WARN_ON() check in device_prepare() is racy (because the PM-runtime API can be disabled briefly for any device at any time and system suspend can start at any time too) and the pm_runtime_suspended() check in the computation of the direct_complete flag value is redundant (because it will be repeated later anyway). Drop both these checks accordingly. Reported-by: Alan Stern Signed-off-by: Rafael J. Wysocki --- drivers/base/power/main.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 6d1dee7051eb..fdd508a78ffd 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1922,10 +1922,6 @@ static int device_prepare(struct device *dev, pm_message_t state) if (dev->power.syscore) return 0; - WARN_ON(!pm_runtime_enabled(dev) && - dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND | - DPM_FLAG_LEAVE_SUSPENDED)); - /* * If a device's parent goes into runtime suspend at the wrong time, * it won't be possible to resume the device. To prevent this we @@ -1973,8 +1969,7 @@ unlock: */ spin_lock_irq(&dev->power.lock); dev->power.direct_complete = state.event == PM_EVENT_SUSPEND && - ((pm_runtime_suspended(dev) && ret > 0) || - dev->power.no_pm_callbacks) && + (ret > 0 || dev->power.no_pm_callbacks) && !dev_pm_test_driver_flags(dev, DPM_FLAG_NEVER_SKIP); spin_unlock_irq(&dev->power.lock); return 0; -- cgit v1.2.3 From db96a75946d3a2a41092c83992e65f727cd35529 Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Thu, 2 Apr 2020 15:56:52 +0800 Subject: PM: sleep: Add pm_debug_messages kernel command line option Debug messages from the system suspend/hibernation infrastructure are disabled by default, and can only be enabled after the system has boot up via /sys/power/pm_debug_messages. This makes the hibernation resume hard to track as it involves system boot up across hibernation. There's no chance for software_resume() to track the resume process, for example. Add a kernel command line option to set pm_debug_messages during boot up. Signed-off-by: Chen Yu [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ kernel/power/main.c | 7 +++++++ 2 files changed, 10 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4e3885e38179..df2c2b97f417 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3719,6 +3719,9 @@ Override pmtimer IOPort with a hex value. e.g. pmtmr=0x508 + pm_debug_messages [SUSPEND,KNL] + Enable suspend/resume debug messages during boot up. + pnp.debug=1 [PNP] Enable PNP debug messages (depends on the CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time diff --git a/kernel/power/main.c b/kernel/power/main.c index 69b7a8aeca3b..40f86ec4ab30 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -535,6 +535,13 @@ static ssize_t pm_debug_messages_store(struct kobject *kobj, power_attr(pm_debug_messages); +static int __init pm_debug_messages_setup(char *str) +{ + pm_debug_messages_on = true; + return 1; +} +__setup("pm_debug_messages", pm_debug_messages_setup); + /** * __pm_pr_dbg - Print a suspend debug message to the kernel log. * @defer: Whether or not to use printk_deferred() to print the message. -- cgit v1.2.3 From 8fdcca8e254ad980d083a03c3c60d8225b320ffb Mon Sep 17 00:00:00 2001 From: Linus Walleij Date: Thu, 2 Apr 2020 10:02:39 +0200 Subject: cpufreq: Select schedutil when using big.LITTLE When we are using a system with big.LITTLE HMP configuration, we need to use EAS to schedule the system. As can be seen from kernel/sched/topology.c: "EAS can be used on a root domain if it meets all the following conditions: 1. an Energy Model (EM) is available; 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. 3. no SMT is detected. 4. the EM complexity is low enough to keep scheduling overheads low; 5. schedutil is driving the frequency of all CPUs of the rd;" This means that at the very least, schedutil needs to be available as a scheduling policy for EAS to work on these systems. Make this explicit by defaulting to the schedutil governor if BIG_LITTLE is selected. Currently users of the TC2 board (like me) has to figure these dependencies out themselves and it is not helpful. Suggested-by: Arnd Bergmann Signed-off-by: Linus Walleij Acked-by: Sudeep Holla Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- drivers/cpufreq/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 9f0e7e79ed14..c3e6bd59e920 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -37,6 +37,7 @@ config CPU_FREQ_STAT choice prompt "Default CPUFreq governor" default CPU_FREQ_DEFAULT_GOV_USERSPACE if ARM_SA1100_CPUFREQ || ARM_SA1110_CPUFREQ + default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if BIG_LITTLE default CPU_FREQ_DEFAULT_GOV_SCHEDUTIL if X86_INTEL_PSTATE && SMP default CPU_FREQ_DEFAULT_GOV_PERFORMANCE help -- cgit v1.2.3 From 4506c531f11860b99f699173a45d932094e96e04 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Thu, 2 Apr 2020 14:45:16 +0200 Subject: Documentation: PM: sleep: Document system-wide suspend code flows Add a document describing high-level system-wide suspend code flows in Linux. Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/suspend-flows.rst | 270 +++++++++++++++++++++++++ Documentation/admin-guide/pm/system-wide.rst | 1 + 2 files changed, 271 insertions(+) create mode 100644 Documentation/admin-guide/pm/suspend-flows.rst diff --git a/Documentation/admin-guide/pm/suspend-flows.rst b/Documentation/admin-guide/pm/suspend-flows.rst new file mode 100644 index 000000000000..c479d7462647 --- /dev/null +++ b/Documentation/admin-guide/pm/suspend-flows.rst @@ -0,0 +1,270 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================= +System Suspend Code Flows +========================= + +:Copyright: |copy| 2020 Intel Corporation + +:Author: Rafael J. Wysocki + +At least one global system-wide transition needs to be carried out for the +system to get from the working state into one of the supported +:doc:`sleep states `. Hibernation requires more than one +transition to occur for this purpose, but the other sleep states, commonly +referred to as *system-wide suspend* (or simply *system suspend*) states, need +only one. + +For those sleep states, the transition from the working state of the system into +the target sleep state is referred to as *system suspend* too (in the majority +of cases, whether this means a transition or a sleep state of the system should +be clear from the context) and the transition back from the sleep state into the +working state is referred to as *system resume*. + +The kernel code flows associated with the suspend and resume transitions for +different sleep states of the system are quite similar, but there are some +significant differences between the :ref:`suspend-to-idle ` code flows +and the code flows related to the :ref:`suspend-to-RAM ` and +:ref:`standby ` sleep states. + +The :ref:`suspend-to-RAM ` and :ref:`standby ` sleep states +cannot be implemented without platform support and the difference between them +boils down to the platform-specific actions carried out by the suspend and +resume hooks that need to be provided by the platform driver to make them +available. Apart from that, the suspend and resume code flows for these sleep +states are mostly identical, so they both together will be referred to as +*platform-dependent suspend* states in what follows. + + +.. _s2idle_suspend: + +Suspend-to-idle Suspend Code Flow +================================= + +The following steps are taken in order to transition the system from the working +state to the :ref:`suspend-to-idle ` sleep state: + + 1. Invoking system-wide suspend notifiers. + + Kernel subsystems can register callbacks to be invoked when the suspend + transition is about to occur and when the resume transition has finished. + + That allows them to prepare for the change of the system state and to clean + up after getting back to the working state. + + 2. Freezing tasks. + + Tasks are frozen primarily in order to avoid unchecked hardware accesses + from user space through MMIO regions or I/O registers exposed directly to + it and to prevent user space from entering the kernel while the next step + of the transition is in progress (which might have been problematic for + various reasons). + + All user space tasks are intercepted as though they were sent a signal and + put into uninterruptible sleep until the end of the subsequent system resume + transition. + + The kernel threads that choose to be frozen during system suspend for + specific reasons are frozen subsequently, but they are not intercepted. + Instead, they are expected to periodically check whether or not they need + to be frozen and to put themselves into uninterruptible sleep if so. [Note, + however, that kernel threads can use locking and other concurrency controls + available in kernel space to synchronize themselves with system suspend and + resume, which can be much more precise than the freezing, so the latter is + not a recommended option for kernel threads.] + + 3. Suspending devices and reconfiguring IRQs. + + Devices are suspended in four phases called *prepare*, *suspend*, + *late suspend* and *noirq suspend* (see :ref:`driverapi_pm_devices` for more + information on what exactly happens in each phase). + + Every device is visited in each phase, but typically it is not physically + accessed in more than two of them. + + The runtime PM API is disabled for every device during the *late* suspend + phase and high-level ("action") interrupt handlers are prevented from being + invoked before the *noirq* suspend phase. + + Interrupts are still handled after that, but they are only acknowledged to + interrupt controllers without performing any device-specific actions that + would be triggered in the working state of the system (those actions are + deferred till the subsequent system resume transition as described + `below `_). + + IRQs associated with system wakeup devices are "armed" so that the resume + transition of the system is started when one of them signals an event. + + 4. Freezing the scheduler tick and suspending timekeeping. + + When all devices have been suspended, CPUs enter the idle loop and are put + into the deepest available idle state. While doing that, each of them + "freezes" its own scheduler tick so that the timer events associated with + the tick do not occur until the CPU is woken up by another interrupt source. + + The last CPU to enter the idle state also stops the timekeeping which + (among other things) prevents high resolution timers from triggering going + forward until the first CPU that is woken up restarts the timekeeping. + That allows the CPUs to stay in the deep idle state relatively long in one + go. + + From this point on, the CPUs can only be woken up by non-timer hardware + interrupts. If that happens, they go back to the idle state unless the + interrupt that woke up one of them comes from an IRQ that has been armed for + system wakeup, in which case the system resume transition is started. + + +.. _s2idle_resume: + +Suspend-to-idle Resume Code Flow +================================ + +The following steps are taken in order to transition the system from the +:ref:`suspend-to-idle ` sleep state into the working state: + + 1. Resuming timekeeping and unfreezing the scheduler tick. + + When one of the CPUs is woken up (by a non-timer hardware interrupt), it + leaves the idle state entered in the last step of the preceding suspend + transition, restarts the timekeeping (unless it has been restarted already + by another CPU that woke up earlier) and the scheduler tick on that CPU is + unfrozen. + + If the interrupt that has woken up the CPU was armed for system wakeup, + the system resume transition begins. + + 2. Resuming devices and restoring the working-state configuration of IRQs. + + Devices are resumed in four phases called *noirq resume*, *early resume*, + *resume* and *complete* (see :ref:`driverapi_pm_devices` for more + information on what exactly happens in each phase). + + Every device is visited in each phase, but typically it is not physically + accessed in more than two of them. + + The working-state configuration of IRQs is restored after the *noirq* resume + phase and the runtime PM API is re-enabled for every device whose driver + supports it during the *early* resume phase. + + 3. Thawing tasks. + + Tasks frozen in step 2 of the preceding `suspend `_ + transition are "thawed", which means that they are woken up from the + uninterruptible sleep that they went into at that time and user space tasks + are allowed to exit the kernel. + + 4. Invoking system-wide resume notifiers. + + This is analogous to step 1 of the `suspend `_ transition + and the same set of callbacks is invoked at this point, but a different + "notification type" parameter value is passed to them. + + +Platform-dependent Suspend Code Flow +==================================== + +The following steps are taken in order to transition the system from the working +state to platform-dependent suspend state: + + 1. Invoking system-wide suspend notifiers. + + This step is the same as step 1 of the suspend-to-idle suspend transition + described `above `_. + + 2. Freezing tasks. + + This step is the same as step 2 of the suspend-to-idle suspend transition + described `above `_. + + 3. Suspending devices and reconfiguring IRQs. + + This step is analogous to step 3 of the suspend-to-idle suspend transition + described `above `_, but the arming of IRQs for system + wakeup generally does not have any effect on the platform. + + There are platforms that can go into a very deep low-power state internally + when all CPUs in them are in sufficiently deep idle states and all I/O + devices have been put into low-power states. On those platforms, + suspend-to-idle can reduce system power very effectively. + + On the other platforms, however, low-level components (like interrupt + controllers) need to be turned off in a platform-specific way (implemented + in the hooks provided by the platform driver) to achieve comparable power + reduction. + + That usually prevents in-band hardware interrupts from waking up the system, + which must be done in a special platform-dependent way. Then, the + configuration of system wakeup sources usually starts when system wakeup + devices are suspended and is finalized by the platform suspend hooks later + on. + + 4. Disabling non-boot CPUs. + + On some platforms the suspend hooks mentioned above must run in a one-CPU + configuration of the system (in particular, the hardware cannot be accessed + by any code running in parallel with the platform suspend hooks that may, + and often do, trap into the platform firmware in order to finalize the + suspend transition). + + For this reason, the CPU offline/online (CPU hotplug) framework is used + to take all of the CPUs in the system, except for one (the boot CPU), + offline (typically, the CPUs that have been taken offline go into deep idle + states). + + This means that all tasks are migrated away from those CPUs and all IRQs are + rerouted to the only CPU that remains online. + + 5. Suspending core system components. + + This prepares the core system components for (possibly) losing power going + forward and suspends the timekeeping. + + 6. Platform-specific power removal. + + This is expected to remove power from all of the system components except + for the memory controller and RAM (in order to preserve the contents of the + latter) and some devices designated for system wakeup. + + In many cases control is passed to the platform firmware which is expected + to finalize the suspend transition as needed. + + +Platform-dependent Resume Code Flow +=================================== + +The following steps are taken in order to transition the system from a +platform-dependent suspend state into the working state: + + 1. Platform-specific system wakeup. + + The platform is woken up by a signal from one of the designated system + wakeup devices (which need not be an in-band hardware interrupt) and + control is passed back to the kernel (the working configuration of the + platform may need to be restored by the platform firmware before the + kernel gets control again). + + 2. Resuming core system components. + + The suspend-time configuration of the core system components is restored and + the timekeeping is resumed. + + 3. Re-enabling non-boot CPUs. + + The CPUs disabled in step 4 of the preceding suspend transition are taken + back online and their suspend-time configuration is restored. + + 4. Resuming devices and restoring the working-state configuration of IRQs. + + This step is the same as step 2 of the suspend-to-idle suspend transition + described `above `_. + + 5. Thawing tasks. + + This step is the same as step 3 of the suspend-to-idle suspend transition + described `above `_. + + 6. Invoking system-wide resume notifiers. + + This step is the same as step 4 of the suspend-to-idle suspend transition + described `above `_. diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst index 2b1f987b34f0..1a1924d71006 100644 --- a/Documentation/admin-guide/pm/system-wide.rst +++ b/Documentation/admin-guide/pm/system-wide.rst @@ -8,3 +8,4 @@ System-Wide Power Management :maxdepth: 2 sleep-states + suspend-flows -- cgit v1.2.3