From 33aa46f252c703e42c81a76696cd0c240f2281e4 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 25 Mar 2020 15:03:35 +0100 Subject: cpufreq: intel_pstate: Use passive mode by default without HWP After recent changes allowing scale-invariant utilization to be used on x86, the schedutil governor on top of intel_pstate in the passive mode should be on par with (or better than) the active mode "powersave" algorithm of intel_pstate on systems in which hardware-managed P-states (HWP) are not used, so it should not be necessary to use the internal scaling algorithm in those cases. Accordingly, modify intel_pstate to start in the passive mode by default if the processor at hand does not support HWP of if the driver is requested to avoid using HWP through the kernel command line. Among other things, that will allow utilization clamps and the support for RT/DL tasks in the schedutil governor to be utilized on systems in which intel_pstate is used. Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/intel_pstate.rst | 32 ++++++++++++++++----------- 1 file changed, 19 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index ad392f3aee06..39d80bc29ccd 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -62,9 +62,10 @@ on the capabilities of the processor. Active Mode ----------- -This is the default operation mode of ``intel_pstate``. If it works in this -mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` -policies contains the string "intel_pstate". +This is the default operation mode of ``intel_pstate`` for processors with +hardware-managed P-states (HWP) support. If it works in this mode, the +``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies +contains the string "intel_pstate". In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and provides its own scaling algorithms for P-state selection. Those algorithms @@ -138,12 +139,13 @@ internal P-state selection logic to be less performance-focused. Active Mode Without HWP ~~~~~~~~~~~~~~~~~~~~~~~ -This is the default operation mode for processors that do not support the HWP -feature. It also is used by default with the ``intel_pstate=no_hwp`` argument -in the kernel command line. However, in this mode ``intel_pstate`` may refuse -to work with the given processor if it does not recognize it. [Note that -``intel_pstate`` will never refuse to work with any processor with the HWP -feature enabled.] +This operation mode is optional for processors that do not support the HWP +feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel in +the command line. The active mode is used in those cases if the +``intel_pstate=active`` argument is passed to the kernel in the command line. +In this mode ``intel_pstate`` may refuse to work with processors that are not +recognized by it. [Note that ``intel_pstate`` will never refuse to work with +any processor with the HWP feature enabled.] In this mode ``intel_pstate`` registers utilization update callbacks with the CPU scheduler in order to run a P-state selection algorithm, either @@ -188,10 +190,14 @@ is not set. Passive Mode ------------ -This mode is used if the ``intel_pstate=passive`` argument is passed to the -kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too). -Like in the active mode without HWP support, in this mode ``intel_pstate`` may -refuse to work with the given processor if it does not recognize it. +This is the default operation mode of ``intel_pstate`` for processors without +hardware-managed P-states (HWP) support. It is always used if the +``intel_pstate=passive`` argument is passed to the kernel in the command line +regardless of whether or not the given processor supports HWP. [Note that the +``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is used +without ``intel_pstate=active``.] Like in the active mode without HWP support, +in this mode ``intel_pstate`` may refuse to work with processors that are not +recognized by it. If the driver works in this mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq". -- cgit v1.2.3 From 6e176bf8d46194353163c2cb660808bc633b45d9 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:08 +0200 Subject: PM: sleep: core: Do not skip callbacks in the resume phase The current code in device_resume_noirq() causes the entire early resume and resume phases of device suspend to be skipped for devices for which the noirq resume phase have been skipped (due to the LEAVE_SUSPENDED flag being set) on the premise that those devices should stay in runtime-suspend after system-wide resume. However, that may not be correct in two situations. First, the middle layer (subsystem) noirq resume callback may be missing for a given device, but its early resume callback may be present and it may need to do something even if it decides to skip the driver callback. Second, if the device's wakeup settings were adjusted in the suspend phase without resuming the device (that was in runtime suspend at that time), they most likely need to be adjusted again in the resume phase and so the driver callback in that phase needs to be run. For the above reason, modify the core to allow the middle layer ->resume_late callback to run even if its ->resume_noirq callback is missing (and the core has skipped the driver-level callback in that phase) and to allow all device callbacks to run in the resume phase. Also make the core set the PM-runtime status of devices with SMART_SUSPEND set whose resume callbacks are not skipped to "active" in the "noirq" resume phase and update the affected subsystems (PCI and ACPI) accordingly. After this change, middle-layer (subsystem) callbacks will always be invoked in all phases of system suspend and resume and driver callbacks will always run in the prepare, suspend, resume, and complete phases for all devices. For devices with SMART_SUSPEND set, driver callbacks will be skipped in the late and noirq phases of system suspend if those devices remain in runtime suspend in __device_suspend_late(). Driver callbacks will also be skipped for them during the noirq and early phases of the "thaw" transition related to hibernation in that case. Setting LEAVE_SUSPENDED means that the driver allows its callbacks to be skipped in the noirq and early phases of system resume, but some additional conditions need to be met for that to happen (among other things, the power.may_skip_resume flag needs to be set for the device during system suspend for the driver callbacks to be skipped during the subsequent resume transition). For all devices with SMART_SUSPEND set whose driver callbacks are invoked during system resume, the PM-runtime status will be set to "active" (by the core). Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/power/pci.rst | 5 +-- drivers/acpi/acpi_lpss.c | 6 ++-- drivers/acpi/device_pm.c | 15 ++++---- drivers/base/power/main.c | 85 ++++++++++++++++++++++----------------------- drivers/pci/pci-driver.c | 18 +++++----- 5 files changed, 62 insertions(+), 67 deletions(-) (limited to 'Documentation') diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index 0924d29636ad..a39b2461919a 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1035,10 +1035,7 @@ This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during the "noirq" phase of system-wide suspend and analogous transitions) and next it uses the dev_pm_may_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() early, as the PM core will skip the remaining resume -callbacks for the device during the transition under way and will set its -runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for -it. +pci_pm_resume_noirq() and pci_pm_resume_early() upfront. 3.2. Device Runtime Power Management ------------------------------------ diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index dee999938213..c4a84df6cc98 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -1093,6 +1093,9 @@ static int acpi_lpss_resume_early(struct device *dev) if (pdata->dev_desc->resume_from_noirq) return 0; + if (dev_pm_may_skip_resume(dev)) + return 0; + return acpi_lpss_do_resume_early(dev); } @@ -1105,9 +1108,6 @@ static int acpi_lpss_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - ret = pm_generic_resume_noirq(dev); if (ret) return ret; diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index b2263ec67b43..399684085f85 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1132,14 +1132,6 @@ static int acpi_subsys_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - /* - * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend - * during system suspend, so update their runtime PM status to "active" - * as they will be put into D0 going forward. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - return pm_generic_resume_noirq(dev); } @@ -1153,7 +1145,12 @@ static int acpi_subsys_resume_noirq(struct device *dev) */ static int acpi_subsys_resume_early(struct device *dev) { - int ret = acpi_dev_resume(dev); + int ret; + + if (dev_pm_may_skip_resume(dev)) + return 0; + + ret = acpi_dev_resume(dev); return ret ? ret : pm_generic_resume_early(dev); } diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 75d7cdb4de9c..25b0302188d8 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -565,12 +565,22 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) * dev_pm_may_skip_resume - System-wide device resume optimization check. * @dev: Target device. * - * Checks whether or not the device may be left in suspend after a system-wide - * transition to the working state. + * Return: + * - %false if the transition under way is RESTORE. + * - The return value of dev_pm_smart_suspend_and_suspended() if the transition + * under way is THAW. + * - The logical negation of %power.must_resume otherwise (that is, when the + * transition under way is RESUME). */ bool dev_pm_may_skip_resume(struct device *dev) { - return !dev->power.must_resume && pm_transition.event != PM_EVENT_RESTORE; + if (pm_transition.event == PM_EVENT_RESTORE) + return false; + + if (pm_transition.event == PM_EVENT_THAW) + return dev_pm_smart_suspend_and_suspended(dev); + + return !dev->power.must_resume; } /** @@ -601,6 +611,22 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; + skip_resume = dev_pm_may_skip_resume(dev); + /* + * If the driver callback is skipped below or by the middle layer + * callback and device_resume_early() also skips the driver callback for + * this device later, it needs to appear as "suspended" to PM-runtime, + * so change its status accordingly. + * + * Otherwise, the device is going to be resumed, so set its PM-runtime + * status to "active", but do that only if DPM_FLAG_SMART_SUSPEND is set + * to avoid confusing drivers that don't use it. + */ + if (skip_resume) + pm_runtime_set_suspended(dev); + else if (dev_pm_smart_suspend_and_suspended(dev)) + pm_runtime_set_active(dev); + if (dev->pm_domain) { info = "noirq power domain "; callback = pm_noirq_op(&dev->pm_domain->ops, state); @@ -614,35 +640,12 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn info = "noirq bus "; callback = pm_noirq_op(dev->bus->pm, state); } - if (callback) { - skip_resume = false; + if (callback) goto Run; - } - skip_resume = dev_pm_may_skip_resume(dev); if (skip_resume) goto Skip; - /* - * If "freeze" driver callbacks have been skipped during hibernation, - * because the device was runtime-suspended in __device_suspend_late(), - * the corresponding "thaw" callbacks must be skipped too, because - * running them for a runtime-suspended device may not be valid. - */ - if (dev_pm_smart_suspend_and_suspended(dev) && - state.event == PM_EVENT_THAW) { - skip_resume = true; - goto Skip; - } - - /* - * The device is going to be resumed, so set its PM-runtime status to - * "active", but do that only if DPM_FLAG_SMART_SUSPEND is set to avoid - * confusing drivers that don't use it. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - if (dev->driver && dev->driver->pm) { info = "noirq driver "; callback = pm_noirq_op(dev->driver->pm, state); @@ -654,20 +657,6 @@ Run: Skip: dev->power.is_noirq_suspended = false; - if (skip_resume) { - /* Make the next phases of resume skip the device. */ - dev->power.is_late_suspended = false; - dev->power.is_suspended = false; - /* - * The device is going to be left in suspend, but it might not - * have been in runtime suspend before the system suspended, so - * its runtime PM status needs to be updated to avoid confusing - * the runtime PM framework when runtime PM is enabled for the - * device again. - */ - pm_runtime_set_suspended(dev); - } - Out: complete_all(&dev->power.completion); TRACE_RESUME(error); @@ -804,15 +793,25 @@ static int device_resume_early(struct device *dev, pm_message_t state, bool asyn } else if (dev->bus && dev->bus->pm) { info = "early bus "; callback = pm_late_early_op(dev->bus->pm, state); - } else if (dev->driver && dev->driver->pm) { + } + if (callback) + goto Run; + + if (dev_pm_may_skip_resume(dev)) + goto Skip; + + if (dev->driver && dev->driver->pm) { info = "early driver "; callback = pm_late_early_op(dev->driver->pm, state); } +Run: error = dpm_run_callback(callback, dev, state, info); + +Skip: dev->power.is_late_suspended = false; - Out: +Out: TRACE_RESUME(error); pm_runtime_enable(dev); diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 0454ca0e4e3f..685fbf044911 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -896,14 +896,6 @@ static int pci_pm_resume_noirq(struct device *dev) if (dev_pm_may_skip_resume(dev)) return 0; - /* - * Devices with DPM_FLAG_SMART_SUSPEND may be left in runtime suspend - * during system suspend, so update their runtime PM status to "active" - * as they are going to be put into D0 shortly. - */ - if (dev_pm_smart_suspend_and_suspended(dev)) - pm_runtime_set_active(dev); - /* * In the suspend-to-idle case, devices left in D0 during suspend will * stay in D0, so it is not necessary to restore or update their @@ -928,6 +920,14 @@ static int pci_pm_resume_noirq(struct device *dev) return 0; } +static int pci_pm_resume_early(struct device *dev) +{ + if (dev_pm_may_skip_resume(dev)) + return 0; + + return pm_generic_resume_early(dev); +} + static int pci_pm_resume(struct device *dev) { struct pci_dev *pci_dev = to_pci_dev(dev); @@ -961,6 +961,7 @@ static int pci_pm_resume(struct device *dev) #define pci_pm_suspend_late NULL #define pci_pm_suspend_noirq NULL #define pci_pm_resume NULL +#define pci_pm_resume_early NULL #define pci_pm_resume_noirq NULL #endif /* !CONFIG_SUSPEND */ @@ -1358,6 +1359,7 @@ static const struct dev_pm_ops pci_dev_pm_ops = { .suspend = pci_pm_suspend, .suspend_late = pci_pm_suspend_late, .resume = pci_pm_resume, + .resume_early = pci_pm_resume_early, .freeze = pci_pm_freeze, .thaw = pci_pm_thaw, .poweroff = pci_pm_poweroff, -- cgit v1.2.3 From 76c70cb58ce30264af4b714109ee756da25d830a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:52:30 +0200 Subject: PM: sleep: core: Rename dev_pm_may_skip_resume() The name of dev_pm_may_skip_resume() may be easily confused with the power.may_skip_resume flag which is not checked by that function, so rename the former as dev_pm_skip_resume(). No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/power/pci.rst | 2 +- drivers/acpi/acpi_lpss.c | 4 ++-- drivers/acpi/device_pm.c | 4 ++-- drivers/base/power/main.c | 8 ++++---- drivers/pci/pci-driver.c | 4 ++-- include/linux/pm.h | 2 +- 6 files changed, 12 insertions(+), 12 deletions(-) (limited to 'Documentation') diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index a39b2461919a..aa1c7fce6cd0 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1034,7 +1034,7 @@ device to be left in suspend after system-wide transitions to the working state. This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +uses the dev_pm_skip_resume() helper to decide whether or not to return from pci_pm_resume_noirq() and pci_pm_resume_early() upfront. 3.2. Device Runtime Power Management diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c index c4a84df6cc98..7632df1a5be3 100644 --- a/drivers/acpi/acpi_lpss.c +++ b/drivers/acpi/acpi_lpss.c @@ -1093,7 +1093,7 @@ static int acpi_lpss_resume_early(struct device *dev) if (pdata->dev_desc->resume_from_noirq) return 0; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return acpi_lpss_do_resume_early(dev); @@ -1105,7 +1105,7 @@ static int acpi_lpss_resume_noirq(struct device *dev) int ret; /* Follow acpi_subsys_resume_noirq(). */ - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; ret = pm_generic_resume_noirq(dev); diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index 1b02d7dc7d34..8c2a091728a9 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1127,7 +1127,7 @@ EXPORT_SYMBOL_GPL(acpi_subsys_suspend_noirq); */ static int acpi_subsys_resume_noirq(struct device *dev) { - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return pm_generic_resume_noirq(dev); @@ -1145,7 +1145,7 @@ static int acpi_subsys_resume_early(struct device *dev) { int ret; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; ret = acpi_dev_resume(dev); diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 5adf0be6aa47..f98eced0f200 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -562,7 +562,7 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) /*------------------------- Resume routines -------------------------*/ /** - * dev_pm_may_skip_resume - System-wide device resume optimization check. + * dev_pm_skip_resume - System-wide device resume optimization check. * @dev: Target device. * * Return: @@ -572,7 +572,7 @@ static void dpm_watchdog_clear(struct dpm_watchdog *wd) * - The logical negation of %power.must_resume otherwise (that is, when the * transition under way is RESUME). */ -bool dev_pm_may_skip_resume(struct device *dev) +bool dev_pm_skip_resume(struct device *dev) { if (pm_transition.event == PM_EVENT_RESTORE) return false; @@ -611,7 +611,7 @@ static int device_resume_noirq(struct device *dev, pm_message_t state, bool asyn if (!dpm_wait_for_superior(dev, async)) goto Out; - skip_resume = dev_pm_may_skip_resume(dev); + skip_resume = dev_pm_skip_resume(dev); /* * If the driver callback is skipped below or by the middle layer * callback and device_resume_early() also skips the driver callback for @@ -797,7 +797,7 @@ static int device_resume_early(struct device *dev, pm_message_t state, bool asyn if (callback) goto Run; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) goto Skip; if (dev->driver && dev->driver->pm) { diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index ce220b1987df..decf82595340 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -891,7 +891,7 @@ static int pci_pm_resume_noirq(struct device *dev) pci_power_t prev_state = pci_dev->current_state; bool skip_bus_pm = pci_dev->skip_bus_pm; - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; /* @@ -920,7 +920,7 @@ static int pci_pm_resume_noirq(struct device *dev) static int pci_pm_resume_early(struct device *dev) { - if (dev_pm_may_skip_resume(dev)) + if (dev_pm_skip_resume(dev)) return 0; return pm_generic_resume_early(dev); diff --git a/include/linux/pm.h b/include/linux/pm.h index e057d1fa2469..d89b7099f241 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -758,7 +758,7 @@ extern int pm_generic_poweroff_late(struct device *dev); extern int pm_generic_poweroff(struct device *dev); extern void pm_generic_complete(struct device *dev); -extern bool dev_pm_may_skip_resume(struct device *dev); +extern bool dev_pm_skip_resume(struct device *dev); extern bool dev_pm_smart_suspend_and_suspended(struct device *dev); #else /* !CONFIG_PM_SLEEP */ -- cgit v1.2.3 From e07515563d010d8b32967634e8dc2fdc732c1aa6 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:53:01 +0200 Subject: PM: sleep: core: Rename DPM_FLAG_NEVER_SKIP Rename DPM_FLAG_NEVER_SKIP to DPM_FLAG_NO_DIRECT_COMPLETE which matches its purpose more closely. No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Bjorn Helgaas # for PCI parts Acked-by: Jeff Kirsher Acked-by: Alan Stern Acked-by: Bjorn Helgaas Acked-by: Alex Deucher --- Documentation/driver-api/pm/devices.rst | 6 +++--- Documentation/power/pci.rst | 10 +++++----- drivers/base/power/main.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +- drivers/gpu/drm/i915/intel_runtime_pm.c | 2 +- drivers/gpu/drm/radeon/radeon_kms.c | 2 +- drivers/misc/mei/pci-me.c | 2 +- drivers/misc/mei/pci-txe.c | 2 +- drivers/net/ethernet/intel/e1000e/netdev.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igc/igc_main.c | 2 +- drivers/pci/pcie/portdrv_pci.c | 2 +- include/linux/pm.h | 6 +++--- 13 files changed, 21 insertions(+), 21 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index f66c7b9126ea..4ace0eba4506 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -361,9 +361,9 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. runtime PM disabled. This feature also can be controlled by device drivers by using the - ``DPM_FLAG_NEVER_SKIP`` and ``DPM_FLAG_SMART_PREPARE`` driver power - management flags. [Typically, they are set at the time the driver is - probed against the device in question by passing them to the + ``DPM_FLAG_NO_DIRECT_COMPLETE`` and ``DPM_FLAG_SMART_PREPARE`` driver + power management flags. [Typically, they are set at the time the driver + is probed against the device in question by passing them to the :c:func:`dev_pm_set_driver_flags` helper function.] If the first of these flags is set, the PM core will not apply the direct-complete procedure described above to the given device and, consequenty, to any diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index aa1c7fce6cd0..9e1408121bea 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1004,11 +1004,11 @@ including the PCI bus type. The flags should be set once at the driver probe time with the help of the dev_pm_set_driver_flags() function and they should not be updated directly afterwards. -The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete -mechanism allowing device suspend/resume callbacks to be skipped if the device -is in runtime suspend when the system suspend starts. That also affects all of -the ancestors of the device, so this flag should only be used if absolutely -necessary. +The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the +direct-complete mechanism allowing device suspend/resume callbacks to be skipped +if the device is in runtime suspend when the system suspend starts. That also +affects all of the ancestors of the device, so this flag should only be used if +absolutely necessary. The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a positive value from pci_pm_prepare() if the ->prepare callback provided by the diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 3170d93e29f9..dbc1e5e7346b 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1844,7 +1844,7 @@ unlock: spin_lock_irq(&dev->power.lock); dev->power.direct_complete = state.event == PM_EVENT_SUSPEND && (ret > 0 || dev->power.no_pm_callbacks) && - !dev_pm_test_driver_flags(dev, DPM_FLAG_NEVER_SKIP); + !dev_pm_test_driver_flags(dev, DPM_FLAG_NO_DIRECT_COMPLETE); spin_unlock_irq(&dev->power.lock); return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index fd1dc3236eca..a9086ea1ab60 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -191,7 +191,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags) } if (adev->runpm) { - dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_use_autosuspend(dev->dev); pm_runtime_set_autosuspend_delay(dev->dev, 5000); pm_runtime_set_active(dev->dev); diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/intel_runtime_pm.c index ad719c9602af..9cb2d7548daa 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -549,7 +549,7 @@ void intel_runtime_pm_enable(struct intel_runtime_pm *rpm) * becaue the HDA driver may require us to enable the audio power * domain during system suspend. */ - dev_pm_set_driver_flags(kdev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(kdev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_set_autosuspend_delay(kdev, 10000); /* 10s */ pm_runtime_mark_last_busy(kdev); diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c index 58176db85952..372962358a18 100644 --- a/drivers/gpu/drm/radeon/radeon_kms.c +++ b/drivers/gpu/drm/radeon/radeon_kms.c @@ -158,7 +158,7 @@ int radeon_driver_load_kms(struct drm_device *dev, unsigned long flags) } if (radeon_is_px(dev)) { - dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_use_autosuspend(dev->dev); pm_runtime_set_autosuspend_delay(dev->dev, 5000); pm_runtime_set_active(dev->dev); diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c index 3d21c38e2dbb..53f16f3bd091 100644 --- a/drivers/misc/mei/pci-me.c +++ b/drivers/misc/mei/pci-me.c @@ -240,7 +240,7 @@ static int mei_me_probe(struct pci_dev *pdev, const struct pci_device_id *ent) * MEI requires to resume from runtime suspend mode * in order to perform link reset flow upon system suspend. */ - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); /* * ME maps runtime suspend/resume to D0i states, diff --git a/drivers/misc/mei/pci-txe.c b/drivers/misc/mei/pci-txe.c index beacf2a2f2b5..4bf26ce61044 100644 --- a/drivers/misc/mei/pci-txe.c +++ b/drivers/misc/mei/pci-txe.c @@ -128,7 +128,7 @@ static int mei_txe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) * MEI requires to resume from runtime suspend mode * in order to perform link reset flow upon system suspend. */ - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); /* * TXE maps runtime suspend/resume to own power gating states, diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 177c6da80c57..2730b1c7dddb 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -7549,7 +7549,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) e1000_print_device_info(adapter); - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); if (pci_dev_run_wake(pdev) && hw->mac.type < e1000_pch_cnp) pm_runtime_put_noidle(&pdev->dev); diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index b46bff8fe056..8bb3db2cbd41 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -3445,7 +3445,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) } } - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_put_noidle(&pdev->dev); return 0; diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c index 69fa1ce1f927..59fc0097438f 100644 --- a/drivers/net/ethernet/intel/igc/igc_main.c +++ b/drivers/net/ethernet/intel/igc/igc_main.c @@ -4825,7 +4825,7 @@ static int igc_probe(struct pci_dev *pdev, pcie_print_link_status(pdev); netdev_info(netdev, "MAC: %pM\n", netdev->dev_addr); - dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NEVER_SKIP); + dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_NO_DIRECT_COMPLETE); pm_runtime_put_noidle(&pdev->dev); diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index 160d67c59310..3acf151ae015 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -115,7 +115,7 @@ static int pcie_portdrv_probe(struct pci_dev *dev, pci_save_state(dev); - dev_pm_set_driver_flags(&dev->dev, DPM_FLAG_NEVER_SKIP | + dev_pm_set_driver_flags(&dev->dev, DPM_FLAG_NO_DIRECT_COMPLETE | DPM_FLAG_SMART_SUSPEND); if (pci_bridge_d3_possible(dev)) { diff --git a/include/linux/pm.h b/include/linux/pm.h index 8c59a7f0bcf4..cdb8fbd6ab18 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -544,7 +544,7 @@ struct pm_subsys_data { * These flags can be set by device drivers at the probe time. They need not be * cleared by the drivers as the driver core will take care of that. * - * NEVER_SKIP: Do not skip all system suspend/resume callbacks for the device. + * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. @@ -554,7 +554,7 @@ struct pm_subsys_data { * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in * other words, the system suspend/resume callbacks can only be skipped for the * device if its driver doesn't object against that). This flag has no effect - * if NEVER_SKIP is set. + * if NO_DIRECT_COMPLETE is set. * * Setting SMART_SUSPEND instructs bus types and PM domains which may want to * runtime resume the device upfront during system suspend that doing so is not @@ -565,7 +565,7 @@ struct pm_subsys_data { * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the * driver prefers the device to be left in suspend after system resume. */ -#define DPM_FLAG_NEVER_SKIP BIT(0) +#define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) #define DPM_FLAG_SMART_SUSPEND BIT(2) #define DPM_FLAG_LEAVE_SUSPENDED BIT(3) -- cgit v1.2.3 From 2a3f34750b8b07df42ab4b30b70e029d46e0d7f3 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:53:20 +0200 Subject: PM: sleep: core: Rename DPM_FLAG_LEAVE_SUSPENDED Rename DPM_FLAG_LEAVE_SUSPENDED to DPM_FLAG_MAY_SKIP_RESUME which matches its purpose more closely. No functional impact. Suggested-by: Alan Stern Signed-off-by: Rafael J. Wysocki Acked-by: Wolfram Sang # for I2C Acked-by: Alan Stern Acked-by: Bjorn Helgaas --- Documentation/driver-api/pm/devices.rst | 4 ++-- Documentation/power/pci.rst | 2 +- drivers/acpi/acpi_tad.c | 2 +- drivers/base/power/main.c | 2 +- drivers/i2c/busses/i2c-designware-platdrv.c | 4 ++-- include/linux/pm.h | 6 +++--- 6 files changed, 10 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 4ace0eba4506..f342c7549b4c 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -803,7 +803,7 @@ general.] However, it often is desirable to leave devices in suspend after system transitions to the working state, especially if those devices had been in runtime suspend before the preceding system-wide suspend (or analogous) -transition. Device drivers can use the ``DPM_FLAG_LEAVE_SUSPENDED`` flag to +transition. Device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to indicate to the PM core (and middle-layer code) that they prefer the specific devices handled by them to be left suspended and they have no problems with skipping their system-wide resume callbacks for this reason. Whether or not the @@ -825,7 +825,7 @@ device really can be left in suspend. For devices whose "noirq", "late" and "early" driver callbacks are invoked directly by the PM core, all of the system-wide resume callbacks are skipped if -``DPM_FLAG_LEAVE_SUSPENDED`` is set and the device is in runtime suspend during +``DPM_FLAG_MAY_SKIP_RESUME`` is set and the device is in runtime suspend during the ``suspend_noirq`` (or analogous) phase or the transition under way is a proper system suspend (rather than anything related to hibernation) and the device's wakeup settings are suitable for runtime PM (that is, it cannot diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index 9e1408121bea..f09b382b4621 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1029,7 +1029,7 @@ into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), the function will set the power.direct_complete flag for it (to make the PM core skip the subsequent "thaw" callbacks for it) and return. -Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver prefers the device to be left in suspend after system-wide transitions to the working state. This flag is checked by the PM core, but the PCI bus type informs the PM core which devices may be left in suspend from its perspective (that happens during diff --git a/drivers/acpi/acpi_tad.c b/drivers/acpi/acpi_tad.c index 33a4bcdaa4d7..7d45cce0c3c1 100644 --- a/drivers/acpi/acpi_tad.c +++ b/drivers/acpi/acpi_tad.c @@ -624,7 +624,7 @@ static int acpi_tad_probe(struct platform_device *pdev) */ device_init_wakeup(dev, true); dev_pm_set_driver_flags(dev, DPM_FLAG_SMART_SUSPEND | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); /* * The platform bus type layer tells the ACPI PM domain powers up the * device, so set the runtime PM status of it to "active". diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index dbc1e5e7346b..aaa4aaf41d27 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1247,7 +1247,7 @@ Skip: * to be skipped. */ if (atomic_read(&dev->power.usage_count) > 1 || - !(dev_pm_test_driver_flags(dev, DPM_FLAG_LEAVE_SUSPENDED) && + !(dev_pm_test_driver_flags(dev, DPM_FLAG_MAY_SKIP_RESUME) && dev->power.may_skip_resume)) dev->power.must_resume = true; diff --git a/drivers/i2c/busses/i2c-designware-platdrv.c b/drivers/i2c/busses/i2c-designware-platdrv.c index 5536673060cc..c429d664f655 100644 --- a/drivers/i2c/busses/i2c-designware-platdrv.c +++ b/drivers/i2c/busses/i2c-designware-platdrv.c @@ -357,12 +357,12 @@ static int dw_i2c_plat_probe(struct platform_device *pdev) if (dev->flags & ACCESS_NO_IRQ_SUSPEND) { dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_SMART_PREPARE | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); } else { dev_pm_set_driver_flags(&pdev->dev, DPM_FLAG_SMART_PREPARE | DPM_FLAG_SMART_SUSPEND | - DPM_FLAG_LEAVE_SUSPENDED); + DPM_FLAG_MAY_SKIP_RESUME); } /* The code below assumes runtime PM to be disabled. */ diff --git a/include/linux/pm.h b/include/linux/pm.h index cdb8fbd6ab18..35796fc49e7a 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -547,7 +547,7 @@ struct pm_subsys_data { * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. * SMART_PREPARE: Check the return value of the driver's ->prepare callback. * SMART_SUSPEND: No need to resume the device from runtime suspend. - * LEAVE_SUSPENDED: Avoid resuming the device during system resume if possible. + * MAY_SKIP_RESUME: Avoid resuming the device during system resume if possible. * * Setting SMART_PREPARE instructs bus types and PM domains which may want * system suspend/resume callbacks to be skipped for the device to return 0 from @@ -562,13 +562,13 @@ struct pm_subsys_data { * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by * the driver if they decide to leave the device in runtime suspend. * - * Setting LEAVE_SUSPENDED informs the PM core and middle-layer code that the + * Setting MAY_SKIP_RESUME informs the PM core and middle-layer code that the * driver prefers the device to be left in suspend after system resume. */ #define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) #define DPM_FLAG_SMART_SUSPEND BIT(2) -#define DPM_FLAG_LEAVE_SUSPENDED BIT(3) +#define DPM_FLAG_MAY_SKIP_RESUME BIT(3) struct dev_pm_info { pm_message_t power_state; -- cgit v1.2.3 From 2fff3f73e8c27801b84d2315e1a49bce96b00eff Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 18 Apr 2020 18:55:32 +0200 Subject: Documentation: PM: sleep: Update driver flags documentation Update the documentation of the driver flags for system-wide power management to reflect the current code flows and be more consistent. Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst | 139 +++++++++++++++++++++----------- Documentation/power/pci.rst | 43 +++++----- include/linux/pm.h | 24 ++---- 3 files changed, 119 insertions(+), 87 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index f342c7549b4c..782cb37073a3 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -772,62 +772,107 @@ the state of devices (possibly except for resuming them from runtime suspend) from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before* invoking device drivers' ``->suspend`` callbacks (or equivalent). +.. _smart_suspend_flag: + +The ``DPM_FLAG_SMART_SUSPEND`` Driver Flag +------------------------------------------ + Some bus types and PM domains have a policy to resume all devices from runtime suspend upfront in their ``->suspend`` callbacks, but that may not be really necessary if the driver of the device can cope with runtime-suspended devices. The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in -:c:member:`power.driver_flags` at the probe time, by passing it to the -:c:func:`dev_pm_set_driver_flags` helper. That also may cause middle-layer code +:c:member:`power.driver_flags` at the probe time with the help of the +:c:func:`dev_pm_set_driver_flags` helper routine. + +However, setting that flag also causes the PM core and middle-layer code (bus types, PM domains etc.) to skip the ``->suspend_late`` and ``->suspend_noirq`` callbacks provided by the driver if the device remains in -runtime suspend at the beginning of the ``suspend_late`` phase of system-wide -suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM -has been disabled for it, under the assumption that its state should not change -after that point until the system-wide transition is over (the PM core itself -does that for devices whose "noirq", "late" and "early" system-wide PM callbacks -are executed directly by it). If that happens, the driver's system-wide resume -callbacks, if present, may still be invoked during the subsequent system-wide -resume transition and the device's runtime power management status may be set -to "active" before enabling runtime PM for it, so the driver must be prepared to -cope with the invocation of its system-wide resume callbacks back-to-back with -its ``->runtime_suspend`` one (without the intervening ``->runtime_resume`` and -so on) and the final state of the device must reflect the "active" runtime PM -status in that case. +runtime suspend during the ``suspend_late`` phase of system-wide suspend (or +during the ``poweroff_late`` or ``freeze_late`` phase of hibernation), +after runtime PM was disabled for it. [Without doing that, the same driver +callback might be executed twice in a row for the same device, which would not +be valid in general.] If the middle-layer system-wide PM callbacks are present +for the device, they are responsible for doing the above, and the PM core takes +care of it otherwise. + +In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_late`` +and ``->thaw_noirq`` callbacks are skipped if the device remained in runtime +suspend during the preceding "freeze" transition related to hibernation. +Again, if the middle-layer callbacks are present for the device, they are +responsible for doing that, or the PM core takes care of it otherwise. + + +The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag +-------------------------------------------- During system-wide resume from a sleep state it's easiest to put devices into the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. [Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in -general.] - -However, it often is desirable to leave devices in suspend after system -transitions to the working state, especially if those devices had been in +general.] However, it often is desirable to leave devices in suspend after +system transitions to the working state, especially if those devices had been in runtime suspend before the preceding system-wide suspend (or analogous) -transition. Device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to -indicate to the PM core (and middle-layer code) that they prefer the specific -devices handled by them to be left suspended and they have no problems with -skipping their system-wide resume callbacks for this reason. Whether or not the -devices will actually be left in suspend may depend on their state before the -given system suspend-resume cycle and on the type of the system transition under -way. In particular, devices are not left suspended if that transition is a -restore from hibernation, as device states are not guaranteed to be reflected -by the information stored in the hibernation image in that case. - -The middle-layer code involved in the handling of the device is expected to -indicate to the PM core if the device may be left in suspend by setting its -:c:member:`power.may_skip_resume` status bit which is checked by the PM core -during the "noirq" phase of the preceding system-wide suspend (or analogous) -transition. The middle layer is then responsible for handling the device as -appropriate in its "noirq" resume callback, which is executed regardless of -whether or not the device is left suspended, but the other resume callbacks -(except for ``->complete``) will be skipped automatically by the PM core if the -device really can be left in suspend. - -For devices whose "noirq", "late" and "early" driver callbacks are invoked -directly by the PM core, all of the system-wide resume callbacks are skipped if -``DPM_FLAG_MAY_SKIP_RESUME`` is set and the device is in runtime suspend during -the ``suspend_noirq`` (or analogous) phase or the transition under way is a -proper system suspend (rather than anything related to hibernation) and the -device's wakeup settings are suitable for runtime PM (that is, it cannot -generate wakeup signals at all or it is allowed to wake up the system from -sleep). +transition. + +To that end, device drivers can use the ``DPM_FLAG_MAY_SKIP_RESUME`` flag to +indicate to the PM core and middle-layer code that they allow their "noirq" and +"early" resume callbacks to be skipped if the device can be left in suspend +after system-wide PM transitions to the working state. Whether or not that is +the case generally depends on the state of the device before the given system +suspend-resume cycle and on the type of the system transition under way. +In particular, the "restore" and "thaw" transitions related to hibernation are +not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All devices are always +resumed during the "restore" transition and whether or not any driver callbacks +are skipped during the "freeze" transition depends whether or not the +``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_).] + +The ``DPM_FLAG_MAY_SKIP_RESUME`` flag is taken into account in combination with +the :c:member:`power.may_skip_resume` status bit set by the PM core during the +"suspend" phase of suspend-type transitions. If the driver or the middle layer +has a reason to prevent the driver's "noirq" and "early" resume callbacks from +being skipped during the subsequent resume transition of the system, it should +clear :c:member:`power.may_skip_resume` in its ``->suspend``, ``->suspend_late`` +or ``->suspend_noirq`` callback. [Note that the drivers setting +``DPM_FLAG_SMART_SUSPEND`` need to clear :c:member:`power.may_skip_resume` in +their ``->suspend`` callback in case the other two are skipped.] + +Setting the :c:member:`power.may_skip_resume` status bit along with the +``DPM_FLAG_MAY_SKIP_RESUME`` flag is necessary, but generally not sufficient, +for the driver's "noirq" and "early" resume callbacks to be skipped. Whether or +not they should be skipped can be determined by evaluating the +:c:func:`dev_pm_skip_resume` helper function. + +If that function returns ``true``, the driver's "noirq" and "early" resume +callbacks should be skipped and the device's runtime PM status will be set to +"suspended" by the PM core. Otherwise, if the device was runtime-suspended +during the preceding system-wide suspend transition and +``DPM_FLAG_SMART_SUSPEND`` is set for it, its runtime PM status will be set to +"active" by the PM core. [Hence, the drivers that do not set +``DPM_FLAG_SMART_SUSPEND`` should not expect the runtime PM status of their +devices to be changed from "suspended" to "active" by the PM core during +system-wide resume-type transitions.] + +If the ``DPM_FLAG_MAY_SKIP_RESUME`` flag is not set for a device, but +``DPM_FLAG_SMART_SUSPEND`` is set and the driver's "late" and "noirq" suspend +callbacks are skipped, its system-wide "noirq" and "early" resume callbacks, if +present, are invoked as usual and the device's runtime PM status is set to +"active" by the PM core before enabling runtime PM for it. In that case, the +driver must be prepared to cope with the invocation of its system-wide resume +callbacks back-to-back with its ``->runtime_suspend`` one (without the +intervening ``->runtime_resume`` and system-wide suspend callbacks) and the +final state of the device must reflect the "active" runtime PM status in that +case. [Note that this is not a problem at all if the driver's +``->suspend_late`` callback pointer points to the same function as its +``->runtime_suspend`` one and its ``->resume_early`` callback pointer points to +the same function as the ``->runtime_resume`` one, while none of the other +system-wide suspend-resume callbacks of the driver are present, for example.] + +Likewise, if ``DPM_FLAG_MAY_SKIP_RESUME`` is set for a device, its driver's +system-wide "noirq" and "early" resume callbacks may be skipped while its "late" +and "noirq" suspend callbacks may have been executed (in principle, regardless +of whether or not ``DPM_FLAG_SMART_SUSPEND`` is set). In that case, the driver +needs to be able to cope with the invocation of its ``->runtime_resume`` +callback back-to-back with its "late" and "noirq" suspend ones. [For instance, +that is not a concern if the driver sets both ``DPM_FLAG_SMART_SUSPEND`` and +``DPM_FLAG_MAY_SKIP_RESUME`` and uses the same pair of suspend/resume callback +functions for runtime PM and system-wide suspend/resume.] diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst index f09b382b4621..1831e431f725 100644 --- a/Documentation/power/pci.rst +++ b/Documentation/power/pci.rst @@ -1010,32 +1010,33 @@ if the device is in runtime suspend when the system suspend starts. That also affects all of the ancestors of the device, so this flag should only be used if absolutely necessary. -The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a -positive value from pci_pm_prepare() if the ->prepare callback provided by the +The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive +value from pci_pm_prepare() only if the ->prepare callback provided by the driver of the device returns a positive value. That allows the driver to opt -out from using the direct-complete mechanism dynamically. +out from using the direct-complete mechanism dynamically (whereas setting +DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out). The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's perspective the device can be safely left in runtime suspend during system suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() -to skip resuming the device from runtime suspend unless there are PCI-specific -reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), -pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early -if the device remains in runtime suspend in the beginning of the "late" phase -of the system-wide transition under way. Moreover, if the device is in -runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime -power management status will be changed to "active" (as it is going to be put -into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), -the function will set the power.direct_complete flag for it (to make the PM core -skip the subsequent "thaw" callbacks for it) and return. - -Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver prefers the -device to be left in suspend after system-wide transitions to the working state. -This flag is checked by the PM core, but the PCI bus type informs the PM core -which devices may be left in suspend from its perspective (that happens during -the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() and pci_pm_resume_early() upfront. +to avoid resuming the device from runtime suspend unless there are PCI-specific +reasons for doing that. Also, it causes pci_pm_suspend_late/noirq() and +pci_pm_poweroff_late/noirq() to return early if the device remains in runtime +suspend during the "late" phase of the system-wide transition under way. +Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or +pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it +is going to be put into D0 going forward). + +Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its +"noirq" and "early" resume callbacks to be skipped if the device can be left +in suspend after a system-wide transition into the working state. This flag is +taken into consideration by the PM core along with the power.may_skip_resume +status bit of the device which is set by pci_pm_suspend_noirq() in certain +situations. If the PM core determines that the driver's "noirq" and "early" +resume callbacks should be skipped, the dev_pm_skip_resume() helper function +will return "true" and that will cause pci_pm_resume_noirq() and +pci_pm_resume_early() to return upfront without touching the device and +executing the driver callbacks. 3.2. Device Runtime Power Management ------------------------------------ diff --git a/include/linux/pm.h b/include/linux/pm.h index 35796fc49e7a..121c104a4090 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -545,25 +545,11 @@ struct pm_subsys_data { * cleared by the drivers as the driver core will take care of that. * * NO_DIRECT_COMPLETE: Do not apply direct-complete optimization to the device. - * SMART_PREPARE: Check the return value of the driver's ->prepare callback. - * SMART_SUSPEND: No need to resume the device from runtime suspend. - * MAY_SKIP_RESUME: Avoid resuming the device during system resume if possible. - * - * Setting SMART_PREPARE instructs bus types and PM domains which may want - * system suspend/resume callbacks to be skipped for the device to return 0 from - * their ->prepare callbacks if the driver's ->prepare callback returns 0 (in - * other words, the system suspend/resume callbacks can only be skipped for the - * device if its driver doesn't object against that). This flag has no effect - * if NO_DIRECT_COMPLETE is set. - * - * Setting SMART_SUSPEND instructs bus types and PM domains which may want to - * runtime resume the device upfront during system suspend that doing so is not - * necessary from the driver's perspective. It also may cause them to skip - * invocations of the ->suspend_late and ->suspend_noirq callbacks provided by - * the driver if they decide to leave the device in runtime suspend. - * - * Setting MAY_SKIP_RESUME informs the PM core and middle-layer code that the - * driver prefers the device to be left in suspend after system resume. + * SMART_PREPARE: Take the driver ->prepare callback return value into account. + * SMART_SUSPEND: Avoid resuming the device from runtime suspend. + * MAY_SKIP_RESUME: Allow driver "noirq" and "early" callbacks to be skipped. + * + * See Documentation/driver-api/pm/devices.rst for details. */ #define DPM_FLAG_NO_DIRECT_COMPLETE BIT(0) #define DPM_FLAG_SMART_PREPARE BIT(1) -- cgit v1.2.3 From 598cc93005636e32a98ed003b0158329827c0ccb Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Sat, 25 Apr 2020 16:35:40 -0400 Subject: PM: sleep: Helpful edits for devices.rst documentation Here are some minor edits of the devices.rst documentation file, intended to improve the clarity and add a couple of missing details. Signed-off-by: Alan Stern Signed-off-by: Rafael J. Wysocki --- Documentation/driver-api/pm/devices.rst | 94 ++++++++++++++++++--------------- 1 file changed, 52 insertions(+), 42 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 782cb37073a3..946ad0b94e31 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -349,7 +349,7 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. PM core will skip the ``suspend``, ``suspend_late`` and ``suspend_noirq`` phases as well as all of the corresponding phases of the subsequent device resume for all of these devices. In that case, - the ``->complete`` callback will be invoked directly after the + the ``->complete`` callback will be the next one invoked after the ``->prepare`` callback and is entirely responsible for putting the device into a consistent state as appropriate. @@ -383,11 +383,15 @@ the phases are: ``prepare``, ``suspend``, ``suspend_late``, ``suspend_noirq``. ``->suspend`` methods provided by subsystems (bus types and PM domains in particular) must follow an additional rule regarding what can be done to the devices before their drivers' ``->suspend`` methods are called. - Namely, they can only resume the devices from runtime suspend by - calling :c:func:`pm_runtime_resume` for them, if that is necessary, and + Namely, they may resume the devices from runtime suspend by + calling :c:func:`pm_runtime_resume` for them, if that is necessary, but they must not update the state of the devices in any other way at that time (in case the drivers need to resume the devices from runtime - suspend in their ``->suspend`` methods). + suspend in their ``->suspend`` methods). In fact, the PM core prevents + subsystems or drivers from putting devices into runtime suspend at + these times by calling :c:func:`pm_runtime_get_noresume` before issuing + the ``->prepare`` callback (and calling :c:func:`pm_runtime_put` after + issuing the ``->complete`` callback). 3. For a number of devices it is convenient to split suspend into the "quiesce device" and "save device state" phases, in which cases @@ -459,22 +463,22 @@ When resuming from freeze, standby or memory sleep, the phases are: Note, however, that new children may be registered below the device as soon as the ``->resume`` callbacks occur; it's not necessary to wait - until the ``complete`` phase with that. + until the ``complete`` phase runs. Moreover, if the preceding ``->prepare`` callback returned a positive number, the device may have been left in runtime suspend throughout the - whole system suspend and resume (the ``suspend``, ``suspend_late``, - ``suspend_noirq`` phases of system suspend and the ``resume_noirq``, - ``resume_early``, ``resume`` phases of system resume may have been - skipped for it). In that case, the ``->complete`` callback is entirely + whole system suspend and resume (its ``->suspend``, ``->suspend_late``, + ``->suspend_noirq``, ``->resume_noirq``, + ``->resume_early``, and ``->resume`` callbacks may have been + skipped). In that case, the ``->complete`` callback is entirely responsible for putting the device into a consistent state after system suspend if necessary. [For example, it may need to queue up a runtime resume request for the device for this purpose.] To check if that is the case, the ``->complete`` callback can consult the device's - ``power.direct_complete`` flag. Namely, if that flag is set when the - ``->complete`` callback is being run, it has been called directly after - the preceding ``->prepare`` and special actions may be required - to make the device work correctly afterward. + ``power.direct_complete`` flag. If that flag is set when the + ``->complete`` callback is being run then the direct-complete mechanism + was used, and special actions may be required to make the device work + correctly afterward. At the end of these phases, drivers should be as functional as they were before suspending: I/O can be performed using DMA and IRQs, and the relevant clocks are @@ -575,10 +579,12 @@ and the phases are similar. The ``->poweroff``, ``->poweroff_late`` and ``->poweroff_noirq`` callbacks should do essentially the same things as the ``->suspend``, ``->suspend_late`` -and ``->suspend_noirq`` callbacks, respectively. The only notable difference is +and ``->suspend_noirq`` callbacks, respectively. A notable difference is that they need not store the device register values, because the registers should already have been stored during the ``freeze``, ``freeze_late`` or -``freeze_noirq`` phases. +``freeze_noirq`` phases. Also, on many machines the firmware will power-down +the entire system, so it is not necessary for the callback to put the device in +a low-power state. Leaving Hibernation @@ -764,11 +770,10 @@ device driver in question. If it is necessary to resume a device from runtime suspend during a system-wide transition into a sleep state, that can be done by calling -:c:func:`pm_runtime_resume` for it from the ``->suspend`` callback (or its -couterpart for transitions related to hibernation) of either the device's driver -or a subsystem responsible for it (for example, a bus type or a PM domain). -That is guaranteed to work by the requirement that subsystems must not change -the state of devices (possibly except for resuming them from runtime suspend) +:c:func:`pm_runtime_resume` from the ``->suspend`` callback (or the ``->freeze`` +or ``->poweroff`` callback for transitions related to hibernation) of either the +device's driver or its subsystem (for example, a bus type or a PM domain). +However, subsystems must not otherwise change the runtime status of devices from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before* invoking device drivers' ``->suspend`` callbacks (or equivalent). @@ -779,27 +784,29 @@ The ``DPM_FLAG_SMART_SUSPEND`` Driver Flag Some bus types and PM domains have a policy to resume all devices from runtime suspend upfront in their ``->suspend`` callbacks, but that may not be really -necessary if the driver of the device can cope with runtime-suspended devices. -The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in -:c:member:`power.driver_flags` at the probe time with the help of the +necessary if the device's driver can cope with runtime-suspended devices. +The driver can indicate this by setting ``DPM_FLAG_SMART_SUSPEND`` in +:c:member:`power.driver_flags` at probe time, with the assistance of the :c:func:`dev_pm_set_driver_flags` helper routine. -However, setting that flag also causes the PM core and middle-layer code +Setting that flag causes the PM core and middle-layer code (bus types, PM domains etc.) to skip the ``->suspend_late`` and ``->suspend_noirq`` callbacks provided by the driver if the device remains in -runtime suspend during the ``suspend_late`` phase of system-wide suspend (or -during the ``poweroff_late`` or ``freeze_late`` phase of hibernation), -after runtime PM was disabled for it. [Without doing that, the same driver +runtime suspend throughout those phases of the system-wide suspend (and +similarly for the "freeze" and "poweroff" parts of system hibernation). +[Otherwise the same driver callback might be executed twice in a row for the same device, which would not be valid in general.] If the middle-layer system-wide PM callbacks are present -for the device, they are responsible for doing the above, and the PM core takes -care of it otherwise. +for the device then they are responsible for skipping these driver callbacks; +if not then the PM core skips them. The subsystem callback routines can +determine whether they need to skip the driver callbacks by testing the return +value from the :c:func:`dev_pm_skip_suspend` helper function. -In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_late`` -and ``->thaw_noirq`` callbacks are skipped if the device remained in runtime -suspend during the preceding "freeze" transition related to hibernation. -Again, if the middle-layer callbacks are present for the device, they are -responsible for doing that, or the PM core takes care of it otherwise. +In addition, with ``DPM_FLAG_SMART_SUSPEND`` set, the driver's ``->thaw_noirq`` +and ``->thaw_early`` callbacks are skipped in hibernation if the device remained +in runtime suspend throughout the preceding "freeze" transition. Again, if the +middle-layer callbacks are present for the device, they are responsible for +doing this, otherwise the PM core takes care of it. The ``DPM_FLAG_MAY_SKIP_RESUME`` Driver Flag @@ -820,17 +827,20 @@ indicate to the PM core and middle-layer code that they allow their "noirq" and after system-wide PM transitions to the working state. Whether or not that is the case generally depends on the state of the device before the given system suspend-resume cycle and on the type of the system transition under way. -In particular, the "restore" and "thaw" transitions related to hibernation are -not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All devices are always -resumed during the "restore" transition and whether or not any driver callbacks -are skipped during the "freeze" transition depends whether or not the -``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_).] +In particular, the "thaw" and "restore" transitions related to hibernation are +not affected by ``DPM_FLAG_MAY_SKIP_RESUME`` at all. [All callbacks are +issued during the "restore" transition regardless of the flag settings, +and whether or not any driver callbacks +are skipped during the "thaw" transition depends whether or not the +``DPM_FLAG_SMART_SUSPEND`` flag is set (see `above `_). +In addition, a device is not allowed to remain in runtime suspend if any of its +children will be returned to full power.] The ``DPM_FLAG_MAY_SKIP_RESUME`` flag is taken into account in combination with the :c:member:`power.may_skip_resume` status bit set by the PM core during the "suspend" phase of suspend-type transitions. If the driver or the middle layer has a reason to prevent the driver's "noirq" and "early" resume callbacks from -being skipped during the subsequent resume transition of the system, it should +being skipped during the subsequent system resume transition, it should clear :c:member:`power.may_skip_resume` in its ``->suspend``, ``->suspend_late`` or ``->suspend_noirq`` callback. [Note that the drivers setting ``DPM_FLAG_SMART_SUSPEND`` need to clear :c:member:`power.may_skip_resume` in @@ -845,8 +855,8 @@ not they should be skipped can be determined by evaluating the If that function returns ``true``, the driver's "noirq" and "early" resume callbacks should be skipped and the device's runtime PM status will be set to "suspended" by the PM core. Otherwise, if the device was runtime-suspended -during the preceding system-wide suspend transition and -``DPM_FLAG_SMART_SUSPEND`` is set for it, its runtime PM status will be set to +during the preceding system-wide suspend transition and its +``DPM_FLAG_SMART_SUSPEND`` is set, its runtime PM status will be set to "active" by the PM core. [Hence, the drivers that do not set ``DPM_FLAG_SMART_SUSPEND`` should not expect the runtime PM status of their devices to be changed from "suspended" to "active" by the PM core during -- cgit v1.2.3 From 7395683a2498c7000120cdee8e4fb0c632e5561b Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:24 +0800 Subject: Documentation: cpuidle: update the document Update the document after the remove of cpuidle_sysfs_switch. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Signed-off-by: Rafael J. Wysocki --- Documentation/ABI/testing/sysfs-devices-system-cpu | 24 ++++++++-------------- Documentation/admin-guide/pm/cpuidle.rst | 20 ++++++++---------- Documentation/driver-api/pm/cpuidle.rst | 5 ++--- 3 files changed, 20 insertions(+), 29 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2e0e3b45d02a..6b5dafab950c 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -106,10 +106,10 @@ Description: CPU topology files that describe a logical CPU's relationship See Documentation/admin-guide/cputopology.rst for more information. -What: /sys/devices/system/cpu/cpuidle/current_driver - /sys/devices/system/cpu/cpuidle/current_governer_ro - /sys/devices/system/cpu/cpuidle/available_governors +What: /sys/devices/system/cpu/cpuidle/available_governors + /sys/devices/system/cpu/cpuidle/current_driver /sys/devices/system/cpu/cpuidle/current_governor + /sys/devices/system/cpu/cpuidle/current_governer_ro Date: September 2007 Contact: Linux kernel mailing list Description: Discover cpuidle policy and mechanism @@ -119,24 +119,18 @@ Description: Discover cpuidle policy and mechanism consumption during idle. Idle policy (governor) is differentiated from idle mechanism - (driver) - - current_driver: (RO) displays current idle mechanism - - current_governor_ro: (RO) displays current idle policy - - With the cpuidle_sysfs_switch boot option enabled (meant for - developer testing), the following three attributes are visible - instead: - - current_driver: same as described above + (driver). available_governors: (RO) displays a space separated list of - available governors + available governors. + + current_driver: (RO) displays current idle mechanism. current_governor: (RW) displays current idle policy. Users can switch the governor at runtime by writing to this file. + current_governor_ro: (RO) displays current idle policy. + See Documentation/admin-guide/pm/cpuidle.rst and Documentation/driver-api/pm/cpuidle.rst for more information. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 5605cc6f9560..a96a423e3779 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -159,17 +159,15 @@ governor uses that information depends on what algorithm is implemented by it and that is the primary reason for having more than one governor in the ``CPUIdle`` subsystem. -There are three ``CPUIdle`` governors available, ``menu``, `TEO `_ -and ``ladder``. Which of them is used by default depends on the configuration -of the kernel and in particular on whether or not the scheduler tick can be -`stopped by the idle loop `_. It is possible to change the -governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has -been passed to the kernel, but that is not safe in general, so it should not be -done on production systems (that may change in the future, though). The name of -the ``CPUIdle`` governor currently used by the kernel can be read from the -:file:`current_governor_ro` (or :file:`current_governor` if -``cpuidle_sysfs_switch`` is present in the kernel command line) file under -:file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. +There are four ``CPUIdle`` governors available, ``menu``, `TEO `_, +``ladder`` and ``haltpoll``. Which of them is used by default depends on the +configuration of the kernel and in particular on whether or not the scheduler +tick can be `stopped by the idle loop `_. Available +governors can be read from the :file:`available_governors`, and the governor +can be changed at runtime. The name of the ``CPUIdle`` governor currently +used by the kernel can be read from the :file:`current_governor_ro` or +:file:`current_governor` file under :file:`/sys/devices/system/cpu/cpuidle/` +in ``sysfs``. Which ``CPUIdle`` driver is used, on the other hand, usually depends on the platform the kernel is running on, but there are platforms with more than one diff --git a/Documentation/driver-api/pm/cpuidle.rst b/Documentation/driver-api/pm/cpuidle.rst index 006cf6db40c6..3588bf078566 100644 --- a/Documentation/driver-api/pm/cpuidle.rst +++ b/Documentation/driver-api/pm/cpuidle.rst @@ -68,9 +68,8 @@ only one in the list (that is, the list was empty before) or the value of its governor currently in use, or the name of the new governor was passed to the kernel as the value of the ``cpuidle.governor=`` command line parameter, the new governor will be used from that point on (there can be only one ``CPUIdle`` -governor in use at a time). Also, if ``cpuidle_sysfs_switch`` is passed to the -kernel in the command line, user space can choose the ``CPUIdle`` governor to -use at run time via ``sysfs``. +governor in use at a time). Also, user space can choose the ``CPUIdle`` +governor to use at run time via ``sysfs``. Once registered, ``CPUIdle`` governors cannot be unregistered, so it is not practical to put them into loadable kernel modules. -- cgit v1.2.3 From a0bd8a2780fab2c8008e128e8a55995d8923e638 Mon Sep 17 00:00:00 2001 From: Hanjun Guo Date: Tue, 19 May 2020 14:25:25 +0800 Subject: Documentation: ABI: make current_governer_ro as a candidate for removal Since both current_governor and current_governor_ro co-exist under /sys/devices/system/cpu/cpuidle/ file, and it's duplicate, make current_governer_ro as a candidate for removal. Signed-off-by: Hanjun Guo Reviewed-by: Doug Smythies Acked-by: Daniel Lezcano Signed-off-by: Rafael J. Wysocki --- Documentation/ABI/obsolete/sysfs-cpuidle | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 Documentation/ABI/obsolete/sysfs-cpuidle (limited to 'Documentation') diff --git a/Documentation/ABI/obsolete/sysfs-cpuidle b/Documentation/ABI/obsolete/sysfs-cpuidle new file mode 100644 index 000000000000..e398fb5e542f --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-cpuidle @@ -0,0 +1,9 @@ +What: /sys/devices/system/cpu/cpuidle/current_governor_ro +Date: April, 2020 +Contact: linux-pm@vger.kernel.org +Description: + current_governor_ro shows current using cpuidle governor, but read only. + with the update that cpuidle governor can be changed at runtime in default, + both current_governor and current_governor_ro co-exist under + /sys/devices/system/cpu/cpuidle/ file, it's duplicate so make + current_governor_ro obselete. -- cgit v1.2.3 From 213081dadd308d5d41a5110c2dc87fd8ba42de5e Mon Sep 17 00:00:00 2001 From: Srinivas Pandruvada Date: Mon, 11 May 2020 16:57:31 -0700 Subject: Documentation: admin-guide: pm: Document intel-speed-select Added documentation to configure servers to use Intel(R) Speed Select Technology using intel-speed-select tool. Signed-off-by: Srinivas Pandruvada Acked-by: Andriy Shevchenko Signed-off-by: Rafael J. Wysocki --- .../admin-guide/pm/intel-speed-select.rst | 917 +++++++++++++++++++++ Documentation/admin-guide/pm/working-state.rst | 1 + 2 files changed, 918 insertions(+) create mode 100644 Documentation/admin-guide/pm/intel-speed-select.rst (limited to 'Documentation') diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst new file mode 100644 index 000000000000..b2ca601c21c6 --- /dev/null +++ b/Documentation/admin-guide/pm/intel-speed-select.rst @@ -0,0 +1,917 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Intel(R) Speed Select Technology User Guide +============================================================ + +The Intel(R) Speed Select Technology (Intel(R) SST) provides a powerful new +collection of features that give more granular control over CPU performance. +With Intel(R) SST, one server can be configured for power and performance for a +variety of diverse workload requirements. + +Refer to the links below for an overview of the technology: + +- https://www.intel.com/content/www/us/en/architecture-and-technology/speed-select-technology-article.html +- https://builders.intel.com/docs/networkbuilders/intel-speed-select-technology-base-frequency-enhancing-performance.pdf + +These capabilities are further enhanced in some of the newer generations of +server platforms where these features can be enumerated and controlled +dynamically without pre-configuring via BIOS setup options. This dynamic +configuration is done via mailbox commands to the hardware. One way to enumerate +and configure these features is by using the Intel Speed Select utility. + +This document explains how to use the Intel Speed Select tool to enumerate and +control Intel(R) SST features. This document gives example commands and explains +how these commands change the power and performance profile of the system under +test. Using this tool as an example, customers can replicate the messaging +implemented in the tool in their production software. + +intel-speed-select configuration tool +====================================== + +Most Linux distribution packages may include the "intel-speed-select" tool. If not, +it can be built by downloading the Linux kernel tree from kernel.org. Once +downloaded, the tool can be built without building the full kernel. + +From the kernel tree, run the following commands:: + +# cd tools/power/x86/intel-speed-select/ +# make +# make install + +Getting Help +------------ + +To get help with the tool, execute the command below:: + +# intel-speed-select --help + +The top-level help describes arguments and features. Notice that there is a +multi-level help structure in the tool. For example, to get help for the feature "perf-profile":: + +# intel-speed-select perf-profile --help + +To get help on a command, another level of help is provided. For example for the command info "info":: + +# intel-speed-select perf-profile info --help + +Summary of platform capability +------------------------------ +To check the current platform and driver capaibilities, execute:: + +#intel-speed-select --info + +For example on a test system:: + + # intel-speed-select --info + Intel(R) Speed Select Technology + Executing on CPU model: X + Platform: API version : 1 + Platform: Driver version : 1 + Platform: mbox supported : 1 + Platform: mmio supported : 1 + Intel(R) SST-PP (feature perf-profile) is supported + TDP level change control is unlocked, max level: 4 + Intel(R) SST-TF (feature turbo-freq) is supported + Intel(R) SST-BF (feature base-freq) is not supported + Intel(R) SST-CP (feature core-power) is supported + +Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) +------------------------------------------------------------------------ + +This feature allows configuration of a server dynamically based on workload +performance requirements. This helps users during deployment as they do not have +to choose a specific server configuration statically. This Intel(R) Speed Select +Technology - Performance Profile (Intel(R) SST-PP) feature introduces a mechanism +that allows multiple optimized performance profiles per system. Each profile +defines a set of CPUs that need to be online and rest offline to sustain a +guaranteed base frequency. Once the user issues a command to use a specific +performance profile and meet CPU online/offline requirement, the user can expect +a change in the base frequency dynamically. This feature is called +"perf-profile" when using the Intel Speed Select tool. + +Number or performance levels +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be multiple performance profiles on a system. To get the number of +profiles, execute the command below:: + + # intel-speed-select perf-profile get-config-levels + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-config-levels:4 + package-1 + die-0 + cpu-14 + get-config-levels:4 + +On this system under test, there are 4 performance profiles in addition to the +base performance profile (which is performance level 0). + +Lock/Unlock status +~~~~~~~~~~~~~~~~~~ + +Even if there are multiple performance profiles, it is possible that that they +are locked. If they are locked, users cannot issue a command to change the +performance state. It is possible that there is a BIOS setup to unlock or check +with your system vendor. + +To check if the system is locked, execute the following command:: + + # intel-speed-select perf-profile get-lock-status + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-lock-status:0 + package-1 + die-0 + cpu-14 + get-lock-status:0 + +In this case, lock status is 0, which means that the system is unlocked. + +Properties of a performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get properties of a specific performance level (For example for the level 0, below), execute the command below:: + + # intel-speed-select perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + cpu-count:28 + enable-cpu-mask:000003ff,f0003fff + enable-cpu-list:0,1,2,3,4,5,6,7,8,9,10,11,12,13,28,29,30,31,32,33,34,35,36,37,38,39,40,41 + thermal-design-power-ratio:26 + base-frequency(MHz):2600 + speed-select-turbo-freq:disabled + speed-select-base-freq:disabled + ... + ... + +Here -l option is used to specify a performance level. + +If the option -l is omitted, then this command will print information about all +the performance levels. The above command is printing properties of the +performance level 0. + +For this performance profile, the list of CPUs displayed by the +"enable-cpu-mask/enable-cpu-list" at the max can be "online." When that +condition is met, then base frequency of 2600 MHz can be maintained. To +understand more, execute "intel-speed-select perf-profile info" for performance +level 4:: + + # intel-speed-select perf-profile info -l 4 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-4 + cpu-count:28 + enable-cpu-mask:000000fa,f0000faf + enable-cpu-list:0,1,2,3,5,7,8,9,10,11,28,29,30,31,33,35,36,37,38,39 + thermal-design-power-ratio:28 + base-frequency(MHz):2800 + speed-select-turbo-freq:disabled + speed-select-base-freq:unsupported + ... + ... + +There are fewer CPUs in the "enable-cpu-mask/enable-cpu-list". Consequently, if +the user only keeps these CPUs online and the rest "offline," then the base +frequency is increased to 2.8 GHz compared to 2.6 GHz at performance level 0. + +Get current performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the current performance level, execute:: + + # intel-speed-select perf-profile get-config-current-level + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + get-config-current_level:0 + +First verify that the base_frequency displayed by the cpufreq sysfs is correct:: + + # cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency + 2600000 + +This matches the base-frequency (MHz) field value displayed from the +"perf-profile info" command for performance level 0(cpufreq frequency is in +KHz). + +To check if the average frequency is equal to the base frequency for a 100% busy +workload, disable turbo:: + +# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo + +Then runs a busy workload on all CPUs, for example:: + +#stress -c 64 + +To verify the base frequency, run turbostat:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + + Package Core CPU Bzy_MHz + - - 2600 + 0 0 0 2600 + 0 1 1 2600 + 0 2 2 2600 + 0 3 3 2600 + 0 4 4 2600 + . . . . + + +Changing performance level +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To the change the performance level to 4, execute:: + + # intel-speed-select -d perf-profile set-config-level -l 4 -o + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile + set_tdp_level:success + +In the command above, "-o" is optional. If it is specified, then it will also +offline CPUs which are not present in the enable_cpu_mask for this performance +level. + +Now if the base_frequency is checked:: + + #cat /sys/devices/system/cpu/cpu0/cpufreq/base_frequency + 2800000 + +Which shows that the base frequency now increased from 2600 MHz at performance +level 0 to 2800 MHz at performance level 4. As a result, any workload, which can +use fewer CPUs, can see a boost of 200 MHz compared to performance level 0. + +Check presence of other Intel(R) SST features +--------------------------------------------- + +Each of the performance profiles also specifies weather there is support of +other two Intel(R) SST features (Intel(R) Speed Select Technology - Base Frequency +(Intel(R) SST-BF) and Intel(R) Speed Select Technology - Turbo Frequency (Intel +SST-TF)). + +For example, from the output of "perf-profile info" above, for level 0 and level +4: + +For level 0:: + speed-select-turbo-freq:disabled + speed-select-base-freq:disabled + +For level 4:: + speed-select-turbo-freq:disabled + speed-select-base-freq:unsupported + +Given these results, the "speed-select-base-freq" (Intel(R) SST-BF) in level 4 +changed from "disabled" to "unsupported" compared to performance level 0. + +This means that at performance level 4, the "speed-select-base-freq" feature is +not supported. However, at performance level 0, this feature is "supported", but +currently "disabled", meaning the user has not activated this feature. Whereas +"speed-select-turbo-freq" (Intel(R) SST-TF) is supported at both performance +levels, but currently not activated by the user. + +The Intel(R) SST-BF and the Intel(R) SST-TF features are built on a foundation +technology called Intel(R) Speed Select Technology - Core Power (Intel(R) SST-CP). +The platform firmware enables this feature when Intel(R) SST-BF or Intel(R) SST-TF +is supported on a platform. + +Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) +--------------------------------------------------------------- + +Intel(R) Speed Select Technology Core Power (Intel(R) SST-CP) is an interface that +allows users to define per core priority. This defines a mechanism to distribute +power among cores when there is a power constrained scenario. This defines a +class of service (CLOS) configuration. + +The user can configure up to 4 class of service configurations. Each CLOS group +configuration allows definitions of parameters, which affects how the frequency +can be limited and power is distributed. Each CPU core can be tied to a class of +service and hence an associated priority. The granularity is at core level not +at per CPU level. + +Enable CLOS based prioritization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To use CLOS based prioritization feature, firmware must be informed to enable +and use a priority type. There is a default per platform priority type, which +can be changed with optional command line parameter. + +To enable and check the options, execute:: + + # intel-speed-select core-power enable --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Enable core-power for a package/die + Clos Enable: Specify priority type with [--priority|-p] + 0: Proportional, 1: Ordered + +There are two types of priority types: + +- Ordered + +Priority for ordered throttling is defined based on the index of the assigned +CLOS group. Where CLOS0 gets highest priority (throttled last). + +Priority order is: +CLOS0 > CLOS1 > CLOS2 > CLOS3. + +- Proportional + +When proportional priority is used, there is an additional parameter called +frequency_weight, which can be specified per CLOS group. The goal of +proportional priority is to provide each core with the requested min., then +distribute all remaining (excess/deficit) budgets in proportion to a defined +weight. This proportional priority can be configured using "core-power config" +command. + +To enable with the platform default priority type, execute:: + + # intel-speed-select core-power enable + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + enable:success + package-1 + die-0 + cpu-6 + core-power + enable:success + +The scope of this enable is per package or die scoped when a package contains +multiple dies. To check if CLOS is enabled and get priority type, "core-power +info" command can be used. For example to check the status of core-power feature +on CPU 0, execute:: + + # intel-speed-select -c 0 core-power info + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + support-status:supported + enable-status:enabled + clos-enable-status:enabled + priority-type:proportional + package-1 + die-0 + cpu-24 + core-power + support-status:supported + enable-status:enabled + clos-enable-status:enabled + priority-type:proportional + +Configuring CLOS groups +~~~~~~~~~~~~~~~~~~~~~~~ + +Each CLOS group has its own attributes including min, max, freq_weight and +desired. These parameters can be configured with "core-power config" command. +Defaults will be used if user skips setting a parameter except clos id, which is +mandatory. To check core-power config options, execute:: + + # intel-speed-select core-power config --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Set core-power configuration for one of the four clos ids + Specify targeted clos id with [--clos|-c] + Specify clos Proportional Priority [--weight|-w] + Specify clos min in MHz with [--min|-n] + Specify clos max in MHz with [--max|-m] + +For example:: + + # intel-speed-select core-power config -c 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + clos epp is not specified, default: 0 + clos frequency weight is not specified, default: 0 + clos min is not specified, default: 0 MHz + clos max is not specified, default: 25500 MHz + clos desired is not specified, default: 0 + package-0 + die-0 + cpu-0 + core-power + config:success + package-1 + die-0 + cpu-6 + core-power + config:success + +The user has the option to change defaults. For example, the user can change the +"min" and set the base frequency to always get guaranteed base frequency. + +Get the current CLOS configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To check the current configuration, "core-power get-config" can be used. For +example, to get the configuration of CLOS 0:: + + # intel-speed-select core-power get-config -c 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + core-power + clos:0 + epp:0 + clos-proportional-priority:0 + clos-min:0 MHz + clos-max:Max Turbo frequency + clos-desired:0 MHz + package-1 + die-0 + cpu-24 + core-power + clos:0 + epp:0 + clos-proportional-priority:0 + clos-min:0 MHz + clos-max:Max Turbo frequency + clos-desired:0 MHz + +Associating a CPU with a CLOS group +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To associate a CPU to a CLOS group "core-power assoc" command can be used:: + + # intel-speed-select core-power assoc --help + Intel(R) Speed Select Technology + Executing on CPU model: X + Associate a clos id to a CPU + Specify targeted clos id with [--clos|-c] + + +For example to associate CPU 10 to CLOS group 3, execute:: + + # intel-speed-select -c 10 core-power assoc -c 3 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-10 + core-power + assoc:success + +Once a CPU is associated, its sibling CPUs are also associated to a CLOS group. +Once associated, avoid changing Linux "cpufreq" subsystem scaling frequency +limits. + +To check the existing association for a CPU, "core-power get-assoc" command can +be used. For example, to get association of CPU 10, execute:: + + # intel-speed-select -c 10 core-power get-assoc + Intel(R) Speed Select Technology + Executing on CPU model: X + package-1 + die-0 + cpu-10 + get-assoc + clos:3 + +This shows that CPU 10 is part of a CLOS group 3. + + +Disable CLOS based prioritization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To disable, execute:: + +# intel-speed-select core-power disable + +Some features like Intel(R) SST-TF can only be enabled when CLOS based prioritization +is enabled. For this reason, disabling while Intel(R) SST-TF is enabled can cause +Intel(R) SST-TF to fail. This will cause the "disable" command to display an error +if Intel(R) SST-TF is already enabled. In turn, to disable, the Intel(R) SST-TF +feature must be disabled first. + +Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) +------------------------------------------------------------------- + +The Intel(R) Speed Select Technology - Base Frequency (Intel(R) SST-BF) feature lets +the user control base frequency. If some critical workload threads demand +constant high guaranteed performance, then this feature can be used to execute +the thread at higher base frequency on specific sets of CPUs (high priority +CPUs) at the cost of lower base frequency (low priority CPUs) on other CPUs. +This feature does not require offline of the low priority CPUs. + +The support of Intel(R) SST-BF depends on the Intel(R) Speed Select Technology - +Performance Profile (Intel(R) SST-PP) performance level configuration. It is +possible that only certain performance levels support Intel(R) SST-BF. It is also +possible that only base performance level (level = 0) has support of Intel +SST-BF. Consequently, first select the desired performance level to enable this +feature. + +In the system under test here, Intel(R) SST-BF is supported at the base +performance level 0, but currently disabled. For example for the level 0:: + + # intel-speed-select -c 0 perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + + speed-select-base-freq:disabled + ... + +Before enabling Intel(R) SST-BF and measuring its impact on a workload +performance, execute some workload and measure performance and get a baseline +performance to compare against. + +Here the user wants more guaranteed performance. For this reason, it is likely +that turbo is disabled. To disable turbo, execute:: + +#echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo + +Based on the output of the "intel-speed-select perf-profile info -l 0" base +frequency of guaranteed frequency 2600 MHz. + + +Measure baseline performance for comparison +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To compare, pick a multi-threaded workload where each thread can be scheduled on +separate CPUs. "Hackbench pipe" test is a good example on how to improve +performance using Intel(R) SST-BF. + +Below, the workload is measuring average scheduler wakeup latency, so a lower +number means better performance:: + + # taskset -c 3,4 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 6.102 [sec] + 6.102445 usecs/op + 163868 ops/sec + +While running the above test, if we take turbostat output, it will show us that +2 of the CPUs are busy and reaching max. frequency (which would be the base +frequency as the turbo is disabled). The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 1000 + 0 1 1 1005 + 0 2 2 1000 + 0 3 3 2600 + 0 4 4 2600 + 0 5 5 1000 + 0 6 6 1000 + 0 7 7 1005 + 0 8 8 1005 + 0 9 9 1000 + 0 10 10 1000 + 0 11 11 995 + 0 12 12 1000 + 0 13 13 1000 + +From the above turbostat output, both CPU 3 and 4 are very busy and reaching +full guaranteed frequency of 2600 MHz. + +Intel(R) SST-BF Capabilities +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get capabilities of Intel(R) SST-BF for the current performance level 0, +execute:: + + # intel-speed-select base-freq info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + speed-select-base-freq + high-priority-base-frequency(MHz):3000 + high-priority-cpu-mask:00000216,00002160 + high-priority-cpu-list:5,6,8,13,33,34,36,41 + low-priority-base-frequency(MHz):2400 + tjunction-temperature(C):125 + thermal-design-power(W):205 + +The above capabilities show that there are some CPUs on this system that can +offer base frequency of 3000 MHz compared to the standard base frequency at this +performance levels. Nevertheless, these CPUs are fixed, and they are presented +via high-priority-cpu-list/high-priority-cpu-mask. But if this Intel(R) SST-BF +feature is selected, the low priorities CPUs (which are not in +high-priority-cpu-list) can only offer up to 2400 MHz. As a result, if this +clipping of low priority CPUs is acceptable, then the user can enable Intel +SST-BF feature particularly for the above "sched pipe" workload since only two +CPUs are used, they can be scheduled on high priority CPUs and can get boost of +400 MHz. + +Enable Intel(R) SST-BF +~~~~~~~~~~~~~~~~~~~~~~ + +To enable Intel(R) SST-BF feature, execute:: + + # intel-speed-select base-freq enable -a + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + base-freq + enable:success + package-1 + die-0 + cpu-14 + base-freq + enable:success + +In this case, -a option is optional. This not only enables Intel(R) SST-BF, but it +also adjusts the priority of cores using Intel(R) Speed Select Technology Core +Power (Intel(R) SST-CP) features. This option sets the minimum performance of each +Intel(R) Speed Select Technology - Performance Profile (Intel(R) SST-PP) class to +maximum performance so that the hardware will give maximum performance possible +for each CPU. + +If -a option is not used, then the following steps are required before enabling +Intel(R) SST-BF: + +- Discover Intel(R) SST-BF and note low and high priority base frequency +- Note the high prioity CPU list +- Enable CLOS using core-power feature set +- Configure CLOS parameters. Use CLOS.min to set to minimum performance +- Subscribe desired CPUs to CLOS groups + +With this configuration, if the same workload is executed by pinning the +workload to high priority CPUs (CPU 5 and 6 in this case):: + + #taskset -c 5,6 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.627 [sec] + 5.627922 usecs/op + 177685 ops/sec + +This way, by enabling Intel(R) SST-BF, the performance of this benchmark is +improved (latency reduced) by 7.79%. From the turbostat output, it can be +observed that the high priority CPUs reached 3000 MHz compared to 2600 MHz. +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 2151 + 0 1 1 2166 + 0 2 2 2175 + 0 3 3 2175 + 0 4 4 2175 + 0 5 5 3000 + 0 6 6 3000 + 0 7 7 2180 + 0 8 8 2662 + 0 9 9 2176 + 0 10 10 2175 + 0 11 11 2176 + 0 12 12 2176 + 0 13 13 2661 + +Disable Intel(R) SST-BF +~~~~~~~~~~~~~~~~~~~~~~~ + +To disable the Intel(R) SST-BF feature, execute:: + +# intel-speed-select base-freq disable -a + + +Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF) +-------------------------------------------------------------------- + +This feature enables the ability to set different "All core turbo ratio limits" +to cores based on the priority. By using this feature, some cores can be +configured to get higher turbo frequency by designating them as high priority at +the cost of lower or no turbo frequency on the low priority cores. + +For this reason, this feature is only useful when system is busy utilizing all +CPUs, but the user wants some configurable option to get high performance on +some CPUs. + +The support of Intel(R) Speed Select Technology - Turbo Frequency (Intel(R) SST-TF) +depends on the Intel(R) Speed Select Technology - Performance Profile (Intel +SST-PP) performance level configuration. It is possible that only a certain +performance level supports Intel(R) SST-TF. It is also possible that only the base +performance level (level = 0) has the support of Intel(R) SST-TF. Hence, first +select the desired performance level to enable this feature. + +In the system under test here, Intel(R) SST-TF is supported at the base +performance level 0, but currently disabled:: + + # intel-speed-select -c 0 perf-profile info -l 0 + Intel(R) Speed Select Technology + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + ... + speed-select-turbo-freq:disabled + ... + ... + + +To check if performance can be improved using Intel(R) SST-TF feature, get the turbo +frequency properties with Intel(R) SST-TF enabled and compare to the base turbo +capability of this system. + +Get Base turbo capability +~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the base turbo capability of performance level 0, execute:: + + # intel-speed-select perf-profile info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + perf-profile-level-0 + ... + ... + turbo-ratio-limits-sse + bucket-0 + core-count:2 + max-turbo-frequency(MHz):3200 + bucket-1 + core-count:4 + max-turbo-frequency(MHz):3100 + bucket-2 + core-count:6 + max-turbo-frequency(MHz):3100 + bucket-3 + core-count:8 + max-turbo-frequency(MHz):3100 + bucket-4 + core-count:10 + max-turbo-frequency(MHz):3100 + bucket-5 + core-count:12 + max-turbo-frequency(MHz):3100 + bucket-6 + core-count:14 + max-turbo-frequency(MHz):3100 + bucket-7 + core-count:16 + max-turbo-frequency(MHz):3100 + +Based on the data above, when all the CPUS are busy, the max. frequency of 3100 +MHz can be achieved. If there is some busy workload on cpu 0 - 11 (e.g. stress) +and on CPU 12 and 13, execute "hackbench pipe" workload:: + + # taskset -c 12,13 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.705 [sec] + 5.705488 usecs/op + 175269 ops/sec + +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + 0 0 0 3000 + 0 1 1 3000 + 0 2 2 3000 + 0 3 3 3000 + 0 4 4 3000 + 0 5 5 3100 + 0 6 6 3100 + 0 7 7 3000 + 0 8 8 3100 + 0 9 9 3000 + 0 10 10 3000 + 0 11 11 3000 + 0 12 12 3100 + 0 13 13 3100 + +Based on turbostat output, the performance is limited by frequency cap of 3100 +MHz. To check if the hackbench performance can be improved for CPU 12 and CPU +13, first check the capability of the Intel(R) SST-TF feature for this performance +level. + +Get Intel(R) SST-TF Capability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To get the capability, the "turbo-freq info" command can be used:: + + # intel-speed-select turbo-freq info -l 0 + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-0 + speed-select-turbo-freq + bucket-0 + high-priority-cores-count:2 + high-priority-max-frequency(MHz):3200 + high-priority-max-avx2-frequency(MHz):3200 + high-priority-max-avx512-frequency(MHz):3100 + bucket-1 + high-priority-cores-count:4 + high-priority-max-frequency(MHz):3100 + high-priority-max-avx2-frequency(MHz):3000 + high-priority-max-avx512-frequency(MHz):2900 + bucket-2 + high-priority-cores-count:6 + high-priority-max-frequency(MHz):3100 + high-priority-max-avx2-frequency(MHz):3000 + high-priority-max-avx512-frequency(MHz):2900 + speed-select-turbo-freq-clip-frequencies + low-priority-max-frequency(MHz):2600 + low-priority-max-avx2-frequency(MHz):2400 + low-priority-max-avx512-frequency(MHz):2100 + +Based on the output above, there is an Intel(R) SST-TF bucket for which there are +two high priority cores. If only two high priority cores are set, then max. +turbo frequency on those cores can be increased to 3200 MHz. This is 100 MHz +more than the base turbo capability for all cores. + +In turn, for the hackbench workload, two CPUs can be set as high priority and +rest as low priority. One side effect is that once enabled, the low priority +cores will be clipped to a lower frequency of 2600 MHz. + +Enable Intel(R) SST-TF +~~~~~~~~~~~~~~~~~~~~~~ + +To enable Intel(R) SST-TF, execute:: + + # intel-speed-select -c 12,13 turbo-freq enable -a + Intel(R) Speed Select Technology + Executing on CPU model: X + package-0 + die-0 + cpu-12 + turbo-freq + enable:success + package-0 + die-0 + cpu-13 + turbo-freq + enable:success + package--1 + die-0 + cpu-63 + turbo-freq --auto + enable:success + +In this case, the option "-a" is optional. If set, it enables Intel(R) SST-TF +feature and also sets the CPUs to high and and low priority using Intel Speed +Select Technology Core Power (Intel(R) SST-CP) features. The CPU numbers passed +with "-c" arguments are marked as high priority, including its siblings. + +If -a option is not used, then the following steps are required before enabling +Intel(R) SST-TF: + +- Discover Intel(R) SST-TF and note buckets of high priority cores and maximum frequency + +- Enable CLOS using core-power feature set - Configure CLOS parameters + +- Subscribe desired CPUs to CLOS groups making sure that high priority cores are set to the maximum frequency + +If the same hackbench workload is executed, schedule hackbench threads on high +priority CPUs:: + + #taskset -c 12,13 perf bench -r 100 sched pipe + # Running 'sched/pipe' benchmark: + # Executed 1000000 pipe operations between two processes + Total time: 5.510 [sec] + 5.510165 usecs/op + 180826 ops/sec + +This improved performance by around 3.3% improvement on a busy system. Here the +turbostat output will show that the CPU 12 and CPU 13 are getting 100 MHz boost. +The turbostat output:: + + #turbostat -c 0-13 --show Package,Core,CPU,Bzy_MHz -i 1 + Package Core CPU Bzy_MHz + ... + 0 12 12 3200 + 0 13 13 3200 diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index 0a38cdf39df1..f40994c422dc 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -13,3 +13,4 @@ Working-State Power Management intel_pstate cpufreq_drivers intel_epb + intel-speed-select -- cgit v1.2.3