summaryrefslogtreecommitdiff
path: root/drivers/cpuidle
AgeCommit message (Collapse)AuthorFilesLines
2026-03-04cpuidle: Skip governor when only one idle state is availableAboorva Devarajan1-0/+10
[ Upstream commit e5c9ffc6ae1bcdb1062527d611043681ac301aca ] On certain platforms (PowerNV systems without a power-mgt DT node), cpuidle may register only a single idle state. In cases where that single state is a polling state (state 0), the ladder governor may incorrectly treat state 1 as the first usable state and pass an out-of-bounds index. This can lead to a NULL enter callback being invoked, ultimately resulting in a system crash. [ 13.342636] cpuidle-powernv : Only Snooze is available [ 13.351854] Faulting instruction address: 0x00000000 [ 13.376489] NIP [0000000000000000] 0x0 [ 13.378351] LR [c000000001e01974] cpuidle_enter_state+0x2c4/0x668 Fix this by adding a bail-out in cpuidle_select() that returns state 0 directly when state_count <= 1, bypassing the governor and keeping the tick running. Fixes: dc2251bf98c6 ("cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol") Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260216185005.1131593-2-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04cpuidle: governors: menu: Always check timers with tick stoppedRafael J. Wysocki1-11/+11
[ Upstream commit 80606f4eb8d7484ab7f7d6f0fd30d71e6fbcf328 ] After commit 5484e31bbbff ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases"), if the return value of get_typical_interval() multiplied by NSEC_PER_USEC is not greater than RESIDENCY_THRESHOLD_NS, the menu governor will skip computing the time till the closest timer. If that happens when the tick has been stopped already, the selected idle state may be too deep due to the subsequent check comparing predicted_ns with TICK_NSEC and causing its value to be replaced with the expected time till the closest timer, which is KTIME_MAX in that case. That will cause the deepest enabled idle state to be selected, but the time till the closest timer very well may be shorter than the target residency of that state, in which case a shallower state should be used. Address this by making menu_select() always compute the time till the closest timer when the tick has been stopped. Also move the predicted_ns check mentioned above into the branch in which the time till the closest timer is determined because it only needs to be done in that case. Fixes: 5484e31bbbff ("cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5959091.DvuYhMxLoT@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08cpuidle: governors: teo: Drop misguided target residency checkRafael J. Wysocki1-5/+2
commit a03b2011808ab02ccb7ab6b573b013b77fbb5921 upstream. When the target residency of the current candidate idle state is greater than the expected time till the closest timer (the sleep length), it does not matter whether or not the tick has already been stopped or if it is going to be stopped. The closest timer will trigger anyway at its due time, so if an idle state with target residency above the sleep length is selected, energy will be wasted and there may be excess latency. Of course, if the closest timer were canceled before it could trigger, a deeper idle state would be more suitable, but this is not expected to happen (generally speaking, hrtimers are not expected to be canceled as a rule). Accordingly, the teo_state_ok() check done in that case causes energy to be wasted more often than it allows any energy to be saved (if it allows any energy to be saved at all), so drop it and let the governor use the teo_find_shallower_state() return value as the new candidate idle state index. Fixes: 21d28cd2fa5f ("cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5955081.DvuYhMxLoT@rafael.j.wysocki Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08cpuidle: menu: Use residency threshold in polling state override decisionsAboorva Devarajan1-4/+5
[ Upstream commit 07d815701274d156ad8c7c088a52e01642156fb8 ] On virtualized PowerPC (pseries) systems, where only one polling state (Snooze) and one deep state (CEDE) are available, selecting CEDE when the predicted idle duration is less than the target residency of CEDE state can hurt performance. In such cases, the entry/exit overhead of CEDE outweighs the power savings, leading to unnecessary state transitions and higher latency. Menu governor currently contains a special-case rule that prioritizes the first non-polling state over polling, even when its target residency is much longer than the predicted idle duration. On PowerPC/pseries, where the gap between the polling state (Snooze) and the first non-polling state (CEDE) is large, this behavior causes performance regressions. Refine that special case by adding an extra requirement: the first non-polling state can only be chosen if its target residency is below the defined RESIDENCY_THRESHOLD_NS. If this condition is not satisfied, polling is allowed instead, avoiding suboptimal non-polling state entries. This change is limited to the single special-case rule for the first non-polling state. The general non-polling state selection logic in the menu governor remains unchanged. Performance improvement observed with pgbench on PowerPC (pseries) system: +---------------------------+------------+------------+------------+ | Metric | Baseline | Patched | Change (%) | +---------------------------+------------+------------+------------+ | Transactions/sec (TPS) | 495,210 | 536,982 | +8.45% | | Avg latency (ms) | 0.163 | 0.150 | -7.98% | +---------------------------+------------+------------+------------+ CPUIdle state usage: +--------------+--------------+-------------+ | Metric | Baseline | Patched | +--------------+--------------+-------------+ | Total usage | 12,735,820 | 13,918,442 | | Above usage | 11,401,520 | 1,598,210 | | Below usage | 20,145 | 702,395 | +--------------+--------------+-------------+ Above/Total and Below/Total usage percentages: +------------------------+-----------+---------+ | Metric | Baseline | Patched | +------------------------+-----------+---------+ | Above % (Above/Total) | 89.56% | 11.49% | | Below % (Below/Total) | 0.16% | 5.05% | | Total cpuidle miss (%) | 89.72% | 16.54% | +------------------------+-----------+---------+ The results indicate that restricting CEDE selection to cases where its residency matches the predicted idle time reduces mispredictions, lowers unnecessary state transitions, and improves overall throughput. Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> [ rjw: Changelog edits, rebase ] Link: https://patch.msgid.link/20251006013954.17972-1-aboorvad@linux.ibm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13cpuidle: Fail cpuidle device registration if there is one alreadyRafael J. Wysocki1-1/+7
[ Upstream commit 7b1b7961170e4fcad488755e5ffaaaf9bd527e8f ] Refuse to register a cpuidle device if the given CPU has a cpuidle device already and print a message regarding it. Without this, an attempt to register a new cpuidle device without unregistering the existing one leads to the removal of the existing cpuidle device without removing its sysfs interface. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13cpuidle: governors: menu: Select polling state in some more casesRafael J. Wysocki1-2/+5
[ Upstream commit db86f55bf81a3a297be05ee8775ae9a8c6e3a599 ] A throughput regression of 11% introduced by commit 779b1a1cb13a ("cpuidle: governors: menu: Avoid selecting states with too much latency") has been reported and it is related to the case when the menu governor checks if selecting a proper idle state instead of a polling one makes sense. In particular, it is questionable to do so if the exit latency of the idle state in question exceeds the predicted idle duration, so add a check for that, which is sufficient to make the reported regression go away, and update the related code comment accordingly. Fixes: 779b1a1cb13a ("cpuidle: governors: menu: Avoid selecting states with too much latency") Closes: https://lore.kernel.org/linux-pm/004501dc43c9$ec8aa930$c59ffb90$@telus.net/ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/12786727.O9o76ZdvQC@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13cpuidle: governors: menu: Rearrange main loop in menu_select()Rafael J. Wysocki1-34/+36
[ Upstream commit 17224c1d2574d29668c4879e1fbf36d6f68cd22b ] Reduce the indentation level in the main loop of menu_select() by rearranging some checks and assignments in it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2389215.ElGaqSPkdT@rafael.j.wysocki Stable-dep-of: db86f55bf81a ("cpuidle: governors: menu: Select polling state in some more cases") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29Revert "cpuidle: menu: Avoid discarding useful information"Rafael J. Wysocki1-12/+9
commit 10fad4012234a7dea621ae17c0c9486824f645a0 upstream. It is reported that commit 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information") led to a performance regression on Intel Jasper Lake systems because it reduced the time spent by CPUs in idle state C7 which is correlated to the maximum frequency the CPUs can get to because of an average running power limit [1]. Before that commit, get_typical_interval() would have returned UINT_MAX whenever it had been unable to make a high-confidence prediction which had led to selecting the deepest available idle state too often and both power and performance had been inadequate as a result of that on some systems. However, this had not been a problem on systems with relatively aggressive average running power limits, like the Jasper Lake systems in question, because on those systems it was compensated by the ability to run CPUs faster. It was addressed by causing get_typical_interval() to return a number based on the recent idle duration information available to it even if it could not make a high-confidence prediction, but that clearly did not take the possible correlation between idle power and available CPU capacity into account. For this reason, revert most of the changes made by commit 85975daeaa4d, except for one cosmetic cleanup, and add a comment explaining the rationale for returning UINT_MAX from get_typical_interval() when it is unable to make a high-confidence prediction. Fixes: 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information") Closes: https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/ [1] Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3663603.iIbC2pHGDl@rafael.j.wysocki Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15cpuidle: qcom-spm: fix device and OF node leaks at probeJohan Hovold1-2/+5
[ Upstream commit cdc06f912670c8c199d5fa9e78b64b7ed8e871d0 ] Make sure to drop the reference to the saw device taken by of_find_device_by_node() after retrieving its driver data during probe(). Also drop the reference to the CPU node sooner to avoid leaking it in case there is no saw node or device. Fixes: 60f3692b5f0b ("cpuidle: qcom_spm: Detach state machine from main SPM handling") Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28cpuidle: governors: menu: Avoid selecting states with too much latencyRafael J. Wysocki1-17/+12
[ Upstream commit 779b1a1cb13ae17028aeddb2fbbdba97357a1e15 ] Occasionally, the exit latency of the idle state selected by the menu governor may exceed the PM QoS CPU wakeup latency limit. Namely, if the scheduler tick has been stopped already and predicted_ns is greater than the tick period length, the governor may return an idle state whose exit latency exceeds latency_req because that decision is made before checking the current idle state's exit latency. For instance, say that there are 3 idle states, 0, 1, and 2. For idle states 0 and 1, the exit latency is equal to the target residency and the values are 0 and 5 us, respectively. State 2 is deeper and has the exit latency and target residency of 200 us and 2 ms (which is greater than the tick period length), respectively. Say that predicted_ns is equal to TICK_NSEC and the PM QoS latency limit is 20 us. After the first two iterations of the main loop in menu_select(), idx becomes 1 and in the third iteration of it the target residency of the current state (state 2) is greater than predicted_ns. State 2 is not a polling one and predicted_ns is not less than TICK_NSEC, so the check on whether or not the tick has been stopped is done. Say that the tick has been stopped already and there are no imminent timers (that is, delta_tick is greater than the target residency of state 2). In that case, idx becomes 2 and it is returned immediately, but the exit latency of state 2 exceeds the latency limit. Address this issue by modifying the code to compare the exit latency of the current idle state (idle state i) with the latency limit before comparing its target residency with predicted_ns, which allows one more exit_latency_ns check that becomes redundant to be dropped. However, after the above change, latency_req cannot take the predicted_ns value any more, which takes place after commit 38f83090f515 ("cpuidle: menu: Remove iowait influence"), because it may cause a polling state to be returned prematurely. In the context of the previous example say that predicted_ns is 3000 and the PM QoS latency limit is still 20 us. Additionally, say that idle state 0 is a polling one. Moving the exit_latency_ns check before the target_residency_ns one causes the loop to terminate in the second iteration, before the target_residency_ns check, so idle state 0 will be returned even though previously state 1 would be returned if there were no imminent timers. For this reason, remove the assignment of the predicted_ns value to latency_req from the code. Fixes: 5ef499cd571c ("cpuidle: menu: Handle stopped tick more aggressively") Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5043159.31r3eYUQgx@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28cpuidle: menu: Remove iowait influenceChristian Loehle1-67/+9
[ Upstream commit 38f83090f515b4b5d59382dfada1e7457f19aa47 ] Remove CPU iowaiters influence on idle state selection. Remove the menu notion of performance multiplier which increased with the number of tasks that went to iowait sleep on this CPU and haven't woken up yet. Relying on iowait for cpuidle is problematic for a few reasons: 1. There is no guarantee that an iowaiting task will wake up on the same CPU. 2. The task being in iowait says nothing about the idle duration, we could be selecting shallower states for a long time. 3. The task being in iowait doesn't always imply a performance hit with increased latency. 4. If there is such a performance hit, the number of iowaiting tasks doesn't directly correlate. 5. The definition of iowait altogether is vague at best, it is sprinkled across kernel code. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20240905092645.2885200-2-christian.loehle@arm.com [ rjw: Minor edits in the changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 779b1a1cb13a ("cpuidle: governors: menu: Avoid selecting states with too much latency") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20cpuidle: governors: menu: Avoid using invalid recent intervals dataRafael J. Wysocki1-4/+17
[ Upstream commit fa3fa55de0d6177fdcaf6fc254f13cc8f33c3eed ] Marc has reported that commit 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information") caused the number of wakeup interrupts to increase on an idle system [1], which was not expected to happen after merely allowing shallower idle states to be selected by the governor in some cases. However, on the system in question, all of the idle states deeper than WFI are rejected by the driver due to a firmware issue [2]. This causes the governor to only consider the recent interval duriation data corresponding to attempts to enter WFI that are successful and the recent invervals table is filled with values lower than the scheduler tick period. Consequently, the governor predicts an idle duration below the scheduler tick period length and avoids stopping the tick more often which leads to the observed symptom. Address it by modifying the governor to update the recent intervals table also when entering the previously selected idle state fails, so it knows that the short idle intervals might have been the minority had the selected idle states been actually entered every time. Fixes: 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information") Link: https://lore.kernel.org/linux-pm/86o6sv6n94.wl-maz@kernel.org/ [1] Link: https://lore.kernel.org/linux-pm/7ffcb716-9a1b-48c2-aaa4-469d0df7c792@arm.com/ [2] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/2793874.mvXUDI8C0e@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-24cpuidle: psci: Fix cpuhotplug routine with PREEMPT_RT=yDaniel Lezcano1-11/+12
commit 621a88dbfe9006c318a0cafbd12e677ccfe006e7 upstream. Currently cpu hotplug with the PREEMPT_RT option set in the kernel is not supported because the underlying generic power domain functions used in the cpu hotplug callbacks are incompatible from a lock point of view. This situation prevents the suspend to idle to reach the deepest idle state for the "cluster" as identified in the undermentioned commit. Use the compatible ones when PREEMPT_RT is enabled and remove the boolean disabling the hotplug callbacks with this option. With this change the platform can reach the deepest idle state allowing at suspend time to consume less power. Tested-on Lenovo T14s with the following script: echo 0 > /sys/devices/system/cpu/cpu3/online BEFORE=$(cat /sys/kernel/debug/pm_genpd/power-domain-cpu-cluster0/idle_states | grep S0 | awk '{ print $3 }') ; rtcwake -s 1 -m mem; AFTER=$(cat /sys/kernel/debug/pm_genpd/power-domain-cpu-cluster0/idle_states | grep S0 | awk '{ print $3 }'); if [ $BEFORE -lt $AFTER ]; then echo "Test successful" else echo "Test failed" fi echo 1 > /sys/devices/system/cpu/cpu3/online Fixes: 1c4b2932bd62 ("cpuidle: psci: Enable the hierarchical topology for s2idle on PREEMPT_RT") Cc: Raghavendra Kakarla <quic_rkakarla@quicinc.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250709154728.733920-1-daniel.lezcano@linaro.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29cpuidle: menu: Avoid discarding useful informationRafael J. Wysocki1-1/+12
[ Upstream commit 85975daeaa4d6ec560bfcd354fc9c08ad7f38888 ] When giving up on making a high-confidence prediction, get_typical_interval() always returns UINT_MAX which means that the next idle interval prediction will be based entirely on the time till the next timer. However, the information represented by the most recent intervals may not be completely useless in those cases. Namely, the largest recent idle interval is an upper bound on the recently observed idle duration, so it is reasonable to assume that the next idle duration is unlikely to exceed it. Moreover, this is still true after eliminating the suspected outliers if the sample set still under consideration is at least as large as 50% of the maximum sample set size. Accordingly, make get_typical_interval() return the current maximum recent interval value in that case instead of UINT_MAX. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Link: https://patch.msgid.link/7770672.EvYhyI6sBW@rjwysocki.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-23cpuidle: teo: Update documentation after previous changesRafael J. Wysocki1-43/+48
[ Upstream commit 5a597a19a2148d1c5cd987907a60c042ab0f62d5 ] After previous changes, the description of the teo governor in the documentation comment does not match the code any more, so update it as appropriate. Fixes: 449914398083 ("cpuidle: teo: Remove recent intercepts metric") Fixes: 2662342079f5 ("cpuidle: teo: Gather statistics regarding whether or not to stop the tick") Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/6120335.lOV4Wx5bFT@rjwysocki.net [ rjw: Corrected 3 typos found by Christian ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-17cpuidle: riscv-sbi: fix device node release in early exit of ↵Javier Carrasco1-2/+2
for_each_possible_cpu [ Upstream commit 7e25044b804581b9c029d5a28d8800aebde18043 ] The 'np' device_node is initialized via of_cpu_device_node_get(), which requires explicit calls to of_node_put() when it is no longer required to avoid leaking the resource. Instead of adding the missing calls to of_node_put() in all execution paths, use the cleanup attribute for 'np' by means of the __free() macro, which automatically calls of_node_put() when the variable goes out of scope. Given that 'np' is only used within the for_each_possible_cpu(), reduce its scope to release the nood after every iteration of the loop. Fixes: 6abf32f1d9c5 ("cpuidle: Add RISC-V SBI CPU idle driver") Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com> Link: https://lore.kernel.org/r/20241116-cpuidle-riscv-sbi-cleanup-v3-1-a3a46372ce08@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-18Merge tag 'pmdomain-v6.12' of ↵Linus Torvalds3-27/+30
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm Pull pmdomain updates from Ulf Hansson: "pmdomain core: - Add support for s2idle for CPU PM domains on PREEMPT_RT - Add device managed version of dev_pm_domain_attach|detach_list() - Improve layout of the debugfs summary table pmdomain providers: - amlogic: Remove obsolete vpu domain driver - bcm: raspberrypi: Add support for devices used as wakeup-sources - imx: Fixup clock handling for imx93 at driver remove - rockchip: Add gating support for RK3576 - rockchip: Add support for RK3576 SoC - Some OF parsing simplifications - Some simplifications by using dev_err_probe() and guard() pmdomain consumers: - qcom/media/venus: Convert to the device managed APIs for PM domains cpuidle-psci: - Add support for s2idle/s2ram for the hierarchical topology on PREEMPT_RT - Some OF parsing simplifications" * tag 'pmdomain-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: (39 commits) pmdomain: core: Reduce debug summary table width pmdomain: core: Move mode_status_str() pmdomain: core: Fix "managed by" alignment in debug summary pmdomain: core: Harden inter-column space in debug summary pmdomain: rockchip: Add gating masks for rk3576 pmdomain: rockchip: Add gating support pmdomain: rockchip: Simplify dropping OF node reference pmdomain: mediatek: make use of dev_err_cast_probe() pmdomain: imx93-pd: drop the context variable "init_off" pmdomain: imx93-pd: don't unprepare clocks on driver remove pmdomain: imx93-pd: replace dev_err() with dev_err_probe() pmdomain: qcom: rpmpd: Simplify locking with guard() pmdomain: qcom: rpmhpd: Simplify locking with guard() pmdomain: qcom: cpr: Simplify locking with guard() pmdomain: qcom: cpr: Simplify with dev_err_probe() pmdomain: imx: gpcv2: Simplify with scoped for each OF child loop pmdomain: imx: gpc: Simplify with scoped for each OF child loop pmdomain: rockchip: SimplUlf Hanssonify locking with guard() pmdomain: rockchip: Simplify with scoped for each OF child loop pmdomain: qcom-cpr: Use scope based of_node_put() to simplify code. ...
2024-08-22cpuidle: remove dead code from cpuidle_enter_state()Dhruva Gole1-4/+1
Checking for index < 0 is useless because the find_deepest_state() function never really returns a negative value. Since this hasn't been reported in over 9 years it's dead code, so remove it. Signed-off-by: Dhruva Gole <d-gole@ti.com> Link: https://patch.msgid.link/20240821114250.1416421-1-d-gole@ti.com [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20cpuidle: riscv-sbi: Simplify with scoped for each OF child loopKrzysztof Kozlowski1-5/+2
Use scoped for_each_child_of_node_scoped() when iterating over device nodes to make code a bit simpler. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://patch.msgid.link/20240820094023.61155-2-krzysztof.kozlowski@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20cpuidle: riscv-sbi: Use scoped device node handling to fix missing of_node_putKrzysztof Kozlowski1-14/+7
Two return statements in sbi_cpuidle_dt_init_states() did not drop the OF node reference count. Solve the issue and simplify entire error handling with scoped/cleanup.h. Fixes: 6abf32f1d9c5 ("cpuidle: Add RISC-V SBI CPU idle driver") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://patch.msgid.link/20240820094023.61155-1-krzysztof.kozlowski@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-20cpuidle: dt_idle_genpd: Simplify with scoped for each OF child loopKrzysztof Kozlowski1-10/+4
Use scoped for_each_child_of_node_scoped() when iterating over device nodes to make code a bit simpler. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240816150931.142208-4-krzysztof.kozlowski@linaro.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-08-20cpuidle: psci: Simplify with scoped for each OF child loopKrzysztof Kozlowski1-5/+2
Use scoped for_each_child_of_node_scoped() when iterating over device nodes to make code a bit simpler. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240816150931.142208-1-krzysztof.kozlowski@linaro.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-08-05cpuidle: psci: Enable the hierarchical topology for s2idle on PREEMPT_RTUlf Hansson1-7/+6
To enable the domain-idle-states to be used during s2idle on a PREEMPT_RT based configuration, let's allow the re-assignment of the ->enter_s2idle() callback to psci_enter_s2idle_domain_idle_state(). Similar to s2ram, let's leave the support for CPU hotplug outside PREEMPT_RT, as it's depending on using runtime PM. For s2idle, this means that an offline CPU's PM domain will remain powered-on. In practise this may lead to that a shallower idle-state than necessary gets selected, which shouldn't be an issue (besides wasting power). Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240527142557.321610-8-ulf.hansson@linaro.org
2024-08-05cpuidle: psci: Enable the hierarchical topology for s2ram on PREEMPT_RTUlf Hansson1-5/+15
The hierarchical PM domain topology are currently disabled on a PREEMPT_RT based configuration. As a first step to enable it to be used, let's try to attach the CPU devices to their PM domains on PREEMPT_RT. In this way the syscore ops becomes available, allowing the PM domain topology to be managed during s2ram. For the moment let's leave the support for CPU hotplug outside PREEMPT_RT, as it's depending on using runtime PM. For s2ram, this isn't a problem as all CPUs are managed via the syscore ops. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240527142557.321610-7-ulf.hansson@linaro.org
2024-08-05cpuidle: psci: Drop redundant assignment of CPUIDLE_FLAG_RCU_IDLEUlf Hansson1-1/+0
When using the hierarchical topology and PSCI OSI-mode we may end up overriding the deepest idle-state's ->enter|enter_s2idle() callbacks, but there is no point to also re-assign the CPUIDLE_FLAG_RCU_IDLE for the idle-state in question, as that has already been set when parsing the states from DT. See init_state_node(). Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240527142557.321610-6-ulf.hansson@linaro.org
2024-08-05cpuidle: psci-domain: Enable system-wide suspend on PREEMPT_RTUlf Hansson1-3/+7
The domain-idle-states are currently disabled on a PREEMPT_RT based configuration for the cpuidle-psci-domain. To enable them to be used for system-wide suspend and in particular during s2idle, let's set the GENPD_FLAG_RPM_ALWAYS_ON instead of GENPD_FLAG_ALWAYS_ON for the corresponding genpd provider. In this way, the runtime PM path remains disabled in genpd for its attached devices, while powering-on/off the PM domain during system-wide suspend becomes allowed. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Raghavendra Kakarla <quic_rkakarla@quicinc.com> # qcm6490 with PREEMPT_RT set Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240527142557.321610-5-ulf.hansson@linaro.org
2024-07-01cpuidle: teo: Don't count non-existent interceptsChristian Loehle1-0/+11
When bailing out early, teo will not query the sleep length anymore since commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") with an expected sleep_length_ns value of KTIME_MAX. This lead to state0 accumulating lots of 'intercepts' because the actually measured sleep length was < KTIME_MAX, so query the sleep length instead for teo to recognize if it still is in an intercept-likely scenario without alternating between the two modes. Fundamentally we can only do one of the two: 1. Skip sleep_length_ns query when we think intercept is likely. 2. Have accurate data if sleep_length_ns is actually intercepted when we believe it is currently intercepted. Previously teo did the former while this patch chooses the latter as the additional time it takes to query the sleep length was found to be negligible and the variants of option 1 (count all unknowns as misses or count all unknown as hits) had significant regressions (as misses had lots of too shallow idle state selections and as hits had terrible performance in intercept-heavy workloads). Fixes: 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") Link: https://patch.msgid.link/c40acf72-010f-4a8b-80e4-33f133ba266b@arm.com Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28cpuidle: teo: Remove recent intercepts metricChristian Loehle1-63/+13
The logic for recent intercepts didn't work, there is an underflow of the 'recent' value that can be observed during boot already, which teo usually doesn't recover from, making the entire logic pointless. Furthermore the recent intercepts also were never reset, thus not actually being very 'recent'. Having underflowed 'recent' values lead to teo always acting as if we were in a scenario were expected sleep length based on timers is too high and it therefore unnecessarily selecting shallower states. Experiments show that the remaining 'intercept' logic is enough to quickly react to scenarios in which teo cannot rely on the timer expected sleep length. See also here: https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment") Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28Revert: "cpuidle: teo: Introduce util-awareness"Christian Loehle1-105/+0
This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b. Util-awareness was reported to be too aggressive in selecting shallower states. Additionally a single threshold was found to not be suitable for reasoning about sleep length as, for all practical purposes, almost arbitrary sleep lengths are still possible for any load value. Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness") Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com Reported-by: Qais Yousef <qyousef@layalina.io> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-21cpuidle: governors: teo: Fix a typo in a commentAtul Kumar Pant1-1/+1
"terget" -> "target" Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> Link: https://patch.msgid.link/20240616124025.16477-1-atulpant.linux@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-14cpuidle: haltpoll: add missing MODULE_DESCRIPTION() macroJeff Johnson1-0/+1
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/cpuidle/cpuidle-haltpoll.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-07cpuidle: menu: Cleanup after loadavg removalChristian Loehle1-12/+5
The performance impact of loadavg was removed with commit a7fe5190c03f ("cpuidle: menu: Remove get_loadavg() from the performance multiplier") With only iowait remaining the description can be simplified, remove also the no longer needed includes. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-05-16Merge tag 'pmdomain-v6.10' of ↵Linus Torvalds3-23/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm Pull pmdomain updates from Ulf Hansson: "pmdomain core: - Don't clear suspended_count at genpd_prepare() - Update the rejected/usage counters at system suspend too pmdomain providers: - ti-sci: Fix duplicate PD referrals - mediatek: Add MT8188 buck isolation setting - renesas: Add R-Car M3-W power-off delay quirk - renesas: Split R-Car M3-W and M3-W+ sub-drivers cpuidle-psci: - Update MAINTAINERS to set a git for DT IDLE PM DOMAIN/ARM PSCI PM DOMAIN - Update init level to core_initcall() - Drop superfluous wrappers psci_dt_attach|detach_cpu()" * tag 'pmdomain-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: pmdomain: ti-sci: Fix duplicate PD referrals pmdomain: core: Don't clear suspended_count at genpd_prepare() pmdomain: core: Update the rejected/usage counters at system suspend too pmdomain: renesas: rcar-sysc: Add R-Car M3-W power-off delay quirk pmdomain: renesas: rcar-sysc: Remove rcar_sysc_nullify() helper pmdomain: renesas: rcar-sysc: Split R-Car M3-W and M3-W+ sub-drivers pmdomain: renesas: rcar-sysc: Absorb rcar_sysc_ch into rcar_sysc_pd MAINTAINERS: Add a git for the DT IDLE PM DOMAIN MAINTAINERS: Add a git for the ARM PSCI PM DOMAIN cpuidle: psci: Update init level to core_initcall() cpuidle: psci: Drop superfluous wrappers psci_dt_attach|detach_cpu() pmdomain: mediatek: Add MT8188 buck isolation setting pmdomain: mediatek: scpsys: drop driver owner assignment
2024-05-14Merge tag 'pm-6.10-rc1' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are mostly cpufreq updates, including a significant intel-pstate driver update and several amd-pstate improvements plus some updates of ARM cpufreq drivers, general fixes and cleanups. Also included are changes related to system sleep, power capping updates adding support for a new platform and a new hardware feature (among other things), a Samsung exynos-asv driver update allowing it to change its Energy Model after adjusting voltage, minor cpuidle and devfreq updates and a small documentation cleanup. Specifics: - Rework the handling of disabled turbo in the intel_pstate driver and make it update the maximum CPU frequency consistently regardless of the reason on top of a number of cleanups (Rafael Wysocki) - Add missing checks for NULL .exit() cpufreq driver callback to the cpufreq core (Viresh Kumar) - Prevent pulicy->max from going above the frequency QoS maximum value when cpufreq_frequency_table_verify() is used (Xuewen Yan) - Prevent a negative CPU number or frequency value from being printed if they are really large (Joshua Yeong) - Update MAINTAINERS entry for amd-pstate to add two new submaintainers and a designated reviewer (Huang Rui) - Clean up the amd-pstate driver and update its documentation (Gautham Shenoy) - Fix the highest frequency issue in the amd-pstate driver which limits performance (Perry Yuan) - Enable CPPC v2 for certain processors in the family 17H, as requested by TR40 processor users who expect improved performance and lower system temperature (Perry Yuan) - Change latency and delay values to be read from platform firmware firstly for more accurate timing (Perry Yuan) - A new quirk is introduced for supporting amd-pstate on legacy processors which either lack CPPC capability, or only only have CPPC v2 capability (Perry Yuan) - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi, Dan Carpenter, Viresh Kumar) - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr Mishin) - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco, Shivani Gupta) - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens) - mediatek cpufreq: Add support for MT7988A (Sam Shih) - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei Fan) - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate cpufreq driver (Jeff Johnson) - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson) - Convert the cpuidle kirkwood driver to platform remove callback returning void (Yangtao Li) - Replace deprecated strncpy() with strscpy() in the hibernation core code (Justin Stitt) - Use %ps to simplify debug output in the core system-wide suspend and resume code (Len Brown) - Remove unnecessary else from device_init_wakeup() and make device_wakeup_disable() return void (Dhruva Gole) - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui) - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang Rui) - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li) - Make the Samsung exynos-asv driver update the Energy Model after adjusting voltage on top of some preliminary changes of the OPP and Enery Model generic code (Lukasz Luba) - Remove a reference to a function that has been dropped from the power management documentation (Bjorn Helgaas) - Convert the platfrom remove callback to .remove_new for the exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and rk3399_dmc devfreq drivers (Uwe Kleine-König) - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon)" * tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (68 commits) PM / devfreq: exynos: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions PM / devfreq: rk3399_dmc: Convert to platform remove callback returning void PM / devfreq: sun8i-a33-mbus: Convert to platform remove callback returning void PM / devfreq: mtk-cci: Convert to platform remove callback returning void PM / devfreq: exynos-ppmu: Convert to platform remove callback returning void PM / devfreq: exynos-nocp: Convert to platform remove callback returning void cpufreq: amd-pstate: fix the highest frequency issue which limits performance cpufreq: intel_pstate: fix struct cpudata::epp_cached kernel-doc cpuidle: ladder: fix ladder_do_selection() kernel-doc powercap: intel_rapl_tpmi: Enable PMU support powercap: intel_rapl: Introduce APIs for PMU support PM: hibernate: replace deprecated strncpy() with strscpy() cpufreq: Fix up printing large CPU numbers and frequency values MAINTAINERS: cpufreq: amd-pstate: Add co-maintainers and reviewer cpufreq: amd-pstate: remove unused variable lowest_nonlinear_freq cpufreq: amd-pstate: fix code format problems cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missing cppc_acpi: print error message if CPPC is unsupported cpufreq: amd-pstate: get transition delay and latency value from ACPI tables cpufreq: amd-pstate: Bail out if min/max/nominal_freq is 0 ...
2024-05-07cpuidle: ladder: fix ladder_do_selection() kernel-docJeff Johnson1-0/+1
make C=1 reports: warning: Function parameter or struct member 'dev' not described in 'ladder_do_selection' Document 'dev' for this function. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-24cpuidle: Avoid explicit cpumask allocation on stackDawei Li1-10/+3
In general it's preferable to avoid placing cpumasks on the stack, as for large values of NR_CPUS these can consume significant amounts of stack space and make stack overflows more likely. Use cpumask_first_and_and() and cpumask_weight_and() to avoid the need for a temporary cpumask on the stack. Signed-off-by: Dawei Li <dawei.li@shingroup.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240416085454.3547175-8-dawei.li@shingroup.cn
2024-04-23cpuidle: kirkwood: Convert to platform remove callback returning voidYangtao Li1-3/+2
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is (mostly) ignored and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new() which already returns void. Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/20230712094014.41787-1-frank.li@vivo.com
2024-04-04cpuidle: psci: Update init level to core_initcall()Maulik Shah1-1/+1
Clients like regulators, interconnects and clocks depend on rpmh-rsc to vote on resources and rpmh-rsc depends on psci power-domains to complete probe. All of them are in core_initcall(). Change psci domain init level to core_initcall() to avoid probe defer from all of the above. Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com> Link: https://lore.kernel.org/r/20240217-init_level-v1-2-bde9e11f8317@quicinc.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2024-04-04cpuidle: psci: Drop superfluous wrappers psci_dt_attach|detach_cpu()Ulf Hansson3-22/+4
To simplify the code, let's drop psci_dt_attach|detach_cpu() and use the common dt_idle_attach|detach_cpu() directly instead. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20240228151139.2650258-1-ulf.hansson@linaro.org
2024-03-22Merge tag 'riscv-for-linus-6.9-mw2' of ↵Linus Torvalds1-44/+5
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for various vector-accelerated crypto routines - Hibernation is now enabled for portable kernel builds - mmap_rnd_bits_max is larger on systems with larger VAs - Support for fast GUP - Support for membarrier-based instruction cache synchronization - Support for the Andes hart-level interrupt controller and PMU - Some cleanups around unaligned access speed probing and Kconfig settings - Support for ACPI LPI and CPPC - Various cleanus related to barriers - A handful of fixes * tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits) riscv: Fix syscall wrapper for >word-size arguments crypto: riscv - add vector crypto accelerated AES-CBC-CTS crypto: riscv - parallelize AES-CBC decryption riscv: Only flush the mm icache when setting an exec pte riscv: Use kcalloc() instead of kzalloc() riscv/barrier: Add missing space after ',' riscv/barrier: Consolidate fence definitions riscv/barrier: Define RISCV_FULL_BARRIER riscv/barrier: Define __{mb,rmb,wmb} RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ cpufreq: Move CPPC configs to common Kconfig and add RISC-V ACPI: RISC-V: Add CPPC driver ACPI: Enable ACPI_PROCESSOR for RISC-V ACPI: RISC-V: Add LPI driver cpuidle: RISC-V: Move few functions to arch/riscv riscv: Introduce set_compat_task() in asm/compat.h riscv: Introduce is_compat_thread() into compat.h riscv: add compile-time test into is_compat_task() riscv: Replace direct thread flag check with is_compat_task() riscv: Improve arch_get_mmap_end() macro ...
2024-03-20cpuidle: RISC-V: Move few functions to arch/riscvSunil V L1-44/+5
To support ACPI Low Power Idle (LPI), few functions are required which are currently static functions in the DT based cpuidle driver. Hence, move them under arch/riscv so that ACPI driver also can use them. Since they are no longer static functions, append "riscv_" prefix to the function name. Signed-off-by: Sunil V L <sunilvl@ventanamicro.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20240118062930.245937-2-sunilvl@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-03-15Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
2024-02-22x86/mm: delete unused cpu argument to leave_mm()Yosry Ahmed1-1/+1
The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm"), delete it. Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-12cpuidle: Avoid potential overflow in integer multiplicationC Cheng1-1/+2
In detail: In C language, when you perform a multiplication operation, if both operands are of int type, the multiplication operation is performed on the int type, and then the result is converted to the target type. This means that if the product of int type multiplication exceeds the range that int type can represent, an overflow will occur even if you store the result in a variable of int64_t type. For a multiplication of two int values, it is better to use mul_u32_u32() rather than s->exit_latency_ns = s->exit_latency * NSEC_PER_USEC to avoid potential overflow happenning. Signed-off-by: C Cheng <C.Cheng@mediatek.com> Signed-off-by: Bo Ye <bo.ye@mediatek.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [ rjw: New subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-12cpuidle: haltpoll: do not shrink guest poll_limit_ns below grow_startParshuram Sangle1-2/+7
While adjusting guest halt poll limit, grow block starts at guest_halt_poll_grow_start without taking intermediate values. Similar behavior is expected while shrinking the value. This avoids short interval values which are really not required. VCPU1 trace (guest_halt_poll_shrink equals 2): VCPU1 grow 10000 VCPU1 shrink 5000 VCPU1 shrink 2500 VCPU1 shrink 1250 VCPU1 shrink 625 VCPU1 shrink 312 VCPU1 shrink 156 VCPU1 shrink 78 VCPU1 shrink 39 VCPU1 shrink 19 VCPU1 shrink 9 VCPU1 shrink 4 Similar change is done in KVM halt poll flow with below patch: Link: https://lore.kernel.org/kvm/20211006133021.271905-3-sashal@kernel.org/ Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-29cpuidle: haltpoll: Do not enable interrupts when entering idleBorislav Petkov (AMD)1-5/+4
The cpuidle drivers' ->enter() methods are supposed to be IRQ invariant: 5e26aa933911 ("cpuidle/poll: Ensure IRQs stay disabled after cpuidle_state::enter() calls") bb7b11258561 ("cpuidle: Move IRQ state validation") Do that in the haltpoll driver too. Fixes: 5e26aa933911 ("cpuidle/poll: Ensure IRQs stay disabled after cpuidle_state::enter() calls") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218245 Reported-by: <forza@tnonline.net> Tested-by: <forza@tnonline.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-09-30cpuidle: dt: Replace deprecated strncpy() with strscpy()Justin Stitt1-2/+2
`strncpy` is deprecated for use on NUL-terminated destination strings [1]. We should prefer more robust and less ambiguous string interfaces. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer. With this, we can also drop the now unnecessary `CPUIDLE_(NAME|DESC)_LEN - 1` pieces. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230913-strncpy-drivers-cpuidle-dt_idle_states-c-v1-1-d16a0dbe5658@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-31Merge tag 'powerpc-6.6-1' of ↵Linus Torvalds1-7/+1
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the configured SMT state when hotplugging CPUs into the system - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the Radix MMU to avoid a broadcast TLBIE flush on exit - Drop the exclusion between ptrace/perf watchpoints, and drop the now unused associated arch hooks - Add support for the "nohlt" command line option to disable CPU idle - Add support for -fpatchable-function-entry for ftrace, with GCC >= 13.1 - Rework memory block size determination, and support 256MB size on systems with GPUs that have hotpluggable memory - Various other small features and fixes Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar, Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain, Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai. * tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits) macintosh/ams: linux/platform_device.h is needed powerpc/xmon: Reapply "Relax frame size for clang" powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled powerpc/iommu: Fix notifiers being shared by PCI and VIO buses powerpc/mpc5xxx: Add missing fwnode_handle_put() powerpc/config: Disable SLAB_DEBUG_ON in skiroot powerpc/pseries: Remove unused hcall tracing instruction powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n powerpc: dts: add missing space before { powerpc/eeh: Use pci_dev_id() to simplify the code powerpc/64s: Move CPU -mtune options into Kconfig powerpc/powermac: Fix unused function warning powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT powerpc: Don't include lppaca.h in paca.h powerpc/pseries: Move hcall_vphn() prototype into vphn.h powerpc/pseries: Move VPHN constants into vphn.h cxl: Drop unused detach_spa() powerpc: Drop zalloc_maybe_bootmem() powerpc/powernv: Use struct opal_prd_msg in more places ...
2023-08-25Merge branches 'pm-cpuidle' and 'pm-cpufreq'Rafael J. Wysocki3-111/+203
Merge CPU power management updates for 6.6-rc1: - Rework the menu and teo cpuidle governors to avoid calling tick_nohz_get_sleep_length(), which is likely to become quite expensive going forward, too often and improve making decisions regarding whether or not to stop the scheduler tick in the teo governor (Rafael Wysocki). - Improve the performance of cpufreq_stats_create_table() in some cases (Liao Chang). - Fix two issues in the amd-pstate-ut cpufreq driver (Swapnil Sapkal). - Use clamp() helper macro to improve the code readability in cpufreq_verify_within_limits() (Liao Chang). - Set stale CPU frequency to minimum in intel_pstate (Doug Smythies). * pm-cpuidle: cpuidle: teo: Avoid unnecessary variable assignments cpuidle: menu: Skip tick_nohz_get_sleep_length() call in some cases cpuidle: teo: Gather statistics regarding whether or not to stop the tick cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases cpuidle: teo: Do not call tick_nohz_get_sleep_length() upfront cpuidle: teo: Drop utilized from struct teo_cpu cpuidle: teo: Avoid stopping the tick unnecessarily when bailing out cpuidle: teo: Update idle duration estimate when choosing shallower state * pm-cpufreq: cpufreq: amd-pstate-ut: Fix kernel panic when loading the driver cpufreq: amd-pstate-ut: Remove module parameter access cpufreq: Use clamp() helper macro to improve the code readability cpufreq: intel_pstate: set stale CPU frequency to minimum cpufreq: stats: Improve the performance of cpufreq_stats_create_table()
2023-08-24powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPTRussell Currey1-7/+1
lppaca_shared_proc() takes a pointer to the lppaca which is typically accessed through get_lppaca(). With DEBUG_PREEMPT enabled, this leads to checking if preemption is enabled, for example: BUG: using smp_processor_id() in preemptible [00000000] code: grep/10693 caller is lparcfg_data+0x408/0x19a0 CPU: 4 PID: 10693 Comm: grep Not tainted 6.5.0-rc3 #2 Call Trace: dump_stack_lvl+0x154/0x200 (unreliable) check_preemption_disabled+0x214/0x220 lparcfg_data+0x408/0x19a0 ... This isn't actually a problem however, as it does not matter which lppaca is accessed, the shared proc state will be the same. vcpudispatch_stats_procfs_init() already works around this by disabling preemption, but the lparcfg code does not, erroring any time /proc/powerpc/lparcfg is accessed with DEBUG_PREEMPT enabled. Instead of disabling preemption on the caller side, rework lppaca_shared_proc() to not take a pointer and instead directly access the lppaca, bypassing any potential preemption checks. Fixes: f13c13a00512 ("powerpc: Stop using non-architected shared_proc field in lppaca") Signed-off-by: Russell Currey <ruscur@russell.cc> [mpe: Rework to avoid needing a definition in paca.h and lppaca.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230823055317.751786-4-mpe@ellerman.id.au