summaryrefslogtreecommitdiff
path: root/lib
diff options
context:
space:
mode:
authorDouglas Anderson <dianders@chromium.org>2023-05-19 20:18:38 +0300
committerAndrew Morton <akpm@linux-foundation.org>2023-06-10 03:44:21 +0300
commit1f423c905a6b43b493df1b259e6e6267e5624e62 (patch)
tree0de5c948d85fde528cdd629090c893352414d687 /lib
parentd9b3629ade8ebffb0075e311409796a56bac8282 (diff)
downloadlinux-1f423c905a6b43b493df1b259e6e6267e5624e62.tar.xz
watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Implement a hardlockup detector that doesn't doesn't need any extra arch-specific support code to detect lockups. Instead of using something arch-specific we will use the buddy system, where each CPU watches out for another one. Specifically, each CPU will use its softlockup hrtimer to check that the next CPU is processing hrtimer interrupts by verifying that a counter is increasing. NOTE: unlike the other hard lockup detectors, the buddy one can't easily show what's happening on the CPU that locked up just by doing a simple backtrace. It relies on some other mechanism in the system to get information about the locked up CPUs. This could be support for NMI backtraces like [1], it could be a mechanism for printing the PC of locked CPUs at panic time like [2] / [3], or it could be something else. Even though that means we still rely on arch-specific code, this arch-specific code seems to often be implemented even on architectures that don't have a hardlockup detector. This style of hardlockup detector originated in some downstream Android trees and has been rebased on / carried in ChromeOS trees for quite a long time for use on arm and arm64 boards. Historically on these boards we've leveraged mechanism [2] / [3] to get information about hung CPUs, but we could move to [1]. Although the original motivation for the buddy system was for use on systems without an arch-specific hardlockup detector, it can still be useful to use even on systems that _do_ have an arch-specific hardlockup detector. On x86, for instance, there is a 24-part patch series [4] in progress switching the arch-specific hard lockup detector from a scarce perf counter to a less-scarce hardware resource. Potentially the buddy system could be a simpler alternative to free up the perf counter but still get hard lockup detection. Overall, pros (+) and cons (-) of the buddy system compared to an arch-specific hardlockup detector (which might be implemented using perf): + The buddy system is usable on systems that don't have an arch-specific hardlockup detector, like arm32 and arm64 (though it's being worked on for arm64 [5]). + The buddy system may free up scarce hardware resources. + If a CPU totally goes out to lunch (can't process NMIs) the buddy system could still detect the problem (though it would be unlikely to be able to get a stack trace). + The buddy system uses the same timer function to pet the hardlockup detector on the running CPU as it uses to detect hardlockups on other CPUs. Compared to other hardlockup detectors, this means it generates fewer interrupts and thus is likely better able to let CPUs stay idle longer. - If all CPUs are hard locked up at the same time the buddy system can't detect it. - If we don't have SMP we can't use the buddy system. - The buddy system needs an arch-specific mechanism (possibly NMI backtrace) to get info about the locked up CPU. [1] https://lore.kernel.org/r/20230419225604.21204-1-dianders@chromium.org [2] https://issuetracker.google.com/172213129 [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html [4] https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calderon@linux.intel.com/ [5] https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.chen@mediatek.com/ Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Tzung-Bi Shih <tzungbi@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ian Rogers <irogers@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'lib')
-rw-r--r--lib/Kconfig.debug52
1 files changed, 49 insertions, 3 deletions
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce51d4dc6803..abcad0513a32 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1035,10 +1035,55 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
Say N if unsure.
-config HARDLOCKUP_DETECTOR_PERF
+# Both the "perf" and "buddy" hardlockup detectors count hrtimer
+# interrupts. This config enables functions managing this common code.
+config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
bool
select SOFTLOCKUP_DETECTOR
+config HARDLOCKUP_DETECTOR_PERF
+ bool
+ depends on HAVE_HARDLOCKUP_DETECTOR_PERF
+ select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
+
+config HARDLOCKUP_DETECTOR_BUDDY
+ bool
+ depends on SMP
+ select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
+
+# For hardlockup detectors you can have one directly provided by the arch
+# or use a "non-arch" one. If you're using a "non-arch" one that is
+# further divided the perf hardlockup detector (which, confusingly, needs
+# arch-provided perf support) and the buddy hardlockup detector (which just
+# needs SMP). In either case, using the "non-arch" code conflicts with
+# the NMI watchdog code (which is sometimes used directly and sometimes used
+# by the arch-provided hardlockup detector).
+config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+ bool
+ depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG
+ default y
+
+config HARDLOCKUP_DETECTOR_PREFER_BUDDY
+ bool "Prefer the buddy CPU hardlockup detector"
+ depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH && HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
+ help
+ Say Y here to prefer the buddy hardlockup detector over the perf one.
+
+ With the buddy detector, each CPU uses its softlockup hrtimer
+ to check that the next CPU is processing hrtimer interrupts by
+ verifying that a counter is increasing.
+
+ This hardlockup detector is useful on systems that don't have
+ an arch-specific hardlockup detector or if resources needed
+ for the hardlockup detector are better used for other things.
+
+# This will select the appropriate non-arch hardlockdup detector
+config HARDLOCKUP_DETECTOR_NON_ARCH
+ bool
+ depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+ select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || HARDLOCKUP_DETECTOR_PREFER_BUDDY
+ select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && !HARDLOCKUP_DETECTOR_PREFER_BUDDY
+
#
# Enables a timestamp based low pass filter to compensate for perf based
# hard lockup detection which runs too fast due to turbo modes.
@@ -1053,9 +1098,10 @@ config HARDLOCKUP_CHECK_TIMESTAMP
config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
depends on DEBUG_KERNEL && !S390
- depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH
+ depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || HAVE_HARDLOCKUP_DETECTOR_ARCH
select LOCKUP_DETECTOR
- select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF
+ select HARDLOCKUP_DETECTOR_NON_ARCH if HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
+
help
Say Y here to enable the kernel to act as a watchdog to detect
hard lockups.