diff options
author | Quan Nguyen <quan@os.amperecomputing.com> | 2022-10-31 05:44:41 +0300 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2022-11-10 21:02:43 +0300 |
commit | 4a4a4e9ebaa3ce903a3cdf8bb173eeaf87828cea (patch) | |
tree | a4f8d334be1e8ec88b8bb68d1e9b69af8114603f /Documentation | |
parent | c08645ea215c446ceb21029fe5416e6a62cbbed7 (diff) | |
download | linux-4a4a4e9ebaa3ce903a3cdf8bb173eeaf87828cea.tar.xz |
misc: smpro-errmon: Add Ampere's SMpro error monitor driver
Add Ampere's SMpro error monitor driver for monitoring and reporting
RAS-related errors as reported by SMpro co-processor found on Ampere's
Altra processor family.
Signed-off-by: Quan Nguyen <quan@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20221031024442.2490881-3-quan@os.amperecomputing.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro | 264 |
1 files changed, 264 insertions, 0 deletions
diff --git a/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro new file mode 100644 index 000000000000..2b84dc8c3149 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro @@ -0,0 +1,264 @@ +What: /sys/bus/platform/devices/smpro-errmon.*/error_[core|mem|pcie|other]_[ce|ue] +KernelVersion: 6.1 +Contact: Quan Nguyen <quan@os.amperecomputing.com> +Description: + (RO) Contains the 48-byte Ampere (Vendor-Specific) Error Record printed + in hex format according to the table below: + + +--------+---------------+-------------+------------------------------------------------------------+ + | Offset | Field | Size (byte) | Description | + +--------+---------------+-------------+------------------------------------------------------------+ + | 00 | Error Type | 1 | See :ref:`the table below <smpro-error-types>` for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 01 | Subtype | 1 | See :ref:`the table below <smpro-error-types>` for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 02 | Instance | 2 | See :ref:`the table below <smpro-error-types>` for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 04 | Error status | 4 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 08 | Error Address | 8 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 16 | Error Misc 0 | 8 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 24 | Error Misc 1 | 8 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 32 | Error Misc 2 | 8 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + | 40 | Error Misc 3 | 8 | See ARM RAS specification for details | + +--------+---------------+-------------+------------------------------------------------------------+ + + The table below defines the value of error types, their subtype, subcomponent and instance: + + .. _smpro-error-types: + + +-----------------+------------+----------+----------------+----------------------------------------+ + | Error Group | Error Type | Sub type | Sub component | Instance | + +-----------------+------------+----------+----------------+----------------------------------------+ + | CPM (core) | 0 | 0 | Snoop-Logic | CPM # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | CPM (core) | 0 | 2 | Armv8 Core 1 | CPM # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 1 | ERR1 | MCU # \| SLOT << 11 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 2 | ERR2 | MCU # \| SLOT << 11 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 3 | ERR3 | MCU # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 4 | ERR4 | MCU # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 5 | ERR5 | MCU # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 6 | ERR6 | MCU # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | MCU (mem) | 1 | 7 | Link Error | MCU # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | Mesh (other) | 2 | 0 | Cross Point | X \| (Y << 5) \| NS <<11 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | Mesh (other) | 2 | 1 | Home Node(IO) | X \| (Y << 5) \| NS <<11 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | Mesh (other) | 2 | 2 | Home Node(Mem) | X \| (Y << 5) \| NS <<11 \| device<<12 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | Mesh (other) | 2 | 4 | CCIX Node | X \| (Y << 5) \| NS <<11 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | 2P Link (other) | 3 | 0 | N/A | Altra 2P Link # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 0 | ERR0 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 1 | ERR1 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 2 | ERR2 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 3 | ERR3 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 4 | ERR4 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 5 | ERR5 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 6 | ERR6 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 7 | ERR7 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 8 | ERR8 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 9 | ERR9 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 10 | ERR10 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 11 | ERR11 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 12 | ERR12 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | GIC (other) | 5 | 13-21 | ERR13 | RC # + 1 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TCU | 100 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU0 | 0 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU1 | 1 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU2 | 2 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU3 | 3 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU4 | 4 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU5 | 5 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU6 | 6 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU7 | 7 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU8 | 8 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMMU (other) | 6 | TBU9 | 9 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PCIe AER (pcie) | 7 | Root | 0 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PCIe AER (pcie) | 7 | Device | 1 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PCIe RC (pcie) | 8 | RCA HB | 0 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PCIe RC (pcie) | 8 | RCB HB | 1 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PCIe RC (pcie) | 8 | RASDP | 8 | RC # | + +-----------------+------------+----------+----------------+----------------------------------------+ + | OCM (other) | 9 | ERR0 | 0 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | OCM (other) | 9 | ERR1 | 1 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | OCM (other) | 9 | ERR2 | 2 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMpro (other) | 10 | ERR0 | 0 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMpro (other) | 10 | ERR1 | 1 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | SMpro (other) | 10 | MPA_ERR | 2 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PMpro (other) | 11 | ERR0 | 0 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PMpro (other) | 11 | ERR1 | 1 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + | PMpro (other) | 11 | MPA_ERR | 2 | 0 | + +-----------------+------------+----------+----------------+----------------------------------------+ + + Example:: + + # cat error_other_ue + 880807001e004010401040101500000001004010401040100c0000000000000000000000000000000000000000000000 + + The detail of each sysfs entries is as below: + + +-------------+---------------------------------------------------------+----------------------------------+ + | Error | Sysfs entry | Description (when triggered) | + +-------------+---------------------------------------------------------+----------------------------------+ + | Core's CE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ce | Core has CE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | Core's UE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ue | Core has UE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ce | Memory has CE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ue | Memory has UE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ce | any PCIe controller has CE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ue | any PCIe controller has UE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | Other's CE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ce | any other CE error | + +-------------+---------------------------------------------------------+----------------------------------+ + | Other's UE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ue | any other UE error | + +-------------+---------------------------------------------------------+----------------------------------+ + + UE: Uncorrect-able Error + CE: Correct-able Error + + For details, see section `3.3 Ampere (Vendor-Specific) Error Record Formats, + Altra Family RAS Supplement`. + + +What: /sys/bus/platform/devices/smpro-errmon.*/overflow_[core|mem|pcie|other]_[ce|ue] +KernelVersion: 6.1 +Contact: Quan Nguyen <quan@os.amperecomputing.com> +Description: + (RO) Return the overflow status of each type HW error reported: + + - 0 : No overflow + - 1 : There is an overflow and the oldest HW errors are dropped + + The detail of each sysfs entries is as below: + + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Overflow | Sysfs entry | Description | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Core's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ce | Core CE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Core's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ue | Core UE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ce | Memory CE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ue | Memory UE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ce | any PCIe controller CE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ue | any PCIe controller UE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Other's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ce| any other CE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + | Other's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ue| other UE error overflow | + +-------------+-----------------------------------------------------------+---------------------------------------+ + + where: + + - UE: Uncorrect-able Error + - CE: Correct-able Error + +What: /sys/bus/platform/devices/smpro-errmon.*/[error|warn]_[smpro|pmpro] +KernelVersion: 6.1 +Contact: Quan Nguyen <quan@os.amperecomputing.com> +Description: + (RO) Contains the internal firmware error/warning printed as hex format. + + The detail of each sysfs entries is as below: + + +---------------+------------------------------------------------------+--------------------------+ + | Error | Sysfs entry | Description | + +---------------+------------------------------------------------------+--------------------------+ + | SMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_smpro | system has SMpro error | + +---------------+------------------------------------------------------+--------------------------+ + | SMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_smpro | system has SMpro warning | + +---------------+------------------------------------------------------+--------------------------+ + | PMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_pmpro | system has PMpro error | + +---------------+------------------------------------------------------+--------------------------+ + | PMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_pmpro | system has PMpro warning | + +---------------+------------------------------------------------------+--------------------------+ + + For details, see section `5.10 RAS Internal Error Register Definitions, + Altra Family Soc BMC Interface Specification`. + +What: /sys/bus/platform/devices/smpro-errmon.*/event_[vrd_warn_fault|vrd_hot|dimm_hot] +KernelVersion: 6.1 +Contact: Quan Nguyen <quan@os.amperecomputing.com> +Description: + (RO) Contains the detail information in case of VRD/DIMM warning/hot events + in hex format as below:: + + AAAA + + where: + + - ``AAAA``: The event detail information data + + The detail of each sysfs entries is as below: + + +---------------+---------------------------------------------------------------+---------------------+ + | Event | Sysfs entry | Description | + +---------------+---------------------------------------------------------------+---------------------+ + | VRD HOT | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_hot | VRD Hot | + +---------------+---------------------------------------------------------------+---------------------+ + | VR Warn/Fault | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_warn_fault | VR Warning or Fault | + +---------------+---------------------------------------------------------------+---------------------+ + | DIMM HOT | /sys/bus/platform/devices/smpro-errmon.*/event_dimm_hot | DIMM Hot | + +---------------+---------------------------------------------------------------+---------------------+ + + For more details, see section `5.7 GPI Status Registers, + Altra Family Soc BMC Interface Specification`. + |