From 81aa5206f9a7c9793e2f7971400351664e40b04f Mon Sep 17 00:00:00 2001 From: Rajat Jain Date: Thu, 21 Jun 2018 16:48:28 -0700 Subject: PCI/AER: Add sysfs attributes to provide AER stats and breakdown Add sysfs attributes to provide total and breakdown of the AERs seen, into different type of correctable, fatal and nonfatal errors: /sys/bus/pci/devices//aer_dev_correctable /sys/bus/pci/devices//aer_dev_fatal /sys/bus/pci/devices//aer_dev_nonfatal Signed-off-by: Rajat Jain Signed-off-by: Bjorn Helgaas --- .../ABI/testing/sysfs-bus-pci-devices-aer_stats | 94 ++++++++++++++++++++++ Documentation/PCI/pcieaer-howto.txt | 5 ++ 2 files changed, 99 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats new file mode 100644 index 000000000000..3a784297cfed --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats @@ -0,0 +1,94 @@ +========================== +PCIe Device AER statistics +========================== +These attributes show up under all the devices that are AER capable. These +statistical counters indicate the errors "as seen/reported by the device". +Note that this may mean that if an endpoint is causing problems, the AER +counters may increment at its link partner (e.g. root port) because the +errors may be "seen" / reported by the link partner and not the +problematic endpoint itself (which may report all counters as 0 as it never +saw any problems). + +Where: /sys/bus/pci/devices//aer_dev_correctable +Date: July 2018 +Kernel Version: 4.19.0 +Contact: linux-pci@vger.kernel.org, rajatja@google.com +Description: List of correctable errors seen and reported by this + PCI device using ERR_COR. Note that since multiple errors may + be reported using a single ERR_COR message, thus + TOTAL_ERR_COR at the end of the file may not match the actual + total of all the errors in the file. Sample output: +------------------------------------------------------------------------- +localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable +Receiver Error 2 +Bad TLP 0 +Bad DLLP 0 +RELAY_NUM Rollover 0 +Replay Timer Timeout 0 +Advisory Non-Fatal 0 +Corrected Internal Error 0 +Header Log Overflow 0 +TOTAL_ERR_COR 2 +------------------------------------------------------------------------- + +Where: /sys/bus/pci/devices//aer_dev_fatal +Date: July 2018 +Kernel Version: 4.19.0 +Contact: linux-pci@vger.kernel.org, rajatja@google.com +Description: List of uncorrectable fatal errors seen and reported by this + PCI device using ERR_FATAL. Note that since multiple errors may + be reported using a single ERR_FATAL message, thus + TOTAL_ERR_FATAL at the end of the file may not match the actual + total of all the errors in the file. Sample output: +------------------------------------------------------------------------- +localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal +Undefined 0 +Data Link Protocol 0 +Surprise Down Error 0 +Poisoned TLP 0 +Flow Control Protocol 0 +Completion Timeout 0 +Completer Abort 0 +Unexpected Completion 0 +Receiver Overflow 0 +Malformed TLP 0 +ECRC 0 +Unsupported Request 0 +ACS Violation 0 +Uncorrectable Internal Error 0 +MC Blocked TLP 0 +AtomicOp Egress Blocked 0 +TLP Prefix Blocked Error 0 +TOTAL_ERR_FATAL 0 +------------------------------------------------------------------------- + +Where: /sys/bus/pci/devices//aer_dev_nonfatal +Date: July 2018 +Kernel Version: 4.19.0 +Contact: linux-pci@vger.kernel.org, rajatja@google.com +Description: List of uncorrectable nonfatal errors seen and reported by this + PCI device using ERR_NONFATAL. Note that since multiple errors + may be reported using a single ERR_FATAL message, thus + TOTAL_ERR_NONFATAL at the end of the file may not match the + actual total of all the errors in the file. Sample output: +------------------------------------------------------------------------- +localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal +Undefined 0 +Data Link Protocol 0 +Surprise Down Error 0 +Poisoned TLP 0 +Flow Control Protocol 0 +Completion Timeout 0 +Completer Abort 0 +Unexpected Completion 0 +Receiver Overflow 0 +Malformed TLP 0 +ECRC 0 +Unsupported Request 0 +ACS Violation 0 +Uncorrectable Internal Error 0 +MC Blocked TLP 0 +AtomicOp Egress Blocked 0 +TLP Prefix Blocked Error 0 +TOTAL_ERR_NONFATAL 0 +------------------------------------------------------------------------- diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt index acd0dddd6bb8..48ce7903e3c6 100644 --- a/Documentation/PCI/pcieaer-howto.txt +++ b/Documentation/PCI/pcieaer-howto.txt @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends the error message to root port. Pls. refer to pci express specs for other fields. +2.4 AER Statistics / Counters + +When PCIe AER errors are captured, the counters / statistics are also exposed +in the form of sysfs attributes which are documented at +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 3. Developer Guide -- cgit v1.2.3