summaryrefslogtreecommitdiff
path: root/Documentation/drivers/edac
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/drivers/edac')
-rw-r--r--Documentation/drivers/edac/edac.txt152
1 files changed, 27 insertions, 125 deletions
diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt
index 70d96a62e5e1..7b3d969d2964 100644
--- a/Documentation/drivers/edac/edac.txt
+++ b/Documentation/drivers/edac/edac.txt
@@ -35,15 +35,14 @@ the vendor should tie the parity status bits to 0 if they do not intend
to generate parity. Some vendors do not do this, and thus the parity bit
can "float" giving false positives.
-The PCI Parity EDAC device has the ability to "skip" known flaky
-cards during the parity scan. These are set by the parity "blacklist"
-interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
-section below.) There is also a parity "whitelist" which is used as
-an explicit list of devices to scan, while the blacklist is a list
-of devices to skip.
+[There are patches in the kernel queue which will allow for storage of
+quirks of PCI devices reporting false parity positives. The 2.6.18
+kernel should have those patches included. When that becomes available,
+then EDAC will be patched to utilize that information to "skip" such
+devices.]
-EDAC will have future error detectors that will be added or integrated
-into EDAC in the following list:
+EDAC will have future error detectors that will be integrated with
+EDAC or added to it, in the following list:
MCE Machine Check Exception
MCA Machine Check Architecture
@@ -93,22 +92,24 @@ EDAC lives in the /sys/devices/system/edac directory. Within this directory
there currently reside 2 'edac' components:
mc memory controller(s) system
- pci PCI status system
+ pci PCI control and status system
============================================================================
Memory Controller (mc) Model
First a background on the memory controller's model abstracted in EDAC.
-Each mc device controls a set of DIMM memory modules. These modules are
+Each 'mc' device controls a set of DIMM memory modules. These modules are
laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
-be multiple csrows and two channels.
+be multiple csrows and multiple channels.
Memory controllers allow for several csrows, with 8 csrows being a typical value.
Yet, the actual number of csrows depends on the electrical "loading"
of a given motherboard, memory controller and DIMM characteristics.
Dual channels allows for 128 bit data transfers to the CPU from memory.
+Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
+(FB-DIMMs). The following example will assume 2 channels:
Channel 0 Channel 1
@@ -234,23 +235,15 @@ Polling period control file:
The time period, in milliseconds, for polling for error information.
Too small a value wastes resources. Too large a value might delay
necessary handling of errors and might loose valuable information for
- locating the error. 1000 milliseconds (once each second) is about
- right for most uses.
+ locating the error. 1000 milliseconds (once each second) is the current
+ default. Systems which require all the bandwidth they can get, may
+ increase this.
LOAD TIME: module/kernel parameter: poll_msec=[0|1]
RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
-Module Version read-only attribute file:
-
- 'mc_version'
-
- The EDAC CORE module's version and compile date are shown here to
- indicate what EDAC is running.
-
-
-
============================================================================
'mcX' DIRECTORIES
@@ -284,35 +277,6 @@ Seconds since last counter reset control file:
-DIMM capability attribute file:
-
- 'edac_capability'
-
- The EDAC (Error Detection and Correction) capabilities/modes of
- the memory controller hardware.
-
-
-DIMM Current Capability attribute file:
-
- 'edac_current_capability'
-
- The EDAC capabilities available with the hardware
- configuration. This may not be the same as "EDAC capability"
- if the correct memory is not used. If a memory controller is
- capable of EDAC, but DIMMs without check bits are in use, then
- Parity, SECDED, S4ECD4ED capabilities will not be available
- even though the memory controller might be capable of those
- modes with the proper memory loaded.
-
-
-Memory Type supported on this controller attribute file:
-
- 'supported_mem_type'
-
- This attribute file displays the memory type, usually
- buffered and unbuffered DIMMs.
-
-
Memory Controller name attribute file:
'mc_name'
@@ -321,16 +285,6 @@ Memory Controller name attribute file:
that is being utilized.
-Memory Controller Module name attribute file:
-
- 'module_name'
-
- This attribute file displays the memory controller module name,
- version and date built. The name of the memory controller
- hardware - some drivers work with multiple controllers and
- this field shows which hardware is present.
-
-
Total memory managed by this memory controller attribute file:
'size_mb'
@@ -432,6 +386,9 @@ Memory Type attribute file:
This attribute file will display what type of memory is currently
on this csrow. Normally, either buffered or unbuffered memory.
+ Examples:
+ Registered-DDR
+ Unbuffered-DDR
EDAC Mode of operation attribute file:
@@ -446,8 +403,13 @@ Device type attribute file:
'dev_type'
- This attribute file will display what type of DIMM device is
- being utilized. Example: x4
+ This attribute file will display what type of DRAM device is
+ being utilized on this DIMM.
+ Examples:
+ x1
+ x2
+ x4
+ x8
Channel 0 CE Count attribute file:
@@ -522,10 +484,10 @@ SYSTEM LOGGING
If logging for UEs and CEs are enabled then system logs will have
error notices indicating errors that have been detected:
-MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
+EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
channel 1 "DIMM_B1": amd76x_edac
-MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
+EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
channel 1 "DIMM_B1": amd76x_edac
@@ -610,64 +572,4 @@ Parity Count:
-PCI Device Whitelist:
-
- 'pci_parity_whitelist'
-
- This control file allows for an explicit list of PCI devices to be
- scanned for parity errors. Only devices found on this list will
- be examined. The list is a line of hexadecimal VENDOR and DEVICE
- ID tuples:
-
- 1022:7450,1434:16a6
-
- One or more can be inserted, separated by a comma.
-
- To write the above list doing the following as one command line:
-
- echo "1022:7450,1434:16a6"
- > /sys/devices/system/edac/pci/pci_parity_whitelist
-
-
-
- To display what the whitelist is, simply 'cat' the same file.
-
-
-PCI Device Blacklist:
-
- 'pci_parity_blacklist'
-
- This control file allows for a list of PCI devices to be
- skipped for scanning.
- The list is a line of hexadecimal VENDOR and DEVICE ID tuples:
-
- 1022:7450,1434:16a6
-
- One or more can be inserted, separated by a comma.
-
- To write the above list doing the following as one command line:
-
- echo "1022:7450,1434:16a6"
- > /sys/devices/system/edac/pci/pci_parity_blacklist
-
-
- To display what the whitelist currently contains,
- simply 'cat' the same file.
-
=======================================================================
-
-PCI Vendor and Devices IDs can be obtained with the lspci command. Using
-the -n option lspci will display the vendor and device IDs. The system
-administrator will have to determine which devices should be scanned or
-skipped.
-
-
-
-The two lists (white and black) are prioritized. blacklist is the lower
-priority and will NOT be utilized when a whitelist has been set.
-Turn OFF a whitelist by an empty echo command:
-
- echo > /sys/devices/system/edac/pci/pci_parity_whitelist
-
-and any previous blacklist will be utilized.
-