summaryrefslogtreecommitdiff
path: root/drivers/edac/mce_amd.c
AgeCommit message (Collapse)AuthorFilesLines
2019-05-21treewide: Add SPDX license identifier for more missed filesThomas Gleixner1-0/+1
Add SPDX license identifiers to all files which: - Have no license information of any form - Have MODULE_LICENCE("GPL*") inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-23x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h modelsYazen Ghannam1-2/+2
AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA errors under certain conditions. The errors are benign and can safely be ignored. However, the high error rate may cause the MCA threshold counter to overflow causing a high rate of thresholding interrupts. In addition, users may see the errors reported through the AMD MCE decoder module, even with the interrupt disabled, due to MCA polling. Clear the "Counter Present" bit in the Instruction Fetch bank's MCA_MISC0 register. This will prevent enabling MCA thresholding on this bank which will prevent the high interrupt rate due to this error. Define an AMD-specific function to filter these errors from the MCE event pool so that they don't get reported during early boot. Rename filter function in EDAC/mce_amd to avoid a naming conflict, while at it. [ bp: Move function prototype to the internal header and massage/cleanup, fix typos. ] Reported-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "clemej@gmail.com" <clemej@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Cc: <stable@vger.kernel.org> # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models Cc: <stable@vger.kernel.org> # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk Cc: <stable@vger.kernel.org> # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in <asm/mce.h> Cc: <stable@vger.kernel.org> # 5.0.x Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@amd.com
2019-02-15EDAC/mce_amd: Decode MCA_STATUS in bit definition orderYazen Ghannam1-12/+12
Sort the MCA_STATUS bits in decode output to follow how they are defined in the register. The order is as follows: Bit | Decode ------------ 62 | Over 61 | UC 59 | MiscV 58 | AddrV 57 | PCC 55 | TCC 53 | SyndV 46 | CECC 45 | UECC 44 | Deferred 43 | Poison 40 | Scrub [ bp: Massage a bit. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190212212417.107049-2-Yazen.Ghannam@amd.com
2019-02-15EDAC/mce_amd: Decode MCA_STATUS[Scrub] bitYazen Ghannam1-0/+3
Previous AMD systems have had a bit in MCA_STATUS to indicate that an error was detected on a scrub operation. However, this bit was defined differently within different banks and families/models. Starting with Family 17h, MCA_STATUS[40] is either Reserved/Read-as-Zero or defined as "Scrub", for all MCA banks and CPU models. Therefore, this bit can be defined as the "Scrub" bit. Define MCA_STATUS[40] as "Scrub" and decode it in the AMD MCE decoding module for Family 17h and newer systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190212212417.107049-1-Yazen.Ghannam@amd.com
2019-02-04EDAC, mce_amd: Print ExtErrorCode and description on a single lineYazen Ghannam1-3/+2
Save a log line by printing the extended error code and the description on a single line. This is similar to how errors are printed in other subsystems, e.g. "#, description". If we don't have a valid description then only the number/code is printed. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190201225534.8177-6-Yazen.Ghannam@amd.com
2019-02-03EDAC, mce_amd: Match error descriptions to latest documentationYazen Ghannam1-83/+83
Update the error descriptions to match the latest documentation for easier searching. In some cases the changes are small and in other cases the changes may be total rewording of the description. No functional changes. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Link: https://lkml.kernel.org/r/20190201225534.8177-5-Yazen.Ghannam@amd.com
2019-02-03x86/MCE/AMD, EDAC/mce_amd: Add new error descriptions for some SMCA bank typesYazen Ghannam1-1/+5
Some SMCA bank types on future systems will report new error types even though the bank type is not treated as a new version. These new error types will reported by bits that are reserved in past systems. Add the new error descriptions to the lists in edac_mce_amd. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-4-Yazen.Ghannam@amd.com
2019-02-03x86/MCE/AMD, EDAC/mce_amd: Add new McaTypes for CS, PSP, and SMU unitsYazen Ghannam1-0/+55
The existing CS, PSP, and SMU SMCA bank types will see new versions (as indicated by their McaTypes) in future SMCA systems. Add the new (HWID, MCATYPE) tuples for these new versions. Reuse the same names as the older versions, since they are logically the same to the user. SMCA systems won't mix and match IP blocks with different McaType versions in the same system, so there isn't a need to distinguish them. The MCA_IPID register is saved when logging an MCA error, and that can be used to triage the error. Also, add the new error descriptions to edac_mce_amd. Some error types (positions in the list) are overloaded compared to the previous McaTypes. Therefore, just create new lists of the error descriptions to keep things simple even if some of the error descriptions are the same between versions. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-3-Yazen.Ghannam@amd.com
2019-02-03x86/MCE/AMD, EDAC/mce_amd: Add new MP5, NBIO, and PCIE SMCA bank typesYazen Ghannam1-0/+32
Add the (HWID, MCATYPE) tuples and names for the new MP5, NBIO, and PCIE SMCA bank types. Also, add their respective error descriptions to the MCE decoding module edac_mce_amd. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Shirish S <Shirish.S@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190201225534.8177-2-Yazen.Ghannam@amd.com
2018-09-27EDAC, amd64: Add Hygon Dhyana supportPu Wen1-1/+3
Add support for Hygon Dhyana CPU to EDAC. Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: mchehab@kernel.org Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: thomas.lendacky@amd.com Cc: linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/9d71061301177822bc55b3bfd44f91057458d886.1537533369.git.puwen@hygon.cn
2018-02-21x86/mce/AMD, EDAC/mce_amd: Enumerate Reserved SMCA bank typeYazen Ghannam1-4/+7
Currently, bank 4 is reserved on Fam17h, so we chose not to initialize bank 4 in the smca_banks array. This means that when we check if a bank is initialized, like during boot or resume, we will see that bank 4 is not initialized and try to initialize it. This will cause a call trace, when resuming from suspend, due to rdmsr_*on_cpu() calls in the init path. The rdmsr_*on_cpu() calls issue an IPI but we're running with interrupts disabled. This triggers: WARNING: CPU: 0 PID: 11523 at kernel/smp.c:291 smp_call_function_single+0xdc/0xe0 ... Reserved banks will be read-as-zero, so their MCA_IPID register will be zero. So, like the smca_banks array, the threshold_banks array will not have an entry for a reserved bank since all its MCA_MISC* registers will be zero. Enumerate a "Reserved" bank type that matches on a HWID_MCATYPE of 0,0. Use the "Reserved" type when checking if a bank is reserved. It's possible that other bank numbers may be reserved on future systems. Don't try to find the block address on reserved banks. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.14.x Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180221101900.10326-7-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-21EDAC, mce_amd: Get rid of local var in amd_filter_mce()Borislav Petkov1-3/+1
... and use the macro for that. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>
2017-08-21EDAC, mce_amd: Get rid of most struct cpuinfo_x86 usesBorislav Petkov1-10/+12
struct mce.cpuid contains CPUID(1).EAX which contains family, model and stepping and thus has enough information for our purposes. Thus get rid of some external dependencies which are not really needed. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>
2017-08-21EDAC, mce_amd: Rename decode_smca_errors() to decode_smca_error()Borislav Petkov1-2/+2
Singular fits better because it decodes a single error. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de>
2017-07-17EDAC, mce_amd: Use cpu_to_node() to find the node IDYazen Ghannam1-5/+1
Using the homegrown amd_get_nb_id() to find a node ID on AMD was fine while the L3 to node mapping was 1:1. And Zen topology broke this. So let's start slowly moving away from it and use the topology interfaces instead. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1490041614-90057-2-git-send-email-Yazen.Ghannam@amd.com [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2017-06-12EDAC, mce_amd: Fix typo in SMCA error descriptionYazen Ghannam1-1/+1
Fix typo in "poison consumption" error description. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1497286703-62853-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2017-02-20Merge branch 'ras-core-for-linus' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Ingo Molnar: "The main changes in this cycle were: - Assign notifier chain priorities for all RAS related handlers to make the ordering explicit (Borislav Petkov) - Improve the AMD MCA banks sysfs output (Yazen Ghannam) - Various cleanups and restructuring of the x86 RAS code (Borislav Petkov)" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority x86/ras: Get rid of mce_process_work() EDAC/mce/amd: Dump TSC value EDAC/mce/amd: Unexport amd_decode_mce() x86/ras/amd/inj: Change dependency x86/ras: Flip the TSC-adding logic x86/ras/amd: Make sysfs names of banks more user-friendly x86/ras/therm_throt: Do not log a fake MCE for thermal events x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
2017-02-16EDAC, mce_amd: Print IPID and Syndrome on a separate lineYazen Ghannam1-5/+4
Currently, the IPID and Syndrome are printed on the same line as the Address. There are cases when we can have a valid Syndrome but not a valid Address. For example, the MCA_SYND register can be used to hold more detailed error info that the hardware folks can use. It's not just DRAM ECC syndromes. There are some error types that aren't related to memory that may have valid syndromes, like some errors related to links in the Data Fabric, etc. In these cases, the IPID and Syndrome are not printed at the same log level as the rest of the stanza, so users won't see them on the console. Console: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Dmesg: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b , Syndrome: 0x000000010b404000, IPID: 0x0001002e00000002 [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Print the IPID first and on a new line. The IPID should always be printed on SMCA systems. The Syndrome will then be printed with the IPID and at the same log level when valid: [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000010b404000 [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2 Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487192182-2474-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2017-01-28EDAC, mce_amd: Give more context to deferred error messageYazen Ghannam1-1/+1
Users may not be familiar with the concept of deferred errors. There is no action for users to take on this type of error, so give more context in the error message to make this more clear. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1485297149-13733-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2017-01-24x86/ras, EDAC, acpi: Assign MCE notifier handlers a priorityBorislav Petkov1-0/+1
Assign all notifiers on the MCE decode chain a priority so that they get called in the correct order. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24EDAC/mce/amd: Dump TSC valueBorislav Petkov1-0/+3
Dump the TSC value of the time when the MCE got logged. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24EDAC/mce/amd: Unexport amd_decode_mce()Borislav Petkov1-2/+2
It is not used outside of the driver anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-28EDAC, mce_amd: Don't report poison bit on Fam15h, bank 4Yazen Ghannam1-4/+7
MCA_STATUS[43] has been defined as "Poison" or "Reserved" for every bank since Fam15h except for Fam15h, bank 4 in which case it's defined as part of the McaStatSubCache bitfield. Filter out that case. Reported-by: Dean Liberty <Dean.Liberty@amd.com> Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1479478222-19896-1-git-send-email-Yazen.Ghannam@amd.com [ Split an almost unparseable ternary conditional, add a comment. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2016-11-23Merge tip:ras/core to pick up dependent changesBorislav Petkov1-6/+6
tip:ras/core contains the respective Fam17h x86 RAS bits which amd64_edac is going to use. So merge it into the EDAC branch. Signed-off-by: Borislav Petkov <bp@suse.de>
2016-11-21EDAC, mce_amd: Rename nb_bus_decoder to dram_ecc_decoderYazen Ghannam1-7/+14
nb_bus_decoder() is only used for DRAM ECC errors so rename it so that the name is more generic and descriptive. Also, call it for DRAM ECC errors on SMCA systems. [ Boris: rename it to real function name with a verb in it. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1479423463-8536-4-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2016-11-08x86/RAS: Hide SMCA bank namesBorislav Petkov1-1/+1
Add accessor functions and hide the smca_names array. Also, add a sanity-check to bank HWID assignment in get_smca_bank_info(). Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Rename smca_bank_names to smca_namesBorislav Petkov1-1/+1
Make it differ more from struct smca_bank_name for better readability. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-08x86/RAS: Simplify SMCA HWID descriptor structBorislav Petkov1-5/+5
Call it simply smca_hwid and call local variables "hwid". More readable. Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/MCE/AMD, EDAC: Handle reserved bank 4 on Fam17h properlyYazen Ghannam1-0/+3
Bank 4 is reserved on family 0x17 and shouldn't generate any MCE records. However, broken hardware and software is not something unheard of so warn about bank 4 errors. They shouldn't be coming from bank 4 naturally but users can still use mce_amd_inj to simulate errors from it for testing purposed. Also, avoid special handling in the injector mce_amd_inj like it is being done on the older families. [ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com> Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/mce, EDAC/mce_amd: Print MCA_SYND and MCA_IPID during MCE on SMCA systemsYazen Ghannam1-0/+2
The MCA_SYND and MCA_IPID registers contain valuable information and should be included in MCE output. The MCA_SYND register contains syndrome and other error information, and the MCA_IPID register will uniquely identify the MCA bank's type without having to rely on system software. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/mce/AMD, EDAC/mce_amd: Define and use tables for known SMCA IP typesYazen Ghannam1-158/+36
Scalable MCA defines a number of IP types. An MCA bank on an SMCA system is defined as one of these IP types. A bank's type is uniquely identified by the combination of the HWID and MCATYPE values read from its MCA_IPID register. Add the required tables in order to be able to lookup error descriptions based on a bank's type and the error's extended error code. [ bp: Align comments, simplify a bit. ] Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13EDAC/mce_amd: Use SMCA prefix for error descriptions arraysYazen Ghannam1-40/+40
The error descriptions defined for Fam17h can be reused for other SMCA systems, so their names should reflect this. Change f17h prefix to smca for error descriptions. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13EDAC/mce_amd: Add missing SMCA error descriptionsYazen Ghannam1-0/+7
Add missing SMCA error descriptions to the error descriptions arrays. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472673994-12235-3-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13EDAC/mce_amd: Print syndrome register value on SMCA systemsYazen Ghannam1-3/+11
Print SyndV bit status and print the raw value of the MCA_SYND register. Further decoding of the syndrome from struct mce.synd can be done in other places where appropriate, e.g. DRAM ECC. Boris: make the error stanza more compact by putting the error address and syndrome on the same line: [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 0x96204100001e0117 [Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000 [Hardware Error]: Invalid IP block specified. [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-12EDAC, mce_amd: Detect SMCA using X86_FEATURE_SMCAYazen Ghannam1-6/+3
Use X86_FEATURE_SMCA when detecting if SMCA is available instead of directly using CPUID 0x80000007_EBX. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errorsAravind Gopalakrishnan1-3/+332
For Scalable MCA enabled processors, errors are listed per IP block. And since it is not required for an IP to map to a particular bank, we need to use HWID and McaType values from the MCx_IPID register to figure out which IP a given bank represents. We also have a new bit (TCC) in the MCx_STATUS register to indicate Task context is corrupt. Add logic here to decode errors from all known IP blocks for Fam17h Model 00-0fh and to print TCC errors. [ Minor fixups. ] Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-14EDAC, mce_amd: Don't emit 'CE' for Deferred errorAravind Gopalakrishnan1-1/+2
Currently, when decoding an MCE, we display 'CE' for a Deferred error, like this: [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|Deferred|-|UECC]: 0xdc04b00095080813 When the 'UC' bit in the MCx_STATUS register is clear, the error status is either a Corrected error or Deferred error as determined by the 'Deferred' bit. So do not print 'CE' on a deferred error. Refer to AMD Error Scope Hierarchy table in a newer BKDG (example: 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"). Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1436788382-6463-1-git-send-email-aravind.gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2014-11-25EDAC, MCE, AMD: Correct formatting of decoded textBorislav Petkov1-3/+3
Write out MCx_ADDR into the more humanly readable "MCx Error Address" and remove double colon in the output. Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2014-11-04EDAC, MCE, AMD: Add decoding table for MC6 xecAravind Gopalakrishnan1-30/+11
Extended error code meanings are tabulated for other banks. Extend that tradition for MC6 too. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1415122868-10969-1-git-send-email-aravind.gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2014-07-14EDAC, MCE, AMD: Add MCE decoding for F15h M60hAravind Gopalakrishnan1-4/+40
Add decoding logic for new Fam15h model 60h. Tested using mce_amd_inj module and works fine. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1405098795-4678-1-git-send-email-Aravind.Gopalakrishnan@amd.com [ Boris: simplify a bit. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2014-05-08EDAC, MCE, AMD: Remove leftover unused maskBorislav Petkov1-2/+0
295d8cda2689 ("EDAC, MCE, AMD: Drop local coreid reporting") removed the code snippet which used that mask but forgot to drop the mask itself. Do that now. Signed-off-by: Borislav Petkov <bp@suse.de>
2014-02-24MCE, AMD: Fix decoding module loading on unsupported hwBorislav Petkov1-32/+33
We want to still be able to issue some error information on systems for which there is no decoding support (think older distro kernels here, for example). Therefore, we allow module registration but skip the per-family bank-specific decoders and issue the general information only, i.e.: [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error. [ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f [ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out) with the hope that it still contains helpful useful bits. Suggested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Link: http://lkml.kernel.org/r/1392659391-2411-1-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2013-06-08EDAC, MCE, AMD: Add an MCE signature for new Fam15h modelsAravind Gopalakrishnan1-2/+3
Add a new error signature for Family 15h, models 30h-3fh. Patch has been tested on Fam15h using mce_amd_inj facility and has been verified to work correctly. Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> [ cleanup commit message and error string ] Signed-off-by: Borislav Petkov <bp@suse.de>
2013-01-23EDAC, MCE, AMD: Remove unneeded exportsBorislav Petkov1-11/+6
Initially, those strings describing different parts of an MCE message were shared with amd64_edac and were therefore exported to modules. However, all except pp_msgs are used only in one place right now so hide them and make them static. No functionality change. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Borislav Petkov <bp@alien8.de>
2013-01-23EDAC, MCE, AMD: Add MCE decoding support for Family 16hJacob Shin1-17/+78
Add MCE decoding logic for AMD Family 16h processors. Boris: - drop unneeded uu_msgs export - exit early in cat_mc1_mce and save us an indentation level Signed-off-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: Borislav Petkov <bp@alien8.de>
2013-01-23EDAC, MCE, AMD: Make MC2 decoding per-familyJacob Shin1-27/+29
Currently only AMD Family 15h processors have special handling for MC2 errors. Since upcoming Family 16h will also need unique handling, let's make MC2 handling part of amd_decoder_ops. Signed-off-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: Borislav Petkov <bp@alien8.de>
2012-11-28MCE, AMD: Dump error statusBorislav Petkov1-2/+20
Dump error status after decoding the error which describes the error disposition. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2012-11-28MCE, AMD: Report decoded error type firstBorislav Petkov1-25/+25
Instead of starting with the error details, report the decoded, readable error type first. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2012-11-28MCE, AMD: Dump CPU f/m/s triple with the errorBorislav Petkov1-4/+6
It is very useful to have the family/model/stepping with the reported error so dump it. This saves us asking the bug reporter about it. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2012-11-28MCE, AMD: Remove functional unit referencesBorislav Petkov1-94/+92
Having the functional unit names in each bank decode is only misleading as this code supports multiple families and there's no guarantee the mapping between FUs and MCE banks will stay the same. And also, knowing the functional unit name doesn't help much since you end up looking at the respective BKDG anyway. So drop all FU references and use the MC bank numbers instead. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>