summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)AuthorFilesLines
2020-08-04Merge tag 'edac_updates_for_5.9' of ↵Linus Torvalds7-141/+204
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Tony Luck: "Boris is on vacation and aske me to send you the EDAC changes" * tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC: Fix reference count leaks EDAC: Remove edac_get_dimm_by_index() EDAC/ghes: Scan the system once on driver init EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt EDAC/ghes: Setup DIMM label from DMI and use it in error reports EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations EDAC/mc: Call edac_inc_ue_error() before panic EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier
2020-08-04Merge tag 'ras-core-2020-08-03' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Ingo Molnar: "Boris is on vacation and he asked us to send you the pending RAS bits: - Print the PPIN field on CPUs that fill them out - Fix an MCE injection bug - Simplify a kzalloc in dev_mcelog_init_device()" * tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce, EDAC/mce_amd: Print PPIN in machine check records x86/mce/dev-mcelog: Use struct_size() helper in kzalloc() x86/mce/inject: Fix a wrong assignment of i_mce.status
2020-06-23x86/mce, EDAC/mce_amd: Print PPIN in machine check recordsSmita Koralahalli1-0/+3
Print the Protected Processor Identification Number (PPIN) on processors which support it. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
2020-06-22Merge branch 'edac-ghes' into edac-for-nextBorislav Petkov1-130/+193
2020-06-18EDAC/amd64: Read back the scrub rate PCI register on F15hBorislav Petkov1-0/+2
Commit: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") added support for F15h, model 0x60 CPUs but in doing so, missed to read back SCRCTRL PCI config register on F15h CPUs which are *not* model 0x60. Add that read so that doing $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate can show the previously set DRAM scrub rate. Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") Reported-by: Anders Andersson <pipatron@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> #v4.4.. Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
2020-06-17EDAC: Fix reference count leaksQiushi Wu2-1/+2
When kobject_init_and_add() returns an error, it should be handled because kobject_init_and_add() takes a reference even when it fails. If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Therefore, replace calling kfree() and call kobject_put() and add a missing kobject_put() in the edac_device_register_sysfs_main_kobj() error path. [ bp: Massage and merge into a single patch. ] Fixes: b2ed215a3338 ("Kobject: change drivers/edac to use kobject_init_and_add") Signed-off-by: Qiushi Wu <wu000273@umn.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu
2020-06-16EDAC/ghes: Scan the system once on driver initBorislav Petkov1-113/+166
Change the hardware scanning and figuring out how many DIMMs a machine has to a single, one-time thing which happens once on driver init. After that scanning completes, struct ghes_hw_desc contains a representation of the hardware which the driver can then use for later initialization. Then, copy the DIMM information into the respective EDAC core representation of those. Get rid of ghes_edac_dimm_fill and use a struct dimm_info array directly. This way, hw detection and further driver initialization is nicely and logically split. Further additions should all be added to ghes_scan_system() and the hw representation extended as needed. There should be no functionality change resulting from this patch. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-06-16EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvtRobert Richter1-8/+5
The struct members list and ghes of struct ghes_edac_pvt are unused, remove them. On that occasion, rename it to the shorter name struct ghes_pvt. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com
2020-06-16EDAC/ghes: Setup DIMM label from DMI and use it in error reportsRobert Richter1-11/+24
The ghes driver reports errors with 'unknown label' even if the actual DIMM label is known, e.g.: EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) Fix this by using struct dimm_info's label string in error reports: EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) The labels are initialized by reading the bank and device strings from DMI. Now, the label information can also read from sysfs. E.g. a ThunderX2 system will show the following: /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0 /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0 /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0 /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0 /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0 /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0 /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0 /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0 /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0 /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0 /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0 /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0 /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0 /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0 /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0 /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0 Since dimm_labels can be rewritten, that label will be used in a later error report: # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label # # some error injection here # dmesg | grep foobar [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) [ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com
2020-06-16EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurationsQiuxu Zhuo2-8/+6
Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU stepping specific configurations to {skx,i10nm}_init(), so can delete the CPU stepping check from 10nm_init(). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com
2020-06-15EDAC/mc: Call edac_inc_ue_error() before panicZhenzhong Duan1-2/+2
By calling edac_inc_ue_error() before panic, we get a correct UE error count for core dump analysis. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200610065846.3626-2-zhenzhong.duan@gmail.com
2020-06-15EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifierZhenzhong Duan1-0/+1
Avoid giving it MCE_PRIO_LOWEST priority by default. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200610065846.3626-1-zhenzhong.duan@gmail.com
2020-06-13Merge tag 'kbuild-v5.8-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues
2020-06-13treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada1-1/+1
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-11Merge branch 'x86/entry' into ras/coreThomas Gleixner9-54/+73
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
2020-06-01Merge branches 'edac-i10nm' and 'edac-misc' into edac-updates-for-5.8Borislav Petkov9-54/+73
Signed-off-by: Borislav Petkov <bp@suse.de>
2020-05-29EDAC/amd64: Remove redundant assignment to variable ret in hw_info_get()Colin Ian King1-1/+1
The variable ret is being assigned with a value that is never read and it is being updated later with a new value. The initialization is redundant so remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
2020-05-22EDAC/amd64: Add AMD family 17h model 60h PCI IDsAlexander Monakov2-0/+17
Add support for AMD Renoir (4000-series Ryzen CPUs). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru
2020-05-20EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enableQiuxu Zhuo4-17/+13
The skx_edac driver wrongly uses the mtr register to retrieve two fields close_pg and bank_xor_enable. Fix it by using the correct mcmtr register to get the two fields. Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Matthew Riley <mattdr@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
2020-04-27EDAC/i10nm: Update driver to support different bus number config register ↵Qiuxu Zhuo1-4/+14
offsets The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville servers with CPU stepping >=4, the offset for bus number configuration register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the steppings use the updated 0xd0 offset. Fix the issue by using the appropriate offset for bus number configuration register according to the CPU model number and stepping. Reported-by: Jerry Chen <jerry.t.chen@intel.com> Reported-and-tested-by: Jin Wen <wen.jin@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic
2020-04-27EDAC, {skx,i10nm}: Make some configurations CPU model specificQiuxu Zhuo4-14/+38
The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
2020-04-24EDAC/amd8131: Remove defined but not used bridge_strJason Yan1-8/+0
Fix the following gcc warning: drivers/edac/amd8131_edac.c:47:21: warning: ‘bridge_str’ defined but not used [-Wunused-const-variable=] static char * const bridge_str[] = { ^~~~~~~~~~ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Robert Richter <rrichter@marvell.com> Link: https://lkml.kernel.org/r/20200415085006.6732-1-yanaijie@huawei.com
2020-04-23EDAC/thunderx: Make symbols staticZou Wei1-4/+4
Make a couple of symbols static, as reported by sparse. [ bp: Massage. ] Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1587624744-97240-1-git-send-email-zou_wei@huawei.com
2020-04-14EDAC: Drop the EDAC report status checksTony Luck4-71/+0
When acpi_extlog was added, we were worried that the same error would be reported more than once by different subsystems. But in the ensuing years I've seen complaints that people could not find an error log (because this mechanism suppressed the log they were looking for). Rip it all out. People are smart enough to notice the same address from different reporting mechanisms. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
2020-04-14x86/mce: Fix all mce notifiers to update the mce->kflags bitmaskTony Luck5-6/+19
If the handler took any action to log or deal with the error, set a bit in mce->kflags so that the default handler on the end of the machine check chain can see what has been done. Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers skip over errors already processed by CEC. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
2020-04-14x86/mce/amd, edac: Remove report_gart_errorsBorislav Petkov3-34/+0
... because no one should be interested in spurious MCEs anyway. Make the filtering unconditional and move it to amd_filter_mce(). Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de
2020-04-14EDAC/xgene: Remove set but not used address local varJason Yan1-3/+0
Fix the following gcc warning: drivers/edac/xgene_edac.c:1486:7: warning: variable ‘address’ set but not used [-Wunused-but-set-variable] u32 address; ^~~~~~~ Remove the unused macro RBERRADDR_RD while at it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200409093259.20069-1-yanaijie@huawei.com
2020-04-14EDAC/armada_xp: Fix some log messagesChristophe JAILLET1-7/+7
Fix spelling (s/Aramda/Armada/) in a log message and in a comment. While at it, add a trailing '\n' in messages. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jan Luebbe <jlu@pengutronix.de> Link: https://lkml.kernel.org/r/20200413041556.3514-1-christophe.jaillet@wanadoo.fr
2020-03-31Merge branch 'perf-core-for-linus' of ↵Linus Torvalds5-21/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The main changes in this cycle were: Kernel side changes: - A couple of x86/cpu cleanups and changes were grandfathered in due to patch dependencies. These clean up the set of CPU model/family matching macros with a consistent namespace and C99 initializer style. - A bunch of updates to various low level PMU drivers: * AMD Family 19h L3 uncore PMU * Intel Tiger Lake uncore support * misc fixes to LBR TOS sampling - optprobe fixes - perf/cgroup: optimize cgroup event sched-in processing - misc cleanups and fixes Tooling side changes are to: - perf {annotate,expr,record,report,stat,test} - perl scripting - libapi, libperf and libtraceevent - vendor events on Intel and S390, ARM cs-etm - Intel PT updates - Documentation changes and updates to core facilities - misc cleanups, fixes and other enhancements" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits) cpufreq/intel_pstate: Fix wrong macro conversion x86/cpu: Cleanup the now unused CPU match macros hwrng: via_rng: Convert to new X86 CPU match macros crypto: Convert to new CPU match macros ASoC: Intel: Convert to new X86 CPU match macros powercap/intel_rapl: Convert to new X86 CPU match macros PCI: intel-mid: Convert to new X86 CPU match macros mmc: sdhci-acpi: Convert to new X86 CPU match macros intel_idle: Convert to new X86 CPU match macros extcon: axp288: Convert to new X86 CPU match macros thermal: Convert to new X86 CPU match macros hwmon: Convert to new X86 CPU match macros platform/x86: Convert to new CPU match macros EDAC: Convert to new X86 CPU match macros cpufreq: Convert to new X86 CPU match macros ACPI: Convert to new X86 CPU match macros x86/platform: Convert to new CPU match macros x86/kernel: Convert to new CPU match macros x86/kvm: Convert to new CPU match macros x86/perf/events: Convert to new CPU match macros ...
2020-03-30Merge branches 'edac-mc-cleanup', 'edac-misc', 'edac-drivers' and ↵Borislav Petkov11-374/+984
'edac-urgent' into edac-updates-for-5.7 Signed-off-by: Borislav Petkov <bp@suse.de>
2020-03-25Merge branch 'x86/cpu' into perf/core, to resolve conflictIngo Molnar5-21/+21
Conflicts: arch/x86/events/intel/uncore.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-24EDAC: Convert to new X86 CPU match macrosThomas Gleixner5-21/+21
The new macro set has a consistent namespace and uses C99 initializers instead of the grufty C89 ones. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
2020-03-17EDAC/armada_xp: Use scnprintf() for avoiding potential buffer overflowTakashi Iwai1-13/+13
Since snprintf() returns the would-be-output size instead of the actual output size, the succeeding calls may go beyond the given buffer limit. Fix it by replacing with scnprintf(). Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jan Luebbe <jlu@pengutronix.de> Link: https://lkml.kernel.org/r/20200311071728.4541-1-tiwai@suse.de
2020-03-17EDAC/synopsys: Do not dump uninitialized pinf->colSherry Sun1-11/+11
On the ZynqMP platform, zynqmp_get_error_info() is used to read out error information. In this function, the pinf->col parameter is not used (it is only used by the Zynq platform's zynq_get_error_info()). So there's no need to print pinf->col on ZynqMP. In order to differentiate on which platform handle_error() is executed, use DDR_ECC_INTR_SUPPORT as the check condition to distinguish between Zynq and ZynqMP platforms. [ bp: Massage. ] Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller") Signed-off-by: Sherry Sun <sherry.sun@nxp.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Manish Narani <manish.narani@xilinx.com> Link: https://lkml.kernel.org/r/1584365679-27443-1-git-send-email-sherry.sun@nxp.com
2020-02-27EDAC/synopsys: Do not print an error with back-to-back snprintf() callsSherry Sun1-15/+7
handle_error() currently calls snprintf() a couple of times in succession to output the message for a CE/UE, therefore overwriting each part of the message which was formatted with the previous snprintf() call. As a result, only the part of the message from the last snprintf() call will be printed. The simplest and most effective way to fix this problem is to combine the whole string into one which to supply to a single snprintf() call. [ bp: Massage. ] Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller") Signed-off-by: Sherry Sun <sherry.sun@nxp.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Cc: Manish Narani <manish.narani@xilinx.com> Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com
2020-02-19EDAC: Add EDAC driver for DMC520Lei Wang3-0/+664
The driver supports error detection and correction on devices with an ARM DMC-520 memory controller. Signed-off-by: Lei Wang <leiwang_git@outlook.com> Signed-off-by: Shiping Ji <shiping.linux@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Link: https://lkml.kernel.org/r/83b48c70-dc06-d0d4-cae9-a2187fca628b@gmail.com
2020-02-18EDAC/mce_amd: Print !SMCA processor warning only oncePrarit Bhargava1-1/+1
This warning is output for every virtual CPU in a guest on an EPYC 2 system because kvm doesn't enable SMCA. Once is enough too. [ bp: Massage. ] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200217134627.19765-1-prarit@redhat.com
2020-02-17EDAC/mc: Remove per layer countersRobert Richter2-63/+24
Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it turns out that only the leaves in the memory hierarchy are consumed (in sysfs), but not the intermediate layers, e.g.: count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx]; These unused counters only add complexity, remove them. The error counter values are directly stored in struct dimm_info now. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-11-rrichter@marvell.com
2020-02-17EDAC/mc: Remove detail[] string and cleanup error string generationRobert Richter1-58/+32
The error descriptor is passed to the error reporting functions, so the error details can be directly generated there. Move string generation from edac_raw_mc_handle_error() to edac_ce_error() and edac_ue_error(). The intermediate detail[] string can be removed then. Also, cleanup the string generation by switching to a single variant only using the ternary operator. [ bp: put ternary operators on a separate line for better readability and use the short-form "inline if" in edac_mc_handle_error(). ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-10-rrichter@marvell.com
2020-02-17EDAC/mc: Pass the error descriptor to error reporting functionsRobert Richter1-58/+42
Most arguments of error reporting functions are already stored in the struct edac_raw_error_desc error descriptor. Pass the error descriptor to the functions and reduce the functions' argument list. [ bp: Sort function args in reverse fir tree order. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-9-rrichter@marvell.com
2020-02-17EDAC/mc: Remove enable_per_layer_report function argumentRobert Richter2-26/+19
Many functions carry the enable_per_layer_report argument. This is a bool value indicating the error information contains some location data where the error occurred. This can easily being determined by checking the pos[] array for values. Negative values indicate there is no location available. So if the top layer is negative, the error location is unknown. Just check if the top layer is negative and remove enable_per_layer_report as function argument and also from struct edac_raw_error_desc. [ bp: Reflow comments to 80 columns, while at it. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-8-rrichter@marvell.com
2020-02-17EDAC/mc: Report "unknown memory" on too many DIMM labels foundRobert Richter1-10/+11
There is a limitation to report only EDAC_MAX_LABELS in e->label of the error descriptor. This is to prevent a potential string overflow. The current implementation falls back to "any memory" in this case and also stops all further processing to find a unique row and channel of the possible error location. Reporting "any memory" is wrong as the memory controller reported an error location for one of the layers. Instead, report "unknown memory" and also do not break early in the loop to further check row and channel for uniqueness. [ bp: Massage commit message. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-7-rrichter@marvell.com
2020-02-17EDAC/mc: Carve out error increment into a separate functionRobert Richter1-15/+25
Carve out the error_count increment into a separate function edac_inc_csrow(). This better separates code and reduces the indentation level. Implementation note: The function edac_inc_csrow() counts the same as before, ->ce_count is only incremented if row >= 0. This is esp. true for the case of (!e->enable_per_layer_report). Here, a DIMM was not found, variable row still has a value of -1 and ->ce_count is not incremented. [ bp: Massage commit message. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200214141757.8976-1-rrichter@marvell.com
2020-02-17EDAC/mc: Determine mci pointer from the error descriptorRobert Richter3-7/+10
Each struct mci has its own error descriptor. Create a function error_desc_to_mci() to determine the corresponding mci from an error descriptor. This removes @mci from the parameter list of edac_raw_mc_handle_error() as the mci pointer does not need to be passed any longer. [ bp: Massage commit message. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-5-rrichter@marvell.com
2020-02-17EDAC: Store error type in struct edac_raw_error_descRobert Richter3-14/+11
Store the error type in struct edac_raw_error_desc. This makes the type parameter of edac_raw_mc_handle_error() obsolete. [ kernel-doc typo ] Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-4-rrichter@marvell.com
2020-02-17EDAC/mc: Reorder functions edac_mc_alloc*()Robert Richter1-106/+103
Reorder the new created functions edac_mc_alloc_csrows() and edac_mc_alloc_dimms() and move them before edac_mc_alloc(). No further code changes. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-3-rrichter@marvell.com
2020-02-17EDAC/mc: Split edac_mc_alloc() into smaller functionsRobert Richter1-35/+70
edac_mc_alloc() is huge. Factor out code by moving it to the two new functions edac_mc_alloc_csrows() and edac_mc_alloc_dimms(). Do not move code yet for better review. [ bp: sort local args in reversed fir tree order. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-2-rrichter@marvell.com
2020-02-17EDAC/mc: Change mci device removal to use put_device()Robert Richter3-56/+47
There are dimm and csrow devices linked to the mci device esp. to show up in sysfs. It must be granted that children devices are removed before its mci parent. Thus, the release functions must be called in the correct order and may not miss any child before releasing its parent. In the current implementation this is only granted by the correct order of release functions. A much better approach is to use put_device() that releases the device only after all users are gone. It is the recommended way to release a device and free its memory. The function uses the device's refcount and only frees it if there are no users of it anymore such as children. So implement a mci_release() function to remove mci devices, use put_device() to free them and early initialize the mci device right after its struct has been allocated. Change the release function so that it can be universally used no matter if the device is registered or not. Since subsequent dimm and csrow sysfs links are implemented as children devices, their refcounts will keep the parent mci device from being removed as long as sysfs entries exist and until all users have been unregistered in edac_remove_sysfs_mci_device(). Remove edac_unregister_sysfs() and merge mci sysfs removal into edac_remove_sysfs_mci_device(). There is only a single instance now that removes the sysfs entries. The function can now be used in the error paths for cleanup. Also, create device release functions for all involved devices (dev->release), remove device_type release functions (dev_type-> release) and also use dev->init_name instead of dev_set_name(). [ bp: Massage commit message and comments. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200212120340.4764-5-rrichter@marvell.com
2020-02-13EDAC/sysfs: Remove csrow objects on errorsRobert Richter1-2/+1
All created csrow objects must be removed in the error path of edac_create_csrow_objects(). The objects have been added as devices. They need to be removed by doing a device_del() *and* put_device() call to also free their memory. The missing put_device() leaves a memory leak. Use device_unregister() instead of device_del() which properly unregisters the device doing both. Fixes: 7adc05d2dc3a ("EDAC/sysfs: Drop device references properly") Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200212120340.4764-4-rrichter@marvell.com
2020-02-13EDAC/mc: Fix use-after-free and memleaks during device removalRobert Richter2-21/+6
A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and DEBUG_KMEMLEAK set, revealed several issues when removing an mci device: 1) Use-after-free: On 27.11.19 17:07:33, John Garry wrote: > [ 22.104498] BUG: KASAN: use-after-free in > edac_remove_sysfs_mci_device+0x148/0x180 The use-after-free is caused by the mci_for_each_dimm() macro called in edac_remove_sysfs_mci_device(). The iterator was introduced with c498afaf7df8 ("EDAC: Introduce an mci_for_each_dimm() iterator"). The iterator loop calls device_unregister(&dimm->dev), which removes the sysfs entry of the device, but also frees the dimm struct in dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(), the dimm struct is accessed again, after having been freed already. The fix is to free all the mci device's subsequent dimm and csrow objects at a later point, in _edac_mc_free(), when the mci device itself is being freed. This keeps the data structures intact and the mci device can be fully used until its removal. The change allows the safe usage of mci_for_each_dimm() to release dimm devices from sysfs. 2) Memory leaks: Following memory leaks have been detected: # grep edac /sys/kernel/debug/kmemleak | sort | uniq -c 1 [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0 # mci->csrows 16 [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0 # csr->channels 16 [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0 # csr->channels[chn] 1 [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0 # mci->dimms 34 [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc() All leaks are from memory allocated by edac_mc_alloc(). Note: The test above shows that edac_mc_alloc() was called here from ghes_edac_register(), thus both functions show up in the stack trace but the module causing the leaks is edac_mc. The comments with the data structures involved were made manually by analyzing the objdump. The data structures listed above and created by edac_mc_alloc() are not properly removed during device removal, which is done in edac_mc_free(). There are two paths implemented to remove the device depending on device registration, _edac_mc_free() is called if the device is not registered and edac_unregister_sysfs() otherwise. The implemenations differ. For the sysfs case, the mci device removal lacks the removal of subsequent data structures (csrows, channels, dimms). This causes the memory leaks (see mci_attr_release()). [ bp: Massage commit message. ] Fixes: c498afaf7df8 ("EDAC: Introduce an mci_for_each_dimm() iterator") Fixes: faa2ad09c01c ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.") Fixes: 7a623c039075 ("edac: rewrite the sysfs code to use struct device") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200212120340.4764-3-rrichter@marvell.com