kernel/linux.git/drivers/gpu/drm/xe/xe_hw_error.c, branch v7.1

drm/xe/hw_error: Use HW_ERR prefix in log

2026-06-10T16:33:25+00:00

Hardware errors should be logged with HW_ERR prefix. Make them consistent with existing logs. Fixes: 01aab7e1c9d4 ("drm/xe/xe_hw_error: Add support for PVC SoC errors") Signed-off-by: Raag Jadav Reviewed-by: Riana Tauro Link: https://patch.msgid.link/20260602044919.702209-5-raag.jadav@intel.com Signed-off-by: Matt Roper (cherry picked from commit ad60a618c49fef07d1860bfb1091140d29f5eddb) Signed-off-by: Matthew Brost

drm/xe/xe_hw_error: Add support for PVC SoC errors

2026-03-06T00:38:56+00:00

Report the SoC nonfatal/fatal hardware error and update the counters. $ sudo ynl --family drm_ras --do get-error-counter \ --json '{"node-id":0, "error-id":2}' {'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0} Co-developed-by: Himal Prasad Ghimiray Signed-off-by: Himal Prasad Ghimiray Signed-off-by: Riana Tauro Reviewed-by: Raag Jadav Link: https://patch.msgid.link/20260304074412.464435-12-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi

drm/xe/xe_hw_error: Add support for Core-Compute errors

2026-03-06T00:38:56+00:00

PVC supports GT error reporting via vector registers along with error status register. Add support to report these errors and update respective counters. Incase of Subslice error reported by vector register, process the error status register for applicable bits. The counter is embedded in the xe drm ras structure and is exposed to the userspace using the drm_ras generic netlink interface. $ sudo ynl --family drm_ras --do get-error-counter \ --json '{"node-id":0, "error-id":1}' {'error-id': 1, 'error-name': 'core-compute', 'error-value': 0} Co-developed-by: Himal Prasad Ghimiray Signed-off-by: Himal Prasad Ghimiray Signed-off-by: Riana Tauro Reviewed-by: Raag Jadav Link: https://patch.msgid.link/20260304074412.464435-11-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi

drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling

2026-03-06T00:38:56+00:00

Initialize DRM RAS in hw error init. Map the UAPI error severities with the hardware error severities and refactor file. Signed-off-by: Riana Tauro Reviewed-by: Raag Jadav Link: https://patch.msgid.link/20260304074412.464435-10-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi

drm/xe/xe_hw_error: Add fault injection to trigger csc error handler

2025-08-26T14:11:34+00:00

Add a debugfs fault handler to trigger csc error handler that wedges the device and enables runtime survivability mode. v2: add debugfs only for bmg (Umesh) v3: do not use csc_fault attribute if debugfs is not enabled v4: rebase Cc: Lucas De Marchi Signed-off-by: Riana Tauro Reviewed-by: Raag Jadav Link: https://lore.kernel.org/r/20250826063419.3022216-11-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi

drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors

2025-08-26T14:11:34+00:00

Add support to handle CSC firmware reported errors. When CSC firmware errors are encoutered, a error interrupt is received by the GFX device as a MSI interrupt. Device Source control registers indicates the source of the error as CSC The HEC error status register indicates that the error is firmware reported Depending on the type of error, the error cause is written to the HEC Firmware error register. On encountering such CSC firmware errors, the graphics device is non-recoverable from driver context. The only way to recover from these errors is firmware flash. System admin/userspace is notified of the necessity of firmware flash with a combination of vendor-specific drm device edged uevent, dmesg logs and runtime survivability sysfs. It is the responsiblity of the consumer to verify all the actions and then trigger a firmware flash using tools like fwupd. $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=vendor-specific DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5973 MAJOR=226 MINOR=0 Logs xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: Tile0 reported NONFATAL error 0x20000 xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set xe 0000:03:00.0: Runtime Survivability mode enabled xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged. IOCTLs and executions are blocked. Only a rebind may clear the failure Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new xe 0000:03:00.0: [drm] device wedged, needs recovery xe 0000:03:00.0: Firmware flash required, Please refer to the userspace documentation for more details! Runtime survivability Sysfs: /sys/bus/pci/devices//survivability_mode v2: use vendor recovery method with runtime survivability (Christian, Rodrigo, Raag) v3: move declare wedged to runtime survivability mode (Rodrigo) v4: update commit message Signed-off-by: Riana Tauro Reviewed-by: Umesh Nerlige Ramappa Link: https://lore.kernel.org/r/20250826063419.3022216-10-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi

drm/xe: Add support to handle hardware errors

2025-08-26T14:11:34+00:00

Gfx device reports two classes of errors: uncorrectable and correctable. Depending on the severity uncorrectable errors are further classified Non-Fatal and Fatal. Correctable and Non-Fatal errors: These errors are reported as MSI. Bits in the Master Interrupt Register indicate the class of the error. The source of the error is then read from the Device Error Source Register. Fatal errors: These are reported as PCIe errors When a PCIe error is asserted, the OS will perform a SBR (Secondary Bus reset) which causes the driver to reload. The error registers are sticky and the values are maintained through SBR. Add basic support to handle these errors. Bspec: 50875, 53073, 53074, 53075, 53076 v2: Format commit message (Umesh) v3: fix documentation (Stuart) Cc: Stuart Summers Co-developed-by: Himal Prasad Ghimiray Signed-off-by: Himal Prasad Ghimiray Signed-off-by: Riana Tauro Reviewed-by: Umesh Nerlige Ramappa Link: https://lore.kernel.org/r/20250826063419.3022216-9-riana.tauro@intel.com Signed-off-by: Rodrigo Vivi