<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/gpu/drm/xe/xe_hw_error.c, branch v7.1</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-06-10T16:33:25+00:00</updated>
<entry>
<title>drm/xe/hw_error: Use HW_ERR prefix in log</title>
<updated>2026-06-10T16:33:25+00:00</updated>
<author>
<name>Raag Jadav</name>
<email>raag.jadav@intel.com</email>
</author>
<published>2026-06-02T04:48:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=381b3576a87f4ed6e76adb78d7d9400428f8f4b7'/>
<id>urn:sha1:381b3576a87f4ed6e76adb78d7d9400428f8f4b7</id>
<content type='text'>
Hardware errors should be logged with HW_ERR prefix. Make them
consistent with existing logs.

Fixes: 01aab7e1c9d4 ("drm/xe/xe_hw_error: Add support for PVC SoC errors")
Signed-off-by: Raag Jadav &lt;raag.jadav@intel.com&gt;
Reviewed-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Link: https://patch.msgid.link/20260602044919.702209-5-raag.jadav@intel.com
Signed-off-by: Matt Roper &lt;matthew.d.roper@intel.com&gt;
(cherry picked from commit ad60a618c49fef07d1860bfb1091140d29f5eddb)
Signed-off-by: Matthew Brost &lt;matthew.brost@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe/xe_hw_error: Add support for PVC SoC errors</title>
<updated>2026-03-06T00:38:56+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2026-03-04T07:44:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=01aab7e1c9d450a5bc7f09e6132b8dbbacf87586'/>
<id>urn:sha1:01aab7e1c9d450a5bc7f09e6132b8dbbacf87586</id>
<content type='text'>
Report the SoC nonfatal/fatal hardware error and update the counters.

$ sudo ynl --family drm_ras --do get-error-counter \
--json '{"node-id":0, "error-id":2}'
{'error-id': 2, 'error-name': 'soc-internal', 'error-value': 0}

Co-developed-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Raag Jadav &lt;raag.jadav@intel.com&gt;
Link: https://patch.msgid.link/20260304074412.464435-12-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe/xe_hw_error: Add support for Core-Compute errors</title>
<updated>2026-03-06T00:38:56+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2026-03-04T07:44:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9b2c89ddd961c188ddd8252a1b01d62ab50f6b95'/>
<id>urn:sha1:9b2c89ddd961c188ddd8252a1b01d62ab50f6b95</id>
<content type='text'>
PVC supports GT error reporting via vector registers along with
error status register. Add support to report these errors and
update respective counters. Incase of Subslice error reported
by vector register, process the error status register
for applicable bits.

The counter is embedded in the xe drm ras structure and is
exposed to the userspace using the drm_ras generic netlink
interface.

$ sudo ynl --family drm_ras --do get-error-counter \
--json '{"node-id":0, "error-id":1}'
{'error-id': 1, 'error-name': 'core-compute', 'error-value': 0}

Co-developed-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Raag Jadav &lt;raag.jadav@intel.com&gt;
Link: https://patch.msgid.link/20260304074412.464435-11-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe/xe_hw_error: Integrate DRM RAS with hardware error handling</title>
<updated>2026-03-06T00:38:56+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2026-03-04T07:44:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=63f687362fbabe13006713208bf4131d897b821e'/>
<id>urn:sha1:63f687362fbabe13006713208bf4131d897b821e</id>
<content type='text'>
Initialize DRM RAS in hw error init. Map the UAPI error severities
with the hardware error severities and refactor file.

Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Raag Jadav &lt;raag.jadav@intel.com&gt;
Link: https://patch.msgid.link/20260304074412.464435-10-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe/xe_hw_error: Add fault injection to trigger csc error handler</title>
<updated>2025-08-26T14:11:34+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2025-08-26T06:34:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d1f51a4f953da4ada15fa585c800ca98d627ce47'/>
<id>urn:sha1:d1f51a4f953da4ada15fa585c800ca98d627ce47</id>
<content type='text'>
Add a debugfs fault handler to trigger csc error handler that
wedges the device and enables runtime survivability mode.

v2: add debugfs only for bmg (Umesh)
v3: do not use csc_fault attribute if debugfs is not enabled
v4: rebase

Cc: Lucas De Marchi &lt;lucas.demarchi@intel.com&gt;
Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Raag Jadav &lt;raag.jadav@intel.com&gt;
Link: https://lore.kernel.org/r/20250826063419.3022216-11-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors</title>
<updated>2025-08-26T14:11:34+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2025-08-26T06:34:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a7df563b45b03ac8aa480051cce8d49d7b489ec6'/>
<id>urn:sha1:a7df563b45b03ac8aa480051cce8d49d7b489ec6</id>
<content type='text'>
Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash.

System admin/userspace is notified of the necessity of firmware flash
with a combination of vendor-specific drm device edged uevent, dmesg logs
and runtime survivability sysfs. It is the responsiblity of the consumer
to verify all the actions and then trigger a firmware flash using tools
like fwupd.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Logs

xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: Tile0 reported NONFATAL error 0x20000
xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set
xe 0000:03:00.0: Runtime Survivability mode enabled
xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.
               IOCTLs and executions are blocked. Only a rebind may clear the failure
               Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new
xe 0000:03:00.0: [drm] device wedged, needs recovery
xe 0000:03:00.0: Firmware flash required, Please refer to the userspace documentation for more details!

Runtime survivability Sysfs:

/sys/bus/pci/devices/&lt;device&gt;/survivability_mode

v2: use vendor recovery method with
    runtime survivability (Christian, Rodrigo, Raag)
v3: move declare wedged to runtime survivability mode (Rodrigo)
v4: update commit message

Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Umesh Nerlige Ramappa &lt;umesh.nerlige.ramappa@intel.com&gt;
Link: https://lore.kernel.org/r/20250826063419.3022216-10-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
<entry>
<title>drm/xe: Add support to handle hardware errors</title>
<updated>2025-08-26T14:11:34+00:00</updated>
<author>
<name>Riana Tauro</name>
<email>riana.tauro@intel.com</email>
</author>
<published>2025-08-26T06:34:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0a2a873d615a39e8a87d3f15285ed888341ddce8'/>
<id>urn:sha1:0a2a873d615a39e8a87d3f15285ed888341ddce8</id>
<content type='text'>
Gfx device reports two classes of errors: uncorrectable and
correctable. Depending on the severity uncorrectable errors are further
classified Non-Fatal and Fatal.

Correctable and Non-Fatal errors: These errors are reported as MSI. Bits in
the Master Interrupt Register indicate the class of the error.
The source of the error is then read from the Device Error Source
Register.

Fatal errors: These are reported as PCIe errors
When a PCIe error is asserted, the OS will perform a SBR (Secondary
Bus reset) which causes the driver to reload. The error registers are
sticky and the values are maintained through SBR.

Add basic support to handle these errors.

Bspec: 50875, 53073, 53074, 53075, 53076

v2: Format commit message (Umesh)
v3: fix documentation (Stuart)

Cc: Stuart Summers &lt;stuart.summers@intel.com&gt;
Co-developed-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Himal Prasad Ghimiray &lt;himal.prasad.ghimiray@intel.com&gt;
Signed-off-by: Riana Tauro &lt;riana.tauro@intel.com&gt;
Reviewed-by: Umesh Nerlige Ramappa &lt;umesh.nerlige.ramappa@intel.com&gt;
Link: https://lore.kernel.org/r/20250826063419.3022216-9-riana.tauro@intel.com
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
</content>
</entry>
</feed>
