kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c, branch v6.19.11

drm/amdgpu: mark invalid records with U64_MAX

2026-03-04T12:20:08+00:00

[ Upstream commit 0028b86b52f7609e36af635ef6cb908925306233 ] set retired_page of invalid ras records to U64_MAX, and skip them when reading ras records Signed-off-by: Gangliang Xie Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu/ras: Move ras data alloc before bad page check

2026-03-04T12:19:50+00:00

[ Upstream commit bd68a1404b6fa2e7e9957b38ba22616faba43e75 ] In the rare event if eeprom has only invalid address entries, allocation is skipped, this causes following NULL pointer issue [ 547.103445] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 547.118897] #PF: supervisor read access in kernel mode [ 547.130292] #PF: error_code(0x0000) - not-present page [ 547.141689] PGD 124757067 P4D 0 [ 547.148842] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 547.158504] CPU: 49 PID: 8167 Comm: cat Tainted: G OE 6.8.0-38-generic #38-Ubuntu [ 547.177998] Hardware name: Supermicro AS -8126GS-TNMR/H14DSG-OD, BIOS 1.7 09/12/2025 [ 547.195178] RIP: 0010:amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.210375] Code: e8 63 78 82 c0 45 31 d2 45 3b 75 08 48 8b 45 a0 73 44 44 89 f1 48 8b 7d 88 48 89 ca 48 c1 e2 05 48 29 ca 49 8b 4d 00 48 01 d1 <48> 83 79 10 00 74 17 49 63 f2 48 8b 49 08 41 83 c2 01 48 8d 34 76 [ 547.252045] RSP: 0018:ffa0000067287ac0 EFLAGS: 00010246 [ 547.263636] RAX: ff11000167c28130 RBX: ff11000127600000 RCX: 0000000000000000 [ 547.279467] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000125b1c800 [ 547.295298] RBP: ffa0000067287b50 R08: 0000000000000000 R09: 0000000000000000 [ 547.311129] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 547.326959] R13: ff11000217b1de00 R14: 0000000000000000 R15: 0000000000000092 [ 547.342790] FS: 0000746e59d14740(0000) GS:ff11017dfda80000(0000) knlGS:0000000000000000 [ 547.360744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 547.373489] CR2: 0000000000000010 CR3: 000000019585e001 CR4: 0000000000f71ef0 [ 547.389321] PKRU: 55555554 [ 547.395316] Call Trace: [ 547.400737] [ 547.405386] ? show_regs+0x6d/0x80 [ 547.412929] ? __die+0x24/0x80 [ 547.419697] ? page_fault_oops+0x99/0x1b0 [ 547.428588] ? do_user_addr_fault+0x2ee/0x6b0 [ 547.438249] ? exc_page_fault+0x83/0x1b0 [ 547.446949] ? asm_exc_page_fault+0x27/0x30 [ 547.456225] ? amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.470040] ? mas_wr_modify+0xcd/0x140 [ 547.478548] sysfs_kf_bin_read+0x63/0xb0 [ 547.487248] kernfs_file_read_iter+0xa1/0x190 [ 547.496909] kernfs_fop_read_iter+0x25/0x40 [ 547.506182] vfs_read+0x255/0x390 This also result in space left assigned to negative values. Moving data alloc call before bad page check resolves both the issue. Signed-off-by: Asad Kamal Suggested-by: Lijo Lazar Reviewed-by: Hawking Zhang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: fix the calculation of RAS bad page number

2026-03-04T12:19:50+00:00

[ Upstream commit f752e79d38857011f1293fcb6c810409c3b669ee ] __amdgpu_ras_restore_bad_pages is responsible for the maintenance of bad page number, drop the unnecessary bad page number update in the error handling path of add_bad_pages. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Fix memory leak in amdgpu_ras_init()

2026-02-26T23:01:35+00:00

[ Upstream commit ee41e5b63c8210525c936ee637a2c8d185ce873c ] When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function returns directly without freeing the allocated con structure, leading to a memory leak. Fix this by jumping to the release_con label to properly clean up the allocated memory before returning the error code. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init") Reviewed-by: Tao Zhou Signed-off-by: Zilin Guan Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Unregister mce notifier

2025-11-18T15:52:19+00:00

Unregister mce notifier on unload. Signed-off-by: Lijo Lazar Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: get RAS bad page address from MCA address

2025-11-12T02:54:14+00:00

Instead of from physical address. v2: add comment to make the code more readable Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: load RAS bad page from PMFW in page retirement

2025-11-12T02:53:26+00:00

In legacy way, bad page is queried from MCA registers, switch to getting it from PMFW when PMFW manages eeprom data. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: support to load RAS bad pages from PMFW

2025-11-06T15:01:14+00:00

PMFW manages eeprom bad page records, update bad page loading accrodingly. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: suspend ras module before gpu reset

2025-11-04T16:53:59+00:00

During gpu reset, all GPU-related resources are inaccessible. To avoid affecting ras functionality, suspend ras module before gpu reset and resume it after gpu reset is complete. V2: Rename functions to avoid misunderstanding. V3: Move flush_delayed_work to amdgpu_ras_process_pause, Move schedule_delayed_work to amdgpu_ras_process_unpause. V4: Rename functions. V5: Move the function to amdgpu_ras.c. Signed-off-by: YiPeng Chai Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Acked-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu: add function to check if pmfw eeprom is supported

2025-11-04T16:53:58+00:00

add function to check if pmfw is supported, skip eeprom check and recover when pmfw eeprom is supported Signed-off-by: Gangliang Xie Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher