kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c, branch v6.18.36

drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)

2026-06-19T11:44:13+00:00

commit e47b0056a08dc70430ffc44bbf62197e7d1ff8ea upstream. Problem: While developing the amd_close_race IGT test (which intentionally triggers execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce zero diagnostic output. The GPU simply hangs silently for ~10s until the scheduler timeout fires. There is no way to distinguish an execute permission fault from any other type of GPU hang. Root cause: GFX 10.1.x defaults to noretry=0, which sets RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers (gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE, wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the translation indefinitely, expecting software to eventually fix the permission bits (as happens in SVM/HMM recovery). No interrupt of any kind reaches the IH ring. This is different from invalid-page faults (V=0) which DO generate a retry fault interrupt that the driver can escalate to a no-retry fault. Permission faults with valid PTEs loop silently forever in hardware. GFX 10.3+ already defaults to noretry=1, which makes permission faults generate immediate L2 protection fault interrupts. GFX 10.1.x was inadvertently left out of this default. Fix: Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer generations. With noretry=1, the existing non-retry fault handler (gmc_v10_0_process_interrupt) already decodes and prints the full GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS, faulting address, VMID, PASID, and process name. No additional logging code is needed — the fix is purely routing permission faults to the existing, fully-capable non-retry interrupt handler. v2: Dropped GFX10-specific logging from gmc_v10_0.c and kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry fault handler, but with noretry=1 permission faults take the non-retry path — the v1 retry handler code was dead and would never execute. Tested on Navi10 (GFX 10.1.10): - Execute permission faults now produce immediate, clear output: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592) Process amd_close_race pid 13380 thread amd_close_race pid 13384 in page at address 0x40001000 from client 0x1b (UTCL2) GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881 PERMISSION_FAULTS: 0x8 - No regressions with properly-mapped GPU workloads Cc: Christian Koenig Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: Vitaly Prosyak Acked-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit eb21edd24c40d81066753f8ac6f23bce15745395) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu/gmc: Fix AMDGPU_GART_PLACEMENT_LOW to not overlap with VRAM

2026-05-23T11:07:07+00:00

[ Upstream commit 36d65da7570bf72ce28504fa9a81abfc728e6d96 ] When the GART placement is set to AMDGPU_GART_PLACEMENT_LOW: Make sure that GART does not overlap with VRAM when VRAM is configured to be in the low address space. Solve this according to the following logic: - When GART fits before VRAM, use zero address for GART - Otherwise, put GART after the end of VRAM, aligned to 4 GiB Previously, I had assumed this was not possible so it was OK to not handle it, but now we got a report from a user who has a board that is configured this way. Fixes: 917f91d8d8e8 ("drm/amdgpu/gmc: add a way to force a particular placement for GART") Signed-off-by: Timur Kristóf Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 3d9de5d86a1658cadb311461b001eb1df67263ad) Signed-off-by: Sasha Levin

drm/amdgpu: Fix validating flush_gpu_tlb_pasid()

2026-05-17T15:15:35+00:00

commit e3a6eff92bbd960b471966d9afccb4d584546d17 upstream. When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf Signed-off-by: Prike Liang Reviewed-by: Prike Liang Reviewed-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: validate the flush_gpu_tlb_pasid()

2026-05-17T15:15:35+00:00

commit f4db9913e4d3dabe9ff3ea6178f2c1bc286012b8 upstream. Validate flush_gpu_tlb_pasid() availability before flushing tlb. Signed-off-by: Prike Liang Reviewed-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: keep vga memory on MacBooks with switchable graphics

2026-03-04T12:21:34+00:00

[ Upstream commit 096bb75e13cc508d3915b7604e356bcb12b17766 ] On Intel MacBookPros with switchable graphics, when the iGPU is enabled, the address of VRAM gets put at 0 in the dGPU's virtual address space. This is non-standard and seems to cause issues with the cursor if it ends up at 0. We have the framework to reserve memory at 0 in the address space, so enable it here if the vram start address is 0. Reviewed-and-tested-by: Mario Kleiner Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302 Cc: stable@vger.kernel.org Cc: Mario Kleiner Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges()

2026-02-26T22:59:42+00:00

[ Upstream commit 0c44d61945c4a80775292d96460aa2f22e62f86c ] amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar Signed-off-by: Zilin Guan Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove

2026-02-06T15:57:44+00:00

commit 8b1ecc9377bc641533cd9e76dfa3aee3cd04a007 upstream. On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf Reviewed-by: Philip Yang Signed-off-by: Jon Doron Signed-off-by: Alex Deucher (cherry picked from commit 6ce8d536c80aa1f059e82184f0d1994436b1d526) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amd/amdgpu: reserve vm invalidation engine for uni_mes

2025-11-24T18:25:31+00:00

Reserve vm invalidation engine 6 when uni_mes enabled. It is used in processing tlb flush request from host. Signed-off-by: Michael Chen Acked-by: Alex Deucher Reviewed-by: Shaoyun liu Signed-off-by: Alex Deucher (cherry picked from commit 873373739b9b150720ea2c5390b4e904a4d21505) Cc: stable@vger.kernel.org

drm/amdgpu: give each kernel job a unique id

2025-09-01T05:19:31+00:00

Userspace jobs have drm_file.client_id as a unique identifier as job's owners. For kernel jobs, we can allocate arbitrary values - the risk of overlap with userspace ids is small (given that it's a u64 value). In the unlikely case the overlap happens, it'll only impact trace events. Since this ID is traced in the gpu_scheduler trace events, this allows to determine the source of each job sent to the hardware. To make grepping easier, the IDs are defined as they will appear in the trace output. Signed-off-by: Pierre-Eric Pelloux-Prayer Acked-by: Alex Deucher Signed-off-by: Arunpravin Paneer Selvam Link: https://lore.kernel.org/r/20250604122827.2191-1-pierre-eric.pelloux-prayer@amd.com

drm/amdgpu: Convert init_mem_ranges into common helpers

2025-06-24T14:04:16+00:00

They can be shared across multiple products Signed-off-by: Hawking Zhang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher