kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c, branch v7.0.13

drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)

2026-06-19T11:48:10+00:00

commit e47b0056a08dc70430ffc44bbf62197e7d1ff8ea upstream. Problem: While developing the amd_close_race IGT test (which intentionally triggers execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce zero diagnostic output. The GPU simply hangs silently for ~10s until the scheduler timeout fires. There is no way to distinguish an execute permission fault from any other type of GPU hang. Root cause: GFX 10.1.x defaults to noretry=0, which sets RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers (gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE, wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the translation indefinitely, expecting software to eventually fix the permission bits (as happens in SVM/HMM recovery). No interrupt of any kind reaches the IH ring. This is different from invalid-page faults (V=0) which DO generate a retry fault interrupt that the driver can escalate to a no-retry fault. Permission faults with valid PTEs loop silently forever in hardware. GFX 10.3+ already defaults to noretry=1, which makes permission faults generate immediate L2 protection fault interrupts. GFX 10.1.x was inadvertently left out of this default. Fix: Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer generations. With noretry=1, the existing non-retry fault handler (gmc_v10_0_process_interrupt) already decodes and prints the full GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS, faulting address, VMID, PASID, and process name. No additional logging code is needed — the fix is purely routing permission faults to the existing, fully-capable non-retry interrupt handler. v2: Dropped GFX10-specific logging from gmc_v10_0.c and kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry fault handler, but with noretry=1 permission faults take the non-retry path — the v1 retry handler code was dead and would never execute. Tested on Navi10 (GFX 10.1.10): - Execute permission faults now produce immediate, clear output: [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592) Process amd_close_race pid 13380 thread amd_close_race pid 13384 in page at address 0x40001000 from client 0x1b (UTCL2) GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881 PERMISSION_FAULTS: 0x8 - No regressions with properly-mapped GPU workloads Cc: Christian Koenig Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: Vitaly Prosyak Acked-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit eb21edd24c40d81066753f8ac6f23bce15745395) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu/gmc: Fix AMDGPU_GART_PLACEMENT_LOW to not overlap with VRAM

2026-05-23T11:09:27+00:00

[ Upstream commit 36d65da7570bf72ce28504fa9a81abfc728e6d96 ] When the GART placement is set to AMDGPU_GART_PLACEMENT_LOW: Make sure that GART does not overlap with VRAM when VRAM is configured to be in the low address space. Solve this according to the following logic: - When GART fits before VRAM, use zero address for GART - Otherwise, put GART after the end of VRAM, aligned to 4 GiB Previously, I had assumed this was not possible so it was OK to not handle it, but now we got a report from a user who has a board that is configured this way. Fixes: 917f91d8d8e8 ("drm/amdgpu/gmc: add a way to force a particular placement for GART") Signed-off-by: Timur Kristóf Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 3d9de5d86a1658cadb311461b001eb1df67263ad) Signed-off-by: Sasha Levin

Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses

2026-02-22T16:26:33+00:00

Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook Signed-off-by: Linus Torvalds

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

drm/amdgpu: keep vga memory on MacBooks with switchable graphics

2026-02-19T17:16:11+00:00

On Intel MacBookPros with switchable graphics, when the iGPU is enabled, the address of VRAM gets put at 0 in the dGPU's virtual address space. This is non-standard and seems to cause issues with the cursor if it ends up at 0. We have the framework to reserve memory at 0 in the address space, so enable it here if the vram start address is 0. Reviewed-and-tested-by: Mario Kleiner Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302 Cc: stable@vger.kernel.org Cc: Mario Kleiner Signed-off-by: Alex Deucher

drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges()

2026-02-05T22:18:33+00:00

amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar Signed-off-by: Zilin Guan Signed-off-by: Alex Deucher

drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_remove

2026-01-27T23:09:04+00:00

On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf Reviewed-by: Philip Yang Signed-off-by: Jon Doron Signed-off-by: Alex Deucher

drm/amdgpu: Fix validating flush_gpu_tlb_pasid()

2026-01-21T19:24:27+00:00

When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf Signed-off-by: Prike Liang Reviewed-by: Prike Liang Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: validate the flush_gpu_tlb_pasid()

2026-01-14T19:28:59+00:00

Validate flush_gpu_tlb_pasid() availability before flushing tlb. Signed-off-by: Prike Liang Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: add support for GC IP version 11.5.4

2026-01-05T21:27:34+00:00

This initializes GC IP version 11.5.4. v2: squash in RLC offset fix Reviewed-by: Mario Limonciello Signed-off-by: Tim Huang Signed-off-by: Alex Deucher