summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2026-03-12drm/amdgpu: Fix error handling in slot resetLijo Lazar1-7/+10
[ Upstream commit b57c4ec98c17789136a4db948aec6daadceb5024 ] If the device has not recovered after slot reset is called, it goes to out label for error handling. There it could make decision based on uninitialized hive pointer and could result in accessing an uninitialized list. Initialize the list and hive properly so that it handles the error situation and also releases the reset domain lock which is acquired during error_detected callback. Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit bb71362182e59caa227e4192da5a612b09349696) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12drm/amdgpu: Fix locking bugs in error pathsBart Van Assche1-5/+7
[ Upstream commit 480ad5f6ead4a47b969aab6618573cd6822bb6a4 ] Do not unlock psp->ras_context.mutex if it has not been locked. This has been detected by the Clang thread-safety analyzer. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: amd-gfx@lists.freedesktop.org Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory") Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6fa01b4335978051d2cd80841728fd63cc597970) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12drm/amdgpu: Unlock a mutex before destroying itBart Van Assche1-0/+1
[ Upstream commit 5e0bcc7b88bcd081aaae6f481b10d9ab294fcb69 ] Mutexes must be unlocked before these are destroyed. This has been detected by the Clang thread-safety analyzer. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Yang Wang <kevinyang.wang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: amd-gfx@lists.freedesktop.org Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework") Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 270258ba320beb99648dceffb67e86ac76786e55) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12drm/amdgpu/userq: Do not allow userspace to trivially triger kernel warningsTvrtko Ursulin1-4/+4
[ Upstream commit 7b7d7693a55d606d700beb9549c9f7f0e5d9c24f ] Userspace can either deliberately pass in the too small num_fences, or the required number can legitimately grow between the two calls to the userq wait ioctl. In both cases we do not want the emit the kernel warning backtrace since nothing is wrong with the kernel and userspace will simply get an errno reported back. So lets simply drop the WARN_ONs. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL") Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2c333ea579de6cc20ea7bc50e9595ef72863e65c) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and ↵Srinivasan Shanmugam1-36/+37
Timeline Management v7 [ Upstream commit efdc66fe12b07e7b7d28650bd8d4f7e3bb92c5d4 ] When GPU memory mappings are updated, the driver returns a fence so userspace knows when the update is finished. The previous refactor could pick the wrong fence or rely on checks that are not safe for GPU mappings that stay valid even when memory is missing. In some cases this could return an invalid fence or cause fence reference counting problems. Fix this by (v5,v6, per Christian): - Starting from the VM’s existing last update fence, so a valid and meaningful fence is always returned even when no new work is required. - Selecting the VM-level fence only for always-valid / PRT mappings using the required combined bo_va + bo guard. - Using the per-BO page table update fence for normal MAP and REPLACE operations. - For UNMAP and CLEAR, returning the fence provided by amdgpu_vm_clear_freed(), which may remain unchanged when nothing needs clearing. - Keeping fence reference counting balanced. v7: Drop the extra bo_va/bo NULL guard since amdgpu_vm_is_bo_always_valid() handles NULL BOs correctly (including PRT). (Christian) This makes VM timeline fences correct and prevents crashes caused by incorrect fence handling. Fixes: bd8150a1b337 ("drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4") Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: keep vga memory on MacBooks with switchable graphicsAlex Deucher1-0/+10
[ Upstream commit 096bb75e13cc508d3915b7604e356bcb12b17766 ] On Intel MacBookPros with switchable graphics, when the iGPU is enabled, the address of VRAM gets put at 0 in the dGPU's virtual address space. This is non-standard and seems to cause issues with the cursor if it ends up at 0. We have the framework to reserve memory at 0 in the address space, so enable it here if the vram start address is 0. Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302 Cc: stable@vger.kernel.org Cc: Mario Kleiner <mario.kleiner.de@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notifyPierre-Eric Pelloux-Prayer1-1/+8
[ Upstream commit b18fc0ab837381c1a6ef28386602cd888f2d9edf ] Invalidating a dmabuf will impact other users of the shared BO. In the scenario where process A moves the BO, it needs to inform process B about the move and process B will need to update its page table. The commit fixes a synchronisation bug caused by the use of the ticket: it made amdgpu_vm_handle_moved behave as if updating the page table immediately was correct but in this case it's not. An example is the following scenario, with 2 GPUs and glxgears running on GPU0 and Xorg running on GPU1, on a system where P2P PCI isn't supported: glxgears: export linear buffer from GPU0 and import using GPU1 submit frame rendering to GPU0 submit tiled->linear blit Xorg: copy of linear buffer The sequence of jobs would be: drm_sched_job_run # GPU0, frame rendering drm_sched_job_queue # GPU0, blit drm_sched_job_done # GPU0, frame rendering drm_sched_job_run # GPU0, blit move linear buffer for GPU1 access # amdgpu_dma_buf_move_notify -> update pt # GPU0 It this point the blit job on GPU0 is still running and would likely produce a page fault. Cc: stable@vger.kernel.org Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Use 5-level paging if gmc support 57-bit VAPhilip Yang1-17/+0
[ Upstream commit 3b948dd0366a0b64c02e4ed1aefdf7825942e803 ] Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if gmc init with 57-bit address space support, because ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging support 47-bit VA, require 5-level paging on GPU to support ARM64. NPA address space 52-bit mapping on NPA GPU VM require 5-level paging. Debugger trap get device snapshot expect LDS and Scratch base, limit above 57-bit, which is set only for 5-level paging. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.19.x Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: GPU vm support 5-level page tablePhilip Yang3-1/+23
[ Upstream commit f6b1c1f5fd7237f77fc3880603ea54dcf0371a20 ] If GPU supports 5-level page table, but CPU disable 5-level page table by using boot option no5lvl or CPU feature not available, the virtual address will be 48bit, not needed to enable 5-level page table on GPU vm. If adev->vm_manager.num_level, number of pde levels, set to 4, then gfxhub and mmhub register VM_CONTEXTx_CNTL/PAGE_TABLE_DEPTH will set to 4 to enable 5-level page table in page table walker. Set vm_manager.root_level to AMDGPU_VM_PDE3, then update GPU mapping will allocate and update PDE3/PDE2/PDE1/PDE0/PTB 5-level page tables. If max_level is not 4, no change for the logic to support features needed by old ASICs. v2: squash in CONFIG fix Signed-off-by: Philip Yang <Philip.Yang@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: 3b948dd0366a ("drm/amdgpu: Use 5-level paging if gmc support 57-bit VA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Protect GPU register accesses in powergated state in some pathsYifan Zhang1-3/+6
[ Upstream commit 39fc2bc4da0082c226cbee331f0a5d44db3997da ] Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all() and amdgpu_fence_driver_hw_fini(). Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: avoid sdma ring reset in sriovVictor Zhao1-0/+3
[ Upstream commit 5cc7bbd9f1b74d9fe2f7ac08d6ba0477e8d2d65f ] sdma ring reset is not supported in SRIOV. kfd driver does not check reset mask, and could queue sdma ring reset during unmap_queues_cpsch. Avoid the ring reset for sriov. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Adjust usleep_range in fence waitCe Sun1-1/+1
[ Upstream commit 3ee1c72606bd2842f0f377fd4b118362af0323ae ] Tune the sleep interval in the PSP fence wait loop from 10-100us to 60-100us.This adjustment results in an overall wait window of 1.2s (60us * 20000 iterations) to 2 seconds (100us * 20000 iterations), which guarantees that we can retrieve the correct fence value Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: return when ras table checksum is errorGangliang Xie1-1/+3
[ Upstream commit 044f8d3b1fac6ac89c560f61415000e6bdab3a03 ] end the function flow when ras table checksum is error Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Skip vcn poison irq release on VFLijo Lazar1-1/+3
[ Upstream commit 8980be03b3f9a4b58197ef95d3b37efa41a25331 ] VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid call trace during deinitialization. [ 71.913601] [drm] clean up the vf2pf work item [ 71.915088] ------------[ cut here ]------------ [ 71.915092] WARNING: CPU: 3 PID: 1079 at /tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp mii ttm aesni_intel crypto_simd cryptd [ 71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G OE 6.8.0-87-generic #88~22.04.1-Ubuntu [ 71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014 [ 71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00 [ 71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246 [ 71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000 [ 71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000 [ 71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000 [ 71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 71.915792] FS: 000070cff87c4c40(0000) GS:ffff893abfb80000(0000) knlGS:0000000000000000 [ 71.915795] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0 [ 71.915800] PKRU: 55555554 [ 71.915802] Call Trace: [ 71.915805] <TASK> [ 71.915809] vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu] Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: validate user queue size constraintsJesse.Zhang1-0/+11
[ Upstream commit 8079b87c02e531cc91601f72ea8336dd2262fdf1 ] Add validation to ensure user queue sizes meet hardware requirements: - Size must be a power of two for efficient ring buffer wrapping - Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations This prevents invalid configurations that could lead to GPU faults or unexpected behavior. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: mark invalid records with U64_MAXGangliang Xie1-0/+6
[ Upstream commit 0028b86b52f7609e36af635ef6cb908925306233 ] set retired_page of invalid ras records to U64_MAX, and skip them when reading ras records Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Skip loading SDMA_RS64 in VFYuBiao Wang1-0/+1
[ Upstream commit 39c21b81112321cbe1267b02c77ecd2161ce19aa ] VFs use the PF SDMA ucode and are unable to load SDMA_RS64. Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com> Signed-off-by: Victor Skvortsov <Victor.Skvortsov@amd.com> Reviewed-by: Gavin Wan <gavin.wan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and ↵Srinivasan Shanmugam1-53/+82
Timeline Management v4 [ Upstream commit bd8150a1b3370a9f7761c5814202a3fe5a79f44f ] This commit simplifies the amdgpu_gem_va_ioctl function, key updates include: - Moved the logic for managing the last update fence directly into amdgpu_gem_va_update_vm. - Introduced checks for the timeline point to enable conditional replacement or addition of fences. v2: Addressed review comments from Christian. v3: Updated comments (Christian). v4: The previous version selected the fence too early and did not manage its reference correctly, which could lead to stale or freed fences being used. This resulted in refcount underflows and could crash when updating GPU timelines. The fence is now chosen only after the VA mapping work is completed, and its reference is taken safely. After exporting it to the VM timeline syncobj, the driver always drops its local fence reference, ensuring balanced refcounting and avoiding use-after-free on dma_fence. Crash signature: [ 205.828135] refcount_t: underflow; use-after-free. [ 205.832963] WARNING: CPU: 30 PID: 7274 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110 ... [ 206.074014] Call Trace: [ 206.076488] <TASK> [ 206.078608] amdgpu_gem_va_ioctl+0x6ea/0x740 [amdgpu] [ 206.084040] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu] [ 206.089994] drm_ioctl_kernel+0x86/0xe0 [drm] [ 206.094415] drm_ioctl+0x26e/0x520 [drm] [ 206.098424] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu] [ 206.104402] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] [ 206.109387] __x64_sys_ioctl+0x96/0xe0 [ 206.113156] do_syscall_64+0x66/0x2d0 ... [ 206.553351] BUG: unable to handle page fault for address: ffffffffc0dfde90 ... [ 206.553378] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0 ... [ 206.553405] Call Trace: [ 206.553409] <IRQ> [ 206.553415] ? __pfx_drm_sched_fence_free_rcu+0x10/0x10 [gpu_sched] [ 206.553424] dma_fence_signal+0x30/0x60 [ 206.553427] drm_sched_job_done.isra.0+0x123/0x150 [gpu_sched] [ 206.553434] dma_fence_signal_timestamp_locked+0x6e/0xe0 [ 206.553437] dma_fence_signal+0x30/0x60 [ 206.553441] amdgpu_fence_process+0xd8/0x150 [amdgpu] [ 206.553854] sdma_v4_0_process_trap_irq+0x97/0xb0 [amdgpu] [ 206.554353] edac_mce_amd(E) ee1004(E) [ 206.554270] amdgpu_irq_dispatch+0x150/0x230 [amdgpu] [ 206.554702] amdgpu_ih_process+0x6a/0x180 [amdgpu] [ 206.555101] amdgpu_irq_handler+0x23/0x60 [amdgpu] [ 206.555500] __handle_irq_event_percpu+0x4a/0x1c0 [ 206.555506] handle_irq_event+0x38/0x80 [ 206.555509] handle_edge_irq+0x92/0x1e0 [ 206.555513] __common_interrupt+0x3e/0xb0 [ 206.555519] common_interrupt+0x80/0xa0 [ 206.555525] </IRQ> [ 206.555527] <TASK> ... [ 206.555650] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0 ... [ 206.555667] Kernel panic - not syncing: Fatal exception in interrupt Link: https://patchwork.freedesktop.org/patch/654669/ Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: avoid a warning in timedout job handlerAlex Deucher1-1/+2
[ Upstream commit c8cf9ddc549fb93cb5a35f3fe23487b1e6707e74 ] Only set an error on the fence if the fence is not signalled. We can end up with a warning if the per queue reset path signals the fence and sets an error as part of the reset, but fails to recover. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: add support for HDP IP version 6.1.1Tim Huang1-0/+1
[ Upstream commit e2fd14f579b841f54a9b7162fef15234d8c0627a ] This initializes HDP IP version 6.1.1. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu/ras: Move ras data alloc before bad page checkAsad Kamal1-5/+5
[ Upstream commit bd68a1404b6fa2e7e9957b38ba22616faba43e75 ] In the rare event if eeprom has only invalid address entries, allocation is skipped, this causes following NULL pointer issue [ 547.103445] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 547.118897] #PF: supervisor read access in kernel mode [ 547.130292] #PF: error_code(0x0000) - not-present page [ 547.141689] PGD 124757067 P4D 0 [ 547.148842] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 547.158504] CPU: 49 PID: 8167 Comm: cat Tainted: G OE 6.8.0-38-generic #38-Ubuntu [ 547.177998] Hardware name: Supermicro AS -8126GS-TNMR/H14DSG-OD, BIOS 1.7 09/12/2025 [ 547.195178] RIP: 0010:amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.210375] Code: e8 63 78 82 c0 45 31 d2 45 3b 75 08 48 8b 45 a0 73 44 44 89 f1 48 8b 7d 88 48 89 ca 48 c1 e2 05 48 29 ca 49 8b 4d 00 48 01 d1 <48> 83 79 10 00 74 17 49 63 f2 48 8b 49 08 41 83 c2 01 48 8d 34 76 [ 547.252045] RSP: 0018:ffa0000067287ac0 EFLAGS: 00010246 [ 547.263636] RAX: ff11000167c28130 RBX: ff11000127600000 RCX: 0000000000000000 [ 547.279467] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000125b1c800 [ 547.295298] RBP: ffa0000067287b50 R08: 0000000000000000 R09: 0000000000000000 [ 547.311129] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 547.326959] R13: ff11000217b1de00 R14: 0000000000000000 R15: 0000000000000092 [ 547.342790] FS: 0000746e59d14740(0000) GS:ff11017dfda80000(0000) knlGS:0000000000000000 [ 547.360744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 547.373489] CR2: 0000000000000010 CR3: 000000019585e001 CR4: 0000000000f71ef0 [ 547.389321] PKRU: 55555554 [ 547.395316] Call Trace: [ 547.400737] <TASK> [ 547.405386] ? show_regs+0x6d/0x80 [ 547.412929] ? __die+0x24/0x80 [ 547.419697] ? page_fault_oops+0x99/0x1b0 [ 547.428588] ? do_user_addr_fault+0x2ee/0x6b0 [ 547.438249] ? exc_page_fault+0x83/0x1b0 [ 547.446949] ? asm_exc_page_fault+0x27/0x30 [ 547.456225] ? amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.470040] ? mas_wr_modify+0xcd/0x140 [ 547.478548] sysfs_kf_bin_read+0x63/0xb0 [ 547.487248] kernfs_file_read_iter+0xa1/0x190 [ 547.496909] kernfs_fop_read_iter+0x25/0x40 [ 547.506182] vfs_read+0x255/0x390 This also result in space left assigned to negative values. Moving data alloc call before bad page check resolves both the issue. Signed-off-by: Asad Kamal <asad.kamal@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: fix the calculation of RAS bad page numberTao Zhou1-4/+0
[ Upstream commit f752e79d38857011f1293fcb6c810409c3b669ee ] __amdgpu_ras_restore_bad_pages is responsible for the maintenance of bad page number, drop the unnecessary bad page number update in the error handling path of add_bad_pages. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: fix NULL pointer issue buffer funcsLikun Gao1-1/+2
[ Upstream commit 9877a865d62c9c3e0f4cc369dc9ca9f7f24f5ee9 ] If SDMA block not enabled, buffer_funcs will not initialize, fix the null pointer issue if buffer_funcs not initialized. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Fix missing unwind in amdgpu_ib_schedule() error pathSrinivasan Shanmugam1-1/+1
[ Upstream commit ba038065655c45728be346d0b174a6da08d8a5c5 ] amdgpu_ib_schedule() returns early after calling amdgpu_ring_undo(). This skips the common free_fence cleanup path. Other error paths were already changed to use goto free_fence, but this one was missed. Change the early return to goto free_fence so all error paths clean up the same way. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c:232 amdgpu_ib_schedule() warn: missing unwind goto? drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c 124 int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs, 125 struct amdgpu_ib *ibs, struct amdgpu_job *job, 126 struct dma_fence **f) 127 { ... 224 225 if (ring->funcs->insert_start) 226 ring->funcs->insert_start(ring); 227 228 if (job) { 229 r = amdgpu_vm_flush(ring, job, need_pipe_sync); 230 if (r) { 231 amdgpu_ring_undo(ring); --> 232 return r; The patch changed the other error paths to goto free_fence but this one was accidentally skipped. 233 } 234 } 235 236 amdgpu_ring_ib_begin(ring); ... 338 339 free_fence: 340 if (!job) 341 kfree(af); 342 return r; 343 } Fixes: f903b85ed0f1 ("drm/amdgpu: fix possible fence leaks from job structure") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: clean up the amdgpu_cs_parser_bosSunil Khatri1-2/+4
[ Upstream commit f025a2b8d93358467b8e8f4b3a617e88c5f02fab ] In low memory conditions, kmalloc can fail. In such conditions unlock the mutex for a clean exit. We do not need to amdgpu_bo_list_put as it's been handled in the amdgpu_cs_parser_fini. Fixes: 737da5363cc0 ("drm/amdgpu: update the functions to use amdgpu version of hmm") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202602030017.7E0xShmH-lkp@intel.com/ Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu/sdma6: enable queue resets unconditionallyAlex Deucher1-12/+3
[ Upstream commit 56423871e9eef1dd069bddef895207fa5ce275fe ] There is no firmware version dependency. This also enables sdma queue resets on all SDMA 6.x based chips. Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu/sdma5.2: enable queue resets unconditionallyAlex Deucher1-19/+3
[ Upstream commit 314d30ad50622fc0d70da71509f9dff21545be14 ] There is no firmware version dependency. This also enables sdma queue resets on all SDMA 5.2.x based chips. Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu/sdma5: enable queue resets unconditionallyAlex Deucher1-12/+3
[ Upstream commit 46a2cb7d24f21132e970cab52359210c3f5ea3c6 ] There is no firmware version dependency. Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Fix memory leak in amdgpu_ras_init()Zilin Guan1-1/+1
[ Upstream commit ee41e5b63c8210525c936ee637a2c8d185ce873c ] When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function returns directly without freeing the allocated con structure, leading to a memory leak. Fix this by jumping to the release_con label to properly clean up the allocated memory before returning the error code. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges()Zilin Guan1-1/+1
[ Upstream commit 0c44d61945c4a80775292d96460aa2f22e62f86c ] amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Fix memory leak in amdgpu_acpi_enumerate_xcc()Zilin Guan1-1/+3
[ Upstream commit c9be63d565789b56ca7b0197e2cb78a3671f95a8 ] In amdgpu_acpi_enumerate_xcc(), if amdgpu_acpi_dev_init() returns -ENOMEM, the function returns directly without releasing the allocated xcc_info, resulting in a memory leak. Fix this by ensuring that xcc_info is properly freed in the error paths. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: 4d5275ab0b18 ("drm/amdgpu: Add parsing of acpi xcc objects") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Drop MMIO_REMAP domain bit and keep it InternalChristian König4-44/+59
[ Upstream commit 96e97a562d067a6d867862db79864cc66aae99c2 ] "AMDGPU_GEM_DOMAIN_MMIO_REMAP" - Never activated as UAPI and it turned out that this was to inflexible. Allocate the MMIO_REMAP buffer object as a regular GEM BO and explicitly move it into the fixed AMDGPU_PL_MMIO_REMAP placement at the TTM level. This avoids relying on GEM domain bits for MMIO_REMAP, keeps the placement purely internal, and makes the lifetime and pinning of the global MMIO_REMAP BO explicit. The BO is pinned in TTM so it cannot be migrated or evicted. The corresponding free path relies on normal DRM teardown ordering, where no further user ioctls can access the global BO once TTM teardown begins. v2 (Srini): - Updated patch title. - Drop use of AMDGPU_GEM_DOMAIN_MMIO_REMAP in amdgpu_ttm.c. The MMIO_REMAP domain bit is removed from UAPI, so keep the MMIO_REMAP BO allocation domain-less (bp.domain = 0) and rely on the TTM placement (AMDGPU_PL_MMIO_REMAP) for backing/pinning. - Keep fdinfo/mem-stats visibility for MMIO_REMAP by classifying BOs based on bo->tbo.resource->mem_type == AMDGPU_PL_MMIO_REMAP, since the domain bit is removed. v3: Squash patches #1 & #3 Fixes: 056132483724 ("drm/amdgpu/uapi: Introduce AMDGPU_GEM_DOMAIN_MMIO_REMAP") Fixes: 2a7a794eb82c ("drm/amdgpu/ttm: Allocate/Free 4K MMIO_REMAP Singleton") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Leo Liu <leo.liu@amd.com> Cc: Ruijing Dong <ruijing.dong@amd.com> Cc: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu/ttm: Pin 4K MMIO_REMAP Singleton BO at Init v2Srinivasan Shanmugam1-0/+32
[ Upstream commit de8955508b72f8a88afc3f2cbb62334d5f79ccc3 ] MMIO_REMAP (HDP flush page) is a hardware I/O window exposed via a PCI BAR. It must not migrate or be evicted. Allocate a single 4 KB GEM BO in AMDGPU_GEM_DOMAIN_MMIO_REMAP during TTM initialization when the hardware exposes a remap bus address and the host page size is <= 4 KiB. Reserve the BO and pin it at the TTM level so it remains fixed for its lifetime. No CPU mapping is established here. On teardown, reserve, unpin, and free the BO if present. This prepares the object to be shared (e.g., via dma-buf) without triggering placement changes or no CPU-access migration v2: Added extra NULL checks Suggested-by: Christian König <christian.koenig@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: 96e97a562d06 ("drm/amdgpu: Drop MMIO_REMAP domain bit and keep it Internal") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amdgpu: Use explicit VCN instance 0 in SR-IOV initSrinivasan Shanmugam1-22/+23
[ Upstream commit af26fa751c2eef66916acbf0d3c3e9159da56186 ] vcn_v2_0_start_sriov() declares a local variable "i" initialized to zero and uses it only as the instance index in SOC15_REG_OFFSET(UVD, i, ...). The value is never changed and all other fields are taken from adev->vcn.inst[0], so this path only ever programs VCN instance 0. This triggered a Smatch: warn: iterator 'i' not incremented Replace the dummy iterator with an explicit instance index of 0 in SOC15_REG_OFFSET() calls. Fixes: dd26858a9cd8 ("drm/amdgpu: implement initialization part on VCN2.0 for SRIOV") Reported by: Dan Carpenter <dan.carpenter@linaro.org> Cc: darlington Opara <darlington.opara@amd.com> Cc: Jinage Zhao <jiange.zhao@amd.com> Cc: Monk Liu <Monk.Liu@amd.com> Cc: Emily Deng <Emily.Deng@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27drm/amd: Drop "amdgpu kernel modesetting enabled" messageMario Limonciello (AMD)1-1/+0
[ Upstream commit 8644084a74a4573278d6f454c6638ccd5965f4e2 ] The behavior for amdgpu was changed with commit e00e5c223878 ("drm/amdgpu: adjust drm_firmware_drivers_only() handling") to potentially allow loading even if nomodeset was set, so the message is no longer accurate. Just drop it to avoid confusion. Fixes: e00e5c223878 ("drm/amdgpu: adjust drm_firmware_drivers_only() handling") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-04drm/amdgpu: Fix double deletion of validate_listHarish Kasiviswanathan1-7/+7
If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ba60345f45eaf7cb4f89105d26083a4b9fd1cba)
2026-02-04Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"Bert Karwatzki1-3/+0
This reverts commit 7294863a6f01248d72b61d38478978d638641bee. This commit was erroneously applied again after commit 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 97a9689300eb2b393ba5efc17c8e5db835917080) Cc: stable@vger.kernel.org
2026-02-04drm/amd: Set minimum version for set_hw_resource_1 on gfx11 to 0x52Mario Limonciello1-1/+1
commit f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") caused a dependency on new enough MES firmware to use amdgpu. This was fixed on most gfx11 and gfx12 hardware with commit 0180e0a5dd5c ("drm/amdgpu/mes: add compatibility checks for set_hw_resource_1"), but this left out that GC 11.0.4 had breakage at MES 0x51. Bump the requirement to 0x52 instead. Reported-by: danijel@nausys.com Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4576 Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c2d2ccc85faf8cc6934d50c18e43097eb453ade2) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx12: adjust KGQ reset sequenceAlex Deucher1-10/+13
Kernel gfx queues do not need to be reinitialized or remapped after a reset. Align with gfx11. v2: preserve init and remap for MMIO case. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0a6d6ed694d72b66b0ed7a483d5effa01acd3951) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx11: adjust KGQ reset sequenceAlex Deucher1-10/+13
Kernel gfx queues do not need to be reinitialized or remapped after a reset. This fixes queue reset failures on APUs. v2: preserve init and remap for MMIO case. Fixes: b3e9bfd86658 ("drm/amdgpu/gfx11: add ring reset callbacks") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b340ff216fdabfe71ba0cdd47e9835a141d08e10) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx12: fix wptr reset in KGQ initAlex Deucher1-1/+1
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a2918f958d3f677ea93c0ac257cb6ba69b7abb7c) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx11: fix wptr reset in KGQ initAlex Deucher1-1/+1
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1f16866bdb1daed7a80ca79ae2837a9832a74fbc) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/gfx10: fix wptr reset in KGQ initAlex Deucher1-1/+1
wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e80b1d1aa1073230b6c25a1a72e88f37e425ccda) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()Alex Deucher1-2/+3
The EXEC_COUNT field must be > 0. In the gfx shadow handling we always emit a cond_exec packet after the gfx_shadow packet, but the EXEC_COUNT never gets patched. This leads to a hang when we try and reset queues on gfx11 APUs. Fixes: c68cbbfd54c6 ("drm/amdgpu: cleanup conditional execution") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ba205ac3d6e83f56c4f824f23f1b4522cb844ff3) Cc: stable@vger.kernel.org
2026-01-29drm/amdgpu/soc21: fix xclk for APUsAlex Deucher1-1/+7
The reference clock is supposed to be 100Mhz, but it appears to actually be slightly lower (99.81Mhz). Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 637fee3954d4bd509ea9d95ad1780fc174489860) Cc: stable@vger.kernel.org
2026-01-28drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_removeJon Doron1-1/+6
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ce8d536c80aa1f059e82184f0d1994436b1d526) Cc: stable@vger.kernel.org
2026-01-21drm/amdgpu: fix type for wptr in ring backupAlex Deucher1-1/+1
Needs to be a u64. Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 56fff1941abd3ca3b6f394979614ca7972552f7f)
2026-01-21drm/amdgpu: Fix validating flush_gpu_tlb_pasid()Timur Kristóf1-2/+4
When a function holds a lock and we return without unlocking it, it deadlocks the kernel. We should always unlock before returning. This commit fixes suspend/resume on SI. Tested on two Tahiti GPUs: FirePro W9000 and R9 280X. Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/ Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e3a6eff92bbd960b471966d9afccb4d584546d17)
2026-01-21drm/amdgpu: fix error handling in ib_schedule()Alex Deucher1-1/+1
If fence emit fails, free the fence if necessary. Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5eb680a06007f2f6ea333d11a4e29039da90614b)
2026-01-21drm/amdgpu: free hw_vm_fence when fail in amdgpu_job_allocJiqian Chen1-2/+5
If drm_sched_job_init fails, hw_vm_fence is not freed currently, then cause memory leak. Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5d42ee457ccd1fb5da4c7f817825b2806ec36956)