summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
AgeCommit message (Collapse)AuthorFilesLines
2026-04-02drm/amdgpu: prevent immediate PASID reuse caseEric Huang1-0/+1
commit 14b81abe7bdc25f8097906fc2f91276ffedb2d26 upstream. PASID resue could cause interrupt issue when process immediately runs into hw state left by previous process exited with the same PASID, it's possible that page faults are still pending in the IH ring buffer when the process exits and frees up its PASID. To prevent the case, it uses idr cyclic allocator same as kernel pid's. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8f1de51f49be692de137c8525106e0fce2d1912d) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25drm/amdgpu: rework how we handle TLB fencesAlex Deucher1-1/+6
commit e9f58ff991dd4be13fd7a651bbf64329c090af09 upstream. Add a new VM flag to indicate whether or not we need a TLB fence. Userqs (KFD or KGD) require a TLB fence. A TLB fence is not strictly required for kernel queues, but it shouldn't hurt. That said, enabling this unconditionally should be fine, but it seems to tickle some issues in KIQ/MES. Only enable them for KFD, or when KGD userq queues are enabled (currently via module parameter). Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4798 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4749 Fixes: f3854e04b708 ("drm/amdgpu: attach tlb fence to the PTs update") Cc: Christian König <christian.koenig@amd.com> Cc: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 69c5fbd2b93b5ced77c6e79afe83371bca84c788) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04drm/amdgpu: Use 5-level paging if gmc support 57-bit VAPhilip Yang1-17/+0
[ Upstream commit 3b948dd0366a0b64c02e4ed1aefdf7825942e803 ] Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if gmc init with 57-bit address space support, because ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging support 47-bit VA, require 5-level paging on GPU to support ARM64. NPA address space 52-bit mapping on NPA GPU VM require 5-level paging. Debugger trap get device snapshot expect LDS and Scratch base, limit above 57-bit, which is set only for 5-level paging. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.19.x Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: GPU vm support 5-level page tablePhilip Yang1-0/+20
[ Upstream commit f6b1c1f5fd7237f77fc3880603ea54dcf0371a20 ] If GPU supports 5-level page table, but CPU disable 5-level page table by using boot option no5lvl or CPU feature not available, the virtual address will be 48bit, not needed to enable 5-level page table on GPU vm. If adev->vm_manager.num_level, number of pde levels, set to 4, then gfxhub and mmhub register VM_CONTEXTx_CNTL/PAGE_TABLE_DEPTH will set to 4 to enable 5-level page table in page table walker. Set vm_manager.root_level to AMDGPU_VM_PDE3, then update GPU mapping will allocate and update PDE3/PDE2/PDE1/PDE0/PTB 5-level page tables. If max_level is not 4, no change for the logic to support features needed by old ASICs. v2: squash in CONFIG fix Signed-off-by: Philip Yang <Philip.Yang@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: 3b948dd0366a ("drm/amdgpu: Use 5-level paging if gmc support 57-bit VA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-14Revert "drm/amdgpu: don't attach the tlb fence for SI"Prike Liang1-3/+1
This reverts commit 820b3d376e8a102c6aeab737ec6edebbbb710e04. It’s better to validate VM TLB flushes in the flush‑TLB backend rather than in the generic VM layer. Reverting this patch depends on commit fa7c231fc2b0 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()") being present in the tree. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 9163fe4d790fb4e16d6b0e23f55b43cddd3d4a65)
2025-12-08drm/amdgpu: don't attach the tlb fence for SIAlex Deucher1-1/+3
SI hardware doesn't support pasids, user mode queues, or KIQ/MES so there is no need for this. Doing so results in a segfault as these callbacks are non-existent for SI. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744 Fixes: f3854e04b708 ("drm/amdgpu: attach tlb fence to the PTs update") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 820b3d376e8a102c6aeab737ec6edebbbb710e04)
2025-12-02drm/amdgpu: Forward VMID reservation errorsNatalie Vock1-2/+1
Otherwise userspace may be fooled into believing it has a reserved VMID when in reality it doesn't, ultimately leading to GPU hangs when SPM is used. Fixes: 80e709ee6ecc ("drm/amdgpu: add option params to enforce process isolation between graphics and compute") Cc: stable@vger.kernel.org Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Natalie Vock <natalie.vock@gmx.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-26drm/amdgpu: attach tlb fence to the PTs updatePrike Liang1-1/+1
Ensure the userq TLB flush is emitted only after the VM update finishes and the PT BOs have been annotated with bookkeeping fences. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-20drm/amdgpu/vm: Check PRT uAPI flag instead of PTE flagTimur Kristóf1-2/+2
This fixes sparse mappings (aka. partially resident textures). Check the correct flags. Since a recent refactor, the code works with uAPI flags (for mapping buffer objects), and not PTE (page table entry) flags. Fixes: 6716a823d18d ("drm/amdgpu: rework how PTE flags are generated v3") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-14drm/amdgpu: avoid memory allocation in the critical code path v3Christian König1-7/+0
When we run out of VMIDs we need to wait for some to become available. Previously we were using a dma_fence_array for that, but this means that we have to allocate memory. Instead just wait for the first not signaled fence from the least recently used VMID to signal. That is not as efficient since we end up in this function multiple times again, but allocating memory can easily fail or deadlock if we have to wait for memory to become available. v2: remove now unused VM manager fields v3: fix dma_fence reference Signed-off-by: Christian König <christian.koenig@amd.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4258 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: fix possible fence leaks from job structureAlex Deucher1-0/+2
If we don't end up initializing the fences, free them when we free the job. We can't set the hw_fence to NULL after emitting it because we need it in the cleanup path for the submit direct case. v2: take a reference to the fences if we emit them v3: handle non-job fence in error paths Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling") Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> (v1) Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04drm/amdgpu: grab a BO reference in vm_lock_done_list.Christian König1-2/+6
Otherwise it is possible that between dropping the status lock and locking the BO that the BO is freed up. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: clean up and unify hw fence handlingAlex Deucher1-5/+2
Decouple the amdgpu fence from the amdgpu_job structure. This lets us clean up the separate fence ops for the embedded fence and other fences. This also allows us to allocate the vm fence up front when we allocate the job. v2: Additional cleanup suggested by Christian v3: Additional cleanups suggested by Christian v4: Additional cleanups suggested by David and vm fence fix v5: cast seqno (David) Cc: David.Wu3@amd.com Cc: christian.koenig@amd.com Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13drm/amdgpu: validate userq va for GEM unmapPrike Liang1-0/+12
When a user unmaps a userq VA, the driver must ensure the queue has no in-flight jobs. If there is pending work, the kernel should wait for the attached eviction (bookkeeping) fence to signal before deleting the mapping. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07drm/amdgpu: partially revert "revert to old status lock handling v3"Christian König1-55/+91
The CI systems are pointing out list corruptions, so we still need to fix something here. Keep the asserts, but revert the lock changes for now. Fixes: 59e4405e9ee2 ("drm/amdgpu: revert to old status lock handling v3") Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machineJesse.Zhang1-1/+1
After GPU reset with VRAM loss, a general protection fault occurs during user queue restoration when accessing vm_bo->vm after spinlock release in amdgpu_vm_bo_reset_state_machine. The root cause is that vm_bo points to the last entry from the list_for_each_entry loop, but this becomes invalid after the spinlock is released. Accessing vm_bo->vm at this point leads to memory corruption. Crash log shows: [ 326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI [ 326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G E 6.16.0+ #25 PREEMPT(voluntary) [ 326.981826] Tainted: [E]=UNSIGNED_MODULE [ 326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024 [ 326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu] [ 326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu] [ 326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05 [ 326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206 [ 326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: 0000000000000000 [ 326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: ffffffffc144814a [ 326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: 0000000000000001 [ 326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: ffff9e8f013e8000 [ 326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: ffff9e8f09980000 [ 326.982110] FS: 0000000000000000(0000) GS:ffff9e9e79995000(0000) knlGS:0000000000000000 [ 326.982112] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: 0000000000750ef0 [ 326.982116] PKRU: 55555554 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_initJesse.Zhang1-44/+21
As KFD no longer uses a separate PASID, the global amdgpu_vm_set_pasid()function is no longer necessary. Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and eliminate redundant locking. v2: remove superflous check adjust amdgpu_vm_fin and remove amdgpu_vm_set_pasid (Chritian) v3: drop amdgpu_vm_assert_locked (Chritian) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4614 Fixes: 59e4405e9ee2 ("drm/amdgpu: revert to old status lock handling v3") Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25drm/amdgpu: revert "rework reserved VMID handling" v2Christian König1-13/+4
This reverts commit e44a0fe630c58b0a87d8281f5c1077a3479e5fce. Initially we used VMID reservation to enforce isolation between processes. That has now been replaced by proper fence handling. Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID for SPM, so restore that approach by reverting back to only allowing a single process to use the reserved VMID. Only compile tested for now. v2: use -ENOENT instead of -EINVAL if VMID is not available Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-22Merge tag 'amd-drm-next-6.18-2025-09-19' of ↵Dave Airlie1-94/+111
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-19: amdgpu: - Fence drv clean up fix - DPC fixes - Misc display fixes - Support the MMIO remap page as a ttm pool - JPEG parser updates - UserQ updates - VCN ctx handling fixes - Documentation updates - Misc cleanups - SMU 13.0.x updates - SI DPM updates - GC 11.x cleaner shader updates - DMCUB updates - DML fixes - Improve fallback handling for pixel encoding - VCN reset improvements - DCE6 DC updates - DSC fixes - Use devm for i2c buses - GPUVM locking updates - GPUVM documentation improvements - Drop non-DC DCE11 code - S0ix fixes - Backlight fix - SR-IOV fixes amdkfd: - SVM updates Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com
2025-09-18drm/amdgpu: revert to old status lock handling v3Christian König1-88/+80
It turned out that protecting the status of each bo_va with a spinlock was just hiding problems instead of solving them. Revert the whole approach, add a separate stats_lock and lockdep assertions that the correct reservation lock is held all over the place. This not only allows for better checks if a state transition is properly protected by a lock, but also switching back to using list macros to iterate over the state of lists protected by the dma_resv lock of the root PD. v2: re-add missing check v3: split into two patches Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-18drm/amdgpu: add missing comment for the new argumentSunil Khatri1-0/+1
In function 'amdgpu_vm_lock_done_list' update the comment for the new argument 'vm'. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509180211.UAqME0zj-lkp@intel.com/ Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-17drm/amdgpu: remove check for BO reservation add assert insteadChristian König1-12/+1
We should leave such checks to lockdep and not implement something manually. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-17drm/amdgpu: fix userq VM validation v4Christian König1-0/+35
That was actually complete nonsense and not validating the BOs at all. The code just cleared all VM areas were it couldn't grab the lock for a BO. Try to fix this. Only compile tested at the moment. v2: fix fence slot reservation as well as pointed out by Sunil. also validate PDs, PTs, per VM BOs and update PDEs v3: grab the status_lock while working with the done list. v4: rename functions, add some comments, fix waiting for updates to complete. v4: rename amdgpu_vm_lock_done_list(), add some more comments Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-15Merge tag 'v6.17-rc6' into drm-nextDave Airlie1-1/+1
This is a backmerge of Linux 6.17-rc6, needed for msm, also requested by misc. Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-09-05Merge tag 'drm-misc-next-2025-09-04' of ↵Dave Airlie1-2/+4
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.18: Cross-subsystem Changes: - Update a number of DT bindings for STM32MP25 Arm SoC Core Changes: gem: - Simplify locking for GPUVM panel-backlight-quirks: - Add additional quirks for EDID, DMI, brightness sched: - Fix race condition in trace code - Clean up sysfb: - Clean up Driver Changes: amdgpu: - Give kernel jobs a unique id for better tracing amdxdna: - Improve error reporting bridge: - Improve ref counting on bridge management - adv7511: Provide SPD and HDMI infoframes - it6505: Replace crypto_shash with sha() - synopsys: Add support for DW DPTX Controller plus DT bindings gud: - Replace simple-KMS pipe with regular atomic helpers imagination: - Improve power management - Add support for TH1520 GPU - Support Risc-V architectures ivpu: - Clean up nouveau: - Improve error reporting panthor: - Fail VM bind if BO has offset - Clean up rcar-du: - Make number of lanes configurable rockchip: - Add support for RK3588 DPTX output rocket: - Use kfree() and sizeof() correctly - Test DMA status - Clean up sitronix: - st7571-i2c: Add support for inverted displays and 2-bit grayscale - Clean up stm: - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings tidss: - Convert to kernel's FIELD_ macros v3d: - Improve job management and locking Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250904090932.GA193997@linux.fritz.box
2025-09-01drm/amdgpu: give each kernel job a unique idPierre-Eric Pelloux-Prayer1-2/+4
Userspace jobs have drm_file.client_id as a unique identifier as job's owners. For kernel jobs, we can allocate arbitrary values - the risk of overlap with userspace ids is small (given that it's a u64 value). In the unlikely case the overlap happens, it'll only impact trace events. Since this ID is traced in the gpu_scheduler trace events, this allows to determine the source of each job sent to the hardware. To make grepping easier, the IDs are defined as they will appear in the trace output. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://lore.kernel.org/r/20250604122827.2191-1-pierre-eric.pelloux-prayer@amd.com
2025-08-20Merge drm/drm-fixes into drm-misc-fixesMaxime Ripard1-4/+11
Update drm-misc-fixes to -rc2. Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-08-13Revert "drm/amdgpu: Use dma_buf from GEM object instance"Thomas Zimmermann1-1/+1
This reverts commit 515986100d176663d0a03219a3056e4252f729e6. The dma_buf field in struct drm_gem_object is not stable over the object instance's lifetime. The field becomes NULL when user space releases the final GEM handle on the buffer object. This resulted in a NULL-pointer deref. Workarounds in commit 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers") and commit f6bfc9afc751 ("drm/framebuffer: Acquire internal references on GEM handles") only solved the problem partially. They especially don't work for buffer objects without a DRM framebuffer associated. Hence, this revert to going back to using .import_attach->dmabuf. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250715082635.34974-1-tzimmermann@suse.de
2025-08-12drm/amdgpu: fix task hang from failed job submission during process killLiu01 Tong1-4/+11
During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout. Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity. v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling(). Fixes: 1f02f2044bda ("drm/amdgpu: Avoid extra evict-restore process.") Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com> Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f101c13a8720c73e67f8f9d511fbbeda95bcedb1)
2025-08-12drm/amdgpu: fix task hang from failed job submission during process killLiu01 Tong1-4/+11
During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout. Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity. v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling(). Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com> Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04drm/amdgpu: rework how PTE flags are generated v3Christian König1-8/+9
Previously we tried to keep the HW specific PTE flags in each mapping, but for CRIU that isn't sufficient any more since the original value is needed for the checkpoint procedure. So rework the whole handling, nuke the early mapping function, keep the UAPI flags in each mapping instead of the HW flags and translate them to the HW flags while filling in the PTEs. Only tested on Navi 23 for now, so probably needs quite a bit of more work. v2: fix KFD and SVN handling v3: one more SVN fix pointed out by Felix v4: squash in gfx12 fix from David Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amdgpu: Avoid extra evict-restore process.Gang Ba1-4/+2
If vm belongs to another process, this is fclose after fork, wait may enable signaling KFD eviction fence and cause parent process queue evicted. [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu] [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0 [677852.634820] dma_fence_wait_timeout+0x3a/0x140 [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu] [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu] [677852.635208] filp_flush+0x38/0x90 [677852.635213] filp_close+0x14/0x30 [677852.635216] do_close_on_exec+0xdd/0x130 [677852.635221] begin_new_exec+0x1da/0x490 [677852.635225] load_elf_binary+0x307/0xea0 [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5 [677852.635235] ? ima_bprm_check+0xa2/0xd0 [677852.635240] search_binary_handler+0xda/0x260 [677852.635245] exec_binprm+0x58/0x1a0 [677852.635249] bprm_execve.part.0+0x16f/0x210 [677852.635254] bprm_execve+0x45/0x80 [677852.635257] do_execveat_common.isra.0+0x190/0x200 Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Gang Ba <Gang.Ba@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-16drm/amdgpu: track ring state associated with a fenceAlex Deucher1-0/+4
We need to know the wptr and sequence number associated with a fence so that we can re-emit the unprocessed state after a ring reset. Pre-allocate storage space for the ring buffer contents and add helpers to save off and re-emit the unprocessed state so that it can be re-emitted after the queue is reset. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-04Merge tag 'amd-drm-next-6.17-2025-07-01' of ↵Dave Airlie1-13/+14
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-01: amdgpu: - FAMS2 fixes - OLED fixes - Misc cleanups - AUX fixes - DMCUB updates - SR-IOV hibernation support - RAS updates - DP tunneling fixes - DML2 fixes - Backlight improvements - Suspend improvements - Use scaling for non-native modes on eDP - SDMA 4.4.x fixes - PCIe DPM fixes - SDMA 5.x fixes - Cleaner shader updates for GC 9.x - Remove fence slab - ISP genpd support - Parition handling rework - SDMA FW checks for userq support - Add missing firmware declaration - Fix leak in amdgpu_ctx_mgr_entity_fini() - Freesync fix - Ring reset refactoring - Legacy dpm verbosity changes amdkfd: - GWS fix - mtype fix for ext coherent system memory - MMU notifier fix - gfx7/8 fix radeon: - CS validation support for additional GL extensions - Bump driver version for new CS validation checks From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-06-30drm/amdgpu: Use dma_buf from GEM object instanceThomas Zimmermann1-1/+1
Avoid dereferencing struct drm_gem_object.import_attach for the imported dma-buf. The dma_buf field in the GEM object instance refers to the same buffer. Prepares to make import_attach an implementation detail of PRIME. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: Test for imported buffers with drm_gem_is_imported()Thomas Zimmermann1-2/+2
Instead of testing import_attach for imported GEM buffers, invoke drm_gem_is_imported() to do the test. v2: - keep amdgpu_bo_print_info() as-is (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: Convert from DRM_* to dev_*Lijo Lazar1-10/+11
Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-17drm: amdgpu: Use struct drm_wedge_task_info inside of struct amdgpu_task_infoAndré Almeida1-6/+6
To avoid a cast when calling drm_dev_wedged_event(), replace pid and task name inside of struct amdgpu_task_info with struct drm_wedge_task_info. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-6-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: amdgpu: Create amdgpu_vm_print_task_info()André Almeida1-0/+9
To avoid repetitive code in amdgpu, create a function that prints the content of struct amdgpu_task_info. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: amdgpu: Allow NULL pointers at amdgpu_vm_put_task_info()André Almeida1-1/+2
Allow NULL pointers at amdgpu_vm_put_task_info() as it common practice for "put" or "free" functions. This avoid an extra check for NULL for callers. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-2-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-04-11drm/amdgpu: adjust enforce_isolation handlingAlex Deucher1-1/+2
Switch from a bool to an enum and allow more options for enforce isolation. There are now 3 modes of operation: - Disabled (0) - Enabled (serialization and cleaner shader) (1) - Enabled in legacy mode (no serialization or cleaner shader) (2) This provides better flexibility for more use cases. Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: remove is_mes_queue flagAlex Deucher1-1/+1
This was leftover from MES bring up when we had MES user queues in the kernel. It's no longer used so remove it. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-21drm/amdgpu: add cleaner shader trace pointChristian König1-0/+1
Note when the cleaner shader is executed. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-21drm/amdgpu: rework how the cleaner shader is emitted v3Christian König1-6/+21
Instead of emitting the cleaner shader for every job which has the enforce_isolation flag set only emit it for the first submission from every client. v2: add missing NULL check v3: fix another NULL pointer deref Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdkfd: Fix pasid value leakXiaogang Chen1-14/+0
Curret kfd does not allocate pasid values, instead uses pasid value for each vm from graphic driver. So should not prevent graphic driver from releasing pasid values since the values are allocated by graphic driver, not kfd driver anymore. This patch does not stop graphic driver release pasid values. Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver") Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdgpu: Unlocked unmap only clear page table leavesPhilip Yang1-4/+0
SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for range length >= huge page and free PTB bo, update mapping to alloc new PT bo. There is race bug that the freed entry bo maybe still on the pt_free list, reused when updating mapping and then freed, leave invalid PDE entry and cause GPU page fault. By setting the update to clear only one PDE entry or clear PTB, to avoid unmap to free PTE bo. This fixes the race bug and improve the unmap and map to GPU performance. Update mapping to huge page will still free the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09Merge tag 'drm-misc-next-2025-01-06' of ↵Dave Airlie1-48/+161
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.14: UAPI Changes: - Clarify drm memory stats documentation Cross-subsystem Changes: Core Changes: - sched: Documentation fixes, Driver Changes: - amdgpu: Track BO memory stats at runtime - amdxdna: Various fixes - hisilicon: New HIBMC driver - bridges: - Provide default implementation of atomic_check for HDMI bridges - it605: HDCP improvements, MCCS Support Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250106-augmented-kakapo-of-action-0cf000@houat
2024-12-19drm/amdgpu: track bo memory stats at runtimeYunxiang Li1-45/+160
Before, every time fdinfo is queried we try to lock all the BOs in the VM and calculate memory usage from scratch. This works okay if the fdinfo is rarely read and the VMs don't have a ton of BOs. If either of these conditions is not true, we get a massive performance hit. In this new revision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-6-Yunxiang.Li@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-19drm/amdgpu: remove unused function parameterYunxiang Li1-3/+1
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-5-Yunxiang.Li@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-18drm/amdgpu: Handle NULL bo->tbo.resource (again) in amdgpu_vm_bo_updateMichel Dänzer1-4/+3
Third time's the charm, I hope? Fixes: d3116756a710 ("drm/ttm: rename bo->mem and make it a pointer") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3837 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>