summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
AgeCommit message (Collapse)AuthorFilesLines
2025-08-20Merge drm/drm-fixes into drm-misc-fixesMaxime Ripard1-4/+11
Update drm-misc-fixes to -rc2. Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-08-13Revert "drm/amdgpu: Use dma_buf from GEM object instance"Thomas Zimmermann1-1/+1
This reverts commit 515986100d176663d0a03219a3056e4252f729e6. The dma_buf field in struct drm_gem_object is not stable over the object instance's lifetime. The field becomes NULL when user space releases the final GEM handle on the buffer object. This resulted in a NULL-pointer deref. Workarounds in commit 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers") and commit f6bfc9afc751 ("drm/framebuffer: Acquire internal references on GEM handles") only solved the problem partially. They especially don't work for buffer objects without a DRM framebuffer associated. Hence, this revert to going back to using .import_attach->dmabuf. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250715082635.34974-1-tzimmermann@suse.de
2025-08-12drm/amdgpu: fix task hang from failed job submission during process killLiu01 Tong1-4/+11
During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout. Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity. v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling(). Fixes: 1f02f2044bda ("drm/amdgpu: Avoid extra evict-restore process.") Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com> Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f101c13a8720c73e67f8f9d511fbbeda95bcedb1)
2025-07-28drm/amdgpu: Avoid extra evict-restore process.Gang Ba1-4/+2
If vm belongs to another process, this is fclose after fork, wait may enable signaling KFD eviction fence and cause parent process queue evicted. [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu] [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0 [677852.634820] dma_fence_wait_timeout+0x3a/0x140 [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu] [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu] [677852.635208] filp_flush+0x38/0x90 [677852.635213] filp_close+0x14/0x30 [677852.635216] do_close_on_exec+0xdd/0x130 [677852.635221] begin_new_exec+0x1da/0x490 [677852.635225] load_elf_binary+0x307/0xea0 [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5 [677852.635235] ? ima_bprm_check+0xa2/0xd0 [677852.635240] search_binary_handler+0xda/0x260 [677852.635245] exec_binprm+0x58/0x1a0 [677852.635249] bprm_execve.part.0+0x16f/0x210 [677852.635254] bprm_execve+0x45/0x80 [677852.635257] do_execveat_common.isra.0+0x190/0x200 Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Gang Ba <Gang.Ba@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-16drm/amdgpu: track ring state associated with a fenceAlex Deucher1-0/+4
We need to know the wptr and sequence number associated with a fence so that we can re-emit the unprocessed state after a ring reset. Pre-allocate storage space for the ring buffer contents and add helpers to save off and re-emit the unprocessed state so that it can be re-emitted after the queue is reset. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-04Merge tag 'amd-drm-next-6.17-2025-07-01' of ↵Dave Airlie1-13/+14
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-01: amdgpu: - FAMS2 fixes - OLED fixes - Misc cleanups - AUX fixes - DMCUB updates - SR-IOV hibernation support - RAS updates - DP tunneling fixes - DML2 fixes - Backlight improvements - Suspend improvements - Use scaling for non-native modes on eDP - SDMA 4.4.x fixes - PCIe DPM fixes - SDMA 5.x fixes - Cleaner shader updates for GC 9.x - Remove fence slab - ISP genpd support - Parition handling rework - SDMA FW checks for userq support - Add missing firmware declaration - Fix leak in amdgpu_ctx_mgr_entity_fini() - Freesync fix - Ring reset refactoring - Legacy dpm verbosity changes amdkfd: - GWS fix - mtype fix for ext coherent system memory - MMU notifier fix - gfx7/8 fix radeon: - CS validation support for additional GL extensions - Bump driver version for new CS validation checks From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-06-30drm/amdgpu: Use dma_buf from GEM object instanceThomas Zimmermann1-1/+1
Avoid dereferencing struct drm_gem_object.import_attach for the imported dma-buf. The dma_buf field in the GEM object instance refers to the same buffer. Prepares to make import_attach an implementation detail of PRIME. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: Test for imported buffers with drm_gem_is_imported()Thomas Zimmermann1-2/+2
Instead of testing import_attach for imported GEM buffers, invoke drm_gem_is_imported() to do the test. v2: - keep amdgpu_bo_print_info() as-is (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: Convert from DRM_* to dev_*Lijo Lazar1-10/+11
Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-17drm: amdgpu: Use struct drm_wedge_task_info inside of struct amdgpu_task_infoAndré Almeida1-6/+6
To avoid a cast when calling drm_dev_wedged_event(), replace pid and task name inside of struct amdgpu_task_info with struct drm_wedge_task_info. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-6-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: amdgpu: Create amdgpu_vm_print_task_info()André Almeida1-0/+9
To avoid repetitive code in amdgpu, create a function that prints the content of struct amdgpu_task_info. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: amdgpu: Allow NULL pointers at amdgpu_vm_put_task_info()André Almeida1-1/+2
Allow NULL pointers at amdgpu_vm_put_task_info() as it common practice for "put" or "free" functions. This avoid an extra check for NULL for callers. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-2-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-04-11drm/amdgpu: adjust enforce_isolation handlingAlex Deucher1-1/+2
Switch from a bool to an enum and allow more options for enforce isolation. There are now 3 modes of operation: - Disabled (0) - Enabled (serialization and cleaner shader) (1) - Enabled in legacy mode (no serialization or cleaner shader) (2) This provides better flexibility for more use cases. Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: remove is_mes_queue flagAlex Deucher1-1/+1
This was leftover from MES bring up when we had MES user queues in the kernel. It's no longer used so remove it. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-21drm/amdgpu: add cleaner shader trace pointChristian König1-0/+1
Note when the cleaner shader is executed. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-21drm/amdgpu: rework how the cleaner shader is emitted v3Christian König1-6/+21
Instead of emitting the cleaner shader for every job which has the enforce_isolation flag set only emit it for the first submission from every client. v2: add missing NULL check v3: fix another NULL pointer deref Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdkfd: Fix pasid value leakXiaogang Chen1-14/+0
Curret kfd does not allocate pasid values, instead uses pasid value for each vm from graphic driver. So should not prevent graphic driver from releasing pasid values since the values are allocated by graphic driver, not kfd driver anymore. This patch does not stop graphic driver release pasid values. Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver") Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdgpu: Unlocked unmap only clear page table leavesPhilip Yang1-4/+0
SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for range length >= huge page and free PTB bo, update mapping to alloc new PT bo. There is race bug that the freed entry bo maybe still on the pt_free list, reused when updating mapping and then freed, leave invalid PDE entry and cause GPU page fault. By setting the update to clear only one PDE entry or clear PTB, to avoid unmap to free PTE bo. This fixes the race bug and improve the unmap and map to GPU performance. Update mapping to huge page will still free the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09Merge tag 'drm-misc-next-2025-01-06' of ↵Dave Airlie1-48/+161
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.14: UAPI Changes: - Clarify drm memory stats documentation Cross-subsystem Changes: Core Changes: - sched: Documentation fixes, Driver Changes: - amdgpu: Track BO memory stats at runtime - amdxdna: Various fixes - hisilicon: New HIBMC driver - bridges: - Provide default implementation of atomic_check for HDMI bridges - it605: HDCP improvements, MCCS Support Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250106-augmented-kakapo-of-action-0cf000@houat
2024-12-19drm/amdgpu: track bo memory stats at runtimeYunxiang Li1-45/+160
Before, every time fdinfo is queried we try to lock all the BOs in the VM and calculate memory usage from scratch. This works okay if the fdinfo is rarely read and the VMs don't have a ton of BOs. If either of these conditions is not true, we get a massive performance hit. In this new revision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-6-Yunxiang.Li@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-19drm/amdgpu: remove unused function parameterYunxiang Li1-3/+1
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-5-Yunxiang.Li@amd.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-18drm/amdgpu: Handle NULL bo->tbo.resource (again) in amdgpu_vm_bo_updateMichel Dänzer1-4/+3
Third time's the charm, I hope? Fixes: d3116756a710 ("drm/ttm: rename bo->mem and make it a pointer") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3837 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10drm/amdgpu: fix when the cleaner shader is emittedChristian König1-6/+7
Emitting the cleaner shader must come after the check if a VM switch is necessary or not. Otherwise we will emit the cleaner shader every time and not just when it is necessary because we switched between applications. This can otherwise crash on gang submit and probably decreases performance quite a bit. v2: squash in fix from Srini (Alex) Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: ee7a846ea27b ("drm/amdgpu: Emit cleaner shader at end of IB submission") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2024-11-05drm/amdgpu: stop syncing PRT map operationsChristian König1-1/+5
Requested by both Bas and Friedrich. Mapping PTEs as PRT doesn't need to sync for anything. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-08drm/amdgpu: Use drm_print_memory_stats helper from fdinfoTvrtko Ursulin1-9/+11
Convert fdinfo memory stats to use the common drm_print_memory_stats helper. This achieves alignment with the common keys as documented in drm-usage-stats.rst, adding specifically drm-total- key the driver was missing until now. Additionally I made the code stop skipping total size for objects which currently do not have a backing store, and I added resident, active and purgeable reporting. Legacy keys have been preserved, with the outlook of only potentially removing only the drm-memory- when the time gets right. The example output now looks like this: pos: 0 flags: 02100002 mnt_id: 24 ino: 1239 drm-driver: amdgpu drm-client-id: 4 drm-pdev: 0000:04:00.0 pasid: 32771 drm-total-cpu: 0 drm-shared-cpu: 0 drm-active-cpu: 0 drm-resident-cpu: 0 drm-purgeable-cpu: 0 drm-total-gtt: 2392 KiB drm-shared-gtt: 0 drm-active-gtt: 0 drm-resident-gtt: 2392 KiB drm-purgeable-gtt: 0 drm-total-vram: 44564 KiB drm-shared-vram: 31952 KiB drm-active-vram: 0 drm-resident-vram: 44564 KiB drm-purgeable-vram: 0 drm-memory-vram: 44564 KiB drm-memory-gtt: 2392 KiB drm-memory-cpu: 0 KiB amd-memory-visible-vram: 44564 KiB amd-evicted-vram: 0 KiB amd-evicted-visible-vram: 0 KiB amd-requested-vram: 44564 KiB amd-requested-visible-vram: 11952 KiB amd-requested-gtt: 2392 KiB drm-engine-compute: 46464671 ns v2: * Track purgeable via AMDGPU_GEM_CREATE_DISCARDABLE. Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <robdclark@chromium.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25drm/amdgpu: sync to KFD fences before clearing PTEsChristian König1-0/+6
This patch tries to solve the basic problem we also need to sync to the KFD fences of the BO because otherwise it can be that we clear PTEs while the KFD queues are still running. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18drm/amdgpu: nuke the VM PD/PT shadow handlingChristian König1-17/+0
This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_state_machine() started to do the same. Compute never used shadows either, so the only proplematic case left is SVM and that is most likely not recoverable in any way when VRAM is lost. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-07drm/amdgpu: revert "use CPU for page table update if SDMA is unavailable"Christian König1-6/+0
That is clearly not something we should do upstream. The SDMA is mandatory for the driver to work correctly. We could do this for emulation and bringup, but in those cases the engineer should probably enabled CPU based updates manually. This reverts commit 62eefd10ac1c7e976bda47ff311bd87cee40ab8d. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-07drm/amdgpu/: Add missing kdoc entry in amdgpu_vm_handle_fault functionSrinivasan Shanmugam1-0/+1
This commit adds a description for the 'ts' parameter in the amdgpu_vm_handle_fault function's comment block. Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2781: warning: Function parameter or struct member 'ts' not described in 'amdgpu_vm_handle_fault' Cc: Xiaogang.Chen <Xiaogang.Chen@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408251419.vgZHg3GV-lkp@intel.com/ Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Xiaogang Chen <Xiaogang.Chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-07drm/amdgpu: use clamp() in amdgpu_vm_adjust_size()Li Zetao1-1/+1
When it needs to get a value within a certain interval, using clamp() makes the code easier to understand than min(max()). Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-07drm/amdgpu: fix invalid fence handling in amdgpu_vm_tlb_flushLang Yu1-2/+4
CPU based update doesn't produce a fence, handle such cases properly. Fixes: d8a3f0a0348d ("drm/amdgpu: implement TLB flush fence") Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-07drm/amdgpu: re-work VM syncingChristian König1-33/+51
Rework how VM operations synchronize to submissions. Provide an amdgpu_sync container to the backends instead of an reservation object and fill in the amdgpu_sync object in the higher layers of the code. No intended functional change, just prepares for upcomming changes. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Friedrich Vock <friedrich.vock@gmx.de> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-27Merge tag 'amd-drm-next-6.12-2024-08-26' of ↵Daniel Vetter1-2/+14
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.12-2024-08-26: amdgpu: - SDMA devcoredump support - DCN 4.0.1 updates - DC SUBVP fixes - Refactor OPP in DC - Refactor MMHUBBUB in DC - DC DML 2.1 updates - DC FAMS2 updates - RAS updates - GFX12 updates - VCN 4.0.3 updates - JPEG 4.0.3 updates - Enable wave kill (soft recovery) for compute queues - Clean up CP error interrupt handling - Enable CP bad opcode interrupts - VCN 4.x fixes - VCN 5.x fixes - GPU reset fixes - Fix vbios embedded EDID size handling - SMU 14.x updates - Misc code cleanups and spelling fixes - VCN devcoredump support - ISP MFD i2c support - DC vblank fixes - GFX 12 fixes - PSR fixes - Convert vbios embedded EDID to drm_edid - DCN 3.5 updates - DMCUB updates - Cursor fixes - Overdrive support for SMU 14.x - GFX CP padding optimizations - DCC fixes - DSC fixes - Preliminary per queue reset infrastructure - Initial per queue reset support for GFX 9 - Initial per queue reset support for GFX 7, 8 - DCN 3.2 fixes - DP MST fixes - SR-IOV fixes - GFX 9.4.3/4 devcoredump support - Add process isolation framework - Enable process isolation support for GFX 9.4.3/4 - Take IOMMU remapping into account for P2P DMA checks amdkfd: - CRIU fixes - Improved input validation for user queues - HMM fix - Enable process isolation support for GFX 9.4.3/4 - Initial per queue reset support for GFX 9 - Allow users to target recommended SDMA engines radeon: - remove .load and drm_dev_alloc - Fix vbios embedded EDID size handling - Convert vbios embedded EDID to drm_edid - Use GEM references instead of TTM - r100 cp init cleanup - Fix potential overflows in evergreen CS offset tracking UAPI: - KFD support for targetting queues on recommended SDMA engines Proposed userspace: https://github.com/ROCm/ROCR-Runtime/commit/2f588a24065f41c208c3701945e20be746d8faf7 https://github.com/ROCm/ROCR-Runtime/commit/eb30a5bbc7719c6ffcf2d2dd2878bc53a47b3f30 drm/buddy: - Add start address support for trim function From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240826201528.55307-1-alexander.deucher@amd.com
2024-08-23drm/amdkfd: Change kfd/svm page fault drain handlingXiaogang Chen1-2/+2
When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and not handle any incoming pages fault of this process until a deferred work item got executed by default system wq. The time period of "not handle page fault" can be long and is unpredicable. That is advese to kfd performance on page faults recovery. This patch uses time stamp of incoming page fault to decide to drop or recover page fault. When app unmap vm ranges kfd records each gpu device's ih ring current time stamp. These time stamps are used at kfd page fault recovery routine. Any page fault happened on unmapped ranges after unmap events is application bug that accesses vm range after unmap. It is not driver work to cover that. By using time stamp of page fault do not need drain page faults at deferred work. So, the time period that kfd does not handle page faults is reduced and can be controlled. Signed-off-by: Xiaogang.Chen <Xiaogang.Chen@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-21drm/amdgpu: Enforce isolation as part of the jobSrinivasan Shanmugam1-1/+2
This patch adds a new parameter 'enforce_isolation' to the amdgpu_job structure. This parameter is used to determine whether shader isolation should be enforced for a job. The enforce_isolation parameter is then stored in the amdgpu_job structure and used when flushing the VM. The enforce_isolation field of the amdgpu_job structure is set directly after the job is allocated This change allows more fine-grained control over shader isolation, making it possible to enforce isolation on a per-job basis rather than globally. This can be useful in scenarios where only certain jobs require isolation. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-16drm/amdgpu: Emit cleaner shader at end of IB submissionAlex Deucher1-0/+5
This commit introduces the emission of a cleaner shader at the end of the IB submission process. This is achieved by adding a new function pointer, `emit_cleaner_shader`, to the `amdgpu_ring_funcs` structure. If the `emit_cleaner_shader` function is set in the ring functions, it is called during the VM flush process. The cleaner shader is only emitted if the `enable_cleaner_shader` flag is set in the `amdgpu_device` structure. This allows the cleaner shader emission to be controlled on a per-device basis. By emitting a cleaner shader at the end of the IB submission, we can ensure that the VM state is properly cleaned up after each submission. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Christian König <christian.koenig@amd.com>
2024-08-06drm/amdgpu: use CPU for page table update if SDMA is unavailableYifan Zhang1-0/+6
avoid using SDMA if it is unavailable. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Tim Huang <tim.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-29Merge drm/drm-next into drm-misc-nextThomas Zimmermann1-25/+37
Backmerging to get a late RC of v6.10 before moving into v6.11. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-07-25drm/amdgpu: reset vm state machine after gpu reset(vram lost)ZhenGuo Yin1-4/+5
[Why] Page table of compute VM in the VRAM will lost after gpu reset. VRAM won't be restored since compute VM has no shadows. [How] Use higher 32-bit of vm->generation to record a vram_lost_counter. Reset the VM state machine when vm->genertaion is not equal to the new generation token. v2: Check vm->generation instead of calling drm_sched_entity_error in amdgpu_vm_validate. v3: Use new generation token instead of vram_lost_counter for check. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org (cherry picked from commit 47c0388b0589cb481c294dcb857d25a214c46eb3)
2024-07-24drm/amdgpu: reset vm state machine after gpu reset(vram lost)ZhenGuo Yin1-4/+5
[Why] Page table of compute VM in the VRAM will lost after gpu reset. VRAM won't be restored since compute VM has no shadows. [How] Use higher 32-bit of vm->generation to record a vram_lost_counter. Reset the VM state machine when vm->genertaion is not equal to the new generation token. v2: Check vm->generation instead of calling drm_sched_entity_error in amdgpu_vm_validate. v3: Use new generation token instead of vram_lost_counter for check. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-09drm/ttm, drm/amdgpu, drm/xe: Consider hitch moves within bulk sublist movesThomas Hellström1-0/+4
To address the problem with hitches moving when bulk move sublists are lru-bumped, register the list cursors with the ttm_lru_bulk_move structure when traversing its list, and when lru-bumping the list, move the cursor hitch to the tail. This also means it's mandatory for drivers to call ttm_lru_bulk_move_init() and ttm_lru_bulk_move_fini() when initializing and finalizing the bulk move structure, so add those calls to the amdgpu- and xe driver. Compared to v1 this is slightly more code but less fragile and hopefully easier to understand. Changes in previous series: - Completely rework the functionality - Avoid a NULL pointer dereference assigning manager->mem_type - Remove some leftover code causing build problems v2: - For hitch bulk tail moves, store the mem_type in the cursor instead of with the manager. v3: - Remove leftover mem_type member from change in v2. v6: - Add some lockdep asserts (Matthew Brost) - Avoid NULL pointer dereference (Matthew Brost) - No need to check bo->resource before dereferencing bo->bulk_move (Matthew Brost) Cc: Christian König <christian.koenig@amd.com> Cc: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <dri-devel@lists.freedesktop.org> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240705153206.68526-5-thomas.hellstrom@linux.intel.com Signed-off-by: Christian König <christian.koenig@amd.com>
2024-05-20drm/amdgpu: Fix amdgpu_vm_is_bo_always_valid kerneldocTvrtko Ursulin1-1/+1
Align kerneldoc with the function argument name. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 26e20235ce00 ("drm/amdgpu: Add amdgpu_bo_is_vm_bo helper") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-18drm/amdgpu: Add amdgpu_bo_is_vm_bo helperTvrtko Ursulin1-15/+26
Help code readability by replacing a bunch of: bo->tbo.base.resv == vm->root.bo->tbo.base.resv With: amdgpu_vm_is_bo_always_valid(vm, bo) No functional changes. v2: * Rename helper and move to amdgpu_vm. (Christian) v3: * Use Christian's kerneldoc. v4: * Fixed logic inversion in amdgpu_vm_bo_get_memory. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-30drm/amdgpu: support gfx v12 specific pte/pde fieldsHawking Zhang1-6/+6
Add gfx v12 pte/pde support to gmc common helper. v2: squash in fixes (Alex) Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17drm/amdgpu: validate the parameters of bo mapping operations more clearlyxinhui pan1-26/+46
Verify the parameters of amdgpu_vm_bo_(map/replace_map/clearing_mappings) in one common place. Fixes: dc54d3d1744d ("drm/amdgpu: implement AMDGPU_VA_OP_CLEAR v2") Cc: stable@vger.kernel.org Reported-by: Vlad Stolyarov <hexed@google.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22drm/amdgpu: sync page table freeing with tlb flushShashank Sharma1-0/+3
The idea behind this patch is to delay the freeing of PT entry objects until the TLB flush is done. This patch: - Adds a tlb_flush_waitlist in amdgpu_vm_update_params which will keep the objects that need to be freed after tlb_flush. - Adds PT entries in this list in amdgpu_vm_ptes_update after finding the PT entry. - Changes functionality of amdgpu_vm_pt_free_dfs from (df_search + free) to simply freeing of the BOs, also renames it to amdgpu_vm_pt_free_list to reflect this same. - Exports function amdgpu_vm_pt_free_list to be called directly. - Calls amdgpu_vm_pt_free_list directly from amdgpu_vm_update_range. V2: rebase V4: Addressed review comments from Christian - add only locked PTEs entries in TLB flush waitlist. - do not create a separate function for list flush. - do not create a new lock for TLB flush. - there is no need to wait on tlb_flush_fence exclusively. V5: Addressed review comments from Christian - change the amdgpu_vm_pt_free_dfs's functionality to simple freeing of the objects and rename it. - add all the PTE objects in params->tlb_flush_waitlist - let amdgpu_vm_pt_free_root handle the freeing of BOs independently - call amdgpu_vm_pt_free_list directly V6: Rebase V7: Rebase V8: Added a NULL check to fix this backtrace issue: [ 415.351447] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 415.359245] #PF: supervisor write access in kernel mode [ 415.365081] #PF: error_code(0x0002) - not-present page [ 415.370817] PGD 101259067 P4D 101259067 PUD 10125a067 PMD 0 [ 415.377140] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 415.382004] CPU: 0 PID: 25481 Comm: test_with_MPI.e Tainted: G OE 5.18.2-mi300-build-140423-ubuntu-22.04+ #24 [ 415.394437] Hardware name: AMD Corporation Sh51p/Sh51p, BIOS RMO1001AS 02/21/2024 [ 415.402797] RIP: 0010:amdgpu_vm_ptes_update+0x6fd/0xa10 [amdgpu] [ 415.409648] Code: 4c 89 ff 4d 8d 66 30 e8 f1 ed ff ff 48 85 db 74 42 48 39 5d a0 74 40 48 8b 53 20 48 8b 4b 18 48 8d 43 18 48 8d 75 b0 4c 89 ff <48 > 89 51 08 48 89 0a 49 8b 56 30 48 89 42 08 48 89 53 18 4c 89 63 [ 415.430621] RSP: 0018:ffffc9000401f990 EFLAGS: 00010287 [ 415.436456] RAX: ffff888147bb82f0 RBX: ffff888147bb82d8 RCX: 0000000000000000 [ 415.444426] RDX: 0000000000000000 RSI: ffffc9000401fa30 RDI: ffff888161f80000 [ 415.452397] RBP: ffffc9000401fa80 R08: 0000000000000000 R09: ffffc9000401fa00 [ 415.460368] R10: 00000007f0cc0000 R11: 00000007f0c85000 R12: ffffc9000401fb20 [ 415.468340] R13: 00000007f0d00000 R14: ffffc9000401faf0 R15: ffff888161f80000 [ 415.476312] FS: 00007f132ff89840(0000) GS:ffff889f87c00000(0000) knlGS:0000000000000000 [ 415.485350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 415.491767] CR2: 0000000000000008 CR3: 0000000161d46003 CR4: 0000000000770ef0 [ 415.499738] PKRU: 55555554 [ 415.502750] Call Trace: [ 415.505482] <TASK> [ 415.507825] amdgpu_vm_update_range+0x32a/0x880 [amdgpu] [ 415.513869] amdgpu_vm_clear_freed+0x117/0x250 [amdgpu] [ 415.519814] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x18c/0x250 [amdgpu] [ 415.527729] kfd_ioctl_unmap_memory_from_gpu+0xed/0x340 [amdgpu] [ 415.534551] kfd_ioctl+0x3b6/0x510 [amdgpu] V9: Addressed review comments from Christian - No NULL check reqd for root PT freeing - Free PT list regardless of needs_flush - Move adding BOs in list in a separate function V10: Added Christian's RB V11: squash in list fix Cc: Christian König <Christian.Koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Christian König <Christian.Koenig@amd.com> Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20drm/amdgpu: implement TLB flush fenceChristian Koenig1-15/+47
The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out by invalidating the TLB before the PT can be freed. Solve this by moving the TLB flush into a DMA-fence object which can be used to delay the freeing of the PT BOs until it is signaled. V2: (Shashank) - rebase - set dma_fence_error only in case of error - add tlb_flush fence only when PT/PD BO is locked (Felix) - use vm->pasid when f is NULL (Mukul) V4: - add a wait for (f->dependency) in tlb_fence_work (Christian) - move the misplaced fence_create call to the end (Philip) V5: - free the f->dependency properly V6: (Shashank) - light code movement, moved all the clean-up in previous patch - introduce params.needs_flush and its usage in this patch - rebase without TLB HW sequence patch V7: - Keep the vm->last_update_fence and tlb_cb code until we can fix the HW sequencing (Christian) - Move all the tlb_fence related code in a separate function so that its easier to read and review V9: Addressed review comments from Christian - start PT update only when we have callback memory allocated V10: - handle device unlock in OOM case (Christian, Mukul) - added Christian's R-B Cc: Christian Koenig <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Shashank Sharma <shashank.sharma@amd.com> Reviewed-by: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20drm/amdgpu: add recent pagefault info in vm_managerSunil Khatri1-0/+8
Currently page fault information is stored per vm and which could be freed or stale during reset. Add it pagefault information in the vm_manager which is a global space for vm's and remains valid across. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04drm/amdgpu: cleanup conditional executionChristian König1-4/+4
First of all calculating the number of dw to patch into a conditional execution is not something HW generation specific. This is just standard ring buffer calculations. While at it also reduce the BUG_ON() into WARN_ON(). Then instead of a random bit pattern use 0 as default value for the number of dw skipped, this way it's not mandatory any more to patch the conditional execution. And last make the address to check a parameter of the conditional execution instead of getting this from the ring. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04drm/amdgpu: change vm->task_info handlingShashank Sharma1-44/+115
This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm. This patch also does logistical changes required for existing usage of vm->task_info. V2: Do not block all the prints when task_info not found (Felix) V3: Fixed review comments from Felix - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini() V4: Fixed review comments from Felix - fix double reference increment in create_task_info - change amdgpu_vm_get_task_info_pasid - additional changes in amdgpu_gem.c while porting Cc: Christian Koenig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>