kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c, branch v6.1.60

drm/amdgpu: disable GFXOFF during compute for GFX11

2022-11-02T21:16:11+00:00

Temporary workaround to fix issues observed in some compute applications when GFXOFF is enabled on GFX11. Signed-off-by: Graham Sider Acked-by: Alex Deucher Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 6.0.x

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

2022-10-19T02:08:33+00:00

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu: Correct amdgpu_amdkfd_total_mem_size calculation

2022-10-06T16:08:18+00:00

amdkfd_total_mem_size is the size of total GPUs vram plus system memory to estimate page tables memory usage and leave enough VRAM room for page tables allocation. Calculate amdkfd_total_mem_size in amdgpu_amdkfd_device_probe is incorrect because adev->gmc.real_vram_size is still 0 called from amdgpu_device_ip_early_init. Move the calculation to amdgpu_amdkfd_device_init to get the correct VRAM size. Do reverse calculation in amdgpu_amdkfd_device_fini_sw to support hot-unplugging GPUs. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu: add page retirement handling for CPU RAS

2022-09-29T13:42:36+00:00

Do RAS page retirement in poison consumption handler unconditionally. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: add gang submit frontend v6

2022-09-20T16:40:46+00:00

Allows submitting jobs as gang which needs to run on multiple engines at the same time. All members of the gang get the same implicit, explicit and VM dependencies. So no gang member will start running until everything else is ready. The last job is considered the gang leader (usually a submission to the GFX ring) and used for signaling output dependencies. Each job is remembered individually as user of a buffer object, so there is no joining of work at the end. v2: rebase and fix review comments from Andrey and Yogesh v3: use READ instead of BOOKKEEP for now because of VM unmaps, set gang leader only when necessary v4: fix order of pushing jobs and adding fences found by Trigger. v5: fix job index calculation and adding IBs to jobs v6: fix typo found by Alex Signed-off-by: Christian König Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: cleanup coding style in amdgpu_amdkfd.c

2022-09-13T18:32:58+00:00

Fix everything checkpatch.pl complained about in amdgpu_amdkfd.c Signed-off-by: Jingyu Wang Signed-off-by: Alex Deucher

drm/amdgpu: let mode2 reset fallback to default when failure

2022-08-16T22:14:31+00:00

- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed v2: move this part out from the asic specific part Signed-off-by: Victor Zhao Acked-by: Andrey Grodzovsky Signed-off-by: Alex Deucher

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17+00:00

Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-28T15:24:41+00:00

Align refcount behaviour for amdgpu_job embedded HW fence with classic pointer style HW fences by increasing refcount each time emit is called so amdgpu code doesn't need to make workarounds using amdgpu_job.job_run_counter to keep the HW fence refcount balanced. Also since in the previous patch we resumed setting s_fence->parent to NULL in drm_sched_stop switch to directly checking if job->hw_fence is signaled to short circuit reset if already signed. Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: To flush tlb for MMHUB of RAVEN series

2022-06-23T21:21:53+00:00

amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:8 pasid:32769, for process test_basic pid 3305 thread test_basic pid 3305) amdgpu: in page starting at address 0x00007ff990003000 from IH client 0x12 (VMC) amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00840051 amdgpu: Faulty UTCL2 client ID: MP1 (0x0) amdgpu: MORE_FAULTS: 0x1 amdgpu: WALKER_ERROR: 0x0 amdgpu: PERMISSION_FAULTS: 0x5 amdgpu: MAPPING_ERROR: 0x0 amdgpu: RW: 0x1 When memory is allocated by kfd, no one triggers the tlb flush for MMHUB0. There is page fault from MMHUB0. v2:fix indentation v3:change subject and fix indentation Signed-off-by: Ruili Ji Reviewed-by: Philip Yang Reviewed-by: Aaron Liu Acked-by: Alex Deucher Signed-off-by: Alex Deucher