kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c, branch v6.1.168

drm/amdgpu: skip fence GFX interrupts disable/enable for S0ix

2023-08-23T15:52:40+00:00

commit f1740b1ab2703b2a057da7cf33b03297e0381aa0 upstream. GFX v11.0.1 reported fence fallback timer expired issue on SDMA and GFX rings after S0ix resume. This is generated by EOP interrupts are disabled when S0ix suspend but fails to re-enable when resume because of the GFX is in GFXOFF. [ 203.349571] [drm] Fence fallback timer expired on ring sdma0 [ 203.349572] [drm] Fence fallback timer expired on ring gfx_0.0.0 [ 203.861635] [drm] Fence fallback timer expired on ring gfx_0.0.0 For S0ix, GFX is in GFXOFF state, avoid to touch the GFX registers to configure the fence driver interrupts for rings that belong to GFX. The interrupts configuration will be restored by GFXOFF exit. Signed-off-by: Tim Huang Reviewed-by: Mario Limonciello Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: skip disabling fence driver src_irqs when device is unplugged

2023-06-09T08:34:16+00:00

[ Upstream commit c1a322a7a4a96cd0a3dde32ce37af437a78bf8cd ] When performing device unbind or halt, we have disabled all irqs at the very begining like amdgpu_pci_remove or amdgpu_device_halt. So amdgpu_irq_put for irqs stored in fence driver should not be called any more, otherwise, below calltrace will arrive. [ 139.114088] WARNING: CPU: 2 PID: 1550 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:616 amdgpu_irq_put+0xf6/0x110 [amdgpu] [ 139.114655] Call Trace: [ 139.114655] [ 139.114657] amdgpu_fence_driver_hw_fini+0x93/0x130 [amdgpu] [ 139.114836] amdgpu_device_fini_hw+0xb6/0x350 [amdgpu] [ 139.114955] amdgpu_driver_unload_kms+0x51/0x70 [amdgpu] [ 139.115075] amdgpu_pci_remove+0x63/0x160 [amdgpu] [ 139.115193] ? __pm_runtime_resume+0x64/0x90 [ 139.115195] pci_device_remove+0x3a/0xb0 [ 139.115197] device_remove+0x43/0x70 [ 139.115198] device_release_driver_internal+0xbd/0x140 Signed-off-by: Guchun Chen Acked-by: Alex Deucher Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

2023-04-20T10:35:11+00:00

[ Upstream commit 033c56474acf567a450f8bafca50e0b610f2b716 ] [Why] For engines not supporting soft reset, i.e. VCN, there will be a failed ib test before mode 1 reset during asic reset. The fences in this case are never signaled and next time when we try to free the sa_bo, kernel will hang. [How] During pre_asic_reset, driver will clear job fences and afterwards the fences' refcount will be reduced to 1. For drm_sched_jobs it will be released in job_free_cb, and for non-sched jobs like ib_test, it's meant to be released in sa_bo_free but only when the fences are signaled. So we have to force signal the non_sched bad job's fence during pre_asic_reset or the clear is not complete. Signed-off-by: YuBiao Wang Acked-by: Luben Tuikov Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-14T18:11:55+00:00

commit 5ad7bbf3dba5c4a684338df1f285080f2588b535 upstream. Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of 0000:04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0], and also removed the no_scheduler check [1]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/ [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König Cc: Guchun Chen Cc: Luben Tuikov Cc: Mario Limonciello Reviewed-by: Luben Tuikov Signed-off-by: Guilherme G. Piccoli Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Remove fence_process in count_emitted

2022-09-27T22:03:09+00:00

The function amdgpu_fence_count_emitted used in work_hander should not call amdgpu_fence_process which must be used in irq handler. Reviewed-by: Christian König Signed-off-by: Jiadong.Zhu Signed-off-by: Alex Deucher

drm/amd: Fix typo 'the the' in comment

2022-07-25T13:31:05+00:00

Replace 'the the' with 'the' in the comment. Signed-off-by: Slark Xiao Signed-off-by: Alex Deucher

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17+00:00

Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-28T15:24:41+00:00

Align refcount behaviour for amdgpu_job embedded HW fence with classic pointer style HW fences by increasing refcount each time emit is called so amdgpu code doesn't need to make workarounds using amdgpu_job.job_run_counter to keep the HW fence refcount balanced. Also since in the previous patch we resumed setting s_fence->parent to NULL in drm_sched_stop switch to directly checking if job->hw_fence is signaled to short circuit reset if already signed. Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-28T15:24:24+00:00

Problem: After we start handling timed out jobs we assume there fences won't be signaled but we cannot be sure and sometimes they fire late. We need to prevent concurrent accesses to fence array from amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process from a late EOP interrupt. Fix: Before accessing fence array in GPU disable EOP interrupt and flush all pending interrupt handlers for amdgpu device's interrupt line. v2: Switch from irq_get/put to full enable/disable_irq for amdgpu Signed-off-by: Andrey Grodzovsky Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences

2022-06-28T15:24:18+00:00

This function should drop the fence refcount when it extracts the fence from the fence array, just as it's done in amdgpu_fence_process. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König Signed-off-by: Alex Deucher