kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch linux-6.0.y

drm/amdgpu: skip MES for S0ix as well since it's part of GFX

2023-01-07T10:15:42+00:00

commit afa6646b1c5d3affd541f76bd7476e4b835a9174 upstream. It's also part of gfxoff. Cc: stable@vger.kernel.org # 6.0, 6.1 Reviewed-by: Mario Limonciello Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: fix pci device refcount leak

2022-12-31T12:26:01+00:00

[ Upstream commit b85e285e3d6352b02947fc1b72303673dfacb0aa ] As comment of pci_get_domain_bus_and_slot() says, it returns a pci device with refcount increment, when finish using it, the caller must decrement the reference count by calling pci_dev_put(). So before returning from amdgpu_device_resume|suspend_display_audio(), pci_dev_put() is called to avoid refcount leak. Fixes: 3f12acc8d6d4 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3") Reviewed-by: Evan Quan Signed-off-by: Yang Yingliang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amd: Fail the suspend if resources can't be evicted

2022-11-26T08:27:20+00:00

[ Upstream commit 8d4de331f1b24a22d18e3c6116aa25228cf54854 ] If a system does not have swap and memory is under 100% usage, amdgpu will fail to evict resources. Currently the suspend carries on proceeding to reset the GPU: ``` [drm] evicting device resources failed [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -12 [drm] free PSP TMR buffer [TTM] Failed allocating page table [drm] evicting device resources failed amdgpu 0000:03:00.0: amdgpu: MODE1 reset amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset ``` At this point if the suspend actually succeeded I think that amdgpu would have recovered because the GPU would have power cut off and restored. However the kernel fails to continue the suspend from the memory pressure and amdgpu fails to run the "resume" from the aborted suspend. ``` ACPI: PM: Preparing to enter system sleep state S3 SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO) cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0 node 0: slabs: 22, objs: 1122, free: 0 ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651) [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed! [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block failed -62 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62 amdgpu 0000:03:00.0: PM: failed to resume async: error -62 ``` To avoid this series of unfortunate events, fail amdgpu's suspend when the memory eviction fails. This will let the system gracefully recover and the user can try suspend again when the memory pressure is relieved. Reported-by: post@davidak.de Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223 Signed-off-by: Mario Limonciello Reviewed-by: Alex Deucher Acked-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-11-03T15:00:21+00:00

commit d61e1d1d5225a9baeb995bcbdb904f66f70ed87e upstream. In the S2idle suspend/resume phase the gfxoff is keeping functional so some IP blocks will be likely to reinitialize at gfxoff entry and that will result in failing to program GC registers.Therefore, let disallow gfxoff until AMDGPU IPs reinitialized completely. Signed-off-by: Prike Liang Acked-by: Alex Deucher Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 5.15.x Signed-off-by: Greg Kroah-Hartman

drm/amd/pm: disable cstate feature for gpu reset scenario

2022-10-26T10:22:56+00:00

commit 3059cd8c5f797ad83d2b194ae66339f5c007ca43 upstream. Suggested by PMFW team and same as what did for gfxoff feature. This can address some Mode1Reset failures observed on SMU13.0.0. Signed-off-by: Evan Quan Reviewed-by: Hawking Zhang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 6.0.x Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV

2022-09-27T22:03:36+00:00

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor during the suspend time. Furthermore, we cannot request a mode 1 reset under SRIOV as VF. Therefore, we will skip it as it is called in suspend_noirq() function. - In the resume code path, we need to send REQ_GPU_INIT to the hypervisor and also resume PSP IP block under SRIOV. Signed-off-by: Bokun Zhang Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org

drm/amdgpu: make sure to init common IP before gmc

2022-09-14T18:21:49+00:00

Move common IP init before GMC init so that HDP gets remapped before GMC init which uses it. This fixes the Unsupported Request error reported through AER during driver load. The error happens as a write happens to the remap offset before real remapping is done. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373 The error was unnoticed before and got visible because of the commit referenced below. This doesn't fix anything in the commit below, rather fixes the issue in amdgpu exposed by the commit. The reference is only to associate this commit with below one so that both go together. Fixes: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()") Acked-by: Christian König Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org

drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks

2022-08-30T21:07:43+00:00

[Why] Devices with CPU XGMI iolink do not support PCIe peer access. Signed-off-by: Alex Sierra Acked-by: Alex Deucher Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu: Remove the additional kfd pre reset call for sriov

2022-08-19T21:06:38+00:00

The additional call is caused by merge conflict Reviewed-by: Felix Kuehling Signed-off-by: shaoyunl Signed-off-by: Alex Deucher

drm/amdgpu: fix hive reference leak when adding xgmi device

2022-08-19T21:06:21+00:00

Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive which will leak the hive reference. Signed-off-by: YiPeng Chai Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher