kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch v6.1.2

drm/amdgpu: Fix potential double free and null pointer dereference

2022-12-31T12:33:03+00:00

[ Upstream commit dfd0287bd3920e132a8dae2a0ec3d92eaff5f2dd ] In amdgpu_get_xgmi_hive(), we should not call kfree() after kobject_put() as the PUT will call kfree(). In amdgpu_device_ip_init(), we need to check the returned *hive* which can be NULL before we dereference it. Signed-off-by: Liang He Reviewed-by: Luben Tuikov Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: fix pci device refcount leak

2022-12-31T12:32:15+00:00

[ Upstream commit b85e285e3d6352b02947fc1b72303673dfacb0aa ] As comment of pci_get_domain_bus_and_slot() says, it returns a pci device with refcount increment, when finish using it, the caller must decrement the reference count by calling pci_dev_put(). So before returning from amdgpu_device_resume|suspend_display_audio(), pci_dev_put() is called to avoid refcount leak. Fixes: 3f12acc8d6d4 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3") Reviewed-by: Evan Quan Signed-off-by: Yang Yingliang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: there is no vbios fb on devices with no display hw (v2)

2022-11-15T18:24:40+00:00

If we enable virtual display functionality on parts with no display hardware we can end up trying to check for and reserve the vbios FB area on devices where it doesn't exist. Check if display hardware is actually present on the hardware before trying to reserve the memory. v2: move the check into common code Acked-by: Evan Quan Signed-off-by: Alex Deucher

drm/amd: Fail the suspend if resources can't be evicted

2022-11-02T21:00:39+00:00

If a system does not have swap and memory is under 100% usage, amdgpu will fail to evict resources. Currently the suspend carries on proceeding to reset the GPU: ``` [drm] evicting device resources failed [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -12 [drm] free PSP TMR buffer [TTM] Failed allocating page table [drm] evicting device resources failed amdgpu 0000:03:00.0: amdgpu: MODE1 reset amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset ``` At this point if the suspend actually succeeded I think that amdgpu would have recovered because the GPU would have power cut off and restored. However the kernel fails to continue the suspend from the memory pressure and amdgpu fails to run the "resume" from the aborted suspend. ``` ACPI: PM: Preparing to enter system sleep state S3 SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO) cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0 node 0: slabs: 22, objs: 1122, free: 0 ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651) [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed! [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block failed -62 amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62). PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62 amdgpu 0000:03:00.0: PM: failed to resume async: error -62 ``` To avoid this series of unfortunate events, fail amdgpu's suspend when the memory eviction fails. This will let the system gracefully recover and the user can try suspend again when the memory pressure is relieved. Reported-by: post@davidak.de Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223 Signed-off-by: Mario Limonciello Reviewed-by: Alex Deucher Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume

2022-10-26T21:48:43+00:00

In the S2idle suspend/resume phase the gfxoff is keeping functional so some IP blocks will be likely to reinitialize at gfxoff entry and that will result in failing to program GC registers.Therefore, let disallow gfxoff until AMDGPU IPs reinitialized completely. Signed-off-by: Prike Liang Acked-by: Alex Deucher Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 5.15.x

drm/amdgpu: skip mes self test for gc 11.0.3 in recover

2022-10-24T18:44:03+00:00

Temporary disable mes self teset for gc 11.0.3 during gpu_recovery. Signed-off-by: YuBiao Wang Acked-by: Luben Tuikov Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amd/pm: disable cstate feature for gpu reset scenario

2022-10-19T02:12:20+00:00

Suggested by PMFW team and same as what did for gfxoff feature. This can address some Mode1Reset failures observed on SMU13.0.0. Signed-off-by: Evan Quan Reviewed-by: Hawking Zhang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 6.0.x Signed-off-by: Alex Deucher

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

2022-10-19T02:08:33+00:00

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV

2022-09-29T13:41:46+00:00

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor during the suspend time. Furthermore, we cannot request a mode 1 reset under SRIOV as VF. Therefore, we will skip it as it is called in suspend_noirq() function. - In the resume code path, we need to send REQ_GPU_INIT to the hypervisor and also resume PSP IP block under SRIOV. Signed-off-by: Bokun Zhang Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Use simplified API for p2p dist calc

2022-09-29T13:41:42+00:00

Use the simpified API that calculates distance between two devices. Signed-off-by: Lijo Lazar Reviewed-by: Christian König Reviewed-by: Guchun Chen Signed-off-by: Alex Deucher