summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
AgeCommit message (Collapse)AuthorFilesLines
2025-01-23Revert "drm/amdgpu: rework resume handling for display (v2)"Greg Kroah-Hartman1-43/+2
This reverts commit 2daba7d857e48035d71cdd95964350b6d0d51545 which is commit 73dae652dcac776296890da215ee7dec357a1032 upstream. The original patch 73dae652dcac (drm/amdgpu: rework resume handling for display (v2)), was only targeted at kernels 6.11 and newer. It did not apply cleanly to 6.12 so I backported it and it backport landed as 99a02eab8251 ("drm/amdgpu: rework resume handling for display (v2)"), however there was a bug in the backport that was subsequently fixed in 063d380ca28e ("drm/amdgpu: fix backport of commit 73dae652dcac"). None of this was intended for kernels older than 6.11, however the original backport eventually landed in 6.6, 6.1, and 5.15. Please revert the change from kernels 6.6, 6.1, and 5.15. Link: https://lore.kernel.org/r/BL1PR12MB5144D5363FCE6F2FD3502534F7E72@BL1PR12MB5144.namprd12.prod.outlook.com Link: https://lore.kernel.org/r/BL1PR12MB51449ADCFBF2314431F8BCFDF7132@BL1PR12MB5144.namprd12.prod.outlook.com Reported-by: Salvatore Bonaccorso <carnil@debian.org> Reported-by: Christian König <christian.koenig@amd.com> Reported-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14drm/amdgpu: rework resume handling for display (v2)Alex Deucher1-2/+43
commit 73dae652dcac776296890da215ee7dec357a1032 upstream. Split resume into a 3rd step to handle displays when DCC is enabled on DCN 4.0.1. Move display after the buffer funcs have been re-enabled so that the GPU will do the move and properly set the DCC metadata for DCN. v2: fix fence irq resume ordering Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.11.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-14drm/amdgpu: skip amdgpu_device_cache_pci_state under sriovVictor Zhao1-0/+3
[ Upstream commit afe260df55ac280cd56306248cb6d8a6b0db095c ] Under sriov, host driver will save and restore vf pci cfg space during reset. And during device init, under sriov, pci_restore_state happens after fullaccess released, and it can have race condition with mmio protection enable from host side leading to missing interrupts. So skip amdgpu_device_cache_pci_state for sriov. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09drm/amdgpu: fix usage slab after freeVitaly Prosyak1-1/+1
commit b61badd20b443eabe132314669bb51a263982e5c upstream. [ +0.000021] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.000027] Read of size 8 at addr ffff8881b8605f88 by task amd_pci_unplug/2147 [ +0.000023] CPU: 6 PID: 2147 Comm: amd_pci_unplug Not tainted 6.10.0+ #1 [ +0.000016] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000016] Call Trace: [ +0.000008] <TASK> [ +0.000009] dump_stack_lvl+0x76/0xa0 [ +0.000017] print_report+0xce/0x5f0 [ +0.000017] ? drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.000019] ? srso_return_thunk+0x5/0x5f [ +0.000015] ? kasan_complete_mode_report_info+0x72/0x200 [ +0.000016] ? drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.000019] kasan_report+0xbe/0x110 [ +0.000015] ? drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.000023] __asan_report_load8_noabort+0x14/0x30 [ +0.000014] drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.000020] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? __kasan_check_write+0x14/0x30 [ +0.000016] ? __pfx_drm_sched_entity_flush+0x10/0x10 [gpu_sched] [ +0.000020] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? __kasan_check_write+0x14/0x30 [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? enable_work+0x124/0x220 [ +0.000015] ? __pfx_enable_work+0x10/0x10 [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? free_large_kmalloc+0x85/0xf0 [ +0.000016] drm_sched_entity_destroy+0x18/0x30 [gpu_sched] [ +0.000020] amdgpu_vce_sw_fini+0x55/0x170 [amdgpu] [ +0.000735] ? __kasan_check_read+0x11/0x20 [ +0.000016] vce_v4_0_sw_fini+0x80/0x110 [amdgpu] [ +0.000726] amdgpu_device_fini_sw+0x331/0xfc0 [amdgpu] [ +0.000679] ? mutex_unlock+0x80/0xe0 [ +0.000017] ? __pfx_amdgpu_device_fini_sw+0x10/0x10 [amdgpu] [ +0.000662] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? __kasan_check_write+0x14/0x30 [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? mutex_unlock+0x80/0xe0 [ +0.000016] amdgpu_driver_release_kms+0x16/0x80 [amdgpu] [ +0.000663] drm_minor_release+0xc9/0x140 [drm] [ +0.000081] drm_release+0x1fd/0x390 [drm] [ +0.000082] __fput+0x36c/0xad0 [ +0.000018] __fput_sync+0x3c/0x50 [ +0.000014] __x64_sys_close+0x7d/0xe0 [ +0.000014] x64_sys_call+0x1bc6/0x2680 [ +0.000014] do_syscall_64+0x70/0x130 [ +0.000014] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? irqentry_exit_to_user_mode+0x60/0x190 [ +0.000015] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? irqentry_exit+0x43/0x50 [ +0.000012] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? exc_page_fault+0x7c/0x110 [ +0.000015] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000014] RIP: 0033:0x7ffff7b14f67 [ +0.000013] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff [ +0.000026] RSP: 002b:00007fffffffe378 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ +0.000019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67 [ +0.000014] RDX: 0000000000000000 RSI: 00007ffff7f6f47a RDI: 0000000000000003 [ +0.000014] RBP: 00007fffffffe3a0 R08: 0000555555569890 R09: 0000000000000000 [ +0.000014] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8 [ +0.000013] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040 [ +0.000020] </TASK> [ +0.000016] Allocated by task 383 on cpu 7 at 26.880319s: [ +0.000014] kasan_save_stack+0x28/0x60 [ +0.000008] kasan_save_track+0x18/0x70 [ +0.000007] kasan_save_alloc_info+0x38/0x60 [ +0.000007] __kasan_kmalloc+0xc1/0xd0 [ +0.000007] kmalloc_trace_noprof+0x180/0x380 [ +0.000007] drm_sched_init+0x411/0xec0 [gpu_sched] [ +0.000012] amdgpu_device_init+0x695f/0xa610 [amdgpu] [ +0.000658] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ +0.000662] amdgpu_pci_probe+0x361/0xf30 [amdgpu] [ +0.000651] local_pci_probe+0xe7/0x1b0 [ +0.000009] pci_device_probe+0x248/0x890 [ +0.000008] really_probe+0x1fd/0x950 [ +0.000008] __driver_probe_device+0x307/0x410 [ +0.000007] driver_probe_device+0x4e/0x150 [ +0.000007] __driver_attach+0x223/0x510 [ +0.000006] bus_for_each_dev+0x102/0x1a0 [ +0.000007] driver_attach+0x3d/0x60 [ +0.000006] bus_add_driver+0x2ac/0x5f0 [ +0.000006] driver_register+0x13d/0x490 [ +0.000008] __pci_register_driver+0x1ee/0x2b0 [ +0.000007] llc_sap_close+0xb0/0x160 [llc] [ +0.000009] do_one_initcall+0x9c/0x3e0 [ +0.000008] do_init_module+0x241/0x760 [ +0.000008] load_module+0x51ac/0x6c30 [ +0.000006] __do_sys_init_module+0x234/0x270 [ +0.000007] __x64_sys_init_module+0x73/0xc0 [ +0.000006] x64_sys_call+0xe3/0x2680 [ +0.000006] do_syscall_64+0x70/0x130 [ +0.000007] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000015] Freed by task 2147 on cpu 6 at 160.507651s: [ +0.000013] kasan_save_stack+0x28/0x60 [ +0.000007] kasan_save_track+0x18/0x70 [ +0.000007] kasan_save_free_info+0x3b/0x60 [ +0.000007] poison_slab_object+0x115/0x1c0 [ +0.000007] __kasan_slab_free+0x34/0x60 [ +0.000007] kfree+0xfa/0x2f0 [ +0.000007] drm_sched_fini+0x19d/0x410 [gpu_sched] [ +0.000012] amdgpu_fence_driver_sw_fini+0xc4/0x2f0 [amdgpu] [ +0.000662] amdgpu_device_fini_sw+0x77/0xfc0 [amdgpu] [ +0.000653] amdgpu_driver_release_kms+0x16/0x80 [amdgpu] [ +0.000655] drm_minor_release+0xc9/0x140 [drm] [ +0.000071] drm_release+0x1fd/0x390 [drm] [ +0.000071] __fput+0x36c/0xad0 [ +0.000008] __fput_sync+0x3c/0x50 [ +0.000007] __x64_sys_close+0x7d/0xe0 [ +0.000007] x64_sys_call+0x1bc6/0x2680 [ +0.000007] do_syscall_64+0x70/0x130 [ +0.000007] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000014] The buggy address belongs to the object at ffff8881b8605f80 which belongs to the cache kmalloc-64 of size 64 [ +0.000020] The buggy address is located 8 bytes inside of freed 64-byte region [ffff8881b8605f80, ffff8881b8605fc0) [ +0.000028] The buggy address belongs to the physical page: [ +0.000011] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1b8605 [ +0.000008] anon flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ +0.000007] page_type: 0xffffefff(slab) [ +0.000009] raw: 0017ffffc0000000 ffff8881000428c0 0000000000000000 dead000000000001 [ +0.000006] raw: 0000000000000000 0000000000200020 00000001ffffefff 0000000000000000 [ +0.000006] page dumped because: kasan: bad access detected [ +0.000012] Memory state around the buggy address: [ +0.000011] ffff8881b8605e80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ +0.000015] ffff8881b8605f00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc [ +0.000015] >ffff8881b8605f80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ +0.000013] ^ [ +0.000011] ffff8881b8606000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [ +0.000014] ffff8881b8606080: fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb fb [ +0.000013] ================================================================== The issue reproduced on VG20 during the IGT pci_unplug test. The root cause of the issue is that the function drm_sched_fini is called before drm_sched_entity_kill. In drm_sched_fini, the drm_sched_rq structure is freed, but this structure is later accessed by each entity within the run queue, leading to invalid memory access. To resolve this, the order of cleanup calls is updated: Before: amdgpu_fence_driver_sw_fini amdgpu_device_ip_fini After: amdgpu_device_ip_fini amdgpu_fence_driver_sw_fini This updated order ensures that all entities in the IPs are cleaned up first, followed by proper cleanup of the schedulers. Additional Investigation: During debugging, another issue was identified in the amdgpu_vce_sw_fini function. The vce.vcpu_bo buffer must be freed only as the final step in the cleanup process to prevent any premature access during earlier cleanup stages. v2: Using Christian suggestion call drm_sched_entity_destroy before drm_sched_fini. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-08drm/amdgpu: fix dereference after null checkJesse Zhang1-1/+1
[ Upstream commit b1f7810b05d1950350ac2e06992982974343e441 ] check the pointer hive before use. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-08drm/amd/amdgpu: Check tbo resource pointerAsad Kamal1-1/+2
[ Upstream commit 6cd2b872643bb29bba01a8ac739138db7bd79007 ] Validate tbo resource pointer, skip if NULL Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14drm/amdgpu: Add lock around VF RLCG interfaceVictor Skvortsov1-0/+1
[ Upstream commit e864180ee49b4d30e640fd1e1d852b86411420c9 ] flush_gpu_tlb may be called from another thread while device_gpu_recover is running. Both of these threads access registers through the VF RLCG interface during VF Full Access. Add a lock around this interface to prevent race conditions between these threads. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03drm/amdgpu: Check if NBIO funcs are NULL in amdgpu_device_baco_exitFriedrich Vock1-1/+1
[ Upstream commit 0cdb3f9740844b9d95ca413e3fcff11f81223ecf ] The special case for VM passthrough doesn't check adev->nbio.funcs before dereferencing it. If GPUs that don't have an NBIO block are passed through, this leads to a NULL pointer dereference on startup. Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Fixes: 1bece222eabe ("drm/amdgpu: Clear doorbell interrupt status for Sienna Cichlid") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05drm/amdgpu: Fix pci state save during mode-1 resetLijo Lazar1-2/+5
[ Upstream commit 74fa02c4a5ea1ade5156a6ce494d3ea83881c2d8 ] Cache the PCI state before bus master is disabled. The saved state is later used for other cases like restoring config space after mode-2 reset. Fixes: 5c03e5843e6b ("drm/amdgpu:add smu mode1/2 support for aldebaran") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13Revert "drm/amd/amdgpu: Fix potential ioremap() memory leaks in ↵Ma Jun1-10/+6
amdgpu_device_init()" commit 03c6284df179de3a4a6e0684764b1c71d2a405e2 upstream. This patch causes the following iounmap erorr and calltrace iounmap: bad address 00000000d0b3631f The original patch was unjustified because amdgpu_device_fini_sw() will always cleanup the rmmio mapping. This reverts commit eb4f139888f636614dab3bcce97ff61cefc4b3a7. Signed-off-by: Ma Jun <Jun.Ma2@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init()Srinivasan Shanmugam1-6/+10
[ Upstream commit eb4f139888f636614dab3bcce97ff61cefc4b3a7 ] This ensures that the memory mapped by ioremap for adev->rmmio, is properly handled in amdgpu_device_init(). If the function exits early due to an error, the memory is unmapped. If the function completes successfully, the memory remains mapped. Reported by smatch: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4337 amdgpu_device_init() warn: 'adev->rmmio' from ioremap() not released on lines: 4035,4045,4051,4058,4068,4337 Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10drm/amd: Flush GFXOFF requests in prepare stageMario Limonciello1-0/+2
[ Upstream commit ca299b4512d4b4f516732a48ce9aa19d91f4473e ] If the system hasn't entered GFXOFF when suspend starts it can cause hangs accessing GC and RLC during the suspend stage. Cc: <stable@vger.kernel.org> # 6.1.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") Cc: <stable@vger.kernel.org> # 6.1.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks") Cc: <stable@vger.kernel.org> # 6.1.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()") Cc: <stable@vger.kernel.org> # 6.1.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend") Cc: <stable@vger.kernel.org> # 6.6.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") Cc: <stable@vger.kernel.org> # 6.6.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks") Cc: <stable@vger.kernel.org> # 6.6.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()") Cc: <stable@vger.kernel.org> # 6.6.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend") Cc: <stable@vger.kernel.org> # 6.1+ Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3132 Fixes: ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10drm/amd: Add concept of running prepare_suspend() sequence for IP blocksMario Limonciello1-1/+11
[ Upstream commit cb11ca3233aa3303dc11dca25977d2e7f24be00f ] If any IP blocks allocate memory during their hw_fini() sequence this can cause the suspend to fail under memory pressure. Introduce a new phase that IP blocks can use to allocate memory before suspend starts so that it can potentially be evicted into swap instead. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: ca299b4512d4 ("drm/amd: Flush GFXOFF requests in prepare stage") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10drm/amd: Evict resources during PM ops prepare() callbackMario Limonciello1-5/+26
[ Upstream commit 5095d5418193eb2748c7d8553c7150b8f1c44696 ] Linux PM core has a prepare() callback run before suspend. If the system is under high memory pressure, the resources may need to be evicted into swap instead. If the storage backing for swap is offlined during the suspend() step then such a call may fail. So move this step into prepare() to move evict majority of resources and update all non-pmops callers to call the same callback. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: ca299b4512d4 ("drm/amd: Flush GFXOFF requests in prepare stage") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23Revert "drm/amd: flush any delayed gfxoff on suspend entry"Mario Limonciello1-1/+0
commit 916361685319098f696b798ef1560f69ed96e934 upstream. commit ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") caused GFXOFF control to be used more heavily and the codepath that was removed from commit 0dee72639533 ("drm/amd: flush any delayed gfxoff on suspend entry") now can be exercised at suspend again. Users report that by using GNOME to suspend the lockscreen trigger will cause SDMA traffic and the system can deadlock. This reverts commit 0dee726395333fea833eaaf838bc80962df886c8. Acked-by: Alex Deucher <alexander.deucher@amd.com> Fixes: ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-05drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()'Srinivasan Shanmugam1-0/+1
[ Upstream commit 8a44fdd3cf91debbd09b43bd2519ad2b2486ccf4 ] In function 'amdgpu_device_need_post(struct amdgpu_device *adev)' - 'adev->pm.fw' may not be released before return. Using the function release_firmware() to release adev->pm.fw. Thus fixing the below: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1571 amdgpu_device_need_post() warn: 'adev->pm.fw' from request_firmware() not released on lines: 1554. Cc: Monk Liu <Monk.Liu@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-10drm/amdgpu: skip gpu_info fw loading on navi12Alex Deucher1-9/+2
commit 21f6137c64c65d6808c4a81006956197ca203383 upstream. It's no longer required. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2318 Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13drm/amdgpu: disable MCBP by defaultJiadong Zhu1-4/+0
[ Upstream commit d6a57588666301acd9d42d3b00d74240964f07f6 ] Disable MCBP(mid command buffer preemption) by default as old Mesa hangs with it. We shall not enable the feature that breaks old usermode driver. Fixes: 50a7c8765ca6 ("drm/amdgpu: enable mcbp by default on gfx9") Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28drm/amdgpu: don't use pci_is_thunderbolt_attached()Alex Deucher1-4/+4
commit 7b1c6263eaf4fd64ffe1cafdc504a42ee4bfbb33 upstream. It's only valid on Intel systems with the Intel VSEC. Use dev_is_removable() instead. This should do the right thing regardless of the platform. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925 Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28drm/amd: Disable PP_PCIE_DPM_MASK when dynamic speed switching not supportedMario Limonciello1-0/+2
[ Upstream commit fbf1035b033a51eee48d5f42e781b02fff272ca0 ] Rather than individual ASICs checking for the quirk, set the quirk at the driver level. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28drm/amdgpu: Fix potential null pointer derefernceStanley.Yang1-1/+2
[ Upstream commit 80285ae1ec8717b597b20de38866c29d84d321a1 ] The amdgpu_ras_get_context may return NULL if device not support ras feature, so add check before using. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-05drm/amd: Fix detection of _PR3 on the PCIe root portMario Limonciello1-1/+1
On some systems with Navi3x dGPU will attempt to use BACO for runtime PM but fails to resume properly. This is because on these systems the root port goes into D3cold which is incompatible with BACO. This happens because in this case dGPU is connected to a bridge between root port which causes BOCO detection logic to fail. Fix the intent of the logic by looking at root port, not the immediate upstream bridge for _PR3. Cc: stable@vger.kernel.org Suggested-by: Jun Ma <Jun.Ma2@amd.com> Tested-by: David Perry <David.Perry@amd.com> Fixes: b10c1c5b3a4e ("drm/amdgpu: add check for ACPI power resources") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-12Revert "drm/amd: Disable S/G for APUs when 64GB or more host memory"Hamza Mahfooz1-26/+0
This reverts commit 70e64c4d522b732e31c6475a3be2349de337d321. Since, we now have an actual fix for this issue, we can get rid of this workaround as it can cause pin failures if enough VRAM isn't carved out by the BIOS. Cc: stable@vger.kernel.org # 6.1+ Acked-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Hamza Mahfooz <hamza.mahfooz@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-01drm/amdgpu: Allocate coredump memory in a nonblocking wayAndré Almeida1-1/+1
During a GPU reset, a normal memory reclaim could block to reclaim memory. Giving that coredump is a best effort mechanism, it shouldn't disturb the reset path. Change its memory allocation flag to a nonblocking one. Signed-off-by: André Almeida <andrealmeid@igalia.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-01drm/amdgpu: Fix the return for gpu mode1_resetHawking Zhang1-2/+11
amdgpu_device_mode1_reset will return gpu mode1_reset succeed (ret = 0) as long as wait_for_bootloader call succeed, regardless of the status reported by smu or psp firmware. This results to driver continue executing recovery even smu or psp fail to perform mode1 reset. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-01drm/amdgpu: Add bootloader status checkLijo Lazar1-3/+14
Add a function to wait till bootloader has reached steady state. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Tested-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amd: flush any delayed gfxoff on suspend entryMario Limonciello1-0/+1
DCN 3.1.4 is reported to hang on s2idle entry if graphics activity is happening during entry. This is because GFXOFF was scheduled as delayed but RLC gets disabled in s2idle entry sequence which will hang GFX IP if not already in GFXOFF. To help this problem, flush any delayed work for GFXOFF early in s2idle entry sequence to ensure that it's off when RLC is changed. commit 4b31b92b143f ("drm/amdgpu: complete gfxoff allow signal during suspend without delay") modified power gating flow so that if called in s0ix that it ensured that GFXOFF wasn't put in work queue but instead processed immediately. This is dead code due to commit 10cb67eb8a1b ("drm/amdgpu: skip CG/PG for gfx during S0ix") because GFXOFF will now not be explicitly called as part of the suspend entry code. Remove that dead code. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amdgpu: disable mcbp if parameter zero is setJiadong Zhu1-4/+5
The parameter amdgpu_mcbp shall have priority against the default value calculated from the chip version. User could disable mcbp by setting the parameter mcbp as zero. v2: do not trigger preemption in sw ring muxer when mcbp is disabled. Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amdgpu: Fix missing comment for mb() in 'amdgpu_device_aper_access'Srinivasan Shanmugam1-0/+6
This patch adds the missing code comment for memory barrier WARNING: memory barrier without comment + mb(); WARNING: memory barrier without comment + mb(); Cc: Guchun Chen <guchun.chen@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-11drm/amdkfd: drop IOMMUv2 supportAlex Deucher1-9/+0
Now that we use the dGPU path for all APUs, drop the IOMMUv2 support. v2: drop the now unused queue manager functions for gfx7/8 APUs Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Tested-by: Mike Lothian <mike@fireburn.co.uk> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09drm/amdgpu/irq: Move irq resume to the beginningEmily Deng1-1/+1
Need to move irq resume to the beginning of reset sriov, or if one interrupt occurs before irq resume, then the irq won't work anymore. Signed-off-by: Emily Deng <Emily.Deng@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09drm/amdgpu: Add FRU sysfs nodes only if neededLijo Lazar1-68/+3
Create sysfs nodes for FRU data only if FRU data is available. Move the logic to FRU specific file. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07drm/amd: Disable S/G for APUs when 64GB or more host memoryMario Limonciello1-0/+26
Users report a white flickering screen on multiple systems that is tied to having 64GB or more memory. When S/G is enabled pages will get pinned to both VRAM carve out and system RAM leading to this. Until it can be fixed properly, disable S/G when 64GB of memory or more is detected. This will force pages to be pinned into VRAM. This should fix white screen flickers but if VRAM pressure is encountered may lead to black screens. It's a trade-off for now. Fixes: 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)") Cc: Hamza Mahfooz <Hamza.Mahfooz@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: <stable@vger.kernel.org> # 6.1.y: bf0207e172703 ("drm/amdgpu: add S/G display parameter") Cc: <stable@vger.kernel.org> # 6.4.y Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2735 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2354 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-27drm/amdgpu: Fix ENOSYS means 'invalid syscall nr' in amdgpu_device.cSrinivasan Shanmugam1-29/+31
ENOSYS should be used for nonexistent syscalls only, replace ENOSYS with EOPNOTSUPP for reset handlers that are not implemented for respective ASIC. WARNING: ENOSYS means 'invalid syscall nr' and nothing else + if (r == -ENOSYS) WARNING: ENOSYS means 'invalid syscall nr' and nothing else + if (r == -ENOSYS) And other following style fixes in amdgpu_device.c: WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. WARNING: Block comments should align the * on each line WARNING: Missing a blank line after declarations WARNING: braces {} are not necessary for single statement blocks Cc: Lijo Lazar <lijo.lazar@amd.com> Cc: Kent Russell <kent.russell@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-25drm/amdgpu: set sw state to gfxoff after SR-IOV resetHorace Chen1-0/+3
[Why] Current SR-IOV will not set GC to off state, while it is a real GC hard reset. Whthout GFX off flag, driver may do gfxhub invalidation before firmware load and gfxhub gart enable. This operation may cause CP to become busy because GC is not in the right state for invalidation. [How] Add a function for SR-IOV to clean up some sw state before recover. Set adev->gfx.is_poweron to false to prevent gfxhub invalidation before gfx firmware autoload complete. Signed-off-by: Horace Chen <horace.chen@amd.com> Reviewed-by: HaiJun Chang <HaiJun.Chang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18drm/amdgpu: Add RLCG interface driver implementation for gfx v9.4.3 (v3)Victor Lu1-2/+3
Add RLCG interface support for gfx v9.4.3 and multiple XCCs. Do not enable it yet. v2: Fix amdgpu_rlcg_reg_access_ctrl init, add support for multiple XCCs in amdgpu_mm_wreg_mmio_rlc v3: Use GET_INST() when indexing amdgpu_rlcg_reg_access_ctrl Signed-off-by: Victor Lu <victorchengchi.lu@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18drm/amdgpu: create a new file for doorbell managerShashank Sharma1-170/+5
This patch: - creates a new file for doorbell management. - moves doorbell code from amdgpu_device.c to this file. V2: - remove doc from function declaration (Christian) - remove 'device' from function names to make it consistent (Alex) - add SPDX license identifier (Luben) V3: - change license to MIT license(Christian) Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-12drm/amd: Move helper for dynamic speed switch check out of smu13Mario Limonciello1-0/+19
This helper is used for checking if the connected host supports the feature, it can be moved into generic code to be used by other smu implementations as well. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-12drm/amdgpu: avoid integer overflow warning in amdgpu_device_resize_fb_bar()Arnd Bergmann1-0/+3
On 32-bit architectures comparing a resource against a value larger than U32_MAX can cause a warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1344:18: error: result of comparison of constant 4294967296 with expression of type 'resource_size_t' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare] res->start > 0x100000000ull) ~~~~~~~~~~ ^ ~~~~~~~~~~~~~~ As gcc does not warn about this in dead code, add an IS_ENABLED() check at the start of the function. This will always return success but not actually resize the BAR on 32-bit architectures without high memory, which is exactly what we want here, as the driver can fall back to bank switching the VRAM access. Fixes: 31b8adab3247 ("drm/amdgpu: require a root bus window above 4GB for BAR resize") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-07drm/amd: Use attribute groups for PSP flashing attributesMario Limonciello1-10/+0
Individually creating attributes can be racy, instead make attributes using attribute groups and control their visibility with an is_visible callback to only show when using appropriate products. v2: squash in fix for PSP 13.0.10 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30drm/amdgpu: enable mcbp by default on gfx9Alex Deucher1-0/+5
It's required for high priority queues. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2535 Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30drm/amdgpu: make mcbp a per device settingAlex Deucher1-4/+15
So we can selectively enable it on certain devices. No intended functional change. Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-23drm/amdgpu: Add vbios attribute only if supportedLijo Lazar1-0/+5
Not all devices carry VBIOS version information. Add the device attribute only if supported. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Fix up kdoc in amdgpu_device.cSrinivasan Shanmugam1-4/+0
Fix these warnings by deleting the deviant arguments. gcc with W=1 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:799: warning: Excess function parameter 'pcie_index' description in 'amdgpu_device_indirect_wreg' drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:799: warning: Excess function parameter 'pcie_data' description in 'amdgpu_device_indirect_wreg' drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:870: warning: Excess function parameter 'pcie_index' description in 'amdgpu_device_indirect_wreg64' drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:870: warning: Excess function parameter 'pcie_data' description in 'amdgpu_device_indirect_wreg64' Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: bypass bios dependent operationsShiwu Zhang1-25/+41
Since bios reading does not work currently so just bypass all operations related to bios v2: hardcode the vram info for APP_APU case (hawking) v3: correct the vram_width with channel number * channel size (lijo) Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: fix incorrect pcie_gen_mask in passthrough caseTong Liu011-3/+3
[why] Passthrough case is treated as root bus and pcie_gen_mask is set as default value that does not support GEN 3 and GEN 4 for PCIe link speed. So PCIe link speed will be downgraded at smu hw init in passthrough condition [how] Move get pci info after detect virtualization and check if it is passthrough case when set pcie_gen_mask Signed-off-by: Tong Liu01 <Tong.Liu01@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdkfd: Refactor migrate init to support partition switchPhilip Yang1-1/+3
Rename smv_migrate_init to a better name kgd2kfd_init_zone_device because it setup zone devive pgmap for page migration and keep it in kfd_migrate.c to access static functions svm_migrate_pgmap_ops. Call it only once in amdgpu_device_ip_init after adev ip blocks are initialized, but before amdgpu_amdkfd_device_init initialize kfd nodes which enable SVM support based on pgmap. svm_range_set_max_pages is called by kgd2kfd_device_init everytime after switching compute partition mode. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: add partition scheduler list updateJames Zhu1-0/+2
Add partition scheduler list update in late init and xcp partition mode switch. Signed-off-by: James Zhu <James.Zhu@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: support partition drm devicesJames Zhu1-0/+1
Support partition drm devices on GC_HWIP IP_VERSION(9, 4, 3). This is a temporary solution and will be superceded. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-and-tested-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09drm/amdgpu: Checked if the pointer NULL before use it.Gavin Wan1-8/+12
For SRIOV on some parts, the host driver does not post VBIOS. So the guest cannot get bios information. Therefore, adev->virt.fw_reserve.p_pf2vf and adev->mode_info.atom_context are NULL. Signed-off-by: Gavin Wan <Gavin.Wan@amd.com> Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>