summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2021-07-19drm/amdkfd: use allowed domain for vmbo validationNirmoy Das1-17/+4
[ Upstream commit bc05716d4fdd065013633602c5960a2bf1511b9c ] Fixes handling when page tables are in system memory. v3: remove struct amdgpu_vm_parser. v2: remove unwanted variable. change amdgpu_amdkfd_validate instead of amdgpu_amdkfd_bo_validate. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19drm/amd/amdgpu/sriov disable all ip hw status by defaultJack Zhang1-1/+1
[ Upstream commit 95ea3dbc4e9548d35ab6fbf67675cef8c293e2f5 ] Disable all ip's hw status to false before any hw_init. Only set it to true until its hw_init is executed. The old 5.9 branch has this change but somehow the 5.11 kernrel does not have this fix. Without this change, sriov tdr have gfx IB test fail. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Review-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-30Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full ↵Yifan Zhang1-5/+1
doorbell." commit baacf52a473b24e10322b67757ddb92ab8d86717 upstream. This reverts commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3. Reason for revert: Side effect of enlarging CP_MEC_DOORBELL_RANGE may cause some APUs fail to enter gfxoff in certain user cases. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."Yifan Zhang1-5/+1
commit ee5468b9f1d3bf48082eed351dace14598e8ca39 upstream. This reverts commit 4cbbe34807938e6e494e535a68d5ff64edac3f20. Reason for revert: side effect of enlarging CP_MEC_DOORBELL_RANGE may cause some APUs fail to enter gfxoff in certain user cases. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue.Yifan Zhang1-1/+5
commit 4cbbe34807938e6e494e535a68d5ff64edac3f20 upstream. If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell.Yifan Zhang1-1/+5
commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3 upstream. If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10drm/amdgpu: make sure we unpin the UVD BONirmoy Das1-0/+1
commit 07438603a07e52f1c6aa731842bd298d2725b7be upstream. Releasing pinned BOs is illegal now. UVD 6 was missing from: commit 2f40801dc553 ("drm/amdgpu: make sure we unpin the UVD BO") Fixes: 2f40801dc553 ("drm/amdgpu: make sure we unpin the UVD BO") Cc: stable@vger.kernel.org Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10drm/amdgpu: Don't query CE and UE errorsLuben Tuikov1-16/+0
commit dce3d8e1d070900e0feeb06787a319ff9379212c upstream. On QUERY2 IOCTL don't query counts of correctable and uncorrectable errors, since when RAS is enabled and supported on Vega20 server boards, this takes insurmountably long time, in O(n^3), which slows the system down to the point of it being unusable when we have GUI up. Fixes: ae363a212b14 ("drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2") Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03drm/amd/amdgpu: fix a potential deadlock in gpu resetLang Yu1-1/+0
[ Upstream commit 9c2876d56f1ce9b6b2072f1446fb1e8d1532cb3d ] When amdgpu_ib_ring_tests failed, the reset logic called amdgpu_device_ip_suspend twice, then deadlock occurred. Deadlock log: [ 805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). [ 806.290952] [drm] free PSP TMR buffer [ 806.319406] ============================================ [ 806.320315] WARNING: possible recursive locking detected [ 806.321225] 5.11.0-custom #1 Tainted: G W OEL [ 806.322135] -------------------------------------------- [ 806.323043] cat/2593 is trying to acquire lock: [ 806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.325668] but task is already holding lock: [ 806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.328430] other info that might help us debug this: [ 806.329539] Possible unsafe locking scenario: [ 806.330549] CPU0 [ 806.330983] ---- [ 806.331416] lock(&adev->dm.dc_lock); [ 806.332086] lock(&adev->dm.dc_lock); [ 806.332738] *** DEADLOCK *** [ 806.333747] May be due to missing lock nesting notation [ 806.334899] 3 locks held by cat/2593: [ 806.335537] #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_read+0x4e/0x110 [ 806.337009] #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu] [ 806.339018] #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.340869] stack backtrace: [ 806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G W OEL 5.11.0-custom #1 [ 806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS WLD0C23N_Weekly_20_12_2 12/23/2020 [ 806.344413] Call Trace: [ 806.344849] dump_stack+0x93/0xbd [ 806.345435] __lock_acquire.cold+0x18a/0x2cf [ 806.346179] lock_acquire+0xca/0x390 [ 806.346807] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.347813] __mutex_lock+0x9b/0x930 [ 806.348454] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.349434] ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu] [ 806.350581] ? _raw_spin_unlock_irqrestore+0x47/0x50 [ 806.351437] ? dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.352437] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.353252] ? rcu_read_lock_sched_held+0x4f/0x80 [ 806.354064] mutex_lock_nested+0x1b/0x20 [ 806.354747] ? mutex_lock_nested+0x1b/0x20 [ 806.355457] dm_suspend+0xb8/0x1d0 [amdgpu] [ 806.356427] ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu] [ 806.357736] amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu] [ 806.360394] amdgpu_device_ip_suspend+0x21/0x70 [amdgpu] [ 806.362926] amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu] [ 806.365560] amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu] Signed-off-by: Lang Yu <Lang.Yu@amd.com> Acked-by: Christian KÃnig <christian.koenig@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03drm/amdgpu: Fix a use-after-freexinhui pan1-0/+1
[ Upstream commit 1e5c37385097c35911b0f8a0c67ffd10ee1af9a2 ] looks like we forget to set ttm->sg to NULL. Hit panic below [ 1235.844104] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b7b4b: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI [ 1235.989074] Call Trace: [ 1235.991751] sg_free_table+0x17/0x20 [ 1235.995667] amdgpu_ttm_backend_unbind.cold+0x4d/0xf7 [amdgpu] [ 1236.002288] amdgpu_ttm_backend_destroy+0x29/0x130 [amdgpu] [ 1236.008464] ttm_tt_destroy+0x1e/0x30 [ttm] [ 1236.013066] ttm_bo_cleanup_memtype_use+0x51/0xa0 [ttm] [ 1236.018783] ttm_bo_release+0x262/0xa50 [ttm] [ 1236.023547] ttm_bo_put+0x82/0xd0 [ttm] [ 1236.027766] amdgpu_bo_unref+0x26/0x50 [amdgpu] [ 1236.032809] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7aa/0xd90 [amdgpu] [ 1236.040400] kfd_ioctl_alloc_memory_of_gpu+0xe2/0x330 [amdgpu] [ 1236.046912] kfd_ioctl+0x463/0x690 [amdgpu] Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03drm/amd/amdgpu: fix refcount leakJingwen Chen1-0/+3
[ Upstream commit fa7e6abc75f3d491bc561734312d065dc9dc2a77 ] [Why] the gem object rfb->base.obj[0] is get according to num_planes in amdgpufb_create, but is not put according to num_planes [How] put rfb->base.obj[0] in amdgpu_fbdev_destroy according to num_planes Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03drm/amdgpu/vcn2.5: add cancel_delayed_work_sync before power gateJames Zhu1-0/+2
commit 2fb536ea42d557f39f70c755f68e1aa1ad466c55 upstream. Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03drm/amdgpu/vcn2.0: add cancel_delayed_work_sync before power gateJames Zhu1-0/+2
commit 0c6013377b4027e69d8f3e63b6bf556b6cb87802 upstream. Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-03drm/amdgpu/vcn1: add cancel_delayed_work_sync before power gateJames Zhu1-1/+5
commit b95f045ea35673572ef46d6483ad8bd6d353d63c upstream. Add cancel_delayed_work_sync before set power gating state to avoid race condition issue when power gating. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-26drm/amdgpu: update sdma golden setting for Navi12Guchun Chen1-0/+4
commit 77194d8642dd4cb7ea8ced77bfaea55610574c38 upstream. Current golden setting is out of date. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-26drm/amdgpu: update gc golden setting for Navi12Guchun Chen1-2/+4
commit 99c45ba5799d6b938bd9bd20edfeb6f3e3e039b9 upstream. Current golden setting is out of date. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-26drm/amdgpu: disable 3DCGCG on picasso/raven1 to avoid compute hangChangfeng2-5/+7
commit dbd1003d1252db5973dddf20b24bb0106ac52aa2 upstream. There is problem with 3DCGCG firmware and it will cause compute test hang on picasso/raven1. It needs to disable 3DCGCG in driver to avoid compute hang. Signed-off-by: Changfeng <Changfeng.Zhu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11drm/amdgpu: fix NULL pointer dereferenceGuchun Chen1-1/+1
[ Upstream commit 3c3dc654333f6389803cdcaf03912e94173ae510 ] ttm->sg needs to be checked before accessing its child member. Call Trace: amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] ttm_bo_cleanup_memtype_use+0x3a/0x60 [ttm] ttm_bo_release+0x17d/0x300 [ttm] amdgpu_bo_unref+0x1a/0x30 [amdgpu] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x78b/0x8b0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x118/0x220 [amdgpu] kfd_ioctl+0x222/0x400 [amdgpu] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu] __x64_sys_ioctl+0x8e/0xd0 ? __context_tracking_exit+0x52/0x90 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f97f264d317 Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdb402c338 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f97f3cc63a0 RCX: 00007f97f264d317 RDX: 00007ffdb402c380 RSI: 00000000c0284b16 RDI: 0000000000000003 RBP: 00007ffdb402c380 R08: 00007ffdb402c428 R09: 00000000c4000004 R10: 00000000c4000004 R11: 0000000000000246 R12: 00000000c0284b16 R13: 0000000000000003 R14: 00007f97f3cc63a0 R15: 00007f8836200000 Signed-off-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-11amdgpu: avoid incorrect %hu format stringArnd Bergmann1-1/+1
[ Upstream commit 7d98d416c2cc1c1f7d9508e887de4630e521d797 ] clang points out that the %hu format string does not match the type of the variables here: drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c:263:7: warning: format specifies type 'unsigned short' but the argument has type 'unsigned int' [-Wformat] version_major, version_minor); ^~~~~~~~~~~~~ include/drm/drm_print.h:498:19: note: expanded from macro 'DRM_ERROR' __drm_err(fmt, ##__VA_ARGS__) ~~~ ^~~~~~~~~~~ Change it to a regular %u, the same way a previous patch did for another instance of the same warning. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Tom Rix <trix@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-11drm/amdgpu : Fix asic reset regression issue introduce by 8f211fe8ac7c4fshaoyunl1-1/+1
[ Upstream commit c8941550aa66b2a90f4b32c45d59e8571e33336e ] This recent change introduce SDMA interrupt info printing with irq->process function. These functions do not require a set function to enable/disable the irq Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-11drm/amdgpu: mask the xgmi number of hops reported from psp to kfdJonathan Kim1-1/+8
[ Upstream commit 4ac5617c4b7d0f0a8f879997f8ceaa14636d7554 ] The psp supplies the link type in the upper 2 bits of the psp xgmi node information num_hops field. With a new link type, Aldebaran has these bits set to a non-zero value (1 = xGMI3) so the KFD topology will report the incorrect IO link weights without proper masking. The actual number of hops is located in the 3 least significant bits of this field so mask if off accordingly before passing it to the KFD. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Amber Lin <amber.lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07drm/amdgpu: check alignment on CPU page for bo mapXℹ Ruoyao1-4/+4
commit e3512fb67093fabdf27af303066627b921ee9bd8 upstream. The page table of AMDGPU requires an alignment to CPU page so we should check ioctl parameters for it. Return -EINVAL if some parameter is unaligned to CPU page, instead of corrupt the page table sliently. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()Nirmoy Das1-1/+1
commit 5e61b84f9d3ddfba73091f9fbc940caae1c9eb22 upstream. Offset calculation wasn't correct as start addresses are in pfn not in bytes. CC: stable@vger.kernel.org Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-30drm/amdgpu: fb BO should be ttm_bo_type_deviceNirmoy Das1-1/+1
[ Upstream commit 521f04f9e3ffc73ef96c776035f8a0a31b4cdd81 ] FB BO should not be ttm_bo_type_kernel type and amdgpufb_create_pinned_object() pins the FB BO anyway. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-09drm/amdgpu: fix parameter error of RREG32_PCIE() in amdgpu_regs_pcieKevin Wang1-2/+2
commit 1aa46901ee51c1c5779b3b239ea0374a50c6d9ff upstream. the register offset isn't needed division by 4 to pass RREG32_PCIE() Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07drm/amdgpu: Add check to prevent IH overflowDefang Bo3-39/+71
[ Upstream commit e4180c4253f3f2da09047f5139959227f5cf1173 ] Similar to commit <b82175750131>("drm/amdgpu: fix IH overflow on Vega10 v2"). When an ring buffer overflow happens the appropriate bit is set in the WPTR register which is also written back to memory. But clearing the bit in the WPTR doesn't trigger another memory writeback. So what can happen is that we end up processing the buffer overflow over and over again because the bit is never cleared. Resulting in a random system lockup because of an infinite loop in an interrupt handler. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Defang Bo <bodefang@126.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04drm/amdgpu: Set reference clock to 100Mhz on Renoir (v2)Alex Deucher1-0/+2
commit 6e80fb8ab04f6c4f377e2fd422bdd1855beb7371 upstream. Fixes the rlc reference clock used for GPU timestamps. Value is 100Mhz. Confirmed with hardware team. v2: reword commit message. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1480 Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04drm/amdgpu: Prevent shift wrapping in amdgpu_read_mask()Dan Carpenter1-3/+3
[ Upstream commit c915ef890d5dc79f483e1ca3b3a5b5f1a170690c ] If the user passes a "level" value which is higher than 31 then that leads to shift wrapping. The undefined behavior will lead to a syzkaller stack dump. Fixes: 5632708f4452 ("drm/amd/powerplay: add dpm force multiple levels on cz/tonga/fiji/polaris (v2)") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04drm/amdgpu: Fix macro name _AMDGPU_TRACE_H_ in preprocessor if conditionChenyang Li1-1/+1
[ Upstream commit 956e20eb0fbb206e5e795539db5469db099715c8 ] Add an underscore in amdgpu_trace.h line 24 "_AMDGPU_TRACE_H". Fixes: d38ceaf99ed0 ("drm/amdgpu: add core driver (v4)") Reviewed-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Chenyang Li <lichenyang@loongson.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27drm/amdgpu/psp: fix psp gfx ctrl cmdsVictor Zhao1-1/+1
[ Upstream commit f14a5c34d143f6627f0be70c0de1d962f3a6ff1c ] psp GFX_CTRL_CMD_ID_CONSUME_CMD different for windows and linux, according to psp, linux cmds are not correct. v2: only correct GFX_CTRL_CMD_ID_CONSUME_CMD. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19drm/amdgpu: fix a GPU hang issue when remove deviceDennis Li1-2/+2
[ Upstream commit 88e21af1b3f887d217f2fb14fc7e7d3cd87ebf57 ] When GFXOFF is enabled and GPU is idle, driver will fail to access some registers. Therefore change to disable power gating before all access registers with MMIO. Dmesg log is as following: amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device. amdgpu: cp queue pipe 4 queue 0 preemption failed amdgpu 0000:03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2 amdgpu 0000:03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 amdgpu 0000:03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2 amdgpu 0000:03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Signed-off-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18amd/amdgpu: Disable VCN DPG mode for PicassoVeerabadhran Gopalakrishnan1-2/+1
[ Upstream commit c6d2b0fbb893d5c7dda405aa0e7bcbecf1c75f98 ] Concurrent operation of VCN and JPEG decoder in DPG mode is causing ring timeout due to power state. Signed-off-by: Veerabadhran Gopalakrishnan <veerabadhran.gopalakrishnan@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18drm/amdgpu: perform srbm soft reset always on SDMA resumeEvan Quan1-15/+12
[ Upstream commit 253475c455eb5f8da34faa1af92709e7bb414624 ] This can address the random SDMA hang after pci config reset seen on Hawaii. Signed-off-by: Evan Quan <evan.quan@amd.com> Tested-by: Sandeep Raghuraman <sandy.8925@gmail.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-10drm/amdgpu: add DID for navi10 blockchain SKUTianci.Yin1-0/+1
[ Upstream commit 8942881144a7365143f196f5eafed24783a424a3 ] Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Tianci.Yin <tianci.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-05drm/amdgpu: increase the reserved VM size to 2MBChristian König1-2/+2
commit 55bb919be4e4973cd037a04f527ecc6686800437 upstream. Ideally this should be a multiple of the VM block size. 2MB should at least fit for Vega/Navi. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Madhav Chauhan <madhav.chauhan@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05drm/amdgpu: correct the gpu reset handling for job != NULL caseEvan Quan1-1/+1
commit 207ac684792560acdb9e06f9d707ebf63c84b0e0 upstream. Current code wrongly treat all cases as job == NULL. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-and-tested-by: Jane Jian <Jane.Jian@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05drm/amdgpu: don't map BO in reserved regionMadhav Chauhan1-0/+10
commit c4aa8dff6091cc9536aeb255e544b0b4ba29faf4 upstream. 2MB area is reserved at top inside VM. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Madhav Chauhan <madhav.chauhan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14drm/amdgpu: prevent double kfree ttm->sgPhilip Yang1-0/+1
[ Upstream commit 1d0e16ac1a9e800598dcfa5b6bc53b704a103390 ] Set ttm->sg to NULL after kfree, to avoid memory corruption backtrace: [ 420.932812] kernel BUG at /build/linux-do9eLF/linux-4.15.0/mm/slub.c:295! [ 420.934182] invalid opcode: 0000 [#1] SMP NOPTI [ 420.935445] Modules linked in: xt_conntrack ipt_MASQUERADE [ 420.951332] Hardware name: Dell Inc. PowerEdge R7525/0PYVT1, BIOS 1.5.4 07/09/2020 [ 420.952887] RIP: 0010:__slab_free+0x180/0x2d0 [ 420.954419] RSP: 0018:ffffbe426291fa60 EFLAGS: 00010246 [ 420.955963] RAX: ffff9e29263e9c30 RBX: ffff9e29263e9c30 RCX: 000000018100004b [ 420.957512] RDX: ffff9e29263e9c30 RSI: fffff3d33e98fa40 RDI: ffff9e297e407a80 [ 420.959055] RBP: ffffbe426291fb00 R08: 0000000000000001 R09: ffffffffc0d39ade [ 420.960587] R10: ffffbe426291fb20 R11: ffff9e49ffdd4000 R12: ffff9e297e407a80 [ 420.962105] R13: fffff3d33e98fa40 R14: ffff9e29263e9c30 R15: ffff9e2954464fd8 [ 420.963611] FS: 00007fa2ea097780(0000) GS:ffff9e297e840000(0000) knlGS:0000000000000000 [ 420.965144] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.966663] CR2: 00007f16bfffefb8 CR3: 0000001ff0c62000 CR4: 0000000000340ee0 [ 420.968193] Call Trace: [ 420.969703] ? __page_cache_release+0x3c/0x220 [ 420.971294] ? amdgpu_ttm_tt_unpopulate+0x5e/0x80 [amdgpu] [ 420.972789] kfree+0x168/0x180 [ 420.974353] ? amdgpu_ttm_tt_set_user_pages+0x64/0xc0 [amdgpu] [ 420.975850] ? kfree+0x168/0x180 [ 420.977403] amdgpu_ttm_tt_unpopulate+0x5e/0x80 [amdgpu] [ 420.978888] ttm_tt_unpopulate.part.10+0x53/0x60 [amdttm] [ 420.980357] ttm_tt_destroy.part.11+0x4f/0x60 [amdttm] [ 420.981814] ttm_tt_destroy+0x13/0x20 [amdttm] [ 420.983273] ttm_bo_cleanup_memtype_use+0x36/0x80 [amdttm] [ 420.984725] ttm_bo_release+0x1c9/0x360 [amdttm] [ 420.986167] amdttm_bo_put+0x24/0x30 [amdttm] [ 420.987663] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 420.989165] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x9ca/0xb10 [amdgpu] [ 420.990666] kfd_ioctl_alloc_memory_of_gpu+0xef/0x2c0 [amdgpu] Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-07drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_configJean Delvare1-1/+1
commit a39d0d7bdf8c21ac7645c02e9676b5cb2b804c31 upstream. A recent attempt to fix a ref count leak in amdgpu_display_crtc_set_config() turned out to be doing too much and "fixed" an intended decrease as if it were a leak. Undo that part to restore the proper balance. This is the very nature of this function to increase or decrease the power reference count depending on the situation. Consequences of this bug is that the power reference would eventually get down to 0 while the display was still in use, resulting in that display switching off unexpectedly. Signed-off-by: Jean Delvare <jdelvare@suse.de> Fixes: e008fa6fb415 ("drm/amdgpu: fix ref count leak in amdgpu_display_crtc_set_config") Cc: stable@vger.kernel.org Cc: Navid Emamdoost <navid.emamdoost@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01drm/amdkfd: fix restore worker race conditionPhilip Yang1-3/+3
[ Upstream commit f7646585a30ed8ef5ab300d4dc3b0c1d6afbe71d ] In free memory of gpu path, remove bo from validate_list to make sure restore worker don't access the BO any more, then unregister bo MMU interval notifier. Otherwise, the restore worker will crash in the middle of validating BO user pages if MMU interval notifer is gone. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu resetJack Zhang3-0/+8
[ Upstream commit 04bef61e5da18c2b301c629a209ccdba4d4c6fbb ] kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate Without this change, sriov tdr code path will never free those allocated memories and get memory leak. v2:add a bugfix for kiq ring test fail Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01drm/amdgpu/vcn2.0: stall DPG when WPTR/RPTR resetJames Zhu1-0/+16
[ Upstream commit ef563ff403404ef2f234abe79bdd9f04ab6481c9 ] Add vcn dpg harware synchronization to fix race condition issue between vcn driver and hardware. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01PCI: Use ioremap(), not phys_to_virt() for platform ROMMikel Rychliski1-13/+18
[ Upstream commit 72e0ef0e5f067fd991f702f0b2635d911d0cf208 ] On some EFI systems, the video BIOS is provided by the EFI firmware. The boot stub code stores the physical address of the ROM image in pdev->rom. Currently we attempt to access this pointer using phys_to_virt(), which doesn't work with CONFIG_HIGHMEM. On these systems, attempting to load the radeon module on a x86_32 kernel can result in the following: BUG: unable to handle page fault for address: 3e8ed03c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 317 Comm: systemd-udevd Not tainted 5.6.0-rc3-next-20200228 #2 Hardware name: Apple Computer, Inc. MacPro1,1/Mac-F4208DC8, BIOS MP11.88Z.005C.B08.0707021221 07/02/07 EIP: radeon_get_bios+0x5ed/0xe50 [radeon] Code: 00 00 84 c0 0f 85 12 fd ff ff c7 87 64 01 00 00 00 00 00 00 8b 47 08 8b 55 b0 e8 1e 83 e1 d6 85 c0 74 1a 8b 55 c0 85 d2 74 13 <80> 38 55 75 0e 80 78 01 aa 0f 84 a4 03 00 00 8d 74 26 00 68 dc 06 EAX: 3e8ed03c EBX: 00000000 ECX: 3e8ed03c EDX: 00010000 ESI: 00040000 EDI: eec04000 EBP: eef3fc60 ESP: eef3fbe0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206 CR0: 80050033 CR2: 3e8ed03c CR3: 2ec77000 CR4: 000006d0 Call Trace: r520_init+0x26/0x240 [radeon] radeon_device_init+0x533/0xa50 [radeon] radeon_driver_load_kms+0x80/0x220 [radeon] drm_dev_register+0xa7/0x180 [drm] radeon_pci_probe+0x10f/0x1a0 [radeon] pci_device_probe+0xd4/0x140 Fix the issue by updating all drivers which can access a platform provided ROM. Instead of calling the helper function pci_platform_rom() which uses phys_to_virt(), call ioremap() directly on the pdev->rom. radeon_read_platform_bios() previously directly accessed an __iomem pointer. Avoid this by calling memcpy_fromio() instead of kmemdup(). pci_platform_rom() now has no remaining callers, so remove it. Link: https://lore.kernel.org/r/20200319021623.5426-1-mikel@mikelr.com Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01drm/amdgpu: increase atombios cmd timeoutJohn Clements1-2/+2
[ Upstream commit 1b3460a8b19688ad3033b75237d40fa580a5a953 ] mitigates race condition on BACO reset between GPU bootcode and driver reload Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01drm/amdgpu: fix calltrace during kmd unload(v3)Monk Liu5-144/+6
[ Upstream commit 82a829dc8c2bb03cc9b7e5beb1c5479aa3ba7831 ] issue: kernel would report a warning from a double unpin during the driver unloading on the CSB bo why: we unpin it during hw_fini, and there will be another unpin in sw_fini on CSB bo. fix: actually we don't need to pin/unpin it during hw_init/fini since it is created with kernel pinned, we only need to fullfill the CSB again during hw_init to prevent CSB/VRAM lost after S3 v2: get_csb in init_rlc so hw_init() will make CSIB content back even after reset or s3 v3: use bo_create_kernel instead of bo_create_reserved for CSB otherwise the bo_free_kernel() on CSB is not aligned and would lead to its internal reserve pending there forever take care of gfx7/8 as well Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Xiaojie Yuan <xiaojie.yuan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03drm/amdgpu/gfx10: refine mgcg settingJiansong Chen1-4/+2
commit de7a1b0b8753e1b0000084f0e339ffab295d27ef upstream. 1. enable ENABLE_CGTS_LEGACY to fix specviewperf11 random hang. 2. remove obsolete RLC_CGTT_SCLK_OVERRIDE workaround. Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03drm/amdgpu: Fix buffer overflow in INFO ioctlAlex Deucher1-0/+4
commit b5b97cab55eb71daba3283c8b1d2cce456d511a1 upstream. The values for "se_num" and "sh_num" come from the user in the ioctl. They can be in the 0-255 range but if they're more than AMDGPU_GFX_MAX_SE (4) or AMDGPU_GFX_MAX_SH_PER_SE (2) then it results in an out of bounds read. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03drm/amdgpu/display: fix ref count leak when pm_runtime_get_sync failsNavid Emamdoost1-4/+12
[ Upstream commit f79f94765f8c39db0b7dec1d335ab046aac03f20 ] The call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03drm/amdgpu: fix ref count leak in amdgpu_display_crtc_set_configNavid Emamdoost1-2/+3
[ Upstream commit e008fa6fb41544b63973a529b704ef342f47cc65 ] in amdgpu_display_crtc_set_config, the call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03drm/amd/display: fix ref count leak in amdgpu_drm_ioctlNavid Emamdoost1-1/+2
[ Upstream commit 5509ac65f2fe5aa3c0003237ec629ca55024307c ] in amdgpu_drm_ioctl the call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>