kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2021-04-09	drm/amdgpu: page retire over debugfs mechanism	John Clements	1	-0/+67
	added support in RAS debugfs to add bad page for isolated page retirement testing Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: enable ras eeprom on aldebaran	John Clements	1	-1/+7
	enable ras eeprom loading by default on aldebaran Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: RAS harvest on driver load	John Clements	1	-0/+29
	In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: John Clements <john.clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: add DMUB outbox event IRQ source define/complete/debug flag	Jude Shih	1	-0/+1
	[Why & How] We use outbox interrupt that allows us to do the AUX via DMUB Therefore, we need to add some irq source related definition in the header files; Signed-off-by: Jude Shih <shenshih@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Fix size overflow	xinhui pan	1	-1/+1
	ttm->num_pages is uint32. Hit overflow when << PAGE_SHIFT directly Fixes: 230c079fdcf4 ("drm/ttm: make num_pages uint32_t") Signed-off-by: xinhui pan <xinhui.pan@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: move mmhub ras_func init to ip specific file	Hawking Zhang	2	-19/+19
	mmhub ras is always owned by gpu driver. ras_funcs initialization shall be done at ip level, instead of putting it in common gmc interface file Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Remove unused function amdgpu_bo_fbdev_mmap()	Thomas Zimmermann	2	-21/+0
	Remove an unused function. Mapping the fbdev framebuffer is apparently not supported. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	Revert "drm/amdgpu: Ensure that the modifier requested is supported by plane."	Qingqing Zhuo	1	-13/+0
	This reverts commit f4a9be998c8ee39a30a68cb775c91928fe10a384. The original commit was found to cause the following two issues on sienna cichlid: 1. Refresh rate locked during vrrdemo 2. Display sticks on flipped landscape mode after changing orientation, and cannot be changed back to regular landscape Signed-off-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: split gfx callbacks into ras and non-ras ones	Hawking Zhang	8	-87/+92
	gfx ras is only available in cerntain ip generations. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: split mmhub callbacks into ras and non-ras ones	Hawking Zhang	13	-29/+74
	mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: do not register df_mca interrupt in certain config	Hawking Zhang	1	-2/+4
	df/mca ras is not managed by gpu driver when gpu is connected to cpu through xgmi. gpu driver should register x86 mca notifier for umc ras error notification Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: split umc callbacks to ras and non-ras ones	Hawking Zhang	12	-32/+51
	umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: move xgmi ras functions to xgmi_ras_funcs	Hawking Zhang	6	-17/+42
	xgmi ras is not managed by gpu driver when gpu is connected to cpu through xgmi. move all xgmi ras functions to xgmi_ras_funcs so gpu driver only initializes xgmi ras functions when it manages xgmi ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: split nbio callbacks into ras and non-ras ones	Hawking Zhang	6	-30/+63
	nbio ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split nbio callbacks into ras and non-ras ones so gpu driver only initializes nbio ras callbacks when it manages nbio ras. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: implement query_ras_error_address callback	Hawking Zhang	1	-0/+90
	query_ras_error_address will be invoked to query bad page address when there is poison data in HBM consumed by GPU engines. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: implement umc query error count callback	Hawking Zhang	2	-0/+92
	umc query_ras_error_count will be invoked to query umc correctable and uncorrectable error. It will reset the umc ras error counter after the query. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: add helper funtion to query umc ras error	Hawking Zhang	2	-0/+77
	Add helper functions to query correctable and uncorrectable umc ras error. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: create umc_v6_7_funcs for aldebaran	Hawking Zhang	3	-1/+58
	umc_v6_7_funcs are callbacks to support umc ras functionalities in aldebaran Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: initialze ras caps per paltform config	Hawking Zhang	1	-12/+23
	Driver only manages GFX/SDMA/MMHUB RAS in platforms that gpu node is connected to cpu through XGMI, other than that, it queries VBIOS for RAS capabilities. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: John Clements <John.Clements@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: drop some unused atombios functions	Alex Deucher	2	-163/+0
	These were leftover from the old CI dpm code which was retired a while ago. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd: cleanup coding style a bit	Bernard Zhao	1	-4/+3
	Fix patch check warning: WARNING: suspect code indent for conditional statements (8, 17) + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); WARNING: braces {} are not necessary for single statement blocks + if (obj && obj->use < 0) { + DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name); + } Signed-off-by: Bernard Zhao <bernard@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: support sdma error injection	Stanley.Yang	1	-0/+1
	Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reivewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: reserve fence slot to update page table	Philip Yang	1	-2/+8
	Forgot to reserve a fence slot to use sdma to update page table, cause below kernel BUG backtrace to handle vm retry fault while application is exiting. [ 133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281! [ 133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu] [ 133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280 [ 133.048672] amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu] [ 133.048788] amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu] [ 133.048905] amdgpu_vm_handle_fault+0x202/0x370 [amdgpu] [ 133.049031] gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu] [ 133.049165] ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu] [ 133.049289] ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049408] amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049534] amdgpu_ih_process+0x9b/0x1c0 [amdgpu] [ 133.049657] amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu] [ 133.049669] process_one_work+0x29f/0x640 [ 133.049678] worker_thread+0x39/0x3f0 [ 133.049685] ? process_one_work+0x640/0x640 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou	5	-63/+164
	1. expand rlcg interface for gc & mmhub indirect access 2. add rlcg interface for no kiq v2: squash in fix for gfx9 (Changfeng) Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou	2	-0/+19
	using the control bits got from host to control registers access. Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou	2	-0/+13
	get pf2vf msg info at it's earliest time so that guest driver can use these info to decide whether register indirect access enabled. Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: indirect register access for nv12 sriov	Peng Ju Zhou	1	-3/+3
	unify host driver and guest driver indirect access control bits names Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com> Reviewed-by: Emily.Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: check alignment on CPU page for bo map	Xℹ Ruoyao	1	-4/+4
	The page table of AMDGPU requires an alignment to CPU page so we should check ioctl parameters for it. Return -EINVAL if some parameter is unaligned to CPU page, instead of corrupt the page table sliently. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Set a suitable dev_info.gart_page_size	Huacai Chen	1	-2/+2
	In Mesa, dev_info.gart_page_size is used for alignment and it was set to AMDGPU_GPU_PAGE_SIZE(4KB). However, the page table of AMDGPU driver requires an alignment on CPU pages. So, for non-4KB page system, gart_page_size should be max_t(u32, PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE). Signed-off-by: Rui Wang <wangr@lemote.com> Signed-off-by: Huacai Chen <chenhc@lemote.com> Link: https://github.com/loongson-community/linux-stable/commit/caa9c0a1 [Xi: rebased for drm-next, use max_t for checkpatch, and reworded commit message.] Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Tested-by: Dan Horák <dan@danny.cz> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: fix compiler warning(v2)	Guchun Chen	1	-3/+1
	warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] int write = !(gtt->userflags & AMDGPU_GEM_USERPTR_READONLY); v2: put short variable declaration last Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: fix NULL pointer dereference	Guchun Chen	1	-1/+1
	ttm->sg needs to be checked before accessing its child member. Call Trace: amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] ttm_bo_cleanup_memtype_use+0x3a/0x60 [ttm] ttm_bo_release+0x17d/0x300 [ttm] amdgpu_bo_unref+0x1a/0x30 [amdgpu] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x78b/0x8b0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x118/0x220 [amdgpu] kfd_ioctl+0x222/0x400 [amdgpu] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu] __x64_sys_ioctl+0x8e/0xd0 ? __context_tracking_exit+0x52/0x90 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f97f264d317 Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdb402c338 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f97f3cc63a0 RCX: 00007f97f264d317 RDX: 00007ffdb402c380 RSI: 00000000c0284b16 RDI: 0000000000000003 RBP: 00007ffdb402c380 R08: 00007ffdb402c428 R09: 00000000c4000004 R10: 00000000c4000004 R11: 0000000000000246 R12: 00000000c0284b16 R13: 0000000000000003 R14: 00007f97f3cc63a0 R15: 00007f8836200000 Signed-off-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add new PF2VF flags for VF register access method	Rohit Khaire	2	-2/+26
	Add 3 sub flags to notify guest for indirect reg access of gc, mmhub and ih The host sets these flags depending on L1 RAP version, asic and other scenarios. These flags ensure that there is compatibility between different guest/host/vbios versions. Signed-off-by: Rohit Khaire <rohit.khaire@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Luben Tuikov <luben.tuikov@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: skip PP_MP1_STATE_UNLOAD on aldebaran	Feifei Xu	1	-2/+1
	This message is not needed on Aldebaran. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd/amdgpu: set MP1 state to UNLOAD before reload its FW for ↵	Chengming Gui	1	-1/+3
	vega20/ALDEBARAN When resume from gpu reset, need set MP1 state to UNLOAD before reload SMU FW otherwise will cause following errors: [ 121.642772] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR [ 123.801051] [drm] failed to load ucode id (24) [ 123.801055] [drm] psp command (0x6) failed and response status is (0x0) [ 123.801214] [drm:psp_load_smu_fw [amdgpu]] ERROR PSP load smu failed! [ 123.801398] [drm:psp_resume [amdgpu]] ERROR PSP resume failed [ 123.801536] [drm:amdgpu_device_fw_loading [amdgpu]] ERROR resume of IP block <psp> failed -22 [ 123.801632] amdgpu 0000:04:00.0: amdgpu: GPU reset(9) failed [ 123.801691] amdgpu 0000:07:00.0: amdgpu: GPU reset(9) failed [ 123.802899] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -22 v2: add error info and including ALDEBARAN also Signed-off-by: Chengming Gui <Jack.Gui@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Reset error code for 'no handler' case	Lijo Lazar	1	-3/+8
	If reset handler is not implemented, reset error before proceeding. Fixes issue with the following trace - [ 106.508592] amdgpu 0000:b1:00.0: amdgpu: ASIC reset failed with error, -38 for drm dev, 0000:b1:00.0 [ 106.508972] amdgpu 0000:b1:00.0: amdgpu: GPU reset succeeded, trying to resume [ 106.509116] [drm] PCIE GART of 512M enabled. [ 106.509120] [drm] PTB located at 0x0000008000000000 [ 106.509136] [drm] VRAM is lost due to GPU reset! [ 106.509332] [drm] PSP is resuming... Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: ih reroute for newer asics than vega20	Alex Sierra	1	-3/+3
	Starting Arcturus, it supports ih reroute through mmio directly in bare metal environment. This is also valid for newer asics such as Aldebaran. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()	Nirmoy Das	1	-1/+1
	Offset calculation wasn't correct as start addresses are in pfn not in bytes. Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd/pm: unify the interface for gfx state setting	Evan Quan	1	-10/+6
	No need to have special handling for swSMU supported ASICs. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd/pm: unify the interface for loading SMU microcode	Evan Quan	1	-8/+2
	No need to have special handling for swSMU supported ASICs. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Fix build warnings	Lijo Lazar	2	-2/+2
	Fix header guard and make internal functions static. Fixes the below warnings: drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu_reset.h:24:9: warning: '__AMDUGPU_RESET_H__' is used as a header guard here, followed by #define of a different macro [-Wheader-guard] drivers/gpu/drm/amd/amdgpu/aldebaran.c:110:6: warning: no previous prototype for function 'aldebaran_async_reset' [-Wmissing-prototypes] drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/aldebaran_ppt.c:1435:5: warning: no previous prototype for function 'aldebaran_mode2_reset' [-Wmissing-prototypes] Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Enable recovery on aldebaran	Lijo Lazar	1	-0/+1
	Add aldebaran to devices which support recovery Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add mode2 reset support for aldebaran	Lijo Lazar	5	-2/+463
	v1: Aldebaran uses reset control to support mode2 reset. The sequences to reset and restore hardware context are specific to a particular configuration. v2: Clear bus mastering before reset. Fix coding style issues, drop unwanted variables and info log. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Make set PG/CG state functions public	Lijo Lazar	2	-3/+9
	Expose PG/CG set states functions for other clients Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add PSP public function to load a list of FWs	Lijo Lazar	2	-0/+19
	v1: Adds a function to load a list of FWs as passed by the caller. This is needed as only a select need to loaded for some use cases. v2: Omit unrelated change, remove info log, fix return value when count is 0 Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add reset control handling to reset workflow	Lijo Lazar	3	-39/+97
	This prefers reset control based handling if it's implemented for a particular ASIC. If not, it takes the legacy path. It uses the legacy method of preparing environment (job, scheduler tasks) and restoring environment. v2: remove unused variable (Alex) Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: Add reset control to amdgpu_device	Lijo Lazar	4	-0/+175
	v1: Add generic amdgpu_reset_control to handle different types of resets. It may be added at device, hive or ip level. Each reset control has a list of handlers associated with it to handle different types of reset. Reset control is responsible for choosing the right handler given a particular reset context. Handler objects may implement a set of functions on how to handle a particular type of reset. prepare_env = Prepare environment/software context (not used currently). prepare_hwcontext = Prepare hardware context for the reset. perform_reset = Perform the type of reset. restore_hwcontext = Restore the hw context after reset. restore_env = Restore the environment after reset (not used currently). Reset context carries the context of reset, as of now this is based on the parameters used for current set of resets. v2: Fix coding style Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amd/amdgpu implement tdr advanced mode	Jack Zhang	2	-1/+82
	[Why] Previous tdr design treats the first job in job_timeout as the bad job. But sometimes a later bad compute job can block a good gfx job and cause an unexpected gfx job timeout because gfx and compute ring share internal GC HW mutually. [How] This patch implements an advanced tdr mode.It involves an additinal synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit step in order to find the real bad job. 1. At Step0 Resubmit stage, it synchronously submits and pends for the first job being signaled. If it gets timeout, we identify it as guilty and do hw reset. After that, we would do the normal resubmit step to resubmit left jobs. 2. For whole gpu reset(vram lost), do resubmit as the old way. v2: squash in build fix (Alex) Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: make BO type check less restrictive	Nirmoy Das	1	-4/+4
	BO with ttm_bo_type_sg type can also have tiling_flag and metadata. So so BO type check for only ttm_bo_type_kernel. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reported-by: Tom StDenis <Tom.StDenis@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: use amdgpu_bo_user bo for metadata and tiling flag	Nirmoy Das	3	-22/+35
	Tiling flag and metadata are only needed for BOs created by amdgpu_gem_object_create(), so we can remove those from the base class. v2: * squash tiling_flags and metadata relared patches into one * use BUG_ON for non ttm_bo_type_device type when accessing tiling_flags and metadata._ v3: *include to_amdgpu_bo_user Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09	drm/amdgpu: use amdgpu_bo_create_user() for when possible	Nirmoy Das	2	-2/+6
	Use amdgpu_bo_create_user() for all the BO allocations for ttm_bo_type_device type. v2: include amdgpu_amdkfd_alloc_gws() as well it calls amdgpu_bo_create() for ttm_bo_type_device Signed-off-by: Nirmoy Das <nirmoy.das@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>