summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)AuthorFilesLines
2022-02-08drm/amdkfd: CRIU prepare for svm resumeRajneesh Bhardwaj4-2/+73
During CRIU restore phase, the VMAs for the virtual address ranges are not at their final location yet so in this stage, only cache the data required to successfully resume the svm ranges during an imminent CRIU resume phase. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Save Shared Virtual Memory rangesRajneesh Bhardwaj3-1/+108
During checkpoint stage, save the shared virtual memory ranges and attributes for the target process. A process may contain a number of svm ranges and each range might contain a number of attributes. While not all attributes may be applicable for a given prange but during checkpoint we store all possible values for the max possible attribute types. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Discover svm rangesRajneesh Bhardwaj4-6/+81
A KFD process may contain a number of virtual address ranges for shared virtual memory management and each such range can have many SVM attributes spanning across various nodes within the process boundary. This change reports the total number of such SVM ranges and their total private data size by extending the PROCESS_INFO op of the the CRIU IOCTL to discover the svm ranges in the target process and a future patches brings in the required support for checkpoint and restore for SVM ranges. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: use user_gpu_id for svm rangesRajneesh Bhardwaj1-2/+2
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore support its possible that the SVM ranges can be resumed on another node where the actual_gpu_id may not be same as the original (user_gpu_id) gpu id. So modify svm code to use user_gpu_id. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU allow external mm for svm rangesRajneesh Bhardwaj1-8/+9
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct from current but for a Checkpoint or Restore operation, the current->mm will fetch the mm for the CRIU master process. So modify these helpers to accept the task mm for a target kfd process to support Checkpoint Restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU checkpoint and restore xnack modeRajneesh Bhardwaj2-0/+16
Recoverable page faults are represented by the xnack mode setting inside a kfd process and are used to represent the device page faults. For CR, we don't consider negative values which are typically used for querying the current xnack mode without modifying it. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU export BOs as prime dmabuf objectsRajneesh Bhardwaj1-2/+69
KFD buffer objects do not associate a GEM handle with them so cannot directly be used with libdrm to initiate a system dma (sDMA) operation to speedup the checkpoint and restore operation so export them as dmabuf objects and use with libdrm helper (amdgpu_bo_import) to further process the sdma command submissions. With sDMA, we see huge improvement in checkpoint and restore operations compared to the generic pci based access via host data path. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU implement gpu_id remappingDavid Yat Sin5-160/+414
When doing a restore on a different node, the gpu_id's on the restore node may be different. But the user space application will still refer use the original gpu_id's in the ioctl calls. Adding code to create a gpu id mapping so that kfd can determine actual gpu_id during the user ioctl's. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU checkpoint and restore eventsDavid Yat Sin3-89/+280
Add support to existing CRIU ioctl's to save and restore events during criu checkpoint and restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU checkpoint and restore queue control stackDavid Yat Sin11-53/+138
Checkpoint contents of queue control stacks on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU checkpoint and restore queue mqdsDavid Yat Sin11-26/+516
Checkpoint contents of queue MQD's on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU restore queue doorbell idDavid Yat Sin1-19/+41
When re-creating queues during CRIU restore, restore the queue with the same doorbell id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU restore sdma id for queuesDavid Yat Sin3-15/+40
When re-creating queues during CRIU restore, restore the queue with the same sdma id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU restore queue idsDavid Yat Sin4-9/+34
When re-creating queues during CRIU restore, restore the queue with the same queue id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU add queues supportDavid Yat Sin3-8/+353
Add support to existing CRIU ioctl's to save number of queues and queue properties for each queue during checkpoint and re-create queues on restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Implement KFD unpause operationDavid Yat Sin3-1/+39
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO op the queues will be stay in an evicted state. Once the plugin is done draining BO contents, it is safe to perform an UNPAUSE op for the queues to resume. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Implement KFD resume ioctlRajneesh Bhardwaj3-10/+67
This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Implement KFD restore ioctlRajneesh Bhardwaj1-1/+297
This implements the KFD CRIU Restore ioctl that lays the basic foundation for the CRIU restore operation. It provides support to create the buffer objects corresponding to the checkpointed image. This ioctl creates various types of buffer objects such as VRAM, MMIO, Doorbell, GTT based on the date sent from the userspace plugin. The data mostly contains the previously checkpointed KFD images from some KFD processs. While restoring a criu process, attach old IDR values to newly created BOs. This also adds the minimal gpu mapping support for a single gpu checkpoint restore use case. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Implement KFD checkpoint ioctlRajneesh Bhardwaj2-2/+179
This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Implement KFD process_info ioctlRajneesh Bhardwaj1-1/+55
This IOCTL op is expected to be called as a precursor to the actual Checkpoint operation. This does the basic discovery into the target process seized by CRIU and relays the information to the userspace that utilizes it to start the Checkpoint operation via another dedicated IOCTL op. The process_info IOCTL op determines the number of GPUs, buffer objects that are associated with the target process, its process id in caller's namespace since /proc/pid/mem interface maybe used to drain the contents of the discovered buffer objects in userspace and getpid returns the pid of CRIU dumper process. Also the pid of a process inside a container might be different than its global pid so return the ns pid. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-08drm/amdkfd: CRIU Introduce Checkpoint-Restore APIsRajneesh Bhardwaj2-2/+161
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can snapshot a running process and later restore it on same or a remote machine but expects the processes that have a device file (e.g. GPU) associated with them, provide necessary driver support to assist CRIU and its extensible plugin interface. Thus, In order to support the Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute Kernel driver, needs to provide a set of new APIs that provide necessary VRAM metadata and its contents to a userspace component (CRIU plugin) that can store it in form of image files. This introduces some new ioctls which will be used to checkpoint-Restore any KFD bound user process. KFD only allows ioctl calls from the same process that opened the KFD file descriptor. Since these ioctls are expected to be called from a KFD criu plugin which has elevated ptrace attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with the file descriptors so modify KFD to allow such calls. (API redesigned by David Yat Sin) Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-03drm/amdkfd: Fix variable set but not used warningPhilip Yang1-3/+0
All warnings (new ones prefixed by >>): drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c: In function 'svm_range_deferred_list_work': >> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2067:22: warning: variable 'p' set but not used [-Wunused-but-set-variable] 2067 | struct kfd_process *p; | Fixes: 367c9b0f1b8750 ("drm/amdkfd: Ensure mm remain valid in svm deferred_list work") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-By: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27drm/amdkfd: svm range restore work deadlock when process exitPhilip Yang2-7/+9
kfd_process_notifier_release flush svm_range_restore_work which calls svm_range_list_lock_and_flush_work to flush deferred_list work, but if deferred_list work mmput release the last user, it will call exit_mmap -> notifier_release, it is deadlock with below backtrace. Move flush svm_range_restore_work to kfd_process_wq_release to avoid deadlock. Then svm_range_restore_work take task->mm ref to avoid mm is gone while validating and mapping ranges to GPU. Workqueue: events svm_range_deferred_list_work [amdgpu] Call Trace: wait_for_completion+0x94/0x100 __flush_work+0x12a/0x1e0 __cancel_work_timer+0x10e/0x190 cancel_delayed_work_sync+0x13/0x20 kfd_process_notifier_release+0x98/0x2a0 [amdgpu] __mmu_notifier_release+0x74/0x1f0 exit_mmap+0x170/0x200 mmput+0x5d/0x130 svm_range_deferred_list_work+0x104/0x230 [amdgpu] process_one_work+0x220/0x3c0 Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reported-by: Ruili Ji <ruili.ji@amd.com> Tested-by: Ruili Ji <ruili.ji@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27drm/amdkfd: Ensure mm remain valid in svm deferred_list workPhilip Yang1-26/+36
svm_deferred_list work should continue to handle deferred_range_list which maybe split to child range to avoid child range leak, and remove ranges mmu interval notifier to avoid mm mm_count leak. So taking mm reference when adding range to deferred list, to ensure mm is valid in the scheduled deferred_list_work, and drop the mm referrence after range is handled. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reported-by: Ruili Ji <ruili.ji@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27drm/amdkfd: Don't take process mutex for svm ioctlsPhilip Yang1-4/+0
SVM ioctls take proper svms->lock to handle race conditions, don't need take process mutex to serialize ioctls. This also fixes circular locking warning: WARNING: possible circular locking dependency detected Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&svms->deferred_list_work)); lock(&process->mutex); lock((work_completion)(&svms->deferred_list_work)); lock(&process->mutex); *** DEADLOCK *** Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27drm/amdkfd: enable heavy-weight TLB flush on Vega20Eric Huang1-1/+2
It is to meet the requirement for memory allocation optimization on MI50. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-20drm/amdkfd: enable heavy-weight TLB flush on ArcturusEric Huang1-2/+8
SDMA FW fixes the hang issue for adding heavy-weight TLB flush on Arcturus, so we can enable it. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-20drm/amdgpu: remove gart.ready flagChristian König1-4/+1
That's just a leftover from old radeon days and was preventing CS and GART bindings before the hardware was initialized. But nowdays that is perfectly valid. The only thing we need to warn about are GART binding before the table is even allocated. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-15drm/amdkfd: Fix indentation on switch statementGraham Sider1-28/+27
Cases should be same indentation as switch. Also fix string spanning across multiple lines. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-15drm/amd/pm: do not expose implementation details to other blocks out of powerEvan Quan1-1/+1
Those implementation details(whether swsmu supported, some ppt_funcs supported, accessing internal statistics ...)should be kept internally. It's not a good practice and even error prone to expose implementation details. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: Fix ASIC name typosKent Russell1-3/+3
Three misspelled ASICs in comments here, so fix the spelling Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: Fix DQM asserts on HawaiiFelix Kuehling1-3/+6
start_nocpsch would never set dqm->sched_running on Hawaii due to an early return statement. This would trigger asserts in other functions and end up in inconsistent states. Bug: https://github.com/RadeonOpenCompute/ROCm/issues/1624 Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: Use prange->update_list head for remove_listFelix Kuehling2-6/+2
The remove_list head was only used for keeping track of existing ranges that are to be removed from the svms->list. The update_list was used for new or existing ranges that need updated attributes. These two cases are mutually exclusive (i.e. the same range will never be on both lists). Therefore we can use the update_list head to track the remove_list and save another 16 bytes in the svm_range struct. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: Use prange->list head for insert_listFelix Kuehling2-11/+8
There are seven list_heads in struct svm_range: list, update_list, remove_list, insert_list, svm_bo_list, deferred_list, child_list. This patch and the next one remove two of them that are redundant. The insert_list head was only used for new ranges that are not on the svms->list yet. So we can use that list head for keeping track of new ranges before they get added, and use list_move_tail to move them to the svms->list when ready. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: Check for null pointer after calling kmemdupJiasheng Jiang1-0/+3
As the possible failure of the allocation, kmemdup() may return NULL pointer. Therefore, it should be better to check the 'props2' in order to prevent the dereference of NULL pointer. Fixes: 3a87177eb141 ("drm/amdkfd: Add topology support for dGPUs") Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11drm/amdkfd: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the amdkfd sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-08drm/amdkfd: enable sdma ecc interrupt event can be handled by ↵yipechai1-0/+1
event_interrupt_wq_v9 Enable sdma ecc interrupt event can be handled by event_interrupt_wq_v9. Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30drm/amdkfd: correct sdma queue number in kfd device init (v3)Guchun Chen1-9/+71
This patch keeps the setting of sdma queue number to the same after recent KFD code refactor. Additionally, improve code to use switch case to list IP version to complete kfd device_info structure filling for IH version assignment. This makes consistency with the IP parse code in amdgpu_discovery.c. v2: use dev_warn for the default switch case; set default sdma queue per engine(8) and IH handler to v9. (Jonathan) v3: Fix missed IP version check of Raven. Fixes: f0dc99a6f742bc ("drm/amdkfd: add kfd_device_info_init function") Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-29drm/amdkfd: reset queue which consumes RAS poison (v2)Tao Zhou2-4/+42
CP supports unmap queue with reset mode which only destroys specific queue without affecting others. Replacing whole gpu reset with reset queue mode for RAS poison consumption saves much time, and we can also fallback to gpu reset solution if reset queue fails. v2: Return directly if process is NULL; Reset queue solution is not applicable to SDMA, fallback to legacy way; Call kfd_unref_process after lookup process. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-29drm/amdkfd: add reset queue function for RAS poison (v2)Tao Zhou2-0/+21
The new interface unmaps queues with reset mode for the process consumes RAS poison, it's only for compute queue. v2: rename the function to reset_queues. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-29drm/amdkfd: add reset parameter for unmap queuesTao Zhou1-6/+6
So we can set reset mode for unmap operation, no functional change. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-16drm/amdkfd: use max() and min() to make code cleanerChangcheng Deng1-2/+2
Use max() and min() in order to make code cleaner. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-16drm/amdkfd: fix svm_bo release invalid wait context warningPhilip Yang3-6/+30
Add svm_range_bo_unref_async to schedule work to wait for svm_bo eviction work done and then free svm_bo. __do_munmap put_page is atomic context, call svm_range_bo_unref_async to avoid warning invalid wait context. Other non atomic context call svm_range_bo_unref. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdgpu: remove unnecessary variablesIsabella Basso1-4/+0
This fixes the warnings below, and also drops the display_count variable, as it's unused. In function 'svm_range_map_to_gpu': warning: variable 'bo_va' set but not used [-Wunused-but-set-variable] 1172 | struct amdgpu_bo_va bo_va; | ^~~~~ ... In function 'dcn201_update_clocks': warning: variable 'enter_display_off' set but not used [-Wunused-but-set-variable] 132 | bool enter_display_off = false; | ^~~~~~~~~~~~~~~~~ Changes since v1: - As suggested by Rodrigo Siqueira: 1. Drop display_count variable. - As suggested by Felix Kuehling: 1. Remove block surrounding amdgpu_xgmi_same_hive. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Isabella Basso <isabbasso@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdgpu: disable default navi2x co-op kernel supportJonathan Kim1-4/+1
This patch reverts the following: commit 48733b224fa7ba ("drm/amdkfd: add Navi2x to GWS init conditions") Disable GWS usage in default settings for now due to FW bugs. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdkfd: add Navi2x to GWS init conditionsGraham Sider1-1/+4
Initalize GWS on Navi2x with mec2_fw_version >= 0x42. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-and-tested-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdkfd: Make KFD support on Hawaii experimentalFelix Kuehling1-1/+5
Hawaii support is mostly untested these days. ROCm user mode also depends on custom firmware for AQL packet processing, that was never pushed upstream due to quality regressions in graphics driver testing. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdkfd: Don't split unchanged SVM rangesFelix Kuehling1-10/+13
If an existing SVM range overlaps an svm_range_set_attr call, we would normally split it in order to update only the overlapping part. However, if the attributes of the existing range would not be changed splitting it is unnecessary. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdkfd: Fix svm_range_is_same_attrsFelix Kuehling1-11/+56
The existing function doesn't compare the access bitmaps and flags. This can result in failure to update those attributes in existing ranges when all other attributes remained unchanged. Because the access and flags attributes modify only some bits in the respective bitmaps, we cannot compare them directly. Instead we need to check whether applying the attributes to a particular range would change the bitmaps. A PREFETCH_LOC attribute must always trigger a migration, even if the attribute value remains unchanged. E.g. if some pages were migrated due to a CPU page fault, a prefetch must still be executed to migrate pages back to VRAM. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amdkfd: Fix error handling in svm_range_addFelix Kuehling1-89/+49
Add null-pointer check after the last svm_range_new call. This was originally reported by Zhou Qingyang <zhou1615@umn.edu> based on a static analyzer. To avoid duplicating the unwinding code from svm_range_handle_overlap, I merged the two functions into one. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Zhou Qingyang <zhou1615@umn.edu> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>