summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)AuthorFilesLines
2021-12-14drm/amdkfd: fix function scopesIsabella Basso1-2/+2
This turns previously global functions into static, thus removing compile-time warnings such as: warning: no previous prototype for 'pm_set_resources_vi' [-Wmissing-prototypes] 113 | int pm_set_resources_vi(struct packet_manager *pm, uint32_t *buffer, | ^~~~~~~~~~~~~~~~~~~ Signed-off-by: Isabella Basso <isabbasso@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14drm/amd: fix improper docstring syntaxIsabella Basso3-5/+11
This fixes various warnings relating to erroneous docstring syntax, of which some are listed below: warning: Function parameter or member 'adev' not described in 'amdgpu_atomfirmware_ras_rom_addr' ... warning: expecting prototype for amdgpu_atpx_validate_functions(). Prototype was for amdgpu_atpx_validate() instead ... warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new' ... warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of waves in-flight on this device ... warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Signed-off-by: Isabella Basso <isabbasso@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-07drm/amdkfd: Correct the value of the no_atomic_fw_version variablechen gong1-2/+2
145: navi10 IP_VERSION(10, 1, 10) navi12 IP_VERSION(10, 1, 2) navi14 IP_VERSION(10, 1, 1) 92: sienna_cichlid IP_VERSION(10, 3, 0) navy_flounder IP_VERSION(10, 3, 2) vangogh IP_VERSION(10, 3, 1) dimgrey_cavefish IP_VERSION(10, 3, 4) beige_goby IP_VERSION(10, 3, 5) yellow_carp IP_VERSION(10, 3, 3) Signed-off-by: chen gong <curry.gong@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: process_info lock not needed for svmPhilip Yang1-9/+0
process_info->lock is used to protect kfd_bo_list, vm_list_head, n_vms and userptr valid/inval list, svm_range_restore_work and svm_range_set_attr don't access those, so do not need to take process_info lock. This will avoid potential circular locking issue. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: remove hardcoded device_info structsGraham Sider1-452/+17
With device_info initialization being handled in kfd_device_info_init, these structs may be removed. Also add comments to help matching IP versions to asic names. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: add kfd_device_info_init functionGraham Sider11-141/+163
Initializes kfd->device_info given either asic_type (enum) if GFX version is less than GFX9, or GC IP version if greater. Also takes in vf and the target compiler gfx version. Uses SDMA version to determine num_sdma_queues_per_engine. Convert device_info to a non-pointer member of kfd, change references accordingly. Change unsupported asic condition to only probe f2g, move device_info initialization post-switch. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: replace asic_name with amdgpu_asic_nameGraham Sider3-32/+8
device_info->asic_name and amdgpu_asic_name[adev->asic_type] both provide asic name strings, with the only difference being casing. Remove asic_name from device_info and replace sysfs entry with lowercase amdgpu_asic_name[]. Ensures string is null-terminated so that this doesn't break if dev->node_props.name ever gets set anywhere else. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: set "r = 0" explicitly before gotoPhilip Yang1-0/+4
To silence the following Smatch static checker warning: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2615 svm_range_restore_pages() warn: missing error code here? 'get_task_mm()' failed. 'r' = '0' Signed-off-by: Philip Yang <Philip.Yang@amd.com> Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdgpu: handle IH ring1 overflowPhilip Yang1-1/+1
IH ring1 is used to process GPU retry fault, overflow is enabled to drain retry fault because we want receive other interrupts while handling retry fault to recover range. There is no overflow flag set when wptr pass rptr. Use timestamp of rptr and wptr to handle overflow and drain retry fault. If fault timestamp goes backward, the fault is filtered and should not be processed. Drain fault is finished if processed_timestamp is equal to or larger than checkpoint timestamp. Add amdgpu_ih_functions interface decode_iv_ts for different chips to get timestamp from IV entry with different iv size and timestamp offset. amdgpu_ih_decode_iv_ts_helper is used for vega10, vega20, navi10. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: Slighly optimize 'init_doorbell_bitmap()'Christophe JAILLET1-3/+3
The 'doorbell_bitmap' bitmap has just been allocated. So we can use the non-atomic '__set_bit()' function to save a few cycles as no concurrent access can happen. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02drm/amdkfd: Use bitmap_zalloc() when applicableChristophe JAILLET2-8/+6
'doorbell_bitmap' and 'queue_slot_bitmap' are bitmaps. So use 'bitmap_zalloc()' to simplify code, improve the semantic and avoid some open-coded arithmetic in allocator arguments. Also change the corresponding 'kfree()' into 'bitmap_free()' to keep consistency. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24drm/amdkfd: simplify drain retry faultPhilip Yang2-9/+23
unmap range always increase atomic svms->drain_pagefaults to simplify both parent range and child range unmap, page fault handle ignores the retry fault if svms->drain_pagefaults is set to speed up interrupt handling. svm_range_drain_retry_fault restart draining if another range unmap from cpu. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24drm/amdkfd: handle VMA remove racePhilip Yang1-9/+13
VMA may be removed before unmap notifier callback, and deferred list work remove range, return success for this special case as we are handling stale retry fault. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-24drm/amdkfd: process exit and retry fault racePhilip Yang1-27/+36
kfd_process_wq_release drain retry fault to ensure no retry fault comes after removing kfd process from the hash table, otherwise svm page fault handler will fail to recover the fault and dump GPU vm fault log. Refactor deferred list work to get_task_mm and take mmap write lock to handle all ranges, and avoid mm is gone while inserting mmu notifier. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22drm/amdkfd: Remove unused entries in tableAmber Lin2-60/+0
Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines and num_sdma_queues_per_engine. They are calculated in kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Graham Sider <Graham.Sider@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22drm/amdkfd: Retrieve SDMA numbers from amdgpuAmber Lin4-22/+37
Instead of hard coding the number of sdma engines and the number of sdma_xgmi engines in the device_info table, get the number of toal SDMA instances from amdgpu. The first two engines are sdma engines and the rest are sdma-xgmi engines unless the ASIC doesn't support XGMI. v2: add kfd_ prefix to non static function names Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered againshaoyunl1-0/+5
In SRIOV configuration, the reset may failed to bring asic back to normal but stop cpsch already been called, the start_cpsch will not be called since there is no resume in this case. When reset been triggered again, driver should avoid to do uninitialization again. Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace asic_family with asic_typeGraham Sider10-53/+22
asic_family was a duplicate of asic_type, both of type amd_asic_type. Replace all instances of device_info->asic_family with adev->asic_type and remove asic_family from device_info. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: convert misc checks to IP version checkingGraham Sider10-35/+33
Switch to IP version checking instead of asic_type on various KFD version checks. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: convert switches to IP version checkingGraham Sider6-155/+102
Converts KFD switch statements to use IP version checking instead of asic_type. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: convert KFD_IS_SOC to IP version checkingGraham Sider4-5/+6
Defined as GC HWIP >= IP_VERSION(9, 0, 1). Also defines KFD_GC_VERSION to return GC HWIP version. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace trivial funcs with direct accessGraham Sider4-11/+10
These get funcs simply return an adev field. Replace funcs/calls with direct field accesses instead. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: Add sysfs bitfields and enums to uAPIFelix Kuehling1-45/+1
These bits are de-facto part of the uAPI, so declare them in a uAPI header. The corresponding bit-fields and enums in user mode are defined in https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/include/hsakmttypes.h HSA_CAP_... -> HSA_CAPABILITY HSA_MEM_HEAP_TYPE_... -> HSA_HEAPTYPE HSA_MEM_FLAGS_... -> HSA_MEMORYPROPERTY HSA_CACHE_TYPE_... -> HsaCacheType HSA_IOLINK_TYPE_... -> HSA_IOLINKTYPE HSA_IOLINK_FLAGS_... -> HSA_LINKPROPERTY Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: remove kgd_dev declaration and initializationGraham Sider2-4/+1
Completes removal of kgd_dev. Direct references to amdgpu_device objects should now be used instead. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace/remove remaining kgd_dev referencesGraham Sider8-50/+32
Remove get_amdgpu_device and other remaining kgd_dev references aside from declaration/kfd struct entry and initialization. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace kgd_dev in gpuvm amdgpu_amdkfd funcsGraham Sider3-27/+27
Modified definitions: - amdgpu_amdkfd_gpuvm_acquire_process_vm - amdgpu_amdkfd_gpuvm_release_process_vm - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu - amdgpu_amdkfd_gpuvm_free_memory_of_gpu - amdgpu_amdkfd_gpuvm_map_memory_to_gpu - amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu - amdgpu_amdkfd_gpuvm_sync_memory - amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel - amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel - amdgpu_amdkfd_gpuvm_get_vm_fault_info - amdgpu_amdkfd_gpuvm_import_dmabuf - amdgpu_amdkfd_get_tile_config Removed: - get_amdgpu_device Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcsGraham Sider7-35/+35
Modified definitions: - amdgpu_amdkfd_get_fw_version - amdgpu_amdkfd_get_local_mem_info - amdgpu_amdkfd_get_gpu_clock_counter - amdgpu_amdkfd_get_max_engine_clock_in_mhz - amdgpu_amdkfd_get_cu_info - amdgpu_amdkfd_get_dmabuf_info - amdgpu_amdkfd_get_vram_usage - amdgpu_amdkfd_get_hive_id - amdgpu_amdkfd_get_unique_id - amdgpu_amdkfd_get_mmio_remap_phys_addr - amdgpu_amdkfd_get_num_gws - amdgpu_amdkfd_get_asic_rev_id - amdgpu_amdkfd_get_noretry - amdgpu_amdkfd_get_xgmi_hops_count - amdgpu_amdkfd_get_xgmi_bandwidth_mbytes - amdgpu_amdkfd_get_pcie_bandwidth_mbytes Also replaces kfd_device_by_kgd with kfd_device_by_adev, now searching via adev rather than kgd. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcsGraham Sider6-25/+23
Modified definitions: - amdgpu_amdkfd_submit_ib - amdgpu_amdkfd_set_compute_idle - amdgpu_amdkfd_have_atomics_support - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_flush_gpu_tlb_pasid - amdgpu_amdkfd_gpu_reset - amdgpu_amdkfd_alloc_gtt_mem - amdgpu_amdkfd_free_gtt_mem - amdgpu_amdkfd_alloc_gws - amdgpu_amdkfd_free_gws - amdgpu_amdkfd_ras_poison_consumption_handler Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace kgd_dev in various kfd2kgd funcsGraham Sider5-20/+20
Modified definitions: - program_sh_mem_settings - set_pasid_vmid_mapping - init_interrupts - address_watch_disable - address_watch_execute - wave_control_execute - address_watch_get_offset - get_atc_vmid_pasid_mapping_info - set_scratch_backing_va - set_vm_context_page_table_base - read_vmid_from_vmfault_reg - get_cu_occupancy - program_trap_handler_settings Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: replace kgd_dev in hqd/mqd kfd2kgd funcsGraham Sider5-29/+29
Modified definitions: - hqd_load - hiq_mqd_load - hqd_sdma_load - hqd_dump - hqd_sdma_dump - hqd_is_occupied - hqd_destroy - hqd_sdma_is_occupied - hqd_sdma_destroy Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-18drm/amdkfd: add amdgpu_device entry to kfd_devGraham Sider2-0/+2
Patch series to remove kgd_dev struct and replace all instances with amdgpu_device objects. amdgpu_device needs to be declared in kgd_kfd_interface.h to be visible to kfd2kgd_calls. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-12Merge tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drmLinus Torvalds7-30/+104
Pull more drm updates from Dave Airlie: "I missed a drm-misc-next pull for the main pull last week. It wasn't that major and isn't the bulk of this at all. This has a bunch of fixes all over, a lot for amdgpu and i915. bridge: - HPD improvments for lt9611uxc - eDP aux-bus support for ps8640 - LVDS data-mapping selection support ttm: - remove huge page functionality (needs reworking) - fix a race condition during BO eviction panels: - add some new panels fbdev: - fix double-free - remove unused scrolling acceleration - CONFIG_FB dep improvements locking: - improve contended locking logging - naming collision fix dma-buf: - add dma_resv_for_each_fence iterator - fix fence refcounting bug - name locking fixesA prime: - fix object references during mmap nouveau: - various code style changes - refcount fix - device removal fixes - protect client list with a mutex - fix CE0 address calculation i915: - DP rates related fixes - Revert disabling dual eDP that was causing state readout problems - put the cdclk vtables in const data - Fix DVO port type for older platforms - Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown - CCS FBs related fixes - Fix recursive lock in GuC submission - Revert guc_id from i915_request tracepoint - Build fix around dmabuf amdgpu: - GPU reset fix - Aldebaran fix - Yellow Carp fixes - DCN2.1 DMCUB fix - IOMMU regression fix for Picasso - DSC display fixes - BPC display calculation fixes - Other misc display fixes - Don't allow partial copy from user for DC debugfs - SRIOV fixes - GFX9 CSB pin count fix - Various IP version check fixes - DP 2.0 fixes - Limit DCN1 MPO fix to DCN1 amdkfd: - SVM fixes - Fix gfx version for renoir - Reset fixes udl: - timeout fix imx: - circular locking fix virtio: - NULL ptr deref fix" * tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm: (126 commits) drm/ttm: Double check mem_type of BO while eviction drm/amdgpu: add missed support for UVD IP_VERSION(3, 0, 64) drm/amdgpu: drop jpeg IP initialization in SRIOV case drm/amd/display: reject both non-zero src_x and src_y only for DCN1x drm/amd/display: Add callbacks for DMUB HPD IRQ notifications drm/amd/display: Don't lock connection_mutex for DMUB HPD drm/amd/display: Add comment where CONFIG_DRM_AMD_DC_DCN macro ends drm/amdkfd: Fix retry fault drain race conditions drm/amdkfd: lower the VAs base offset to 8KB drm/amd/display: fix exit from amdgpu_dm_atomic_check() abruptly drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov drm/amdgpu: fix uvd crash on Polaris12 during driver unloading drm/i915/adlp/fb: Prevent the mapping of redundant trailing padding NULL pages drm/i915/fb: Fix rounding error in subsampled plane size calculation drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown() drm/locking: fix __stack_depot_* name conflict drm/virtio: Fix NULL dereference error in virtio_gpu_poll drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support() drm/amdgpu: Fix dangling kfd_bo pointer for shared BOs drm/amd/amdkfd: Don't sent command to HWS on kfd reset ...
2021-11-11mm/migrate.c: remove MIGRATE_PFN_LOCKEDAlistair Popple1-2/+0
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-10drm/amdkfd: Fix retry fault drain race conditionsFelix Kuehling2-5/+20
The check for whether to drain retry faults must be under the mmap write lock to serialize with munmap notifier callbacks. We were also missing checks on child ranges. To fix that, simplify the logic by using a flag rather than checking on each prange. That also allows draining less freqeuntly when many ranges are unmapped at once. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Alex Sierra <Alex.Sierra@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-10drm/amdkfd: lower the VAs base offset to 8KBAlex Sierra1-1/+1
The low 16MB of virtual address space are currently reserved for kernel mode allocations mapped into user virtual address space. This causes conflicts with HMM/SVM mappings at low virtual addresses. We tried to move those kernel mode allocations to the upper half of the 64-bit virtual address space for GFX9, which is naturally reserved for kernel use. However, TBA (trap handler code) has problems to access addresses in the high virtual space. We have decided to set this to 8KB of the lower address space as a temporary fix, while investigate TBA address problem. It is very unlikely for user space to map memory at this low region. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05drm/amd/amdkfd: Don't sent command to HWS on kfd resetshaoyunl2-2/+6
When kfd need to be reset, sent command to HWS might cause hang and get unnecessary timeout. This change try not to touch HW in pre_reset and keep queues to be in the evicted state when the reset is done, so they are not put back on the runlist. These queues will be destroied on process termination. Signed-off-by: shaoyunl <shaoyun.liu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-05drm/amdkfd: avoid recursive lock in migrations back to RAMAlex Sierra3-0/+8
[Why]: When we call hmm_range_fault to map memory after a migration, we don't expect memory to be migrated again as a result of hmm_range_fault. The driver ensures that all memory is in GPU-accessible locations so that no migration should be needed. However, there is one corner case where hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE back to system memory due to a write-fault when a system memory page in the same range was mapped read-only (e.g. COW). Ranges with individual pages in different locations are usually the result of failed page migrations (e.g. page lock contention). The unexpected migration back to system memory causes a deadlock from recursive locking in our driver. [How]: Creating a task reference new member under svm_range_list struct. Setting this with "current" reference, right before the hmm_range_fault is called. This member is checked against "current" reference at svm_migrate_to_ram callback function. If equal, the migration will be ignored. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03drm/amdkfd: update gfx target version for RenoirGraham Sider1-1/+1
Previously Renoir compiler gfx target version was forced to Raven. Update driver side for completeness. Signed-off-by: Graham Sider <Graham.Sider@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03drm/amdkfd: Handle incomplete migration to system memoryFelix Kuehling2-13/+40
If some pages fail to migrate to system memory, don't update prange->actual_loc = 0. This prevents endless CPU page faults after partial migration failures due to contested page locks. Migration to RAM must be complete during migrations from VRAM to VRAM and during evictions. Implement retry and fail if the migration to RAM fails. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03drm/amdkfd: Avoid thrashing of stack and heapFelix Kuehling1-5/+17
Stack and heap pages tend to be shared by many small allocations. Concurrent access by CPU and GPU is therefore likely, which can lead to thrashing. Avoid this by setting the preferred location to system memory. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-03drm/amdkfd: Fix SVM_ATTR_PREFERRED_LOCFelix Kuehling1-3/+11
The preferred location should be used as the migration destination whenever it is accessible by the faulting GPU. System memory is always accessible. Peer memory is accessible if it's in the same XGMI hive. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28drm/amdkfd: Remove cu mask from struct queue_properties(v2)Lang Yu8-54/+57
Actually, cu_mask has been copied to mqd memory and does't have to persist in queue_properties. Remove it from queue_properties. And use struct mqd_update_info to store such properties, then pass it to update queue operation. v2: * Rename pqm_update_queue to pqm_update_queue_properties. * Rename struct queue_update_info to struct mqd_update_info. * Rename pqm_set_cu_mask to pqm_update_mqd. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28drm/amdkfd: Add an optional argument into update queue operation(v2)Lang Yu9-35/+52
Currently, queue is updated with data in queue_properties. And all allocated resource in queue_properties will not be freed until the queue is destroyed. But some properties(e.g., cu mask) bring some memory management headaches(e.g., memory leak) and make code complex. Actually they have been copied to mqd and don't have to persist in queue_properties. Add an argument into update queue to pass such properties, then we can remove them from queue_properties. v2: Don't use void *. Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-28drm/amdkfd: Separate pinned BOs destruction from general routineLang Yu3-37/+106
Currently, all kfd BOs use same destruction routine. But pinned BOs are not unpinned properly. Separate them from general routine. v2 (Felix): Add safeguard to prevent user space from freeing signal BO. Kunmap signal BO in the event of setting event page error. Just kunmap signal BO to avoid duplicating the code. Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-22drm/amdkfd: debug message to count successfully migrated pagesPhilip Yang1-0/+21
Not all migrate.cpages returned from migrate_vma_setup can be migrated, for example non anonymous page, or out of device memory. So after migrate_vma_pages returns, add debug message to count pages are successfully migrated which has MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-22drm/amdkfd: clarify the origin of cpages returned by migration functionsPhilip Yang1-21/+22
cpages is only updated by migrate_vma_setup. So capture its value at that point to clarify the significance of the number. The next patch will add counting of actually migrated pages after migrate_vma_pages for debug purposes. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-20drm/amdkfd: protect raven_device_info with KFD_SUPPORT_IOMMU_V2Alex Deucher1-1/+1
raven_device_info is not used when KFD_SUPPORT_IOMMU_V2 is not set. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-20drm/amdkfd: protect hawaii_device_info with CONFIG_DRM_AMDGPU_CIKAlex Deucher1-0/+2
hawaii_device_info is not used when CONFIG_DRM_AMDGPU_CIK is not set. Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-20drm/amdkfd: map gpu hive id to xgmi connected cpuJonathan Kim1-1/+18
ROCr needs to be able to identify all devices that have direct access to fine grain memory, which should include CPUs that are connected to GPUs over xGMI. The GPU hive ID can be mapped onto the CPU hive ID since the CPU is part of the hive. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-14drm/amdkfd: unregistered svm range not overlap with TTM rangePhilip Yang1-5/+90
When creating unregistered new svm range to recover retry fault, avoid new svm range to overlap with ranges or userptr ranges managed by TTM, otherwise svm migration will trigger TTM or userptr eviction, to evict user queues unexpectedly. Change helper amdgpu_ttm_tt_affect_userptr to return userptr which is inside the range. Add helper svm_range_check_vm_userptr to scan all userptr of the vm, and return overlap userptr bo start, last. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>