summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)AuthorFilesLines
2020-09-03drm/amdkfd: fix ref count leak when pm_runtime_get_sync failsAlex Deucher1-1/+3
[ Upstream commit 1c1ada37af6ee6fb9cfc8da6a56cc83208cd8d6f ] The call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03drm/amdkfd: Fix reference count leaks.Qiushi Wu1-5/+15
[ Upstream commit 20eca0123a35305e38b344d571cf32768854168c ] kobject_init_and_add() takes reference even when it fails. If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Signed-off-by: Qiushi Wu <wu000273@umn.edu> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-25drm/amd: fix potential memleak in err branchBernard Zhao1-0/+1
The function kobject_init_and_add alloc memory like: kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs ->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller ->kmalloc_slab, in err branch this memory not free. If use kmemleak, this path maybe catched. These changes are to add kobject_put in kobject_init_and_add failed branch, fix potential memleak. Signed-off-by: Bernard Zhao <bernard@vivo.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2020-06-19Merge tag 'amd-drm-fixes-5.8-2020-06-17' of ↵Dave Airlie1-1/+2
git://people.freedesktop.org/~agd5f/linux into drm-fixes amd-drm-fixes-5.8-2020-06-17: amdgpu: - Fix kvfree/kfree mixup - Fix hawaii device id in powertune configuration - Display FP fixes - Documentation fixes amdkfd: - devcgroup check fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20200617220733.3773183-1-alexander.deucher@amd.com
2020-06-18drm/amdkfd: Use correct major in devcgroup checkLorenz Brun1-1/+2
The existing code used the major version number of the DRM driver instead of the device major number of the DRM subsystem for validating access for a devices cgroup. This meant that accesses allowed by the devices cgroup weren't permitted and certain accesses denied by the devices cgroup were permitted (if they matched the wrong major device number). Signed-off-by: Lorenz Brun <lorenz@brun.one> Fixes: 6b855f7b83d2f ("drm/amdkfd: Check against device cgroup") Reviewed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse1-2/+2
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03Merge tag 'drm-next-2020-06-02' of git://anongit.freedesktop.org/drm/drmLinus Torvalds15-36/+235
Pull drm updates from Dave Airlie: "Highlights: - Core DRM had a lot of refactoring around managed drm resources to make drivers simpler. - Intel Tigerlake support is on by default - amdgpu now support p2p PCI buffer sharing and encrypted GPU memory Details: core: - uapi: error out EBUSY when existing master - uapi: rework SET/DROP MASTER permission handling - remove drm_pci.h - drm_pci* are now legacy - introduced managed DRM resources - subclassing support for drm_framebuffer - simple encoder helper - edid improvements - vblank + writeback documentation improved - drm/mm - optimise tree searches - port drivers to use devm_drm_dev_alloc dma-buf: - add flag for p2p buffer support mst: - ACT timeout improvements - remove drm_dp_mst_has_audio - don't use 2nd TX slot - spec recommends against it bridge: - dw-hdmi various improvements - chrontel ch7033 support - fix stack issues with old gcc hdmi: - add unpack function for drm infoframe fbdev: - misc fbdev driver fixes i915: - uapi: global sseu pinning - uapi: OA buffer polling - uapi: remove generated perf code - uapi: per-engine default property values in sysfs - Tigerlake GEN12 enabled. - Lots of gem refactoring - Tigerlake enablement patches - move to drm_device logging - Icelake gamma HW readout - push MST link retrain to hotplug work - bandwidth atomic helpers - ICL fixes - RPS/GT refactoring - Cherryview full-ppgtt support - i915 locking guidelines documented - require linear fb stride to be 512 multiple on gen9 - Tigerlake SAGV support amdgpu: - uapi: encrypted GPU memory handling - uapi: add MEM_SYNC IB flag - p2p dma-buf support - export VRAM dma-bufs - FRU chip access support - RAS/SR-IOV updates - Powerplay locking fixes - VCN DPG (powergating) enablement - GFX10 clockgating fixes - DC fixes - GPU reset fixes - navi SDMA fix - expose FP16 for modesetting - DP 1.4 compliance fixes - gfx10 soft recovery - Improved Critical Thermal Faults handling - resizable BAR on gmc10 amdkfd: - uapi: GWS resource management - track GPU memory per process - report PCI domain in topology radeon: - safe reg list generator fixes nouveau: - HD audio fixes on recent systems - vGPU detection (fail probe if we're on one, for now) - Interlaced mode fixes (mostly avoidance on Turing, which doesn't support it) - SVM improvements/fixes - NVIDIA format modifier support - Misc other fixes. adv7511: - HDMI SPDIF support ast: - allocate crtc state size - fix double assignment - fix suspend bochs: - drop connector register cirrus: - move to tiny drivers. exynos: - fix imported dma-buf mapping - enable runtime PM - fixes and cleanups mediatek: - DPI pin mode swap - config mipi_tx current/impedance lima: - devfreq + cooling device support - task handling improvements - runtime PM support pl111: - vexpress init improvements - fix module auto-load rcar-du: - DT bindings conversion to YAML - Planes zpos sanity check and fix - MAINTAINERS entry for LVDS panel driver mcde: - fix return value mgag200: - use managed config init stm: - read endpoints from DT vboxvideo: - use PCI managed functions - drop WC mtrr vkms: - enable cursor by default rockchip: - afbc support virtio: - various cleanups qxl: - fix cursor notify port hisilicon: - 128-byte stride alignment fix sun4i: - improved format handling" * tag 'drm-next-2020-06-02' of git://anongit.freedesktop.org/drm/drm: (1401 commits) drm/amd/display: Fix potential integer wraparound resulting in a hang drm/amd/display: drop cursor position check in atomic test drm/amdgpu: fix device attribute node create failed with multi gpu drm/nouveau: use correct conflicting framebuffer API drm/vblank: Fix -Wformat compile warnings on some arches drm/amdgpu: Sync with VM root BO when switching VM to CPU update mode drm/amd/display: Handle GPU reset for DC block drm/amdgpu: add apu flags (v2) drm/amd/powerpay: Disable gfxoff when setting manual mode on picasso and raven drm/amdgpu: fix pm sysfs node handling (v2) drm/amdgpu: move gpu_info parsing after common early init drm/amdgpu: move discovery gfx config fetching drm/nouveau/dispnv50: fix runtime pm imbalance on error drm/nouveau: fix runtime pm imbalance on error drm/nouveau: fix runtime pm imbalance on error drm/nouveau/debugfs: fix runtime pm imbalance on error drm/nouveau/nouveau/hmm: fix migrate zero page to GPU drm/nouveau/nouveau/hmm: fix nouveau_dmem_chunk allocations drm/nouveau/kms/nv50-: Share DP SST mode_valid() handling with MST drm/nouveau/kms/nv50-: Move 8BPC limit for MST into nv50_mstc_get_modes() ...
2020-05-21drm/amdkfd: report the real PCI bus numberEvan Quan1-1/+1
Since the PCI bus number retrieved by PCI_BUS_NUM(pdev->devfn) is wrong. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-21drm/amdkfd: Fix boolreturn.cocci warningsAishwarya Ramakrishnan1-2/+2
Return statements in functions returning bool should use true/false instead of 1/0. drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c:40:9-10: WARNING: return of 0/1 in function 'event_interrupt_isr_v9' with return type bool Generated by: scripts/coccinelle/misc/boolreturn.cocci Signed-off-by: Aishwarya Ramakrishnan <aishwaryarj100@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-01drm/amdkfd: Use a systematic method to calculate queue mask bitYong Zhao1-1/+3
The queue mask used for set_resources always assumes the queue number per pipe is 8, so KFD needs to align with that by using function amdgpu_queue_mask_bit_to_set_resource_bit(). Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-01drm/amdkfd: Fix comment formattingFelix Kuehling1-2/+2
Corrected two function names. Added a missing space. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-01drm/amdkfd: Report domain with topologyOri Messinger2-0/+4
PCI domain has moved to 32-bits to accommodate virtualization, so a 32-bit integer is exposed for domain to reflect this change. Domain can be found in here: /sys/class/kfd/kfd/topology/nodes/X/properties Where X is the card number Signed-off-by: Ori Messinger <ori.messinger@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-30drm/amdkfd: Track GPU memory utilization per processMukul Joshi3-10/+67
Track GPU VRAM usage on a per process basis and report it through sysfs. Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28drm/amdkfd: Enable over-subscription with >1 GWS queueJoseph Greathouse8-6/+62
The current GWS usage model will only allows a single GWS-enabled process to be active on the GPU at once. This ensures that a barrier-using kernel gets a known amount of GPU hardware, to prevent deadlock due to inability to go beyond the GWS barrier. The HWS watches how many GWS entries are assigned to each process, and goes into over-subscription mode when two processes need more than the 64 that are available. The current KFD method for working with this is to allocate all 64 GWS entries to each GWS-capable process. When more than one GWS-enabled process is in the runlist, we must make sure the runlist is in over-subscription mode, so that the HWS gets a chained RUN_LIST packet and continues scheduling kernels. Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28drm/amdkfd: Enable GWS based on FW SupportJoseph Greathouse4-13/+38
Rather than only enabling GWS support based on the hws_gws_support modparm, also check whether the GPU's HWS firmware supports GWS. Leave the old modparm in place in case users want to test GWS on GPUs not yet in the support list. v2: fix broken syntax from the first patch. Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28drm/amdkfd: New IOCTL to allocate queue GWS (v2)Oak Zeng3-0/+50
Add a new kfd ioctl to allocate queue GWS. Queue GWS is released on queue destroy. v2: re-introduce this API with the following fixes squashed in: - drm/amdkfd: fix null pointer dereference on dev - drm/amdkfd: Return proper error code for gws alloc API - drm/amdkfd: Remove GPU ID in GWS queue creation Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28drm/amdkfd: Put ASIC revision into HSA capabilityJoseph Greathouse2-1/+8
In order to surface the ASIC revision to user level, we want to put it into the HSA topology. This can be because different ASIC revisions may require user-level software to do different things (e.g. patch code for things that are changed in later hardware revisions). The ASIC revision from the hardware is maximum of 4 bits at this time, so put it into 4 of the open bits in the HSA capability. Then user-level software can use this capability information to know -- for each ASIC -- what revision-based things must be done. Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-23drm/amdkfd: Adjust three kfd dmesg printings during initializationYong Zhao2-3/+1
Delete two printings which are not very useful, and change one from pr_info() to pr_debug(). Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-13device_cgroup: Cleanup cgroup eBPF device filter codeOdin Ugedal1-1/+1
Original cgroup v2 eBPF code for filtering device access made it possible to compile with CONFIG_CGROUP_DEVICE=n and still use the eBPF filtering. Change commit 4b7d4d453fc4 ("device_cgroup: Export devcgroup_check_permission") reverted this, making it required to set it to y. Since the device filtering (and all the docs) for cgroup v2 is no longer a "device controller" like it was in v1, someone might compile their kernel with CONFIG_CGROUP_DEVICE=n. Then (for linux 5.5+) the eBPF filter will not be invoked, and all processes will be allowed access to all devices, no matter what the eBPF filter says. Signed-off-by: Odin Ugedal <odin@ugedal.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-04-01drm/amdkfd: kfree the wrong pointerJack Zhang1-2/+2
Originally, it kfrees the wrong pointer for mem_obj. It would cause memory leak under stress test. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Acked-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-19drm: amd: fix spelling mistake "shoudn't" -> "shouldn't"Colin Ian King1-1/+1
There are spelling mistakes in pr_err messages and a comment. Fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10drm/amdkfd: Consolidate duplicated bo alloc flagsYong Zhao1-6/+7
ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*, but they are interweavedly used in kernel driver, resulting in bad readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is not referenced in kernel, and it functions implicitly in kernel through ALLOC_MEM_FLAGS_COHERENT, causing unnecessary confusion. Replace all occurrences of ALLOC_MEM_FLAGS_* with KFD_IOC_ALLOC_MEM_FLAGS_* to solve the problem. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10drm/amdkfd: Use pr_debug to print the message of reaching event limitYong Zhao1-1/+1
People are inclined to think of the previous pr_warn message as an error, so use pre_debug instead. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06drm/amdkfd: Signal eviction fence on process destruction (v2)Felix Kuehling1-0/+5
Otherwise BOs may wait for the fence indefinitely and never be destroyed. v2: Signal the fence right after destroying queues to avoid unnecessary delaye-delete in kfd_process_wq_release Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: xinhui pan <xinhui.pan@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06drm/amdkfd: Add more comments on GFX9 user CP queue MQD workaroundYong Zhao1-3/+15
Because too many things are involved in this workaround, we need more comments to avoid pitfalls. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-05drm/amdkfd: fix indentation issueColin Ian King1-1/+1
There is a statement that is indented with spaces instead of a tab. Replace spaces with a tab. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-29drm/amdkfd: change SDMA MQD memory typeEric Huang1-1/+1
SDMA MQD memory type is NC that causes MQD data overwritten accidentally by an old stable cache line. Changing it to UC default for GART will fix the issue. The mqd_gfx9 parameter is meant for control stacks that are allocated together with user mode queue MQDs. Setting mqd_gfx9 to true maps the control stack pages as NC. Here it was accidentally applied to SDMA MQDs, which are allocated together with the HIQ MQD. Setting the mqd_gfx9 to false avoids that. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Acked-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-29drm/amdkfd: Make get_tile_config() genericYong Zhao1-1/+1
Given we can query all the asic specific information from amdgpu_gfx_config, we can make get_tile_config() generic. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Delete unnecessary unmap queue package submissionsYong Zhao3-68/+29
The previous way of using SDMA queue count to infer whether we should unmap SDMA engines has bugs. The reason it did not cause issues is because MEC firmware unmaps all queues (CP + SDMA) when a unmap package for compute engine is received. Becasue of that, only one unmap queue package is needed, instead of one unmap queue package for CP and each SDMA engine, which results in much simpler driver code. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Delete excessive printingsYong Zhao2-5/+1
Those printings are duplicated or useless. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Fix a memory leak in queue creation error handlingYong Zhao1-0/+3
When the queue creation failed, some resources were not freed. Fix it. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Count active CP queues directlyYong Zhao3-16/+35
The previous code of calculating active CP queues is problematic if some SDMA queues are inactive. Fix that by counting CP queues directly. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Avoid ambiguity by indicating it's cp queueYong Zhao5-10/+10
The queues represented in queue_bitmap are only CP queues. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amdkfd: Rename queue_count to active_queue_countYong Zhao4-23/+23
The name is easier to understand the code. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26drm/amd: Extend ROCt to surface UUID for devices that have themDivya Shikre4-0/+10
Devices from Arcturus onwards will have their UUID exposed to Thunk. Adding neccessary functions to the kernel to propagate the uuid. Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-13drm/amdkfd: refactor runtime pm for bacoRajneesh Bhardwaj3-14/+58
So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended. However, in an ideal case, when gpu is runtime suspended the kfd driver should be able to: - auto resume amdgpu driver whenever a client requests compute service - prevent runtime suspend for amdgpu while kfd is in use This change refactors the amdgpu and amdkfd drivers to support BACO and runtime power management. Reviewed-by: Oak Zeng <oak.zeng@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-13drm/amdkfd: show warning when kfd is lockedRajneesh Bhardwaj1-0/+2
During system suspend the kfd driver aquires a lock that prohibits further kfd actions unless the gpu is resumed. This adds some info which can be useful while debugging. Reviewed-by: Oak Zeng <oak.zeng@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-06drm/amdkfd: Add queue information to sysfsAmber Lin3-0/+99
Provide compute queues information in sysfs under /sys/class/kfd/kfd/proc. The format is /sys/class/kfd/kfd/proc/<pid>/queues/<queue id>/XX where XX are size, type, and gpuid three files to represent queue size, queue type, and the GPU this queue uses. <queue id> folder and files underneath are generated when a queue is created. They are removed when the queue is destroyed. Signed-off-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-04drm/amdkfd: Fix a bug in SDMA RLC queue counting under HWS modeYong Zhao1-4/+6
The sdma_queue_count increment should be done before execute_queues_cpsch(), which calls pm_calc_rlib_size() where sdma_queue_count is used to calculate whether over_subscription is triggered. With the previous code, when a SDMA queue is created, compute_queue_count in pm_calc_rlib_size() is one more than the actual compute queue number, because the queue_count has been incremented while sdma_queue_count has not. This patch fixes that. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/amdkfd: Add a message when SW scheduler is usedYong Zhao1-0/+1
SW scheduler is previously called non HW scheduler, or non HWS. This message is useful when triaging issues from dmesg. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/amdkfd: use map_queues for hiq on gfx v10 as wellHuang Rui1-1/+9
To align with gfx v9, we use the map_queues packet to load hiq MQD. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/amdkfd: use kiq to load the mqd of hiq queue for gfx v9 (v6)Aaron Liu1-1/+9
There is an issue that CP will check the HIQ queue to be configured and mapped with KIQ ring, otherwise, it will be unable to read back the secure buffer while the gfxoff is enabled even with trusted IP blocks. v1 -> v2: - Fix to remove surplus set_resources packets. - Fill the whole configuration in MQD. - Change the author as Aaron because he addressed the key point of this issue. - Add kiq ring lock. v2 -> v3: - Free the lock while in error return case. - Remove the programming only needed by the queue is unmapped. v3 -> v4: - Remove doorbell programming because it's used for restarting queue. - Remove CP scheduler programming because map_queue packet will handle this. v4 -> v5: - Remove cp_hqd_active because mec ucode will enable it while use map_queues. - Revise goto out_unlock. - Correct the right doorbell offset for HIQ that kfd driver assigned in the packet. v5 -> v6: - Merge Arcturus fix into this patch because it will get oops in Arcturus platform. Reported-by: Lisa Saturday <Lisa.Saturday@amd.com> Signed-off-by: Aaron Liu <aaron.liu@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-and-Tested-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16drm/amdgpu: GPU TLB flush API moved to amdgpu_amdkfdAlex Sierra1-3/+5
[Why] TLB flush method has been deprecated using kfd2kgd interface. This implementation is now on the amdgpu_amdkfd API. [How] TLB flush functions now implemented in amdgpu_amdkfd. Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-10drm/amdkfd: Improve kfd_process lookup in kfd_ioctlFelix Kuehling2-4/+28
Use filep->private_data to store a pointer to the kfd_process data structure. Take an extra reference for that, which gets released in the kfd_release callback. Check that the process calling kfd_ioctl is the same that opened the file descriptor. Return -EBADF if it's not, so that this error can be distinguished in user mode. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07drm/amdkfd: Avoid hanging hardware in stop_cpschFelix Kuehling5-13/+17
Don't use the HWS if it's known to be hanging. In a reset also don't try to destroy the HIQ because that may hang on SRIOV if the KIQ is unresponsive. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: shaoyunl <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07drm/amdkfd: Improve HWS hang detection and handlingFelix Kuehling3-6/+26
Move HWS hang detection into unmap_queues_cpsch to catch hangs in all cases. If this happens during a reset, don't schedule another reset because the reset already in progress is expected to take care of it. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: shaoyunl <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07drm/amdkfd: Remove unused variableFelix Kuehling2-2/+0
dqm->pipeline_mem wasn't used anywhere. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: shaoyunl <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07drm/amdkfd: Fix permissions of hang_hwsFelix Kuehling1-1/+1
Reading from /sys/kernel/debug/kfd/hang_hws would cause a kernel oops because we didn't implement a read callback. Set the permission to write-only to prevent that. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: shaoyunl <shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-19drm/amdkfd: expose num_cp_queues data field to topology node (v2)Huang Rui2-0/+4
Thunk driver would like to know the num_cp_queues data, however this data relied on different asic specific. So it's better to get it from kfd driver. v2: don't update name size. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-19drm/amdkfd: expose num_sdma_queues_per_engine data field to topology node (v2)Huang Rui2-0/+5
Thunk driver would like to know the num_sdma_queues_per_engine data, however this data relied on different asic specific. So it's better to get it from kfd driver. v2: don't update the name size. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>