summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)AuthorFilesLines
2022-04-19drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM tooLang Yu1-1/+3
The idea is from commit a50fe7078035 ("drm/amdkfd: Only apply heavy-weight TLB flush on Aldebaran") and commit f61c40c0757a ("drm/amdkfd: enable heavy-weight TLB flush on Arcturus"). At the moment, heavy-weight TLB could cause problems on ASICs except Aldebaran and Arcturus. A simple hipMallocManaged/hipFree program could trigger this issue. [ 97.787657] amdgpu 0000:01:00.0: amdgpu: wait for kiq fence error: 0. [ 106.868758] amdgpu: qcm fence wait loop timeout expired [ 106.868966] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [ 106.869203] amdgpu: Failed to evict process queues [ 106.869261] amdgpu: Failed to quiesce KFD Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-19drm/amdkfd: move kfd_flush_tlb_after_unmap into kfd_priv.hLang Yu2-8/+8
To make kfd_flush_tlb_after_unmap visible in kfd_svm.c, move it into kfd_priv.h. And change it to an inline function. Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-14drm/amdkfd: fix race condition in kfd_wait_on_eventsFelix Kuehling1-21/+5
Add the waiters to the wait queue during initialization, while holding the event spinlock. Otherwise the waiter will not get activated if the event signals before being added to the wait queue. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-14drm/amdkfd: potential NULL dereference in kfd_set/reset_event()Dan Carpenter1-2/+12
If lookup_event_by_id() returns a NULL "ev" pointer then the spin_lock(&ev->lock) will crash. This was detected by Smatch: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_events.c:644 kfd_set_event() error: we previously assumed 'ev' could be null (see line 639) Fixes: 5273e82c5f47 ("drm/amdkfd: Improve concurrency of event handling") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-13drm/amdkfd: Cleanup IO links during KFD device removalMukul Joshi3-11/+78
Currently, the IO-links to the device being removed from topology, are not cleared. As a result, there would be dangling links left in the KFD topology. This patch aims to fix the following: 1. Cleanup all IO links to the device being removed. 2. Ensure that node numbering in sysfs and nodes proximity domain values are consistent after the device is removed: a. Adding a device and removing a GPU device are made mutually exclusive. b. The global proximity domain counter is no longer required to be an atomic counter. A normal 32-bit counter can be used instead. 3. Update generation_count to let user-mode know that topology has changed due to device removal. CC: Shuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-13drm/amdkfd: shrink bitmap size in struct svm_validate_contextLang Yu1-1/+1
A MAX_GPU_INSTANCE bits bitmap will suffice. Signed-off-by: Lang Yu <Lang.Yu@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-12drm/amdkfd: Asynchronously free eventsFelix Kuehling2-2/+3
The synchronize_rcu call in destroy_events can take several ms, which noticeably slows down applications destroying many events. Use kfree_rcu to free the event structure asynchronously and eliminate the synchronize_rcu call in the user thread. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-12Merge tag 'drm-misc-next-2022-04-07' of ↵Dave Airlie1-1/+1
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 5.19: UAPI Changes: Cross-subsystem Changes: Core Changes: - atomic: Add atomic_print_state to private objects - edid: Constify the EDID parsing API, rework of the API - dma-buf: Add dma_resv_replace_fences, dma_resv_get_singleton, make dma_resv_excl_fence private - format: Support monochrome formats - fbdev: fixes for cfb_imageblit and sys_imageblit, pagelist corruption fix - selftests: several small fixes - ttm: Rework bulk move handling Driver Changes: - Switch all relevant drivers to drm_mode_copy or drm_mode_duplicate - bridge: conversions to devm_drm_of_get_bridge and panel_bridge, autosuspend for analogix_dp, audio support for it66121, DSI to DPI support for tc358767, PLL fixes and I2C support for icn6211 - bridge_connector: Enable HPD if supported - etnaviv: fencing improvements - gma500: GEM and GTT improvements, connector handling fixes - komeda: switch to plane reset helper - mediatek: MIPI DSI improvements - omapdrm: GEM improvements - panel: DT bindings fixes for st7735r, few fixes for ssd130x, new panels: ltk035c5444t, B133UAN01, NV3052C - qxl: Allow to run on arm64 - sysfb: Kconfig rework, support for VESA graphic mode selection - vc4: Add a tracepoint for CL submissions, HDMI YUV output, HDMI and clock improvements - virtio: Remove restriction of non-zero blob_flags, - vmwgfx: support for CursorMob and CursorBypass 4, various improvements and small fixes [airlied: fixup conflict with newvision panel callbacks] Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/20220407085940.pnflvjojs4qw4b77@houat
2022-04-11drm/amdkfd: Handle drain retry fault race with XNACK mode changePhilip Yang1-5/+6
Application could change XNACK enabled to disabled while KFD is draining stale retry fault, therefore the check for whether to drain retry faults must be before the check for whether xnack_enabled, to avoid report incorrect vm fault after application changes XNACK mode. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-07drm/amdkfd: Improve concurrency of event handlingFelix Kuehling3-43/+88
Use rcu_read_lock to read p->event_idr concurrently with other readers and writers. Use p->event_mutex only for creating and destroying events and in kfd_wait_on_events. Protect the contents of the kfd_event structure with a per-event spinlock that can be taken inside the rcu_read_lock critical section. This eliminates contention of p->event_mutex in set_event, which tends to be on the critical path for dispatch latency even when busy waiting is used. It also eliminates lock contention in event interrupt handlers. Since the p->event_mutex is now used much less, the impact of requiring it in kfd_wait_on_events should also be much smaller. This should improve event handling latency for processes using multiple GPUs concurrently. v2: Reschedule the worker periodically to avoid soft lockup warnings Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Sean Keely <Sean.Keely@amd.com> # v1 Tested-by: Sanjay Tripathi <sanjay.tripathi@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-07Merge tag 'amd-drm-fixes-5.18-2022-04-06' of ↵Dave Airlie1-9/+15
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-5.18-2022-04-06: amdgpu: - VCN 3.0 fixes - DCN 3.1.5 fix - Misc display fixes - GC 10.3 golden register fix - Suspend fix - SMU 10 fix amdkfd: - Event fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220406170441.5779-1-alexander.deucher@amd.com
2022-04-07Merge tag 'amd-drm-next-5.18-2022-03-25' of ↵Dave Airlie2-9/+7
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-next-5.18-2022-03-25: amdgpu: - GFX 10.3.7 fixes - noretry updates - VCN fixes - TMDS fix - zstate fix for freesync video - DCN 3.1.5 fix - Display stack size fix - Audio fix - DCN 3.1 pstate fix - TMZ VCN fix - APU passthrough fix - Misc other fixes amdkfd: - Error handling fix - xgmi p2p fix - HWS VMIDs fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220325183602.5718-1-alexander.deucher@amd.com
2022-04-06drm/amdkfd: Create file descriptor after client is added to smi_clients listLee Jones1-9/+15
This ensures userspace cannot prematurely clean-up the client before it is fully initialised which has been proven to cause issues in the past. Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2022-04-06dma-buf/drivers: make reserving a shared slot mandatory v4Christian König1-1/+1
Audit all the users of dma_resv_add_excl_fence() and make sure they reserve a shared slot also when only trying to add an exclusive fence. This is the next step towards handling the exclusive fence like a shared one. v2: fix missed case in amdgpu v3: and two more radeon, rename function v4: add one more case to TTM, fix i915 after rebase Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20220406075132.3263-2-christian.koenig@amd.com
2022-04-06drm/amdkfd: Add missing NULL check in svm_range_map_to_gpuPhilip Yang1-1/+1
bo_adev is NULL for system memory mapping to GPU. Fixes: 30671b44aa570a ("drm/amdgpu: fix TLB flushing during eviction") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-05drm/amdgpu: fix TLB flushing during evictionChristian König1-9/+9
Testing the valid bit is not enough to figure out if we need to invalidate the TLB or not. During eviction it is quite likely that we move a BO from VRAM to GTT and update the page tables immediately to the new GTT address. Rework the whole function to get all the necessary parameters directly as value. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-01drm/amdkfd: Create file descriptor after client is added to smi_clients listLee Jones1-9/+15
This ensures userspace cannot prematurely clean-up the client before it is fully initialised which has been proven to cause issues in the past. Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-01drm/amdkfd: Use atomic64_t type for pdd->tlb_seqPhilip Yang2-4/+8
To support multi-thread update page table. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdgpu: remove table_freed param from the VM codeChristian König1-3/+2
Better to leave the decision when to flush the VM changes in the TLB to the VM code. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: use tlb_seq from the VM subsystem for SVM as well v2Christian König3-16/+12
Instead of hand rolling the table_freed parameter. v2: add some changes suggested by Philip Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: start using tlb_seq from the VM subsystemChristian König2-0/+8
Instead of trying to figure out if a TLB flush is necessary or not use the information provided by the VM subsystem now. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: print unmap queue status for RAS poison consumption (v3)Tao Zhou1-4/+9
Print the status out when it passes, and also tell user gpu reset is triggered when we fall back to legacy way. v2: make the message more explicit. v3: change succeeds to succeeded. replace pr_warn with dev_warn. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: add RAS poison consumption handling for UTCL2 (v2)Tao Zhou1-0/+6
Do RAS page retirement and use gpu reset as fallback in UTCL2 fault handler. v2: replace vm fault event with posion consumed event in UTCL2 poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: replace source_id with client_id for RAS poison consumptionTao Zhou1-7/+16
Client ID is more accruate here and we can deal with more different cases with client ID. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: refine event_interrupt_poison_consumptionTao Zhou1-6/+5
Combine reading and setting poison flag as one atomic operation and add print message for the function. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: Check for potential null return of kmalloc_array()QintaoShen1-0/+2
As the kmalloc_array() may return null, the 'event_waiters[i].wait' would lead to null-pointer dereference. Therefore, it is better to check the return value of kmalloc_array() to avoid this confusion. Signed-off-by: QintaoShen <unSimple1993@163.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: Check use_xgmi_p2p before reporting hive_idDivya Shikre1-1/+2
Recently introduced commit 158a05a0b885 ("drm/amdgpu: Add use_xgmi_p2p module parameter") did not update XGMI iolinks when use_xgmi_p2p is disabled. Add fix to not create XGMI iolinks in KFD topology when this parameter is disabled. Fixes: 158a05a0b885 ("drm/amdgpu: Add use_xgmi_p2p module parameter") Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25drm/amdkfd: Fix Incorrect VMIDs passed to HWSTushar Patel1-8/+3
Compute-only GPUs have more than 8 VMIDs allocated to KFD. Fix this by passing correct number of VMIDs to HWS v2: squash in warning fix (Alex) Signed-off-by: Tushar Patel <tushar.patel@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25Merge tag 'drm-next-2022-03-24' of git://anongit.freedesktop.org/drm/drmLinus Torvalds55-2662/+3426
Pull drm updates from Dave Airlie: "Lots of work all over, Intel improving DG2 support, amdkfd CRIU support, msm new hw support, and faster fbdev support. dma-buf: - rename dma-buf-map to iosys-map core: - move buddy allocator to core - add pci/platform init macros - improve EDID parser deep color handling - EDID timing type 7 support - add GPD Win Max quirk - add yes/no helpers to string_helpers - flatten syncobj chains - add nomodeset support to lots of drivers - improve fb-helper clipping support - add default property value interface fbdev: - improve fbdev ops speed ttm: - add a backpointer from ttm bo->ttm resource dp: - move displayport headers - add a dp helper module bridge: - anx7625 atomic support, HDCP support panel: - split out panel-lvds and lvds bindings - find panels in OF subnodes privacy: - add chromeos privacy screen support fb: - hot unplug fw fb on forced removal simpledrm: - request region instead of marking ioresource busy - add panel oreintation property udmabuf: - fix oops with 0 pages amdgpu: - power management code cleanup - Enable freesync video mode by default - RAS code cleanup - Improve VRAM access for debug using SDMA - SR-IOV rework special register access and fixes - profiling power state request ioctl - expose IP discovery via sysfs - Cyan skillfish updates - GC 10.3.7, SDMA 5.2.7, DCN 3.1.6 updates - expose benchmark tests via debugfs - add module param to disable XGMI for testing - GPU reset debugfs register dumping support amdkfd: - CRIU support - SDMA queue fixes radeon: - UVD suspend fix - iMac backlight fix i915: - minimal parallel submission for execlists - DG2-G12 subplatform added - DG2 programming workarounds - DG2 accelerated migration support - flat CCS and CCS engine support for XeHP - initial small BAR support - drop fake LMEM support - ADL-N PCH support - bigjoiner updates - introduce VMA resources and async unbinding - register definitions cleanups - multi-FBC refactoring - DG1 OPROM over SPI support - ADL-N platform enabling - opregion mailbox #5 support - DP MST ESI improvements - drm device based logging - async flip optimisation for DG2 - CPU arch abstraction fixes - improve GuC ADS init to work on aarch64 - tweak TTM LRU priority hint - GuC 69.0.3 support - remove short term execbuf pins nouveau: - higher DP/eDP bitrates - backlight fixes msm: - dpu + dp support for sc8180x - dp support for sm8350 - dpu + dsi support for qcm2290 - 10nm dsi phy tuning support - bridge support for dp encoder - gpu support for additional 7c3 SKUs ingenic: - HDMI support for JZ4780 - aux channel EDID support ast: - AST2600 support - add wide screen support - create DP/DVI connectors omapdrm: - fix implicit dma_buf fencing vc4: - add CSC + full range support - better display firmware handoff panfrost: - add initial dual-core GPU support stm: - new revision support - fb handover support mediatek: - transfer display binding document to yaml format. - add mt8195 display device binding. - allow commands to be sent during video mode. - add wait_for_event for crtc disable by cmdq. tegra: - YUV format support rcar-du: - LVDS support for M3-W+ (R8A77961) exynos: - BGR pixel format for FIMD device" * tag 'drm-next-2022-03-24' of git://anongit.freedesktop.org/drm/drm: (1529 commits) drm/i915/display: Do not re-enable PSR after it was marked as not reliable drm/i915/display: Fix HPD short pulse handling for eDP drm/amdgpu: Use drm_mode_copy() drm/radeon: Use drm_mode_copy() drm/amdgpu: Use ternary operator in `vcn_v1_0_start()` drm/amdgpu: Remove pointless on stack mode copies drm/amd/pm: fix indenting in __smu_cmn_reg_print_error() drm/amdgpu/dc: fix typos in comments drm/amdgpu: fix typos in comments drm/amd/pm: fix typos in comments drm/amdgpu: Add stolen reserved memory for MI25 SRIOV. drm/amdgpu: Merge get_reserved_allocation to get_vbios_allocations. drm/amdkfd: evict svm bo worker handle error drm/amdgpu/vcn: fix vcn ring test failure in igt reload test drm/amdgpu: only allow secure submission on rings which support that drm/amdgpu: fixed the warnings reported by kernel test robot drm/amd/display: 3.2.177 drm/amd/display: [FW Promotion] Release 0.0.108.0 drm/amd/display: Add save/restore PANEL_PWRSEQ_REF_DIV2 drm/amd/display: Wait for hubp read line for Pollock ...
2022-03-15drm/amdkfd: evict svm bo worker handle errorPhilip Yang2-14/+36
Migrate vram to ram may fail to find the vma if process is exiting and vma is removed, evict svm bo worker sets prange->svm_bo to NULL and warn svm_bo ref count != 1 only if migrating vram to ram successfully. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15drm/amdkfd: CRIU export dmabuf handles for GTT BOsDavid Yat Sin1-4/+14
Export dmabuf handles for GTT BOs so that their contents can be accessed using SDMA during checkpoint/restore. v2: Squash in fix from David to set dmabuf handle to invalid for BOs that cannot be accessed using SDMA during checkpoint/restore. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by : Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15drm/amdkfd: CRIU Refactor restore BO functionDavid Yat Sin1-142/+129
Refactor CRIU restore BO to reduce identation. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15drm/amdkfd: CRIU remove sync and TLB flush on restoreDavid Yat Sin1-29/+1
When the process is getting restored, the queues are not mapped yet, so there is no VMID assigned for this process and no TLBs to flush. Signed-off-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-10drm/amdkfd: bail out early if no get_atc_vmid_pasid_mapping_infoYifan Zhang1-9/+12
it makes no sense to continue with an undefined vmid. Fixes: c8b0507f40deea ("drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before call") Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-07drm/amdkfd: Add format attribute to kfd_smi_event_addPhilip Yang1-0/+1
To enable compiler type-checked against the format string in callers. All warnings (new ones prefixed by >>): >> warning: function 'kfd_smi_event_add' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format] Fixes: d58b8a99cbb8 ("drm/amdkfd: Add SMI add event helper") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-04drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before callYifan Zhang1-8/+10
Fix the NULL point issue: [ 3076.255609] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 3076.255624] #PF: supervisor instruction fetch in kernel mode [ 3076.255637] #PF: error_code(0x0010) - not-present page [ 3076.255649] PGD 0 P4D 0 [ 3076.255660] Oops: 0010 [#1] SMP NOPTI [ 3076.255669] CPU: 20 PID: 2415 Comm: kfdtest Tainted: G W OE 5.11.0-41-generic #45~20.04.1-Ubuntu [ 3076.255691] Hardware name: AMD Splinter/Splinter-RPL, BIOS VS2326337N.FD 02/07/2022 [ 3076.255706] RIP: 0010:0x0 [ 3076.255718] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 3076.255732] RSP: 0018:ffffb64283e3fc10 EFLAGS: 00010297 [ 3076.255744] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000027 [ 3076.255759] RDX: ffffb64283e3fc1e RSI: 0000000000000008 RDI: ffff8c7a87f60000 [ 3076.255776] RBP: ffffb64283e3fc78 R08: ffff8c7d88518ac0 R09: ffffb64283e3fa60 [ 3076.255791] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000000f [ 3076.255805] R13: ffff8c7bdcea5800 R14: ffff8c7a9f3f3000 R15: ffff8c7a8696bc00 [ 3076.255820] FS: 0000000000000000(0000) GS:ffff8c7d88500000(0000) knlGS:0000000000000000 [ 3076.255839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3076.255851] CR2: ffffffffffffffd6 CR3: 0000000109e3c000 CR4: 0000000000750ee0 [ 3076.255866] PKRU: 55555554 [ 3076.255873] Call Trace: [ 3076.255884] dbgdev_wave_reset_wavefronts+0x72/0x160 [amdgpu] [ 3076.256025] process_termination_cpsch.cold+0x26/0x2f [amdgpu] [ 3076.256182] ? ktime_get_mono_fast_ns+0x4e/0xa0 [ 3076.256196] kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu] [ 3076.256328] kfd_process_notifier_release+0x187/0x2b0 [amdgpu] [ 3076.256451] ? mn_itree_inv_end+0xdc/0x110 [ 3076.256463] __mmu_notifier_release+0x74/0x1f0 [ 3076.256474] exit_mmap+0x170/0x200 [ 3076.256484] ? __handle_mm_fault+0x677/0x920 [ 3076.256496] ? _cond_resched+0x19/0x30 [ 3076.256507] mmput+0x5d/0x130 [ 3076.256518] do_exit+0x332/0xaf0 [ 3076.256526] ? handle_mm_fault+0xd7/0x2b0 [ 3076.256537] do_group_exit+0x43/0xa0 [ 3076.256548] __x64_sys_exit_group+0x18/0x20 [ 3076.256559] do_syscall_64+0x38/0x90 [ 3076.256569] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03mm: remove the extra ZONE_DEVICE struct page refcountChristoph Hellwig1-1/+0
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: don't include <linux/memremap.h> in <linux/mm.h>Christoph Hellwig1-0/+1
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03mm: remove pointless includes from <linux/hmm.h>Christoph Hellwig1-0/+1
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Link: https://lkml.kernel.org/r/20220210072828.2930359-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03drm/amdkfd: Add SMI add event helperPhilip Yang1-46/+22
To remove duplicate code, unify event message format and simplify new event add in the following patches. Use KFD_SMI_EVENT_MSG_SIZE to define msg size, the same size will be used in user space to alloc the msg receive buffer. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03drm/amdkfd: Correct SMI event read sizePhilip Yang1-2/+3
sizeof(buf) is 8 bytes because it is defined as unsigned char *buf, each SMI event read only copy max 8 bytes to user buffer. Correct this by using the buf allocate size. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-03Revert "drm/amdkfd: process_info lock not needed for svm"Philip Yang1-0/+9
This reverts commit 3abfe30d803e62cc75dec254eefab3b04d69219b. To fix deadlock in kFDSVMEvictTest when xnack off. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23drm/amdkfd: Print bdf in peer map failure messageHarish Kasiviswanathan1-2/+9
Print alloc node, peer node and memory domain when peer map fails. This is more useful v2: use dev_err instead of pr_err use bdf for identify peer gpu Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23drm/amdkfd: Use real device for messagesFelix Kuehling3-10/+4
kfd_chardev() doesn't provide much useful information in dev_... messages on multi-GPU systems because there is only one KFD device, which doesn't correspond to any particular GPU. Use the actual GPU device to indicate the GPU that caused a message. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23drm/amdkfd: Fix for possible integer overflowDavid Yat Sin1-1/+1
Fix for possible integer overflow when doing addition. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22drm/amdkfd: make CRAT table missing message informational onlyAlex Deucher1-1/+1
The driver has a fallback so make the message informational rather than a warning. The driver has a fallback if the Component Resource Association Table (CRAT) is missing, so make this informational now. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1906 Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22drm/amdkfd: Fix criu_restore_bo error handlingFelix Kuehling1-2/+2
Clang static analysis reports this problem kfd_chardev.c:2327:2: warning: 1st function call argument is an uninitialized value kvfree(bo_privs); ^~~~~~~~~~~~~~~~ Make sure bo_buckets and bo_privs are initialized so freeing them in the error handling code path will never result in undefined behaviour. Fixes: 73fa13b6a511 ("drm/amdkfd: CRIU Implement KFD restore ioctl") Reported-by: Tom Rix <trix@redhat.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-22drm/amdkfd: Drop IH ring overflow message to dbgKent Russell1-1/+1
When this was first implemented, overflows weren't expected in regular operations, and tests weren't in place to cause said overflow. Now there are cases where overflows occur with real workloads, but we know that the kernel can handle this robustly, so move the message to a debug message. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17drm/amdkfd: Use proper enum in pm_unmap_queues_v9()Nathan Chancellor1-1/+1
Clang warns: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager_v9.c:267:3: error: implicit conversion from enumeration type 'enum mes_map_queues_extended_engine_sel_enum' to different enumeration type 'enum mes_unmap_queues_extended_engine_sel_enum' [-Werror,-Wenum-conversion] extended_engine_sel__mes_map_queues__sdma0_to_7_sel : ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Use 'extended_engine_sel__mes_unmap_queues__sdma0_to_7_sel' to eliminate the warning, which is the same numeric value of the proper type. Fixes: 009e9a158505 ("drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues") Link: https://github.com/ClangBuiltLinux/linux/issues/1596 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17drm/amdkfd: add return value check for queue evictionTao Zhou1-1/+1
Otherwise gpu reset will be triggered unconditionally in poison consumption. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>