summaryrefslogtreecommitdiff
path: root/drivers/accel
AgeCommit message (Collapse)AuthorFilesLines
3 daysaccel/qaic: Handle DBC deactivation if the owner went awayYoussef Samir1-2/+45
[ Upstream commit 2feec5ae5df785658924ab6bd91280dc3926507c ] When a DBC is released, the device sends a QAIC_TRANS_DEACTIVATE_FROM_DEV transaction to the host over the QAIC_CONTROL MHI channel. QAIC handles this by calling decode_deactivate() to release the resources allocated for that DBC. Since that handling is done in the qaic_manage_ioctl() context, if the user goes away before receiving and handling the deactivation, the host will be out-of-sync with the DBCs available for use, and the DBC resources will not be freed unless the device is removed. If another user loads and requests to activate a network, then the device assigns the same DBC to that network, QAIC will "indefinitely" wait for dbc->in_use = false, leading the user process to hang. As a solution to this, handle QAIC_TRANS_DEACTIVATE_FROM_DEV transactions that are received after the user has gone away. Fixes: 129776ac2e38 ("accel/qaic: Add control path") Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20260205123415.3870898-1-youssef.abdulrahman@oss.qualcomm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/rocket: fix unwinding in error path in rocket_probeQuentin Schulz1-1/+14
[ Upstream commit 34f4495a7f72895776b81969639f527c99eb12b9 ] When rocket_core_init() fails (as could be the case with EPROBE_DEFER), we need to properly unwind by decrementing the counter we just incremented and if this is the first core we failed to probe, remove the rocket DRM device with rocket_device_fini() as well. This matches the logic in rocket_remove(). Failing to properly unwind results in out-of-bounds accesses. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-2-eec3bf29dc3b@cherry.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/rocket: fix unwinding in error path in rocket_core_initQuentin Schulz1-2/+5
[ Upstream commit f509a081f6a289f7c66856333b3becce7a33c97e ] When rocket_job_init() is called, iommu_group_get() has already been called, therefore we should call iommu_group_put() and make the iommu_group pointer NULL. This aligns with what's done in rocket_core_fini(). If pm_runtime_resume_and_get() somehow fails, not only should rocket_job_fini() be called but we should also unwind everything done before that, that is, disable PM, put the iommu_group, NULLify it and then call rocket_job_fini(). This is exactly what's done in rocket_core_fini() so let's call that function instead of duplicating the code. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-1-eec3bf29dc3b@cherry.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12net: qrtr: Drop the MHI auto_queue feature for IPCR DL channelsManivannan Sadhasivam1-44/+0
[ Upstream commit 51731792a25cb312ca94cdccfa139eb46de1b2ef ] MHI stack offers the 'auto_queue' feature, which allows the MHI stack to auto queue the buffers for the RX path (DL channel). Though this feature simplifies the client driver design, it introduces race between the client drivers and the MHI stack. For instance, with auto_queue, the 'dl_callback' for the DL channel may get called before the client driver is fully probed. This means, by the time the dl_callback gets called, the client driver's structures might not be initialized, leading to NULL ptr dereference. Currently, the drivers have to workaround this issue by initializing the internal structures before calling mhi_prepare_for_transfer_autoqueue(). But even so, there is a chance that the client driver's internal code path may call the MHI queue APIs before mhi_prepare_for_transfer_autoqueue() is called, leading to similar NULL ptr dereference. This issue has been reported on the Qcom X1E80100 CRD machines affecting boot. So to properly fix all these races, drop the MHI 'auto_queue' feature altogether and let the client driver (QRTR) manage the RX buffers manually. In the QRTR driver, queue the RX buffers based on the ring length during probe and recycle the buffers in 'dl_callback' once they are consumed. This also warrants removing the setting of 'auto_queue' flag from controller drivers. Currently, this 'auto_queue' feature is only enabled for IPCR DL channel. So only the QRTR client driver requires the modification. Fixes: 227fee5fc99e ("bus: mhi: core: Add an API for auto queueing buffers for DL channel") Fixes: 68a838b84eff ("net: qrtr: start MHI channel after endpoit creation") Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/linux-arm-msm/ZyTtVdkCCES0lkl4@hovoldconsulting.com Suggested-by: Chris Lew <quic_clew@quicinc.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Acked-by: Jeff Johnson <jjohnson@kernel.org> # drivers/net/wireless/ath/... Acked-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251218-qrtr-fix-v2-1-c7499bfcfbe0@oss.qualcomm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Validate command buffer payload countLizhi Hou1-1/+4
[ Upstream commit 901ec3470994006bc8dd02399e16b675566c3416 ] The count field in the command header is used to determine the valid payload size. Verify that the valid payload does not exceed the remaining buffer space. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260219211946.1920485-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Prevent ubuf size overflowLizhi Hou1-1/+5
[ Upstream commit 03808abb1d868aed7478a11a82e5bb4b3f1ca6d6 ] The ubuf size calculation may overflow, resulting in an undersized allocation and possible memory corruption. Use check_add_overflow() helpers to validate the size calculation before allocation. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260217192815.1784689-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Remove buffer size check when creating command BOLizhi Hou1-19/+19
[ Upstream commit 08fe1b5166fdc81b010d7bf39cd6440620e7931e ] Large command buffers may be used, and they do not always need to be mapped or accessed by the driver. Performing a size check at command BO creation time unnecessarily rejects valid use cases. Remove the buffer size check from command BO creation, and defer vmap and size validation to the paths where the driver actually needs to map and access the command buffer. Fixes: ac49797c1815 ("accel/amdxdna: Add GEM buffer object management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260206060237.4050492-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04accel/amdxdna: Fix tail-pointer polling in mailbox_get_msg()Lizhi Hou1-18/+1
[ Upstream commit cd77d5a4aaf8c5c1d819f47cf814bf7d4920b0a2 ] In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a non-zero tail pointer. This assumed that a zero value indicates an error. However, certain corner cases legitimately produce a zero tail pointer. To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail pointer will be treated as a valid rewind event. Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251204181603.793824-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix incorrect error code returned for failed chain commandLizhi Hou1-1/+1
[ Upstream commit 750817a7c41de083ca5d73052e97bb7b67d7c394 ] The driver currently returns an incorrect error code when a chain command fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for failed chain commands. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix memory leak in amdxdna_ubuf_mapZishun Yi1-2/+8
[ Upstream commit 84dd57fb0359500092f1101409ca32091731490d ] The amdxdna_ubuf_map() function allocates memory for sg and internal sg table structures, but it fails to free them if subsequent operations (sg_alloc_table_from_pages or dma_map_sgtable) fail. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Min Ma <mamin506@gmail.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Stop job scheduling across aie2_release_resource()Lizhi Hou1-0/+6
[ Upstream commit f1370241fe8045702bc9d0812b996791f0500f1b ] Running jobs on a hardware context while it is in the process of releasing resources can lead to use-after-free and crashes. Fix this by stopping job scheduling before calling aie2_release_resource() and restarting it after the release completes. Additionally, aie2_sched_job_run() now checks whether the hardware context is still active. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Hold mm structure across iommu_sva_unbind_device()Lizhi Hou2-0/+4
[ Upstream commit a9162439ad792afcddc04718408ec1380b7a5f63 ] Some tests trigger a crash in iommu_sva_unbind_device() due to accessing iommu_mm after the associated mm structure has been freed. Fix this by taking an explicit reference to the mm structure after successfully binding the device, and releasing it only after the device is unbound. This ensures the mm remains valid for the entire SVA bind/unbind lifetime. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix notifier_wq flushing warningLizhi Hou1-1/+1
[ Upstream commit b36178488d479e9a53bbef2b01280378b5586e60 ] Create notifier_wq with WQ_MEM_RECLAIM flag to fix the possible warning. workqueue: WQ_MEM_RECLAIM amdxdna_js:drm_sched_free_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM notifier_wq:0x0 Fixes: e486147c912f ("accel/amdxdna: Add BO import and export") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260113173624.256053-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix race where send ring appears full due to delayed head updateLizhi Hou1-11/+16
[ Upstream commit 343f5683cfa443000904c88ce2e23656375fc51c ] The firmware sends a response and interrupts the driver before advancing the mailbox send ring head pointer. As a result, the driver may observe the response and attempt to send a new request before the firmware has updated the head pointer. In this window, the send ring still appears full, causing the driver to incorrectly fail the send operation. This race can be triggered more easily in a multithreaded environment, leading to unexpected and spurious "send ring full" failures. To address this, poll the send ring head pointer for up to 100us before returning a full-ring condition. This allows the firmware time to update the head pointer. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251211045125.1724604-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-17accel/amdxdna: Block running under a hypervisorMario Limonciello (AMD)1-0/+6
[ Upstream commit 7bbf6d15e935abbb3d604c1fa157350e84a26f98 ] SVA support is required, which isn't configured by hypervisor solutions. Closes: https://github.com/QubesOS/qubes-issues/issues/10275 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4656 Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251213054513.87925-1-superm1@kernel.org Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Fix deadlock between context destroy and job timeoutLizhi Hou1-2/+4
[ Upstream commit ca2583412306ceda9304a7c4302fd9efbf43e963 ] Hardware context destroy function holds dev_lock while waiting for all jobs to complete. The timeout job also needs to acquire dev_lock, this leads to a deadlock. Fix the issue by temporarily releasing dev_lock before waiting for all jobs to finish, and reacquiring it afterward. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181050.1293125-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Clear mailbox interrupt register during channel creationLizhi Hou1-0/+1
[ Upstream commit 6ff9385c07aa311f01f87307e6256231be7d8675 ] The mailbox interrupt register is not always cleared when a mailbox channel is created. This can leave stale interrupt states from previous operations. Fix this by explicitly clearing the interrupt register in the mailbox channel creation function. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181115.1293158-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Fix dma_fence leak when job is canceledLizhi Hou2-1/+1
[ Upstream commit dea9f84776b96a703f504631ebe9fea07bd2c181 ] Currently, dma_fence_put(job->fence) is called in job notification callback. However, if a job is canceled, the notification callback is never invoked, leading to a memory leak. Move dma_fence_put(job->fence) to the job cleanup function to ensure the fence is always released. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Fix incorrect command state for timed out jobLizhi Hou2-2/+14
[ Upstream commit 6fb7f298883246e21f60f971065adcb789ae6eba ] When a command times out, mark it as ERT_CMD_STATE_TIMEOUT. Any other commands that are canceled due to this timeout should be marked as ERT_CMD_STATE_ABORT. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251029193423.2430463-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Fix race condition when unbinding BOsTomasz Rusinowicz1-1/+2
[ Upstream commit 00812636df370bedf4e44a0c81b86ea96bca8628 ] Fix 'Memory manager not clean during takedown' warning that occurs when ivpu_gem_bo_free() removes the BO from the BOs list before it gets unmapped. Then file_priv_unbind() triggers a warning in drm_mm_takedown() during context teardown. Protect the unmapping sequence with bo_list_lock to ensure the BO is always fully unmapped when removed from the list. This ensures the BO is either fully unmapped at context teardown time or present on the list and unmapped by file_priv_unbind(). Fixes: 48aea7f2a2ef ("accel/ivpu: Fix locking in ivpu_bo_remove_all_bos_from_context()") Signed-off-by: Tomasz Rusinowicz <tomasz.rusinowicz@intel.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251029071451.184243-1-karol.wachowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Remove skip of dma unmap for imported buffersMaciej Falkowski1-3/+0
[ Upstream commit c063c1bbee67391f12956d2ffdd5da00eb87ff79 ] Rework of imported buffers introduced in the commit e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers") switched the logic of imported buffers by dma mapping/unmapping them just as the regular buffers. The commit didn't include removal of skipping dma unmap of imported buffers which results in them being mapped without unmapping. Fixes: e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers") Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Link: https://patch.msgid.link/20251027150933.2384538-1-maciej.falkowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Fix uninitialized return valueLizhi Hou1-2/+2
[ Upstream commit 81233d5419cf20e4f5ef505882951616888a2ef9 ] In aie2_get_hwctx_status() and aie2_query_ctx_status_array(), the functions could return an uninitialized value in some cases. Update them to always return 0. The amount of valid results is indicated by the returned buffer_size, element_size, and num_element fields. Fixes: 2f509fe6a42c ("accel/amdxdna: Add ioctl DRM_IOCTL_AMDXDNA_GET_ARRAY") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251024165503.1548131-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Fix race condition when mapping dmabufWludzik, Jozef1-1/+2
[ Upstream commit 63c7870fab67b2ab2bfe75e8b46f3c37b88c47a8 ] Fix a race that can occur when multiple jobs submit the same dmabuf. This could cause the sg_table to be mapped twice, leading to undefined behavior. Fixes: e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers") Signed-off-by: Wludzik, Jozef <jozef.wludzik@intel.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://lore.kernel.org/r/20251014071725.3047287-1-karol.wachowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Fix DCT active percent formatKarol Wachowski3-4/+9
[ Upstream commit aa1c2b073ad23847dd2e7bdc7d30009f34ed7f59 ] The pcode MAILBOX STATUS register PARAM2 field expects DCT active percent in U1.7 value format. Convert percentage value to this format before writing to the register. Fixes: a19bffb10c46 ("accel/ivpu: Implement DCT handling") Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://lore.kernel.org/r/20251001104322.1249896-1-karol.wachowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Fix page fault in ivpu_bo_unbind_all_bos_from_context()Jacek Lawrynowicz1-6/+16
[ Upstream commit 8b694b405a84696f1d964f6da7cf9721e68c4714 ] Don't add BO to the vdev->bo_list in ivpu_gem_create_object(). When failure happens inside drm_gem_shmem_create(), the BO is not fully created and ivpu_gem_bo_free() callback will not be called causing a deleted BO to be left on the list. Fixes: 8d88e4cdce4f ("accel/ivpu: Use GEM shmem helper for all buffers") Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://lore.kernel.org/r/20250925145114.1446283-1-maciej.falkowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Rework bind/unbind of imported buffersJacek Lawrynowicz3-34/+60
[ Upstream commit e0c0891cd63bf8338c25c423e28a5a93aed3d74c ] Ensure that imported buffers are properly mapped and unmapped in the same way as regular buffers to properly handle buffers during device's bind and unbind operations to prevent resource leaks and inconsistent buffer states. Imported buffers are now dma_mapped before submission and dma_unmapped in ivpu_bo_unbind(), guaranteeing they are unmapped when the device is unbound. Add also imported buffers to vdev->bo_list for consistent unmapping on device unbind. The bo->ctx_id is set in open() so imported buffers have a valid context ID. Debug logs have been updated to match the new code structure. The function ivpu_bo_pin() has been renamed to ivpu_bo_bind() to better reflect its purpose, and unbind tests have been refactored for improved coverage and clarity. Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://lore.kernel.org/r/20250925145059.1446243-1-maciej.falkowski@linux.intel.com Stable-dep-of: 8b694b405a84 ("accel/ivpu: Fix page fault in ivpu_bo_unbind_all_bos_from_context()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/ivpu: Ensure rpm_runtime_put in case of engine reset/resume failKarol Wachowski1-2/+2
[ Upstream commit 9f6c63285737b141ca25a619add80a96111b8b96 ] Previously, aborting work could return early after engine reset or resume failure, skipping the necessary runtime_put cleanup leaving the device with incorrect reference count breaking runtime power management state. Replace early returns with goto statements to ensure runtime_put is always executed. Fixes: a47e36dc5d90 ("accel/ivpu: Trigger device recovery on engine reset/resume failure") Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://lore.kernel.org/r/20250916084809.850073-1-karol.wachowski@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Call dma_buf_vmap_unlocked() for imported objectLizhi Hou1-27/+20
[ Upstream commit 457f4393d02fdb612a93912fb09cef70e6e545c9 ] In amdxdna_gem_obj_vmap(), calling dma_buf_vmap() triggers a kernel warning if LOCKDEP is enabled. So for imported object, use dma_buf_vmap_unlocked(). Then, use drm_gem_vmap() for other objects. The similar change applies to vunmap code. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://lore.kernel.org/r/20250916174842.234709-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18accel/amdxdna: Fix an integer overflow in aie2_query_ctx_status_array()Lizhi Hou1-0/+6
[ Upstream commit 9e16c8bf9aebf629344cfd4cd5e3dc7d8c3f7d82 ] The unpublished smatch static checker reported a warning. drivers/accel/amdxdna/aie2_pci.c:904 aie2_query_ctx_status_array() warn: potential user controlled sizeof overflow 'args->num_element * args->element_size' '1-u32max(user) * 1-u32max(user)' Even this will not cause a real issue, it is better to put a reasonable limitation for element_size and num_element. Add condition to make sure the input element_size <= 4K and num_element <= 1K. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/dri-devel/aL56ZCLyl3tLQM1e@stanley.mountain/ Fixes: 2f509fe6a42c ("accel/amdxdna: Add ioctl DRM_IOCTL_AMDXDNA_GET_ARRAY") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://lore.kernel.org/r/20250909154531.3469979-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-14accel/qaic: Synchronize access to DBC request queue head & tail pointerPranjal Ramajor Asha Kanojiya3-2/+15
Two threads of the same process can potential read and write parallelly to head and tail pointers of the same DBC request queue. This could lead to a race condition and corrupt the DBC request queue. Fixes: ff13be830333 ("accel/qaic: Add datapath") Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com> [jhugo: Add fixes tag] Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://lore.kernel.org/r/20251007061837.206132-1-youssef.abdulrahman@oss.qualcomm.com
2025-10-14accel/qaic: Treat remaining == 0 as error in find_and_map_user_pages()Youssef Samir1-1/+1
Currently, if find_and_map_user_pages() takes a DMA xfer request from the user with a length field set to 0, or in a rare case, the host receives QAIC_TRANS_DMA_XFER_CONT from the device where resources->xferred_dma_size is equal to the requested transaction size, the function will return 0 before allocating an sgt or setting the fields of the dma_xfer struct. In that case, encode_addr_size_pairs() will try to access the sgt which will lead to a general protection fault. Return an EINVAL in case the user provides a zero-sized ALP, or the device requests continuation after all of the bytes have been transferred. Fixes: 96d3c1cadedb ("accel/qaic: Clean up integer overflow checking in map_user_pages()") Signed-off-by: Youssef Samir <quic_yabdulra@quicinc.com> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://lore.kernel.org/r/20251007122320.339654-1-youssef.abdulrahman@oss.qualcomm.com
2025-10-14accel/qaic: Fix bootlog initialization orderingJeffrey Hugo1-2/+3
As soon as we queue MHI buffers to receive the bootlog from the device, we could be receiving data. Therefore all the resources needed to process that data need to be setup prior to queuing the buffers. We currently initialize some of the resources after queuing the buffers which creates a race between the probe() and any data that comes back from the device. If the uninitialized resources are accessed, we could see page faults. Fix the init ordering to close the race. Fixes: 5f8df5c6def6 ("accel/qaic: Add bootlog debugfs") Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://lore.kernel.org/r/20251007115750.332169-1-youssef.abdulrahman@oss.qualcomm.com
2025-09-25accel/habanalabs: add Infineon version checkPavan S1-2/+9
On HL338 ASICs, the Infineon first‑stage firmware is not present and the reported version is 0. In this case printing a version number is misleading, as it suggests valid firmware when it does not exist. Fix this by printing the first‑stage Infineon firmware version only if the reported value is non‑zero. This avoids confusing or incorrect log messages on devices where the first stage is not applicable. Signed-off-by: Pavan S <pavan.sreenivas@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs/gaudi2: read preboot status after recovering from dirty stateKonstantin Sinyuk1-1/+7
Dirty state can occur when the host VM undergoes a reset while the device does not. In such a case, the driver must reset the device before it can be used again. As part of this reset, the device capabilities are zeroed. Therefore, the driver must read the Preboot status again to learn the Preboot state, capabilities, and security configuration. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: add HL_GET_P_STATE passthrough typeAriel Aviad1-0/+3
Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl to allow userspace to read the device performance state via firmware. Signed-off-by: Ariel Aviad <ariel.aviad@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: add debugfs interface for HLDIO testingKonstantin Sinyuk2-0/+214
Add debugfs files for NVMe Direct I/O (HLDIO) functionality. This interface allows userspace access to direct SSD ↔ device transfers through debugfs nodes. Four debugfs files are created under /sys/kernel/debug/habanalabs/hlN/: - dio_ssd2hl : trigger SSD-to-device transfers - dio_hl2ssd : trigger device-to-SSD transfers (placeholder, not yet implemented) - dio_stats : show transfer statistics - dio_reset : reset statistics counters Usage examples: # Perform SSD → device transfer echo "fd=3 va=0x10000 off=0 len=4096" > \ /sys/kernel/debug/habanalabs/hl0/dio_ssd2hl # View statistics cat /sys/kernel/debug/habanalabs/hl0/dio_stats # Reset counters echo 1 > /sys/kernel/debug/habanalabs/hl0/dio_reset This interface provides access to HLDIO functionality for validation and diagnostics. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: add NVMe Direct I/O (HLDIO) infrastructureKonstantin Sinyuk6-2/+624
Introduce NVMe Direct I/O (HLDIO) infrastructure to support peer‑to‑peer DMA in the habanalabs driver. This adds internal helpers and data structures to enable direct transfers between NVMe storage and device memory. The feature is built only when CONFIG_HL_HLDIO is enabled. A debugfs interface is also provided for functional validation. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Farah Kassabri <farah.kassabri@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: support mapping cb with vmalloc-backed coherent memoryMoti Haimovski2-0/+26
When IOMMU is enabled, dma_alloc_coherent() with GFP_USER may return addresses from the vmalloc range. If such an address is mapped without VM_MIXEDMAP, vm_insert_page() will trigger a BUG_ON due to the VM_PFNMAP restriction. Fix this by checking for vmalloc addresses and setting VM_MIXEDMAP in the VMA before mapping. This ensures safe mapping and avoids kernel crashes. The memory is still driver-allocated and cannot be accessed directly by userspace. Signed-off-by: Moti Haimovski <moti.haimovski@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: remove old interface variation of 'access_ok()'Ilia Levi1-5/+0
The access_ok() API no longer requires the VERIFY_WRITE argument, and the use of the old interface with VERIFY_WRITE is deprecated. Clean up the habanalabs memory manager to use the modern access_ok() interface consistently. This removes old #ifdef guards and aligns the driver with current upstream kernel APIs. Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs/gaudi2: use the CPLD_SHUTDOWN event handlerKonstantin Sinyuk2-4/+2
After CPLD shutdown event the device is not usable anymore. The common CPLD_SHUTDOWN event handler disables any subsequent device access. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: disable device access after CPLD_SHUTDOWNKonstantin Sinyuk2-0/+28
After a CPLD shutdown event the device becomes unusable. Prevent further device access once this event is received. Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: clarify ctx use after hl_ctx_put() in dmabuf releaseTomer Tayar1-1/+6
In hl_release_dmabuf(), ctx is dereferenced after calling hl_ctx_put() to obtain the compute device file. This is safe because the dma-buf object holds a file reference taken in export_dmabuf(), and the file release (which drops another ctx reference) can only happen after we drop that file reference via fput(). Thus, this hl_ctx_put() call cannot be the last one at this point. Add a comment explaining this to avoid confusion. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs/gaudi2: add support for logging register accesses from debugfsSharley Calzolari3-1/+148
Add infrastructure for logging the last configuration register accesses that occur via debugfs read/write operations. At interrupt time, these log entries can be dumped to dmesg, which helps in diagnosing the cause of RAZWI and ADDR_DEC interrupts. The logging is implemented as a ring buffer of access entries, with each entry recording timestamp and access details. To ensure correctness under concurrent access, operations are now protected using spinlocks. Entries are copied under lock and then printed after releasing it, which minimizes time spent in the critical section. Signed-off-by: Sharley Calzolari <sharley.calzolari@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs/gaudi2: stringify engine/queue idsAriel Suller2-8/+367
Print engine/queue names instead of numerical engine/queue IDs to make logs and debug output more readable. Signed-off-by: Ariel Suller <ariel.suller@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: add generic message type to get error countersVitaly Margolin1-0/+3
Add a new CPUCP generic message type to retrieve HBM, SRAM and critical error counters from the device. Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs/gaudi2: fix BMON disable configurationVered Yavniely1-1/+1
Change the BMON_CR register value back to its original state before enabling, so that BMON does not continue to collect information after being disabled. Signed-off-by: Vered Yavniely <vered.yavniely@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-25accel/habanalabs: return ENOMEM if less than requested pages were pinnedTomer Tayar1-1/+1
EFAULT is currently returned if less than requested user pages are pinned. This value means a "bad address" which might be confusing to the user, as the address of the given user memory is not necessarily "bad". Modify the return value to ENOMEM, as "out of memory" is more suitable in this case. Signed-off-by: Tomer Tayar <tomer.tayar@intel.com> Reviewed-by: Koby Elbaz <koby.elbaz@intel.com> Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
2025-09-15Merge tag 'v6.17-rc6' into drm-nextDave Airlie4-5/+5
This is a backmerge of Linux 6.17-rc6, needed for msm, also requested by misc. Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-09-04accel/amdxdna: Add ioctl DRM_IOCTL_AMDXDNA_GET_ARRAYLizhi Hou3-26/+118
Add interface for applications to get information array. The application provides a buffer pointer along with information type, maximum number of entries and maximum size of each entry. The buffer may also contain match conditions based on the information type. After the ioctl completes, the actual number of entries and entry size are returned. (see [1], used by driver runtime library) [1] https://github.com/amd/xdna-driver/blob/main/src/shim/host/platform_host.cpp#L337 Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://lore.kernel.org/r/20250903053402.2103196-1-lizhi.hou@amd.com
2025-09-01accel/ivpu: Prevent recovery work from being queued during device removalKarol Wachowski3-4/+4
Use disable_work_sync() instead of cancel_work_sync() in ivpu_dev_fini() to ensure that no new recovery work items can be queued after device removal has started. Previously, recovery work could be scheduled even after canceling existing work, potentially leading to use-after-free bugs if recovery accessed freed resources. Rename ivpu_pm_cancel_recovery() to ivpu_pm_disable_recovery() to better reflect its new behavior. Fixes: 58cde80f45a2 ("accel/ivpu: Use dedicated work for job timeout detection") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Karol Wachowski <karol.wachowski@intel.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://lore.kernel.org/r/20250808110939.328366-1-jacek.lawrynowicz@linux.intel.com