summaryrefslogtreecommitdiff
path: root/drivers/accel
AgeCommit message (Collapse)AuthorFilesLines
3 daysaccel/qaic: Handle DBC deactivation if the owner went awayYoussef Samir1-2/+45
[ Upstream commit 2feec5ae5df785658924ab6bd91280dc3926507c ] When a DBC is released, the device sends a QAIC_TRANS_DEACTIVATE_FROM_DEV transaction to the host over the QAIC_CONTROL MHI channel. QAIC handles this by calling decode_deactivate() to release the resources allocated for that DBC. Since that handling is done in the qaic_manage_ioctl() context, if the user goes away before receiving and handling the deactivation, the host will be out-of-sync with the DBCs available for use, and the DBC resources will not be freed unless the device is removed. If another user loads and requests to activate a network, then the device assigns the same DBC to that network, QAIC will "indefinitely" wait for dbc->in_use = false, leading the user process to hang. As a solution to this, handle QAIC_TRANS_DEACTIVATE_FROM_DEV transactions that are received after the user has gone away. Fixes: 129776ac2e38 ("accel/qaic: Add control path") Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20260205123415.3870898-1-youssef.abdulrahman@oss.qualcomm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
12 daysaccel/ivpu: Add disable clock relinquish workaround for NVL-A0Karol Wachowski2-2/+5
commit e8ab57b56402697a9bef50b71aecc613f0d61846 upstream. Turn on disable clock relinquish workaround for Nova Lake A0. Without this workaround NPU may not power off correctly after inference, leading to unexpected system behavior. Fixes: 550f4dd2cedd ("accel/ivpu: Add support for Nova Lake's NPU") Cc: <stable@vger.kernel.org> # v6.19+ Reviewed-by: Lizhi.hou <lizhi.hou@amd.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20260323095029.64613-1-karol.wachowski@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19accel/amdxdna: Fix runtime suspend deadlock when there is pending jobLizhi Hou2-12/+12
[ Upstream commit 6b13cb8f48a42ddf6dd98865b673a82e37ff238b ] The runtime suspend callback drains the running job workqueue before suspending the device. If a job is still executing and calls pm_runtime_resume_and_get(), it can deadlock with the runtime suspend path. Fix this by moving pm_runtime_resume_and_get() from the job execution routine to the job submission routine, ensuring the device is resumed before the job is queued and avoiding the deadlock during runtime suspend. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260310180058.336348-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel: ethosu: Fix NPU_OP_ELEMENTWISE validation with scalarRob Herring (Arm)1-1/+4
[ Upstream commit 838ae99f9a77a5724ee6d4e7b7b1eb079147f888 ] The NPU_OP_ELEMENTWISE instruction uses a scalar value for IFM2 if the IFM2_BROADCAST "scalar" mode is set. It is a bit (7) on the u65 and part of a field (bits 3:0) on the u85. The driver was hardcoded to the u85. Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver") Reviewed-and-Tested-by: Anders Roxell <anders.roxell@linaro.org> Link: https://patch.msgid.link/20260218-ethos-fixes-v1-2-be3fa3ea9a30@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel: ethosu: Fix job submit error clean-up refcount underflowsRob Herring (Arm)1-8/+18
[ Upstream commit 150bceb3e0a4a30950279d91ea0e8cc69a736742 ] If the job submit fails before adding the job to the scheduler queue such as when the GEM buffer bounds checks fail, then doing a ethosu_job_put() results in a pm_runtime_put_autosuspend() without the corresponding pm_runtime_resume_and_get(). The dma_fence_put()'s are also unnecessary, but seem to be harmless. Split the ethosu_job_cleanup() function into 2 parts for the before and after the job is queued. Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver") Reviewed-and-Tested-by: Anders Roxell <anders.roxell@linaro.org> Link: https://patch.msgid.link/20260218-ethos-fixes-v1-1-be3fa3ea9a30@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix NULL pointer dereference of mgmt_channLizhi Hou3-10/+19
[ Upstream commit 6270ee26e1edd862ea17e3eba148ca8fb2c99dc9 ] mgmt_chann may be set to NULL if the firmware returns an unexpected error in aie2_send_mgmt_msg_wait(). This can later lead to a NULL pointer dereference in aie2_hw_stop(). Fix this by introducing a dedicated helper to destroy mgmt_chann and by adding proper NULL checks before accessing it. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260226213857.3068474-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fill invalid payload for failed commandLizhi Hou3-15/+38
[ Upstream commit 89ff45359abbf9d8d3c4aa3f5a57ed0be82b5a12 ] Newer userspace applications may read the payload of a failed command to obtain detailed error information. However, the driver and old firmware versions may not support returning advanced error information. In this case, initialize the command payload with an invalid value so userspace can detect that no detailed error information is available. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260227004841.3080241-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/rocket: fix unwinding in error path in rocket_probeQuentin Schulz1-1/+14
[ Upstream commit 34f4495a7f72895776b81969639f527c99eb12b9 ] When rocket_core_init() fails (as could be the case with EPROBE_DEFER), we need to properly unwind by decrementing the counter we just incremented and if this is the first core we failed to probe, remove the rocket DRM device with rocket_device_fini() as well. This matches the logic in rocket_remove(). Failing to properly unwind results in out-of-bounds accesses. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-2-eec3bf29dc3b@cherry.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/rocket: fix unwinding in error path in rocket_core_initQuentin Schulz1-2/+5
[ Upstream commit f509a081f6a289f7c66856333b3becce7a33c97e ] When rocket_job_init() is called, iommu_group_get() has already been called, therefore we should call iommu_group_put() and make the iommu_group pointer NULL. This aligns with what's done in rocket_core_fini(). If pm_runtime_resume_and_get() somehow fails, not only should rocket_job_fini() be called but we should also unwind everything done before that, that is, disable PM, put the iommu_group, NULLify it and then call rocket_job_fini(). This is exactly what's done in rocket_core_fini() so let's call that function instead of duplicating the code. Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net> Link: https://patch.msgid.link/20251215-rocket-error-path-v1-1-eec3bf29dc3b@cherry.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Validate command buffer payload countLizhi Hou1-1/+4
[ Upstream commit 901ec3470994006bc8dd02399e16b675566c3416 ] The count field in the command header is used to determine the valid payload size. Verify that the valid payload does not exceed the remaining buffer space. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260219211946.1920485-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Prevent ubuf size overflowLizhi Hou1-1/+5
[ Upstream commit 03808abb1d868aed7478a11a82e5bb4b3f1ca6d6 ] The ubuf size calculation may overflow, resulting in an undersized allocation and possible memory corruption. Use check_add_overflow() helpers to validate the size calculation before allocation. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260217192815.1784689-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix out-of-bounds memset in command slot handlingLizhi Hou1-4/+4
[ Upstream commit 1110a949675ebd56b3f0286e664ea543f745801c ] The remaining space in a command slot may be smaller than the size of the command header. Clearing the command header with memset() before verifying the available slot space can result in an out-of-bounds write and memory corruption. Fix this by moving the memset() call after the size validation. Fixes: 3d32eb7a5ecf ("accel/amdxdna: Fix cu_idx being cleared by memset() during command setup") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260217185415.1781908-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix command hang on suspended hardware contextLizhi Hou1-7/+11
[ Upstream commit 07efce5a6611af6714ea3ef65694e0c8dd7e44f5 ] When a hardware context is suspended, the job scheduler is stopped. If a command is submitted while the context is suspended, the job is queued in the scheduler but aie2_sched_job_run() is never invoked to restart the hardware context. As a result, the command hangs. Fix this by modifying the hardware context suspend routine to keep the job scheduler running so that queued jobs can trigger context restart properly. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260211205341.722982-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix suspend failure after enabling turbo modeLizhi Hou1-4/+5
[ Upstream commit fdb65acfe655f844ae1e88696b9656d3ef5bb8fb ] Enabling turbo mode disables hardware clock gating. Suspend requires hardware clock gating to be re-enabled, otherwise suspend will fail. Fix this by calling aie2_runtime_cfg() from aie2_hw_stop() to re-enable clock gating during suspend. Also ensure that firmware is initialized in aie2_hw_start() before modifying clock-gating settings during resume. Fixes: f4d7b8a6bc8c ("accel/amdxdna: Enhance power management settings") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260211204716.722788-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix dead lock for suspend and resumeLizhi Hou6-19/+26
[ Upstream commit 1aa82181a3c285c7351523d587f7981ae4c015c8 ] When an application issues a query IOCTL while auto suspend is running, a deadlock can occur. The query path holds dev_lock and then calls pm_runtime_resume_and_get(), which waits for the ongoing suspend to complete. Meanwhile, the suspend callback attempts to acquire dev_lock and blocks, resulting in a deadlock. Fix this by releasing dev_lock before calling pm_runtime_resume_and_get() and reacquiring it after the call completes. Also acquire dev_lock in the resume callback to keep the locking consistent. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260211204644.722758-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Reduce log noise during process terminationMario Limonciello2-3/+7
[ Upstream commit 57aa3917a3b3bd805a3679371f97a1ceda3c5510 ] During process termination, several error messages are logged that are not actual errors but expected conditions when a process is killed or interrupted. This creates unnecessary noise in the kernel log. The specific scenarios are: 1. HMM invalidation returns -ERESTARTSYS when the wait is interrupted by a signal during process cleanup. This is expected when a process is being terminated and should not be logged as an error. 2. Context destruction returns -ENODEV when the firmware or device has already stopped, which commonly occurs during cleanup if the device was already torn down. This is also an expected condition during orderly shutdown. Downgrade these expected error conditions from error level to debug level to reduce log noise while still keeping genuine errors visible. Fixes: 97f27573837e ("accel/amdxdna: Fix potential NULL pointer dereference in context cleanup") Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260210164521.1094274-3-mario.limonciello@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Fix crash when destroying a suspended hardware contextLizhi Hou1-0/+3
[ Upstream commit 8363c02863332992a1822688da41f881d88d1631 ] If userspace issues an ioctl to destroy a hardware context that has already been automatically suspended, the driver may crash because the mailbox channel pointer is NULL for the suspended context. Fix this by checking the mailbox channel pointer in aie2_destroy_context() before accessing it. Fixes: 97f27573837e ("accel/amdxdna: Fix potential NULL pointer dereference in context cleanup") Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260206060306.4050531-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Switch to always use chained commandLizhi Hou1-2/+2
[ Upstream commit c68a6af400ca80596e8c37de0a1cb564aa9da8a4 ] Preempt commands are only supported when submitted as chained commands. To ensure preempt support works consistently, always submit commands in chained command format. Set force_cmdlist to true so that single commands are filled using the chained command layout, enabling correct handling of preempt commands. Fixes: 3a0ff7b98af4 ("accel/amdxdna: Support preemption requests") Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260206060251.4050512-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel/amdxdna: Remove buffer size check when creating command BOLizhi Hou1-19/+19
[ Upstream commit 08fe1b5166fdc81b010d7bf39cd6440620e7931e ] Large command buffers may be used, and they do not always need to be mapped or accessed by the driver. Performing a size check at command BO creation time unnecessarily rejects valid use cases. Remove the buffer size check from command BO creation, and defer vmap and size validation to the paths where the driver actually needs to map and access the command buffer. Fixes: ac49797c1815 ("accel/amdxdna: Add GEM buffer object management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260206060237.4050492-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12accel: ethosu: Fix shift overflow in cmd_to_addr()Dan Carpenter1-1/+1
[ Upstream commit 7be41fb00e2c2a823f271a8318b453ca11812f1e ] The "((cmd[0] & 0xff0000) << 16)" shift is zero. This was intended to be (((u64)cmd[0] & 0xff0000) << 16). Move the cast to the correct location. Fixes: 5a5e9c0228e6 ("accel: Add Arm Ethos-U NPU driver") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/aQGmY64tWcwOGFP4@stanley.mountain Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: qrtr: Drop the MHI auto_queue feature for IPCR DL channelsManivannan Sadhasivam1-44/+0
[ Upstream commit 51731792a25cb312ca94cdccfa139eb46de1b2ef ] MHI stack offers the 'auto_queue' feature, which allows the MHI stack to auto queue the buffers for the RX path (DL channel). Though this feature simplifies the client driver design, it introduces race between the client drivers and the MHI stack. For instance, with auto_queue, the 'dl_callback' for the DL channel may get called before the client driver is fully probed. This means, by the time the dl_callback gets called, the client driver's structures might not be initialized, leading to NULL ptr dereference. Currently, the drivers have to workaround this issue by initializing the internal structures before calling mhi_prepare_for_transfer_autoqueue(). But even so, there is a chance that the client driver's internal code path may call the MHI queue APIs before mhi_prepare_for_transfer_autoqueue() is called, leading to similar NULL ptr dereference. This issue has been reported on the Qcom X1E80100 CRD machines affecting boot. So to properly fix all these races, drop the MHI 'auto_queue' feature altogether and let the client driver (QRTR) manage the RX buffers manually. In the QRTR driver, queue the RX buffers based on the ring length during probe and recycle the buffers in 'dl_callback' once they are consumed. This also warrants removing the setting of 'auto_queue' flag from controller drivers. Currently, this 'auto_queue' feature is only enabled for IPCR DL channel. So only the QRTR client driver requires the modification. Fixes: 227fee5fc99e ("bus: mhi: core: Add an API for auto queueing buffers for DL channel") Fixes: 68a838b84eff ("net: qrtr: start MHI channel after endpoit creation") Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/linux-arm-msm/ZyTtVdkCCES0lkl4@hovoldconsulting.com Suggested-by: Chris Lew <quic_clew@quicinc.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Acked-by: Jeff Johnson <jjohnson@kernel.org> # drivers/net/wireless/ath/... Acked-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251218-qrtr-fix-v2-1-c7499bfcfbe0@oss.qualcomm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04accel/amdxdna: Fix tail-pointer polling in mailbox_get_msg()Lizhi Hou1-18/+1
[ Upstream commit cd77d5a4aaf8c5c1d819f47cf814bf7d4920b0a2 ] In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a non-zero tail pointer. This assumed that a zero value indicates an error. However, certain corner cases legitimately produce a zero tail pointer. To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail pointer will be treated as a valid rewind event. Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251204181603.793824-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Move RPM resume into job run functionLizhi Hou1-10/+9
[ Upstream commit 69674c1c704c0199ca7a3947f3cdcd575973175d ] Currently, amdxdna_pm_resume_get() is called during job creation, and amdxdna_pm_suspend_put() is called when the hardware notifies job completion. If a job is canceled before it is run, no hardware completion notification is generated, resulting in an unbalanced runtime PM resume/suspend pair. Fix this by moving amdxdna_pm_resume_get() to the job run path, ensuring runtime PM is only resumed for jobs that are actually executed. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260204171118.3165607-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix incorrect DPM level after suspend/resumeLizhi Hou2-2/+3
[ Upstream commit d19d963d2a4acb5bbf03e25733ba565a7f6e1422 ] The suspend routine sets the DPM level to 0, which unintentionally overwrites the previously saved DPM level. As a result, the device always resumes with DPM level 0 instead of restoring the original value. Fix this by ensuring the suspend path does not overwrite the saved DPM level, allowing the correct DPM level to be restored during resume. Fixes: f4d7b8a6bc8c ("accel/amdxdna: Enhance power management settings") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260204171048.3165580-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix incorrect error code returned for failed chain commandLizhi Hou1-1/+1
[ Upstream commit 750817a7c41de083ca5d73052e97bb7b67d7c394 ] The driver currently returns an incorrect error code when a chain command fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for failed chain commands. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Remove hardware context statusLizhi Hou3-28/+5
[ Upstream commit b853007fdcdd64b49601a993c2b30c28279ae15d ] One newly supported command does not require hardware context configuration to be performed upfront. As a result, checking hardware context status causes this command to fail incorrectly. Remove hardware context status handling entirely. For other commands, if userspace submits a request without configuring the hardware context first, the firmware will report an error or time out as appropriate. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260202212450.2681273-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Enable temporal sharing only modeLizhi Hou6-4/+21
[ Upstream commit 7818618a09a06320f409571bf28801ccfe7e0a30 ] Newer firmware versions prefer temporal sharing only mode. In this mode, the driver no longer needs to manage AIE array column allocation. Instead, a new field, num_unused_col, is added to the hardware context creation request to specify how many columns will not be used by this hardware context. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251217191150.2145937-1-lizhi.hou@amd.com Stable-dep-of: b853007fdcdd ("accel/amdxdna: Remove hardware context status") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix memory leak in amdxdna_ubuf_mapZishun Yi1-2/+8
[ Upstream commit 84dd57fb0359500092f1101409ca32091731490d ] The amdxdna_ubuf_map() function allocates memory for sg and internal sg table structures, but it fails to free them if subsequent operations (sg_alloc_table_from_pages or dma_map_sgtable) fail. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Min Ma <mamin506@gmail.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Stop job scheduling across aie2_release_resource()Lizhi Hou1-0/+6
[ Upstream commit f1370241fe8045702bc9d0812b996791f0500f1b ] Running jobs on a hardware context while it is in the process of releasing resources can lead to use-after-free and crashes. Fix this by stopping job scheduling before calling aie2_release_resource() and restarting it after the release completes. Additionally, aie2_sched_job_run() now checks whether the hardware context is still active. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Hold mm structure across iommu_sva_unbind_device()Lizhi Hou2-0/+4
[ Upstream commit a9162439ad792afcddc04718408ec1380b7a5f63 ] Some tests trigger a crash in iommu_sva_unbind_device() due to accessing iommu_mm after the associated mm structure has been freed. Fix this by taking an explicit reference to the mm structure after successfully binding the device, and releasing it only after the device is unbound. This ensures the mm remains valid for the entire SVA bind/unbind lifetime. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix notifier_wq flushing warningLizhi Hou1-1/+1
[ Upstream commit b36178488d479e9a53bbef2b01280378b5586e60 ] Create notifier_wq with WQ_MEM_RECLAIM flag to fix the possible warning. workqueue: WQ_MEM_RECLAIM amdxdna_js:drm_sched_free_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM notifier_wq:0x0 Fixes: e486147c912f ("accel/amdxdna: Add BO import and export") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260113173624.256053-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix potential NULL pointer dereference in context cleanupLizhi Hou1-24/+26
[ Upstream commit 97f27573837ef96b4ba42af463cc800cab615c0e ] aie_destroy_context() is invoked during error handling in aie2_create_context(). However, aie_destroy_context() assumes that the context's mailbox channel pointer is non-NULL. If mailbox channel creation fails, the pointer remains NULL and calling aie_destroy_context() can lead to a NULL pointer dereference. In aie2_create_context(), replace aie_destroy_context() with a function which request firmware to remove the context created previously. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251212183244.1826318-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix race where send ring appears full due to delayed head updateLizhi Hou1-11/+16
[ Upstream commit 343f5683cfa443000904c88ce2e23656375fc51c ] The firmware sends a response and interrupts the driver before advancing the mailbox send ring head pointer. As a result, the driver may observe the response and attempt to send a new request before the firmware has updated the head pointer. In this window, the send ring still appears full, causing the driver to incorrectly fail the send operation. This race can be triggered more easily in a multithreaded environment, leading to unexpected and spurious "send ring full" failures. To address this, poll the send ring head pointer for up to 100us before returning a full-ring condition. This allows the firmware time to update the head pointer. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251211045125.1724604-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix cu_idx being cleared by memset() during command setupLizhi Hou1-4/+4
[ Upstream commit 3d32eb7a5ecff92d83a5fd34c45c171c17d3d5d0 ] For one command type, cu_idx is assigned before calling memset() on the command structure. This results in cu_idx being overwritten, causing the firmware to receive an incomplete or invalid command and leading to unexpected command failures. Fix this by moving the memset() call before initializing cu_idx so that all fields are populated in the correct order. Fixes: 71829d7f2f70 ("accel/amdxdna: Use MSG_OP_CHAIN_EXEC_NPU when supported") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251209211639.1636888-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27accel/amdxdna: Fix race condition when checking rpm_onLizhi Hou7-47/+24
[ Upstream commit 00ffe45ece80160aef446d74ded906352f21dd72 ] When autosuspend is triggered, driver rpm_on flag is set to indicate that a suspend/resume is already in progress. However, when a userspace application submits a command during this narrow window, amdxdna_pm_resume_get() may incorrectly skip the resume operation because the rpm_on flag is still set. This results in commands being submitted while the device has not actually resumed, causing unexpected behavior. The set_dpm() is called by suspend/resume, it relied on rpm_on flag to avoid calling into rpm suspend/resume recursivly. So to fix this, remove the use of the rpm_on flag entirely. Instead, introduce aie2_pm_set_dpm() which explicitly resumes the device before invoking set_dpm(). With this change, set_dpm() is called directly inside the suspend or resume execution path. Otherwise, aie2_pm_set_dpm() is called. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251208165356.1549237-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-15accel/amdxdna: Block running under a hypervisorMario Limonciello (AMD)1-0/+6
SVA support is required, which isn't configured by hypervisor solutions. Closes: https://github.com/QubesOS/qubes-issues/issues/10275 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4656 Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251213054513.87925-1-superm1@kernel.org Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2025-11-13accel/amdxdna: Fix deadlock between context destroy and job timeoutLizhi Hou1-2/+4
Hardware context destroy function holds dev_lock while waiting for all jobs to complete. The timeout job also needs to acquire dev_lock, this leads to a deadlock. Fix the issue by temporarily releasing dev_lock before waiting for all jobs to finish, and reacquiring it afterward. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181050.1293125-1-lizhi.hou@amd.com
2025-11-13accel/amdxdna: Clear mailbox interrupt register during channel creationLizhi Hou1-0/+1
The mailbox interrupt register is not always cleared when a mailbox channel is created. This can leave stale interrupt states from previous operations. Fix this by explicitly clearing the interrupt register in the mailbox channel creation function. Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251107181115.1293158-1-lizhi.hou@amd.com
2025-11-12accel/ivpu: Fix warning due to undefined CONFIG_PROC_FSKarol Wachowski1-2/+2
Change #if to #ifdef CONFIG_PROC_FS to fix warning reported by test robot: drivers/accel/ivpu/ivpu_drv.c:458:5: warning: "CONFIG_PROC_FS" is not defined, evaluates to 0 [-Wundef] Fixes: 63cc028484ab ("accel/ivpu: Add fdinfo support for memory statistics") Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com> Reviewed-by: Andrzej.Kacprowski@linux.intel.com Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251112071911.1136934-1-karol.wachowski@linux.intel.com
2025-11-12accel/ivpu: Count only resident buffers in memory utilizationKarol Wachowski1-1/+2
Do not count buffer objects that have no backing pages, including imported buffers where pages are set by VM faults triggered by userspace or pinned by other drivers. Instead, return information about actual memory used by the NPU. Counting imported buffers results in incorrect calculations when the same pages are counted multiple times, giving overly high results. Fixes: 7bfc9fa99580 ("accel/ivpu: Expose NPU memory utilization info in sysfs") Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251106101052.1050348-3-karol.wachowski@linux.intel.com
2025-11-12accel/ivpu: Add fdinfo support for memory statisticsKarol Wachowski3-0/+23
Implement DRM fdinfo interface to expose memory usage statistics for NPU device file descriptors. Exclude unpinned and imported buffers from resident memory calculations to provide accurate memory usage reporting. Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251106101052.1050348-2-karol.wachowski@linux.intel.com
2025-11-07accel/qaic: Add qaic_ prefix to irq_polling_workZack McKevitt3-3/+3
Rename irq_polling_work to qaic_irq_polling_work to reduce ambiguity and avoid potential naming conflicts in the future. Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031192511.3179130-1-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Collect crashdump from SSR channelPranjal Ramajor Asha Kanojiya5-18/+578
After subsystem of the device has crashed it sends a message with command DEBUG_TRANSFER_INFO to kernel(host). Send ACK for that message and then prepare to collect the ramdump of the subsystem Steps of crashdump collection is as follows, 1) Device sends DEBUG_TRANSFER_INFO message indicating that device wants to send crashdump. 2) Send an acknowledgment to that message either ACK or NACK. a) NACK will inform the device that host will not download the crashdump b) ACK will inform the device that host will download the crashdump 3) Along with the DEBUG_TRANSFER_INFO we receive a table base address and its length, use that to download that table from device. a) This table is meta data of the crashdump and not the actual crashdump. 4) After we respond as ACK for message received on step 1) we start downloading the table. Use series of MEMORY_READ/MEMORY_READ_RSP SSR commands to download the entire table. 5) Each entry in the table represents a segment of crashdump. Once the table downloading is complete, iterate through each entry of table and download each crashdump segment(same as table itself). Table entry contains the memory base address and length along with other info. 6) After the entire crashdump is downloaded send DEBUG_TRANSFER_DONE which marks that host is terminating the crashdump transfer. This message can be send in both success or error case. 7) After receiving DEBUG_TRANSFER_DONE_RSP hand over the crashdump to dev_coredumpv() and free all the necessary memory. Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Co-developed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <pkanojiy@codeaurora.org> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-4-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Implement basic SSR handlingJeffrey Hugo6-7/+359
Subsystem restart (SSR) for a qaic device means that a NSP has crashed, and will be restarted. However the restart process will lose any state associated with activation, so the user will need to do some recovery. While SSR has the provision to collect a crash dump, this patch does not implement support for it. Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Co-developed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Co-developed-by: Troy Hanson <quic_thanson@quicinc.com> Signed-off-by: Troy Hanson <quic_thanson@quicinc.com> Co-developed-by: Aswin Venkatesan <aswivenk@qti.qualcomm.com> Signed-off-by: Aswin Venkatesan <aswivenk@qti.qualcomm.com> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> [jhugo: Fix minor checkpatch whitespace issues] Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-3-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/qaic: Add DMA Bridge Channel(DBC) sysfs and ueventsPranjal Ramajor Asha Kanojiya5-0/+145
Expose sysfs files for each DBC representing the current state of that DBC. For example, sysfs for DBC ID 0 and accel minor number 0 looks like this, /sys/class/accel/accel0/dbc0_state Following are the states and their corresponding values, DBC_STATE_IDLE (0) DBC_STATE_ASSIGNED (1) DBC_STATE_BEFORE_SHUTDOWN (2) DBC_STATE_AFTER_SHUTDOWN (3) DBC_STATE_BEFORE_POWER_UP (4) DBC_STATE_AFTER_POWER_UP (5) Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251031174059.2814445-2-zachary.mckevitt@oss.qualcomm.com
2025-11-07accel/amdxdna: Treat power-off failure as unrecoverable errorLizhi Hou1-0/+10
Failing to set power off indicates an unrecoverable hardware or firmware error. Update the driver to treat such a failure as a fatal condition and stop further operations that depend on successful power state transition. This prevents undefined behavior when the hardware remains in an unexpected state after a failed power-off attempt. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251106180521.1095218-1-lizhi.hou@amd.com
2025-11-06accel/amdxdna: Fix dma_fence leak when job is canceledLizhi Hou2-1/+1
Currently, dma_fence_put(job->fence) is called in job notification callback. However, if a job is canceled, the notification callback is never invoked, leading to a memory leak. Move dma_fence_put(job->fence) to the job cleanup function to ensure the fence is always released. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com
2025-11-06accel/qaic: Add support for PM callbacksYoussef Samir4-0/+103
Add initial support for suspend and hibernation PM callbacks to QAIC. The device can be suspended any time in which the data path is not busy as queued I/O operations are lost on suspension and cannot be resumed after suspend. Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com> Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20251029181808.1216466-1-zachary.mckevitt@oss.qualcomm.com
2025-11-05accel/amdxdna: Support preemption requestsLizhi Hou7-1/+192
The driver checks the firmware version during initialization.If preemption is supported, the driver configures preemption accordingly and handles userspace preemption requests. Otherwise, the driver returns an error for userspace preemption requests. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20251104185340.897560-1-lizhi.hou@amd.com
2025-11-05accel/ivpu: Improve debug and warning messagesKarol Wachowski6-58/+120
Add IOCTL debug bit for logging user provided parameter validation errors. Refactor several warning and error messages to better reflect fault reason. User generated faults should not flood kernel messages with warnings or errors, so change those to ivpu_dbg(). Add additional debug logs for parameter validation in IOCTLs. Check size provided by in metric streamer start and return -EINVAL together with a debug message print. Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251104132418.970784-1-karol.wachowski@linux.intel.com