| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 2feec5ae5df785658924ab6bd91280dc3926507c ]
When a DBC is released, the device sends a QAIC_TRANS_DEACTIVATE_FROM_DEV
transaction to the host over the QAIC_CONTROL MHI channel. QAIC handles
this by calling decode_deactivate() to release the resources allocated for
that DBC. Since that handling is done in the qaic_manage_ioctl() context,
if the user goes away before receiving and handling the deactivation, the
host will be out-of-sync with the DBCs available for use, and the DBC
resources will not be freed unless the device is removed. If another user
loads and requests to activate a network, then the device assigns the same
DBC to that network, QAIC will "indefinitely" wait for dbc->in_use = false,
leading the user process to hang.
As a solution to this, handle QAIC_TRANS_DEACTIVATE_FROM_DEV transactions
that are received after the user has gone away.
Fixes: 129776ac2e38 ("accel/qaic: Add control path")
Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Link: https://patch.msgid.link/20260205123415.3870898-1-youssef.abdulrahman@oss.qualcomm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 34f4495a7f72895776b81969639f527c99eb12b9 ]
When rocket_core_init() fails (as could be the case with EPROBE_DEFER),
we need to properly unwind by decrementing the counter we just
incremented and if this is the first core we failed to probe, remove the
rocket DRM device with rocket_device_fini() as well. This matches the
logic in rocket_remove(). Failing to properly unwind results in
out-of-bounds accesses.
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Cc: stable@vger.kernel.org
Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de>
Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Link: https://patch.msgid.link/20251215-rocket-error-path-v1-2-eec3bf29dc3b@cherry.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f509a081f6a289f7c66856333b3becce7a33c97e ]
When rocket_job_init() is called, iommu_group_get() has already been
called, therefore we should call iommu_group_put() and make the
iommu_group pointer NULL. This aligns with what's done in
rocket_core_fini().
If pm_runtime_resume_and_get() somehow fails, not only should
rocket_job_fini() be called but we should also unwind everything done
before that, that is, disable PM, put the iommu_group, NULLify it and
then call rocket_job_fini(). This is exactly what's done in
rocket_core_fini() so let's call that function instead of duplicating
the code.
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Cc: stable@vger.kernel.org
Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de>
Reviewed-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Link: https://patch.msgid.link/20251215-rocket-error-path-v1-1-eec3bf29dc3b@cherry.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 51731792a25cb312ca94cdccfa139eb46de1b2ef ]
MHI stack offers the 'auto_queue' feature, which allows the MHI stack to
auto queue the buffers for the RX path (DL channel). Though this feature
simplifies the client driver design, it introduces race between the client
drivers and the MHI stack. For instance, with auto_queue, the 'dl_callback'
for the DL channel may get called before the client driver is fully probed.
This means, by the time the dl_callback gets called, the client driver's
structures might not be initialized, leading to NULL ptr dereference.
Currently, the drivers have to workaround this issue by initializing the
internal structures before calling mhi_prepare_for_transfer_autoqueue().
But even so, there is a chance that the client driver's internal code path
may call the MHI queue APIs before mhi_prepare_for_transfer_autoqueue() is
called, leading to similar NULL ptr dereference. This issue has been
reported on the Qcom X1E80100 CRD machines affecting boot.
So to properly fix all these races, drop the MHI 'auto_queue' feature
altogether and let the client driver (QRTR) manage the RX buffers manually.
In the QRTR driver, queue the RX buffers based on the ring length during
probe and recycle the buffers in 'dl_callback' once they are consumed. This
also warrants removing the setting of 'auto_queue' flag from controller
drivers.
Currently, this 'auto_queue' feature is only enabled for IPCR DL channel.
So only the QRTR client driver requires the modification.
Fixes: 227fee5fc99e ("bus: mhi: core: Add an API for auto queueing buffers for DL channel")
Fixes: 68a838b84eff ("net: qrtr: start MHI channel after endpoit creation")
Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/linux-arm-msm/ZyTtVdkCCES0lkl4@hovoldconsulting.com
Suggested-by: Chris Lew <quic_clew@quicinc.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Acked-by: Jeff Johnson <jjohnson@kernel.org> # drivers/net/wireless/ath/...
Acked-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251218-qrtr-fix-v2-1-c7499bfcfbe0@oss.qualcomm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 901ec3470994006bc8dd02399e16b675566c3416 ]
The count field in the command header is used to determine the valid
payload size. Verify that the valid payload does not exceed the remaining
buffer space.
Fixes: aac243092b70 ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260219211946.1920485-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 03808abb1d868aed7478a11a82e5bb4b3f1ca6d6 ]
The ubuf size calculation may overflow, resulting in an undersized
allocation and possible memory corruption.
Use check_add_overflow() helpers to validate the size calculation before
allocation.
Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260217192815.1784689-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 08fe1b5166fdc81b010d7bf39cd6440620e7931e ]
Large command buffers may be used, and they do not always need to be
mapped or accessed by the driver. Performing a size check at command BO
creation time unnecessarily rejects valid use cases.
Remove the buffer size check from command BO creation, and defer vmap
and size validation to the paths where the driver actually needs to map
and access the command buffer.
Fixes: ac49797c1815 ("accel/amdxdna: Add GEM buffer object management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260206060237.4050492-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cd77d5a4aaf8c5c1d819f47cf814bf7d4920b0a2 ]
In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a
non-zero tail pointer. This assumed that a zero value indicates an error.
However, certain corner cases legitimately produce a zero tail pointer.
To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail
pointer will be treated as a valid rewind event.
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251204181603.793824-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 750817a7c41de083ca5d73052e97bb7b67d7c394 ]
The driver currently returns an incorrect error code when a chain command
fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for
failed chain commands.
Fixes: aac243092b70 ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 84dd57fb0359500092f1101409ca32091731490d ]
The amdxdna_ubuf_map() function allocates memory for sg and
internal sg table structures, but it fails to free them if subsequent
operations (sg_alloc_table_from_pages or dma_map_sgtable) fail.
Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer")
Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Min Ma <mamin506@gmail.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f1370241fe8045702bc9d0812b996791f0500f1b ]
Running jobs on a hardware context while it is in the process of
releasing resources can lead to use-after-free and crashes.
Fix this by stopping job scheduling before calling
aie2_release_resource() and restarting it after the release completes.
Additionally, aie2_sched_job_run() now checks whether the hardware
context is still active.
Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a9162439ad792afcddc04718408ec1380b7a5f63 ]
Some tests trigger a crash in iommu_sva_unbind_device() due to
accessing iommu_mm after the associated mm structure has been
freed.
Fix this by taking an explicit reference to the mm structure
after successfully binding the device, and releasing it only
after the device is unbound. This ensures the mm remains valid
for the entire SVA bind/unbind lifetime.
Fixes: be462c97b7df ("accel/amdxdna: Add hardware context")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b36178488d479e9a53bbef2b01280378b5586e60 ]
Create notifier_wq with WQ_MEM_RECLAIM flag to fix the possible warning.
workqueue: WQ_MEM_RECLAIM amdxdna_js:drm_sched_free_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM notifier_wq:0x0
Fixes: e486147c912f ("accel/amdxdna: Add BO import and export")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260113173624.256053-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 343f5683cfa443000904c88ce2e23656375fc51c ]
The firmware sends a response and interrupts the driver before advancing
the mailbox send ring head pointer. As a result, the driver may observe
the response and attempt to send a new request before the firmware has
updated the head pointer. In this window, the send ring still appears
full, causing the driver to incorrectly fail the send operation.
This race can be triggered more easily in a multithreaded environment,
leading to unexpected and spurious "send ring full" failures.
To address this, poll the send ring head pointer for up to 100us before
returning a full-ring condition. This allows the firmware time to update
the head pointer.
Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251211045125.1724604-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7bbf6d15e935abbb3d604c1fa157350e84a26f98 ]
SVA support is required, which isn't configured by hypervisor
solutions.
Closes: https://github.com/QubesOS/qubes-issues/issues/10275
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4656
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251213054513.87925-1-superm1@kernel.org
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ca2583412306ceda9304a7c4302fd9efbf43e963 ]
Hardware context destroy function holds dev_lock while waiting for all jobs
to complete. The timeout job also needs to acquire dev_lock, this leads to
a deadlock.
Fix the issue by temporarily releasing dev_lock before waiting for all
jobs to finish, and reacquiring it afterward.
Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251107181050.1293125-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6ff9385c07aa311f01f87307e6256231be7d8675 ]
The mailbox interrupt register is not always cleared when a mailbox channel
is created. This can leave stale interrupt states from previous operations.
Fix this by explicitly clearing the interrupt register in the mailbox
channel creation function.
Fixes: b87f920b9344 ("accel/amdxdna: Support hardware mailbox")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251107181115.1293158-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dea9f84776b96a703f504631ebe9fea07bd2c181 ]
Currently, dma_fence_put(job->fence) is called in job notification
callback. However, if a job is canceled, the notification callback is never
invoked, leading to a memory leak. Move dma_fence_put(job->fence)
to the job cleanup function to ensure the fence is always released.
Fixes: aac243092b70 ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6fb7f298883246e21f60f971065adcb789ae6eba ]
When a command times out, mark it as ERT_CMD_STATE_TIMEOUT. Any other
commands that are canceled due to this timeout should be marked as
ERT_CMD_STATE_ABORT.
Fixes: aac243092b70 ("accel/amdxdna: Add command execution")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251029193423.2430463-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 00812636df370bedf4e44a0c81b86ea96bca8628 ]
Fix 'Memory manager not clean during takedown' warning that occurs
when ivpu_gem_bo_free() removes the BO from the BOs list before it
gets unmapped. Then file_priv_unbind() triggers a warning in
drm_mm_takedown() during context teardown.
Protect the unmapping sequence with bo_list_lock to ensure the BO is
always fully unmapped when removed from the list. This ensures the BO
is either fully unmapped at context teardown time or present on the
list and unmapped by file_priv_unbind().
Fixes: 48aea7f2a2ef ("accel/ivpu: Fix locking in ivpu_bo_remove_all_bos_from_context()")
Signed-off-by: Tomasz Rusinowicz <tomasz.rusinowicz@intel.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20251029071451.184243-1-karol.wachowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c063c1bbee67391f12956d2ffdd5da00eb87ff79 ]
Rework of imported buffers introduced in the commit
e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers")
switched the logic of imported buffers by dma mapping/unmapping
them just as the regular buffers.
The commit didn't include removal of skipping dma unmap of imported
buffers which results in them being mapped without unmapping.
Fixes: e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers")
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Link: https://patch.msgid.link/20251027150933.2384538-1-maciej.falkowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 81233d5419cf20e4f5ef505882951616888a2ef9 ]
In aie2_get_hwctx_status() and aie2_query_ctx_status_array(), the
functions could return an uninitialized value in some cases. Update them
to always return 0. The amount of valid results is indicated by the
returned buffer_size, element_size, and num_element fields.
Fixes: 2f509fe6a42c ("accel/amdxdna: Add ioctl DRM_IOCTL_AMDXDNA_GET_ARRAY")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20251024165503.1548131-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 63c7870fab67b2ab2bfe75e8b46f3c37b88c47a8 ]
Fix a race that can occur when multiple jobs submit the same dmabuf.
This could cause the sg_table to be mapped twice, leading to undefined
behavior.
Fixes: e0c0891cd63b ("accel/ivpu: Rework bind/unbind of imported buffers")
Signed-off-by: Wludzik, Jozef <jozef.wludzik@intel.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://lore.kernel.org/r/20251014071725.3047287-1-karol.wachowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit aa1c2b073ad23847dd2e7bdc7d30009f34ed7f59 ]
The pcode MAILBOX STATUS register PARAM2 field expects DCT active
percent in U1.7 value format. Convert percentage value to this
format before writing to the register.
Fixes: a19bffb10c46 ("accel/ivpu: Implement DCT handling")
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://lore.kernel.org/r/20251001104322.1249896-1-karol.wachowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8b694b405a84696f1d964f6da7cf9721e68c4714 ]
Don't add BO to the vdev->bo_list in ivpu_gem_create_object().
When failure happens inside drm_gem_shmem_create(), the BO is not
fully created and ivpu_gem_bo_free() callback will not be called
causing a deleted BO to be left on the list.
Fixes: 8d88e4cdce4f ("accel/ivpu: Use GEM shmem helper for all buffers")
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://lore.kernel.org/r/20250925145114.1446283-1-maciej.falkowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e0c0891cd63bf8338c25c423e28a5a93aed3d74c ]
Ensure that imported buffers are properly mapped and unmapped in
the same way as regular buffers to properly handle buffers during
device's bind and unbind operations to prevent resource leaks and
inconsistent buffer states.
Imported buffers are now dma_mapped before submission and
dma_unmapped in ivpu_bo_unbind(), guaranteeing they are unmapped
when the device is unbound.
Add also imported buffers to vdev->bo_list for consistent unmapping
on device unbind. The bo->ctx_id is set in open() so imported
buffers have a valid context ID.
Debug logs have been updated to match the new code structure.
The function ivpu_bo_pin() has been renamed to ivpu_bo_bind()
to better reflect its purpose, and unbind tests have been refactored
for improved coverage and clarity.
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Signed-off-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://lore.kernel.org/r/20250925145059.1446243-1-maciej.falkowski@linux.intel.com
Stable-dep-of: 8b694b405a84 ("accel/ivpu: Fix page fault in ivpu_bo_unbind_all_bos_from_context()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9f6c63285737b141ca25a619add80a96111b8b96 ]
Previously, aborting work could return early after engine reset or resume
failure, skipping the necessary runtime_put cleanup leaving the device
with incorrect reference count breaking runtime power management state.
Replace early returns with goto statements to ensure runtime_put is always
executed.
Fixes: a47e36dc5d90 ("accel/ivpu: Trigger device recovery on engine reset/resume failure")
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://lore.kernel.org/r/20250916084809.850073-1-karol.wachowski@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 457f4393d02fdb612a93912fb09cef70e6e545c9 ]
In amdxdna_gem_obj_vmap(), calling dma_buf_vmap() triggers a kernel
warning if LOCKDEP is enabled. So for imported object, use
dma_buf_vmap_unlocked(). Then, use drm_gem_vmap() for other objects.
The similar change applies to vunmap code.
Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://lore.kernel.org/r/20250916174842.234709-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9e16c8bf9aebf629344cfd4cd5e3dc7d8c3f7d82 ]
The unpublished smatch static checker reported a warning.
drivers/accel/amdxdna/aie2_pci.c:904 aie2_query_ctx_status_array()
warn: potential user controlled sizeof overflow
'args->num_element * args->element_size' '1-u32max(user) * 1-u32max(user)'
Even this will not cause a real issue, it is better to put a reasonable
limitation for element_size and num_element. Add condition to make sure
the input element_size <= 4K and num_element <= 1K.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/dri-devel/aL56ZCLyl3tLQM1e@stanley.mountain/
Fixes: 2f509fe6a42c ("accel/amdxdna: Add ioctl DRM_IOCTL_AMDXDNA_GET_ARRAY")
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://lore.kernel.org/r/20250909154531.3469979-1-lizhi.hou@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Two threads of the same process can potential read and write parallelly to
head and tail pointers of the same DBC request queue. This could lead to a
race condition and corrupt the DBC request queue.
Fixes: ff13be830333 ("accel/qaic: Add datapath")
Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com>
Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com>
[jhugo: Add fixes tag]
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251007061837.206132-1-youssef.abdulrahman@oss.qualcomm.com
|
|
Currently, if find_and_map_user_pages() takes a DMA xfer request from the
user with a length field set to 0, or in a rare case, the host receives
QAIC_TRANS_DMA_XFER_CONT from the device where resources->xferred_dma_size
is equal to the requested transaction size, the function will return 0
before allocating an sgt or setting the fields of the dma_xfer struct.
In that case, encode_addr_size_pairs() will try to access the sgt which
will lead to a general protection fault.
Return an EINVAL in case the user provides a zero-sized ALP, or the device
requests continuation after all of the bytes have been transferred.
Fixes: 96d3c1cadedb ("accel/qaic: Clean up integer overflow checking in map_user_pages()")
Signed-off-by: Youssef Samir <quic_yabdulra@quicinc.com>
Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251007122320.339654-1-youssef.abdulrahman@oss.qualcomm.com
|
|
As soon as we queue MHI buffers to receive the bootlog from the device,
we could be receiving data. Therefore all the resources needed to
process that data need to be setup prior to queuing the buffers.
We currently initialize some of the resources after queuing the buffers
which creates a race between the probe() and any data that comes back
from the device. If the uninitialized resources are accessed, we could
see page faults.
Fix the init ordering to close the race.
Fixes: 5f8df5c6def6 ("accel/qaic: Add bootlog debugfs")
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Youssef Samir <youssef.abdulrahman@oss.qualcomm.com>
Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Reviewed-by: Carl Vanderlip <carl.vanderlip@oss.qualcomm.com>
Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251007115750.332169-1-youssef.abdulrahman@oss.qualcomm.com
|
|
On HL338 ASICs, the Infineon first‑stage firmware is not present and
the reported version is 0. In this case printing a version number is
misleading, as it suggests valid firmware when it does not exist.
Fix this by printing the first‑stage Infineon firmware version only
if the reported value is non‑zero. This avoids confusing or incorrect
log messages on devices where the first stage is not applicable.
Signed-off-by: Pavan S <pavan.sreenivas@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Dirty state can occur when the host VM undergoes a reset while the
device does not. In such a case, the driver must reset the device before
it can be used again. As part of this reset, the device capabilities
are zeroed. Therefore, the driver must read the Preboot status again to
learn the Preboot state, capabilities, and security configuration.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add a new passthrough type HL_GET_P_STATE to the cpucp generic ioctl
to allow userspace to read the device performance state via firmware.
Signed-off-by: Ariel Aviad <ariel.aviad@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add debugfs files for NVMe Direct I/O (HLDIO) functionality.
This interface allows userspace access to direct SSD ↔ device transfers
through debugfs nodes.
Four debugfs files are created under /sys/kernel/debug/habanalabs/hlN/:
- dio_ssd2hl : trigger SSD-to-device transfers
- dio_hl2ssd : trigger device-to-SSD transfers
(placeholder, not yet implemented)
- dio_stats : show transfer statistics
- dio_reset : reset statistics counters
Usage examples:
# Perform SSD → device transfer
echo "fd=3 va=0x10000 off=0 len=4096" > \
/sys/kernel/debug/habanalabs/hl0/dio_ssd2hl
# View statistics
cat /sys/kernel/debug/habanalabs/hl0/dio_stats
# Reset counters
echo 1 > /sys/kernel/debug/habanalabs/hl0/dio_reset
This interface provides access to HLDIO functionality for validation
and diagnostics.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Farah Kassabri <farah.kassabri@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Introduce NVMe Direct I/O (HLDIO) infrastructure to support
peer‑to‑peer DMA in the habanalabs driver. This adds internal helpers
and data structures to enable direct transfers between NVMe storage
and device memory.
The feature is built only when CONFIG_HL_HLDIO is enabled. A debugfs
interface is also provided for functional validation.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Farah Kassabri <farah.kassabri@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
When IOMMU is enabled, dma_alloc_coherent() with GFP_USER may return
addresses from the vmalloc range. If such an address is mapped without
VM_MIXEDMAP, vm_insert_page() will trigger a BUG_ON due to the
VM_PFNMAP restriction.
Fix this by checking for vmalloc addresses and setting VM_MIXEDMAP
in the VMA before mapping. This ensures safe mapping and avoids kernel
crashes. The memory is still driver-allocated and cannot be accessed
directly by userspace.
Signed-off-by: Moti Haimovski <moti.haimovski@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
The access_ok() API no longer requires the VERIFY_WRITE argument,
and the use of the old interface with VERIFY_WRITE is deprecated.
Clean up the habanalabs memory manager to use the modern access_ok()
interface consistently. This removes old #ifdef guards and aligns the
driver with current upstream kernel APIs.
Signed-off-by: Ilia Levi <ilia.levi@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
After CPLD shutdown event the device is not usable anymore. The common
CPLD_SHUTDOWN event handler disables any subsequent device access.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
After a CPLD shutdown event the device becomes unusable. Prevent further
device access once this event is received.
Signed-off-by: Konstantin Sinyuk <konstantin.sinyuk@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
In hl_release_dmabuf(), ctx is dereferenced after calling hl_ctx_put()
to obtain the compute device file.
This is safe because the dma-buf object holds a file reference taken in
export_dmabuf(), and the file release (which drops another ctx reference)
can only happen after we drop that file reference via fput(). Thus, this
hl_ctx_put() call cannot be the last one at this point.
Add a comment explaining this to avoid confusion.
Signed-off-by: Tomer Tayar <tomer.tayar@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add infrastructure for logging the last configuration register accesses
that occur via debugfs read/write operations. At interrupt time, these
log entries can be dumped to dmesg, which helps in diagnosing the cause
of RAZWI and ADDR_DEC interrupts.
The logging is implemented as a ring buffer of access entries, with each
entry recording timestamp and access details. To ensure correctness
under concurrent access, operations are now protected using spinlocks.
Entries are copied under lock and then printed after releasing it, which
minimizes time spent in the critical section.
Signed-off-by: Sharley Calzolari <sharley.calzolari@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Print engine/queue names instead of numerical engine/queue IDs to make
logs and debug output more readable.
Signed-off-by: Ariel Suller <ariel.suller@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Add a new CPUCP generic message type to retrieve HBM, SRAM and critical
error counters from the device.
Signed-off-by: Vitaly Margolin <vitaly.margolin@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
Change the BMON_CR register value back to its original state before
enabling, so that BMON does not continue to collect information
after being disabled.
Signed-off-by: Vered Yavniely <vered.yavniely@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
EFAULT is currently returned if less than requested user pages are
pinned. This value means a "bad address" which might be confusing to
the user, as the address of the given user memory is not necessarily
"bad".
Modify the return value to ENOMEM, as "out of memory" is more suitable
in this case.
Signed-off-by: Tomer Tayar <tomer.tayar@intel.com>
Reviewed-by: Koby Elbaz <koby.elbaz@intel.com>
Signed-off-by: Koby Elbaz <koby.elbaz@intel.com>
|
|
This is a backmerge of Linux 6.17-rc6, needed for msm,
also requested by misc.
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
Add interface for applications to get information array. The application
provides a buffer pointer along with information type, maximum number of
entries and maximum size of each entry. The buffer may also contain match
conditions based on the information type. After the ioctl completes, the
actual number of entries and entry size are returned. (see [1], used by
driver runtime library)
[1] https://github.com/amd/xdna-driver/blob/main/src/shim/host/platform_host.cpp#L337
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Maciej Falkowski <maciej.falkowski@linux.intel.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://lore.kernel.org/r/20250903053402.2103196-1-lizhi.hou@amd.com
|
|
Use disable_work_sync() instead of cancel_work_sync() in ivpu_dev_fini()
to ensure that no new recovery work items can be queued after device
removal has started. Previously, recovery work could be scheduled even
after canceling existing work, potentially leading to use-after-free
bugs if recovery accessed freed resources.
Rename ivpu_pm_cancel_recovery() to ivpu_pm_disable_recovery() to better
reflect its new behavior.
Fixes: 58cde80f45a2 ("accel/ivpu: Use dedicated work for job timeout detection")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Karol Wachowski <karol.wachowski@intel.com>
Reviewed-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Link: https://lore.kernel.org/r/20250808110939.328366-1-jacek.lawrynowicz@linux.intel.com
|