kernel/linux.git/include/drm/gpu_scheduler.h, branch v6.19.11

drm/sched: backend_ops doc fix

2025-09-17T12:58:33+00:00

Function drm_sched_entity_do_release() has been renamed in commit 180fc134d712 ("drm/scheduler: Rename cleanup functions v2."). Refer to the correct function in the documentation. Signed-off-by: Luc Ma [phasta: commit message] Signed-off-by: Philipp Stanner Link: https://lore.kernel.org/r/20250915132327.6293-1-onion0709@gmail.com

drm/sched: Allow drivers to skip the reset and keep on running

2025-07-15T11:27:07+00:00

When the DRM scheduler times out, it's possible that the GPU isn't hung; instead, a job just took unusually long (longer than the timeout) but is still running, and there is, thus, no reason to reset the hardware. This can occur in two scenarios: 1. The job is taking longer than the timeout, but the driver determined through a GPU-specific mechanism that the hardware is still making progress. Hence, the driver would like the scheduler to skip the timeout and treat the job as still pending from then onward. This happens in v3d, Etnaviv, and Xe. 2. Timeout has fired before the free-job worker. Consequently, the scheduler calls `sched->ops->timedout_job()` for a job that isn't timed out. These two scenarios are problematic because the job was removed from the `sched->pending_list` before calling `sched->ops->timedout_job()`, which means that when the job finishes, it won't be freed by the scheduler though `sched->ops->free_job()` - leading to a memory leak. To solve these problems, create a new `drm_gpu_sched_stat`, called DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The new status will indicate that the job must be reinserted into `sched->pending_list`, and the hardware / driver will still complete that job. Reviewed-by: Philipp Stanner Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-2-5c5ba4f55039@igalia.com Signed-off-by: Maíra Canal

drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET

2025-07-15T11:27:00+00:00

Among the scheduler's statuses, the only one that indicates an error is DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV signifies that the operation succeeded and the GPU is in a nominal state. However, to provide more information about the GPU's status, it is needed to convey more information than just "OK". Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has hung, but it has been successfully reset and is now in a nominal state again. Reviewed-by: Philipp Stanner Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com Signed-off-by: Maíra Canal

drm/sched: Avoid memory leaks with cancel_job() callback

2025-07-10T15:07:08+00:00

Since its inception, the GPU scheduler can leak memory if the driver calls drm_sched_fini() while there are still jobs in flight. The simplest way to solve this in a backwards compatible manner is by adding a new callback, drm_sched_backend_ops.cancel_job(), which instructs the driver to signal the hardware fence associated with the job. Afterwards, the scheduler can safely use the established free_job() callback for freeing the job. Implement the new backend_ops callback cancel_job(). Suggested-by: Tvrtko Ursulin Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/ Reviewed-by: Maíra Canal Acked-by: Tvrtko Ursulin Signed-off-by: Philipp Stanner Link: https://lore.kernel.org/r/20250710125412.128476-4-phasta@kernel.org

drm: Get rid of drm_sched_job.id

2025-05-28T14:16:15+00:00

Its only purpose was for trace events, but jobs can already be uniquely identified using their fence. The downside of using the fence is that it's only available after 'drm_sched_job_arm' was called which is true for all trace events that used job.id so they can safely switch to using it. Suggested-by: Tvrtko Ursulin Signed-off-by: Pierre-Eric Pelloux-Prayer Reviewed-by: Christian König Reviewed-by: Arvind Yadav Signed-off-by: Philipp Stanner Link: https://lore.kernel.org/r/20250526125505.2360-9-pierre-eric.pelloux-prayer@amd.com

drm/sched: Store the drm client_id in drm_sched_fence

2025-05-28T14:15:58+00:00

This will be used in a later commit to trace the drm client_id in some of the gpu_scheduler trace events. This requires changing all the users of drm_sched_job_init to add an extra parameter. The newly added drm_client_id field in the drm_sched_fence is a bit of a duplicate of the owner one. One suggestion I received was to merge those 2 fields - this can't be done right now as amdgpu uses some special values (AMDGPU_FENCE_OWNER_*) that can't really be translated into a client id. Christian is working on getting rid of those; when it's done we should be able to squash owner/drm_client_id together. Reviewed-by: Christian König Signed-off-by: Pierre-Eric Pelloux-Prayer Signed-off-by: Philipp Stanner Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com

drm/sched: Fix outdated comments referencing thread

2025-05-13T13:39:48+00:00

The GPU scheduler's comments refer to a "thread" at various places. Those are leftovers from commit a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread"). Replace all references to kthreads. Reviewed-by: Tvrtko Ursulin Signed-off-by: Philipp Stanner Link: https://lore.kernel.org/r/20250314101023.111248-2-phasta@kernel.org

drm/sched: Update timedout_job()'s documentation

2025-03-06T15:36:32+00:00

drm_sched_backend_ops.timedout_job()'s documentation is outdated. It mentions the deprecated function drm_sched_resubmit_jobs(). Furthermore, it does not point out the important distinction between hardware and firmware schedulers. Since firmware schedulers typically only use one entity per scheduler, timeout handling is significantly more simple because the entity the faulted job came from can just be killed without affecting innocent processes. Update the documentation with that distinction and other details. Reformat the docstring to work to a unified style with the other handles. Acked-by: Danilo Krummrich Signed-off-by: Philipp Stanner Link: https://patchwork.freedesktop.org/patch/msgid/20250305130551.136682-5-phasta@kernel.org

drm/sched: Adjust outdated docu for run_job()

2025-03-06T15:35:48+00:00

The documentation for drm_sched_backend_ops.run_job() mentions a certain function called drm_sched_job_recovery(). This function does not exist. What's actually meant is drm_sched_resubmit_jobs(), which is by now also deprecated. Furthermore, the scheduler expects to "inherit" a reference on the fence from the run_job() callback. This, so far, is also not documented. Remove the mention of the removed function. Discourage the behavior of drm_sched_backend_ops.run_job() being called multiple times for the same job. Document the necessity of incrementing the refcount in run_job(). Acked-by: Danilo Krummrich Signed-off-by: Philipp Stanner Link: https://patchwork.freedesktop.org/patch/msgid/20250305130551.136682-3-phasta@kernel.org

drm/sched: Group exported prototypes by object type

2025-02-24T09:17:42+00:00

Do a bit of house keeping in gpu_scheduler.h by grouping the API by type of object it operates on. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matthew Brost Cc: Philipp Stanner Signed-off-by: Philipp Stanner Link: https://patchwork.freedesktop.org/patch/msgid/20250221105038.79665-7-tvrtko.ursulin@igalia.com