kernel/linux.git/drivers/gpu/drm/lima, branch v6.9.12

drm/lima: fix shared irq handling on driver remove

2024-07-11T10:50:56+00:00

[ Upstream commit a6683c690bbfd1f371510cb051e8fa49507f3f5e ] lima uses a shared interrupt, so the interrupt handlers must be prepared to be called at any time. At driver removal time, the clocks are disabled early and the interrupts stay registered until the very end of the remove process due to the devm usage. This is potentially a bug as the interrupts access device registers which assumes clocks are enabled. A crash can be triggered by removing the driver in a kernel with CONFIG_DEBUG_SHIRQ enabled. This patch frees the interrupts at each lima device finishing callback so that the handlers are already unregistered by the time we fully disable clocks. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240401224329.1228468-2-nunes.erico@gmail.com Signed-off-by: Sasha Levin

drm/lima: mask irqs in timeout path before hard reset

2024-06-27T11:52:16+00:00

[ Upstream commit a421cc7a6a001b70415aa4f66024fa6178885a14 ] There is a race condition in which a rendering job might take just long enough to trigger the drm sched job timeout handler but also still complete before the hard reset is done by the timeout handler. This runs into race conditions not expected by the timeout handler. In some very specific cases it currently may result in a refcount imbalance on lima_pm_idle, with a stack dump such as: [10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0 ... [10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0 ... [10136.669628] Call trace: [10136.669634] lima_devfreq_record_idle+0xa0/0xb0 [10136.669646] lima_sched_pipe_task_done+0x5c/0xb0 [10136.669656] lima_gp_irq_handler+0xa8/0x120 [10136.669666] __handle_irq_event_percpu+0x48/0x160 [10136.669679] handle_irq_event+0x4c/0xc0 We can prevent that race condition entirely by masking the irqs at the beginning of the timeout handler, at which point we give up on waiting for that job entirely. The irqs will be enabled again at the next hard reset which is already done as a recovery by the timeout handler. Signed-off-by: Erico Nunes Reviewed-by: Qiang Yu Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-4-nunes.erico@gmail.com Signed-off-by: Sasha Levin

drm/lima: include pp bcast irq in timeout handler check

2024-06-27T11:52:16+00:00

[ Upstream commit d8100caf40a35904d27ce446fb2088b54277997a ] In commit 53cb55b20208 ("drm/lima: handle spurious timeouts due to high irq latency") a check was added to detect an unexpectedly high interrupt latency timeout. With further investigation it was noted that on Mali-450 the pp bcast irq may also be a trigger of race conditions against the timeout handler, so add it to this check too. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-3-nunes.erico@gmail.com Signed-off-by: Sasha Levin

drm/lima: add mask irq callback to gp and pp

2024-06-27T11:52:15+00:00

[ Upstream commit 49c13b4d2dd4a831225746e758893673f6ae961c ] This is needed because we want to reset those devices in device-agnostic code such as lima_sched. In particular, masking irqs will be useful before a hard reset to prevent race conditions. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-2-nunes.erico@gmail.com Signed-off-by: Sasha Levin

drm/lima: standardize debug messages by ip name

2024-02-12T08:27:48+00:00

Some debug messages carried the ip name, or included "lima", or included both the ip name and then the numbered ip name again. Make the messages more consistent by always looking up and showing the ip name first. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-9-nunes.erico@gmail.com

drm/lima: increase default job timeout to 10s

2024-02-12T08:27:39+00:00

The previous 500ms default timeout was fairly optimistic and could be hit by real world applications. Many distributions targeting devices with a Mali-4xx already bumped this timeout to a higher limit. We can be generous here with a high value as 10s since this should mostly catch buggy jobs like infinite loop shaders, and these don't seem to happen very often in real applications. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-8-nunes.erico@gmail.com

drm/lima: remove guilty drm_sched context handling

2024-02-12T08:27:28+00:00

Marking the context as guilty currently only makes the application which hits a single timeout problem to stop its rendering context entirely. All jobs submitted later are dropped from the guilty context. Lima runs on fairly underpowered hardware for modern standards and it is not entirely unreasonable that a rendering job may time out occasionally due to high system load or too demanding application stack. In this case it would be generally preferred to report the error but try to keep the application going. Other similar embedded GPU drivers don't make use of the guilty context flag. Now that there are reliability improvements to the lima timeout recovery handling, drop the guilty contexts to let the application keep running in this case. Signed-off-by: Erico Nunes Acked-by: Christian König Reviewed-by: Vasily Khoruzhick Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-7-nunes.erico@gmail.com

drm/lima: handle spurious timeouts due to high irq latency

2024-02-12T08:27:17+00:00

There are several unexplained and unreproduced cases of rendering timeouts with lima, for which one theory is high IRQ latency coming from somewhere else in the system. This kind of occurrence may cause applications to trigger unnecessary resets of the GPU or even applications to hang if it hits an issue in the recovery path. Panfrost already does some special handling to account for such "spurious timeouts", it makes sense to have this in lima too to reduce the chance that it hit users. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-6-nunes.erico@gmail.com

drm/lima: set gp bus_stop bit before hard reset

2024-02-12T08:27:07+00:00

This is required for reliable hard resets. Otherwise, doing a hard reset while a task is still running (such as a task which is being stopped by the drm_sched timeout handler) may result in random mmu write timeouts or lockups which cause the entire gpu to hang. Signed-off-by: Erico Nunes Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-5-nunes.erico@gmail.com

drm/lima: set pp bus_stop bit before hard reset

2024-02-12T08:26:57+00:00

This is required for reliable hard resets. Otherwise, doing a hard reset while a task is still running (such as a task which is being stopped by the drm_sched timeout handler) may result in random mmu write timeouts or lockups which cause the entire gpu to hang. Signed-off-by: Erico Nunes Reviewed-by: Vasily Khoruzhick Signed-off-by: Qiang Yu Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-4-nunes.erico@gmail.com