<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/gpu/drm/lima, branch v6.9.12</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.9.12</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.9.12'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-07-11T10:50:56+00:00</updated>
<entry>
<title>drm/lima: fix shared irq handling on driver remove</title>
<updated>2024-07-11T10:50:56+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-04-01T22:43:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b5daf9217a50636a969bc1965f827878aeb09ffe'/>
<id>urn:sha1:b5daf9217a50636a969bc1965f827878aeb09ffe</id>
<content type='text'>
[ Upstream commit a6683c690bbfd1f371510cb051e8fa49507f3f5e ]

lima uses a shared interrupt, so the interrupt handlers must be prepared
to be called at any time. At driver removal time, the clocks are
disabled early and the interrupts stay registered until the very end of
the remove process due to the devm usage.
This is potentially a bug as the interrupts access device registers
which assumes clocks are enabled. A crash can be triggered by removing
the driver in a kernel with CONFIG_DEBUG_SHIRQ enabled.
This patch frees the interrupts at each lima device finishing callback
so that the handlers are already unregistered by the time we fully
disable clocks.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240401224329.1228468-2-nunes.erico@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>drm/lima: mask irqs in timeout path before hard reset</title>
<updated>2024-06-27T11:52:16+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-04-05T15:29:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=58bfd311c93d66d8282bf21ebbf35cc3bb8ad9db'/>
<id>urn:sha1:58bfd311c93d66d8282bf21ebbf35cc3bb8ad9db</id>
<content type='text'>
[ Upstream commit a421cc7a6a001b70415aa4f66024fa6178885a14 ]

There is a race condition in which a rendering job might take just long
enough to trigger the drm sched job timeout handler but also still
complete before the hard reset is done by the timeout handler.
This runs into race conditions not expected by the timeout handler.
In some very specific cases it currently may result in a refcount
imbalance on lima_pm_idle, with a stack dump such as:

[10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669628] Call trace:
[10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
[10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
[10136.669656]  lima_gp_irq_handler+0xa8/0x120
[10136.669666]  __handle_irq_event_percpu+0x48/0x160
[10136.669679]  handle_irq_event+0x4c/0xc0

We can prevent that race condition entirely by masking the irqs at the
beginning of the timeout handler, at which point we give up on waiting
for that job entirely.
The irqs will be enabled again at the next hard reset which is already
done as a recovery by the timeout handler.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Reviewed-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-4-nunes.erico@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>drm/lima: include pp bcast irq in timeout handler check</title>
<updated>2024-06-27T11:52:16+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-04-05T15:29:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=78137658a8fa03993e11e2925819de27304a0b00'/>
<id>urn:sha1:78137658a8fa03993e11e2925819de27304a0b00</id>
<content type='text'>
[ Upstream commit d8100caf40a35904d27ce446fb2088b54277997a ]

In commit 53cb55b20208 ("drm/lima: handle spurious timeouts due to high
irq latency") a check was added to detect an unexpectedly high interrupt
latency timeout.
With further investigation it was noted that on Mali-450 the pp bcast
irq may also be a trigger of race conditions against the timeout
handler, so add it to this check too.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-3-nunes.erico@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>drm/lima: add mask irq callback to gp and pp</title>
<updated>2024-06-27T11:52:15+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-04-05T15:29:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4180af73c269f7d59a96cce83c7abf12fdbefd5e'/>
<id>urn:sha1:4180af73c269f7d59a96cce83c7abf12fdbefd5e</id>
<content type='text'>
[ Upstream commit 49c13b4d2dd4a831225746e758893673f6ae961c ]

This is needed because we want to reset those devices in device-agnostic
code such as lima_sched.
In particular, masking irqs will be useful before a hard reset to
prevent race conditions.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240405152951.1531555-2-nunes.erico@gmail.com
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>drm/lima: standardize debug messages by ip name</title>
<updated>2024-02-12T08:27:48+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=423af970da74db7eed1b14f2b7f115714a67aeb8'/>
<id>urn:sha1:423af970da74db7eed1b14f2b7f115714a67aeb8</id>
<content type='text'>
Some debug messages carried the ip name, or included "lima", or
included both the ip name and then the numbered ip name again.
Make the messages more consistent by always looking up and showing
the ip name first.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-9-nunes.erico@gmail.com
</content>
</entry>
<entry>
<title>drm/lima: increase default job timeout to 10s</title>
<updated>2024-02-12T08:27:39+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9e5690a16fc26b17eeb8d7e07865b69a78e81e57'/>
<id>urn:sha1:9e5690a16fc26b17eeb8d7e07865b69a78e81e57</id>
<content type='text'>
The previous 500ms default timeout was fairly optimistic and could be
hit by real world applications. Many distributions targeting devices
with a Mali-4xx already bumped this timeout to a higher limit.
We can be generous here with a high value as 10s since this should
mostly catch buggy jobs like infinite loop shaders, and these don't
seem to happen very often in real applications.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-8-nunes.erico@gmail.com
</content>
</entry>
<entry>
<title>drm/lima: remove guilty drm_sched context handling</title>
<updated>2024-02-12T08:27:28+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e721d1cc8101a26fba22dc9a62309522e15bd0c7'/>
<id>urn:sha1:e721d1cc8101a26fba22dc9a62309522e15bd0c7</id>
<content type='text'>
Marking the context as guilty currently only makes the application which
hits a single timeout problem to stop its rendering context entirely.
All jobs submitted later are dropped from the guilty context.

Lima runs on fairly underpowered hardware for modern standards and it is
not entirely unreasonable that a rendering job may time out occasionally
due to high system load or too demanding application stack. In this case
it would be generally preferred to report the error but try to keep the
application going.

Other similar embedded GPU drivers don't make use of the guilty context
flag. Now that there are reliability improvements to the lima timeout
recovery handling, drop the guilty contexts to let the application keep
running in this case.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Reviewed-by: Vasily Khoruzhick &lt;anarsoul@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-7-nunes.erico@gmail.com
</content>
</entry>
<entry>
<title>drm/lima: handle spurious timeouts due to high irq latency</title>
<updated>2024-02-12T08:27:17+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=53cb55b20208eaa79de63c82795c8d51caa9cfcc'/>
<id>urn:sha1:53cb55b20208eaa79de63c82795c8d51caa9cfcc</id>
<content type='text'>
There are several unexplained and unreproduced cases of rendering
timeouts with lima, for which one theory is high IRQ latency coming from
somewhere else in the system.
This kind of occurrence may cause applications to trigger unnecessary
resets of the GPU or even applications to hang if it hits an issue in
the recovery path.
Panfrost already does some special handling to account for such
"spurious timeouts", it makes sense to have this in lima too to reduce
the chance that it hit users.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-6-nunes.erico@gmail.com
</content>
</entry>
<entry>
<title>drm/lima: set gp bus_stop bit before hard reset</title>
<updated>2024-02-12T08:27:07+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=27aa58ec85f973d98d336df7b7941149308db80f'/>
<id>urn:sha1:27aa58ec85f973d98d336df7b7941149308db80f</id>
<content type='text'>
This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-5-nunes.erico@gmail.com
</content>
</entry>
<entry>
<title>drm/lima: set pp bus_stop bit before hard reset</title>
<updated>2024-02-12T08:26:57+00:00</updated>
<author>
<name>Erico Nunes</name>
<email>nunes.erico@gmail.com</email>
</author>
<published>2024-01-24T02:59:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a9da58c86e3bfb0a068a1834e48bdc536e7201e4'/>
<id>urn:sha1:a9da58c86e3bfb0a068a1834e48bdc536e7201e4</id>
<content type='text'>
This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes &lt;nunes.erico@gmail.com&gt;
Reviewed-by: Vasily Khoruzhick &lt;anarsoul@gmail.com&gt;
Signed-off-by: Qiang Yu &lt;yuq825@gmail.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20240124025947.2110659-4-nunes.erico@gmail.com
</content>
</entry>
</feed>
