<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch v7.0-rc7</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.0-rc7</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.0-rc7'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-23T18:47:47+00:00</updated>
<entry>
<title>drm/amdgpu: fix strsep() corrupting lockup_timeout on multi-GPU (v3)</title>
<updated>2026-03-23T18:47:47+00:00</updated>
<author>
<name>Ruijing Dong</name>
<email>ruijing.dong@amd.com</email>
</author>
<published>2026-03-17T17:54:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2d300ebfc411205fa31ba7741c5821d381912381'/>
<id>urn:sha1:2d300ebfc411205fa31ba7741c5821d381912381</id>
<content type='text'>
amdgpu_device_get_job_timeout_settings() passes a pointer directly
to the global amdgpu_lockup_timeout[] buffer into strsep().
strsep() destructively replaces delimiter characters with '\0'
in-place.

On multi-GPU systems, this function is called once per device.
When a multi-value setting like "0,0,0,-1" is used, the first
GPU's call transforms the global buffer into "0\00\00\0-1". The
second GPU then sees only "0" (terminated at the first '\0'),
parses a single value, hits the single-value fallthrough
(index == 1), and applies timeout=0 to all rings — causing
immediate false job timeouts.

Fix this by copying into a stack-local array before calling
strsep(), so the global module parameter buffer remains intact
across calls. The buffer is AMDGPU_MAX_TIMEOUT_PARAM_LENGTH
(256) bytes, which is safe for the stack.

v2: wrap commit message to 72 columns, add Assisted-by tag.
v3: use stack array with strscpy() instead of kstrdup()/kfree()
    to avoid unnecessary heap allocation (Christian).

This patch was developed with assistance from Claude (claude-opus-4-6).

Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Ruijing Dong &lt;ruijing.dong@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 94d79f51efecb74be1d88dde66bdc8bfcca17935)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amd: Set num IP blocks to 0 if discovery fails</title>
<updated>2026-03-11T18:04:08+00:00</updated>
<author>
<name>Mario Limonciello</name>
<email>mario.limonciello@amd.com</email>
</author>
<published>2026-03-10T16:58:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3646ff28780b4c52c5b5081443199e7a430110e5'/>
<id>urn:sha1:3646ff28780b4c52c5b5081443199e7a430110e5</id>
<content type='text'>
If discovery has failed for any reason (such as no support for a block)
then there is no need to unwind all the IP blocks in fini. In this
condition there can actually be failures during the unwind too.

Reset num_ip_blocks to zero during failure path and skip the unnecessary
cleanup path.

Suggested-by: Lijo Lazar &lt;lijo.lazar@amd.com&gt;
Reviewed-by: Lijo Lazar &lt;lijo.lazar@amd.com&gt;
Signed-off-by: Mario Limonciello &lt;mario.limonciello@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit fae5984296b981c8cc3acca35b701c1f332a6cd8)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amd: Fix a few more NULL pointer dereference in device cleanup</title>
<updated>2026-03-06T22:19:14+00:00</updated>
<author>
<name>Mario Limonciello</name>
<email>mario.limonciello@amd.com</email>
</author>
<published>2026-03-05T15:06:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=72ecb1dae72775fa9fea0159d8445d620a0a2295'/>
<id>urn:sha1:72ecb1dae72775fa9fea0159d8445d620a0a2295</id>
<content type='text'>
I found a few more paths that cleanup fails due to a NULL version pointer
on unsupported hardware.

Add NULL checks as applicable.

Fixes: 39fc2bc4da00 ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths")
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Mario Limonciello &lt;mario.limonciello@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit f5a05f8414fc10f307eb965f303580c7778f8dd2)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amd: Fix NULL pointer dereference in device cleanup</title>
<updated>2026-03-06T22:10:40+00:00</updated>
<author>
<name>Mario Limonciello</name>
<email>mario.limonciello@amd.com</email>
</author>
<published>2026-03-04T20:07:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=062ea905fff7756b2e87143ffccaece5cdb44267'/>
<id>urn:sha1:062ea905fff7756b2e87143ffccaece5cdb44267</id>
<content type='text'>
When GPU initialization fails due to an unsupported HW block
IP blocks may have a NULL version pointer. During cleanup in
amdgpu_device_fini_hw, the code calls amdgpu_device_set_pg_state and
amdgpu_device_set_cg_state which iterate over all IP blocks and access
adev-&gt;ip_blocks[i].version without NULL checks, leading to a kernel
NULL pointer dereference.

Add NULL checks for adev-&gt;ip_blocks[i].version in both
amdgpu_device_set_cg_state and amdgpu_device_set_pg_state to prevent
dereferencing NULL pointers during GPU teardown when initialization has
failed.

Fixes: 39fc2bc4da00 ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths")
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Mario Limonciello &lt;mario.limonciello@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit b7ac77468cda92eecae560b05f62f997a12fe2f2)
Cc: stable@vger.kernel.org
</content>
</entry>
<entry>
<title>drm/amdgpu: Fix error handling in slot reset</title>
<updated>2026-02-25T22:57:55+00:00</updated>
<author>
<name>Lijo Lazar</name>
<email>lijo.lazar@amd.com</email>
</author>
<published>2026-02-24T04:48:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b57c4ec98c17789136a4db948aec6daadceb5024'/>
<id>urn:sha1:b57c4ec98c17789136a4db948aec6daadceb5024</id>
<content type='text'>
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.

Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.

Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar &lt;lijo.lazar@amd.com&gt;
Reviewed-by: Ce Sun &lt;cesun102@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit bb71362182e59caa227e4192da5a612b09349696)
</content>
</entry>
<entry>
<title>Convert 'alloc_obj' family to use the new default GFP_KERNEL argument</title>
<updated>2026-02-22T01:09:51+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-02-22T00:37:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bf4afc53b77aeaa48b5409da5c8da6bb4eff7f43'/>
<id>urn:sha1:bf4afc53b77aeaa48b5409da5c8da6bb4eff7f43</id>
<content type='text'>
This was done entirely with mindless brute force, using

    git grep -l '\&lt;k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>treewide: Replace kmalloc with kmalloc_obj for non-scalar types</title>
<updated>2026-02-21T09:02:28+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2026-02-21T07:49:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=69050f8d6d075dc01af7a5f2f550a8067510366f'/>
<id>urn:sha1:69050f8d6d075dc01af7a5f2f550a8067510366f</id>
<content type='text'>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
</entry>
<entry>
<title>drm/amdgpu: Set atomics to true for xgmi</title>
<updated>2026-02-19T17:16:11+00:00</updated>
<author>
<name>Harish Kasiviswanathan</name>
<email>Harish.Kasiviswanathan@amd.com</email>
</author>
<published>2026-02-12T17:13:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=23c098b5fc89c17cec6b25ebe071f47a54b7d613'/>
<id>urn:sha1:23c098b5fc89c17cec6b25ebe071f47a54b7d613</id>
<content type='text'>
xgmi support atomics between links. Set them to true. This only set for
GFX12 onwards to avoid regression on older generations

v2: Use correct xgmi flag that indicates CPU connection

Signed-off-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amd/pm: send unload command to smu during modprobe -r amdgpu</title>
<updated>2026-02-12T20:20:07+00:00</updated>
<author>
<name>Kenneth Feng</name>
<email>kenneth.feng@amd.com</email>
</author>
<published>2026-01-22T10:04:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d76c66c6ad83994181967d7427385aab4c009ae0'/>
<id>urn:sha1:d76c66c6ad83994181967d7427385aab4c009ae0</id>
<content type='text'>
Send unload command to smu during modprobe -r amdgpu for smu 13/14.
1. This can fix the high voltage/temperatue issue after driver is unloaded.
2. Reloading driver could fail but with the debug port based mode1 reset
during driver is reloaded, it is good and safe.

Signed-off-by: Kenneth Feng &lt;kenneth.feng@amd.com&gt;
Reviewed-by: Yang Wang &lt;kevinyang.wang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
<entry>
<title>drm/amd: Fix hang on amdgpu unload by using pci_dev_is_disconnected()</title>
<updated>2026-02-05T22:25:57+00:00</updated>
<author>
<name>Mario Limonciello</name>
<email>mario.limonciello@amd.com</email>
</author>
<published>2026-02-05T16:42:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f7afda7fcd169a9168695247d07ad94cf7b9798f'/>
<id>urn:sha1:f7afda7fcd169a9168695247d07ad94cf7b9798f</id>
<content type='text'>
The commit 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise
disconnect") introduced early KFD cleanup when drm_dev_is_unplugged()
returns true. However, this causes hangs during normal module unload
(rmmod amdgpu).

The issue occurs because drm_dev_unplug() is called in amdgpu_pci_remove()
for all removal scenarios, not just surprise disconnects. This was done
intentionally in commit 39934d3ed572 ("Revert "drm/amdgpu: TA unload
messages are not actually sent to psp when amdgpu is uninstalled"") to
fix IGT PCI software unplug test failures. As a result,
drm_dev_is_unplugged() returns true even during normal module unload,
triggering the early KFD cleanup inappropriately.

The correct check should distinguish between:
- Actual surprise disconnect (eGPU unplugged): pci_dev_is_disconnected()
  returns true
- Normal module unload (rmmod): pci_dev_is_disconnected() returns false

Replace drm_dev_is_unplugged() with pci_dev_is_disconnected() to ensure
the early cleanup only happens during true hardware disconnect events.

Cc: stable@vger.kernel.org
Reported-by: Cal Peake &lt;cp@absolutedigital.net&gt;
Closes: https://lore.kernel.org/all/b0c22deb-c0fa-3343-33cf-fd9a77d7db99@absolutedigital.net/
Fixes: 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise disconnect")
Acked-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Mario Limonciello &lt;mario.limonciello@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
</entry>
</feed>
