| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit b57c4ec98c17789136a4db948aec6daadceb5024 ]
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.
Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.
Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bb71362182e59caa227e4192da5a612b09349696)
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 480ad5f6ead4a47b969aab6618573cd6822bb6a4 ]
Do not unlock psp->ras_context.mutex if it has not been locked. This has
been detected by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6fa01b4335978051d2cd80841728fd63cc597970)
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e0bcc7b88bcd081aaae6f481b10d9ab294fcb69 ]
Mutexes must be unlocked before these are destroyed. This has been detected
by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Yang Wang <kevinyang.wang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 270258ba320beb99648dceffb67e86ac76786e55)
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b7d7693a55d606d700beb9549c9f7f0e5d9c24f ]
Userspace can either deliberately pass in the too small num_fences, or the
required number can legitimately grow between the two calls to the userq
wait ioctl. In both cases we do not want the emit the kernel warning
backtrace since nothing is wrong with the kernel and userspace will simply
get an errno reported back. So lets simply drop the WARN_ONs.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2c333ea579de6cc20ea7bc50e9595ef72863e65c)
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Timeline Management v7
[ Upstream commit efdc66fe12b07e7b7d28650bd8d4f7e3bb92c5d4 ]
When GPU memory mappings are updated, the driver returns a fence so
userspace knows when the update is finished.
The previous refactor could pick the wrong fence or rely on checks that
are not safe for GPU mappings that stay valid even when memory is
missing. In some cases this could return an invalid fence or cause fence
reference counting problems.
Fix this by (v5,v6, per Christian):
- Starting from the VM’s existing last update fence, so a valid and
meaningful fence is always returned even when no new work is required.
- Selecting the VM-level fence only for always-valid / PRT mappings using
the required combined bo_va + bo guard.
- Using the per-BO page table update fence for normal MAP and REPLACE
operations.
- For UNMAP and CLEAR, returning the fence provided by
amdgpu_vm_clear_freed(), which may remain unchanged when nothing needs
clearing.
- Keeping fence reference counting balanced.
v7: Drop the extra bo_va/bo NULL guard since
amdgpu_vm_is_bo_always_valid() handles NULL BOs correctly (including
PRT). (Christian)
This makes VM timeline fences correct and prevents crashes caused by
incorrect fence handling.
Fixes: bd8150a1b337 ("drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4")
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 096bb75e13cc508d3915b7604e356bcb12b17766 ]
On Intel MacBookPros with switchable graphics, when the iGPU
is enabled, the address of VRAM gets put at 0 in the dGPU's
virtual address space. This is non-standard and seems to cause
issues with the cursor if it ends up at 0. We have the framework
to reserve memory at 0 in the address space, so enable it here if
the vram start address is 0.
Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302
Cc: stable@vger.kernel.org
Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b18fc0ab837381c1a6ef28386602cd888f2d9edf ]
Invalidating a dmabuf will impact other users of the shared BO.
In the scenario where process A moves the BO, it needs to inform
process B about the move and process B will need to update its
page table.
The commit fixes a synchronisation bug caused by the use of the
ticket: it made amdgpu_vm_handle_moved behave as if updating
the page table immediately was correct but in this case it's not.
An example is the following scenario, with 2 GPUs and glxgears
running on GPU0 and Xorg running on GPU1, on a system where P2P
PCI isn't supported:
glxgears:
export linear buffer from GPU0 and import using GPU1
submit frame rendering to GPU0
submit tiled->linear blit
Xorg:
copy of linear buffer
The sequence of jobs would be:
drm_sched_job_run # GPU0, frame rendering
drm_sched_job_queue # GPU0, blit
drm_sched_job_done # GPU0, frame rendering
drm_sched_job_run # GPU0, blit
move linear buffer for GPU1 access #
amdgpu_dma_buf_move_notify -> update pt # GPU0
It this point the blit job on GPU0 is still running and would
likely produce a page fault.
Cc: stable@vger.kernel.org
Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3b948dd0366a0b64c02e4ed1aefdf7825942e803 ]
Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if
gmc init with 57-bit address space support, because
ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging
support 47-bit VA, require 5-level paging on GPU to support ARM64.
NPA address space 52-bit mapping on NPA GPU VM require 5-level paging.
Debugger trap get device snapshot expect LDS and Scratch base, limit
above 57-bit, which is set only for 5-level paging.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.19.x
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f6b1c1f5fd7237f77fc3880603ea54dcf0371a20 ]
If GPU supports 5-level page table, but CPU disable 5-level page table
by using boot option no5lvl or CPU feature not available, the virtual
address will be 48bit, not needed to enable 5-level page table on GPU
vm.
If adev->vm_manager.num_level, number of pde levels, set to 4, then
gfxhub and mmhub register VM_CONTEXTx_CNTL/PAGE_TABLE_DEPTH will set
to 4 to enable 5-level page table in page table walker.
Set vm_manager.root_level to AMDGPU_VM_PDE3, then update GPU mapping
will allocate and update PDE3/PDE2/PDE1/PDE0/PTB 5-level page tables.
If max_level is not 4, no change for the logic to support features
needed by old ASICs.
v2: squash in CONFIG fix
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Stable-dep-of: 3b948dd0366a ("drm/amdgpu: Use 5-level paging if gmc support 57-bit VA")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 39fc2bc4da0082c226cbee331f0a5d44db3997da ]
Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU
register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all()
and amdgpu_fence_driver_hw_fini().
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5cc7bbd9f1b74d9fe2f7ac08d6ba0477e8d2d65f ]
sdma ring reset is not supported in SRIOV. kfd driver does not check
reset mask, and could queue sdma ring reset during unmap_queues_cpsch.
Avoid the ring reset for sriov.
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3ee1c72606bd2842f0f377fd4b118362af0323ae ]
Tune the sleep interval in the PSP fence wait loop from 10-100us to
60-100us.This adjustment results in an overall wait window of 1.2s
(60us * 20000 iterations) to 2 seconds (100us * 20000 iterations),
which guarantees that we can retrieve the correct fence value
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 044f8d3b1fac6ac89c560f61415000e6bdab3a03 ]
end the function flow when ras table checksum is error
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8980be03b3f9a4b58197ef95d3b37efa41a25331 ]
VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid
call trace during deinitialization.
[ 71.913601] [drm] clean up the vf2pf work item
[ 71.915088] ------------[ cut here ]------------
[ 71.915092] WARNING: CPU: 3 PID: 1079 at /tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[ 71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp mii ttm aesni_intel crypto_simd cryptd
[ 71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G OE 6.8.0-87-generic #88~22.04.1-Ubuntu
[ 71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[ 71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu]
[ 71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00
[ 71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246
[ 71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000
[ 71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000
[ 71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000
[ 71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[ 71.915792] FS: 000070cff87c4c40(0000) GS:ffff893abfb80000(0000) knlGS:0000000000000000
[ 71.915795] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0
[ 71.915800] PKRU: 55555554
[ 71.915802] Call Trace:
[ 71.915805] <TASK>
[ 71.915809] vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu]
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8079b87c02e531cc91601f72ea8336dd2262fdf1 ]
Add validation to ensure user queue sizes meet hardware requirements:
- Size must be a power of two for efficient ring buffer wrapping
- Size must be at least AMDGPU_GPU_PAGE_SIZE to prevent undersized allocations
This prevents invalid configurations that could lead to GPU faults or
unexpected behavior.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0028b86b52f7609e36af635ef6cb908925306233 ]
set retired_page of invalid ras records to U64_MAX, and skip
them when reading ras records
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 39c21b81112321cbe1267b02c77ecd2161ce19aa ]
VFs use the PF SDMA ucode and are unable to load SDMA_RS64.
Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Signed-off-by: Victor Skvortsov <Victor.Skvortsov@amd.com>
Reviewed-by: Gavin Wan <gavin.wan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Timeline Management v4
[ Upstream commit bd8150a1b3370a9f7761c5814202a3fe5a79f44f ]
This commit simplifies the amdgpu_gem_va_ioctl function, key updates
include:
- Moved the logic for managing the last update fence directly into
amdgpu_gem_va_update_vm.
- Introduced checks for the timeline point to enable conditional
replacement or addition of fences.
v2: Addressed review comments from Christian.
v3: Updated comments (Christian).
v4: The previous version selected the fence too early and did not manage its
reference correctly, which could lead to stale or freed fences being used.
This resulted in refcount underflows and could crash when updating GPU
timelines.
The fence is now chosen only after the VA mapping work is completed, and its
reference is taken safely. After exporting it to the VM timeline syncobj, the
driver always drops its local fence reference, ensuring balanced refcounting
and avoiding use-after-free on dma_fence.
Crash signature:
[ 205.828135] refcount_t: underflow; use-after-free.
[ 205.832963] WARNING: CPU: 30 PID: 7274 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
...
[ 206.074014] Call Trace:
[ 206.076488] <TASK>
[ 206.078608] amdgpu_gem_va_ioctl+0x6ea/0x740 [amdgpu]
[ 206.084040] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 206.089994] drm_ioctl_kernel+0x86/0xe0 [drm]
[ 206.094415] drm_ioctl+0x26e/0x520 [drm]
[ 206.098424] ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
[ 206.104402] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[ 206.109387] __x64_sys_ioctl+0x96/0xe0
[ 206.113156] do_syscall_64+0x66/0x2d0
...
[ 206.553351] BUG: unable to handle page fault for address: ffffffffc0dfde90
...
[ 206.553378] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
...
[ 206.553405] Call Trace:
[ 206.553409] <IRQ>
[ 206.553415] ? __pfx_drm_sched_fence_free_rcu+0x10/0x10 [gpu_sched]
[ 206.553424] dma_fence_signal+0x30/0x60
[ 206.553427] drm_sched_job_done.isra.0+0x123/0x150 [gpu_sched]
[ 206.553434] dma_fence_signal_timestamp_locked+0x6e/0xe0
[ 206.553437] dma_fence_signal+0x30/0x60
[ 206.553441] amdgpu_fence_process+0xd8/0x150 [amdgpu]
[ 206.553854] sdma_v4_0_process_trap_irq+0x97/0xb0 [amdgpu]
[ 206.554353] edac_mce_amd(E) ee1004(E)
[ 206.554270] amdgpu_irq_dispatch+0x150/0x230 [amdgpu]
[ 206.554702] amdgpu_ih_process+0x6a/0x180 [amdgpu]
[ 206.555101] amdgpu_irq_handler+0x23/0x60 [amdgpu]
[ 206.555500] __handle_irq_event_percpu+0x4a/0x1c0
[ 206.555506] handle_irq_event+0x38/0x80
[ 206.555509] handle_edge_irq+0x92/0x1e0
[ 206.555513] __common_interrupt+0x3e/0xb0
[ 206.555519] common_interrupt+0x80/0xa0
[ 206.555525] </IRQ>
[ 206.555527] <TASK>
...
[ 206.555650] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
...
[ 206.555667] Kernel panic - not syncing: Fatal exception in interrupt
Link: https://patchwork.freedesktop.org/patch/654669/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c8cf9ddc549fb93cb5a35f3fe23487b1e6707e74 ]
Only set an error on the fence if the fence is not
signalled. We can end up with a warning if the
per queue reset path signals the fence and sets an error
as part of the reset, but fails to recover.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e2fd14f579b841f54a9b7162fef15234d8c0627a ]
This initializes HDP IP version 6.1.1.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bd68a1404b6fa2e7e9957b38ba22616faba43e75 ]
In the rare event if eeprom has only invalid address entries,
allocation is skipped, this causes following NULL pointer issue
[ 547.103445] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ 547.118897] #PF: supervisor read access in kernel mode
[ 547.130292] #PF: error_code(0x0000) - not-present page
[ 547.141689] PGD 124757067 P4D 0
[ 547.148842] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 547.158504] CPU: 49 PID: 8167 Comm: cat Tainted: G OE 6.8.0-38-generic #38-Ubuntu
[ 547.177998] Hardware name: Supermicro AS -8126GS-TNMR/H14DSG-OD, BIOS 1.7 09/12/2025
[ 547.195178] RIP: 0010:amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu]
[ 547.210375] Code: e8 63 78 82 c0 45 31 d2 45 3b 75 08 48 8b 45 a0 73 44 44 89 f1 48 8b 7d 88 48 89 ca 48 c1 e2 05 48 29 ca 49 8b 4d 00 48 01 d1 <48> 83 79 10 00 74 17 49 63 f2 48 8b 49 08 41 83 c2 01 48 8d 34 76
[ 547.252045] RSP: 0018:ffa0000067287ac0 EFLAGS: 00010246
[ 547.263636] RAX: ff11000167c28130 RBX: ff11000127600000 RCX: 0000000000000000
[ 547.279467] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000125b1c800
[ 547.295298] RBP: ffa0000067287b50 R08: 0000000000000000 R09: 0000000000000000
[ 547.311129] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 547.326959] R13: ff11000217b1de00 R14: 0000000000000000 R15: 0000000000000092
[ 547.342790] FS: 0000746e59d14740(0000) GS:ff11017dfda80000(0000) knlGS:0000000000000000
[ 547.360744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 547.373489] CR2: 0000000000000010 CR3: 000000019585e001 CR4: 0000000000f71ef0
[ 547.389321] PKRU: 55555554
[ 547.395316] Call Trace:
[ 547.400737] <TASK>
[ 547.405386] ? show_regs+0x6d/0x80
[ 547.412929] ? __die+0x24/0x80
[ 547.419697] ? page_fault_oops+0x99/0x1b0
[ 547.428588] ? do_user_addr_fault+0x2ee/0x6b0
[ 547.438249] ? exc_page_fault+0x83/0x1b0
[ 547.446949] ? asm_exc_page_fault+0x27/0x30
[ 547.456225] ? amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu]
[ 547.470040] ? mas_wr_modify+0xcd/0x140
[ 547.478548] sysfs_kf_bin_read+0x63/0xb0
[ 547.487248] kernfs_file_read_iter+0xa1/0x190
[ 547.496909] kernfs_fop_read_iter+0x25/0x40
[ 547.506182] vfs_read+0x255/0x390
This also result in space left assigned to negative values.
Moving data alloc call before bad page check resolves both the issue.
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f752e79d38857011f1293fcb6c810409c3b669ee ]
__amdgpu_ras_restore_bad_pages is responsible for the maintenance of bad
page number, drop the unnecessary bad page number update in the error
handling path of add_bad_pages.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9877a865d62c9c3e0f4cc369dc9ca9f7f24f5ee9 ]
If SDMA block not enabled, buffer_funcs will not initialize,
fix the null pointer issue if buffer_funcs not initialized.
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ba038065655c45728be346d0b174a6da08d8a5c5 ]
amdgpu_ib_schedule() returns early after calling amdgpu_ring_undo().
This skips the common free_fence cleanup path. Other error paths were
already changed to use goto free_fence, but this one was missed.
Change the early return to goto free_fence so all error paths clean up
the same way.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c:232 amdgpu_ib_schedule()
warn: missing unwind goto?
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
124 int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
125 struct amdgpu_ib *ibs, struct amdgpu_job *job,
126 struct dma_fence **f)
127 {
...
224
225 if (ring->funcs->insert_start)
226 ring->funcs->insert_start(ring);
227
228 if (job) {
229 r = amdgpu_vm_flush(ring, job, need_pipe_sync);
230 if (r) {
231 amdgpu_ring_undo(ring);
--> 232 return r;
The patch changed the other error paths to goto free_fence but
this one was accidentally skipped.
233 }
234 }
235
236 amdgpu_ring_ib_begin(ring);
...
338
339 free_fence:
340 if (!job)
341 kfree(af);
342 return r;
343 }
Fixes: f903b85ed0f1 ("drm/amdgpu: fix possible fence leaks from job structure")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f025a2b8d93358467b8e8f4b3a617e88c5f02fab ]
In low memory conditions, kmalloc can fail. In such conditions
unlock the mutex for a clean exit.
We do not need to amdgpu_bo_list_put as it's been handled in the
amdgpu_cs_parser_fini.
Fixes: 737da5363cc0 ("drm/amdgpu: update the functions to use amdgpu version of hmm")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602030017.7E0xShmH-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 56423871e9eef1dd069bddef895207fa5ce275fe ]
There is no firmware version dependency. This also
enables sdma queue resets on all SDMA 6.x based
chips.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 314d30ad50622fc0d70da71509f9dff21545be14 ]
There is no firmware version dependency. This also
enables sdma queue resets on all SDMA 5.2.x based
chips.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 46a2cb7d24f21132e970cab52359210c3f5ea3c6 ]
There is no firmware version dependency.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ee41e5b63c8210525c936ee637a2c8d185ce873c ]
When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function
returns directly without freeing the allocated con structure, leading
to a memory leak.
Fix this by jumping to the release_con label to properly clean up the
allocated memory before returning the error code.
Compile tested only. Issue found using a prototype static analysis tool
and code review.
Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init")
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0c44d61945c4a80775292d96460aa2f22e62f86c ]
amdgpu_discovery_get_nps_info() internally allocates memory for ranges
using kvcalloc(), which may use vmalloc() for large allocation. Using
kfree() to release vmalloc memory will lead to a memory corruption.
Use kvfree() to safely handle both kmalloc and vmalloc allocations.
Compile tested only. Issue found using a prototype static analysis tool
and code review.
Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table")
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c9be63d565789b56ca7b0197e2cb78a3671f95a8 ]
In amdgpu_acpi_enumerate_xcc(), if amdgpu_acpi_dev_init() returns -ENOMEM,
the function returns directly without releasing the allocated xcc_info,
resulting in a memory leak.
Fix this by ensuring that xcc_info is properly freed in the error paths.
Compile tested only. Issue found using a prototype static analysis tool
and code review.
Fixes: 4d5275ab0b18 ("drm/amdgpu: Add parsing of acpi xcc objects")
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 96e97a562d067a6d867862db79864cc66aae99c2 ]
"AMDGPU_GEM_DOMAIN_MMIO_REMAP" - Never activated as UAPI and it turned
out that this was to inflexible.
Allocate the MMIO_REMAP buffer object as a regular GEM BO and explicitly
move it into the fixed AMDGPU_PL_MMIO_REMAP placement at the TTM level.
This avoids relying on GEM domain bits for MMIO_REMAP, keeps the
placement purely internal, and makes the lifetime and pinning of the
global MMIO_REMAP BO explicit. The BO is pinned in TTM so it cannot be
migrated or evicted.
The corresponding free path relies on normal DRM teardown ordering,
where no further user ioctls can access the global BO once TTM teardown
begins.
v2 (Srini):
- Updated patch title.
- Drop use of AMDGPU_GEM_DOMAIN_MMIO_REMAP in amdgpu_ttm.c. The
MMIO_REMAP domain bit is removed from UAPI, so keep the MMIO_REMAP BO
allocation domain-less (bp.domain = 0) and rely on the TTM placement
(AMDGPU_PL_MMIO_REMAP) for backing/pinning.
- Keep fdinfo/mem-stats visibility for MMIO_REMAP by classifying BOs
based on bo->tbo.resource->mem_type == AMDGPU_PL_MMIO_REMAP, since the
domain bit is removed.
v3: Squash patches #1 & #3
Fixes: 056132483724 ("drm/amdgpu/uapi: Introduce AMDGPU_GEM_DOMAIN_MMIO_REMAP")
Fixes: 2a7a794eb82c ("drm/amdgpu/ttm: Allocate/Free 4K MMIO_REMAP Singleton")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Leo Liu <leo.liu@amd.com>
Cc: Ruijing Dong <ruijing.dong@amd.com>
Cc: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit de8955508b72f8a88afc3f2cbb62334d5f79ccc3 ]
MMIO_REMAP (HDP flush page) is a hardware I/O window exposed via a PCI
BAR. It must not migrate or be evicted.
Allocate a single 4 KB GEM BO in AMDGPU_GEM_DOMAIN_MMIO_REMAP during TTM
initialization when the hardware exposes a remap bus address and the
host page size is <= 4 KiB. Reserve the BO and pin it at the TTM level
so it remains fixed for its lifetime. No CPU mapping is established
here.
On teardown, reserve, unpin, and free the BO if present.
This prepares the object to be shared (e.g., via dma-buf) without
triggering placement changes or no CPU-access migration
v2: Added extra NULL checks
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Stable-dep-of: 96e97a562d06 ("drm/amdgpu: Drop MMIO_REMAP domain bit and keep it Internal")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit af26fa751c2eef66916acbf0d3c3e9159da56186 ]
vcn_v2_0_start_sriov() declares a local variable "i" initialized to zero
and uses it only as the instance index in SOC15_REG_OFFSET(UVD, i, ...).
The value is never changed and all other fields are taken from
adev->vcn.inst[0], so this path only ever programs VCN instance 0.
This triggered a Smatch:
warn: iterator 'i' not incremented
Replace the dummy iterator with an explicit instance index of 0 in
SOC15_REG_OFFSET() calls.
Fixes: dd26858a9cd8 ("drm/amdgpu: implement initialization part on VCN2.0 for SRIOV")
Reported by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: darlington Opara <darlington.opara@amd.com>
Cc: Jinage Zhao <jiange.zhao@amd.com>
Cc: Monk Liu <Monk.Liu@amd.com>
Cc: Emily Deng <Emily.Deng@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8644084a74a4573278d6f454c6638ccd5965f4e2 ]
The behavior for amdgpu was changed with commit e00e5c223878
("drm/amdgpu: adjust drm_firmware_drivers_only() handling") to
potentially allow loading even if nomodeset was set, so the
message is no longer accurate.
Just drop it to avoid confusion.
Fixes: e00e5c223878 ("drm/amdgpu: adjust drm_firmware_drivers_only() handling")
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is
removed from validate_list, the mem handle still lingers in the KFD idr.
This means when process is terminated,
kfd_process_free_outstanding_kfd_bos() will call
amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double
deletion.
To avoid this -
(a) Check if list is empty before deleting it
(b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can
be safely called again if it returns failure the first time.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6ba60345f45eaf7cb4f89105d26083a4b9fd1cba)
|
|
This reverts commit 7294863a6f01248d72b61d38478978d638641bee.
This commit was erroneously applied again after commit 0ab5d711ec74
("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
removed it, leading to very hard to debug crashes, when used with a system with two
AMD GPUs of which only one supports ASPM.
Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/
Link: https://github.com/acpica/acpica/issues/1060
Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 97a9689300eb2b393ba5efc17c8e5db835917080)
Cc: stable@vger.kernel.org
|
|
commit f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") caused
a dependency on new enough MES firmware to use amdgpu. This was fixed
on most gfx11 and gfx12 hardware with commit 0180e0a5dd5c
("drm/amdgpu/mes: add compatibility checks for set_hw_resource_1"), but
this left out that GC 11.0.4 had breakage at MES 0x51.
Bump the requirement to 0x52 instead.
Reported-by: danijel@nausys.com
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4576
Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c2d2ccc85faf8cc6934d50c18e43097eb453ade2)
Cc: stable@vger.kernel.org
|
|
Kernel gfx queues do not need to be reinitialized or
remapped after a reset. Align with gfx11.
v2: preserve init and remap for MMIO case.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a6d6ed694d72b66b0ed7a483d5effa01acd3951)
Cc: stable@vger.kernel.org
|
|
Kernel gfx queues do not need to be reinitialized or
remapped after a reset. This fixes queue reset failures
on APUs.
v2: preserve init and remap for MMIO case.
Fixes: b3e9bfd86658 ("drm/amdgpu/gfx11: add ring reset callbacks")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b340ff216fdabfe71ba0cdd47e9835a141d08e10)
Cc: stable@vger.kernel.org
|
|
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a2918f958d3f677ea93c0ac257cb6ba69b7abb7c)
Cc: stable@vger.kernel.org
|
|
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1f16866bdb1daed7a80ca79ae2837a9832a74fbc)
Cc: stable@vger.kernel.org
|
|
wptr is a 64 bit value and we need to update the
full value, not just 32 bits. Align with what we
already do for KCQs.
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e80b1d1aa1073230b6c25a1a72e88f37e425ccda)
Cc: stable@vger.kernel.org
|
|
The EXEC_COUNT field must be > 0. In the gfx shadow
handling we always emit a cond_exec packet after the gfx_shadow
packet, but the EXEC_COUNT never gets patched. This leads
to a hang when we try and reset queues on gfx11 APUs.
Fixes: c68cbbfd54c6 ("drm/amdgpu: cleanup conditional execution")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ba205ac3d6e83f56c4f824f23f1b4522cb844ff3)
Cc: stable@vger.kernel.org
|
|
The reference clock is supposed to be 100Mhz, but it
appears to actually be slightly lower (99.81Mhz).
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 637fee3954d4bd509ea9d95ad1780fc174489860)
Cc: stable@vger.kernel.org
|
|
On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
ih2 interrupt ring buffers are not initialized. This is by design, as
these secondary IH rings are only available on discrete GPUs. See
vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
AMD_IS_APU is set.
However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
get the timestamp of the last interrupt entry. When retry faults are
enabled on APUs (noretry=0), this function is called from the SVM page
fault recovery path, resulting in a NULL pointer dereference when
amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
The crash manifests as:
BUG: kernel NULL pointer dereference, address: 0000000000000004
RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
Call Trace:
amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
amdgpu_ih_process+0x84/0x100 [amdgpu]
This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
noretry=1 to noretry=0, enabling retry fault handling and thus
exercising the buggy code path.
Fix this by adding a check for ih1.ring_size before attempting to use
it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
Rework retry fault removal"). This is needed if the hardware doesn't
support secondary HW IH rings.
v2: additional updates (Alex)
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Jon Doron <jond@wiz.io>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6ce8d536c80aa1f059e82184f0d1994436b1d526)
Cc: stable@vger.kernel.org
|
|
Needs to be a u64.
Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 56fff1941abd3ca3b6f394979614ca7972552f7f)
|
|
When a function holds a lock and we return without unlocking it,
it deadlocks the kernel. We should always unlock before returning.
This commit fixes suspend/resume on SI.
Tested on two Tahiti GPUs: FirePro W9000 and R9 280X.
Fixes: f4db9913e4d3 ("drm/amdgpu: validate the flush_gpu_tlb_pasid()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601190121.z9C0uml5-lkp@intel.com/
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e3a6eff92bbd960b471966d9afccb4d584546d17)
|
|
If fence emit fails, free the fence if necessary.
Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5eb680a06007f2f6ea333d11a4e29039da90614b)
|
|
If drm_sched_job_init fails, hw_vm_fence is not freed currently,
then cause memory leak.
Fixes: db36632ea51e ("drm/amdgpu: clean up and unify hw fence handling")
Link: https://lore.kernel.org/amd-gfx/a5a828cb-0e4a-41f0-94c3-df31e5ddad52@amd.com/T/#t
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Reviewed-by: Amos Kong <kongjianjun@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5d42ee457ccd1fb5da4c7f817825b2806ec36956)
|