summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
AgeCommit message (Collapse)AuthorFilesLines
2025-10-02Merge tag 'drm-next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds1-13/+26
Pull drm updates from Dave Airlie: "cross-subsystem: - i2c-hid: Make elan touch controllers power on after panel is enabled - dt bindings for STM32MP25 SoC - pci vgaarb: use screen_info helpers - rust pin-init updates - add MEI driver for late binding firmware update/load uapi: - add ioctl for reassigning GEM handles - provide boot_display attribute on boot-up devices core: - document DRM_MODE_PAGE_FLIP_EVENT - add vendor specific recovery method to drm device wedged uevent gem: - Simplify gpuvm locking ttm: - add interface to populate buffers sched: - Fix race condition in trace code atomic: - Reallow no-op async page flips display: - dp: Fix command length video: - Improve pixel-format handling for struct screen_info rust: - drop Opaque<> from ioctl args - Alloc: - BorrowedPage type and AsPageIter traits - Implement Vmalloc::to_page() and VmallocPageIter - DMA/Scatterlist: - Add dma::DataDirection and type alias for dma_addr_t - Abstraction for struct scatterlist and sg_table - DRM: - simplify use of generics - add DriverFile type alias - drop Object::SIZE - Rust: - pin-init tree merge - Various methods for AsBytes and FromBytes traits gpuvm: - Support madvice in Xe driver gpusvm: - fix hmm_pfn_to_map_order usage in gpusvm bridge: - Improve and fix ref counting on bridge management - cdns-dsi: Various improvements to mode setting - Support Solomon SSD2825 plus DT bindings - Support Waveshare DSI2DPI plus DT bindings - Support Content Protection property - display-connector: Improve DP display detection - Add support for Radxa Ra620 plus DT bindings - adv7511: Provide SPD and HDMI infoframes - it6505: Replace crypto_shash with sha() - synopsys: Add support for DW DPTX Controller plus DT bindings - adv7511: Write full Audio infoframe - ite6263: Support vendor-specific infoframes - simple: Add support for Realtek RTD2171 DP-to-HDMI plus DT bindings panel: - panel-edp: Support mt8189 Chromebooks; Support BOE NV140WUM-N64; Support SHP LQ134Z1; Fixes - panel-simple: Support Olimex LCD-OLinuXino-5CTS plus DT bindings - Support Samsung AMS561RA01 - Support Hydis HV101HD1 plus DT bindings - ilitek-ili9881c: Refactor mode setting; Add support for Bestar BSD1218-A101KL68 LCD plus DT bindings - lvds: Add support for Ampire AMP19201200B5TZQW-T03 to DT bindings - edp: Add support for additonal mt8189 Chromebook panels - lvds: Add DT bindings for EDT ETML0700Z8DHA amdgpu: - add CRIU support for gem objects - RAS updates - VCN SRAM load fixes - EDID read fixes - eDP ALPM support - Documentation updates - Rework PTE flag generation - DCE6 fixes - VCN devcoredump cleanup - MMHUB client id fixes - VCN 5.0.1 RAS support - SMU 13.0.x updates - Expanded PCIe DPC support - Expanded VCN reset support - VPE per queue reset support - give kernel jobs unique id for tracing - pre-populate exported buffers - cyan skillfish updates - make vbios build number available in sysfs - userq updates - HDCP updates - support MMIO remap page as ttm pool - JPEG parser updates - DCE6 DC updates - use devm for i2c buses - GPUVM locking updates - Drop non-DC DCE11 code - improve fallback handling for pixel encoding amdkfd: - SVM/page migration fixes - debugfs fixes - add CRIO support for gem objects - SVM updates radeon: - use dev_warn_once in CS parsers xe: - add madvise interface - add DRM_IOCTL_XE_VM_QUERY_MEMORY_RANGE_ATTRS to query VMA count and memory attributes - drop L# bank mask reporting from media GT3 on Xe3+. - add SLPC power_profile sysfs interface - add configs attribs to add post/mid context-switch commands - handle firmware reported hardware errors notifying userspace with device wedged uevent - use same dir structure across sysfs/debugfs - cleanup and future proof vram region init - add G-states and PCI link states to debugfs - Add SRIOV support for CCS surfaces on Xe2+ - Enable SRIOV PF mode by default on supported platforms - move flush to common code - extended core workarounds for Xe2/3 - use DRM scheduler for delayed GT TLB invalidations - configs improvements and allow VF device enablement - prep work to expose mmio regions to userspace - VF migration support added - prepare GPU SVM for THP migration - start fixing XE_PAGE_SIZE vs PAGE_SIZE - add PSMI support for hw validation - resize VF bars to max possible size according to number of VFs - Ensure GT is in C0 during resume - pre-populate exported buffers - replace xe_hmm with gpusvm - add more SVM GT stats to debugfs - improve fake pci and WA kunnit handle for new platform testing - Test GuC to GuC comms to add debugging - use attribute groups to simplify sysfs registration - add Late Binding firmware code to interact with MEI i915: - apply multiple JSL/EHL/Gen7/Gen6 workarounds properly - protect against overflow in active_engine() - Use try_cmpxchg64() in __active_lookup() - include GuC registers in error state - get rid of dev->struct_mutex - iopoll: generalize read_poll_timout - lots more display refactoring - Reject HBR3 in any eDP Panel - Prune modes for YUV420 - Display Wa fix, additions, and updates - DP: Fix 2.7 Gbps link training on g4x - DP: Adjust the idle pattern handling - DP: Shuffle the link training code a bit - Don't set/read the DSI C clock divider on GLK - Enable_psr kernel parameter changes - Type-C enabled/disconnected dp-alt sink - Wildcat Lake enabling - DP HDR updates - DRAM detection - wait PSR idle on dsb commit - Remove FBC modulo 4 restriction for ADL-P+ - panic: refactor framebuffer allocation habanalabs: - debug/visibility improvements - vmalloc-backed coherent mmap support - HLDIO infrastructure nova-core: - various register!() macro improvements - minor vbios/firmware fixes/refactoring - advance firmware boot stages; process Booter and patch signatures - process GSP and GSP bootloader - Add r570.144 firmware bindings and update to it - Move GSP boot code to own module - Use new pin-init features to store driver's private data in a single allocation - Update ARef import from sync::aref nova-drm: - Update ARef import from sync::aref tyr: - initial driver skeleton for a rust driver for ARM Mali GPUs - capable of powering up, query metadata and provide it to userspace. msm: - GPU and Core: - in DT bindings describe clocks per GPU type - GMU bandwidth voting for x1-85 - a623/a663 speedbins - cleanup some remaining no-iommu leftovers after VM_BIND conversion - fix GEM obj 32b size truncation - add missing VM_BIND param validation - IFPC for x1-85 and a750 - register xml and gen_header.py sync from mesa - Display: - add missing bindings for display on SC8180X - added DisplayPort MST bindings - conversion from round_rate() to determine_rate() amdxdna: - add IOCTL_AMDXDNA_GET_ARRAY - support user space allocated buffers - streamline PM interfaces - Refactoring wrt. hardware contexts - improve error reporting nouveau: - use GSP firmware by default - improve error reporting - Pre-populate exported buffers ast: - Clean up detection of DRAM config exynos: - add DSIM bridge driver support for Exynos7870 - Document Exynos7870 DSIM compatible in dt-binding panthor: - Print task/pid on errors - Add support for Mali G710, G510, G310, Gx15, Gx20, Gx25 - Improve cache flushing - Fail VM bind if BO has offset renesas: - convert to RUNTIME_PM_OPS rcar-du: - Make number of lanes configurable - Use RUNTIME_PM_OPS - Add support for DSI commands rocket: - Add driver for Rockchip NPU plus DT bindings - Use kfree() and sizeof() correctly - Test DMA status rockchip: - dsi2: Add support for RK3576 plus DT bindings - Add support for RK3588 DPTX output tidss: - Use crtc_ fields for programming display mode - Remove other drivers from aperture pixpaper: - Add support for Mayqueen Pixpaper plus DT bindings v3d: - Support querying nubmer of GPU resets for KHR_robustness stm: - Clean up logging - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings sitronix: - st7571-i2c: Add support for inverted displays and 2-bit grayscale tidss: - Convert to kernel's FIELD_ macros vesadrm: - Support 8-bit palette mode imagination: - Improve power management - Add support for TH1520 GPU - Support Risc-V architectures v3d: - Improve job management and locking vkms: - Support variants of ARGB8888, ARGB16161616, RGB565, RGB888 and P01x - Spport YUV with 16-bit components" * tag 'drm-next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernel: (1455 commits) drm/amd: Add name to modes from amdgpu_connector_add_common_modes() drm/amd: Drop some common modes from amdgpu_connector_add_common_modes() drm/amdgpu: update MODULE_PARM_DESC for freesync_video drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes() drm/amd/display: Share dce100_validate_global with DCE6-8 drm/amd/display: Share dce100_validate_bandwidth with DCE6-8 drm/amdgpu: Fix fence signaling race condition in userqueue amd/amdkfd: enhance kfd process check in switch partition amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw drm/amd/display: Reject modes with too high pixel clock on DCE6-10 drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes() drm/amd/display: Only enable common modes for eDP and LVDS drm/amdgpu: remove the redeclaration of variable i drm/amdgpu/userq: assign an error code for invalid userq va drm/amdgpu: revert "rework reserved VMID handling" v2 drm/amdgpu: remove leftover from enforcing isolation by VMID drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails accel/habanalabs: add Infineon version check accel/habanalabs/gaudi2: read preboot status after recovering from dirty state accel/habanalabs: add HL_GET_P_STATE passthrough type ...
2025-09-25drm/amdgpu: update MODULE_PARM_DESC for freesync_videoAlex Deucher1-1/+1
To better describe what it does. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25drm/amd: Fix hybrid sleepMario Limonciello (AMD)1-1/+1
[Why] commit 530694f54dd5e ("drm/amdgpu: do not resume device in thaw for normal hibernation") optimized the flow for systems that are going into S4 where the power would be turned off. Basically the thaw() callback wouldn't resume the device if the hibernation image was successfully created since the system would be powered off. This however isn't the correct flow for a system entering into s0i3 after the hibernation image is created. Some of the amdgpu callbacks have different behavior depending upon the intended state of the suspend. [How] Use pm_hibernation_mode_is_suspend() as an input to decide whether to run resume during thaw() callback. Reported-by: Ionut Nechita <ionut_n2001@yahoo.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4573 Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Fixes: 530694f54dd5e ("drm/amdgpu: do not resume device in thaw for normal hibernation") Acked-by: Alex Deucher <alexander.deucher@amd.com> Tested-by: Kenneth Crudup <kenny@panix.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.17+ <stable@vger.kernel.org> # 6.17+: 495c8d35035e: PM: hibernate: Add pm_hibernation_mode_is_suspend() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-05drm/amd: add more cyan skillfish PCI idsAlex Deucher1-0/+5
Add additional PCI IDs to the cyan skillfish family. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-02drm/amdgpu: Add ioctl to get all gem handles for a processDavid Francis1-0/+1
Add new ioctl DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES. This ioctl returns a list of bos with their handles, sizes, and flags and domains. This ioctl is meant to be used during CRIU checkpoint and provide information needed to reconstruct the bos in CRIU restore. Userspace for this and the next change can be found at https://github.com/checkpoint-restore/criu/pull/2613 Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exitVitaly Prosyak1-1/+4
User queues are disabled before GEM objects are released (protecting against user app crashes). No races with PCI hot-unplug (because drm_dev_enter prevents cleanup if iewdevice is being removed). Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04drm/amdgpu: fix link error for !PM_SLEEPArnd Bergmann1-10/+10
When power management is not enabled in the kernel build, the newly added hibernation changes cause a link failure: arm-linux-gnueabi-ld: drivers/gpu/drm/amd/amdgpu/amdgpu_drv.o: in function `amdgpu_pmops_thaw': amdgpu_drv.c:(.text+0x1514): undefined reference to `pm_hibernate_is_recovering' Make the power management code in this driver conditional on CONFIG_PM and CONFIG_PM_SLEEP Fixes: 530694f54dd5 ("drm/amdgpu: do not resume device in thaw for normal hibernation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20250714081635.4071570-1-arnd@kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04drm/amdgpu: Fix build error when CONFIG_SUSPEND is disabledPerry Yuan1-0/+4
The variable `pm_suspend_target_state` is conditionally defined only when `CONFIG_SUSPEND` is enabled (see `include/linux/suspend.h`). Directly referencing it without guarding by `#ifdef CONFIG_SUSPEND` causes build failures when suspend functionality is disabled (e.g., `CONFIG_SUSPEND=n`). Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amdgpu: fix module parameter descriptionYann Dirson1-1/+1
Fix dcdebugmask description. Signed-off-by: Yann Dirson <ydirson@free.fr> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-21Merge tag 'amd-drm-next-6.17-2025-07-17' of ↵Dave Airlie1-5/+6
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-17: amdgpu: - Partition fixes - Reset fixes - RAS fixes - i2c fix - MPC updates - DSC cleanup - EDID fixes - Display idle D3 update - IPS updates - DMUB updates - Retimer fix - Replay fixes - Fix DC memory leak - Initial support for smartmux - DCN 4.0.1 degamma LUT fix - Per queue reset cleanups - Track ring state associated with a fence - SR-IOV fixes - SMU fixes - Per queue reset improvements for GC 9+ compute - Per queue reset improvements for GC 10+ gfx - Per queue reset improvements for SDMA 5+ - Per queue reset improvements for JPEG 2+ - Per queue reset improvements for VCN 2+ - GC 8 fix - ISP updates amdkfd: - Enable KFD on LoongArch radeon: - Drop console lock during suspend/resume UAPI: - Add userq slot info to INFO IOCTL Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html) From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250717213827.2061581-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-21Merge tag 'drm-misc-next-2025-07-17' of ↵Dave Airlie1-0/+17
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.17: UAPI Changes: Cross-subsystem Changes: Core Changes: - mode_config: Change fb_create prototype to pass the drm_format_info and avoid redundant lookups in drivers - sched: kunit improvements, memory leak fixes, reset handling improvements - tests: kunit EDID update Driver Changes: - amdgpu: Hibernation fixes, structure lifetime fixes - nouveau: sched improvements - sitronix: Add Sitronix ST7567 Support - bridge: - Make connector available to bridge detect hook - panel: - More refcounting changes - New panels: BOE NE14QDM Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://lore.kernel.org/r/20250717-efficient-kudu-of-fantasy-ff95e0@houat
2025-07-16drm/amdgpu: make compute timeouts consistentAlex Deucher1-5/+5
For kernel compute queues, align the timeout with other kernel queues (10 sec). This had previously been set higher for OpenCL when it used kernel queues, but now OpenCL uses KFD user queues which don't have a timeout limitation. This also aligns with SR-IOV which already used a shorter timeout. Additionally the longer timeout negatively impacts the user experience with kernel queues for interactive applications. Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15drm/amdgpu: refine eeprom data checkganglxie1-0/+1
add eeprom data checksum check before driver unload. reset eeprom and save correct data to eeprom when check failed Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-10drm/amdgpu: do not resume device in thaw for normal hibernationSamuel Zhang1-0/+17
For normal hibernation, GPU do not need to be resumed in thaw since it is not involved in writing the hibernation image. Skip resume in this case can reduce the hibernation time. On VM with 8 * 192GB VRAM dGPUs, 98% VRAM usage and 1.7TB system memory, this can save 50 minutes. Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com> Tested-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Link: https://lore.kernel.org/r/20250710062313.3226149-6-guoqing.zhang@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-07Revert "drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini"Vitaly Prosyak1-1/+15
This reverts commit 5fb90421fa0fbe0a968274912101fe917bf1c47b. The original patch moved `amdgpu_userq_mgr_fini()` to the driver's `postclose` callback, which is called after `drm_gem_release()` in the DRM file cleanup sequence.If a user application crashes or aborts without cleaning up its user queues, 'drm_gem_release()` may free GEM objects that are still referenced by active user queues, leading to use-after-free. By reverting, we ensure that user queues are disabled and cleaned up before any GEM objects are released, preventing this class of bug. However, this reintroduces a race during PCI hot-unplug, where device removal can race with per-file cleanup, leading to use-after-free in suspend/unplug paths. This will be fixed in the next patch. Fixes: 5fb90421fa0f ("drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c") Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07drm/amdgpu: Pass adev pointer to functionsLijo Lazar1-8/+7
Pass amdgpu device context instead of drm device context to some amdgpu_device_* functions. DRM device context is not required in those functions. No functional change. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: Fix error with dev_info_once usageLijo Lazar1-1/+1
Fixes error with dev_info_once usage in amdgpu_device_asic_has_dc_support. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506281140.mXfWT3EN-lkp@intel.com/ Fixes: a3e510fd69c3 ("drm/amdgpu: Convert from DRM_* to dev_*") Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24drm/amd: Fix spelling mistake "correctalbe" -> "correctable"Colin Ian King1-1/+1
There is a spelling mistake in a pr_info message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70cVitaly Prosyak1-15/+1
The issue was reproduced on NV10 using IGT pci_unplug test. It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`. However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`. As a result, KASAN detected a use-after-free condition, as shown in the log below. The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and `amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before `amdgpu_fpriv` is freed. This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`, where the following components are initialized: - `amdgpu_userq_mgr_init()` - `amdgpu_eviction_fence_init()` - `amdgpu_ctx_mgr_init()` Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using: - `amdgpu_userq_mgr_fini()` - `amdgpu_eviction_fence_destroy()` - `amdgpu_ctx_mgr_fini()` This change eliminates the use-after-free and improves consistency in resource management between open and close paths. [ +0.094367] ================================================================== [ +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737 [ +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ #2 [ +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000003] dump_stack_lvl+0x76/0xa0 [ +0.000010] print_report+0xce/0x600 [ +0.000009] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000790] ? srso_return_thunk+0x5/0x5f [ +0.000007] ? kasan_complete_mode_report_info+0x76/0x200 [ +0.000008] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000684] kasan_report+0xbe/0x110 [ +0.000007] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000601] __asan_report_store8_noabort+0x17/0x30 [ +0.000007] amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000801] ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu] [ +0.000819] ? srso_return_thunk+0x5/0x5f [ +0.000008] amdgpu_drm_release+0xa3/0xe0 [amdgpu] [ +0.000604] __fput+0x354/0xa90 [ +0.000010] __fput_sync+0x59/0x80 [ +0.000005] __x64_sys_close+0x7d/0xe0 [ +0.000006] x64_sys_call+0x2505/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000004] ? kasan_record_aux_stack+0xae/0xd0 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? kmem_cache_free+0x398/0x580 [ +0.000006] ? __fput+0x543/0xa90 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __fput+0x543/0xa90 [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000007] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? fpregs_assert_state_consistent+0x21/0xb0 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? syscall_exit_to_user_mode+0x4e/0x240 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? irqentry_exit+0x43/0x50 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? exc_page_fault+0x7c/0x110 [ +0.000006] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000005] RIP: 0033:0x7ffff7b14f67 [ +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff [ +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67 [ +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003 [ +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000 [ +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8 [ +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040 [ +0.000007] </TASK> [ +0.000286] Allocated by task 425 on cpu 11 at 29.751192s: [ +0.000013] kasan_save_stack+0x28/0x60 [ +0.000008] kasan_save_track+0x18/0x70 [ +0.000006] kasan_save_alloc_info+0x38/0x60 [ +0.000006] __kasan_kmalloc+0xc1/0xd0 [ +0.000005] __kmalloc_cache_noprof+0x1bd/0x430 [ +0.000006] amdgpu_driver_open_kms+0x172/0x760 [amdgpu] [ +0.000521] drm_file_alloc+0x569/0x9a0 [ +0.000008] drm_client_init+0x1b7/0x410 [ +0.000007] drm_fbdev_client_setup+0x174/0x470 [ +0.000007] drm_client_setup+0x8a/0xf0 [ +0.000006] amdgpu_pci_probe+0x50b/0x10d0 [amdgpu] [ +0.000482] local_pci_probe+0xe7/0x1b0 [ +0.000008] pci_device_probe+0x5bf/0x890 [ +0.000005] really_probe+0x1fd/0x950 [ +0.000007] __driver_probe_device+0x307/0x410 [ +0.000005] driver_probe_device+0x4e/0x150 [ +0.000006] __driver_attach+0x223/0x510 [ +0.000005] bus_for_each_dev+0x102/0x1a0 [ +0.000006] driver_attach+0x3d/0x60 [ +0.000005] bus_add_driver+0x309/0x650 [ +0.000005] driver_register+0x13d/0x490 [ +0.000006] __pci_register_driver+0x1ee/0x2b0 [ +0.000006] xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo] [ +0.000008] do_one_initcall+0x9c/0x3e0 [ +0.000007] do_init_module+0x29e/0x7f0 [ +0.000006] load_module+0x5c75/0x7c80 [ +0.000006] init_module_from_file+0x106/0x180 [ +0.000007] idempotent_init_module+0x377/0x740 [ +0.000006] __x64_sys_finit_module+0xd7/0x180 [ +0.000006] x64_sys_call+0x1f0b/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] Freed by task 1737 on cpu 9 at 76.455063s: [ +0.000010] kasan_save_stack+0x28/0x60 [ +0.000006] kasan_save_track+0x18/0x70 [ +0.000005] kasan_save_free_info+0x3b/0x60 [ +0.000006] __kasan_slab_free+0x54/0x80 [ +0.000005] kfree+0x127/0x470 [ +0.000006] amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu] [ +0.000485] drm_file_free.part.0+0x5b1/0xba0 [ +0.000007] drm_file_free+0x13/0x30 [ +0.000006] drm_client_release+0x1c4/0x2b0 [ +0.000006] drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper] [ +0.000007] put_fb_info+0x97/0xe0 [ +0.000006] unregister_framebuffer+0x197/0x380 [ +0.000005] drm_fb_helper_unregister_info+0x94/0x100 [ +0.000005] drm_fbdev_client_unregister+0x3c/0x80 [ +0.000007] drm_client_dev_unregister+0x144/0x330 [ +0.000006] drm_dev_unregister+0x49/0x1b0 [ +0.000006] drm_dev_unplug+0x4c/0xd0 [ +0.000006] amdgpu_pci_remove+0x58/0x130 [amdgpu] [ +0.000482] pci_device_remove+0xae/0x1e0 [ +0.000006] device_remove+0xc7/0x180 [ +0.000006] device_release_driver_internal+0x3d4/0x5a0 [ +0.000007] device_release_driver+0x12/0x20 [ +0.000006] pci_stop_bus_device+0x104/0x150 [ +0.000006] pci_stop_and_remove_bus_device_locked+0x1b/0x40 [ +0.000005] remove_store+0xd7/0xf0 [ +0.000007] dev_attr_store+0x3f/0x80 [ +0.000006] sysfs_kf_write+0x125/0x1d0 [ +0.000005] kernfs_fop_write_iter+0x2ea/0x490 [ +0.000007] vfs_write+0x90d/0xe70 [ +0.000006] ksys_write+0x119/0x220 [ +0.000006] __x64_sys_write+0x72/0xc0 [ +0.000006] x64_sys_call+0x18ab/0x26f0 [ +0.000005] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] The buggy address belongs to the object at ffff88811c068000 which belongs to the cache kmalloc-rnd-01-4k of size 4096 [ +0.000016] The buggy address is located 3168 bytes inside of freed 4096-byte region [ffff88811c068000, ffff88811c069000) [ +0.000022] The buggy address belongs to the physical page: [ +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068 [ +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) [ +0.000007] page_type: f5(slab) [ +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000 [ +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ +0.000004] page dumped because: kasan: bad access detected [ +0.000011] Memory state around the buggy address: [ +0.000009] ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000012] ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ^ [ +0.000010] ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ================================================================== Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Lijo Lazar <lijo.lazar@amd.com> Cc: Jesse Zhang <Jesse.Zhang@amd.com> Cc: Arvind Yadav <arvind.yadav@amd.com> v2: drop amdgpu_drm_release() and assign drm_release() as the callback directly.(Alex) Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24drm/amdgpu: remove fence slabAlex Deucher1-5/+0
Just use kmalloc for the fences in the rare case we need an independent fence. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18drm/amd: Add support for a complete pmops actionMario Limonciello1-1/+1
complete() callbacks are supposed to handle reversing anything that occurred during prepare() callbacks. They'll be called on every power state transition, and will also be called if the sequence is failed (such as an aborted suspend). Add support for IP blocks to support this action. Reviewed-by: Alex Hung <alex.hung@amd.com> Link: https://lore.kernel.org/r/20250602014432.3538345-2-superm1@kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18drm/amdgpu: Add debug mask to disable CE logsXiang Liu1-0/+6
Add debug mask to disable kernel logs of RAS correctable errors, including both ACA and CE error counter kernel messages. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22drm/amdgpu: Fix eviction fence worker race during fd closeJesse.Zhang1-1/+1
The current cleanup order during file descriptor close can lead to a race condition where the eviction fence worker attempts to access a destroyed mutex from the user queue manager: [ 517.294055] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ 517.294060] WARNING: CPU: 8 PID: 2030 at kernel/locking/mutex.c:564 [ 517.294094] Workqueue: events amdgpu_eviction_fence_suspend_worker [amdgpu] The issue occurs because: 1. We destroy the user queue manager (including its mutex) first 2. Then try to destroy eviction fences which may have pending work 3. The eviction fence worker may try to access the already-destroyed mutex Fix this by reordering the cleanup to: 1. First mark the fd as closing and destroy eviction fences, which flushes any pending work 2. Then safely destroy the user queue manager after we're certain no more fence work will be executed The copy in amdgpu_driver_postclose_kms() needs to be removed (Christian) Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08drm/amdgpu: Add debug bit for userptr usageShane Xiao1-0/+5
In VM debug mode, it is desirable to notify the application to correct the freeing sequence by unmapping the memory before destroying the userptr in the old userptr path. Add a bitmask to decide whether to send gpu vm fault to the applition. Signed-off-by: Shane Xiao <shane.xiao@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08drm/amdgpu: fix pm notifier handlingAlex Deucher1-9/+1
Set the s3/s0ix and s4 flags in the pm notifier so that we can skip the resource evictions properly in pm prepare based on whether we are suspending or hibernating. Drop the eviction as processes are not frozen at this time, we we can end up getting stuck trying to evict VRAM while applications continue to submit work which causes the buffers to get pulled back into VRAM. v2: Move suspend flags out of pm notifier (Mario) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4178 Fixes: 2965e6355dcd ("drm/amd: Add Suspend/Hibernate notification callback support") Cc: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-05drm/amdgpu: Add Support for enforcing isolation without Cleaner ShaderSrinivasan Shanmugam1-2/+2
Adjusted the enforce isolation setting handling to include the ability to disable the cleaner shader without affecting isolation between tasks. v2: Updated enforce isolation documentation and parameters. (Alex) Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-01drm/amdgpu/userq: fix user_queue parameters listBagas Sanjaya1-5/+6
Sphinx reports htmldocs warning: Documentation/gpu/amdgpu/module-parameters:7: drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1119: ERROR: Unexpected indentation. [docutils] Fix the warning by using reST bullet list syntax for user_queue parameter options, separated from preceding paragraph by a blank line. Fixes: fb20954c9717 ("drm/amdgpu/userq: rework driver parameter") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-next/20250422202956.176fb590@canb.auug.org.au/ Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-22drm/amdgpu/userq: use consistent function namingAlex Deucher1-1/+1
s/userqueue/userq/ 1. remove the mix of amdgpu_userqueue and amdgpu_userq 2. to be consistent with other amdgpu_userq_fence.c 3. it's shorter Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-21drm/amdgpu/userq: rework driver parameterAlex Deucher1-6/+9
Replace disable_kq parameter with user_queue parameter. The parameter has the following logic: -1 = auto (ASIC specific default) 0 = user queues disabled 1 = user queues enabled and kernel queues enabled (if supported) 2 = user queues enabled and kernel queues disabled The default behavior (-1) is currently the same as 0 for current ASICs. To enable user queues (in addition to kernel queues) set user_queue=1. To enable user queues and disable kernel queues (to make all resources available to user queues), set user_queue=2. Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11drm/amdgpu: adjust enforce_isolation handlingAlex Deucher1-5/+7
Switch from a bool to an enum and allow more options for enforce isolation. There are now 3 modes of operation: - Disabled (0) - Enabled (serialization and cleaner shader) (1) - Enabled in legacy mode (no serialization or cleaner shader) (2) This provides better flexibility for more use cases. Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11drm/amd: Forbid suspending into non-default suspend statesMario Limonciello1-1/+13
On systems that default to 'deep' some userspace software likes to try to suspend in 'deep' first. If there is a failure for any reason (such as -ENOMEM) the failure is ignored and then it will try to use 's2idle' as a fallback. This fails, but more importantly it leads to graphical problems. Forbid this behavior and only allow suspending in the last state supported by the system. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4093 Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250408180957.4027643-1-superm1@kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: add parameter to disable kernel queuesAlex Deucher1-0/+9
On chips that support user queues, setting this option will disable kernel queues to be used to validate user queues without kernel queues. Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu/userq: prevent runtime pm when userqs are activeAlex Deucher1-0/+30
Similar to KFD, prevent runtime pm while user queues are active. Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: bump version for user queue IP support queryAlex Deucher1-1/+2
Add the user queue IP support query to the drm_amdgpu_info_device query. Cc: marek.olsak@amd.com Cc: prike.liang@amd.com Cc: sunil.khatri@amd.com Cc: yogesh.mohanmarimuthu@amd.com Reviewed-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: Fix Illegal opcode in command stream ErrorArvind Yadav1-1/+15
When applications closes, it triggers the drm_file_free function which subsequently releases all allocated buffer objects. Concurrently, the resume_worker thread will attempt to map the usermode queue. However, since the wptr buffer object has already been deallocated, this will result in an Illegal opcode error being raised in the command stream. Now replacing drm_release() with a new function amdgpu_drm_release(). This function will set the flag to prevent the scheduling of any new queue resume/map, stop all queues and then call drm_release(). V2: - Replace drm_release with amdgpu_drm_release(Christian). Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Arvind Yadav <arvind.yadav@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: Implement userqueue signal/wait IOCTLArunpravin Paneer Selvam1-0/+2
This patch introduces new IOCTL for userqueue secure semaphore. The signal IOCTL called from userspace application creates a drm syncobj and array of bo GEM handles and passed in as parameter to the driver to install the fence into it. The wait IOCTL gets an array of drm syncobjs, finds the fences attached to the drm syncobjs and obtain the array of memory_address/fence_value combintion which are returned to userspace. v2: (Christian) - Install fence into GEM BO object. - Lock all BO's using the dma resv subsystem - Reorder the sequence in signal IOCTL function. - Get write pointer from the shadow wptr - use userq_fence to fetch the va/value in wait IOCTL. v3: (Christian) - Use drm_exec helper for the proper BO drm reserve and avoid BO lock/unlock issues. - fence/fence driver reference count logic for signal/wait IOCTLs. v4: (Christian) - Fixed the drm_exec calling sequence - use dma_resv_for_each_fence_unlock if BO's are not locked - Modified the fence_info array storing logic. v5: (Christian) - Keep fence_drv until wait queue execution. - Add dma_fence_wait for other fences. - Lock BO's using drm_exec as the number of fences in them could change. - Install signaled fences as well into BO/Syncobj. - Move Syncobj fence installation code after the drm_exec_prepare_array. - Directly add dma_resv_usage_rw(args->bo_flags.... - remove unnecessary dma_fence_put. v6: (Christian) - Add xarray stuff to store the fence_drv - Implement a function to iterate over the xarray and drop the fence_drv references. - Add drm_exec_until_all_locked() wrapper - Add a check that if we haven't exceeded the user allocated num_fences before adding dma_fence to the fences array. v7: (Christian) - Use memdup_user() for kmalloc_array + copy_from_user - Move the fence_drv references from the xarray into the newly created fence and drop the fence_drv references when we signal this fence. - Move this locking of BOs before the "if (!wait_info->num_fences)", this way you need this code block only once. - Merge the error handling code and the cleanup + return 0 code. - Initializing the xa should probably be done in the userq code. - Remove the userq back pointer stored in fence_drv. - Pass xarray as parameter in amdgpu_userq_walk_and_drop_fence_drv() v8: (Christian) - Move fence_drv references must come before adding the fence to the list. - Use xa_lock_irqsave_nested for nested spinlock operations. - userq_mgr should be per fpriv and not one per device. - Restructure the interrupt process code for the early exit of the loop. - The reference acquired in the syncobj fence replace code needs to be kept around. - Modify the dma_fence acquire placement in wait IOCTL. - Move USERQ_BO_WRITE flag to UAPI header file. - drop the fence drv reference after telling the hw to stop accessing it. - Add multi sync object support to userq signal IOCTL. V9: (Christian) - Store all the fence_drv ref to other drivers and not ourself. - Remove the userq fence xa implementation and replace with kvmalloc_array. v10: (Christian) - Add a comment for the userq_xa xarray - drop the if check of userq_fence->fence_drv_array - use the i variable to initialize userq_fence->fence_drv_array_count - drop the fence reference before you free the array in the error handling, otherwise it could be that some references leaked. Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: Implement a new userqueue fence driverArunpravin Paneer Selvam1-0/+6
Developed a userqueue fence driver for the userqueue process shared BO synchronization. Create a dma fence having write pointer as the seqno and allocate a seq64 memory for each user queue process and feed this memory address into the firmware/hardware, thus the firmware writes the read pointer into the given address when the process completes it execution. Compare wptr and rptr, if rptr >= wptr, signal the fences for the waiting process to consume the buffers. v2: Worked on review comments from Christian for the following modifications - Add wptr as sequence number into the fence - Add a reference count for the fence driver - Add dma_fence_put below the list_del as it might frees the userq fence. - Trim unnecessary code in interrupt handler. - Check dma fence signaled state in dma fence creation function for a potential problem of hardware completing the job processing beforehand. - Add necessary locks. - Create a list and process all the unsignaled fences. - clean up fences in destroy function. - implement .signaled callback function v3: Worked on review comments from Christian - Modify naming convention for reference counted objects - Fix fence driver reference drop issue - Drop amdgpu_userq_fence_driver_process() function return value v4: Worked on review comments from Christian - Moved fence driver allocation into amdgpu_userq_fence_driver_alloc() - Added detail doc mentioning the differences b/w two spinlocks declared. v5: Worked on review comments from Christian - Check before upcast and remove local variable - Add error handling in fence_drv alloc function. - Move rptr read fn outside of the loop and remove WARN_ON in destroy function. v6: - clear the seq64 memory in user fence driver(Christian) - fix for the wptr va bo mapping(Christian) - move the fence_drv xa entry erase code from the interrupt handler into user fence destroy function Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: add new IOCTL for usermode queueShashank Sharma1-0/+1
This patch adds: - A new IOCTL function to create and destroy - A new structure to keep all the user queue data in one place. - A function to generate unique index for the queue. V1: Worked on review comments from RFC patch series: - Alex: Keep a list of queues, instead of single queue per process. - Christian: Use the queue manager instead of global ptrs, Don't keep the queue structure in amdgpu_ctx V2: Worked on review comments: - Christian: - Formatting of text - There is no need for queuing of userqueues, with idr in place - Alex: - Remove use_doorbell, its unnecessary - Reuse amdgpu_mqd_props for saving mqd fields - Code formatting and re-arrangement V3: - Integration with doorbell manager V4: - Accommodate MQD union related changes in UAPI (Alex) - Do not set the queue size twice (Bas) V5: - Remove wrapper functions for queue indexing (Christian) - Do not save the queue id/idr in queue itself (Christian) - Move the idr allocation in the IP independent generic space (Christian) V6: - Check the validity of input IP type (Christian) V7: - Move uq_func from uq_mgr to adev (Alex) - Add missing free(queue) for error cases (Yifan) V9: - Rebase V10: Addressed review comments from Christian, and added R-B: - Do not initialize the local variable - Convert DRM_ERROR to DEBUG. V11: - check the input flags to be zero (Alex) Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08drm/amdgpu: add usermode queue base codeShashank Sharma1-0/+1
This patch adds IP independent skeleton code for amdgpu usermode queue. It contains: - A new files with init functions of usermode queues. - A queue context manager in driver private data. V1: Worked on design review comments from RFC patch series: (https://patchwork.freedesktop.org/series/112214/) - Alex: Keep a list of queues, instead of single queue per process. - Christian: Use the queue manager instead of global ptrs, Don't keep the queue structure in amdgpu_ctx V2: - Reformatted code, split the big patch into two V3: - Integration with doorbell manager V4: - Align the structure member names to the largest member's column (Luben) - Added SPDX license (Luben) V5: - Do not add amdgpu.h in amdgpu_userqueue.h (Christian). - Move struct amdgpu_userq_mgr into amdgpu_userqueue.h (Christian). V6: Rebase V9: Rebase V10: Rebase + Alex's R-B Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Koenig <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-07drm/amdgpu: add rebar parameterAlex Deucher1-0/+11
Add a new parameter to disable BAR resizing. Note that this only disables the driver from attempting to resize the BAR, The BIOS may have resized the BAR at boot. Some teams have found this useful in debugging P2P DMA issues on systems where the available MMIO space did not allow for all of the GPUs present to resize their BARs. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-27drm/amd: Handle being compiled without SI or CIK support betterMario Limonciello1-20/+24
If compiled without SI or CIK support but amdgpu tries to load it will run into failures with uninitialized callbacks. Show a nicer message in this case and fail probe instead. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4050 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-03-19Documentation/amdgpu: Add debug_mask documentationLijo Lazar1-0/+5
Add description for debug_mask bit options. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-19drm/amd/pm: Add debug bit for smu pool allocationLijo Lazar1-0/+5
In certain cases, it's desirable to avoid PMFW log transactions to system memory. Add a mask bit to decide whether to allocate smu pool in device memory or system memory. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-18drm/amdgpu: adjust drm_firmware_drivers_only() handlingAlex Deucher1-0/+14
Move to probe so we can check the PCI device type and only apply the drm_firmware_drivers_only() check for PCI DISPLAY classes. Also add a module parameter to override the nomodeset kernel parameter as a workaround for platforms that have this hardcoded on their kernel command lines. Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-18drm/amdgpu: drop drm_firmware_drivers_only()Alex Deucher1-3/+0
There are a number of systems and cloud providers out there that have nomodeset hardcoded in their kernel parameters to block nouveau for the nvidia driver. This prevents the amdgpu driver from loading. Unfortunately the end user cannot easily change this. The preferred way to block modules from loading is to use modprobe.blacklist=<driver>. That is what providers should be using to block specific drivers. Drop the check to allow the driver to load even when nomodeset is specified on the kernel command line. Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-10drm/amd/display: allow 256B DCC max compressed block sizes on gfx12Marek Olšák1-1/+2
The hw supports it. Signed-off-by: Marek Olšák <marek.olsak@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-07drm/amd: Keep display off while going into S4Mario Limonciello1-2/+9
When userspace invokes S4 the flow is: 1) amdgpu_pmops_prepare() 2) amdgpu_pmops_freeze() 3) Create hibernation image 4) amdgpu_pmops_thaw() 5) Write out image to disk 6) Turn off system Then on resume amdgpu_pmops_restore() is called. This flow has a problem that because amdgpu_pmops_thaw() is called it will call amdgpu_device_resume() which will resume all of the GPU. This includes turning the display hardware back on and discovering connectors again. This is an unexpected experience for the display to turn back on. Adjust the flow so that during the S4 sequence display hardware is not turned back on. Reported-by: Xaver Hugl <xaver.hugl@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2038 Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Harry Wentland <harry.wentland@amd.com> Link: https://lore.kernel.org/r/20250306185124.44780-1-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-28drm/amdgpu: Create a debug option to disable ring resetAndré Almeida1-0/+6
Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we have ring reset, create a debug option to disable them in amdgpu, forcing the driver to go with the full device reset path again when both options are combined. This option is useful for testing and debugging purposes when one wants to test the full reset from userspace. Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdgpu: Add flags to distinguish vf/pf/pt modeAsad Kamal1-1/+2
Add extra flag definition for ids_flag field to distinguish between vf/pf/pt modes v2: Updated kms driver minor version & removed pf check as default is 0 v3: Fix up version (Alex) v4: rebase (Alex) Proposed userspace: https://github.com/ROCm/amdsmi/commit/e663bed7d6b3df79f5959e73981749b1f22ec698 Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-13drm/amdgpu: Update usage for bad page thresholdHawking Zhang1-1/+1
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>