summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm
AgeCommit message (Collapse)AuthorFilesLines
7 daysMerge tag 'drm-misc-fixes-2025-07-10' of ↵Simona Vetter11-51/+100
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes drm-misc-fixes for v6.16-rc6 or final: - Fix nouveau fail on debugfs errors. - Magic 50 ms to fix nouveau suspend. - Call rust destructor on drm device release. - Fix DMA api error handling in tegra/nvdec. - Fix PVR device reset. - Habanalabs maintainer update. - Small memory leak fix when nouveau acpi init fails. - Do not attempt to bind to any PCI device with AGP capability. - Make FB's acquire handles on backing object, same as i915/xe already does. - Fix race in drm_gem_handle_create_tail. Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/e522cdc7-1787-48f2-97e5-0f94783970ab@linux.intel.com
8 daysMerge tag 'drm-xe-fixes-2025-07-11' of ↵Simona Vetter9-23/+53
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Clear LMTT page to avoid leaking data from one VF to another - Align PF queue size to power of 2 - Disable Indirect Ring State to avoid intermittent issues on context switch: feature is not currently needed, so can be disabled for now. - Fix compression handling when the BO pages are very fragmented - Restore display pm on error path - Fix runtime pm handling in xe devcoredump - Fix xe_pm_set_vram_threshold() doc - Recommend new minor versions of GuC firmware - Drop some workarounds on VF - Do not use verbose GuC logging by default: it should be only for debugging Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/s6jyd24mimbzb4vxtgc5vupvbyqplfep2c6eupue7znnlbhuxy@lmvzexfzhrnn
8 daysdrm/xe/guc: Default log level to non-verboseLucas De Marchi1-1/+1
Currently xe sets the guc log level to a verbose level since it's useful to debug hangs and general development. However the verbose level may already be too much and affect performance. Michal Mrozek did some tests with the L0 compute stack for submission latency with ULLS disabled. Below are the normalized numbers with log level 3 (the current default) as baseline for each test: Test \ Log Level 3 0 1 2 ----------------------------------------------------------- ------ ------ ------ ------ BestWalkerNthCommandListSubmission(CmdListCount=2) 1.00 0.63 0.63 0.96 BestWalkerNthSubmission(KernelCount=2) 1.00 0.62 0.63 0.96 BestWalkerNthSubmissionImmediate(KernelCount=2) 1.00 0.58 0.58 0.85 BestWalkerSubmission 1.00 0.62 0.62 0.96 BestWalkerSubmissionImmediate 1.00 0.63 0.62 0.96 BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=2) 1.00 0.58 0.58 0.86 BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=4) 1.00 0.70 0.70 0.83 BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=8) 1.00 0.53 0.52 0.78 Log level 2 is the first "verbose level" for GuC, where the biggest difference happens. Keep log level 3 for CONFIG_DRM_XE_DEBUG, but switch to 1, i.e. GUC_LOG_LEVEL_NON_VERBOSE, for "normal" builds. Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250613-guc-log-level-v2-1-cb84a63e49fe@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit a37128ba613ad6a5f81f382fa3cfe5c4a6527310) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VFMichal Wajdeczko1-2/+2
These workarounds are not applicable for use by the VFs. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Tested-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Link: https://lore.kernel.org/r/20250710103040.375610-2-jakub1.kolakowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 1d2e2503e506ddc499cbb7afdc8b70bcf6fe241f) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2Julia Filipchuk1-3/+3
UAPI compatibility version 1.22.2 Resolves various bugs. Recommend newer version. Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250626182805.1701096-13-daniele.ceraolospurio@intel.com (cherry picked from commit 0b64addcae7f04745bc5f62d41e27268052f812e) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/xe/pm: Correct comment of xe_pm_set_vram_threshold()Shuicheng Lin1-3/+5
The parameter threshold is with size in MiB, not in bits. Correct it to avoid any confusion. v2: s/mb/MiB, s/vram/VRAM, fix return section. (Michal) Fixes: 30c399529f4c ("drm/xe: Document Xe PM component") Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250708021450.3602087-2-shuicheng.lin@intel.com Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 0efec0500117947f924e5ac83be40f96378af85a) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/xe: Release runtime pm for error path of xe_devcoredump_read()Shuicheng Lin1-10/+28
xe_pm_runtime_put() is missed to be called for the error path in xe_devcoredump_read(). Add function description comments for xe_devcoredump_read() to help understand it. v2: more detail function comments and refine goto logic (Matt) Fixes: c4a2e5f865b7 ("drm/xe: Add devcoredump chunking") Cc: stable@vger.kernel.org Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250707004911.3502904-6-shuicheng.lin@intel.com (cherry picked from commit 017ef1228d735965419ff118fe1b89089e772c42) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/xe/pm: Restore display pm if there is error after display suspendShuicheng Lin1-2/+1
xe_bo_evict_all() is called after xe_display_pm_suspend(). So if there is error with xe_bo_evict_all(), display pm should be restored. Fixes: 51462211f4a9 ("drm/xe/pxp: add PXP PM support") Fixes: cb8f81c17531 ("drm/xe/display: Make display suspend/resume work on discrete") Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://lore.kernel.org/r/20250708035424.3608190-2-shuicheng.lin@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 83dcee17855c4e5af037ae3262809036de127903) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
8 daysdrm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences tooHans de Goede1-4/+4
It turns out that the fixup from vlv_fixup_mipi_sequences() is necessary for some DSI panel's with version 2 mipi-sequences too. Specifically the Acer Iconia One 8 A1-840 (not to be confused with the A1-840FHD which is different) has the following sequences: BDB block 53 (1284 bytes) - MIPI sequence block: Sequence block version v2 Panel 0 * Sequence 2 - MIPI_SEQ_INIT_OTP GPIO index 9, source 0, set 0 (0x00) Delay: 50000 us GPIO index 9, source 0, set 1 (0x01) Delay: 6000 us GPIO index 9, source 0, set 0 (0x00) Delay: 6000 us GPIO index 9, source 0, set 1 (0x01) Delay: 25000 us Send DCS: Port A, VC 0, LP, Type 39, Length 5, Data ff aa 55 a5 80 Send DCS: Port A, VC 0, LP, Type 39, Length 3, Data 6f 11 00 ... Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 29 Delay: 120000 us Sequence 4 - MIPI_SEQ_DISPLAY_OFF Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 28 Delay: 105000 us Send DCS: Port A, VC 0, LP, Type 05, Length 2, Data 10 00 Delay: 10000 us Sequence 5 - MIPI_SEQ_ASSERT_RESET Delay: 10000 us GPIO index 9, source 0, set 0 (0x00) Notice how there is no MIPI_SEQ_DEASSERT_RESET, instead the deassert is done at the beginning of MIPI_SEQ_INIT_OTP, which is exactly what the fixup from vlv_fixup_mipi_sequences() fixes up. Extend it to also apply to v2 sequences, this fixes the panel not working on the Acer Iconia One 8 A1-840. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14605 Signed-off-by: Hans de Goede <hansg@kernel.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Link: https://lore.kernel.org/r/20250703143824.7121-1-hansg@kernel.org Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 11895f375939d60efe7ed5dddc1cffe2e79f976c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
9 daysdrm/gem: Fix race in drm_gem_handle_create_tail()Simona Vetter1-1/+9
Object creation is a careful dance where we must guarantee that the object is fully constructed before it is visible to other threads, and GEM buffer objects are no difference. Final publishing happens by calling drm_gem_handle_create(). After that the only allowed thing to do is call drm_gem_object_put() because a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id (which is trivial since we have a linear allocator) can already tear down the object again. Luckily most drivers get this right, the very few exceptions I've pinged the relevant maintainers for. Unfortunately we also need drm_gem_handle_create() when creating additional handles for an already existing object (e.g. GETFB ioctl or the various bo import ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as the only exported function to stop these issues from happening. Now unfortunately the implementation of drm_gem_handle_create() isn't living up to standards: It does correctly finishe object initialization at the global level, and hence is safe against a concurrent tear down. But it also sets up the file-private aspects of the handle, and that part goes wrong: We fully register the object in the drm_file.object_idr before calling drm_vma_node_allow() or obj->funcs->open, which opens up races against concurrent removal of that handle in drm_gem_handle_delete(). Fix this with the usual two-stage approach of first reserving the handle id, and then only registering the object after we've completed the file-private setup. Jacek reported this with a testcase of concurrently calling GEM_CLOSE on a freshly-created object (which also destroys the object), but it should be possible to hit this with just additional handles created through import or GETFB without completed destroying the underlying object with the concurrent GEM_CLOSE ioctl calls. Note that the close-side of this race was fixed in f6cd7daecff5 ("drm: Release driver references to handle before making it available again"), which means a cool 9 years have passed until someone noticed that we need to make this symmetry or there's still gaps left :-/ Without the 2-stage close approach we'd still have a race, therefore that's an integral part of this bugfix. More importantly, this means we can have NULL pointers behind allocated id in our drm_file.object_idr. We need to check for that now: - drm_gem_handle_delete() checks for ERR_OR_NULL already - drm_gem.c:object_lookup() also chekcs for NULL - drm_gem_release() should never be called if there's another thread still existing that could call into an IOCTL that creates a new handle, so cannot race. For paranoia I added a NULL check to drm_gem_object_release_handle() though. - most drivers (etnaviv, i915, msm) are find because they use idr_find(), which maps both ENOENT and NULL to NULL. - drivers using idr_for_each_entry() should also be fine, because idr_get_next does filter out NULL entries and continues the iteration. - The same holds for drm_show_memory_stats(). v2: Use drm_WARN_ON (Thomas) Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Tested-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Cc: stable@vger.kernel.org Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Signed-off-by: Simona Vetter <simona.vetter@intel.com> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20250707151814.603897-1-simona.vetter@ffwll.ch
9 daysdrm/framebuffer: Acquire internal references on GEM handlesThomas Zimmermann4-26/+61
Acquire GEM handles in drm_framebuffer_init() and release them in the corresponding drm_framebuffer_cleanup(). Ties the handle's lifetime to the framebuffer. Not all GEM buffer objects have GEM handles. If not set, no refcounting takes place. This is the case for some fbdev emulation. This is not a problem as these GEM objects do not use dma-bufs and drivers will not release them while fbdev emulation is running. Framebuffer flags keep a bit per color plane of which the framebuffer holds a GEM handle reference. As all drivers use drm_framebuffer_init(), they will now all hold dma-buf references as fixed in commit 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers"). In the GEM framebuffer helpers, restore the original ref counting on buffer objects. As the helpers for handle refcounting are now no longer called from outside the DRM core, unexport the symbols. v3: - don't mix internal flags with mode flags (Christian) v2: - track framebuffer handle refs by flag - drop gma500 cleanup (Christian) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers") Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/dri-devel/20250703115915.3096-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Tested-by: Mario Limonciello <superm1@kernel.org> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Anusha Srivatsa <asrivats@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: <stable@vger.kernel.org> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250707131224.249496-1-tzimmermann@suse.de
11 daysdrm/xe/bmg: fix compressed VRAM handlingMatthew Auld1-1/+1
There looks to be an issue in our compression handling when the BO pages are very fragmented, where we choose to skip the identity map and instead fall back to emitting the PTEs by hand when migrating memory, such that we can hopefully do more work per blit operation. However in such a case we need to ensure the src PTEs are correctly tagged with a compression enabled PAT index on dgpu xe2+, otherwise the copy will simply treat the src memory as uncompressed, leading to corruption if the memory was compressed by the user. To fix this pass along use_comp_pat into emit_pte() on the src side, to indicate that compression should be considered. v2 (Jonathan): tweak the commit message Fixes: 523f191cc0c7 ("drm/xe/xe_migrate: Handle migration logic for xe2+ dgfx") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Akshata Jahagirdar <akshata.jahagirdar@intel.com> Cc: <stable@vger.kernel.org> # v6.12+ Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250701103949.83116-2-matthew.auld@intel.com (cherry picked from commit f7a2fd776e57bd6468644bdecd91ab3aba57ba58) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
11 daysRevert "drm/xe/xe2: Enable Indirect Ring State support for Xe2"Matthew Brost1-1/+0
This reverts commit fe0154cf8222d9e38c60ccc124adb2f9b5272371. Seeing some unexplained random failures during LRC context switches with indirect ring state enabled. The failures were always there, but the repro rate increased with the addition of WA BB as a separate BO. Commit 3a1edef8f4b5 ("drm/xe: Make WA BB part of LRC BO") helped to reduce the issues in the context switches, but didn't eliminate them completely. Indirect ring state is not required for any current features, so disable for now until failures can be root caused. Cc: stable@vger.kernel.org Fixes: fe0154cf8222 ("drm/xe/xe2: Enable Indirect Ring State support for Xe2") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250702035846.3178344-1-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 03d85ab36bcbcbe9dc962fccd3f8e54d7bb93b35) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
11 daysdrm/xe: Allocate PF queue size on pow2 boundaryMatthew Brost1-0/+1
CIRC_SPACE does not work unless the size argument is a power of 2, allocate PF queue size on power of 2 boundary. Cc: stable@vger.kernel.org Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size") Fixes: 29582e0ea75c ("drm/xe: Add page queue multiplier") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250702213511.3226167-1-matthew.brost@intel.com (cherry picked from commit 491b9783126303755717c0cbde0b08ee59b6abab) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
11 daysdrm/xe/pf: Clear all LMTT pages on allocMichal Wajdeczko1-0/+11
Our LMEM buffer objects are not cleared by default on alloc and during VF provisioning we only setup LMTT PTEs for the actually provisioned LMEM range. But beyond that valid range we might leave some stale data that could either point to some other VFs allocations or even to the PF pages. Explicitly clear all new LMTT page to avoid the risk that a malicious VF would try to exploit that gap. While around add asserts to catch any undesired PTE overwrites and low-level debug traces to track LMTT PT life-cycle. Fixes: b1d204058218 ("drm/xe/pf: Introduce Local Memory Translation Table") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20250701220052.1612-1-michal.wajdeczko@intel.com (cherry picked from commit 3fae6918a3e27cce20ded2551f863fb05d4bef8d) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
11 daysdrm/nouveau/gsp: fix potential leak of memory used during acpi initBen Skeggs1-6/+14
If any of the ACPI calls fail, memory allocated for the input buffer would be leaked. Fix failure paths to free allocated memory. Also add checks to ensure the allocations succeeded in the first place. Reported-by: Danilo Krummrich <dakr@kernel.org> Fixes: 176fdcbddfd2 ("drm/nouveau/gsp/r535: add support for booting GSP-RM") Signed-off-by: Ben Skeggs <bskeggs@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20250617040036.2932-1-bskeggs@nvidia.com
14 daysrust: drm: remove unnecessary importsTamir Duberstein1-1/+1
`kernel::str::CStr` is included in the prelude. Signed-off-by: Tamir Duberstein <tamird@gmail.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20250704-cstr-include-drm-v1-1-a279dfc4d753@gmail.com
2025-07-04drm/imagination: Fix kernel crash when hard resetting the GPUAlessio Belle1-2/+2
The GPU hard reset sequence calls pm_runtime_force_suspend() and pm_runtime_force_resume(), which according to their documentation should only be used during system-wide PM transitions to sleep states. The main issue though is that depending on some internal runtime PM state as seen by pm_runtime_force_suspend() (whether the usage count is <= 1), pm_runtime_force_resume() might not resume the device unless needed. If that happens, the runtime PM resume callback pvr_power_device_resume() is not called, the GPU clocks are not re-enabled, and the kernel crashes on the next attempt to access GPU registers as part of the power-on sequence. Replace calls to pm_runtime_force_suspend() and pm_runtime_force_resume() with direct calls to the driver's runtime PM callbacks, pvr_power_device_suspend() and pvr_power_device_resume(), to ensure clocks are re-enabled and avoid the kernel crash. Fixes: cc1aeedb98ad ("drm/imagination: Implement firmware infrastructure and META FW support") Signed-off-by: Alessio Belle <alessio.belle@imgtec.com> Reviewed-by: Matt Coster <matt.coster@imgtec.com> Link: https://lore.kernel.org/r/20250624-fix-kernel-crash-gpu-hard-reset-v1-1-6d24810d72a6@imgtec.com Cc: stable@vger.kernel.org Signed-off-by: Matt Coster <matt.coster@imgtec.com>
2025-07-04drm/tegra: nvdec: Fix dma_alloc_coherent error checkMikko Perttunen1-4/+2
Check for NULL return value with dma_alloc_coherent, in line with Robin's fix for vic.c in 'drm/tegra: vic: Fix DMA API misuse'. Fixes: 46f226c93d35 ("drm/tegra: Add NVDEC driver") Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Link: https://lore.kernel.org/r/20250702-nvdec-dma-error-check-v1-1-c388b402c53a@nvidia.com
2025-07-04Merge tag 'drm-xe-fixes-2025-07-03' of ↵Dave Airlie11-135/+310
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix chunking the PTE updates and overflowing the maximum number of dwords with with MI_STORE_DATA_IMM (Jia Yao) - Move WA BB to the LRC BO to mitigate hangs on context switch (Matthew Brost) - Fix frequency/flush WAs for BMG (Vinay / Lucas) - Fix kconfig prompt title and description (Lucas) - Do not require kunit (Harry Austen / Lucas) - Extend 14018094691 WA to BMG (Daniele) - Fix wedging the device on signal (Matthew Brost) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/o5662wz6nrlf6xt5sjgxq5oe6qoujefzywuwblm3m626hreifv@foqayqydd6ig
2025-07-04Merge tag 'samsung-dsim-fixes-for-v6.16-rc4' of ↵Dave Airlie1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes - Fixed raw pointer leakage and unsafe behavior in printk() . Switch from %pK to %p for pointer formatting, as %p is now safer and prevents issues like raw pointer leakage and acquiring sleeping locks in atomic contexts. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Inki Dae <inki.dae@samsung.com> Link: https://lore.kernel.org/r/20250629091742.29956-1-inki.dae@samsung.com
2025-07-04Merge tag 'exynos-drm-fixes-for-v6.16-rc4' of ↵Dave Airlie4-17/+33
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes Fixups - Fixed raw pointer leakage and unsafe behavior in printk() . Switch from %pK to %p for pointer formatting, as %p is now safer and prevents issues like raw pointer leakage and acquiring sleeping locks in atomic contexts. - Fixed kernel panic during boot . A NULL pointer dereference issue occasionally occurred when the vblank interrupt handler was called before the DRM driver was fully initialized during boot. So this patch fixes the issue by adding a check in the interrupt handler to ensure the DRM driver is properly initialized. - Fixed a lockup issue on Samsung Peach-Pit/Pi Chromebooks . The issue occurred after commit c9b1150a68d9 changed the call order of CRTC enable/disable and bridge pre_enable/post_disable methods, causing fimd_dp_clock_enable() to be called before the FIMD device was activated. To fix this, runtime PM guards were added to fimd_dp_clock_enable() to ensure proper operation even when CRTC is not enabled. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Inki Dae <inki.dae@samsung.com> Link: https://lore.kernel.org/r/20250629083554.28628-1-inki.dae@samsung.com
2025-07-04Merge tag 'drm-intel-fixes-2025-07-03' of ↵Dave Airlie5-14/+15
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes - Make mei interrupt top half irq disabled to fix RT builds - Fix timeline left held on VMA alloc error - Fix NULL pointer deref in vlv_dphy_param_init() - Fix selftest mock_request() to avoid NULL deref Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://lore.kernel.org/r/aGYVPAA4KvsZqDFx@jlahtine-mobl
2025-07-04Merge tag 'drm-misc-fixes-2025-07-03' of ↵Dave Airlie13-88/+192
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes drm-misc-fixes for v6.16-rc5: - Replace simple panel lookup hack with proper fix. - nullpointer deref in vesadrm fix. - fix dma_resv_wait_timeout. - fix error handling in ttm_buffer_object_transfer. - bridge fixes. - Fix vmwgfx accidentally allocating encrypted memory. - Fix race in spsc_queue_push() - Add refcount on backing GEM objects during fb creation. - Fix v3d irq's being enabled during gpu reset. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://lore.kernel.org/r/a7461418-08dc-4b7c-b2fa-264155f66d5e@linux.intel.com
2025-07-04nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcsDave Airlie1-0/+7
This fixes a bunch of command hangs after runtime suspend/resume. This fixes a regression caused by code movement in the commit below, the commit seems to just change timings enough to cause this to happen now, and adding the sleep seems to avoid it. I've spent some time trying to root cause it to no great avail, it seems like a bug on the firmware side, but it could be a bug in our rpc handling that I can't find. Either way, we should land the workaround to fix the problem, while we continue to work out the root cause. Signed-off-by: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@nvidia.com> Cc: Danilo Krummrich <dakr@kernel.org> Fixes: c21b039715ce ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()") Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20250702232707.175679-1-airlied@gmail.com
2025-07-04drm/nouveau: Do not fail module init on debugfs errorsAaron Thompson3-11/+4
If CONFIG_DEBUG_FS is enabled, nouveau_drm_init() returns an error if it fails to create the "nouveau" directory in debugfs. One case where that will happen is when debugfs access is restricted by CONFIG_DEBUG_FS_ALLOW_NONE or by the boot parameter debugfs=off, which cause the debugfs APIs to return -EPERM. So just ignore errors from debugfs. Note that nouveau_debugfs_root may be an error now, but that is a standard pattern for debugfs. From include/linux/debugfs.h: "NOTE: it's expected that most callers should _ignore_ the errors returned by this function. Other debugfs functions handle the fact that the "dentry" passed to them could be an error and they don't crash in that case. Drivers should generally work fine even if debugfs fails to init anyway." Fixes: 97118a1816d2 ("drm/nouveau: create module debugfs root") Cc: stable@vger.kernel.org Signed-off-by: Aaron Thompson <dev@aaront.org> Acked-by: Timur Tabi <ttabi@nvidia.com> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20250703211949.9916-1-dev@aaront.org
2025-07-03drm/xe: Do not wedge device on killed exec queuesMatthew Brost1-4/+6
When a user closes an exec queue or interrupts an app with Ctrl-C, this does not warrant wedging the device in mode 2. Avoid this by skipping the wedge check for killed exec queues in the TDR and LR exec queue cleanup worker. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250624174103.2707941-1-matthew.brost@intel.com (cherry picked from commit 5a2f117a80c207372513ca8964eeb178874f4990) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-03drm/xe: Extend WA 14018094691 to BMGDaniele Ceraolo Spurio1-1/+2
This WA is applicable to BMG as well. Note that this is a GSC WA and we don't load the GSC on BMG, so extending the WA to BMG won't do anything right now. However, it helps future-proof the driver so that if we ever turn the GSC on we won't have to remember to extend this WA. v2: don't use VERSION_RANGE from 2001 to 2004 (Matt) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250613231128.1261815-2-daniele.ceraolospurio@intel.com (cherry picked from commit 1a5ce0c5b95b0624ebd44f574b98003a466973be) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-03drm/v3d: Disable interrupts before resetting the GPUMaíra Canal3-10/+37
Currently, an interrupt can be triggered during a GPU reset, which can lead to GPU hangs and NULL pointer dereference in an interrupt context as shown in the following trace: [ 314.035040] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0 [ 314.043822] Mem abort info: [ 314.046606] ESR = 0x0000000096000005 [ 314.050347] EC = 0x25: DABT (current EL), IL = 32 bits [ 314.055651] SET = 0, FnV = 0 [ 314.058695] EA = 0, S1PTW = 0 [ 314.061826] FSC = 0x05: level 1 translation fault [ 314.066694] Data abort info: [ 314.069564] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 [ 314.075039] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 314.080080] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 314.085382] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000102728000 [ 314.091814] [00000000000000c0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 [ 314.100511] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP [ 314.106770] Modules linked in: v3d i2c_brcmstb vc4 snd_soc_hdmi_codec gpu_sched drm_shmem_helper drm_display_helper cec drm_dma_helper drm_kms_helper drm drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight [ 314.129654] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.25+rpt-rpi-v8 #1 Debian 1:6.12.25-1+rpt1 [ 314.139388] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT) [ 314.145211] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 314.152165] pc : v3d_irq+0xec/0x2e0 [v3d] [ 314.156187] lr : v3d_irq+0xe0/0x2e0 [v3d] [ 314.160198] sp : ffffffc080003ea0 [ 314.163502] x29: ffffffc080003ea0 x28: ffffffec1f184980 x27: 021202b000000000 [ 314.170633] x26: ffffffec1f17f630 x25: ffffff8101372000 x24: ffffffec1f17d9f0 [ 314.177764] x23: 000000000000002a x22: 000000000000002a x21: ffffff8103252000 [ 314.184895] x20: 0000000000000001 x19: 00000000deadbeef x18: 0000000000000000 [ 314.192026] x17: ffffff94e51d2000 x16: ffffffec1dac3cb0 x15: c306000000000000 [ 314.199156] x14: 0000000000000000 x13: b2fc982e03cc5168 x12: 0000000000000001 [ 314.206286] x11: ffffff8103f8bcc0 x10: ffffffec1f196868 x9 : ffffffec1dac3874 [ 314.213416] x8 : 0000000000000000 x7 : 0000000000042a3a x6 : ffffff810017a180 [ 314.220547] x5 : ffffffec1ebad400 x4 : ffffffec1ebad320 x3 : 00000000000bebeb [ 314.227677] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 314.234807] Call trace: [ 314.237243] v3d_irq+0xec/0x2e0 [v3d] [ 314.240906] __handle_irq_event_percpu+0x58/0x218 [ 314.245609] handle_irq_event+0x54/0xb8 [ 314.249439] handle_fasteoi_irq+0xac/0x240 [ 314.253527] handle_irq_desc+0x48/0x68 [ 314.257269] generic_handle_domain_irq+0x24/0x38 [ 314.261879] gic_handle_irq+0x48/0xd8 [ 314.265533] call_on_irq_stack+0x24/0x58 [ 314.269448] do_interrupt_handler+0x88/0x98 [ 314.273624] el1_interrupt+0x34/0x68 [ 314.277193] el1h_64_irq_handler+0x18/0x28 [ 314.281281] el1h_64_irq+0x64/0x68 [ 314.284673] default_idle_call+0x3c/0x168 [ 314.288675] do_idle+0x1fc/0x230 [ 314.291895] cpu_startup_entry+0x3c/0x50 [ 314.295810] rest_init+0xe4/0xf0 [ 314.299030] start_kernel+0x5e8/0x790 [ 314.302684] __primary_switched+0x80/0x90 [ 314.306691] Code: 940029eb 360ffc13 f9442ea0 52800001 (f9406017) [ 314.312775] ---[ end trace 0000000000000000 ]--- [ 314.317384] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 314.324249] SMP: stopping secondary CPUs [ 314.328167] Kernel Offset: 0x2b9da00000 from 0xffffffc080000000 [ 314.334076] PHYS_OFFSET: 0x0 [ 314.336946] CPU features: 0x08,00002013,c0200000,0200421b [ 314.342337] Memory Limit: none [ 314.345382] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- Before resetting the GPU, it's necessary to disable all interrupts and deal with any interrupt handler still in-flight. Otherwise, the GPU might reset with jobs still running, or yet, an interrupt could be handled during the reset. Cc: stable@vger.kernel.org Fixes: 57692c94dcbe ("drm/v3d: Introduce a new DRM driver for Broadcom V3D V3.x+") Reviewed-by: Juan A. Suarez <jasuarez@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://lore.kernel.org/r/20250628224243.47599-1-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-02drm/gem: Acquire references on GEM handles for framebuffersThomas Zimmermann3-11/+51
A GEM handle can be released while the GEM buffer object is attached to a DRM framebuffer. This leads to the release of the dma-buf backing the buffer object, if any. [1] Trying to use the framebuffer in further mode-setting operations leads to a segmentation fault. Most easily happens with driver that use shadow planes for vmap-ing the dma-buf during a page flip. An example is shown below. [ 156.791968] ------------[ cut here ]------------ [ 156.796830] WARNING: CPU: 2 PID: 2255 at drivers/dma-buf/dma-buf.c:1527 dma_buf_vmap+0x224/0x430 [...] [ 156.942028] RIP: 0010:dma_buf_vmap+0x224/0x430 [ 157.043420] Call Trace: [ 157.045898] <TASK> [ 157.048030] ? show_trace_log_lvl+0x1af/0x2c0 [ 157.052436] ? show_trace_log_lvl+0x1af/0x2c0 [ 157.056836] ? show_trace_log_lvl+0x1af/0x2c0 [ 157.061253] ? drm_gem_shmem_vmap+0x74/0x710 [ 157.065567] ? dma_buf_vmap+0x224/0x430 [ 157.069446] ? __warn.cold+0x58/0xe4 [ 157.073061] ? dma_buf_vmap+0x224/0x430 [ 157.077111] ? report_bug+0x1dd/0x390 [ 157.080842] ? handle_bug+0x5e/0xa0 [ 157.084389] ? exc_invalid_op+0x14/0x50 [ 157.088291] ? asm_exc_invalid_op+0x16/0x20 [ 157.092548] ? dma_buf_vmap+0x224/0x430 [ 157.096663] ? dma_resv_get_singleton+0x6d/0x230 [ 157.101341] ? __pfx_dma_buf_vmap+0x10/0x10 [ 157.105588] ? __pfx_dma_resv_get_singleton+0x10/0x10 [ 157.110697] drm_gem_shmem_vmap+0x74/0x710 [ 157.114866] drm_gem_vmap+0xa9/0x1b0 [ 157.118763] drm_gem_vmap_unlocked+0x46/0xa0 [ 157.123086] drm_gem_fb_vmap+0xab/0x300 [ 157.126979] drm_atomic_helper_prepare_planes.part.0+0x487/0xb10 [ 157.133032] ? lockdep_init_map_type+0x19d/0x880 [ 157.137701] drm_atomic_helper_commit+0x13d/0x2e0 [ 157.142671] ? drm_atomic_nonblocking_commit+0xa0/0x180 [ 157.147988] drm_mode_atomic_ioctl+0x766/0xe40 [...] [ 157.346424] ---[ end trace 0000000000000000 ]--- Acquiring GEM handles for the framebuffer's GEM buffer objects prevents this from happening. The framebuffer's cleanup later puts the handle references. Commit 1a148af06000 ("drm/gem-shmem: Use dma_buf from GEM object instance") triggers the segmentation fault easily by using the dma-buf field more widely. The underlying issue with reference counting has been present before. v2: - acquire the handle instead of the BO (Christian) - fix comment style (Christian) - drop the Fixes tag (Christian) - rename err_ gotos - add missing Link tag Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://elixir.bootlin.com/linux/v6.15/source/drivers/gpu/drm/drm_gem.c#L241 # [1] Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Anusha Srivatsa <asrivats@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Cc: <stable@vger.kernel.org> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250630084001.293053-1-tzimmermann@suse.de
2025-07-01drm/xe: Allow dropping kunit dependency as built-inHarry Austen1-1/+2
Fix Kconfig symbol dependency on KUNIT, which isn't actually required for XE to be built-in. However, if KUNIT is enabled, it must be built-in too. Fixes: 08987a8b6820 ("drm/xe: Fix build with KUNIT=m") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Harry Austen <hpausten@protonmail.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20250627-xe-kunit-v2-2-756fe5cd56cf@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit a559434880b320b83733d739733250815aecf1b0) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe: Fix kconfig promptLucas De Marchi2-3/+4
The xe driver is the official driver for Intel Xe2 and later, while maintaining experimental support for earlier GPUs. Reword the help message accordingly. Reviewed-by: Maarten Lankhorst <dev@lankhorst.se> Link: https://lore.kernel.org/r/20250611-xe-kconfig-help-v1-1-8bcc6b47d11a@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 1488a3089de3d0bcdc9532da7ce04cf0af9d7dd0) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe/bmg: Update Wa_22019338487Vinay Belgaumkar4-2/+135
Limit GT max frequency to 2600MHz and wait for frequency to reduce before proceeding with a transient flush. This is really only needed for the transient flush: if L2 flush is needed due to 16023588340 then there's no need to do this additional wait since we are already using the bigger hammer. v2: Use generic names, ensure user set max frequency requests wait for flush to complete (Rodrigo) v3: - User requests wait via wait_var_event_timeout (Lucas) - Close races on flush + user requests (Lucas) - Fix xe_guc_pc_remove_flush_freq_limit() being called on last gt rather than root gt (Lucas) v4: - Only apply the freq reducing part if a TDF is needed: L2 flush trumps the need for waiting a lower frequency Fixes: aaa08078e725 ("drm/xe/bmg: Apply Wa_22019338487") Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-4-b888388477f2@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit deea6a7d6d803d6bb874a3e6f1b312e560e6c6df) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe/bmg: Update Wa_14022085890Vinay Belgaumkar2-0/+9
Set GT min frequency to 1200Mhz once driver load is complete. v2: Review comments (Rodrigo) v3: Apply Wa earlier so user_req_min is not clobbered. v4: Apply to all GTs (Lucas) Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250612-wa-14022085890-v4-3-94ba5dcc1e30@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit bdde16c9ac5cb56ad2ee19792222fa1853577af7) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe: Split xe_device_td_flush()Lucas De Marchi1-28/+40
xe_device_td_flush() has 2 possible implementations: an entire L2 flush or a transient flush, depending on WA 16023588340. Make this clear by splitting the function so it calls each of them. Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-3-b888388477f2@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 5e300ed8a545bdffc26b579c526b5fef7b2d5365) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe/xe_guc_pc: Lock once to update stashed frequenciesLucas De Marchi1-19/+20
pc_set_mert_freq_cap() currently lock()/unlock() the mutex multiple times to stash the current frequencies. It's not a problem since xe_guc_pc_restore_stashed_freq() is guaranteed to be called only later in the init sequence. However, now that we have _locked() variants for this functions, use them and avoid potential issues when called from other places or using the same pattern. While at it, prefer and early return for the WA check to reduce indentation. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-2-b888388477f2@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit d878c97daa603573e5af01fd8beec2fffdb42ad1) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe/guc_pc: Add _locked variant for min/max freqLucas De Marchi1-54/+69
There are places in which the getters/setters are called one after the other causing a multiple lock()/unlock(). These are not currently a problem since they are all happening from the same thread, but there's a race possibility as calls are added outside of the early init when the max/min and stashed values need to be correlated. Add the _locked() variants to prepare for that. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-1-b888388477f2@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 1beae9aa2b88d3a02eb666e7b777eb2d7bc645f4) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe: Make WA BB part of LRC BOMatthew Brost2-21/+19
No idea why, but without this GuC context switches randomly fail when running IGTs in a loop. Need to follow up why this fixes the aforementioned issue but can live with a stable driver for now. Fixes: 617d824c5323 ("drm/xe: Add WA BB to capture active context utilization") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Tested-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250612031925.4009701-1-matthew.brost@intel.com (cherry picked from commit 3a1edef8f4b58b0ba826bc68bf4bce4bdf59ecf3) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/xe: Fix out-of-bounds field write in MI_STORE_DATA_IMMJia Yao1-8/+10
According to Bspec, bits 0~9 of MI_STORE_DATA_IMM must not exceed 0x3FE. The macro MI_SDI_NUM_QW(x) evaluates to 2 * x + 1, which means the condition 2 * x + 1 <= 0x3FE must be satisfied. Therefore, the maximum valid value for x is 0x1FE, not 0x1FF. v2 - Replace 0x1fe with macro MAX_PTE_PER_SDI (Auld, Matthew & Patelczyk, Maciej) v3 - Change macro MAX_PTE_PER_SDI from 0x1fe to 0x1feU (De Marchi, Lucas) Bspec: 60246 Fixes: 9c44fd5f6e8a ("drm/xe: Add migrate layer functions for SVM support") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Brian3 Nguyen <brian3.nguyen@intel.com> Cc: Alex Zuo <alex.zuo@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Suggested-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Jia Yao <jia.yao@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Maciej Patelczyk <maciej.patelczyk@intel.com> Link: https://lore.kernel.org/r/20250612224620.161105-1-jia.yao@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit c038bdba98c9f6a36378044a9d4385531a194d3e) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-01drm/i915/gsc: mei interrupt top half should be in irq disabled contextJunxiao Chang1-1/+1
MEI GSC interrupt comes from i915. It has top half and bottom half. Top half is called from i915 interrupt handler. It should be in irq disabled context. With RT kernel, by default i915 IRQ handler is in threaded IRQ. MEI GSC top half might be in threaded IRQ context. generic_handle_irq_safe API could be called from either IRQ or process context, it disables local IRQ then calls MEI GSC interrupt top half. This change fixes A380/A770 GPU boot hang issue with RT kernel. Fixes: 1e3dc1d8622b ("drm/i915/gsc: add gsc as a mei auxiliary device") Tested-by: Furong Zhou <furong.zhou@intel.com> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Junxiao Chang <junxiao.chang@intel.com> Link: https://lore.kernel.org/r/20250425151108.643649-1-junxiao.chang@intel.com Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit dccf655f69002d496a527ba441b4f008aa5bebbf) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-07-01drm/i915/gt: Fix timeline left held on VMA alloc errorJanusz Krzysztofik1-1/+2
The following error has been reported sporadically by CI when a test unbinds the i915 driver on a ring submission platform: <4> [239.330153] ------------[ cut here ]------------ <4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count) <4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915] ... <4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915] ... <4> [239.330942] Call Trace: <4> [239.330944] <TASK> <4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915] <4> [239.331202] i915_driver_release+0x86/0xa0 [i915] <4> [239.331482] devm_drm_dev_init_release+0x61/0x90 <4> [239.331494] devm_action_release+0x15/0x30 <4> [239.331504] release_nodes+0x3d/0x120 <4> [239.331517] devres_release_all+0x96/0xd0 <4> [239.331533] device_unbind_cleanup+0x12/0x80 <4> [239.331543] device_release_driver_internal+0x23a/0x280 <4> [239.331550] ? bus_find_device+0xa5/0xe0 <4> [239.331563] device_driver_detach+0x14/0x20 ... <4> [357.719679] ---[ end trace 0000000000000000 ]--- If the test also unloads the i915 module then that's followed with: <3> [357.787478] ============================================================================= <3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown() <3> [357.788031] ----------------------------------------------------------------------------- <3> [357.788204] Object 0xffff888109e7f480 @offset=29824 <3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244 <4> [357.788994] i915_vma_instance+0xee/0xc10 [i915] <4> [357.789290] init_status_page+0x7b/0x420 [i915] <4> [357.789532] intel_engines_init+0x1d8/0x980 [i915] <4> [357.789772] intel_gt_init+0x175/0x450 [i915] <4> [357.790014] i915_gem_init+0x113/0x340 [i915] <4> [357.790281] i915_driver_probe+0x847/0xed0 [i915] <4> [357.790504] i915_pci_probe+0xe6/0x220 [i915] ... Closer analysis of CI results history has revealed a dependency of the error on a few IGT tests, namely: - igt@api_intel_allocator@fork-simple-stress-signal, - igt@api_intel_allocator@two-level-inception-interruptible, - igt@gem_linear_blits@interruptible, - igt@prime_mmap_coherency@ioctl-errors, which invisibly trigger the issue, then exhibited with first driver unbind attempt. All of the above tests perform actions which are actively interrupted with signals. Further debugging has allowed to narrow that scope down to DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring submission, in particular. If successful then that function, or its execlists or GuC submission equivalent, is supposed to be called only once per GEM context engine, followed by raise of a flag that prevents the function from being called again. The function is expected to unwind its internal errors itself, so it may be safely called once more after it returns an error. In case of ring submission, the function first gets a reference to the engine's legacy timeline and then allocates a VMA. If the VMA allocation fails, e.g. when i915_vma_instance() called from inside is interrupted with a signal, then ring_context_alloc() fails, leaving the timeline held referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the timeline is got, and only that last one is put on successful completion. As a consequence, the legacy timeline, with its underlying engine status page's VMA object, is still held and not released on driver unbind. Get the legacy timeline only after successful allocation of the context engine's VMA. v2: Add a note on other submission methods (Krzysztof Karas): Both execlists and GuC submission use lrc_alloc() which seems free from a similar issue. Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061 Cc: Chris Wilson <chris.p.wilson@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Krzysztof Karas <krzysztof.karas@intel.com> Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20250611104352.1014011-2-janusz.krzysztofik@linux.intel.com (cherry picked from commit cc43422b3cc79eacff4c5a8ba0d224688ca9dd4f) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2025-07-01drm/vmwgfx: Fix guests running with TDX/SEVMarko Kiiskila1-1/+1
Commit 81256a50aa0f ("x86/mm: Make memremap(MEMREMAP_WB) map memory as encrypted by default") changed the default behavior of memremap(MEMREMAP_WB) and started mapping memory as encrypted. The driver requires the fifo memory to be decrypted to communicate with the host but was relaying on the old default behavior of memremap(MEMREMAP_WB) and thus broke. Fix it by explicitly specifying the desired behavior and passing MEMREMAP_DEC to memremap. Fixes: 81256a50aa0f ("x86/mm: Make memremap(MEMREMAP_WB) map memory as encrypted by default") Signed-off-by: Marko Kiiskila <marko.kiiskila@broadcom.com> Signed-off-by: Zack Rusin <zack.rusin@broadcom.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20250618192926.1092450-1-zack.rusin@broadcom.com
2025-06-30drm/amd/display: Don't allow OLED to go down to fully offMario Limonciello1-5/+7
[Why] OLED panels can be fully off, but this behavior is unexpected. [How] Ensure that minimum luminance is at least 1. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4338 Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ray Wu <ray.wu@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 51496c7737d06a74b599d0aa7974c3d5a4b1162e)
2025-06-30drm/amd/display: Added case for when RR equals panel's max RR using freesyncHarold Sun2-0/+9
[WHY] Rounding error sometimes occurs when the refresh rate is equal to a panel's max refresh rate, causing HDMI compliance failures. [HOW] Added a case so that we round up to avoid v_total_min to be below a panel's minimum bound. Reviewed-by: Jun Lei <jun.lei@amd.com> Signed-off-by: Harold Sun <Harold.Sun@amd.com> Signed-off-by: Ray Wu <ray.wu@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fe7645d22bc0f7c1558296538ec49987bf268ef6)
2025-06-30drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8Alex Deucher2-0/+16
These were missed when support was added for other generations. The callbacks are called unconditionally so we need to make sure all generations have them. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4304 Link: https://github.com/ROCm/ROCm/issues/4965 Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+") Cc: Jonathan Kim <jonathan.kim@amd.com> Reported-by: Johl Brown <johlbrown@gmail.com> Reviewed-by: Jonathan Kim <jonathan.kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1e9d17a5dcf1242e9518e461d8e63ad35240e49e) Cc: stable@vger.kernel.org
2025-06-30drm/amdgpu: Fix memory leak in amdgpu_ctx_mgr_entity_finiLin.Cao1-0/+1
patch dd64956685fa ("drm/amdgpu: Remove duplicated "context still alive" check") removed ctx put, which will cause amdgpu_ctx_fini() cannot be called and then cause some finished fence that added by amdgpu_ctx_add_fence() cannot be released and cause memleak. Fixes: dd64956685fa ("drm/amdgpu: Remove duplicated "context still alive" check") Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8cf66089e28108dedd47e6156a48489303cf525c) Cc: stable@vger.kernel.org
2025-06-30drm/amdkfd: Don't call mmput from MMU notifier callbackPhilip Yang1-23/+20
If the process is exiting, the mmput inside mmu notifier callback from compactd or fork or numa balancing could release the last reference of mm struct to call exit_mmap and free_pgtable, this triggers deadlock with below backtrace. The deadlock will leak kfd process as mmu notifier release is not called and cause VRAM leaking. The fix is to take mm reference mmget_non_zero when adding prange to the deferred list to pair with mmput in deferred list work. If prange split and add into pchild list, the pchild work_item.mm is not used, so remove the mm parameter from svm_range_unmap_split and svm_range_add_child. The backtrace of hung task: INFO: task python:348105 blocked for more than 64512 seconds. Call Trace: __schedule+0x1c3/0x550 schedule+0x46/0xb0 rwsem_down_write_slowpath+0x24b/0x4c0 unlink_anon_vmas+0xb1/0x1c0 free_pgtables+0xa9/0x130 exit_mmap+0xbc/0x1a0 mmput+0x5a/0x140 svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu] mn_itree_invalidate+0x72/0xc0 __mmu_notifier_invalidate_range_start+0x48/0x60 try_to_unmap_one+0x10fa/0x1400 rmap_walk_anon+0x196/0x460 try_to_unmap+0xbb/0x210 migrate_page_unmap+0x54d/0x7e0 migrate_pages_batch+0x1c3/0xae0 migrate_pages_sync+0x98/0x240 migrate_pages+0x25c/0x520 compact_zone+0x29d/0x590 compact_zone_order+0xb6/0xf0 try_to_compact_pages+0xbe/0x220 __alloc_pages_direct_compact+0x96/0x1a0 __alloc_pages_slowpath+0x410/0x930 __alloc_pages_nodemask+0x3a9/0x3e0 do_huge_pmd_anonymous_page+0xd7/0x3e0 __handle_mm_fault+0x5e3/0x5f0 handle_mm_fault+0xf7/0x2e0 hmm_vma_fault.isra.0+0x4d/0xa0 walk_pmd_range.isra.0+0xa8/0x310 walk_pud_range+0x167/0x240 walk_pgd_range+0x55/0x100 __walk_page_range+0x87/0x90 walk_page_range+0xf6/0x160 hmm_range_fault+0x4f/0x90 amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu] amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu] init_user_pages+0xb1/0x2a0 [amdgpu] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu] kfd_ioctl+0x29d/0x500 [amdgpu] Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier") Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a29e067bd38946f752b0ef855f3dfff87e77bec7) Cc: stable@vger.kernel.org
2025-06-30drm/amdgpu: Include sdma_4_4_4.binKent Russell1-0/+1
This got missed during SDMA 4.4.4 support. Fixes: 968e3811c3e8 ("drm/amdgpu: add initial support for sdma444") Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 51526efe02714339ed6139f7bc348377d363200a) Cc: stable@vger.kernel.org
2025-06-30amdkfd: MTYPE_UC for ext-coherent system memoryDavid Yat Sin1-1/+1
Set memory mtype to UC host memory when ext-coherent flag is set and memory is registered as a SVM allocation. Reviewed-by: Amber Lin <Amber.Lin@amd.com> Signed-off-by: David Yat Sin <David.YatSin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5d14fdab4778c29cfd39e62c3ce84d232b4a7d8c)
2025-06-30drm/amdgpu/sdma5.x: suspend KFD queues in ring resetAlex Deucher2-2/+12
SDMA 5.x only supports engine soft reset which resets all queues on the engine. As such, we need to suspend KFD queues around resets like we do for SDMA 4.x. Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 61feed0baa1a0d094af0e07e968b1e6e875f07d0)