Age | Commit message (Collapse) | Author | Files | Lines |
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc6 or final:
- Fix nouveau fail on debugfs errors.
- Magic 50 ms to fix nouveau suspend.
- Call rust destructor on drm device release.
- Fix DMA api error handling in tegra/nvdec.
- Fix PVR device reset.
- Habanalabs maintainer update.
- Small memory leak fix when nouveau acpi init fails.
- Do not attempt to bind to any PCI device with AGP capability.
- Make FB's acquire handles on backing object, same as i915/xe already does.
- Fix race in drm_gem_handle_create_tail.
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/e522cdc7-1787-48f2-97e5-0f94783970ab@linux.intel.com
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Clear LMTT page to avoid leaking data from one VF to another
- Align PF queue size to power of 2
- Disable Indirect Ring State to avoid intermittent issues on context
switch: feature is not currently needed, so can be disabled for now.
- Fix compression handling when the BO pages are very fragmented
- Restore display pm on error path
- Fix runtime pm handling in xe devcoredump
- Fix xe_pm_set_vram_threshold() doc
- Recommend new minor versions of GuC firmware
- Drop some workarounds on VF
- Do not use verbose GuC logging by default: it should be only for
debugging
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/s6jyd24mimbzb4vxtgc5vupvbyqplfep2c6eupue7znnlbhuxy@lmvzexfzhrnn
|
|
Currently xe sets the guc log level to a verbose level since it's useful
to debug hangs and general development. However the verbose level may
already be too much and affect performance.
Michal Mrozek did some tests with the L0 compute stack for submission
latency with ULLS disabled. Below are the normalized numbers with log
level 3 (the current default) as baseline for each test:
Test \ Log Level 3 0 1 2
----------------------------------------------------------- ------ ------ ------ ------
BestWalkerNthCommandListSubmission(CmdListCount=2) 1.00 0.63 0.63 0.96
BestWalkerNthSubmission(KernelCount=2) 1.00 0.62 0.63 0.96
BestWalkerNthSubmissionImmediate(KernelCount=2) 1.00 0.58 0.58 0.85
BestWalkerSubmission 1.00 0.62 0.62 0.96
BestWalkerSubmissionImmediate 1.00 0.63 0.62 0.96
BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=2) 1.00 0.58 0.58 0.86
BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=4) 1.00 0.70 0.70 0.83
BestWalkerSubmissionImmediateMultiCmdlists(cmdlistCount=8) 1.00 0.53 0.52 0.78
Log level 2 is the first "verbose level" for GuC, where the biggest
difference happens. Keep log level 3 for CONFIG_DRM_XE_DEBUG, but switch
to 1, i.e. GUC_LOG_LEVEL_NON_VERBOSE, for "normal" builds.
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250613-guc-log-level-v2-1-cb84a63e49fe@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit a37128ba613ad6a5f81f382fa3cfe5c4a6527310)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
These workarounds are not applicable for use by the VFs.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Tested-by: Jakub Kolakowski <jakub1.kolakowski@intel.com>
Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
Signed-off-by: Jakub Kolakowski <jakub1.kolakowski@intel.com>
Link: https://lore.kernel.org/r/20250710103040.375610-2-jakub1.kolakowski@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 1d2e2503e506ddc499cbb7afdc8b70bcf6fe241f)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
UAPI compatibility version 1.22.2
Resolves various bugs. Recommend newer version.
Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250626182805.1701096-13-daniele.ceraolospurio@intel.com
(cherry picked from commit 0b64addcae7f04745bc5f62d41e27268052f812e)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
The parameter threshold is with size in MiB, not in bits.
Correct it to avoid any confusion.
v2: s/mb/MiB, s/vram/VRAM, fix return section. (Michal)
Fixes: 30c399529f4c ("drm/xe: Document Xe PM component")
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250708021450.3602087-2-shuicheng.lin@intel.com
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 0efec0500117947f924e5ac83be40f96378af85a)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
xe_pm_runtime_put() is missed to be called for the error path in
xe_devcoredump_read().
Add function description comments for xe_devcoredump_read() to help
understand it.
v2: more detail function comments and refine goto logic (Matt)
Fixes: c4a2e5f865b7 ("drm/xe: Add devcoredump chunking")
Cc: stable@vger.kernel.org
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250707004911.3502904-6-shuicheng.lin@intel.com
(cherry picked from commit 017ef1228d735965419ff118fe1b89089e772c42)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
xe_bo_evict_all() is called after xe_display_pm_suspend(). So if there
is error with xe_bo_evict_all(), display pm should be restored.
Fixes: 51462211f4a9 ("drm/xe/pxp: add PXP PM support")
Fixes: cb8f81c17531 ("drm/xe/display: Make display suspend/resume work on discrete")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://lore.kernel.org/r/20250708035424.3608190-2-shuicheng.lin@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 83dcee17855c4e5af037ae3262809036de127903)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
It turns out that the fixup from vlv_fixup_mipi_sequences() is necessary
for some DSI panel's with version 2 mipi-sequences too.
Specifically the Acer Iconia One 8 A1-840 (not to be confused with the
A1-840FHD which is different) has the following sequences:
BDB block 53 (1284 bytes) - MIPI sequence block:
Sequence block version v2
Panel 0 *
Sequence 2 - MIPI_SEQ_INIT_OTP
GPIO index 9, source 0, set 0 (0x00)
Delay: 50000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 6000 us
GPIO index 9, source 0, set 0 (0x00)
Delay: 6000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 25000 us
Send DCS: Port A, VC 0, LP, Type 39, Length 5, Data ff aa 55 a5 80
Send DCS: Port A, VC 0, LP, Type 39, Length 3, Data 6f 11 00
...
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 29
Delay: 120000 us
Sequence 4 - MIPI_SEQ_DISPLAY_OFF
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 28
Delay: 105000 us
Send DCS: Port A, VC 0, LP, Type 05, Length 2, Data 10 00
Delay: 10000 us
Sequence 5 - MIPI_SEQ_ASSERT_RESET
Delay: 10000 us
GPIO index 9, source 0, set 0 (0x00)
Notice how there is no MIPI_SEQ_DEASSERT_RESET, instead the deassert
is done at the beginning of MIPI_SEQ_INIT_OTP, which is exactly what
the fixup from vlv_fixup_mipi_sequences() fixes up.
Extend it to also apply to v2 sequences, this fixes the panel not working
on the Acer Iconia One 8 A1-840.
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14605
Signed-off-by: Hans de Goede <hansg@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Link: https://lore.kernel.org/r/20250703143824.7121-1-hansg@kernel.org
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 11895f375939d60efe7ed5dddc1cffe2e79f976c)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
|
Object creation is a careful dance where we must guarantee that the
object is fully constructed before it is visible to other threads, and
GEM buffer objects are no difference.
Final publishing happens by calling drm_gem_handle_create(). After
that the only allowed thing to do is call drm_gem_object_put() because
a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
(which is trivial since we have a linear allocator) can already tear
down the object again.
Luckily most drivers get this right, the very few exceptions I've
pinged the relevant maintainers for. Unfortunately we also need
drm_gem_handle_create() when creating additional handles for an
already existing object (e.g. GETFB ioctl or the various bo import
ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
the only exported function to stop these issues from happening.
Now unfortunately the implementation of drm_gem_handle_create() isn't
living up to standards: It does correctly finishe object
initialization at the global level, and hence is safe against a
concurrent tear down. But it also sets up the file-private aspects of
the handle, and that part goes wrong: We fully register the object in
the drm_file.object_idr before calling drm_vma_node_allow() or
obj->funcs->open, which opens up races against concurrent removal of
that handle in drm_gem_handle_delete().
Fix this with the usual two-stage approach of first reserving the
handle id, and then only registering the object after we've completed
the file-private setup.
Jacek reported this with a testcase of concurrently calling GEM_CLOSE
on a freshly-created object (which also destroys the object), but it
should be possible to hit this with just additional handles created
through import or GETFB without completed destroying the underlying
object with the concurrent GEM_CLOSE ioctl calls.
Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
Release driver references to handle before making it available
again"), which means a cool 9 years have passed until someone noticed
that we need to make this symmetry or there's still gaps left :-/
Without the 2-stage close approach we'd still have a race, therefore
that's an integral part of this bugfix.
More importantly, this means we can have NULL pointers behind
allocated id in our drm_file.object_idr. We need to check for that
now:
- drm_gem_handle_delete() checks for ERR_OR_NULL already
- drm_gem.c:object_lookup() also chekcs for NULL
- drm_gem_release() should never be called if there's another thread
still existing that could call into an IOCTL that creates a new
handle, so cannot race. For paranoia I added a NULL check to
drm_gem_object_release_handle() though.
- most drivers (etnaviv, i915, msm) are find because they use
idr_find(), which maps both ENOENT and NULL to NULL.
- drivers using idr_for_each_entry() should also be fine, because
idr_get_next does filter out NULL entries and continues the
iteration.
- The same holds for drm_show_memory_stats().
v2: Use drm_WARN_ON (Thomas)
Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Tested-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Cc: stable@vger.kernel.org
Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20250707151814.603897-1-simona.vetter@ffwll.ch
|
|
Acquire GEM handles in drm_framebuffer_init() and release them in
the corresponding drm_framebuffer_cleanup(). Ties the handle's
lifetime to the framebuffer. Not all GEM buffer objects have GEM
handles. If not set, no refcounting takes place. This is the case
for some fbdev emulation. This is not a problem as these GEM objects
do not use dma-bufs and drivers will not release them while fbdev
emulation is running. Framebuffer flags keep a bit per color plane
of which the framebuffer holds a GEM handle reference.
As all drivers use drm_framebuffer_init(), they will now all hold
dma-buf references as fixed in commit 5307dce878d4 ("drm/gem: Acquire
references on GEM handles for framebuffers").
In the GEM framebuffer helpers, restore the original ref counting
on buffer objects. As the helpers for handle refcounting are now
no longer called from outside the DRM core, unexport the symbols.
v3:
- don't mix internal flags with mode flags (Christian)
v2:
- track framebuffer handle refs by flag
- drop gma500 cleanup (Christian)
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 5307dce878d4 ("drm/gem: Acquire references on GEM handles for framebuffers")
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/dri-devel/20250703115915.3096-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Tested-by: Mario Limonciello <superm1@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Anusha Srivatsa <asrivats@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: <stable@vger.kernel.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250707131224.249496-1-tzimmermann@suse.de
|
|
There looks to be an issue in our compression handling when the BO pages
are very fragmented, where we choose to skip the identity map and
instead fall back to emitting the PTEs by hand when migrating memory,
such that we can hopefully do more work per blit operation. However in
such a case we need to ensure the src PTEs are correctly tagged with a
compression enabled PAT index on dgpu xe2+, otherwise the copy will
simply treat the src memory as uncompressed, leading to corruption if
the memory was compressed by the user.
To fix this pass along use_comp_pat into emit_pte() on the src side, to
indicate that compression should be considered.
v2 (Jonathan): tweak the commit message
Fixes: 523f191cc0c7 ("drm/xe/xe_migrate: Handle migration logic for xe2+ dgfx")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Akshata Jahagirdar <akshata.jahagirdar@intel.com>
Cc: <stable@vger.kernel.org> # v6.12+
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://lore.kernel.org/r/20250701103949.83116-2-matthew.auld@intel.com
(cherry picked from commit f7a2fd776e57bd6468644bdecd91ab3aba57ba58)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
This reverts commit fe0154cf8222d9e38c60ccc124adb2f9b5272371.
Seeing some unexplained random failures during LRC context switches with
indirect ring state enabled. The failures were always there, but the
repro rate increased with the addition of WA BB as a separate BO.
Commit 3a1edef8f4b5 ("drm/xe: Make WA BB part of LRC BO") helped to
reduce the issues in the context switches, but didn't eliminate them
completely.
Indirect ring state is not required for any current features, so disable
for now until failures can be root caused.
Cc: stable@vger.kernel.org
Fixes: fe0154cf8222 ("drm/xe/xe2: Enable Indirect Ring State support for Xe2")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250702035846.3178344-1-matthew.brost@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 03d85ab36bcbcbe9dc962fccd3f8e54d7bb93b35)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
CIRC_SPACE does not work unless the size argument is a power of 2,
allocate PF queue size on power of 2 boundary.
Cc: stable@vger.kernel.org
Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size")
Fixes: 29582e0ea75c ("drm/xe: Add page queue multiplier")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250702213511.3226167-1-matthew.brost@intel.com
(cherry picked from commit 491b9783126303755717c0cbde0b08ee59b6abab)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Our LMEM buffer objects are not cleared by default on alloc
and during VF provisioning we only setup LMTT PTEs for the
actually provisioned LMEM range. But beyond that valid range
we might leave some stale data that could either point to some
other VFs allocations or even to the PF pages.
Explicitly clear all new LMTT page to avoid the risk that a
malicious VF would try to exploit that gap.
While around add asserts to catch any undesired PTE overwrites
and low-level debug traces to track LMTT PT life-cycle.
Fixes: b1d204058218 ("drm/xe/pf: Introduce Local Memory Translation Table")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Lukasz Laguna <lukasz.laguna@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://lore.kernel.org/r/20250701220052.1612-1-michal.wajdeczko@intel.com
(cherry picked from commit 3fae6918a3e27cce20ded2551f863fb05d4bef8d)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
If any of the ACPI calls fail, memory allocated for the input buffer
would be leaked. Fix failure paths to free allocated memory.
Also add checks to ensure the allocations succeeded in the first place.
Reported-by: Danilo Krummrich <dakr@kernel.org>
Fixes: 176fdcbddfd2 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")
Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20250617040036.2932-1-bskeggs@nvidia.com
|
|
`kernel::str::CStr` is included in the prelude.
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20250704-cstr-include-drm-v1-1-a279dfc4d753@gmail.com
|
|
The GPU hard reset sequence calls pm_runtime_force_suspend() and
pm_runtime_force_resume(), which according to their documentation should
only be used during system-wide PM transitions to sleep states.
The main issue though is that depending on some internal runtime PM
state as seen by pm_runtime_force_suspend() (whether the usage count is
<= 1), pm_runtime_force_resume() might not resume the device unless
needed. If that happens, the runtime PM resume callback
pvr_power_device_resume() is not called, the GPU clocks are not
re-enabled, and the kernel crashes on the next attempt to access GPU
registers as part of the power-on sequence.
Replace calls to pm_runtime_force_suspend() and
pm_runtime_force_resume() with direct calls to the driver's runtime PM
callbacks, pvr_power_device_suspend() and pvr_power_device_resume(),
to ensure clocks are re-enabled and avoid the kernel crash.
Fixes: cc1aeedb98ad ("drm/imagination: Implement firmware infrastructure and META FW support")
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
Reviewed-by: Matt Coster <matt.coster@imgtec.com>
Link: https://lore.kernel.org/r/20250624-fix-kernel-crash-gpu-hard-reset-v1-1-6d24810d72a6@imgtec.com
Cc: stable@vger.kernel.org
Signed-off-by: Matt Coster <matt.coster@imgtec.com>
|
|
Check for NULL return value with dma_alloc_coherent, in line with
Robin's fix for vic.c in 'drm/tegra: vic: Fix DMA API misuse'.
Fixes: 46f226c93d35 ("drm/tegra: Add NVDEC driver")
Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20250702-nvdec-dma-error-check-v1-1-c388b402c53a@nvidia.com
|
|
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix chunking the PTE updates and overflowing the maximum number of
dwords with with MI_STORE_DATA_IMM (Jia Yao)
- Move WA BB to the LRC BO to mitigate hangs on context switch (Matthew
Brost)
- Fix frequency/flush WAs for BMG (Vinay / Lucas)
- Fix kconfig prompt title and description (Lucas)
- Do not require kunit (Harry Austen / Lucas)
- Extend 14018094691 WA to BMG (Daniele)
- Fix wedging the device on signal (Matthew Brost)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/o5662wz6nrlf6xt5sjgxq5oe6qoujefzywuwblm3m626hreifv@foqayqydd6ig
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
- Fixed raw pointer leakage and unsafe behavior in printk()
. Switch from %pK to %p for pointer formatting, as %p is now safer
and prevents issues like raw pointer leakage and acquiring sleeping
locks in atomic contexts.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Inki Dae <inki.dae@samsung.com>
Link: https://lore.kernel.org/r/20250629091742.29956-1-inki.dae@samsung.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
Fixups
- Fixed raw pointer leakage and unsafe behavior in printk()
. Switch from %pK to %p for pointer formatting, as %p is now safer
and prevents issues like raw pointer leakage and acquiring sleeping
locks in atomic contexts.
- Fixed kernel panic during boot
. A NULL pointer dereference issue occasionally occurred
when the vblank interrupt handler was called before
the DRM driver was fully initialized during boot.
So this patch fixes the issue by adding a check in the interrupt handler
to ensure the DRM driver is properly initialized.
- Fixed a lockup issue on Samsung Peach-Pit/Pi Chromebooks
. The issue occurred after commit c9b1150a68d9 changed
the call order of CRTC enable/disable and bridge pre_enable/post_disable
methods, causing fimd_dp_clock_enable() to be called
before the FIMD device was activated. To fix this,
runtime PM guards were added to fimd_dp_clock_enable()
to ensure proper operation even when CRTC is not enabled.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Inki Dae <inki.dae@samsung.com>
Link: https://lore.kernel.org/r/20250629083554.28628-1-inki.dae@samsung.com
|
|
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Make mei interrupt top half irq disabled to fix RT builds
- Fix timeline left held on VMA alloc error
- Fix NULL pointer deref in vlv_dphy_param_init()
- Fix selftest mock_request() to avoid NULL deref
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://lore.kernel.org/r/aGYVPAA4KvsZqDFx@jlahtine-mobl
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc5:
- Replace simple panel lookup hack with proper fix.
- nullpointer deref in vesadrm fix.
- fix dma_resv_wait_timeout.
- fix error handling in ttm_buffer_object_transfer.
- bridge fixes.
- Fix vmwgfx accidentally allocating encrypted memory.
- Fix race in spsc_queue_push()
- Add refcount on backing GEM objects during fb creation.
- Fix v3d irq's being enabled during gpu reset.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/a7461418-08dc-4b7c-b2fa-264155f66d5e@linux.intel.com
|
|
This fixes a bunch of command hangs after runtime suspend/resume.
This fixes a regression caused by code movement in the commit below,
the commit seems to just change timings enough to cause this to happen
now, and adding the sleep seems to avoid it.
I've spent some time trying to root cause it to no great avail,
it seems like a bug on the firmware side, but it could be a bug
in our rpc handling that I can't find.
Either way, we should land the workaround to fix the problem,
while we continue to work out the root cause.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@nvidia.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Fixes: c21b039715ce ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20250702232707.175679-1-airlied@gmail.com
|
|
If CONFIG_DEBUG_FS is enabled, nouveau_drm_init() returns an error if it
fails to create the "nouveau" directory in debugfs. One case where that
will happen is when debugfs access is restricted by
CONFIG_DEBUG_FS_ALLOW_NONE or by the boot parameter debugfs=off, which
cause the debugfs APIs to return -EPERM.
So just ignore errors from debugfs. Note that nouveau_debugfs_root may
be an error now, but that is a standard pattern for debugfs. From
include/linux/debugfs.h:
"NOTE: it's expected that most callers should _ignore_ the errors
returned by this function. Other debugfs functions handle the fact that
the "dentry" passed to them could be an error and they don't crash in
that case. Drivers should generally work fine even if debugfs fails to
init anyway."
Fixes: 97118a1816d2 ("drm/nouveau: create module debugfs root")
Cc: stable@vger.kernel.org
Signed-off-by: Aaron Thompson <dev@aaront.org>
Acked-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/20250703211949.9916-1-dev@aaront.org
|
|
When a user closes an exec queue or interrupts an app with Ctrl-C,
this does not warrant wedging the device in mode 2.
Avoid this by skipping the wedge check for killed exec queues in
the TDR and LR exec queue cleanup worker.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250624174103.2707941-1-matthew.brost@intel.com
(cherry picked from commit 5a2f117a80c207372513ca8964eeb178874f4990)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
This WA is applicable to BMG as well.
Note that this is a GSC WA and we don't load the GSC on BMG, so
extending the WA to BMG won't do anything right now. However, it helps
future-proof the driver so that if we ever turn the GSC on we won't have
to remember to extend this WA.
v2: don't use VERSION_RANGE from 2001 to 2004 (Matt)
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20250613231128.1261815-2-daniele.ceraolospurio@intel.com
(cherry picked from commit 1a5ce0c5b95b0624ebd44f574b98003a466973be)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Currently, an interrupt can be triggered during a GPU reset, which can
lead to GPU hangs and NULL pointer dereference in an interrupt context
as shown in the following trace:
[ 314.035040] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0
[ 314.043822] Mem abort info:
[ 314.046606] ESR = 0x0000000096000005
[ 314.050347] EC = 0x25: DABT (current EL), IL = 32 bits
[ 314.055651] SET = 0, FnV = 0
[ 314.058695] EA = 0, S1PTW = 0
[ 314.061826] FSC = 0x05: level 1 translation fault
[ 314.066694] Data abort info:
[ 314.069564] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[ 314.075039] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 314.080080] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 314.085382] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000102728000
[ 314.091814] [00000000000000c0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 314.100511] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[ 314.106770] Modules linked in: v3d i2c_brcmstb vc4 snd_soc_hdmi_codec gpu_sched drm_shmem_helper drm_display_helper cec drm_dma_helper drm_kms_helper drm drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight
[ 314.129654] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.25+rpt-rpi-v8 #1 Debian 1:6.12.25-1+rpt1
[ 314.139388] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[ 314.145211] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 314.152165] pc : v3d_irq+0xec/0x2e0 [v3d]
[ 314.156187] lr : v3d_irq+0xe0/0x2e0 [v3d]
[ 314.160198] sp : ffffffc080003ea0
[ 314.163502] x29: ffffffc080003ea0 x28: ffffffec1f184980 x27: 021202b000000000
[ 314.170633] x26: ffffffec1f17f630 x25: ffffff8101372000 x24: ffffffec1f17d9f0
[ 314.177764] x23: 000000000000002a x22: 000000000000002a x21: ffffff8103252000
[ 314.184895] x20: 0000000000000001 x19: 00000000deadbeef x18: 0000000000000000
[ 314.192026] x17: ffffff94e51d2000 x16: ffffffec1dac3cb0 x15: c306000000000000
[ 314.199156] x14: 0000000000000000 x13: b2fc982e03cc5168 x12: 0000000000000001
[ 314.206286] x11: ffffff8103f8bcc0 x10: ffffffec1f196868 x9 : ffffffec1dac3874
[ 314.213416] x8 : 0000000000000000 x7 : 0000000000042a3a x6 : ffffff810017a180
[ 314.220547] x5 : ffffffec1ebad400 x4 : ffffffec1ebad320 x3 : 00000000000bebeb
[ 314.227677] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 314.234807] Call trace:
[ 314.237243] v3d_irq+0xec/0x2e0 [v3d]
[ 314.240906] __handle_irq_event_percpu+0x58/0x218
[ 314.245609] handle_irq_event+0x54/0xb8
[ 314.249439] handle_fasteoi_irq+0xac/0x240
[ 314.253527] handle_irq_desc+0x48/0x68
[ 314.257269] generic_handle_domain_irq+0x24/0x38
[ 314.261879] gic_handle_irq+0x48/0xd8
[ 314.265533] call_on_irq_stack+0x24/0x58
[ 314.269448] do_interrupt_handler+0x88/0x98
[ 314.273624] el1_interrupt+0x34/0x68
[ 314.277193] el1h_64_irq_handler+0x18/0x28
[ 314.281281] el1h_64_irq+0x64/0x68
[ 314.284673] default_idle_call+0x3c/0x168
[ 314.288675] do_idle+0x1fc/0x230
[ 314.291895] cpu_startup_entry+0x3c/0x50
[ 314.295810] rest_init+0xe4/0xf0
[ 314.299030] start_kernel+0x5e8/0x790
[ 314.302684] __primary_switched+0x80/0x90
[ 314.306691] Code: 940029eb 360ffc13 f9442ea0 52800001 (f9406017)
[ 314.312775] ---[ end trace 0000000000000000 ]---
[ 314.317384] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 314.324249] SMP: stopping secondary CPUs
[ 314.328167] Kernel Offset: 0x2b9da00000 from 0xffffffc080000000
[ 314.334076] PHYS_OFFSET: 0x0
[ 314.336946] CPU features: 0x08,00002013,c0200000,0200421b
[ 314.342337] Memory Limit: none
[ 314.345382] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
Before resetting the GPU, it's necessary to disable all interrupts and
deal with any interrupt handler still in-flight. Otherwise, the GPU might
reset with jobs still running, or yet, an interrupt could be handled
during the reset.
Cc: stable@vger.kernel.org
Fixes: 57692c94dcbe ("drm/v3d: Introduce a new DRM driver for Broadcom V3D V3.x+")
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Link: https://lore.kernel.org/r/20250628224243.47599-1-mcanal@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
|
|
A GEM handle can be released while the GEM buffer object is attached
to a DRM framebuffer. This leads to the release of the dma-buf backing
the buffer object, if any. [1] Trying to use the framebuffer in further
mode-setting operations leads to a segmentation fault. Most easily
happens with driver that use shadow planes for vmap-ing the dma-buf
during a page flip. An example is shown below.
[ 156.791968] ------------[ cut here ]------------
[ 156.796830] WARNING: CPU: 2 PID: 2255 at drivers/dma-buf/dma-buf.c:1527 dma_buf_vmap+0x224/0x430
[...]
[ 156.942028] RIP: 0010:dma_buf_vmap+0x224/0x430
[ 157.043420] Call Trace:
[ 157.045898] <TASK>
[ 157.048030] ? show_trace_log_lvl+0x1af/0x2c0
[ 157.052436] ? show_trace_log_lvl+0x1af/0x2c0
[ 157.056836] ? show_trace_log_lvl+0x1af/0x2c0
[ 157.061253] ? drm_gem_shmem_vmap+0x74/0x710
[ 157.065567] ? dma_buf_vmap+0x224/0x430
[ 157.069446] ? __warn.cold+0x58/0xe4
[ 157.073061] ? dma_buf_vmap+0x224/0x430
[ 157.077111] ? report_bug+0x1dd/0x390
[ 157.080842] ? handle_bug+0x5e/0xa0
[ 157.084389] ? exc_invalid_op+0x14/0x50
[ 157.088291] ? asm_exc_invalid_op+0x16/0x20
[ 157.092548] ? dma_buf_vmap+0x224/0x430
[ 157.096663] ? dma_resv_get_singleton+0x6d/0x230
[ 157.101341] ? __pfx_dma_buf_vmap+0x10/0x10
[ 157.105588] ? __pfx_dma_resv_get_singleton+0x10/0x10
[ 157.110697] drm_gem_shmem_vmap+0x74/0x710
[ 157.114866] drm_gem_vmap+0xa9/0x1b0
[ 157.118763] drm_gem_vmap_unlocked+0x46/0xa0
[ 157.123086] drm_gem_fb_vmap+0xab/0x300
[ 157.126979] drm_atomic_helper_prepare_planes.part.0+0x487/0xb10
[ 157.133032] ? lockdep_init_map_type+0x19d/0x880
[ 157.137701] drm_atomic_helper_commit+0x13d/0x2e0
[ 157.142671] ? drm_atomic_nonblocking_commit+0xa0/0x180
[ 157.147988] drm_mode_atomic_ioctl+0x766/0xe40
[...]
[ 157.346424] ---[ end trace 0000000000000000 ]---
Acquiring GEM handles for the framebuffer's GEM buffer objects prevents
this from happening. The framebuffer's cleanup later puts the handle
references.
Commit 1a148af06000 ("drm/gem-shmem: Use dma_buf from GEM object
instance") triggers the segmentation fault easily by using the dma-buf
field more widely. The underlying issue with reference counting has
been present before.
v2:
- acquire the handle instead of the BO (Christian)
- fix comment style (Christian)
- drop the Fixes tag (Christian)
- rename err_ gotos
- add missing Link tag
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://elixir.bootlin.com/linux/v6.15/source/drivers/gpu/drm/drm_gem.c#L241 # [1]
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Anusha Srivatsa <asrivats@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: <stable@vger.kernel.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250630084001.293053-1-tzimmermann@suse.de
|
|
Fix Kconfig symbol dependency on KUNIT, which isn't actually required
for XE to be built-in. However, if KUNIT is enabled, it must be built-in
too.
Fixes: 08987a8b6820 ("drm/xe: Fix build with KUNIT=m")
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Harry Austen <hpausten@protonmail.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20250627-xe-kunit-v2-2-756fe5cd56cf@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit a559434880b320b83733d739733250815aecf1b0)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
The xe driver is the official driver for Intel Xe2 and later, while
maintaining experimental support for earlier GPUs. Reword the help
message accordingly.
Reviewed-by: Maarten Lankhorst <dev@lankhorst.se>
Link: https://lore.kernel.org/r/20250611-xe-kconfig-help-v1-1-8bcc6b47d11a@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 1488a3089de3d0bcdc9532da7ce04cf0af9d7dd0)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Limit GT max frequency to 2600MHz and wait for frequency to reduce
before proceeding with a transient flush. This is really only needed for
the transient flush: if L2 flush is needed due to 16023588340 then
there's no need to do this additional wait since we are already using
the bigger hammer.
v2: Use generic names, ensure user set max frequency requests wait
for flush to complete (Rodrigo)
v3:
- User requests wait via wait_var_event_timeout (Lucas)
- Close races on flush + user requests (Lucas)
- Fix xe_guc_pc_remove_flush_freq_limit() being called on last gt
rather than root gt (Lucas)
v4:
- Only apply the freq reducing part if a TDF is needed: L2 flush trumps
the need for waiting a lower frequency
Fixes: aaa08078e725 ("drm/xe/bmg: Apply Wa_22019338487")
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-4-b888388477f2@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit deea6a7d6d803d6bb874a3e6f1b312e560e6c6df)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
Set GT min frequency to 1200Mhz once driver load is complete.
v2: Review comments (Rodrigo)
v3: Apply Wa earlier so user_req_min is not clobbered.
v4: Apply to all GTs (Lucas)
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250612-wa-14022085890-v4-3-94ba5dcc1e30@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit bdde16c9ac5cb56ad2ee19792222fa1853577af7)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
xe_device_td_flush() has 2 possible implementations: an entire L2 flush
or a transient flush, depending on WA 16023588340. Make this clear by
splitting the function so it calls each of them.
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-3-b888388477f2@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 5e300ed8a545bdffc26b579c526b5fef7b2d5365)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
pc_set_mert_freq_cap() currently lock()/unlock() the mutex multiple times
to stash the current frequencies. It's not a problem since
xe_guc_pc_restore_stashed_freq() is guaranteed to be called only later
in the init sequence. However, now that we have _locked() variants for
this functions, use them and avoid potential issues when called from
other places or using the same pattern.
While at it, prefer and early return for the WA check to reduce
indentation.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-2-b888388477f2@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit d878c97daa603573e5af01fd8beec2fffdb42ad1)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
There are places in which the getters/setters are called one after the
other causing a multiple lock()/unlock(). These are not currently a
problem since they are all happening from the same thread, but there's a
race possibility as calls are added outside of the early init when the
max/min and stashed values need to be correlated.
Add the _locked() variants to prepare for that.
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250618-wa-22019338487-v5-1-b888388477f2@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 1beae9aa2b88d3a02eb666e7b777eb2d7bc645f4)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
No idea why, but without this GuC context switches randomly fail when
running IGTs in a loop. Need to follow up why this fixes the
aforementioned issue but can live with a stable driver for now.
Fixes: 617d824c5323 ("drm/xe: Add WA BB to capture active context utilization")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Tested-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250612031925.4009701-1-matthew.brost@intel.com
(cherry picked from commit 3a1edef8f4b58b0ba826bc68bf4bce4bdf59ecf3)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
According to Bspec, bits 0~9 of MI_STORE_DATA_IMM must not exceed 0x3FE.
The macro MI_SDI_NUM_QW(x) evaluates to 2 * x + 1, which means the
condition 2 * x + 1 <= 0x3FE must be satisfied. Therefore, the maximum
valid value for x is 0x1FE, not 0x1FF.
v2
- Replace 0x1fe with macro MAX_PTE_PER_SDI (Auld, Matthew & Patelczyk, Maciej)
v3
- Change macro MAX_PTE_PER_SDI from 0x1fe to 0x1feU (De Marchi, Lucas)
Bspec: 60246
Fixes: 9c44fd5f6e8a ("drm/xe: Add migrate layer functions for SVM support")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Brian3 Nguyen <brian3.nguyen@intel.com>
Cc: Alex Zuo <alex.zuo@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Maciej Patelczyk <maciej.patelczyk@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Suggested-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Jia Yao <jia.yao@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Link: https://lore.kernel.org/r/20250612224620.161105-1-jia.yao@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit c038bdba98c9f6a36378044a9d4385531a194d3e)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
|
|
MEI GSC interrupt comes from i915. It has top half and bottom half.
Top half is called from i915 interrupt handler. It should be in
irq disabled context.
With RT kernel, by default i915 IRQ handler is in threaded IRQ. MEI GSC
top half might be in threaded IRQ context. generic_handle_irq_safe API
could be called from either IRQ or process context, it disables local
IRQ then calls MEI GSC interrupt top half.
This change fixes A380/A770 GPU boot hang issue with RT kernel.
Fixes: 1e3dc1d8622b ("drm/i915/gsc: add gsc as a mei auxiliary device")
Tested-by: Furong Zhou <furong.zhou@intel.com>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
Link: https://lore.kernel.org/r/20250425151108.643649-1-junxiao.chang@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit dccf655f69002d496a527ba441b4f008aa5bebbf)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
|
|
The following error has been reported sporadically by CI when a test
unbinds the i915 driver on a ring submission platform:
<4> [239.330153] ------------[ cut here ]------------
<4> [239.330166] i915 0000:00:02.0: [drm] drm_WARN_ON(dev_priv->mm.shrink_count)
<4> [239.330196] WARNING: CPU: 1 PID: 18570 at drivers/gpu/drm/i915/i915_gem.c:1309 i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330640] RIP: 0010:i915_gem_cleanup_early+0x13e/0x150 [i915]
...
<4> [239.330942] Call Trace:
<4> [239.330944] <TASK>
<4> [239.330949] i915_driver_late_release+0x2b/0xa0 [i915]
<4> [239.331202] i915_driver_release+0x86/0xa0 [i915]
<4> [239.331482] devm_drm_dev_init_release+0x61/0x90
<4> [239.331494] devm_action_release+0x15/0x30
<4> [239.331504] release_nodes+0x3d/0x120
<4> [239.331517] devres_release_all+0x96/0xd0
<4> [239.331533] device_unbind_cleanup+0x12/0x80
<4> [239.331543] device_release_driver_internal+0x23a/0x280
<4> [239.331550] ? bus_find_device+0xa5/0xe0
<4> [239.331563] device_driver_detach+0x14/0x20
...
<4> [357.719679] ---[ end trace 0000000000000000 ]---
If the test also unloads the i915 module then that's followed with:
<3> [357.787478] =============================================================================
<3> [357.788006] BUG i915_vma (Tainted: G U W N ): Objects remaining on __kmem_cache_shutdown()
<3> [357.788031] -----------------------------------------------------------------------------
<3> [357.788204] Object 0xffff888109e7f480 @offset=29824
<3> [357.788670] Allocated in i915_vma_instance+0xee/0xc10 [i915] age=292729 cpu=4 pid=2244
<4> [357.788994] i915_vma_instance+0xee/0xc10 [i915]
<4> [357.789290] init_status_page+0x7b/0x420 [i915]
<4> [357.789532] intel_engines_init+0x1d8/0x980 [i915]
<4> [357.789772] intel_gt_init+0x175/0x450 [i915]
<4> [357.790014] i915_gem_init+0x113/0x340 [i915]
<4> [357.790281] i915_driver_probe+0x847/0xed0 [i915]
<4> [357.790504] i915_pci_probe+0xe6/0x220 [i915]
...
Closer analysis of CI results history has revealed a dependency of the
error on a few IGT tests, namely:
- igt@api_intel_allocator@fork-simple-stress-signal,
- igt@api_intel_allocator@two-level-inception-interruptible,
- igt@gem_linear_blits@interruptible,
- igt@prime_mmap_coherency@ioctl-errors,
which invisibly trigger the issue, then exhibited with first driver unbind
attempt.
All of the above tests perform actions which are actively interrupted with
signals. Further debugging has allowed to narrow that scope down to
DRM_IOCTL_I915_GEM_EXECBUFFER2, and ring_context_alloc(), specific to ring
submission, in particular.
If successful then that function, or its execlists or GuC submission
equivalent, is supposed to be called only once per GEM context engine,
followed by raise of a flag that prevents the function from being called
again. The function is expected to unwind its internal errors itself, so
it may be safely called once more after it returns an error.
In case of ring submission, the function first gets a reference to the
engine's legacy timeline and then allocates a VMA. If the VMA allocation
fails, e.g. when i915_vma_instance() called from inside is interrupted
with a signal, then ring_context_alloc() fails, leaving the timeline held
referenced. On next I915_GEM_EXECBUFFER2 IOCTL, another reference to the
timeline is got, and only that last one is put on successful completion.
As a consequence, the legacy timeline, with its underlying engine status
page's VMA object, is still held and not released on driver unbind.
Get the legacy timeline only after successful allocation of the context
engine's VMA.
v2: Add a note on other submission methods (Krzysztof Karas):
Both execlists and GuC submission use lrc_alloc() which seems free
from a similar issue.
Fixes: 75d0a7f31eec ("drm/i915: Lift timeline into intel_context")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12061
Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
Reviewed-by: Krzysztof Niemiec <krzysztof.niemiec@intel.com>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://lore.kernel.org/r/20250611104352.1014011-2-janusz.krzysztofik@linux.intel.com
(cherry picked from commit cc43422b3cc79eacff4c5a8ba0d224688ca9dd4f)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
|
|
Commit 81256a50aa0f ("x86/mm: Make memremap(MEMREMAP_WB) map memory as
encrypted by default") changed the default behavior of
memremap(MEMREMAP_WB) and started mapping memory as encrypted.
The driver requires the fifo memory to be decrypted to communicate with
the host but was relaying on the old default behavior of
memremap(MEMREMAP_WB) and thus broke.
Fix it by explicitly specifying the desired behavior and passing
MEMREMAP_DEC to memremap.
Fixes: 81256a50aa0f ("x86/mm: Make memremap(MEMREMAP_WB) map memory as encrypted by default")
Signed-off-by: Marko Kiiskila <marko.kiiskila@broadcom.com>
Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20250618192926.1092450-1-zack.rusin@broadcom.com
|
|
[Why]
OLED panels can be fully off, but this behavior is unexpected.
[How]
Ensure that minimum luminance is at least 1.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4338
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 51496c7737d06a74b599d0aa7974c3d5a4b1162e)
|
|
[WHY]
Rounding error sometimes occurs when the refresh rate is equal to a panel's
max refresh rate, causing HDMI compliance failures.
[HOW]
Added a case so that we round up to avoid v_total_min to be below a panel's
minimum bound.
Reviewed-by: Jun Lei <jun.lei@amd.com>
Signed-off-by: Harold Sun <Harold.Sun@amd.com>
Signed-off-by: Ray Wu <ray.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fe7645d22bc0f7c1558296538ec49987bf268ef6)
|
|
These were missed when support was added for other generations.
The callbacks are called unconditionally so we need to make
sure all generations have them.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4304
Link: https://github.com/ROCm/ROCm/issues/4965
Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+")
Cc: Jonathan Kim <jonathan.kim@amd.com>
Reported-by: Johl Brown <johlbrown@gmail.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1e9d17a5dcf1242e9518e461d8e63ad35240e49e)
Cc: stable@vger.kernel.org
|
|
patch dd64956685fa ("drm/amdgpu: Remove duplicated "context still
alive" check") removed ctx put, which will cause amdgpu_ctx_fini()
cannot be called and then cause some finished fence that added by
amdgpu_ctx_add_fence() cannot be released and cause memleak.
Fixes: dd64956685fa ("drm/amdgpu: Remove duplicated "context still alive" check")
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8cf66089e28108dedd47e6156a48489303cf525c)
Cc: stable@vger.kernel.org
|
|
If the process is exiting, the mmput inside mmu notifier callback from
compactd or fork or numa balancing could release the last reference
of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
with below backtrace.
The deadlock will leak kfd process as mmu notifier release is not called
and cause VRAM leaking.
The fix is to take mm reference mmget_non_zero when adding prange to the
deferred list to pair with mmput in deferred list work.
If prange split and add into pchild list, the pchild work_item.mm is not
used, so remove the mm parameter from svm_range_unmap_split and
svm_range_add_child.
The backtrace of hung task:
INFO: task python:348105 blocked for more than 64512 seconds.
Call Trace:
__schedule+0x1c3/0x550
schedule+0x46/0xb0
rwsem_down_write_slowpath+0x24b/0x4c0
unlink_anon_vmas+0xb1/0x1c0
free_pgtables+0xa9/0x130
exit_mmap+0xbc/0x1a0
mmput+0x5a/0x140
svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
mn_itree_invalidate+0x72/0xc0
__mmu_notifier_invalidate_range_start+0x48/0x60
try_to_unmap_one+0x10fa/0x1400
rmap_walk_anon+0x196/0x460
try_to_unmap+0xbb/0x210
migrate_page_unmap+0x54d/0x7e0
migrate_pages_batch+0x1c3/0xae0
migrate_pages_sync+0x98/0x240
migrate_pages+0x25c/0x520
compact_zone+0x29d/0x590
compact_zone_order+0xb6/0xf0
try_to_compact_pages+0xbe/0x220
__alloc_pages_direct_compact+0x96/0x1a0
__alloc_pages_slowpath+0x410/0x930
__alloc_pages_nodemask+0x3a9/0x3e0
do_huge_pmd_anonymous_page+0xd7/0x3e0
__handle_mm_fault+0x5e3/0x5f0
handle_mm_fault+0xf7/0x2e0
hmm_vma_fault.isra.0+0x4d/0xa0
walk_pmd_range.isra.0+0xa8/0x310
walk_pud_range+0x167/0x240
walk_pgd_range+0x55/0x100
__walk_page_range+0x87/0x90
walk_page_range+0xf6/0x160
hmm_range_fault+0x4f/0x90
amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
init_user_pages+0xb1/0x2a0 [amdgpu]
amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
kfd_ioctl+0x29d/0x500 [amdgpu]
Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a29e067bd38946f752b0ef855f3dfff87e77bec7)
Cc: stable@vger.kernel.org
|
|
This got missed during SDMA 4.4.4 support.
Fixes: 968e3811c3e8 ("drm/amdgpu: add initial support for sdma444")
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 51526efe02714339ed6139f7bc348377d363200a)
Cc: stable@vger.kernel.org
|
|
Set memory mtype to UC host memory when ext-coherent
flag is set and memory is registered as a SVM allocation.
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5d14fdab4778c29cfd39e62c3ce84d232b4a7d8c)
|
|
SDMA 5.x only supports engine soft reset which resets
all queues on the engine. As such, we need to suspend
KFD queues around resets like we do for SDMA 4.x.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 61feed0baa1a0d094af0e07e968b1e6e875f07d0)
|