summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/xe
AgeCommit message (Collapse)AuthorFilesLines
3 daysdrm/xe/oa: Fix exec_queue leak on width check in stream openShuicheng Lin1-2/+4
[ Upstream commit 4d25342543c01310fc4e0cba7cb17c775e2421e2 ] In xe_oa_stream_open_ioctl(), when param.exec_q->width > 1 the function returns -EOPNOTSUPP directly, skipping the existing err_exec_q cleanup path. The exec_queue reference obtained by xe_exec_queue_lookup() is leaked. The exec queue holds a reference on the xe_file, which is only dropped during queue teardown. The leaked lookup ref is not on the file's exec_queue xarray, so file close cannot release it. This keeps both the exec queue and the file private state pinned indefinitely. Jump to err_exec_q instead of returning directly so the reference is released. Fixes: f0ed39830e60 ("xe/oa: Fix query mode of operation for OAR/OAC") Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20260514203210.593488-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 339fa0be9e4a5d69fa47e91f4a36574224fb478f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN4Gustavo Sousa3-3/+4
[ Upstream commit 6df5678b6a94ac80e31e847074c4b30c21025b1f ] The register COMMON_SLICE_CHICKEN4 is a MCR register on both Xe2 and Xe3. Let's make sure to define a MCR version of it and use it for the relevant IP versions. Use XEHP_ as prefix for the register name, since it is MCR as of Xe_HP. v2: - Also change for one entry in lrc_tunnings, which was caught by manual testing and add corresponging Fixes tag in commit message. (Gustavo) Fixes: 8d6f16f1f082 ("drm/xe: Extend Wa_22021007897 to Xe3 platforms") Fixes: e5c13e2c505b ("drm/xe/xe2hpg: Add Wa_22021007897") Fixes: 8ccf5f6b2295 ("drm/xe/tuning: Apply windower hardware filtering setting on Xe3 and Xe3p") Bspec: 66534, 71185, 74417 Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-3-30dd47855fee@intel.com Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> (cherry picked from commit 75f65f1a4c06da1d87f28570a9d4cdad28f13360) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe/tuning: Apply windower hardware filtering setting on Xe3 and Xe3pMatt Roper2-0/+6
[ Upstream commit 8ccf5f6b2295164962bbee5b0770f4366fd9bee2 ] A recent bspec tuning guide update asks us to program COMMON_SLICE_CHICKEN4[5] on Xe3 and Xe3p platforms. Add this setting to our LRC tuning RTP table so that the setting will become part of each context's LRC. Bspec: 72161, 55902 Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260224235055.3038710-2-matthew.d.roper@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Stable-dep-of: 6df5678b6a94 ("drm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN4") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN1Gustavo Sousa2-1/+2
[ Upstream commit a4660bd949733fd6ea621fdb50fabac2608155e9 ] The register COMMON_SLICE_CHICKEN1 is a MCR register on Xe2. Let's make sure to define a MCR version of it and use it for the relevant IP versions. Use XEHP_ as prefix for the register name, since it is MCR as of Xe_HP. Fixes: a5d221924e13 ("drm/xe/xe2_hpg: Add set of workarounds") Fixes: 9f18b55b6d3f ("drm/xe/xe2: Add workaround 18033852989") Bspec: 66534, 71185 Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-2-30dd47855fee@intel.com Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> (cherry picked from commit a672725fdbfc3ea430130039d677c7dc98d59df8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe: Consolidate workaround entries for Wa_18033852989Matt Roper1-8/+4
[ Upstream commit fe681e7b44d78fd77d79de21eca58c3b6bdcda0e ] Wa_18033852989 applies to all graphics versions from 20.01 through 20.04 (inclusive). Consolidate the RTP entries into a single range-based entry. Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260220-forupstream-wa_cleanup-v2-19-b12005a05af6@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Stable-dep-of: a4660bd94973 ("drm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN1") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe: Consolidate workaround entries for Wa_14019988906Matt Roper1-8/+4
[ Upstream commit c2142a1a841525d897ef69b3e6a5ab48183e1fcf ] Wa_14019988906 applies to all graphics versions from 20.01 through 20.04 (inclusive). Consolidate the RTP entries into a single range-based entry. Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260220-forupstream-wa_cleanup-v2-18-b12005a05af6@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Stable-dep-of: a4660bd94973 ("drm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN1") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe/pf: Fix CFI failure in debugfs accessMohanram Meenakshisundaram2-2/+6
[ Upstream commit 96bf49b526e2d03a2b7f6e861925a08f46ed0d28 ] Reading debugfs file (/sys/kernel/debug/dri/0/gt*/pf/adverse_events) with CFI (Control Flow Integrity) enabled, the kernel panics at xe_gt_debugfs_simple_show+0x82/0xc0. xe_gt_debugfs_simple_show() declare a function pointer expecting int return type, but xe_gt_sriov_pf_monitor_print_events() is void return type, leading to CFI failure and kernel panic. [507620.973657] CFI failure at xe_gt_debugfs_simple_show+0x82/0xc0 [xe] (target: xe_gt_sriov_pf_monitor_print_events+0x0/0x130 [xe]; expected type: 0xd72c7139) Fix xe_gt_sriov_pf_monitor_print_events() function by updating to return an int type. Fixes: 1c99d3d3edab ("drm/xe/pf: Expose PF monitor details via debugfs") Signed-off-by: Mohanram Meenakshisundaram <mohanram.meenakshisundaram@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260514174918.1556357-2-mohanram.meenakshisundaram@intel.com (cherry picked from commit ff1d386a8359746d9699ac30336e3b0684c68958) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe/vf: Fix signature of print functionsMichal Wajdeczko2-9/+21
[ Upstream commit 9bb2f1d7e6e58b8e434ddc2048c661bf87ccdf2a ] We have plugged-in existing VF print functions into our GT debugfs show helper as-is, but we missed that the helper expects functions to return int, while they were defined as void. This can lead to errors being reported when CFI is enabled. Fixes: 63d8cb8fe3dd ("drm/xe/vf: Expose SR-IOV VF attributes to GT debugfs") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Mohanram Meenakshisundaram <mohanram.meenakshisundaram@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260514155726.7165-1-michal.wajdeczko@intel.com (cherry picked from commit 314e31c9a8a1c421ee4f7f755b9348aefbbca090) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe/gsc: Fix double-free of managed BO in error pathShuicheng Lin1-4/+1
[ Upstream commit d3ded53fab90996e7d94a39049e11962dd066725 ] The error path in xe_gsc_init_post_hwconfig() explicitly frees a BO allocated with xe_managed_bo_create_pin_map() via xe_bo_unpin_map_no_vm(). Since the managed BO already has a devm cleanup action registered, this causes a double-free when devm unwinds during probe failure. Remove the explicit free and let devm handle it, consistent with all other xe_managed_bo_create_pin_map() callers. Fixes: 2e5d47fe7839 ("drm/xe/uc: Use managed bo for HuC and GSC objects") Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Assisted-by: Claude:claude-opus-4.6 Link: https://patch.msgid.link/20260511154134.223696-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 71d61e3e299a17139e47f980a4d6f425b2c59bf7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysdrm/xe/multi_queue: Fix secondary queue error caseNiranjana Vishwanathapura1-8/+8
commit 00907da2126ed785451b2a2f0fef282246dad104 upstream. If xe_lrc_create() fails, the secondary queue added to the multi-queue group list is not removed before freeing the queue. Fix error path handling for secondary queues by removing it from the multi-queue group list at the right place. Reported-by: Sebastian Österlund <sebastian.osterlund@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7979 Fixes: d716a5088c88 ("drm/xe/multi_queue: Handle tearing down of a multi queue") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260518191639.320890-2-niranjana.vishwanathapura@intel.com (cherry picked from commit d2d23c12789cf69eddc35b8d38cd8eaabd0168f1) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 daysdrm/xe/dma-buf: fix UAF with retry loopMatthew Auld1-37/+12
commit 155a372a1cc50fa93387c5d3cdfd614a61e1afd1 upstream. Retry doesn't work here, since bo will be freed on error, leading to UAF. However, now that we do the alloc & init before the attach, we can now combine this as one unit and have the init do the alloc for us. This should make the retry safe. Reported by Sashiko. v2: Fix up the error unwind (CI) Closes: https://sashiko.dev/#/patchset/20260506184332.86743-2-matthew.auld%40intel.com Fixes: eb289a5f6cc6 ("drm/xe: Convert xe_dma_buf.c for exhaustive eviction") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.18+ Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260508102635.149172-4-matthew.auld@intel.com (cherry picked from commit 479669418253e0f27f8cf5db01a731352ea592e7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 daysdrm/xe/dma-buf: handle empty bo and UAF racesMatthew Auld1-15/+16
commit 981bedbbe61364fcc3a3b87ebaf648a66cd07108 upstream. There look to be some nasty races here when triggering the invalidate_mappings hook: 1) We do xe_bo_alloc() followed by the attach, before the actual full bo init step in xe_dma_buf_init_obj(). However the bo is visible on the attachments list after the attach. This is bad since exporter driver, say amdgpu, can at any time call back into our invalidate_mappings hook, with an empty/bogus bo, leading to potential bugs/crashes. 2) Similar to 1) but here we get a UAF, when the invalidate_mappings hook is triggered. For example, we get as far as xe_bo_init_locked() but this fails in some way. But here the bo will be freed on error, but we still have it attached from dma-buf pov, so if the invalidate_mappings is now triggered then the bo we access is gone and we trigger UAF and more bugs/crashes. To fix this, move the attach step until after we actually have a fully set up buffer object. Note that the bo is not published to userspace until later, so not sure what the comment "Don't publish the bo until we have a valid attachment", is referring to. We have at least two different customers reporting hitting a NULL ptr deref in evict_flags when importing something from amdgpu, followed by triggering the evict flow. Hit rate is also pretty low, which would hint at some kind of race, so something like 1) or 2) might explain this. v2: - Shuffle the order of the ops slightly (no functional change) - Improve the comment to better explain the ordering (Matt B) Assisted-by: Gemini:gemini-3 #debug Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7903 Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/4055 Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260508102635.149172-3-matthew.auld@intel.com (cherry picked from commit af1f2ad0c59fe4e2f924c526f66e968289d77971) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 daysdrm/xe/xelp: Fix Wa_18022495364Tvrtko Ursulin1-1/+1
[ Upstream commit 7fe6cae2f7fad2b5166b0fc096618629f9e2ebcb ] It looks I mistyped CS_DEBUG_MODE2 as CS_DEBUG_MODE1 when adding the workaround. Fix it. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: ca33cd271ef9 ("drm/xe/xelp: Add Wa_18022495364") Cc: Matt Roper <matthew.d.roper@intel.com> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.18+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260116095040.49335-1-tvrtko.ursulin@igalia.com Stable-dep-of: 0df99689eb79 ("drm/xe/xelp: Fix Wa_18022495364") Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe/gsc: Fix BO leak on error in query_compatibility_version()Shuicheng Lin1-1/+1
[ Upstream commit 3762d6c36549accea7068c4a175483fafdd03657 ] When xe_gsc_read_out_header() fails, query_compatibility_version() returns directly instead of jumping to the out_bo label. This skips the xe_bo_unpin_map_no_vm() call, leaving the BO pinned and mapped with no remaining reference to free it. Fix by using goto out_bo so the error path properly cleans up the BO, consistent with the other error handling in the same function. Fixes: 0881cbe04077 ("drm/xe/gsc: Query GSC compatibility version") Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260417163308.3416147-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 8de86d0a843c32ca9d36864bdb92f0376a830bce) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe/eustall: Fix drm_dev_put called before stream disable in closeShuicheng Lin1-2/+2
[ Upstream commit dc2d9842c67d883d3200ae33b9c3859dd9492408 ] In xe_eu_stall_stream_close(), drm_dev_put() is called before the stream is disabled and its resources are freed. If this drops the last reference, the device structures could be freed while the subsequent cleanup code still accesses them, leading to a use-after-free. Fix this by moving drm_dev_put() after all device accesses are complete. This matches the ordering in xe_oa_release(). Fixes: 9a0b11d4cf3b ("drm/xe/eustall: Add support to init, enable and disable EU stall sampling") Cc: Harish Chegondi <harish.chegondi@intel.com> Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Harish Chegondi <harish.chegondi@intel.com> Link: https://patch.msgid.link/20260415225428.3399934-1-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 35aff528f7297e949e5e19c9cd7fd748cf1cf21c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Fix error cleanup in xe_exec_queue_create_ioctl()Shuicheng Lin1-2/+5
[ Upstream commit f3cc22d4df3ed58439ea7e21daa54c3608e03b78 ] Two error handling issues exist in xe_exec_queue_create_ioctl(): 1. When xe_hw_engine_group_add_exec_queue() fails, the error path jumps to put_exec_queue which skips xe_exec_queue_kill(). If the VM is in preempt fence mode, xe_vm_add_compute_exec_queue() has already added the queue to the VM's compute exec queue list. Skipping the kill leaves the queue on that list, leading to a dangling pointer after the queue is freed. 2. When xa_alloc() fails after xe_hw_engine_group_add_exec_queue() has succeeded, the error path does not call xe_hw_engine_group_del_exec_queue() to remove the queue from the hw engine group list. The queue is then freed while still linked into the hw engine group, causing a use-after-free. Fix both by: - Changing the xe_hw_engine_group_add_exec_queue() failure path to jump to kill_exec_queue so that xe_exec_queue_kill() properly removes the queue from the VM's compute list. - Adding a del_hw_engine_group label before kill_exec_queue for the xa_alloc() failure path, which removes the queue from the hw engine group before proceeding with the rest of the cleanup. Fixes: 7970cb36966c ("'drm/xe/hw_engine_group: Register hw engine group's exec queues") Cc: Francois Dugast <francois.dugast@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408020647.3397933-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 37c831f401746a45d510b312b0ed7a77b1e06ec8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Fix potential NULL deref in ↵Shuicheng Lin1-1/+1
xe_exec_queue_tlb_inval_last_fence_put_unlocked [ Upstream commit f8c4151d50b12923b67819ebf03c1c6782c984c1 ] xe_exec_queue_tlb_inval_last_fence_put_unlocked() uses q->vm->xe as the first argument to xe_assert(). This function is called unconditionally from xe_exec_queue_destroy() for all queues, including kernel queues that have q->vm == NULL (e.g., queues created during GT init in xe_gt_record_default_lrcs() with vm=NULL). While current compilers optimize away the q->vm->xe dereference (even in CONFIG_DRM_XE_DEBUG=y builds, the compiler pushes the dereference into the WARN branch that is only taken when the assert condition is false), the code is semantically incorrect and constitutes undefined behavior in the C abstract machine for the NULL pointer case. Use gt_to_xe(q->gt) instead, which is always valid for any exec queue. This is consistent with how xe_exec_queue_destroy() itself obtains the xe_device pointer in its own xe_assert at the top of the function. Fixes: b2d7ec41f2a3 ("drm/xe: Attach last fence to TLB invalidation job queues") Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260409003449.3405767-1-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 96078a1c68bf97f17fd1d08c3f58f5c5cc9ccd65) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe/debugfs: Correct printing of register whitelist rangesMatt Roper1-1/+1
[ Upstream commit 03f2499c51dffce611b065b2894406beb9f2ebe0 ] The register-save-restore debugfs prints whitelist entries as offset ranges. E.g., REG[0x39319c-0x39319f]: allow read access for a single dword-sized register. However the GENMASK value used to set the lower bits to '1' for the upper bound of the whitelist range incorrectly included one more bit than it should have, causing the whitelist ranges to sometimes appear twice as large as they really were. For example, REG[0x6210-0x6217]: allow rw access was also intended to be a single dword-sized register whitelist (with a range 0x6210-0x6213) but was printed incorrectly as a qword-sized range because one too many bits was flipped on. Similar 'off by one' logic was applied when printing 4-dword register ranges and 64-dword register ranges as well. Correct the GENMASK logic to print these ranges in debugfs correctly. No impact outside of correcting the misleading debugfs output. Fixes: d855d2246ea6 ("drm/xe: Print whitelist while applying") Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260408-regsr_wl_range-v1-1-e9a28c8b4264@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 1a2a722ff96749734a5585dfe7f0bea7719caa8b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()Matthew Brost1-24/+9
[ Upstream commit a0fc362f095330f7b3f68ac0c55ef8da18290c87 ] xe_guc_submit_wedge() runs in the DMA-fence signaling path, where GFP_KERNEL memory allocations are not permitted. However, registering guc_submit_wedged_fini via drmm_add_action_or_reset() triggers such an allocation. Avoid this by moving the logic from guc_submit_wedged_fini() into guc_submit_fini(), where wedged exec queue references are dropped during normal teardown. Fixes: 8ed9aaae39f3 ("drm/xe: Force wedged state and block GT reset upon any GPU hang") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260326210116.202585-3-matthew.brost@intel.com (cherry picked from commit 4a706bd93c4fb156a13477e26ffdf2e633edeb10) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Use XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET enum instead of magic numberZhanjun Dong1-3/+4
[ Upstream commit a7f607610da721f77db358b09be8091e60bd8e89 ] Replace the magic number 2 with the proper enum value XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET for better code readability and maintainability. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-5-zhanjun.dong@intel.com Stable-dep-of: a0fc362f0953 ("drm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()") Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe/xe2_hpg: Drop invalid workaround Wa_15010599737Matt Roper1-4/+1
[ Upstream commit 1046bc7b416814833a43af8e66c52b0ea71c2021 ] Wa_15010599737 was a workaround originally proposed (and ultimately rejected) for DG2-G10. There's no record of it ever being relevant or even considered for any other platforms. The specific bit this workaround was setting is documented as "This bit should be set to 1 for the DX9 API and 0 for all other APIs" which means that it should almost always be left at the default value of 0 on Linux. The register itself is directly accessible from userspace, so in the special cases where it might be relevant (e.g., Wine/Proton running Windows DX9 apps), the userspace drivers already have the ability to change the setting without involvement of the kernel. Fixes: 7f3ee7d88058 ("drm/xe/xe2hpg: Add initial GT workarounds") Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://patch.msgid.link/20260223-forupstream-wa_cleanup-v3-2-7f201eb2f172@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Consolidate workaround entries for Wa_14019386621Matt Roper1-8/+4
[ Upstream commit f0d6d356f8ac427d1f3eb8fb783a64ac3efd6fc7 ] Wa_14019386621 applies to all graphics versions from 20.01 through 20.04 (inclusive). Consolidate the RTP entries into a single range-based entry. Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260220-forupstream-wa_cleanup-v2-17-b12005a05af6@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Stable-dep-of: 1046bc7b4168 ("drm/xe/xe2_hpg: Drop invalid workaround Wa_15010599737") Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysdrm/xe: Consolidate workaround entries for Wa_14019877138Matt Roper1-16/+4
[ Upstream commit 55b19abb6c44db40fe1ebd01e9c16aa02c4cf663 ] Wa_14019877138 applies to all graphics versions from 12.55 through 20.04 (inclusive) that have a render engine. Consolidate the RTP entries into a single range-based entry. Note that the DG2 entry for this workaround was missing an ENGINE_CLASS(RENDER) rule; that mistake is fixed by this consolidation. Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260220-forupstream-wa_cleanup-v2-16-b12005a05af6@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Stable-dep-of: 1046bc7b4168 ("drm/xe/xe2_hpg: Drop invalid workaround Wa_15010599737") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-17drm/xe/uapi: Reject coh_none PAT index for CPU cached memory in madviseJia Yao1-0/+47
commit 4e5591c2fc1b30f4ea5e2eab4c3a695acc404e39 upstream. Add validation in xe_vm_madvise_ioctl() to reject PAT indices with XE_COH_NONE coherency mode when applied to CPU cached memory. Using coh_none with CPU cached buffers is a security issue. When the kernel clears pages before reallocation, the clear operation stays in CPU cache (dirty). GPU with coh_none can bypass CPU caches and read stale sensitive data directly from DRAM, potentially leaking data from previously freed pages of other processes. This aligns with the existing validation in vm_bind path (xe_vm_bind_ioctl_validate_bo). v2(Matthew brost) - Add fixes - Move one debug print to better place v3(Matthew Auld) - Should be drm/xe/uapi - More Cc v4(Shuicheng Lin) - Fix kmem leak issues by the way v5 - Remove kmem leak because it has been merged by another patch v6 - Remove the fix which is not related to current fix v7 - No change v8 - Rebase v9 - Limit the restrictions to iGPU v10 - No change Fixes: ada7486c5668 ("drm/xe: Implement madvise ioctl for xe") Cc: <stable@vger.kernel.org> # v6.18+ Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Mathew Alwin <alwin.mathew@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Jia Yao <jia.yao@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260417055917.2027459-2-jia.yao@intel.com (cherry picked from commit 016ccdb674b8c899940b3944952c96a6a490d10a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-17drm/xe/bo: Fix bo leak on unaligned size validation in xe_bo_init_locked()Shuicheng Lin1-1/+3
commit 09a8f3c1c11977a6e10c167f26dd298790b31c32 upstream. When type is ttm_bo_type_device and aligned_size != size, the function returns an error without freeing a caller-provided bo, violating the documented contract that bo is freed on failure. Add xe_bo_free(bo) before returning the error. Fixes: 4e03b584143e ("drm/xe/uapi: Reject bo creation of unaligned size") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408175255.3402838-2-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 601c2aa087b6f21014300a3f107a08ee4dde7bdf) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-17drm/xe: Fix dma-buf attachment leak in xe_gem_prime_import()Shuicheng Lin1-4/+7
commit 111ab678471bf1f90d078d5513bb086b70596c3c upstream. When xe_dma_buf_init_obj() fails, the attachment from dma_buf_dynamic_attach() is not detached. Add dma_buf_detach() before returning the error. Note: we cannot use goto out_err here because xe_dma_buf_init_obj() already frees bo on failure, and out_err would double-free it. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Mattheq Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408175255.3402838-5-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit a828eb185aac41800df8eae4b60501ccc0dbbe51) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-17drm/xe/bo: Fix bo leak on GGTT flag validation in xe_bo_init_locked()Shuicheng Lin1-1/+3
commit 1d0adf2fd94fb0c0037c643fadd8f2cf3cffc009 upstream. When XE_BO_FLAG_GGTT_ALL is set without XE_BO_FLAG_GGTT, the function returns an error without freeing a caller-provided bo, violating the documented contract that bo is freed on failure. Add xe_bo_free(bo) before returning the error. Fixes: 5a3b0df25d6a ("drm/xe: Allow bo mapping on multiple ggtts") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408175255.3402838-3-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 3fbd6cf43cac7b60757f3ce3d95195d3843a902c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-17drm/xe: Fix bo leak in xe_dma_buf_init_obj() on allocation failureShuicheng Lin1-1/+11
commit 93a528f67ce5095bcab46a69839eca97f43dd352 upstream. When drm_gpuvm_resv_object_alloc() fails, the pre-allocated storage bo is not freed. Add xe_bo_free(storage) before returning the error. xe_dma_buf_init_obj() calls xe_bo_init_locked(), which frees the bo on error. Therefore, xe_dma_buf_init_obj() must also free the bo on its own error paths. Otherwise, since xe_gem_prime_import() cannot distinguish whether the failure originated from xe_dma_buf_init_obj() or from xe_bo_init_locked(), it cannot safely decide whether the bo should be freed. Add comments documenting the ownership semantics: on success, ownership of storage is transferred to the returned drm_gem_object; on failure, storage is freed before returning. v2: Add comments to explain the free logic. Fixes: eb289a5f6cc6 ("drm/xe: Convert xe_dma_buf.c for exhaustive eviction") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4.6 Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408175255.3402838-4-shuicheng.lin@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> (cherry picked from commit 78a6c5f899f22338bbf48b44fb8950409c5a69b9) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-17drm/xe/hdcp: Add NULL check for media_gt in intel_hdcp_gsc_check_status()Gustavo Sousa1-2/+10
commit 60a1e131a811b68703da58fd805ab359b704ab03 upstream. When media GT is disabled via configfs, there is no allocation for media_gt, which is kept as NULL. In such scenario, intel_hdcp_gsc_check_status() results in a kernel pagefault error due to &gt->uc.gsc being evaluated as an invalid memory address. Fix that by introducing a NULL check on media_gt and bailing out early if so. While at it, also drop the NULL check for gsc, since it can't be NULL if media_gt is not NULL. v2: - Get address for gsc only after checking that gt is not NULL. (Shuicheng) - Drop the NULL check for gsc. (Shuicheng) v3: - Add "Fixes" and "Cc: <stable...>" tags. (Matt) Fixes: 4af50beb4e0f ("drm/xe: Use gsc_proxy_init_done to check proxy status") Cc: <stable@vger.kernel.org> # v6.10+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260416-check-for-null-media_gt-in-intel_hdcp_gsc_check_status-v2-1-9adb9fd3b621@intel.com Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> (cherry picked from commit bfaf87e84ca3ca3f6e275f9ae56da47a8b55ffd1) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-07drm/xe: Fix bug in idledly unit conversionVinay Belgaumkar1-2/+1
We only need to convert to picosecond units before writing to RING_IDLEDLY. Fixes: 7c53ff050ba8 ("drm/xe: Apply Wa_16023105232") Cc: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Acked-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patch.msgid.link/20260401012710.4165547-1-vinay.belgaumkar@intel.com (cherry picked from commit 13743bd628bc9d9a0e2fe53488b2891aedf7cc74) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe: Avoid memory allocations in xe_device_declare_wedged()Matthew Brost1-14/+13
xe_device_declare_wedged() runs in the DMA-fence signaling path, where GFP_KERNEL memory allocations are not allowed. However, registering xe_device_wedged_fini via drmm_add_action_or_reset() triggers a GFP_KERNEL allocation. Fix this by deferring the registration of xe_device_wedged_fini until late in the driver load sequence. Additionally, drop the wedged PM reference only if the device is actually wedged in xe_device_wedged_fini. Fixes: 452bca0edbd0 ("drm/xe: Don't suspend device upon wedge") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260326210116.202585-2-matthew.brost@intel.com (cherry picked from commit b08ceb443866808b881b12d4183008d214d816c1) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe: Disable garbage collector work item on SVM closeMatthew Brost1-1/+1
When an SVM is closed, the garbage collector work item must be stopped synchronously and any future queuing must be prevented. Replace flush_work() with disable_work_sync() to ensure both conditions are met. Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260227015225.3081787-1-matthew.brost@intel.com (cherry picked from commit 2247feb9badca5a4774df9a437bfc44fba4f22de) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Don't allow PXP on older PTL GSC FWsDaniele Ceraolo Spurio1-0/+12
On PTL, older GSC FWs have a bug that can cause them to crash during PXP invalidation events, which leads to a complete loss of power management on the media GT. Therefore, we can't use PXP on FWs that have this bug, which was fixed in PTL GSC build 1396. Fixes: b1dcec9bd8a1 ("drm/xe/ptl: Enable PXP for PTL") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-10-daniele.ceraolospurio@intel.com (cherry picked from commit 6eb04caaa972934c9b6cea0e0c29e466bf9a346f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Clear restart flag in pxp_start after jumping backDaniele Ceraolo Spurio1-1/+3
If we don't clear the flag we'll keep jumping back at the beginning of the function once we reach the end. Fixes: ccd3c6820a90 ("drm/xe/pxp: Decouple queue addition from PXP start") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-9-daniele.ceraolospurio@intel.com (cherry picked from commit 0850ec7bb2459602351639dccf7a68a03c9d1ee0) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Remove incorrect handling of impossible state during suspendDaniele Ceraolo Spurio1-6/+0
The default case of the PXP suspend switch is incorrectly exiting without releasing the lock. However, this case is impossible to hit because we're switching on an enum and all the valid enum values have their own cases. Therefore, we can just get rid of the default case and rely on the compiler to warn us if a new enum value is added and we forget to add it to the switch. Fixes: 51462211f4a9 ("drm/xe/pxp: add PXP PM support") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-8-daniele.ceraolospurio@intel.com (cherry picked from commit f1b5a77fc9b6a90cd9a5e3db9d4c73ae1edfcfac) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Clean up termination status on failureDaniele Ceraolo Spurio1-0/+1
If the PXP HW termination fails during PXP start, the normal completion code won't be called, so the termination will remain uncomplete. To avoid unnecessary waits, mark the termination as completed from the error path. Note that we already do this if the termination fails when handling a termination irq from the HW. Fixes: f8caa80154c4 ("drm/xe/pxp: Add PXP queue tracking and session start") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-7-daniele.ceraolospurio@intel.com (cherry picked from commit 5d9e708d2a69ab1f64a17aec810cd7c70c5b9fab) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/madvise: Accept canonical GPU addresses in xe_vm_madvise_ioctlArvind Yadav1-4/+12
Userspace passes canonical (sign-extended) GPU addresses where bits 63:48 mirror bit 47. The internal GPUVM uses non-canonical form (upper bits zeroed), so passing raw canonical addresses into GPUVM lookups causes mismatches for addresses above 128TiB. Strip the sign extension with xe_device_uncanonicalize_addr() at the top of xe_vm_madvise_ioctl(). Non-canonical addresses are unaffected. Fixes: ada7486c5668 ("drm/xe: Implement madvise ioctl for xe") Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Arvind Yadav <arvind.yadav@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260326130843.3545241-13-arvind.yadav@intel.com (cherry picked from commit 05c8b1cdc54036465ea457a0501a8c2f9409fce7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/xe_pagefault: Disallow writes to read-only VMAsJonathan Cavitt1-0/+6
The page fault handler should reject write/atomic access to read only VMAs. Add code to handle this in xe_pagefault_service after the VMA lookup. v2: - Apply max line length (Matthew) Fixes: fb544b844508 ("drm/xe: Implement xe_pagefault_queue_work") Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260324152935.72444-7-jonathan.cavitt@intel.com (cherry picked from commit 714ee6754ac5fa3dc078856a196a6b124cd797a0) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-25drm/xe: always keep track of remap prev/nextMatthew Auld3-10/+28
During 3D workload, user is reporting hitting: [ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925 [ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy) [ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe] [ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282 [ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000 [ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000 [ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380 [ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380 [ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000 [ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0 [ 413.362088] PKRU: 55555554 [ 413.362089] Call Trace: [ 413.362092] <TASK> [ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe] Which seems to hint that the vma we are re-inserting for the ops unwind is either invalid or overlapping with something already inserted in the vm. It shouldn't be invalid since this is a re-insertion, so must have worked before. Leaving the likely culprit as something already placed where we want to insert the vma. Following from that, for the case where we do something like a rebind in the middle of a vma, and one or both mapped ends are already compatible, we skip doing the rebind of those vma and set next/prev to NULL. As well as then adjust the original unmap va range, to avoid unmapping the ends. However, if we trigger the unwind path, we end up with three va, with the two ends never being removed and the original va range in the middle still being the shrunken size. If this occurs, one failure mode is when another unwind op needs to interact with that range, which can happen with a vector of binds. For example, if we need to re-insert something in place of the original va. In this case the va is still the shrunken version, so when removing it and then doing a re-insert it can overlap with the ends, which were never removed, triggering a warning like above, plus leaving the vm in a bad state. With that, we need two things here: 1) Stop nuking the prev/next tracking for the skip cases. Instead relying on checking for skip prev/next, where needed. That way on the unwind path, we now correctly remove both ends. 2) Undo the unmap va shrinkage, on the unwind path. With the two ends now removed the unmap va should expand back to the original size again, before re-insertion. v2: - Update the explanation in the commit message, based on an actual IGT of triggering this issue, rather than conjecture. - Also undo the unmap shrinkage, for the skip case. With the two ends now removed, the original unmap va range should expand back to the original range. v3: - Track the old start/range separately. vma_size/start() uses the va info directly. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602 Fixes: 8f33b4f054fc ("drm/xe: Avoid doing rebinds") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260318100208.78097-2-matthew.auld@intel.com (cherry picked from commit aec6969f75afbf4e01fd5fb5850ed3e9c27043ac) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-24drm/xe: Implement recent spec updates to Wa_16025250150Matt Roper2-1/+3
The hardware teams noticed that the originally documented workaround steps for Wa_16025250150 may not be sufficient to fully avoid a hardware issue. The workaround documentation has been augmented to suggest programming one additional register; make the corresponding change in the driver. Fixes: 7654d51f1fd8 ("drm/xe/xe2hpg: Add Wa_16025250150") Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Link: https://patch.msgid.link/20260319-wa_16025250150_part2-v1-1-46b1de1a31b2@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit a31566762d4075646a8a2214586158b681e94305) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-23drm/xe/pf: Fix use-after-free in migration restoreMichał Winiarski1-0/+2
When an error is returned from xe_sriov_pf_migration_restore_produce(), the data pointer is not set to NULL, which can trigger use-after-free in subsequent .write() calls. Set the pointer to NULL upon error to fix the problem. Fixes: 1ed30397c0b92 ("drm/xe/pf: Add support for encap/decap of bitstream to/from packet") Reported-by: Sebastian Österlund <sebastian.osterlund@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7230 Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260217154118.176902-1-michal.winiarski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 4f53d8c6d23527d734fe3531d08e15cb170a0819) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-19drm/xe: Fix missing runtime PM reference in ccs_mode_storeSanjay Yadav1-0/+2
ccs_mode_store() calls xe_gt_reset() which internally invokes xe_pm_runtime_get_noresume(). That function requires the caller to already hold an outer runtime PM reference and warns if none is held: [46.891177] xe 0000:03:00.0: [drm] Missing outer runtime PM protection [46.891178] WARNING: drivers/gpu/drm/xe/xe_pm.c:885 at xe_pm_runtime_get_noresume+0x8b/0xc0 Fix this by protecting xe_gt_reset() with the scope-based guard(xe_pm_runtime)(xe), which is the preferred form when the reference lifetime matches a single scope. v2: - Use scope-based guard(xe_pm_runtime)(xe) (Shuicheng) - Update commit message accordingly Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7593 Fixes: 480b358e7d8e ("drm/xe: Do not wake device during a GT reset") Cc: <stable@vger.kernel.org> # v6.19+ Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260313071608.3459480-2-sanjay.kumar.yadav@intel.com (cherry picked from commit 7937ea733f79b3f25e802a0c8360bf7423856f36) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe: Open-code GGTT MMIO access protectionMatthew Brost2-7/+8
GGTT MMIO access is currently protected by hotplug (drm_dev_enter), which works correctly when the driver loads successfully and is later unbound or unloaded. However, if driver load fails, this protection is insufficient because drm_dev_unplug() is never called. Additionally, devm release functions cannot guarantee that all BOs with GGTT mappings are destroyed before the GGTT MMIO region is removed, as some BOs may be freed asynchronously by worker threads. To address this, introduce an open-coded flag, protected by the GGTT lock, that guards GGTT MMIO access. The flag is cleared during the dev_fini_ggtt devm release function to ensure MMIO access is disabled once teardown begins. Cc: stable@vger.kernel.org Fixes: 919bb54e989c ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node") Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-8-zhanjun.dong@intel.com (cherry picked from commit 4f3a998a173b4325c2efd90bdadc6ccd3ad9a431) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe/lrc: Fix uninitialized new_ts when capturing context timestampUmesh Nerlige Ramappa1-2/+2
Getting engine specific CTX TIMESTAMP register can fail. In that case, if the context is active, new_ts is uninitialized. Fix that case by initializing new_ts to the last value that was sampled in SW - lrc->ctx_timestamp. Flagged by static analysis. v2: Fix new_ts initialization (Ashutosh) Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR") Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20260312125308.3126607-2-umesh.nerlige.ramappa@intel.com (cherry picked from commit 466e75d48038af252187855058a7a9312db9d2f8) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe/oa: Allow reading after disabling OA streamAshutosh Dixit1-2/+5
Some OA data might be present in the OA buffer when OA stream is disabled. Allow UMD's to retrieve this data, so that all data till the point when OA stream is disabled can be retrieved. v2: Update tail pointer after disable (Umesh) Fixes: efb315d0a013 ("drm/xe/oa/uapi: Read file_operation") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Umesh Nerlige Ramappa<umesh.nerlige.ramappa@intel.com> Link: https://patch.msgid.link/20260313053630.3176100-1-ashutosh.dixit@intel.com (cherry picked from commit 4ff57c5e8dbba23b5457be12f9709d5c016da16e) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe: Skip over non leaf pte for PRL generationBrian Nguyen1-9/+29
The check using xe_child->base.children was insufficient in determining if a pte was a leaf node. So explicitly skip over every non-leaf pt and conditionally abort if there is a scenario where a non-leaf pt is interleaved between leaf pt, which results in the page walker skipping over some leaf pt. Note that the behavior being targeted for abort is PD[0] = 2M PTE PD[1] = PT -> 512 4K PTEs PD[2] = 2M PTE results in abort, page walker won't descend PD[1]. With new abort, ensuring valid PRL before handling a second abort. v2: - Revert to previous assert. - Revised non-leaf handling for interleaf child pt and leaf pte. - Update comments to specifications. (Stuart) - Remove unnecessary XE_PTE_PS64. (Matthew B) v3: - Modify secondary abort to only check non-leaf PTEs. (Matthew B) Fixes: b912138df299 ("drm/xe: Create page reclaim list on unbind") Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260305171546.67691-6-brian3.nguyen@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 1d123587525db86cc8f0d2beb35d9e33ca3ade83) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe/guc: Ensure CT state transitions via STOP before DISABLEDZhanjun Dong1-0/+1
The GuC CT state transition requires moving to the STOP state before entering the DISABLED state. Update the driver teardown sequence to make the proper state machine transitions. Fixes: ee4b32220a6b ("drm/xe/guc: Add devm release action to safely tear down CT") Cc: stable@vger.kernel.org Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-6-zhanjun.dong@intel.com (cherry picked from commit dace8cb0032f57ea67c87b3b92ad73c89dd2db44) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe: Trigger queue cleanup if not in wedged mode 2Matthew Brost1-13/+22
The intent of wedging a device is to allow queues to continue running only in wedged mode 2. In other modes, queues should initiate cleanup and signal all remaining fences. Fix xe_guc_submit_wedge to correctly clean up queues when wedge mode != 2. Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device") Cc: stable@vger.kernel.org Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-4-zhanjun.dong@intel.com (cherry picked from commit e25ba41c8227c5393c16e4aab398076014bd345f) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe: Forcefully tear down exec queues in GuC submit finiMatthew Brost3-12/+63
In GuC submit fini, forcefully tear down any exec queues by disabling CTs, stopping the scheduler (which cleans up lost G2H), killing all remaining queues, and resuming scheduling to allow any remaining cleanup actions to complete and signal any remaining fences. Split guc_submit_fini into device related and software only part. Using device-managed and drm-managed action guarantees the correct ordering of cleanup. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-3-zhanjun.dong@intel.com (cherry picked from commit a6ab444a111a59924bd9d0c1e0613a75a0a40b89) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-03-19drm/xe: Always kill exec queues in xe_guc_submit_pause_abortMatthew Brost1-2/+1
xe_guc_submit_pause_abort is intended to be called after something disastrous occurs (e.g., VF migration fails, device wedging, or driver unload) and should immediately trigger the teardown of remaining submission state. With that, kill any remaining queues in this function. Fixes: 7c4b7e34c83b ("drm/xe/vf: Abort VF post migration recovery on failure") Cc: stable@vger.kernel.org Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260310225039.1320161-2-zhanjun.dong@intel.com (cherry picked from commit 78f3bf00be4f15daead02ba32d4737129419c902) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>