| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit bfe9e314d7574d1c5c851972e7aee342733819d2 ]
During 3D workload, user is reporting hitting:
[ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925
[ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy)
[ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe]
[ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282
[ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000
[ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000
[ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380
[ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380
[ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000
[ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0
[ 413.362088] PKRU: 55555554
[ 413.362089] Call Trace:
[ 413.362092] <TASK>
[ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe]
Which seems to hint that the vma we are re-inserting for the ops unwind
is either invalid or overlapping with something already inserted in the
vm. It shouldn't be invalid since this is a re-insertion, so must have
worked before. Leaving the likely culprit as something already placed
where we want to insert the vma.
Following from that, for the case where we do something like a rebind in
the middle of a vma, and one or both mapped ends are already compatible,
we skip doing the rebind of those vma and set next/prev to NULL. As well
as then adjust the original unmap va range, to avoid unmapping the ends.
However, if we trigger the unwind path, we end up with three va, with
the two ends never being removed and the original va range in the middle
still being the shrunken size.
If this occurs, one failure mode is when another unwind op needs to
interact with that range, which can happen with a vector of binds. For
example, if we need to re-insert something in place of the original va.
In this case the va is still the shrunken version, so when removing it
and then doing a re-insert it can overlap with the ends, which were
never removed, triggering a warning like above, plus leaving the vm in a
bad state.
With that, we need two things here:
1) Stop nuking the prev/next tracking for the skip cases. Instead
relying on checking for skip prev/next, where needed. That way on the
unwind path, we now correctly remove both ends.
2) Undo the unmap va shrinkage, on the unwind path. With the two ends
now removed the unmap va should expand back to the original size again,
before re-insertion.
v2:
- Update the explanation in the commit message, based on an actual IGT of
triggering this issue, rather than conjecture.
- Also undo the unmap shrinkage, for the skip case. With the two ends
now removed, the original unmap va range should expand back to the
original range.
v3:
- Track the old start/range separately. vma_size/start() uses the va
info directly.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602
Fixes: 8f33b4f054fc ("drm/xe: Avoid doing rebinds")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260318100208.78097-2-matthew.auld@intel.com
(cherry picked from commit aec6969f75afbf4e01fd5fb5850ed3e9c27043ac)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[ adapted function signatures ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 01f2557aa684e514005541e71a3d01f4cd45c170 upstream.
GGTT MMIO access is currently protected by hotplug (drm_dev_enter),
which works correctly when the driver loads successfully and is later
unbound or unloaded. However, if driver load fails, this protection is
insufficient because drm_dev_unplug() is never called.
Additionally, devm release functions cannot guarantee that all BOs with
GGTT mappings are destroyed before the GGTT MMIO region is removed, as
some BOs may be freed asynchronously by worker threads.
To address this, introduce an open-coded flag, protected by the GGTT
lock, that guards GGTT MMIO access. The flag is cleared during the
dev_fini_ggtt devm release function to ensure MMIO access is disabled
once teardown begins.
Cc: stable@vger.kernel.org
Fixes: 919bb54e989c ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node")
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260310225039.1320161-8-zhanjun.dong@intel.com
(cherry picked from commit 4f3a998a173b4325c2efd90bdadc6ccd3ad9a431)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9be6fd9fbd2032b683e51374497768af9aaa228a upstream.
Some OA data might be present in the OA buffer when OA stream is
disabled. Allow UMD's to retrieve this data, so that all data till the
point when OA stream is disabled can be retrieved.
v2: Update tail pointer after disable (Umesh)
Fixes: efb315d0a013 ("drm/xe/oa/uapi: Read file_operation")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Umesh Nerlige Ramappa<umesh.nerlige.ramappa@intel.com>
Link: https://patch.msgid.link/20260313053630.3176100-1-ashutosh.dixit@intel.com
(cherry picked from commit 4ff57c5e8dbba23b5457be12f9709d5c016da16e)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1bfd7575092420ba5a0b944953c95b74a5646ff8 upstream.
xe_sync_entry_parse() can allocate references (syncobj, fence, chain fence,
or user fence) before hitting a later failure path. Several of those paths
returned directly, leaving partially initialized state and leaking refs.
Route these error paths through a common free_sync label and call
xe_sync_entry_cleanup(sync) before returning the error.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260219233516.2938172-5-shuicheng.lin@intel.com
(cherry picked from commit f939bdd9207a5d1fc55cced5459858480686ce22)
Cc: stable@vger.kernel.org
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3091723785def05ebfe6a50866f87a044ae314ba ]
Free the newly allocated entry when xa_store() fails to avoid a memory
leak on the error path.
v2: use goto fail_free. (Bala)
Fixes: e5283bd4dfec ("drm/xe/reg_sr: Remove register pool")
Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260204172810.1486719-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 6bc6fec71ac45f52db609af4e62bdb96b9f5fadb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cdc8a1e11f4d5b480ec750e28010c357185b95a6 ]
If a batch buffer is complete, it makes little sense to preempt the
fence signaling instructions in the ring, as the largest portion of the
work (the batch buffer) is already done and fence signaling consists of
only a few instructions. If these instructions are preempted, the GuC
would need to perform a context switch just to signal the fence, which
is costly and delays fence signaling. Avoid this scenario by disabling
preemption immediately after the BB start instruction and re-enabling it
after executing the fence signaling instructions.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patch.msgid.link/20260115004546.58060-1-matthew.brost@intel.com
(cherry picked from commit 2bcbf2dcde0c839a73af664a3c77d4e77d58a3eb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dd1ef5e2456558876244795bb22a4d90cb24f160 ]
If the firmware is not running during TDR (e.g., when the driver is
unloading), there's no need to toggle scheduling in the GuC. In such
cases, skip this step.
v4:
- Bail on wait UC not running (Niranjana)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Link: https://patch.msgid.link/20260110012739.2888434-4-matthew.brost@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bc6387a2e0c1562faa56ce2a98cef50cab809e08 ]
The PSS_CHICKEN register has been part of the RCS engine's LRC since it
was first introduced in Xe_LP. That means that any workarounds that
adjust its value (such as Wa_14019988906 and Wa_14019877138) need to be
implemented in the lrc_was[] table so that they become part of the
default LRC from which all subsequent LRCs are copied. Although these
workarounds were implemented correctly on most platforms, they were
incorrectly placed on the engine_was[] table for Xe2_HPG.
Move the workarounds to the proper lrc_was[] table and switch the
'xe_rtp_match_first_render_or_compute' rule to specifically match the
RCS since that's the engine whose LRC manages the register.
Bspec: 65182
Fixes: 7f3ee7d88058 ("drm/xe/xe2hpg: Add initial GT workarounds")
Reviewed-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Link: https://patch.msgid.link/20260205220508.51905-2-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit e04c609eedf4d6748ac0bcada4de1275b034fed6)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a5d221924e13a22c83b682410dcf72422d1c68db ]
Add set of workarounds for xe2_hpg.
-v2: Fix xe2_hpg GMD version for some workarounds.
-v3: Removed extra Workaround (Matt Roper)
Signed-off-by: Shekhar Chauhan <shekhar.chauhan@intel.com>
Signed-off-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://lore.kernel.org/r/20250605190804.1287289-3-dnyaneshwar.bhadane@intel.com
Stable-dep-of: bc6387a2e0c1 ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dddc53806dd2a10e210d5ea08caec6d3f92440b2 ]
Extend Wa_13011645652 to PTL.
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250116184659.384874-1-vinay.belgaumkar@intel.com
Stable-dep-of: bc6387a2e0c1 ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906 & Wa_14019877138")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4a9b4e1fa52a6aaa1adbb7f759048df14afed54c ]
xe_mmio_read64_2x32() was adjusting register addresses and then
calling xe_mmio_read32(), which applies the adjustment again.
This may shift accesses twice if adj_offset < adj_limit. There is
no issue currently, as for media gt, adj_offset > adj_limit, so
the 2nd adjust will be a no-op. But it may not work in future.
To fix it, replace the adjusted-address comparison with a direct
sanity check that ensures the MMIO address adjustment cutoff never
falls within the 8-byte range of a 64-bit register. And let
xe_mmio_read32() handle address translation.
v2: rewrite the sanity check in a more natural way. (Matt)
v3: Add Fixes tag. (Jani)
Fixes: 07431945d8ae ("drm/xe: Avoid 64-bit register reads")
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://patch.msgid.link/20260130165621.471408-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit a30f999681126b128a43137793ac84b6a5b7443f)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a84590c5ceb354d2e9f7f6812cfb3a9709e14afa ]
Since much of the MMIO register access done by the driver is to non-GT
registers, use of 'xe_gt' in these interfaces has been a long-standing
design flaw that's been hard to disentangle.
To avoid a flag day across the whole driver, munge the function names
and add temporary compatibility macros with the original function names
that can accept either the new xe_mmio or the old xe_gt structure as a
parameter. This will allow us to slowly convert parts of the driver
over to the new interface independently.
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-54-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6fb5d1a1d376910700d054d13cefbf0812b444a9 ]
Although we want to break the GT-centric nature of the MMIO code in the
general driver, the SRIOV handling still relies on data in a VF
substructure of the GT. So add a GT backpointer, but name it
sriov_vf_gt to make it clear that it's only for this one specific
special case and will not be set or usable for anything else.
v2:
- Store backpointer to the GT itself rather than the SRIOV-specific
substructure. (Michal)
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> # v1
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-53-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1877c88fa9b9bdbce7a65d7cbd2aa4e29bb514af ]
Once MMIO operations stop being (incorrectly) tied to a GT, we'll still
need a backpointer for feature checks, message logging, and tracepoints.
Use a tile backpointer since that may allow the most useful debugging
output, while also providing access to the xe_device.
v2:
- Make backpointer an xe_tile instead of xe_device. (Michal)
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> # v1
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-52-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 960a83799f5bb8634755f0593c591c53ff4acee8 ]
The mmio_ext stuff is completely unused right now, but it isn't
providing any functionality that couldn't be treated as a regular mmio
space.
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-51-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fa599b8c95a7070430703f4908a50141f2c7088c ]
Each GT should share the same register iomap as its parent tile. Future
patches will switch to access the iomap through the GT's mmio substruct
rather than through the tile.
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-50-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9d383916a552784ec35e6d25469fc2da9bcd9948 ]
By moving the GSI adjustment fields into 'struct xe_mmio' we can replace
the GT's MMIO substructure with another instance of xe_mmio. At the
moment this means MMIO operations wind up pulling information from two
different places (the tile's xe_mmio for the iomap and the GT's xe_mmio
for the adjustment), but we'll address that in future patches.
The type headers change a bit with this change, meaning that various
files should be including xe_device_types.h instead of (or in addition
to) xe_gt_types.h.
v2:
- Fix pre-existing kerneldoc typo while moving the fields (Lucas)
v3:
- Add missing '@' in kerneldoc. (Rodrigo)
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-49-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d4aff99aefa2a3c8999a98f0d52a977b284b9ec9 ]
xe_mmio currently has a size parameter that is assigned but never used
anywhere. The current values assigned appear to be the size of the BAR
region assigned for the tile (both for registers and other purposes such
as the GGTT). Since the current field isn't being used for anything,
change the assignments to 4MB (the size of the register region on all
current platform) and rename the field to 'regs_size' to more clearly
describe what it represents. We can use this value in later patches to
help ensure no register accesses accidentally go past the end of the
desired register space (which might not be caught easily if they still
fall within the iomap).
v2:
- s/regs_length/regs_size/ (Lucas)
- Clarify kerneldoc description (Lucas)
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-48-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 34953ee349dde9d1733d4af75e929f7fd5fab539 ]
Pull the 'mmio' substructure from xe_tile out into a dedicated type.
Future patches will expand this structure and then eventually move MMIO
read/write operations over to using this type.
v2:
- Fix kerneldoc of 'size' field. The rename/refocusing of this field
got moved to the next patch of the series. (Lucas)
- Correct commit message; it's the tile, not the device, mmio that's
been pulled out to a separate type. (Michal)
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-47-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 998fde0647671c82f637e299026d951f9b155b37 ]
Forcewake is a general GT power management concept that isn't specific
to MMIO register access. Move the forcewake information for a GT out of
the 'mmio' substruct and into a 'pm' substruct. Also use the gt_to_fw()
helper in a few more places where it was being open-coded.
v2:
- Kerneldoc tweaks. (Lucas)
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-46-matthew.d.roper@intel.com
Stable-dep-of: 4a9b4e1fa52a ("drm/xe/mmio: Avoid double-adjust in 64-bit reads")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 96c2c72b817d70e8d110e78b0162e044a0c41f9f ]
Call drm_dev_unregister() when xe_device_probe() fails after successful
drm_dev_register(). This ensures the DRM device is promptly unregistered
before returning an error, avoiding leaving it registered on the failure
path.
Otherwise, there is warn message if xe_device_probe() is called again:
"
[ 207.322365] [drm:drm_minor_register]
[ 207.322381] debugfs: '128' already exists in 'dri'
[ 207.322432] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/renderD128'
[ 207.322435] CPU: 5 UID: 0 PID: 10261 Comm: modprobe Tainted: G B W 6.19.0-rc2-lgci-xe-kernel+ #223 PREEMPT(voluntary)
[ 207.322439] Tainted: [B]=BAD_PAGE, [W]=WARN
[ 207.322440] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
[ 207.322441] Call Trace:
[ 207.322442] <TASK>
[ 207.322443] dump_stack_lvl+0xa0/0xc0
[ 207.322446] dump_stack+0x10/0x20
[ 207.322448] sysfs_warn_dup+0xd5/0x110
[ 207.322451] sysfs_create_dir_ns+0x1f6/0x280
[ 207.322453] ? __pfx_sysfs_create_dir_ns+0x10/0x10
[ 207.322455] ? lock_acquire+0x1a4/0x2e0
[ 207.322458] ? __kasan_check_read+0x11/0x20
[ 207.322461] kobject_add_internal+0x28d/0x8e0
[ 207.322464] kobject_add+0x11f/0x1f0
[ 207.322465] ? lock_acquire+0x1a4/0x2e0
[ 207.322467] ? __pfx_kobject_add+0x10/0x10
[ 207.322469] ? __kasan_check_write+0x14/0x20
[ 207.322471] ? kobject_put+0x62/0x4a0
[ 207.322473] ? get_device_parent.isra.0+0x1bb/0x4c0
[ 207.322475] ? kobject_put+0x62/0x4a0
[ 207.322477] device_add+0x2d7/0x1500
[ 207.322479] ? __pfx_device_add+0x10/0x10
[ 207.322481] ? drm_debugfs_add_file+0xfa/0x170
[ 207.322483] ? drm_debugfs_add_files+0x82/0xd0
[ 207.322485] ? drm_debugfs_add_files+0x82/0xd0
[ 207.322487] drm_minor_register+0x10a/0x2d0
[ 207.322489] drm_dev_register+0x143/0x860
[ 207.322491] ? xe_configfs_get_psmi_enabled+0x12/0x90 [xe]
[ 207.322667] xe_device_probe+0x185b/0x2c40 [xe]
[ 207.322812] ? __pfx___drm_dev_dbg+0x10/0x10
[ 207.322815] ? add_dr+0x180/0x220
[ 207.322818] ? __pfx___drmm_mutex_release+0x10/0x10
[ 207.322821] ? __pfx_xe_device_probe+0x10/0x10 [xe]
[ 207.322966] ? xe_pm_init_early+0x33a/0x410 [xe]
[ 207.323136] xe_pci_probe+0x936/0x1250 [xe]
[ 207.323298] ? lock_acquire+0x1a4/0x2e0
[ 207.323302] ? __pfx_xe_pci_probe+0x10/0x10 [xe]
[ 207.323464] local_pci_probe+0xe6/0x1a0
[ 207.323468] pci_device_probe+0x523/0x840
[ 207.323470] ? __pfx_pci_device_probe+0x10/0x10
[ 207.323473] ? sysfs_do_create_link_sd.isra.0+0x8c/0x110
[ 207.323476] ? sysfs_create_link+0x48/0xc0
[ 207.323479] really_probe+0x1fd/0x8a0
...
"
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Link: https://patch.msgid.link/20260109211041.2446012-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 60bfb8baf8f0d5b0d521744dfd01c880ce1a23f3)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bb36170d959fad7f663f91eb9c32a84dd86bef2b ]
Restrict D3Cold disablement for BMG to unsupported NUC platforms,
instead of disabling it on all platforms.
Signed-off-by: Karthik Poosa <karthik.poosa@intel.com>
Fixes: 3e331a6715ee ("drm/xe/pm: Temporarily disable D3Cold on BMG")
Link: https://patch.msgid.link/20260123173238.1642383-1-karthik.poosa@intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 39125eaf8863ab09d70c4b493f58639b08d5a897)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f2eedadf19979109415928f5ea9ba9a73262aa8f ]
Fix the false-positive "Missing outer runtime PM protection" warning
triggered by
release_async_domains() -> intel_runtime_pm_get_noresume() ->
xe_pm_runtime_get_noresume()
during system suspend.
xe_pm_runtime_get_noresume() is supposed to warn if the device is not in
the runtime resumed state, using xe_pm_runtime_get_if_in_use() for this.
However the latter function will fail if called during runtime or system
suspend/resume, regardless of whether the device is runtime resumed or
not.
Based on the above suppress the warning during system suspend/resume,
similarly to how this is done during runtime suspend/resume.
Suggested-by: Imre Deak <imre.deak@intel.com>
Reviewed-by: Imre Deak <imre.deak@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241217230547.1667561-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Stable-dep-of: bb36170d959f ("drm/xe/pm: Disable D3Cold for BMG only on specific platforms")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7ee9b3e091c63da71e15c72003f1f07e467f5158 ]
The topology query helper advanced the user pointer by the size
of the pointer, not the size of the structure. This can misalign
the output blob and corrupt the following mask. Fix the increment
to use sizeof(*topo).
There is no issue currently, as sizeof(*topo) happens to be equal
to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes).
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit c2a6859138e7f73ad904be17dd7d1da6cc7f06b3)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 95d0883ac8105717f59c2dcdc0d8b9150f13aa12 upstream.
This patch ensures the gt will be awake for the entire duration
of the resume sequences until GuCRC takes over and GT-C6 gets
re-enabled.
Before suspending GT-C6 is kept enabled, but upon resume, GuCRC
is not yet alive to properly control the exits and some cases of
instability and corruption related to GT-C6 can be observed.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Link: https://lore.kernel.org/r/20250827000633.1369890-3-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1313351e71181a4818afeb8dfe202e4162091ef6 upstream.
Move forcewake_get() into xe_gt_idle_enable_c6() to streamline the
code and make it easier to use.
Suggested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250827000633.1369890-2-x.wang@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fe3ccd24138fd391ae8e32289d492c85f67770fc upstream.
When imported dma-bufs are destroyed, TTM is not fully
individualizing the dma-resv, but it *is* copying the fences that
need to be waited for before declaring idle. So in the case where
the bo->resv != bo->_resv we can still drop the preempt-fences, but
make sure we do that on bo->_resv which contains the fence-pointer
copy.
In the case where the copying fails, bo->_resv will typically not
contain any fences pointers at all, so there will be nothing to
drop. In that case, TTM would have ensured all fences that would
have been copied are signaled, including any remaining preempt
fences.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Fixes: fa0af721bd1f ("drm/ttm: test private resv obj on release/destroy")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Tested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251217093441.5073-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 425fe550fb513b567bd6d01f397d274092a9c274)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 80f9c601d9c4d26f00356c0a9c461650e7089273 upstream.
msleep is not very accurate in terms of how long it actually sleeps,
whereas usleep_range is precise. Replace the timeslice sleep for
long-running workloads with the more accurate usleep_range to avoid
jitter if the sleep period is less than 20ms.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251212182847.1683222-3-matthew.brost@intel.com
(cherry picked from commit ca415c4d4c17ad676a2c8981e1fcc432221dce79)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f0f404bd289d79a260b634c5b3f4d330b13472c upstream.
A 10ms timeslice for long-running workloads is far too long and causes
significant jitter in benchmarks when the system is shared. Adjust the
value to 5ms for preempt-fencing VMs, as the resume step there is quite
costly as memory is moved around, and set it to zero for pagefault VMs,
since switching back to pagefault mode after dma-fence mode is
relatively fast.
Also change min_run_period_ms to 'unsiged int' type rather than 's64' as
only positive values make sense.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251212182847.1683222-2-matthew.brost@intel.com
(cherry picked from commit 33a5abd9a68394aa67f9618b20eee65ee8702ff4)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3595114bc31d1eb5e1996164c901485c1ffac6f7 upstream.
An OA property value of 0 is invalid and will cause a NPD.
Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6452
Fixes: cc4e6994d5a2 ("drm/xe/oa: Move functions up so they can be reused for config ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Link: https://patch.msgid.link/20251212061850.1565459-3-ashutosh.dixit@intel.com
(cherry picked from commit 7a100e6ddcc47c1f6ba7a19402de86ce24790621)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 449bcd5d45eb4ce26740f11f8601082fe734bed2 upstream.
Some Xe bos are allocated with extra backing-store for the CCS
metadata. It's never been the intention to share the CCS metadata
when exporting such bos as dma-buf. Don't include it in the
dma-buf sg-table.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20251209204920.224374-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit a4ebfb9d95d78a12512b435a698ee6886d712571)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dcb171931954c51a1a7250d558f02b8f36570783 upstream.
In xe_oa_add_config_ioctl(), we accessed oa_config->id after dropping
metrics_lock. Since this lock protects the lifetime of oa_config, an
attacker could guess the id and call xe_oa_remove_config_ioctl() with
perfect timing, freeing oa_config before we dereference it, leading to
a potential use-after-free.
Fix this by caching the id in a local variable while holding the lock.
v2: (Matt A)
- Dropped mutex_unlock(&oa->metrics_lock) ordering change from
xe_oa_remove_config_ioctl()
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6614
Fixes: cdf02fe1a94a7 ("drm/xe/oa/uapi: Add/remove OA config perf ops")
Cc: <stable@vger.kernel.org> # v6.11+
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20251118114859.3379952-2-sanjay.kumar.yadav@intel.com
(cherry picked from commit 28aeaed130e8e587fd1b73b6d66ca41ccc5a1a31)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f8dd66bfb4e184c71bd26418a00546ebe7f5c17a ]
The OA open parameters did not validate num_syncs, allowing
userspace to pass arbitrarily large values, potentially
leading to excessive allocations.
Add check to ensure that num_syncs does not exceed DRM_XE_MAX_SYNCS,
returning -EINVAL when the limit is violated.
v2: use XE_IOCTL_DBG() and drop duplicated check. (Ashutosh)
Fixes: c8507a25cebd ("drm/xe/oa/uapi: Define and parse OA sync properties")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251205234715.2476561-6-shuicheng.lin@intel.com
(cherry picked from commit e057b2d2b8d815df3858a87dffafa2af37e5945b)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8e461304009135270e9ccf2d7e2dfe29daec9b60 ]
The exec and vm_bind ioctl allow userspace to specify an arbitrary
num_syncs value. Without bounds checking, a very large num_syncs
can force an excessively large allocation, leading to kernel warnings
from the page allocator as below.
Introduce DRM_XE_MAX_SYNCS (set to 1024) and reject any request
exceeding this limit.
"
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1217 at mm/page_alloc.c:5124 __alloc_frozen_pages_noprof+0x2f8/0x2180 mm/page_alloc.c:5124
...
Call Trace:
<TASK>
alloc_pages_mpol+0xe4/0x330 mm/mempolicy.c:2416
___kmalloc_large_node+0xd8/0x110 mm/slub.c:4317
__kmalloc_large_node_noprof+0x18/0xe0 mm/slub.c:4348
__do_kmalloc_node mm/slub.c:4364 [inline]
__kmalloc_noprof+0x3d4/0x4b0 mm/slub.c:4388
kmalloc_noprof include/linux/slab.h:909 [inline]
kmalloc_array_noprof include/linux/slab.h:948 [inline]
xe_exec_ioctl+0xa47/0x1e70 drivers/gpu/drm/xe/xe_exec.c:158
drm_ioctl_kernel+0x1f1/0x3e0 drivers/gpu/drm/drm_ioctl.c:797
drm_ioctl+0x5e7/0xc50 drivers/gpu/drm/drm_ioctl.c:894
xe_drm_ioctl+0x10b/0x170 drivers/gpu/drm/xe/xe_device.c:224
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:598 [inline]
__se_sys_ioctl fs/ioctl.c:584 [inline]
__x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:584
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xbb/0x380 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
"
v2: Add "Reported-by" and Cc stable kernels.
v3: Change XE_MAX_SYNCS from 64 to 1024. (Matt & Ashutosh)
v4: s/XE_MAX_SYNCS/DRM_XE_MAX_SYNCS/ (Matt)
v5: Do the check at the top of the exec func. (Matt)
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reported-by: Koen Koning <koen.koning@intel.com>
Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6450
Cc: <stable@vger.kernel.org> # v6.12+
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Carl Zhang <carl.zhang@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Ivan Briano <ivan.briano@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251205234715.2476561-5-shuicheng.lin@intel.com
(cherry picked from commit b07bac9bd708ec468cd1b8a5fe70ae2ac9b0a11c)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Stable-dep-of: f8dd66bfb4e1 ("drm/xe/oa: Limit num_syncs to prevent oversized allocations")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit eed5b815fa49c17d513202f54e980eb91955d3ed ]
During GT reset recovery in do_gt_restart(), xe_uc_start() was called
before xe_reg_sr_apply_mmio() restored engine-specific registers. This
created a race window where the scheduler could run jobs before hardware
state was fully restored.
This caused failures in eudebug tests (xe_exec_sip_eudebug@breakpoint-
waitsip-*) where TD_CTL register (containing TD_CTL_GLOBAL_DEBUG_ENABLE)
wasn't restored before jobs started executing. Breakpoints would fail to
trigger SIP entry because the debug enable bit wasn't set yet.
Fix by moving xe_uc_start() after all MMIO register restoration,
including engine registers and CCS mode configuration, ensuring all
hardware state is fully restored before any jobs can be scheduled.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Jan Maslak <jan.maslak@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251210145618.169625-2-jan.maslak@intel.com
(cherry picked from commit 825aed0328588b2837636c1c5a0c48795d724617)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 17445af7dcc7d645b6fb8951fd10c8b72cc7f23f ]
MEI GSC interrupt comes from i915 or xe driver. It has top half and
bottom half. Top half is called from i915/xe interrupt handler. It
should be in irq disabled context.
With RT kernel(PREEMPT_RT enabled), by default IRQ handler is in
threaded IRQ. MEI GSC top half might be in threaded IRQ context.
generic_handle_irq_safe API could be called from either IRQ or
process context, it disables local IRQ then calls MEI GSC interrupt
top half.
This change fixes B580 GPU boot issue with RT enabled.
Fixes: e02cea83d32d ("drm/xe/gsc: add Battlemage support")
Tested-by: Baoli Zhang <baoli.zhang@intel.com>
Signed-off-by: Junxiao Chang <junxiao.chang@intel.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251107033152.834960-1-junxiao.chang@intel.com
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
(cherry picked from commit 3efadf028783a49ab2941294187c8b6dd86bf7da)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7276878b069c57d9a9cca5db01d2f7a427b73456 ]
When tick counts are large and multiplication by MSEC_PER_SEC is larger
than 64 bits, the conversion from clock ticks to milliseconds can go bad.
Use mul_u64_u32_div() instead.
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Suggested-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Fixes: 49cc215aad7f ("drm/xe: Add xe_gt_clock_interval_to_ms helper")
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/1562f1b62d5be3fbaee100f09107f3cc49e40dd1.1763408584.git.harish.chegondi@intel.com
(cherry picked from commit 96b93ac214f9dd66294d975d86c5dee256faef91)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d52dea485cd3c98cfeeb474cf66cf95df2ab142f ]
If user provides a large value (such as 0x80) for parameter
prefetch_mem_region_instance in vm_bind ioctl, it will cause
BIT(prefetch_region) overflow as below:
"
------------[ cut here ]------------
UBSAN: shift-out-of-bounds in drivers/gpu/drm/xe/xe_vm.c:3414:7
shift exponent 128 is too large for 64-bit type 'long unsigned int'
CPU: 8 UID: 0 PID: 53120 Comm: xe_exec_system_ Tainted: G W 6.18.0-rc1-lgci-xe-kernel+ #200 PREEMPT(voluntary)
Tainted: [W]=WARN
Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0xa0/0xc0
dump_stack+0x10/0x20
ubsan_epilogue+0x9/0x40
__ubsan_handle_shift_out_of_bounds+0x10e/0x170
? mutex_unlock+0x12/0x20
xe_vm_bind_ioctl.cold+0x20/0x3c [xe]
...
"
Fix it by validating prefetch_region before the BIT() usage.
v2: Add Closes and Cc stable kernels. (Matt)
Reported-by: Koen Koning <koen.koning@intel.com>
Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com>
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6478
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20251112181005.2120521-2-shuicheng.lin@intel.com
(cherry picked from commit 8f565bdd14eec5611cc041dba4650e42ccdf71d9)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit d52dea485cd3c98cfeeb474cf66cf95df2ab142f)
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b11a020d914c3b7628f56a9ea476a5b03679489b ]
Currently Xe driver is triggering flr without any clean-up on
shutdown. This is causing random warnings from pending related works as the
underlying hardware is reset in the middle of their execution.
Fix this by performing clean shutdown also when using flr.
Fixes: 501d799a47e2 ("drm/xe: Wire up device shutdown handler")
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Maarten Lankhorst <dev@lankhorst.se>
Link: https://patch.msgid.link/20251031122312.1836534-1-jouni.hogander@intel.com
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
(cherry picked from commit a4ff26b7c8ef38e4dd34f77cbcd73576fdde6dd4)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9cd27eec872f0b95dcdd811edc39d2d32e4158c8 ]
The xe_device_shutdown() function was needing a few declarations
that were only required under a specific condition. This change
moves those declarations to be within that conditional branch
to avoid unnecessary declarations.
Reviewed-by: Nitin Gote <nitin.r.gote@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20251007100208.1407021-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit 15b3036045188f4da4ca62b2ed01b0f160252e9b)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Stable-dep-of: b11a020d914c ("drm/xe: Do clean shutdown also when using flr")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 95af8f4fdce8349a5fe75264007f1af2aa1082ea ]
Cancel and wait for any Dead CT worker to complete before continuing
with device unbinding. Else the worker will end up using resources freed
by the undind operation.
Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Fixes: d2c5a5a926f4 ("drm/xe/guc: Dead CT helper")
Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20251103123144.3231829-6-balasubramani.vivekanandan@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 492671339114e376aaa38626d637a2751cdef263)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3b09b11805bfee32d5a0000f5ede42c07237a6c4 ]
Due to multiple explosion issues in the early days of the Xe driver,
the GuC load was hacked to never return a failure. That prevented
kernel panics and such initially, but now all it achieves is creating
more confusing errors when the driver tries to submit commands to a
GuC it already knows is not there. So fix that up.
As a stop-gap and to help with debug of load failures due to invalid
GuC init params, a wedge call had been added to the inner GuC load
function. The reason being that it leaves the GuC log accessible via
debugfs. However, for an end user, simply aborting the module load is
much cleaner than wedging and trying to continue. The wedge blocks
user submissions but it seems that various bits of the driver itself
still try to submit to a dead GuC and lots of subsequent errors occur.
And with regards to developers debugging why their particular code
change is being rejected by the GuC, it is trivial to either add the
wedge back in and hack the return code to zero again or to just do a
GuC log dump to dmesg.
v2: Add support for error injection testing and drop the now redundant
wedge call.
CC: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://lore.kernel.org/r/20250909224132.536320-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2506af5f8109a387a5e8e9e3d7c498480b8033db ]
The GuC communication protocol allows GuC to send NO_RESPONSE_RETRY
reply message to indicate that due to some interim condition it can
not handle incoming H2G request and the host shall resend it.
But in some cases, due to errors, this unsatisfied condition might
be final and this could lead to endless retries as it was recently
seen on the CI:
[drm] GT0: PF: VF1 FLR didn't finish in 5000 ms (-ETIMEDOUT)
[drm] GT0: PF: VF1 resource sanitizing failed (-ETIMEDOUT)
[drm] GT0: PF: VF1 FLR failed!
[drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
[drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
[drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
[drm:guc_ct_send_recv [xe]] GT0: H2G action 0x5503 retrying: reason 0x0
To avoid such dangerous loops allow only limited number of retries
(for now 50) and add some delays (n * 5ms) to slow down the rate of
resending this repeated request.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com>
Link: https://lore.kernel.org/r/20250903223330.6408-1-michal.wajdeczko@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ad83b1da5b786ee2d245e41ce55cb1c71fed7c22 ]
There are platforms already have a maximum dump size of 12KB, to avoid
data truncating, increase GuC crash dump buffer size to 16KB.
Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250829160427.1245732-1-zhanjun.dong@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1cda3c755bb7770be07d75949bb0f45fb88651f6 ]
I saw an oops in xe_gem_fault when running the xe-fast-feedback
testlist against the realtime kernel without debug options enabled.
The panic happens after core_hotunplug unbind-rebind finishes.
Presumably what happens is that a process mmaps, unlocks because
of the FAULT_FLAG_RETRY_NOWAIT logic, has no process memory left,
causing ttm_bo_vm_dummy_page() to return VM_FAULT_NOPAGE, since
there was nothing left to populate, and then oopses in
"mem_type_is_vram(tbo->resource->mem_type)" because tbo->resource
is NULL.
It's convoluted, but fits the data and explains the oops after
the test exits.
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20250715152057.23254-2-dev@lankhorst.se
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 45fbb51050e72723c2bdcedc1ce32305256c70ed ]
The GuC load process will abort if certain status codes (which are
indicative of a fatal error) are reported. Otherwise, it keeps waiting
until the 'success' code is returned. New error codes have been added
in recent GuC releases, so add support for aborting on those as well.
v2: Shuffle HWCONFIG_START to the front of the switch to keep the
ordering as per the enum define for clarity (review feedback by
Jonathan). Also add a description for the basic 'invalid init data'
code which was missing.
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b3fbda1a630a9439c885b2a5dc5230cc49a87e9e upstream.
Waking the device during a GT reset can lead to unintended memory
allocation, which is not allowed since GT resets occur in the reclaim
path. Prevent this by holding a PM reference while a reset is in flight.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://lore.kernel.org/r/20251022005538.828980-3-matthew.brost@intel.com
(cherry picked from commit 480b358e7d8ef69fd8f1b0cad6e07c7d70a36ee4)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9f64b3cd051b825de0a2a9f145c8e003200cedd5 upstream.
In normal operation, a registered exec queue is disabled and
deregistered through the GuC, and freed only after the GuC confirms
completion. However, if the driver is forced to unbind while the exec
queue is still running, the user may call exec_destroy() after the GuC
has already been stopped and CT communication disabled.
In this case, the driver cannot receive a response from the GuC,
preventing proper cleanup of exec queue resources. Fix this by directly
releasing the resources when GuC is not running.
Here is the failure dmesg log:
"
[ 468.089581] ---[ end trace 0000000000000000 ]---
[ 468.089608] pci 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535)
[ 468.090558] pci 0000:03:00.0: [drm] GT0: total 65535
[ 468.090562] pci 0000:03:00.0: [drm] GT0: used 1
[ 468.090564] pci 0000:03:00.0: [drm] GT0: range 1..1 (1)
[ 468.092716] ------------[ cut here ]------------
[ 468.092719] WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298 ttm_vram_mgr_fini+0xf8/0x130 [xe]
"
v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled().
As CT may go down and come back during VF migration.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20251010172529.2967639-2-shuicheng.lin@intel.com
(cherry picked from commit 9b42321a02c50a12b2beb6ae9469606257fbecea)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2d1684a077d62fddfac074052c162ec6573a34e1 upstream.
Currently this is hidden behind perfmon_capable() since this is
technically an info leak, given that this is a system wide metric.
However the granularity reported here is always PAGE_SIZE aligned, which
matches what the core kernel is already willing to expose to userspace
if querying how many free RAM pages there are on the system, and that
doesn't need any special privileges. In addition other drm drivers seem
happy to expose this.
The motivation here if with oneAPI where they want to use the system
wide 'used' reporting here, so not the per-client fdinfo stats. This has
also come up with some perf overlay applications wanting this
information.
Fixes: 1105ac15d2a1 ("drm/xe/uapi: restrict system wide accounting")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com
(cherry picked from commit 4d0b035fd6dae8ee48e9c928b10f14877e595356)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 08fdfd260e641da203f80aff8d3ed19c5ecceb7d ]
In xe_hw_engine_group_get_mode(), a write lock is acquired before
calling switch_mode(), which in turn invokes
xe_hw_engine_group_suspend_faulting_lr_jobs().
On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(),
the write lock is released there, and then again in
xe_hw_engine_group_get_mode(), leading to a double release.
Fix this by keeping both acquire and release operation in
xe_hw_engine_group_get_mode().
Fixes: 770bd1d34113 ("drm/xe/hw_engine_group: Ensure safe transition between execution modes")
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Reviewed-by: Francois Dugast <francois.dugast@intel.com>
Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 662d98b8b373007fa1b08ba93fee11f6fd3e387c)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|