| Age | Commit message (Collapse) | Author | Files | Lines |
|
The cited commit introduced MLX5_PRIV_FLAGS_SWITCH_LEGACY to identify
when a transition to legacy mode is requested via devlink. However, the
logic failed to clear this flag if the mode was subsequently changed
back to MLX5_ESWITCH_OFFLOADS (switchdev). Consequently, if a user
toggled from legacy to switchdev, the flag remained set, leaving the
driver with wrong state indicating
Fix this by explicitly clearing the MLX5_PRIV_FLAGS_SWITCH_LEGACY bit
when the requested mode is MLX5_ESWITCH_OFFLOADS.
Fixes: 2a4f56fbcc47 ("net/mlx5e: Keep netdev when leave switchdev for devlink set legacy only")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlx5_lag_disable_change() unconditionally called mlx5_disable_lag() when
LAG was active, which is incorrect for MLX5_LAG_MODE_MPESW.
Hnece, call mlx5_disable_mpesw() when running in MPESW mode.
Fixes: a32327a3a02c ("net/mlx5: Lag, Control MultiPort E-Switch single FDB mode")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix a circular locking dependency between dbg_mutex and the domain
rx/tx mutexes that could lead to a deadlock.
The dump path in dr_dump_domain_all() was acquiring locks in the order:
dbg_mutex -> rx.mutex -> tx.mutex
While the table/matcher creation paths acquire locks in the order:
rx.mutex -> tx.mutex -> dbg_mutex
This inverted lock ordering creates a circular dependency. Fix this by
changing dr_dump_domain_all() to acquire the domain lock before
dbg_mutex, matching the order used in mlx5dr_table_create() and
mlx5dr_matcher_create().
Lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.19.0-rc6net_next_e817c4e #1 Not tainted
------------------------------------------------------
sos/30721 is trying to acquire lock:
ffff888102df5900 (&dmn->info.rx.mutex){+.+.}-{4:4}, at:
dr_dump_start+0x131/0x450 [mlx5_core]
but task is already holding lock:
ffff888102df5bc0 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}, at:
dr_dump_start+0x10b/0x450 [mlx5_core]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}:
__mutex_lock+0x91/0x1060
mlx5dr_matcher_create+0x377/0x5e0 [mlx5_core]
mlx5_cmd_dr_create_flow_group+0x62/0xd0 [mlx5_core]
mlx5_create_flow_group+0x113/0x1c0 [mlx5_core]
mlx5_chains_create_prio+0x453/0x2290 [mlx5_core]
mlx5_chains_get_table+0x2e2/0x980 [mlx5_core]
esw_chains_create+0x1e6/0x3b0 [mlx5_core]
esw_create_offloads_fdb_tables.cold+0x62/0x63f [mlx5_core]
esw_offloads_enable+0x76f/0xd20 [mlx5_core]
mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core]
mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core]
devlink_nl_eswitch_set_doit+0x67/0xe0
genl_family_rcv_msg_doit+0xe0/0x130
genl_rcv_msg+0x188/0x290
netlink_rcv_skb+0x4b/0xf0
genl_rcv+0x24/0x40
netlink_unicast+0x1ed/0x2c0
netlink_sendmsg+0x210/0x450
__sock_sendmsg+0x38/0x60
__sys_sendto+0x119/0x180
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x70/0xd00
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #1 (&dmn->info.tx.mutex){+.+.}-{4:4}:
__mutex_lock+0x91/0x1060
mlx5dr_table_create+0x11d/0x530 [mlx5_core]
mlx5_cmd_dr_create_flow_table+0x62/0x140 [mlx5_core]
__mlx5_create_flow_table+0x46f/0x960 [mlx5_core]
mlx5_create_flow_table+0x16/0x20 [mlx5_core]
esw_create_offloads_fdb_tables+0x136/0x240 [mlx5_core]
esw_offloads_enable+0x76f/0xd20 [mlx5_core]
mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core]
mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core]
devlink_nl_eswitch_set_doit+0x67/0xe0
genl_family_rcv_msg_doit+0xe0/0x130
genl_rcv_msg+0x188/0x290
netlink_rcv_skb+0x4b/0xf0
genl_rcv+0x24/0x40
netlink_unicast+0x1ed/0x2c0
netlink_sendmsg+0x210/0x450
__sock_sendmsg+0x38/0x60
__sys_sendto+0x119/0x180
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x70/0xd00
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #0 (&dmn->info.rx.mutex){+.+.}-{4:4}:
__lock_acquire+0x18b6/0x2eb0
lock_acquire+0xd3/0x2c0
__mutex_lock+0x91/0x1060
dr_dump_start+0x131/0x450 [mlx5_core]
seq_read_iter+0xe3/0x410
seq_read+0xfb/0x130
full_proxy_read+0x53/0x80
vfs_read+0xba/0x330
ksys_read+0x65/0xe0
do_syscall_64+0x70/0xd00
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&dmn->dump_info.dbg_mutex);
lock(&dmn->info.tx.mutex);
lock(&dmn->dump_info.dbg_mutex);
lock(&dmn->info.rx.mutex);
*** DEADLOCK ***
Fixes: 9222f0b27da2 ("net/mlx5: DR, Add support for dumping steering info")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
A good number of fixes:
- cfg80211:
- cancel rfkill work appropriately
- fix radiotap parsing to correctly reject field 18
- fix wext (yes...) off-by-one for IGTK key ID
- mac80211:
- fix for mesh NULL pointer dereference
- fix for stack out-of-bounds (2 bytes) write on
specific multi-link action frames
- set default WMM parameters for all links
- mwifiex: check dev_alloc_name() return value correctly
- libertas: fix potential timer use-after-free
- brcmfmac: fix crash on probe failure
* tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame()
wifi: mac80211: bounds-check link_id in ieee80211_ml_reconfiguration
wifi: mac80211: set default WMM parameters on all links
wifi: libertas: fix use-after-free in lbs_free_adapter()
wifi: mwifiex: Fix dev_alloc_name() return value check
wifi: brcmfmac: Fix potential kernel oops when probe fails
wifi: radiotap: reject radiotap with unknown bits
wifi: cfg80211: cancel rfkill_block work in wiphy_unregister()
wifi: cfg80211: wext: fix IGTK key ID off-by-one
====================
Link: https://patch.msgid.link/20260225113159.360574-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Kicinski reported following issue in upcoming patches:
W=1 C=1 GCC build gives us:
net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through
../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h,
../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h:
153:29: warning: array of flexible structures
sparse doesn't like that hdr has a zero-length array which overlaps
proto. The kernel code doesn't currently need those arrays.
PPPoE connection is functional after applying this patch.
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Correct spelling mistakes in comments:
- Fix misspelling of gro_receive
- Fix misspelling of Partition
Signed-off-by: Abhilekh Deka <abhindeka@gmail.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Link: https://patch.msgid.link/20260224153601.17534-1-abhindeka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Wire phy_ethtool_nway_reset() as the .nway_reset ethtool operation,
allowing userspace to restart PHY autonegotiation via 'ethtool -r'.
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260224145723.49450-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot is reporting
unregister_netdevice: waiting for netdevsim0 to become free. Usage count = 3
ref_tracker: netdev@ffff88807dcf8618 has 1/2 users at
__netdev_tracker_alloc include/linux/netdevice.h:4400 [inline]
netdev_hold include/linux/netdevice.h:4429 [inline]
inetdev_init+0x201/0x4e0 net/ipv4/devinet.c:286
inetdev_event+0x251/0x1610 net/ipv4/devinet.c:1600
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
call_netdevice_notifiers_mtu net/core/dev.c:2318 [inline]
netif_set_mtu_ext+0x5aa/0x800 net/core/dev.c:9886
netif_set_mtu+0xd7/0x1b0 net/core/dev.c:9907
dev_set_mtu+0x126/0x260 net/core/dev_api.c:248
team_port_del+0xb07/0xcb0 drivers/net/team/team_core.c:1333
team_del_slave drivers/net/team/team_core.c:1936 [inline]
team_device_event+0x207/0x5b0 drivers/net/team/team_core.c:2929
notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2281 [inline]
call_netdevice_notifiers net/core/dev.c:2295 [inline]
__dev_change_net_namespace+0xcb7/0x2050 net/core/dev.c:12592
do_setlink+0x2ce/0x4590 net/core/rtnetlink.c:3060
rtnl_changelink net/core/rtnetlink.c:3776 [inline]
__rtnl_newlink net/core/rtnetlink.c:3935 [inline]
rtnl_newlink+0x15a9/0x1be0 net/core/rtnetlink.c:4072
rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:6958
netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
problem. Ido Schimmel found steps to reproduce
ip link add name team1 type team
ip link add name dummy1 mtu 1499 master team1 type dummy
ip netns add ns1
ip link set dev dummy1 netns ns1
ip -n ns1 link del dev dummy1
and also found that the same issue was fixed in the bond driver in
commit f51048c3e07b ("bonding: avoid NETDEV_CHANGEMTU event when
unregistering slave").
Let's do similar thing for the team driver, with commit ad7c7b2172c3 ("net:
hold netdev instance lock during sysfs operations") and commit 303a8487a657
("net: s/__dev_set_mtu/__netif_set_mtu/") also applied.
Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
Suggested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260224125709.317574-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While testing corner cases in the driver, a use-after-free crash
was found on the service rescan PCI path.
When mana_serv_reset() calls mana_gd_suspend(), mana_gd_cleanup()
destroys gc->service_wq. If the subsequent mana_gd_resume() fails
with -ETIMEDOUT or -EPROTO, the code falls through to
mana_serv_rescan() which triggers pci_stop_and_remove_bus_device().
This invokes the PCI .remove callback (mana_gd_remove), which calls
mana_gd_cleanup() a second time, attempting to destroy the already-
freed workqueue. Fix this by NULL-checking gc->service_wq in
mana_gd_cleanup() and setting it to NULL after destruction.
Call stack of issue for reference:
[Sat Feb 21 18:53:48 2026] Call Trace:
[Sat Feb 21 18:53:48 2026] <TASK>
[Sat Feb 21 18:53:48 2026] mana_gd_cleanup+0x33/0x70 [mana]
[Sat Feb 21 18:53:48 2026] mana_gd_remove+0x3a/0xc0 [mana]
[Sat Feb 21 18:53:48 2026] pci_device_remove+0x41/0xb0
[Sat Feb 21 18:53:48 2026] device_remove+0x46/0x70
[Sat Feb 21 18:53:48 2026] device_release_driver_internal+0x1e3/0x250
[Sat Feb 21 18:53:48 2026] device_release_driver+0x12/0x20
[Sat Feb 21 18:53:48 2026] pci_stop_bus_device+0x6a/0x90
[Sat Feb 21 18:53:48 2026] pci_stop_and_remove_bus_device+0x13/0x30
[Sat Feb 21 18:53:48 2026] mana_do_service+0x180/0x290 [mana]
[Sat Feb 21 18:53:48 2026] mana_serv_func+0x24/0x50 [mana]
[Sat Feb 21 18:53:48 2026] process_one_work+0x190/0x3d0
[Sat Feb 21 18:53:48 2026] worker_thread+0x16e/0x2e0
[Sat Feb 21 18:53:48 2026] kthread+0xf7/0x130
[Sat Feb 21 18:53:48 2026] ? __pfx_worker_thread+0x10/0x10
[Sat Feb 21 18:53:48 2026] ? __pfx_kthread+0x10/0x10
[Sat Feb 21 18:53:48 2026] ret_from_fork+0x269/0x350
[Sat Feb 21 18:53:48 2026] ? __pfx_kthread+0x10/0x10
[Sat Feb 21 18:53:48 2026] ret_from_fork_asm+0x1a/0x30
[Sat Feb 21 18:53:48 2026] </TASK>
Fixes: 505cc26bcae0 ("net: mana: Add support for auxiliary device servicing events")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aZ2bzL64NagfyHpg@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The devm_add_action_or_reset() function already executes the cleanup
action on failure before returning an error, so the explicit goto error
and subsequent zl3073x_dev_dpll_fini() call causes double cleanup.
Fixes: ebb1031c5137 ("dpll: zl3073x: Refactor DPLL initialization")
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260224-dpll-v2-1-d7786414a830@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The accounting for transmit frames does not count the descriptors
correctly. It uses:
tx_packets = (tx_q->cur_tx + 1) - first_tx;
however, these are indexes into a circular buffer, so cur_tx can be
less than first_tx, and when that happens, tx_packets becomes a very
large unsigned integer. When this is added to tx_q->tx_count_frames,
it has the effect of reducing the count of frames, possibly causing
it to also wrap to a very large unsigned integer.
Fix this by using CIRC_CNT() to calculate the number of descriptors
used.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIl-0000000Aouz-0ttb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The stmmac descriptor queues are circular buffers, operated as far as
the hardware is concerned as either a ring, or a chain that loops back
on itself. From the software perspective, it forms a circular buffer.
We have a few places which calculate the number of in-use and free
entries in these circular buffers, for which we have macros for.
Use CIRC_CNT() and CIRC_SPACE() as appropriate to calculate these
values.
Validating, for stmmac_tx_avail(), which uses CIRC_SPACE():
dirty_tx = 1, cur_tx = 0 -> 0
dirty_tx = 0, cur_tx = 0 -> dma_tx_size - 1
dirty_tx = 0, cur_tx = 1 -> dma_tx_size - 2
dirty_tx passed as end, reduced by one. cur_tx passed as start.
Output on sane computers is identical.
For stmmac_rx_dirty(), which uses CIRC_CNT():
dirty_rx = 1, cur_rx = 0 -> dma_rx_size - 1
dirty_rx = 0, cur_rx = 0 -> 0
dirty_rx = 0, cur_rx = 1 -> 1
dirty_rx passed as start, cur_rx passed as end. Output is identical.
Same validation performed on the is_last_segment calculation, which
also gets converted to CIRC_CNT().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1vuoIg-0000000Aout-0LyS@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
NVMEM typically loads after the ethernet driver and
of_get_ethdev_address returns -EPROBE_DEFER. return in such a case to
allow NVMEM to work.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260224014607.353378-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The kaweth driver should validate that the device it is probing has the
proper number and types of USB endpoints it is expecting before it binds
to it. If a malicious device were to not have the same urbs the driver
will crash later on when it blindly accesses these endpoints.
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://patch.msgid.link/2026022305-substance-virtual-c728@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The kalmia driver should validate that the device it is probing has the
proper number and types of USB endpoints it is expecting before it binds
to it. If a malicious device were to not have the same urbs the driver
will crash later on when it blindly accesses these endpoints.
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: d40261236e8e ("net/usb: Add Samsung Kalmia driver for Samsung GT-B3730")
Link: https://patch.msgid.link/2026022326-shack-headstone-ef6f@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The pegasus driver should validate that the device it is probing has the
proper number and types of USB endpoints it is expecting before it binds
to it. If a malicious device were to not have the same urbs the driver
will crash later on when it blindly accesses these endpoints.
Cc: Petko Manolov <petkan@nucleusys.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026022347-legibly-attest-cc5c@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the device is disconnected from the driver, there is a "dangling"
reference count on the usb interface that was grabbed in the probe
callback. Fix this up by properly dropping the reference after we are
done with it.
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: c46ee38620a2 ("NFC: pn533: add NXP pn533 nfc device driver")
Link: https://patch.msgid.link/2026022329-flashing-ought-7573@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
According to the dwmac v3.74a databook, only MII, GMII and RGMII dwmac
interface modes are supported for EEE. Restrict EEE to these modes, or
the modules supported by a PCS other than the GMAC's integrated PCS.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vuUsD-0000000Afci-0XxO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A workaround was introduced in commit 1fb710793ce2 ("drm/amdgpu: Enable
MES lr_compute_wa by default") to help with some hangs observed in gfx1151.
This WA didn't fully fix the issue. It was actually fixed by adjusting
the VGPR size to the correct value that matched the hardware in commit
b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151").
There are reports of instability on other products with newer GC microcode
versions, and I believe they're caused by this workaround. As we don't
need the workaround any more, remove it.
Fixes: b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9973e64bd6ee7642860a6f3b6958cbf14e89cabd)
Cc: stable@vger.kernel.org
|
|
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.
Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.
Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bb71362182e59caa227e4192da5a612b09349696)
|
|
This will set AMDGPU_VCN_SMU_DPM_INTERFACE_* smu_type
based on soc type and fixing ring timeout issue seen
for DPM enabled case.
Signed-off-by: sguttula <suresh.guttula@amd.com>
Reviewed-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f0f23c315b38c55e8ce9484cf59b65811f350630)
|
|
Do not unlock psp->ras_context.mutex if it has not been locked. This has
been detected by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6fa01b4335978051d2cd80841728fd63cc597970)
|
|
Mutexes must be unlocked before these are destroyed. This has been detected
by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Yang Wang <kevinyang.wang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 270258ba320beb99648dceffb67e86ac76786e55)
|
|
This can be called while preemption is disabled, for example by
dcn32_internal_validate_bw which is called with the FPU active.
Fixes "BUG: scheduling while atomic" messages I encounter on my Navi31
machine.
Signed-off-by: Natalie Vock <natalie.vock@gmx.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b42dae2ebc5c84a68de63ec4ffdfec49362d53f1)
Cc: stable@vger.kernel.org
|
|
Huge input values in amdgpu_userq_wait_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
v2: squash in Srini's fix
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcec012c664247531aed3e662f4280ff804d1476)
Cc: stable@vger.kernel.org
|
|
Huge input values in amdgpu_userq_signal_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit be267e15f99bc97cbe202cd556717797cdcf79a5)
Cc: stable@vger.kernel.org
|
|
Userspace can either deliberately pass in the too small num_fences, or the
required number can legitimately grow between the two calls to the userq
wait ioctl. In both cases we do not want the emit the kernel warning
backtrace since nothing is wrong with the kernel and userspace will simply
get an errno reported back. So lets simply drop the WARN_ONs.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2c333ea579de6cc20ea7bc50e9595ef72863e65c)
|
|
Drop reference to syncobj and timeline fence when aborting the ioctl due
output array being too small.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 68951e9c3e6bb22396bc42ef2359751c8315dd27)
Cc: <stable@vger.kernel.org> # v6.16+
|
|
A workaround was introduced in commit 1fb710793ce2 ("drm/amdgpu: Enable
MES lr_compute_wa by default") to help with some hangs observed in gfx1151.
This WA didn't fully fix the issue. It was actually fixed by adjusting
the VGPR size to the correct value that matched the hardware in commit
b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151").
There are reports of instability on other products with newer GC microcode
versions, and I believe they're caused by this workaround. As we don't
need the workaround any more, remove it.
Fixes: b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.
Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.
Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Handle check address validity command in SR-IOV
guest.
Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This will set AMDGPU_VCN_SMU_DPM_INTERFACE_* smu_type
based on soc type and fixing ring timeout issue seen
for DPM enabled case.
Signed-off-by: sguttula <suresh.guttula@amd.com>
Reviewed-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add function to convert retired address in SR-IOV
guest.
Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Handle address check validity command in SR-IOV guest
Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add convert retired address command and structure
for uniras.
Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add address check command and data structure
for uniras.
Signed-off-by: Jinzhou Su <jinzhou.su@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Do not unlock psp->ras_context.mutex if it has not been locked. This has
been detected by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Mutexes must be unlocked before these are destroyed. This has been detected
by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Yang Wang <kevinyang.wang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This can be called while preemption is disabled, for example by
dcn32_internal_validate_bw which is called with the FPU active.
Fixes "BUG: scheduling while atomic" messages I encounter on my Navi31
machine.
Signed-off-by: Natalie Vock <natalie.vock@gmx.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Using legacy driver with latest firmware causes a power off issue.
Fix this by assigning a different filename (npu_7.sbin) to the latest
firmware. The driver attempts to load the latest firmware first and falls
back to the previous firmware version if loading fails.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/5009
Fixes: f1eac46fe5f7 ("accel/amdxdna: Update firmware version check for latest firmware")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260225204752.2711734-1-lizhi.hou@amd.com
|
|
Huge input values in amdgpu_userq_wait_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
v2: squash in Srini's fix
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Endpoint drivers use dw_pcie_ep_raise_msix_irq() to raise an MSI-X
interrupt to the host using a writel(), which generates a PCI posted write
transaction. There's no completion for posted writes, so the writel() may
return before the PCI write completes. dw_pcie_ep_raise_msix_irq() also
unmaps the outbound ATU entry used for the PCI write, so the write races
with the unmap.
If the PCI write loses the race with the ATU unmap, the write may corrupt
host memory or cause IOMMU errors, e.g., these when running fio with a
larger queue depth against nvmet-pci-epf:
arm-smmu-v3 fc900000.iommu: 0x0000010000000010
arm-smmu-v3 fc900000.iommu: 0x0000020000000000
arm-smmu-v3 fc900000.iommu: 0x000000090000f040
arm-smmu-v3 fc900000.iommu: 0x0000000000000000
arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
Flush the write by performing a readl() of the same address to ensure that
the write has reached the destination before the ATU entry is unmapped.
The same problem was solved for dw_pcie_ep_raise_msi_irq() in commit
8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping"), but there
it was solved by dedicating an outbound iATU only for MSI. We can't do the
same for MSI-X because each vector can have a different msg_addr and the
msg_addr may be changed while the vector is masked.
Fixes: beb4641a787d ("PCI: dwc: Add MSI-X callbacks handler")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260211175540.105677-2-cassel@kernel.org
|
|
Endpoint drivers use dw_pcie_ep_raise_msi_irq() to raise MSI interrupts to
the host. After 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU
mapping"), dw_pcie_ep_raise_msi_irq() caches the Message Address from the
MSI Capability in ep->msi_msg_addr. But that Message Address is controlled
by the host, and it may change. For example, if:
- firmware on the host configures the Message Address and triggers an
MSI,
- a driver on the Endpoint raises the MSI via dw_pcie_ep_raise_msi_irq(),
which caches the Message Address,
- a kernel on the host reconfigures the Message Address and the host
kernel driver triggers another MSI,
dw_pcie_ep_raise_msi_irq() notices that the Message Address no longer
matches the cached ep->msi_msg_addr, warns about it, and returns error
instead of raising the MSI. The host kernel may hang because it never
receives the MSI.
This was seen with the nvmet_pci_epf_driver: the host UEFI performs NVMe
commands, e.g. Identify Controller to get the name of the controller,
nvmet-pci-epf posts the completion queue entry and raises an IRQ using
dw_pcie_ep_raise_msi_irq(). When the host boots Linux, we see a
WARN_ON_ONCE() from dw_pcie_ep_raise_msi_irq(), and the host kernel hangs
because the nvme driver never gets an IRQ.
Remove the warning when dw_pcie_ep_raise_msi_irq() notices that Message
Address has changed, remap using the new address, and update the
ep->msi_msg_addr cache.
Fixes: 8719c64e76bf ("PCI: dwc: ep: Cache MSI outbound iATU mapping")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Koichiro Den <den@valinux.co.jp>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260210181225.3926165-2-cassel@kernel.org
|
|
Huge input values in amdgpu_userq_signal_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Use the existing helper instead of open coding it
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Sunil Khatri <sunil.khatrti@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Use the existing helper instead of open coding it.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Sunil Khatri <sunil.khatrti@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Use dedicated memory as vf ras command buffer.
V2:
Add lock to ensure serialization of sending vf ras commands.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Jinzhou Su <jinzhou.su@amd.com>
Tested-by: Jinzhou Su <jinzhou.su@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Missed deleting the commented line in the original patch.
Fixes: 73463e26f7e2 ("drm/amdkfd: Disable MQD queue priority")
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If we gate the fence destruction with a check telling us whether there are
valid pointers in there we can eliminate the need for dual, basically
identical, exit paths.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Userspace can either deliberately pass in the too small num_fences, or the
required number can legitimately grow between the two calls to the userq
wait ioctl. In both cases we do not want the emit the kernel warning
backtrace since nothing is wrong with the kernel and userspace will simply
get an errno reported back. So lets simply drop the WARN_ONs.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|