summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel/ice
AgeCommit message (Collapse)AuthorFilesLines
2025-06-17ice: fix eswitch code memory leak in reset scenarioGrzegorz Nitka1-1/+5
Add simple eswitch mode checker in attaching VF procedure and allocate required port representor memory structures only in switchdev mode. The reset flows triggers VF (if present) detach/attach procedure. It might involve VF port representor(s) re-creation if the device is configured is switchdev mode (not legacy one). The memory was blindly allocated in current implementation, regardless of the mode and not freed if in legacy mode. Kmemeleak trace: unreferenced object (percpu) 0x7e3bce5b888458 (size 40): comm "bash", pid 1784, jiffies 4295743894 hex dump (first 32 bytes on cpu 45): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): pcpu_alloc_noprof+0x4c4/0x7c0 ice_repr_create+0x66/0x130 [ice] ice_repr_create_vf+0x22/0x70 [ice] ice_eswitch_attach_vf+0x1b/0xa0 [ice] ice_reset_all_vfs+0x1dd/0x2f0 [ice] ice_pci_err_resume+0x3b/0xb0 [ice] pci_reset_function+0x8f/0x120 reset_store+0x56/0xa0 kernfs_fop_write_iter+0x120/0x1b0 vfs_write+0x31c/0x430 ksys_write+0x61/0xd0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e Testing hints (ethX is PF netdev): - create at least one VF echo 1 > /sys/class/net/ethX/device/sriov_numvfs - trigger the reset echo 1 > /sys/class/net/ethX/device/reset Fixes: 415db8399d06 ("ice: make representor code generic") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-17net: ice: Perform accurate aRFS flow matchKrishna Kumar1-0/+48
This patch fixes an issue seen in a large-scale deployment under heavy incoming pkts where the aRFS flow wrongly matches a flow and reprograms the NIC with wrong settings. That mis-steering causes RX-path latency spikes and noisy neighbor effects when many connections collide on the same hash (some of our production servers have 20-30K connections). set_rps_cpu() calls ndo_rx_flow_steer() with flow_id that is calculated by hashing the skb sized by the per rx-queue table size. This results in multiple connections (even across different rx-queues) getting the same hash value. The driver steer function modifies the wrong flow to use this rx-queue, e.g.: Flow#1 is first added: Flow#1: <ip1, port1, ip2, port2>, Hash 'h', q#10 Later when a new flow needs to be added: Flow#2: <ip3, port3, ip4, port4>, Hash 'h', q#20 The driver finds the hash 'h' from Flow#1 and updates it to use q#20. This results in both flows getting un-optimized - packets for Flow#1 goes to q#20, and then reprogrammed back to q#10 later and so on; and Flow #2 programming is never done as Flow#1 is matched first for all misses. Many flows may wrongly share the same hash and reprogram rules of the original flow each with their own q#. Tested on two 144-core servers with 16K netperf sessions for 180s. Netperf clients are pinned to cores 0-71 sequentially (so that wrong packets on q#s 72-143 can be measured). IRQs are set 1:1 for queues -> CPUs, enable XPS, enable aRFS (global value is 144 * rps_flow_cnt). Test notes about results from ice_rx_flow_steer(): --------------------------------------------------- 1. "Skip:" counter increments here: if (fltr_info->q_index == rxq_idx || arfs_entry->fltr_state != ICE_ARFS_ACTIVE) goto out; 2. "Add:" counter increments here: ret = arfs_entry->fltr_info.fltr_id; INIT_HLIST_NODE(&arfs_entry->list_entry); 3. "Update:" counter increments here: /* update the queue to forward to on an already existing flow */ Runtime comparison: original code vs with the patch for different rps_flow_cnt values. +-------------------------------+--------------+--------------+ | rps_flow_cnt | 512 | 2048 | +-------------------------------+--------------+--------------+ | Ratio of Pkts on Good:Bad q's | 214 vs 822K | 1.1M vs 980K | | Avoid wrong aRFS programming | 0 vs 310K | 0 vs 30K | | CPU User | 216 vs 183 | 216 vs 206 | | CPU System | 1441 vs 1171 | 1447 vs 1320 | | CPU Softirq | 1245 vs 920 | 1238 vs 961 | | CPU Total | 29 vs 22.7 | 29 vs 24.9 | | aRFS Update | 533K vs 59 | 521K vs 32 | | aRFS Skip | 82M vs 77M | 7.2M vs 4.5M | +-------------------------------+--------------+--------------+ A separate TCP_STREAM and TCP_RR with 1,4,8,16,64,128,256,512 connections showed no performance degradation. Some points on the patch/aRFS behavior: 1. Enabling full tuple matching ensures flows are always correctly matched, even with smaller hash sizes. 2. 5-6% drop in CPU utilization as the packets arrive at the correct CPUs and fewer calls to driver for programming on misses. 3. Larger hash tables reduces mis-steering due to more unique flow hashes, but still has clashes. However, with larger per-device rps_flow_cnt, old flows take more time to expire and new aRFS flows cannot be added if h/w limits are reached (rps_may_expire_flow() succeeds when 10*rps_flow_cnt pkts have been processed by this cpu that are not part of the flow). Fixes: 28bf26724fdb0 ("ice: Implement aRFS") Signed-off-by: Krishna Kumar <krikku@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-12Merge tag 'net-6.16-rc2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and wireless. Current release - regressions: - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD Current release - new code bugs: - eth: airoha: correct enable mask for RX queues 16-31 - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer disappears under traffic - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid routes Previous releases - regressions: - phy: phy_caps: don't skip better duplex match on non-exact match - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0 - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient packet loss, exact reason not fully understood, yet Previous releases - always broken: - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6) - sched: sfq: fix a potential crash on gso_skb handling - Bluetooth: intel: improve rx buffer posting to avoid causing issues in the firmware - eth: intel: i40e: make reset handling robust against multiple requests - eth: mlx5: ensure FW pages are always allocated on the local NUMA node, even when device is configure to 'serve' another node - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850, prevent kernel crashes - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request() for 3 sec if fw_stats_done is not set" * tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context net: ethtool: Don't check if RSS context exists in case of context 0 af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD. ipv6: Move fib6_config_validate() to ip6_route_add(). net: drv: netdevsim: don't napi_complete() from netpoll net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get() veth: prevent NULL pointer dereference in veth_xdp_rcv net_sched: remove qdisc_tree_flush_backlog() net_sched: ets: fix a race in ets_qdisc_change() net_sched: tbf: fix a race in tbf_change() net_sched: red: fix a race in __red_change() net_sched: prio: fix a race in prio_tune() net_sched: sch_sfq: reject invalid perturb period net: phy: phy_caps: Don't skip better duplex macth on non-exact match MAINTAINERS: Update Kuniyuki Iwashima's email address. selftests: net: add test case for NAT46 looping back dst net: clear the dst when changing skb protocol net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper net/mlx5e: Fix leak of Geneve TLV option object net/mlx5: HWS, make sure the uplink is the last destination ...
2025-06-10ice/ptp: fix crosstimestamp reportingAnton Nadezhdin1-0/+1
Set use_nsecs=true as timestamp is reported in ns. Lack of this result in smaller timestamp error window which cause error during phc2sys execution on E825 NICs: phc2sys[1768.256]: ioctl PTP_SYS_OFFSET_PRECISE: Invalid argument This problem was introduced in the cited commit which omitted setting use_nsecs to true when converting the ice driver to use convert_base_to_cs(). Testing hints (ethX is PF netdev): phc2sys -s ethX -c CLOCK_REALTIME -O 37 -m phc2sys[1769.256]: CLOCK_REALTIME phc offset -5 s0 freq -0 delay 0 Fixes: d4bea547ebb57 ("ice/ptp: Remove convert_art_to_tsc()") Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar2-2/+3
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-30ice: fix rebuilding the Tx scheduler tree for large queue countsMichal Kubiak1-28/+142
The current implementation of the Tx scheduler allows the tree to be rebuilt as the user adds more Tx queues to the VSI. In such a case, additional child nodes are added to the tree to support the new number of queues. Unfortunately, this algorithm does not take into account that the limit of the VSI support node may be exceeded, so an additional node in the VSI layer may be required to handle all the requested queues. Such a scenario occurs when adding XDP Tx queues on machines with many CPUs. Although the driver still respects the queue limit returned by the FW, the Tx scheduler was unable to add those queues to its tree and returned one of the errors below. Such a scenario occurs when adding XDP Tx queues on machines with many CPUs (e.g. at least 321 CPUs, if there is already 128 Tx/Rx queue pairs). Although the driver still respects the queue limit returned by the FW, the Tx scheduler was unable to add those queues to its tree and returned the following errors: Failed VSI LAN queue config for XDP, error: -5 or: Failed to set LAN Tx queue context, error: -22 Fix this problem by extending the tree rebuild algorithm to check if the current VSI node can support the requested number of queues. If it cannot, create as many additional VSI support nodes as necessary to handle all the required Tx queues. Symmetrically, adjust the VSI node removal algorithm to remove all nodes associated with the given VSI. Also, make the search for the next free VSI node more restrictive. That is, add queue group nodes only to the VSI support nodes that have a matching VSI handle. Finally, fix the comment describing the tree update algorithm to better reflect the current scenario. Fixes: b0153fdd7e8a ("ice: update VSI config dynamically") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-30ice: create new Tx scheduler nodes for new queues onlyMichal Kubiak1-5/+6
The current implementation of the Tx scheduler tree attempts to create nodes for all Tx queues, ignoring the fact that some queues may already exist in the tree. For example, if the VSI already has 128 Tx queues and the user requests for 16 new queues, the Tx scheduler will compute the tree for 272 queues (128 existing queues + 144 new queues), instead of 144 queues (128 existing queues and 16 new queues). Fix that by modifying the node count calculation algorithm to skip the queues that already exist in the tree. Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-30ice: fix Tx scheduler error handling in XDP callbackMichal Kubiak1-14/+33
When the XDP program is loaded, the XDP callback adds new Tx queues. This means that the callback must update the Tx scheduler with the new queue number. In the event of a Tx scheduler failure, the XDP callback should also fail and roll back any changes previously made for XDP preparation. The previous implementation had a bug that not all changes made by the XDP callback were rolled back. This caused the crash with the following call trace: [ +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5 [ +0.382335] Oops: general protection fault, probably for non-canonical address 0x50a2250a90495525: 0000 [#1] SMP NOPTI [ +0.010710] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+ #14 PREEMPT(voluntary) [ +0.010175] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022 [ +0.010946] RIP: 0010:__ice_update_sample+0x39/0xe0 [ice] [...] [ +0.002715] Call Trace: [ +0.002452] <IRQ> [ +0.002021] ? __die_body.cold+0x19/0x29 [ +0.003922] ? die_addr+0x3c/0x60 [ +0.003319] ? exc_general_protection+0x17c/0x400 [ +0.004707] ? asm_exc_general_protection+0x26/0x30 [ +0.004879] ? __ice_update_sample+0x39/0xe0 [ice] [ +0.004835] ice_napi_poll+0x665/0x680 [ice] [ +0.004320] __napi_poll+0x28/0x190 [ +0.003500] net_rx_action+0x198/0x360 [ +0.003752] ? update_rq_clock+0x39/0x220 [ +0.004013] handle_softirqs+0xf1/0x340 [ +0.003840] ? sched_clock_cpu+0xf/0x1f0 [ +0.003925] __irq_exit_rcu+0xc2/0xe0 [ +0.003665] common_interrupt+0x85/0xa0 [ +0.003839] </IRQ> [ +0.002098] <TASK> [ +0.002106] asm_common_interrupt+0x26/0x40 [ +0.004184] RIP: 0010:cpuidle_enter_state+0xd3/0x690 Fix this by performing the missing unmapping of XDP queues from q_vectors and setting the XDP rings pointer back to NULL after all those queues are released. Also, add an immediate exit from the XDP callback in case of ring preparation failure. Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-1/+6
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19ice: Fix LACP bonds without SRIOV environmentDave Ertman1-0/+6
If an aggregate has the following conditions: - The SRIOV LAG DDP package has been enabled - The bond is in 802.3ad LACP mode - The bond is disqualified from supporting SRIOV VF LAG - Both interfaces were added simultaneously to the bond (same command) Then there is a chance that the two interfaces will be assigned different LACP Aggregator ID's. This will cause a failure of the LACP control over the bond. To fix this, we can detect if the primary interface for the bond (as defined by the driver) is not in switchdev mode, and exit the setup flow if so. Reproduction steps: %> ip link add bond0 type bond mode 802.3ad miimon 100 %> ip link set bond0 up %> ifenslave bond0 eth0 eth1 %> cat /proc/net/bonding/bond0 | grep Agg Check for Aggregator IDs that differ. Fixes: ec5a6c5f79ed ("ice: process events created by lag netdev event handler") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-19ice: fix vf->num_mac count with port representorsJacob Keller1-1/+0
The ice_vc_repr_add_mac() function indicates that it does not store the MAC address filters in the firmware. However, it still increments vf->num_mac. This is incorrect, as vf->num_mac should represent the number of MAC filters currently programmed to firmware. Indeed, we only perform this increment if the requested filter is a unicast address that doesn't match the existing vf->hw_lan_addr. In addition, ice_vc_repr_del_mac() does not decrement the vf->num_mac counter. This results in the counter becoming out of sync with the actual count. As it turns out, vf->num_mac is currently only used in legacy made without port representors. The single place where the value is checked is for enforcing a filter limit on untrusted VFs. Upcoming patches to support VF Live Migration will use this value when determining the size of the TLV for MAC address filters. Fix the representor mode function to stop incrementing the counter incorrectly. Fixes: ac19e03ef780 ("ice: allow process VF opcodes in different ways") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-13Merge branch 'for-next' of ↵Jakub Kicinski11-115/+242
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com IWL reviews: [v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com [v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com [v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com [v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com [v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09iidc/ice/irdma: Update IDC to support multiple consumersDave Ertman7-93/+216
In preparation of supporting more than a single core PCI driver for RDMA, move ice specific structs like qset_params, qos_info and qos_params from iidc_rdma.h to iidc_rdma_ice.h. Previously, the ice driver was just exporting its entire PF struct to the auxiliary driver, but since each core driver will have its own different PF struct, implement a universal struct that all core drivers can provide to the auxiliary driver through the probe call. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Co-developed-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Co-developed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Co-developed-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-31/+22
Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07ice: use DSN instead of PCI BDF for ice_adapter indexPrzemek Kitszel2-31/+22
Use Device Serial Number instead of PCI bus/device/function for the index of struct ice_adapter. Functions on the same physical device should point to the very same ice_adapter instance, but with two PFs, when at least one of them is PCI-e passed-through to a VM, it is no longer the case - PFs will get seemingly random PCI BDF values, and thus indices, what finally leds to each of them being on their own instance of ice_adapter. That causes them to don't attempt any synchronization of the PTP HW clock usage, or any other future resources. DSN works nicely in place of the index, as it is "immutable" in terms of virtualization. Fixes: 0e2bddf9e5f9 ("ice: add ice_adapter for shared data across PFs on the same NIC") Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250505161939.2083581-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-5/+10
Cross-merge networking fixes after downstream PR (net-6.15-rc5). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-30ice: Replace ice specific DSCP mapping num with a kernel defineTatyana Nikolova4-7/+7
Replace ice driver specific DSCP mapping number defines ICE_DSCP_NUM_VAL and IIDC_MAX_DSCP_MAPPING with an equivalent kernel define DSCP_MAX. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30iidc/ice/irdma: Break iidc.h into two headersDave Ertman1-0/+1
In preparation of supporting more than a single core PCI driver for RDMA, break the iidc_rdma.h header file into two more focused headers. Only the elements universal to all Intel drivers will remain in the generic iidc_rdma.h header. Move the ice specific information to an ice specific header file named iidc_rdma_ice.h. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30iidc/ice/irdma: Rename to iidc_* conventionDave Ertman4-19/+22
In preparation of supporting more than a single core PCI driver for RDMA, homogenize naming to iidc_rdma_* and IIDC_RDMA_* form. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30iidc/ice/irdma: Rename IDC header fileDave Ertman1-1/+1
To prepare for the IDC upgrade to support different CORE PCI drivers, rename header file from iidc.h to iidc_rdma.h since this files functionality is specifically for RDMA support. Use net/dscp.h include in irdma osdep.h and DSCP_MAX type.h, instead of iidc header and define. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-29ice: Check VF VSI Pointer Value in ice_vc_add_fdir_fltr()Xuanqiang Luo1-0/+5
As mentioned in the commit baeb705fd6a7 ("ice: always check VF VSI pointer values"), we need to perform a null pointer check on the return value of ice_get_vf_vsi() before using it. Fixes: 6ebbe97a4881 ("ice: Add a per-VF limit on number of FDIR filters") Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250425222636.3188441-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29ice: fix Get Tx Topology AQ command error on E830Paul Greenwalt1-5/+5
The Get Tx Topology AQ command (opcode 0x0418) has different read flag requirements depending on the hardware/firmware. For E810, E822, and E823 firmware the read flag must be set, and for newer hardware (E825 and E830) it must not be set. This results in failure to configure Tx topology and the following warning message during probe: DDP package does not support Tx scheduling layers switching feature - please update to the latest DDP package and try again The current implementation only handles E825-C but not E830. It is confusing as we first check ice_is_e825c() and then set the flag in the set case. Finally, we check ice_is_e825c() again and set the flag for all other hardware in both the set and get case. Instead, notice that we always need the read flag for set, but only need the read flag for get on E810, E822, and E823 firmware. Fix the logic to check the MAC type and set the read flag in get only on the older devices which require it. Fixes: ba1124f58afd ("ice: Add E830 device IDs, MAC type and registers") Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250425222636.3188441-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: ptp: introduce .supported_perout_flags to ptp_clock_infoJacob Keller1-3/+1
The PTP_PEROUT_REQUEST2 ioctl has gained support for flags specifying specific output behavior including PTP_PEROUT_ONE_SHOT, PTP_PEROUT_DUTY_CYCLE, PTP_PEROUT_PHASE. Driver authors are notorious for not checking the flags of the request. This results in misinterpreting the request, generating an output signal that does not match the requested value. It is anticipated that even more flags will be added in the future, resulting in even more broken requests. Expecting these issues to be caught during review or playing whack-a-mole after the fact is not a great solution. Instead, introduce the supported_perout_flags field in the ptp_clock_info structure. Update the core character device logic to explicitly reject any request which has a flag not on this list. This ensures that drivers must 'opt in' to the flags they support. Drivers which don't set the .supported_perout_flags field will not need to check that unsupported flags aren't passed, as the core takes care of this. Update the drivers which do support flags to set this new field. Note the following driver files set n_per_out to a non-zero value but did not check the flags at all: • drivers/ptp/ptp_clockmatrix.c • drivers/ptp/ptp_idt82p33.c • drivers/ptp/ptp_fc3.c • drivers/net/ethernet/ti/am65-cpts.c • drivers/net/ethernet/aquantia/atlantic/aq_ptp.c • drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c • drivers/net/dsa/sja1105/sja1105_ptp.c • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c • drivers/net/ethernet/mscc/ocelot_vsc7514.c • drivers/net/ethernet/intel/i40e/i40e_ptp.c Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-2-f6b17d15475c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: ptp: introduce .supported_extts_flags to ptp_clock_infoJacob Keller1-8/+4
The PTP_EXTTS_REQUEST(2) ioctl has a flags field which specifies how the external timestamp request should behave. This includes which edge of the signal to timestamp, as well as a specialized "offset" mode. It is expected that more flags will be added in the future. Driver authors routinely do not check the flags, often accepting requests with flags which they do not support. Even drivers which do check flags may not be future-proofed to reject flags not yet defined. Thus, any future flag additions often require manually updating drivers to reject these flags. This approach of hoping we catch flag checks during review, or playing whack-a-mole after the fact is the wrong approach. Introduce the "supported_extts_flags" field to the ptp_clock_info structure. This field defines the set of flags the device actually supports. Update the core character device logic to check this field and reject unsupported requests. Getting this right is somewhat tricky. First, to avoid unnecessary repetition and make basic functionality work when .supported_extts_flags is 0, the core always accepts the PTP_ENABLE_FEATURE flag. This flag is used to set the 'on' parameter to the .enable function and is thus always 'supported' by all drivers. For backwards compatibility, the PTP_RISING_EDGE and PTP_FALLING_EDGE flags are merely "hints" when using the old PTP_EXTTS_REQUEST ioctl, and are not expected to be enforced. If the user issues PTP_EXTTS_REQUEST2, the PTP_STRICT_FLAGS flag is added which is supposed to inform the driver to strictly validate the flags and reject unsupported requests. To handle this, first check if the driver reports PTP_STRICT_FLAGS support. If it does not, then always allow the PTP_RISING_EDGE and PTP_FALLING_EDGE flags. This keeps backwards compatibility with the original PTP_EXTTS_REQUEST ioctl where these flags are not guaranteed to be honored. This way, drivers which do not set the supported_extts_flags will continue to accept requests for the original PTP_EXTTS_REQUEST ioctl. The core will automatically reject requests with new flags, and correctly reject requests with PTP_STRICT_FLAGS, where the driver is supposed to strictly validate the flags. Update the various drivers, refactoring their validation logic into the .supported_extts_flags field. For consistency and readability, PTP_ENABLE_FEATURE is not set in the supported flags list, and PTP_EXTTS_EDGES is expanded to PTP_RISING_EDGE | PTP_FALLING_EDGE in all cases. Note the following driver files set n_ext_ts to a non-zero value but did not check flags at all: • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c • drivers/net/ethernet/freescale/enetc/enetc_ptp.c • drivers/net/ethernet/intel/i40e/i40e_ptp.c • drivers/net/ethernet/marvell/octeontx2/nic/otx2_ptp.c • drivers/net/ethernet/renesas/ravb_ptp.c • drivers/net/ethernet/renesas/rtsn.c • drivers/net/ethernet/renesas/rtsn.h • drivers/net/ethernet/ti/am65-cpts.c • drivers/net/ethernet/ti/cpts.h • drivers/net/ethernet/ti/icssg/icss_iep.c • drivers/net/ethernet/xscale/ptp_ixp46x.c • drivers/net/phy/bcm-phy-ptp.c • drivers/ptp/ptp_ocp.c • drivers/ptp/ptp_pch.c • drivers/ptp/ptp_qoriq.c These drivers behavior does change slightly: they will now reject the PTP_EXTTS_REQUEST2 ioctl, because they do not strictly validate their flags. This also makes them no longer incorrectly accept PTP_EXT_OFFSET. Also note that the renesas ravb driver does not support PTP_STRICT_FLAGS. We could leave the .supported_extts_flags as 0, but I added the PTP_RISING_EDGE | PTP_FALLING_EDGE since the driver previously manually validated these flags. This is equivalent to 0 because the core will allow these flags regardless unless PTP_STRICT_FLAGS is also set. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-1-f6b17d15475c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-11ice: make const read-only array dflt_rules staticColin Ian King1-1/+1
Don't populate the const read-only array dflt_rules on the stack at run time, instead make it static. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: improve error message for insufficient filter spaceMartyna Szapar-Mudlaw1-0/+7
When adding a rule to switch through tc, if the operation fails due to not enough free recipes (-ENOSPC), provide a clearer error message: "Unable to add filter: insufficient space available." This improves user feedback by distinguishing space limitations from other generic failures. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: enable timesync operation on 2xNAC E825 devicesKarol Kolacinski6-26/+134
According to the E825C specification, SBQ address for ports on a single complex is device 2 for PHY 0 and device 13 for PHY1. For accessing ports on a dual complex E825C (so called 2xNAC mode), the driver should use destination device 2 (referred as phy_0) for the current complex PHY and device 13 (referred as phy_0_peer) for peer complex PHY. Differentiate SBQ destination device by checking if current PF port number is on the same PHY as target port number. Adjust 'ice_get_lane_number' function to provide unique port number for ports from PHY1 in 'dual' mode config (by adding fixed offset for PHY1 ports). Cache this value in ice_hw struct. Introduce ice_get_primary_hw wrapper to get access to timesync register not available from second NAC. Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: refactor ice_sbq_msg_dev enumKarol Kolacinski3-18/+15
Rename ice_sbq_msg_dev to ice_sbq_dev_id to reflect the meaning of this type more precisely. This enum type describes RDA (Remote Device Access) client ids, accessible over SB (Side Band) interface. Rename enum elements to make a driver namespace more cleaner and consistent with other definitions within SB Remove unused 'rmn_x' entries, specific to unsupported E824 device. Adjust clients '2' and '13' names (phy_0 and phy_0_peer respectively) to be compliant with EAS doc. According to the specification, regardless of the complex entity (single or dual), when accessing its own ports, they're accessed always as 'phy_0' client. And referred as 'phy_0_peer' when handling ports connected to the other complex. Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: remove SW side band access workaround for E825Karol Kolacinski1-23/+0
Due to the bug in FW/NVM autoload mechanism (wrong default SB_REM_DEV_CTL register settings), the access to peer PHY and CGU clients was disabled by default. As the workaround solution, the register value was overwritten by the driver at the probe or reset handling. Remove workaround as it's not needed anymore. The fix in autoload procedure has been provided with NVM 3.80 version. NOTE: at the time the fix was provided in NVM, the E825C product was not officially available on the market, so it's not expected this change will cause regression when running with older driver/kernel versions. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: enable LLDP TX for VFs through tcLarysa Zaremba5-0/+175
Only a single VSI can be in charge of sending LLDP frames, sometimes it is beneficial to assign this function to a VF, that is possible to do with tc capabilities in the switchdev mode. It requires first blocking the PF from sending the LLDP frames with a following command: tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop Then it becomes possible to configure a forward rule from a VF port representor to uplink instead. tc filter add dev <vf_ifname> ingress protocol lldp flower skip_sw action mirred egress redirect dev <ifname> How LLDP exclusivity was done previously is LLDP traffic was blocked for a whole port by a single rule and PF was bypassing that. Now at least in the switchdev mode, every separate VSI has to have its own drop rule. Another complication is the fact that tc does not respect when the driver refuses to delete a rule, so returning an error results in a HW rule still present with no way to reference it through tc. This is addressed by allowing the PF rule to be deleted at any time, but making the VF forward rule "dormant" in such case, this means it is deleted from HW but stays in tc and driver's bookkeeping to be restored when drop rule is added back to the PF. Implement tc configuration handling which enables the user to transmit LLDP packets from VF instead of PF. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: support egress drop rules on PFLarysa Zaremba6-44/+138
tc clsact qdisc allows us to add offloaded egress rules with commands such as the following one: tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop Support the egress rule drop action when added to PF, with a few caveats: * in switchdev mode, all PF traffic has to go uplink with an exception for LLDP that can be delegated to a single VSI at a time * in legacy mode, we cannot delegate LLDP functionality to another VSI, so such packets from PF should not be blocked. Also, simplify the rule direction logic, it was previously derived from actions, but actually can be inherited from the tc block (and flipped in case of port representors). Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: remove headers argument from ice_tc_count_lkupsLarysa Zaremba1-9/+4
Remove the headers argument from the ice_tc_count_lkups() function, because it is not used anywhere. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: receive LLDP on trusted VFsMateusz Pacuszka8-12/+134
When a trusted VF tries to configure an LLDP multicast address, configure a rule that would mirror the traffic to this VF, untrusted VFs are not allowed to receive LLDP at all, so the request to add LLDP MAC address will always fail for them. Add a forwarding LLDP filter to a trusted VF when it tries to add an LLDP multicast MAC address. The MAC address has to be added after enabling trust (through restarting the LLDP service). Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com> Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: do not add LLDP-specific filter if not necessaryLarysa Zaremba4-13/+28
Commit 34295a3696fb ("ice: implement new LLDP filter command") introduced the ability to use LLDP-specific filter that directs all LLDP traffic to a single VSI. However, current goal is for all trusted VFs to be able to see LLDP neighbors, which is impossible to do with the special filter. Make using the generic filter the default choice and fall back to special one only if a generic filter cannot be added. That way setups with "NVMs where an already existent LLDP filter is blocking the creation of a filter to allow LLDP packets" will still be able to configure software Rx LLDP on PF only, while all other setups would be able to forward them to VFs too. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11ice: fix check for existing switch ruleMateusz Pacuszka1-2/+2
In case the rule already exists and another VSI wants to subscribe to it new VSI list is being created and both VSIs are moved to it. Currently, the check for already existing VSI with the same rule is done based on fdw_id.hw_vsi_id, which applies only to LOOKUP_RX flag. Change it to vsi_handle. This is software VSI ID, but it can be applied here, because vsi_map itself is also based on it. Additionally change return status in case the VSI already exists in the VSI map to "Already exists". Such case should be handled by the caller. Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner2-3/+3
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-23/+55
Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-18ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw()Mateusz Polchlopek1-9/+15
Fix using the untrusted value of proto->raw.pkt_len in function ice_vc_fdir_parse_raw() by verifying if it does not exceed the VIRTCHNL_MAX_SIZE_RAW_PACKET value. Fixes: 99f419df8a5c ("ice: enable FDIR filters from raw binary patterns for VFs") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: fix input validation for virtchnl BWLukasz Czapnik1-3/+21
Add missing validation of tc and queue id values sent by a VF in ice_vc_cfg_q_bw(). Additionally fixed logged value in the warning message, where max_tx_rate was incorrectly referenced instead of min_tx_rate. Also correct error handling in this function by properly exiting when invalid configuration is detected. Fixes: 015307754a19 ("ice: Support VF queue rate limit and quanta size configuration") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: validate queue quanta parameters to prevent OOB accessJan Glaza1-4/+9
Add queue wraparound prevention in quanta configuration. Ensure end_qid does not overflow by validating start_qid and num_queues. Fixes: 015307754a19 ("ice: Support VF queue rate limit and quanta size configuration") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jan Glaza <jan.glaza@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: stop truncating queue ids when checkingJan Glaza1-1/+1
Queue IDs can be up to 4096, fix invalid check to stop truncating IDs to 8 bits. Fixes: bf93bf791cec8 ("ice: introduce ice_virtchnl.c and ice_virtchnl.h") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jan Glaza <jan.glaza@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: fix reservation of resources for RDMA when disabledJesse Brandeburg1-1/+2
If the CONFIG_INFINIBAND_IRDMA symbol is not enabled as a module or a built-in, then don't let the driver reserve resources for RDMA. The result of this change is a large savings in resources for older kernels, and a cleaner driver configuration for the IRDMA=n case for old and new kernels. Implement this by avoiding enabling the RDMA capability when scanning hardware capabilities. Note: Loading the out-of-tree irdma driver in connection to the in-kernel ice driver, is not supported, and should not be attempted, especially when disabling IRDMA in the kernel config. Fixes: d25a0fc41c1f ("ice: Initialize RDMA support") Signed-off-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Acked-by: Dave Ertman <david.m.ertman@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: ensure periodic output start time is in the futureKarol Kolacinski1-2/+4
On E800 series hardware, if the start time for a periodic output signal is programmed into GLTSYN_TGT_H and GLTSYN_TGT_L registers, the hardware logic locks up and the periodic output signal never starts. Any future attempt to reprogram the clock function is futile as the hardware will not reset until a power on. The ice_ptp_cfg_perout function has logic to prevent this, as it checks if the requested start time is in the past. If so, a new start time is calculated by rounding up. Since commit d755a7e129a5 ("ice: Cache perout/extts requests and check flags"), the rounding is done to the nearest multiple of the clock period, rather than to a full second. This is more accurate, since it ensures the signal matches the user request precisely. Unfortunately, there is a race condition with this rounding logic. If the current time is close to the multiple of the period, we could calculate a target time that is extremely soon. It takes time for the software to program the registers, during which time this requested start time could become a start time in the past. If that happens, the periodic output signal will lock up. For large enough periods, or for the logic prior to the mentioned commit, this is unlikely. However, with the new logic rounding to the period and with a small enough period, this becomes inevitable. For example, attempting to enable a 10MHz signal requires a period of 100 nanoseconds. This means in the *best* case, we have 99 nanoseconds to program the clock output. This is essentially impossible, and thus such a small period practically guarantees that the clock output function will lock up. To fix this, add some slop to the clock time used to check if the start time is in the past. Because it is not critical that output signals start immediately, but it *is* critical that we do not brick the function, 0.5 seconds is selected. This does mean that any requested output will be delayed by at least 0.5 seconds. This slop is applied before rounding, so that we always round up to the nearest multiple of the period that is at least 0.5 seconds in the future, ensuring a minimum of 0.5 seconds to program the clock output registers. Finally, to ensure that the hardware registers programming the clock output complete in a timely manner, add a write flush to the end of ice_ptp_write_perout. This ensures we don't risk any issue with PCIe transaction batching. Strictly speaking, this fixes a race condition all the way back at the initial implementation of periodic output programming, as it is theoretically possible to trigger this bug even on the old logic when always rounding to a full second. However, the window is narrow, and the code has been refactored heavily since then, making a direct backport not apply cleanly. Fixes: d755a7e129a5 ("ice: Cache perout/extts requests and check flags") Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: health.c: fix compilation on gcc 7.5Przemek Kitszel1-3/+3
GCC 7 is not as good as GCC 8+ in telling what is a compile-time const, and thus could be used for static storage. Fortunately keeping strings as const arrays is enough to make old gcc happy. Excerpt from the report: My GCC is: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0. CC [M] drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.o drivers/net/ethernet/intel/ice/devlink/health.c:35:3: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:35:3: note: (near initialization for 'ice_health_status_lookup[0].solution') drivers/net/ethernet/intel/ice/devlink/health.c:35:31: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:35:31: note: (near initialization for 'ice_health_status_lookup[0].data_label[0]') drivers/net/ethernet/intel/ice/devlink/health.c:37:46: error: initializer element is not constant "Change or replace the module or cable.", {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:37:46: note: (near initialization for 'ice_health_status_lookup[1].data_label[0]') drivers/net/ethernet/intel/ice/devlink/health.c:39:3: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 85d6164ec56d ("ice: add fw and port health reporters") Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Closes: https://lore.kernel.org/netdev/CY8PR11MB7134BF7A46D71E50D25FA7A989F72@CY8PR11MB7134.namprd11.prod.outlook.com Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18ice: E825C PHY register cleanupKarol Kolacinski1-17/+14
Minor PTP register refactor, including logical grouping E825C 1-step timestamping registers. Remove unused register definitions (PHY_REG_GPCS_BITSLIP, PHY_REG_REVISION). Also, apply preferred GENMASK macro (instead of ICE_M) for register fields definition affected by this patch. Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-5-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18ice: Refactor E825C PHY registers info structKarol Kolacinski3-65/+20
Simplify ice_phy_reg_info_eth56g struct definition to include base address for the very first quad. Use base address info and 'step' value to determine address for specific PHY quad. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-4-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18ice: rename ice_ptp_init_phc_eth56g functionKarol Kolacinski1-7/+6
Refactor the code by changing ice_ptp_init_phc_eth56g function name to ice_ptp_init_phc_e825, to be consistent with the naming pattern for other devices. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-3-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18ice: Add E830 checksum offload supportPaul Greenwalt7-4/+87
E830 supports raw receive and generic transmit checksum offloads. Raw receive checksum support is provided by hardware calculating the checksum over the whole packet, regardless of type. The calculated checksum is provided to driver in the Rx flex descriptor. Then the driver assigns the checksum to skb->csum and sets skb->ip_summed to CHECKSUM_COMPLETE. Generic transmit checksum support is provided by hardware calculating the checksum given two offsets: the start offset to begin checksum calculation, and the offset to insert the calculated checksum in the packet. Support is advertised to the stack using NETIF_F_HW_CSUM feature. E830 has the following limitations when both generic transmit checksum offload and TCP Segmentation Offload (TSO) are enabled: 1. Inner packet header modification is not supported. This restriction includes the inability to alter TCP flags, such as the push flag. As a result, this limitation can impact the receiver's ability to coalesce packets, potentially degrading network throughput. 2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents support of Maximum Transmission Unit (MTU) greater than 1063 bytes. Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features(). When NETIF_F_HW_CSUM is requested the provided skb->csum_start and skb->csum_offset are passed to hardware in the Tx context descriptor generic checksum (GCS) parameters. Hardware calculates the 1's complement from skb->csum_start to the end of the packet, and inserts the result in the packet at skb->csum_offset. Co-developed-by: Alice Michael <alice.michael@intel.com> Signed-off-by: Alice Michael <alice.michael@intel.com> Co-developed-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-2-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni7-32/+33
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.") 2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c 661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-05ice: register devlink prior to creating health reportersPrzemek Kitszel1-2/+2
ice_health_init() was introduced in the commit 2a82874a3b7b ("ice: add Tx hang devlink health reporter"). The call to it should have been put after ice_devlink_register(). It went unnoticed until next reporter by Konrad, which receives events from FW. FW is reporting all events, also from prior driver load, and thus it is not unlikely to have something at the very beginning. And that results in a splat: [ 24.455950] ? devlink_recover_notify.constprop.0+0x198/0x1b0 [ 24.455973] devlink_health_report+0x5d/0x2a0 [ 24.455976] ? __pfx_ice_health_status_lookup_compare+0x10/0x10 [ice] [ 24.456044] ice_process_health_status_event+0x1b7/0x200 [ice] Do the analogous thing for deinit patch. Fixes: 85d6164ec56d ("ice: add fw and port health reporters") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Konrad Knitter <konrad.knitter@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>