summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel
AgeCommit message (Collapse)AuthorFilesLines
2023-07-24ice: implement static version of ageingMichal Swiatkowski2-0/+58
Remove fdb entries always when ageing time expired. Allow user to set ageing time using port object attribute. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: implement bridge port vlanMichal Swiatkowski8-90/+287
Port VLAN in this case means push and pop VLAN action on specific vid. There are a few limitation in hardware: - push and pop can't be used separately - if port VLAN is used there can't be any trunk VLANs, because pop action is done on all traffic received by VSI in port VLAN mode - port VLAN mode on uplink port isn't supported Reflect these limitations in code using dev_info to inform the user about unsupported configuration. In bridge mode there is a need to configure port vlan without resetting VFs. To do that implement ice_port_vlan_on/off() functions. They are only configuring correct vlan_ops to allow setting port vlan. We also need to clear port vlan without resetting the VF which is not supported right now. Change it by implementing clear_port_vlan ops. As previous VLAN configuration isn't always the same, store current config while creating port vlan and restore it in clear function. Configuration steps: - configure switchdev with bridge - #bridge vlan add dev eth0 vid 120 pvid untagged - #bridge vlan add dev eth1 vid 120 pvid untagged - ping from VF0 to VF1 Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Add VLAN FDB support in switchdev modeMarcin Szycik2-8/+317
Add support for matching on VLAN tag in bridge offloads. Currently only trunk mode is supported. To enable VLAN filtering (existing FDB entries will be deleted): ip link set $BR type bridge vlan_filtering 1 To add VLANs to bridge in trunk mode: bridge vlan add dev $PF1 vid 110-111 bridge vlan add dev $VF1_PR vid 110-111 Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Add guard rule when creating FDB in switchdevMarcin Szycik5-36/+132
Introduce new "guard" rule upon FDB entry creation. It matches on src_mac, has valid bit unset, allow_pass_l2 set and has a nop action. Previously introduced "forward" rule matches on dst_mac, has valid bit set, need_pass_l2 set and has a forward action. With these rules, a packet will be offloaded only if FDB exists in both directions (RX and TX). Let's assume link partner sends a packet to VF1: src_mac = LP_MAC, dst_mac = is VF1_MAC. Bridge adds FDB, two rules are created: 1. Guard rule matching on src_mac == LP_MAC 2. Forward rule matching on dst_mac == LP_MAC Now VF1 responds with src_mac = VF1_MAC, dst_mac = LP_MAC. Before this change, only one rule with dst_mac == LP_MAC would have existed, and the packet would have been offloaded, meaning the bridge wouldn't add FDB in the opposite direction. Now, the forward rule matches (dst_mac == LP_MAC), but it has need_pass_l2 set an there is no guard rule with src_mac == VF1_MAC, so the packet goes through slow-path and the bridge adds FDB. Two rules are created: 1. Guard rule matching on src_mac == VF1_MAC 2. Forward rule matching on dst_mac == VF1_MAC Further packets in both directions will be offloaded. The same example is true in opposite direction (i.e. VF1 is the first to send a packet out). Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Switchdev FDB events supportWojciech Drewek2-0/+478
Listen for SWITCHDEV_FDB_{ADD|DEL}_TO_DEVICE events while in switchdev mode. Accept these events on both uplink and VF PR ports. Add HW rules in newly created workqueue. FDB entries are stored in rhashtable for lookup when removing the entry and in the list for cleanup purpose. Direction of the HW rule depends on the type of the ports on which the FDB event was received: ICE_ESWITCH_BR_UPLINK_PORT: TX rule that forwards the packet to the LAN (egress). ICE_ESWITCH_BR_VF_REPR_PORT: RX rule that forwards the packet to the VF associated with the port representor. In both cases the rule matches on the dst mac address. All the FDB entries are stored in the bridge structure. When the port is removed all the FDB entries associated with this port are removed as well. This is achieved thanks to the reference to the port that FDB entry holds. In the fwd rule we use only one lookup type (MAC address) but lkups_cnt variable is already introduced because we will have more lookups in the subsequent patches. Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Implement basic eswitch bridge setupWojciech Drewek8-9/+456
With this patch, ice driver is able to track if the port representors or uplink port were added to the linux bridge in switchdev mode. Listen for NETDEV_CHANGEUPPER events in order to detect this. ice_esw_br data structure reflects the linux bridge and stores all the ports of the bridge (ice_esw_br_port) in xarray, it's created when the first port is added to the bridge and freed once the last port is removed. Note that only one bridge is supported per eswitch. Bridge port (ice_esw_br_port) can be either a VF port representor port or uplink port (ice_esw_br_port_type). In both cases bridge port holds a reference to the VSI, VF's VSI in case of the PR and uplink VSI in case of the uplink. VSI's index is used as an index to the xarray in which ports are stored. Add a check which prevents configuring switchdev mode if uplink is already added to any bridge. This is needed because we need to listen for NETDEV_CHANGEUPPER events to record if the uplink was added to the bridge. Netdevice notifier is registered after eswitch mode is changed to switchdev. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Unset src prune on uplink VSIWojciech Drewek3-0/+32
In switchdev mode uplink VSI is supposed to receive all packets that were not matched by existing filters. If ICE_AQ_VSI_SW_FLAG_LOCAL_LB bit is unset and we have a filter associated with uplink VSI which matches on dst mac equal to MAC1, then packets with src mac equal to MAC1 will be pruned from reaching uplink VSI. Fix this by updating uplink VSI with ICE_AQ_VSI_SW_FLAG_LOCAL_LB bit set when configuring switchdev mode. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Disable vlan pruning for uplink VSIWojciech Drewek1-0/+10
In switchdev mode, uplink VSI is configured to be default VSI which means it will receive all unmatched packets. In order to receive vlan packets we need to disable vlan pruning as well. This is done by dis_rx_filtering vlan op. Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Don't tx before switchdev is fully configuredWojciech Drewek1-0/+3
There is possibility that ice_eswitch_port_start_xmit might be called while some resources are still not allocated which might cause NULL pointer dereference. Fix this by checking if switchdev configuration was finished. Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Prohibit rx mode change in switchdev modeWojciech Drewek1-1/+1
Don't allow to change promisc mode in switchdev mode. When switchdev is configured, PF netdev is set to be a default VSI. This is needed for the slow-path to work correctly. All the unmatched packets will be directed to PF netdev. It is possible that this setting might be overwritten by ndo_set_rx_mode. Prevent this by checking if switchdev is enabled in ice_set_rx_mode. Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24ice: Skip adv rules removal upon switchdev releaseWojciech Drewek3-55/+0
Advanced rules for ctrl VSI will be removed anyway when the VSI will cleaned up, no need to do it explicitly. Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-24vxlan: calculate correct header length for GPEJiri Benc1-1/+1
VXLAN-GPE does not add an extra inner Ethernet header. Take that into account when calculating header length. This causes problems in skb_tunnel_check_pmtu, where incorrect PMTU is cached. In the collect_md mode (which is the only mode that VXLAN-GPE supports), there's no magic auto-setting of the tunnel interface MTU. It can't be, since the destination and thus the underlying interface may be different for each packet. So, the administrator is responsible for setting the correct tunnel interface MTU. Apparently, the administrators are capable enough to calculate that the maximum MTU for VXLAN-GPE is (their_lower_MTU - 36). They set the tunnel interface MTU to 1464. If you run a TCP stream over such interface, it's then segmented according to the MTU 1464, i.e. producing 1514 bytes frames. Which is okay, this still fits the lower MTU. However, skb_tunnel_check_pmtu (called from vxlan_xmit_one) uses 50 as the header size and thus incorrectly calculates the frame size to be 1528. This leads to ICMP too big message being generated (locally), PMTU of 1450 to be cached and the TCP stream to be resegmented. The fix is to use the correct actual header size, especially for skb_tunnel_check_pmtu calculation. Fixes: e1e5314de08ba ("vxlan: implement GPE") Signed-off-by: Jiri Benc <jbenc@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-21iavf: check for removal state before IAVF_FLAG_PF_COMMS_FAILEDJacob Keller1-3/+3
In iavf_adminq_task(), if the function can't acquire the adapter->crit_lock, it checks if the driver is removing. If so, it simply exits without re-enabling the interrupt. This is done to ensure that the task stops processing as soon as possible once the driver is being removed. However, if the IAVF_FLAG_PF_COMMS_FAILED is set, the function checks this before attempting to acquire the lock. In this case, the function exits early and re-enables the interrupt. This will happen even if the driver is already removing. Avoid this, by moving the check to after the adapter->crit_lock is acquired. This way, if the driver is removing, we will not re-enable the interrupt. Fixes: fc2e6b3b132a ("iavf: Rework mutexes for better synchronisation") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-21iavf: fix potential deadlock on allocation failureJacob Keller1-2/+3
In iavf_adminq_task(), if kzalloc() fails to allocate the event.msg_buf, the function will exit without releasing the adapter->crit_lock. This is unlikely, but if it happens, the next access to that mutex will deadlock. Fix this by moving the unlock to the end of the function, and adding a new label to allow jumping to the unlock portion of the function exit flow. Fixes: fc2e6b3b132a ("iavf: Rework mutexes for better synchronisation") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-21i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir()Wang Ming1-1/+1
The debugfs_create_dir() function returns error pointers. It never returns NULL. Most incorrect error checks were fixed, but the one in i40e_dbg_init() was forgotten. Fix the remaining error check. Fixes: 02e9c290814c ("i40e: debugfs interface") Signed-off-by: Wang Ming <machel@vivo.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-144/+199
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20Merge tag 'for-netdev' of ↵Jakub Kicinski7-90/+254
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-19 We've added 45 non-merge commits during the last 3 day(s) which contain a total of 71 files changed, 7808 insertions(+), 592 deletions(-). The main changes are: 1) multi-buffer support in AF_XDP, from Maciej Fijalkowski, Magnus Karlsson, Tirthendu Sarkar. 2) BPF link support for tc BPF programs, from Daniel Borkmann. 3) Enable bpf_map_sum_elem_count kfunc for all program types, from Anton Protopopov. 4) Add 'owner' field to bpf_rb_node to fix races in shared ownership, Dave Marchevsky. 5) Prevent potential skb_header_pointer() misuse, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) bpf, net: Introduce skb_pointer_if_linear(). bpf: sync tools/ uapi header with selftests/bpf: Add mprog API tests for BPF tcx links selftests/bpf: Add mprog API tests for BPF tcx opts bpftool: Extend net dump with tcx progs libbpf: Add helper macro to clear opts structs libbpf: Add link-based API for tcx libbpf: Add opts-based attach/detach/query API for tcx bpf: Add fd-based tcx multi-prog infra with link support bpf: Add generic attach/detach/query API for multi-progs selftests/xsk: reset NIC settings to default after running test suite selftests/xsk: add test for too many frags selftests/xsk: add metadata copy test for multi-buff selftests/xsk: add invalid descriptor test for multi-buffer selftests/xsk: add unaligned mode test for multi-buffer selftests/xsk: add basic multi-buffer test selftests/xsk: transmit and receive multi-buffer packets xsk: add multi-buffer documentation i40e: xsk: add TX multi-buffer support ice: xsk: Tx multi-buffer support ... ==================== Link: https://lore.kernel.org/r/20230719175424.75717-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19i40e: xsk: add TX multi-buffer supportTirthendu Sarkar2-5/+6
Set eop bit in TX desc command only for the last descriptor of the packet and do not set for all preceding descriptors. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-17-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19ice: xsk: Tx multi-buffer supportMaciej Fijalkowski2-23/+61
Most of this patch is about actually supporting XDP_TX action. Pure Tx ZC support is only about looking at XDP_PKT_CONTD presence at options field and based on that generating EOP bit on Tx HW descriptor. This is that simple due to the implementation on xsk_tx_peek_release_desc_batch() where we are making sure that last produced descriptor is an EOP one. Overwrite xdp_zc_max_segs with a value that defines max scatter-gatter count on Tx side that HW can handle. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-16-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19i40e: xsk: add RX multi-buffer supportTirthendu Sarkar4-18/+84
This patch is inspired from the multi-buffer support in non-zc path for i40e as well as from the patch to support zc on ice. Each subsequent frag is added to skb_shared_info of the first frag for possible xdp_prog use as well to xsk buffer list for accessing the buffers in af_xdp. For XDP_PASS, new pages are allocated for frags and contents are copied from memory backed by xsk_buff_pool. Replace next_to_clean with next_to_process as done in non-zc path and advance it for every buffer and change the semantics of next_to_clean to point to the first buffer of a packet. Driver will use next_to_process in the same way next_to_clean was used previously. For the non multi-buffer case, next_to_process and next_to_clean will always be the same since each packet consists of a single buffer. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-14-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19ice: xsk: add RX multi-buffer supportMaciej Fijalkowski2-43/+102
This support is strongly inspired by work that introduced multi-buffer support to regular Rx data path in ice. There are some differences, though. When adding a frag, besides adding it to skb_shared_info, use also fresh xsk_buff_add_frag() helper. Reason for doing both things is that we can not rule out the fact that AF_XDP pipeline could use XDP program that needs to access frame fragments. Without them being in skb_shared_info it will not be possible. Another difference is that XDP_PASS has to allocate a new pages for each frags and copy contents from memory backed by xsk_buff_pool. chain_len that is used for programming HW Rx descriptors no longer has to be limited to 1 when xsk_pool is present - remove this restriction. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-13-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19Merge branch '40GbE' of ↵David S. Miller1-27/+38
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-07-14 (i40e) This series contains updates to i40e driver only. Ivan Vecera adds waiting for VF to complete initialization on VF related configuration callbacks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-19Merge branch '40GbE' of ↵Jakub Kicinski4-111/+176
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-07-17 (iavf) This series contains updates to iavf driver only. Ding Hui fixes use-after-free issue by calling netif_napi_del() for all allocated q_vectors. He also resolves out-of-bounds issue by not updating to new values when timeout is encountered. Marcin and Ahmed change the way resets are handled so that the callback operating under the RTNL lock will wait for the reset to finish, the rtnl_lock sensitive functions in reset flow will schedule the netdev update for later in order to remove circular dependency with the critical lock. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: iavf: fix reset task race with iavf_remove() iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies Revert "iavf: Do not restart Tx queues after reset task failure" Revert "iavf: Detach device during reset task" iavf: Wait for reset in callbacks which trigger it iavf: use internal state to free traffic IRQs iavf: Fix out-of-bounds when setting channels on remove iavf: Fix use-after-free in free_netdev ==================== Link: https://lore.kernel.org/r/20230717175205.3217774-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19ice: remove unnecessary (void*) conversionsWu Yunchuan1-2/+2
No need cast (void*) to (struct ice_ring_container *). Signed-off-by: Wu Yunchuan <yunchuan@nfschina.com> Link: https://lore.kernel.org/r/20230717031154.54740-1-yunchuan@nfschina.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19igc: Prevent garbled TX queue with XDP ZEROCOPYFlorian Kauer1-2/+2
In normal operation, each populated queue item has next_to_watch pointing to the last TX desc of the packet, while each cleaned item has it set to 0. In particular, next_to_use that points to the next (necessarily clean) item to use has next_to_watch set to 0. When the TX queue is used both by an application using AF_XDP with ZEROCOPY as well as a second non-XDP application generating high traffic, the queue pointers can get in an invalid state where next_to_use points to an item where next_to_watch is NOT set to 0. However, the implementation assumes at several places that this is never the case, so if it does hold, bad things happen. In particular, within the loop inside of igc_clean_tx_irq(), next_to_clean can overtake next_to_use. Finally, this prevents any further transmission via this queue and it never gets unblocked or signaled. Secondly, if the queue is in this garbled state, the inner loop of igc_clean_tx_ring() will never terminate, completely hogging a CPU core. The reason is that igc_xdp_xmit_zc() reads next_to_use before acquiring the lock, and writing it back (potentially unmodified) later. If it got modified before locking, the outdated next_to_use is written pointing to an item that was already used elsewhere (and thus next_to_watch got written). Fixes: 9acf59a752d4 ("igc: Enable TX via AF_XDP zero-copy") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230717175444.3217831-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18igc: Add TransmissionOverrun counterMuhammad Husaini Zulkifli1-3/+32
Add TransmissionOverrun as per defined by IEEE 802.1Q Bridges. TransmissionOverrun counter shall be incremented if the implementation detects that a frame from a given queue is still being transmitted by the MAC when that gate-close event for that queue occurs. This counter is utilised by the Certification conformance test to inform the user application whether any packets are currently being transmitted on a particular queue during a gate-close event. Intel Discrete I225/I226 have a mechanism to not transmit a packets if the gate open time is insufficient for the packet transmission by setting the Strict_End bit. Thus, it is expected for this counter to be always zero at this moment. Inspired from enetc_taprio_stats() and enetc_taprio_queue_stats(), now driver also report the tx_overruns counter per traffic class. User can get this counter by using below command: 1) tc -s qdisc show dev <interface> root 2) tc -s class show dev <interface> Test Result (Before): class mq :1 root Sent 1289 bytes 20 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq :2 root Sent 124 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq :3 root Sent 46028 bytes 86 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq :4 root Sent 2596 bytes 14 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Test Result (After): class taprio 100:1 root Sent 8491 bytes 38 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Transmit overruns: 0 class taprio 100:2 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Transmit overruns: 0 class taprio 100:3 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Transmit overruns: 0 class taprio 100:4 root Sent 994 bytes 11 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 Transmit overruns: 0 Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230714201428.1718097-1-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-17iavf: fix reset task race with iavf_remove()Ahmed Zaki4-29/+16
The reset task is currently scheduled from the watchdog or adminq tasks. First, all direct calls to schedule the reset task are replaced with the iavf_schedule_reset(), which is modified to accept the flag showing the type of reset. To prevent the reset task from starting once iavf_remove() starts, we need to check the __IAVF_IN_REMOVE_TASK bit before we schedule it. This is now easily added to iavf_schedule_reset(). Finally, remove the check for IAVF_FLAG_RESET_NEEDED in the watchdog task. It is redundant since all callers who set the flag immediately schedules the reset task. Fixes: 3ccd54ef44eb ("iavf: Fix init state closure on remove") Fixes: 14756b2ae265 ("iavf: Fix __IAVF_RESETTING state usage") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: fix a deadlock caused by rtnl and driver's lock circular dependenciesAhmed Zaki3-32/+85
A driver's lock (crit_lock) is used to serialize all the driver's tasks. Lockdep, however, shows a circular dependency between rtnl and crit_lock. This happens when an ndo that already holds the rtnl requests the driver to reset, since the reset task (in some paths) tries to grab rtnl to either change real number of queues of update netdev features. [566.241851] ====================================================== [566.241893] WARNING: possible circular locking dependency detected [566.241936] 6.2.14-100.fc36.x86_64+debug #1 Tainted: G OE [566.241984] ------------------------------------------------------ [566.242025] repro.sh/2604 is trying to acquire lock: [566.242061] ffff9280fc5ceee8 (&adapter->crit_lock){+.+.}-{3:3}, at: iavf_close+0x3c/0x240 [iavf] [566.242167] but task is already holding lock: [566.242209] ffffffff9976d350 (rtnl_mutex){+.+.}-{3:3}, at: iavf_remove+0x6b5/0x730 [iavf] [566.242300] which lock already depends on the new lock. [566.242353] the existing dependency chain (in reverse order) is: [566.242401] -> #1 (rtnl_mutex){+.+.}-{3:3}: [566.242451] __mutex_lock+0xc1/0xbb0 [566.242489] iavf_init_interrupt_scheme+0x179/0x440 [iavf] [566.242560] iavf_watchdog_task+0x80b/0x1400 [iavf] [566.242627] process_one_work+0x2b3/0x560 [566.242663] worker_thread+0x4f/0x3a0 [566.242696] kthread+0xf2/0x120 [566.242730] ret_from_fork+0x29/0x50 [566.242763] -> #0 (&adapter->crit_lock){+.+.}-{3:3}: [566.242815] __lock_acquire+0x15ff/0x22b0 [566.242869] lock_acquire+0xd2/0x2c0 [566.242901] __mutex_lock+0xc1/0xbb0 [566.242934] iavf_close+0x3c/0x240 [iavf] [566.242997] __dev_close_many+0xac/0x120 [566.243036] dev_close_many+0x8b/0x140 [566.243071] unregister_netdevice_many_notify+0x165/0x7c0 [566.243116] unregister_netdevice_queue+0xd3/0x110 [566.243157] iavf_remove+0x6c1/0x730 [iavf] [566.243217] pci_device_remove+0x33/0xa0 [566.243257] device_release_driver_internal+0x1bc/0x240 [566.243299] pci_stop_bus_device+0x6c/0x90 [566.243338] pci_stop_and_remove_bus_device+0xe/0x20 [566.243380] pci_iov_remove_virtfn+0xd1/0x130 [566.243417] sriov_disable+0x34/0xe0 [566.243448] ice_free_vfs+0x2da/0x330 [ice] [566.244383] ice_sriov_configure+0x88/0xad0 [ice] [566.245353] sriov_numvfs_store+0xde/0x1d0 [566.246156] kernfs_fop_write_iter+0x15e/0x210 [566.246921] vfs_write+0x288/0x530 [566.247671] ksys_write+0x74/0xf0 [566.248408] do_syscall_64+0x58/0x80 [566.249145] entry_SYSCALL_64_after_hwframe+0x72/0xdc [566.249886] other info that might help us debug this: [566.252014] Possible unsafe locking scenario: [566.253432] CPU0 CPU1 [566.254118] ---- ---- [566.254800] lock(rtnl_mutex); [566.255514] lock(&adapter->crit_lock); [566.256233] lock(rtnl_mutex); [566.256897] lock(&adapter->crit_lock); [566.257388] *** DEADLOCK *** The deadlock can be triggered by a script that is continuously resetting the VF adapter while doing other operations requiring RTNL, e.g: while :; do ip link set $VF up ethtool --set-channels $VF combined 2 ip link set $VF down ip link set $VF up ethtool --set-channels $VF combined 4 ip link set $VF down done Any operation that triggers a reset can substitute "ethtool --set-channles" As a fix, add a new task "finish_config" that do all the work which needs rtnl lock. With the exception of iavf_remove(), all work that require rtnl should be called from this task. As for iavf_remove(), at the point where we need to call unregister_netdevice() (and grab rtnl_lock), we make sure the finish_config task is not running (cancel_work_sync()) to safely grab rtnl. Subsequent finish_config work cannot restart after that since the task is guarded by the __IAVF_IN_REMOVE_TASK bit in iavf_schedule_finish_config(). Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17Revert "iavf: Do not restart Tx queues after reset task failure"Marcin Szycik1-15/+0
This reverts commit 08f1c147b7265245d67321585c68a27e990e0c4b. Netdev is no longer being detached during reset, so this fix can be reverted. We leave the removal of "hacky" IFF_UP flag update. Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17Revert "iavf: Detach device during reset task"Marcin Szycik1-11/+2
This reverts commit aa626da947e9cd30c4cf727493903e1adbb2c0a0. Detaching device during reset was not fully fixing the rtnl locking issue, as there could be a situation where callback was already in progress before detaching netdev. Furthermore, detaching netdevice causes TX timeouts if traffic is running. To reproduce: ip netns exec ns1 iperf3 -c $PEER_IP -t 600 --logfile /dev/null & while :; do for i in 200 7000 400 5000 300 3000 ; do ip netns exec ns1 ip link set $VF1 mtu $i sleep 2 done sleep 10 done Currently, callbacks such as iavf_change_mtu() wait for the reset. If the reset fails to acquire the rtnl_lock, they schedule the netdev update for later while continuing the reset flow. Operations like MTU changes are performed under the rtnl_lock. Therefore, when the operation finishes, another callback that uses rtnl_lock can start. Signed-off-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Wait for reset in callbacks which trigger itMarcin Szycik4-17/+68
There was a fail when trying to add the interface to bonding right after changing the MTU on the interface. It was caused by bonding interface unable to open the interface due to interface being in __RESETTING state because of MTU change. Add new reset_waitqueue to indicate that reset has finished. Add waiting for reset to finish in callbacks which trigger hw reset: iavf_set_priv_flags(), iavf_change_mtu() and iavf_set_ringparam(). We use a 5000ms timeout period because on Hyper-V based systems, this operation takes around 3000-4000ms. In normal circumstances, it doesn't take more than 500ms to complete. Add a function iavf_wait_for_reset() to reuse waiting for reset code and use it also in iavf_set_channels(), which already waits for reset. We don't use error handling in iavf_set_channels() as this could cause the device to be in incorrect state if the reset was scheduled but hit timeout or the waitng function was interrupted by a signal. Fixes: 4e5e6b5d9d13 ("iavf: Fix return of set the new channel count") Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Co-developed-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: use internal state to free traffic IRQsAhmed Zaki1-3/+4
If the system tries to close the netdev while iavf_reset_task() is running, __LINK_STATE_START will be cleared and netif_running() will return false in iavf_reinit_interrupt_scheme(). This will result in iavf_free_traffic_irqs() not being called and a leak as follows: [7632.489326] remove_proc_entry: removing non-empty directory 'irq/999', leaking at least 'iavf-enp24s0f0v0-TxRx-0' [7632.490214] WARNING: CPU: 0 PID: 10 at fs/proc/generic.c:718 remove_proc_entry+0x19b/0x1b0 is shown when pci_disable_msix() is later called. Fix by using the internal adapter state. The traffic IRQs will always exist if state == __IAVF_RUNNING. Fixes: 5b36e8d04b44 ("i40evf: Enable VF to request an alternate queue allocation") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Fix out-of-bounds when setting channels on removeDing Hui1-1/+1
If we set channels greater during iavf_remove(), and waiting reset done would be timeout, then returned with error but changed num_active_queues directly, that will lead to OOB like the following logs. Because the num_active_queues is greater than tx/rx_rings[] allocated actually. Reproducer: [root@host ~]# cat repro.sh #!/bin/bash pf_dbsf="0000:41:00.0" vf0_dbsf="0000:41:02.0" g_pids=() function do_set_numvf() { echo 2 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) echo 0 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) } function do_set_channel() { local nic=$(ls -1 --indicator-style=none /sys/bus/pci/devices/${vf0_dbsf}/net/) [ -z "$nic" ] && { sleep $((RANDOM%3)) ; return 1; } ifconfig $nic 192.168.18.5 netmask 255.255.255.0 ifconfig $nic up ethtool -L $nic combined 1 ethtool -L $nic combined 4 sleep $((RANDOM%3)) } function on_exit() { local pid for pid in "${g_pids[@]}"; do kill -0 "$pid" &>/dev/null && kill "$pid" &>/dev/null done g_pids=() } trap "on_exit; exit" EXIT while :; do do_set_numvf ; done & g_pids+=($!) while :; do do_set_channel ; done & g_pids+=($!) wait Result: [ 3506.152887] iavf 0000:41:02.0: Removing device [ 3510.400799] ================================================================== [ 3510.400820] BUG: KASAN: slab-out-of-bounds in iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400823] Read of size 8 at addr ffff88b6f9311008 by task repro.sh/55536 [ 3510.400823] [ 3510.400830] CPU: 101 PID: 55536 Comm: repro.sh Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1 [ 3510.400832] Hardware name: Powerleader PR2008AL/H12DSi-N6, BIOS 2.0 04/09/2021 [ 3510.400835] Call Trace: [ 3510.400851] dump_stack+0x71/0xab [ 3510.400860] print_address_description+0x6b/0x290 [ 3510.400865] ? iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400868] kasan_report+0x14a/0x2b0 [ 3510.400873] iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400880] iavf_remove+0x2b6/0xc70 [iavf] [ 3510.400884] ? iavf_free_all_rx_resources+0x160/0x160 [iavf] [ 3510.400891] ? wait_woken+0x1d0/0x1d0 [ 3510.400895] ? notifier_call_chain+0xc1/0x130 [ 3510.400903] pci_device_remove+0xa8/0x1f0 [ 3510.400910] device_release_driver_internal+0x1c6/0x460 [ 3510.400916] pci_stop_bus_device+0x101/0x150 [ 3510.400919] pci_stop_and_remove_bus_device+0xe/0x20 [ 3510.400924] pci_iov_remove_virtfn+0x187/0x420 [ 3510.400927] ? pci_iov_add_virtfn+0xe10/0xe10 [ 3510.400929] ? pci_get_subsys+0x90/0x90 [ 3510.400932] sriov_disable+0xed/0x3e0 [ 3510.400936] ? bus_find_device+0x12d/0x1a0 [ 3510.400953] i40e_free_vfs+0x754/0x1210 [i40e] [ 3510.400966] ? i40e_reset_all_vfs+0x880/0x880 [i40e] [ 3510.400968] ? pci_get_device+0x7c/0x90 [ 3510.400970] ? pci_get_subsys+0x90/0x90 [ 3510.400982] ? pci_vfs_assigned.part.7+0x144/0x210 [ 3510.400987] ? __mutex_lock_slowpath+0x10/0x10 [ 3510.400996] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 3510.401001] sriov_numvfs_store+0x214/0x290 [ 3510.401005] ? sriov_totalvfs_show+0x30/0x30 [ 3510.401007] ? __mutex_lock_slowpath+0x10/0x10 [ 3510.401011] ? __check_object_size+0x15a/0x350 [ 3510.401018] kernfs_fop_write+0x280/0x3f0 [ 3510.401022] vfs_write+0x145/0x440 [ 3510.401025] ksys_write+0xab/0x160 [ 3510.401028] ? __ia32_sys_read+0xb0/0xb0 [ 3510.401031] ? fput_many+0x1a/0x120 [ 3510.401032] ? filp_close+0xf0/0x130 [ 3510.401038] do_syscall_64+0xa0/0x370 [ 3510.401041] ? page_fault+0x8/0x30 [ 3510.401043] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 3510.401073] RIP: 0033:0x7f3a9bb842c0 [ 3510.401079] Code: 73 01 c3 48 8b 0d d8 cb 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 89 24 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24 [ 3510.401080] RSP: 002b:00007ffc05f1fe18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 3510.401083] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3a9bb842c0 [ 3510.401085] RDX: 0000000000000002 RSI: 0000000002327408 RDI: 0000000000000001 [ 3510.401086] RBP: 0000000002327408 R08: 00007f3a9be53780 R09: 00007f3a9c8a4700 [ 3510.401086] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ 3510.401087] R13: 0000000000000001 R14: 00007f3a9be52620 R15: 0000000000000001 [ 3510.401090] [ 3510.401093] Allocated by task 76795: [ 3510.401098] kasan_kmalloc+0xa6/0xd0 [ 3510.401099] __kmalloc+0xfb/0x200 [ 3510.401104] iavf_init_interrupt_scheme+0x26f/0x1310 [iavf] [ 3510.401108] iavf_watchdog_task+0x1d58/0x4050 [iavf] [ 3510.401114] process_one_work+0x56a/0x11f0 [ 3510.401115] worker_thread+0x8f/0xf40 [ 3510.401117] kthread+0x2a0/0x390 [ 3510.401119] ret_from_fork+0x1f/0x40 [ 3510.401122] 0xffffffffffffffff [ 3510.401123] In timeout handling, we should keep the original num_active_queues and reset num_req_queues to 0. Fixes: 4e5e6b5d9d13 ("iavf: Fix return of set the new channel count") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> Cc: Huang Cun <huangcun@sangfor.com.cn> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Fix use-after-free in free_netdevDing Hui1-4/+1
We do netif_napi_add() for all allocated q_vectors[], but potentially do netif_napi_del() for part of them, then kfree q_vectors and leave invalid pointers at dev->napi_list. Reproducer: [root@host ~]# cat repro.sh #!/bin/bash pf_dbsf="0000:41:00.0" vf0_dbsf="0000:41:02.0" g_pids=() function do_set_numvf() { echo 2 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) echo 0 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) } function do_set_channel() { local nic=$(ls -1 --indicator-style=none /sys/bus/pci/devices/${vf0_dbsf}/net/) [ -z "$nic" ] && { sleep $((RANDOM%3)) ; return 1; } ifconfig $nic 192.168.18.5 netmask 255.255.255.0 ifconfig $nic up ethtool -L $nic combined 1 ethtool -L $nic combined 4 sleep $((RANDOM%3)) } function on_exit() { local pid for pid in "${g_pids[@]}"; do kill -0 "$pid" &>/dev/null && kill "$pid" &>/dev/null done g_pids=() } trap "on_exit; exit" EXIT while :; do do_set_numvf ; done & g_pids+=($!) while :; do do_set_channel ; done & g_pids+=($!) wait Result: [ 4093.900222] ================================================================== [ 4093.900230] BUG: KASAN: use-after-free in free_netdev+0x308/0x390 [ 4093.900232] Read of size 8 at addr ffff88b4dc145640 by task repro.sh/6699 [ 4093.900233] [ 4093.900236] CPU: 10 PID: 6699 Comm: repro.sh Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1 [ 4093.900238] Hardware name: Powerleader PR2008AL/H12DSi-N6, BIOS 2.0 04/09/2021 [ 4093.900239] Call Trace: [ 4093.900244] dump_stack+0x71/0xab [ 4093.900249] print_address_description+0x6b/0x290 [ 4093.900251] ? free_netdev+0x308/0x390 [ 4093.900252] kasan_report+0x14a/0x2b0 [ 4093.900254] free_netdev+0x308/0x390 [ 4093.900261] iavf_remove+0x825/0xd20 [iavf] [ 4093.900265] pci_device_remove+0xa8/0x1f0 [ 4093.900268] device_release_driver_internal+0x1c6/0x460 [ 4093.900271] pci_stop_bus_device+0x101/0x150 [ 4093.900273] pci_stop_and_remove_bus_device+0xe/0x20 [ 4093.900275] pci_iov_remove_virtfn+0x187/0x420 [ 4093.900277] ? pci_iov_add_virtfn+0xe10/0xe10 [ 4093.900278] ? pci_get_subsys+0x90/0x90 [ 4093.900280] sriov_disable+0xed/0x3e0 [ 4093.900282] ? bus_find_device+0x12d/0x1a0 [ 4093.900290] i40e_free_vfs+0x754/0x1210 [i40e] [ 4093.900298] ? i40e_reset_all_vfs+0x880/0x880 [i40e] [ 4093.900299] ? pci_get_device+0x7c/0x90 [ 4093.900300] ? pci_get_subsys+0x90/0x90 [ 4093.900306] ? pci_vfs_assigned.part.7+0x144/0x210 [ 4093.900309] ? __mutex_lock_slowpath+0x10/0x10 [ 4093.900315] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 4093.900318] sriov_numvfs_store+0x214/0x290 [ 4093.900320] ? sriov_totalvfs_show+0x30/0x30 [ 4093.900321] ? __mutex_lock_slowpath+0x10/0x10 [ 4093.900323] ? __check_object_size+0x15a/0x350 [ 4093.900326] kernfs_fop_write+0x280/0x3f0 [ 4093.900329] vfs_write+0x145/0x440 [ 4093.900330] ksys_write+0xab/0x160 [ 4093.900332] ? __ia32_sys_read+0xb0/0xb0 [ 4093.900334] ? fput_many+0x1a/0x120 [ 4093.900335] ? filp_close+0xf0/0x130 [ 4093.900338] do_syscall_64+0xa0/0x370 [ 4093.900339] ? page_fault+0x8/0x30 [ 4093.900341] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 4093.900357] RIP: 0033:0x7f16ad4d22c0 [ 4093.900359] Code: 73 01 c3 48 8b 0d d8 cb 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 89 24 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24 [ 4093.900360] RSP: 002b:00007ffd6491b7f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 4093.900362] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f16ad4d22c0 [ 4093.900363] RDX: 0000000000000002 RSI: 0000000001a41408 RDI: 0000000000000001 [ 4093.900364] RBP: 0000000001a41408 R08: 00007f16ad7a1780 R09: 00007f16ae1f2700 [ 4093.900364] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ 4093.900365] R13: 0000000000000001 R14: 00007f16ad7a0620 R15: 0000000000000001 [ 4093.900367] [ 4093.900368] Allocated by task 820: [ 4093.900371] kasan_kmalloc+0xa6/0xd0 [ 4093.900373] __kmalloc+0xfb/0x200 [ 4093.900376] iavf_init_interrupt_scheme+0x63b/0x1320 [iavf] [ 4093.900380] iavf_watchdog_task+0x3d51/0x52c0 [iavf] [ 4093.900382] process_one_work+0x56a/0x11f0 [ 4093.900383] worker_thread+0x8f/0xf40 [ 4093.900384] kthread+0x2a0/0x390 [ 4093.900385] ret_from_fork+0x1f/0x40 [ 4093.900387] 0xffffffffffffffff [ 4093.900387] [ 4093.900388] Freed by task 6699: [ 4093.900390] __kasan_slab_free+0x137/0x190 [ 4093.900391] kfree+0x8b/0x1b0 [ 4093.900394] iavf_free_q_vectors+0x11d/0x1a0 [iavf] [ 4093.900397] iavf_remove+0x35a/0xd20 [iavf] [ 4093.900399] pci_device_remove+0xa8/0x1f0 [ 4093.900400] device_release_driver_internal+0x1c6/0x460 [ 4093.900401] pci_stop_bus_device+0x101/0x150 [ 4093.900402] pci_stop_and_remove_bus_device+0xe/0x20 [ 4093.900403] pci_iov_remove_virtfn+0x187/0x420 [ 4093.900404] sriov_disable+0xed/0x3e0 [ 4093.900409] i40e_free_vfs+0x754/0x1210 [i40e] [ 4093.900415] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 4093.900416] sriov_numvfs_store+0x214/0x290 [ 4093.900417] kernfs_fop_write+0x280/0x3f0 [ 4093.900418] vfs_write+0x145/0x440 [ 4093.900419] ksys_write+0xab/0x160 [ 4093.900420] do_syscall_64+0xa0/0x370 [ 4093.900421] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 4093.900422] 0xffffffffffffffff [ 4093.900422] [ 4093.900424] The buggy address belongs to the object at ffff88b4dc144200 which belongs to the cache kmalloc-8k of size 8192 [ 4093.900425] The buggy address is located 5184 bytes inside of 8192-byte region [ffff88b4dc144200, ffff88b4dc146200) [ 4093.900425] The buggy address belongs to the page: [ 4093.900427] page:ffffea00d3705000 refcount:1 mapcount:0 mapping:ffff88bf04415c80 index:0x0 compound_mapcount: 0 [ 4093.900430] flags: 0x10000000008100(slab|head) [ 4093.900433] raw: 0010000000008100 dead000000000100 dead000000000200 ffff88bf04415c80 [ 4093.900434] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000 [ 4093.900434] page dumped because: kasan: bad access detected [ 4093.900435] [ 4093.900435] Memory state around the buggy address: [ 4093.900436] ffff88b4dc145500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900437] ffff88b4dc145580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900438] >ffff88b4dc145600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900438] ^ [ 4093.900439] ffff88b4dc145680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900440] ffff88b4dc145700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900440] ================================================================== Although the patch #2 (of 2) can avoid the issue triggered by this repro.sh, there still are other potential risks that if num_active_queues is changed to less than allocated q_vectors[] by unexpected, the mismatched netif_napi_add/del() can also cause UAF. Since we actually call netif_napi_add() for all allocated q_vectors unconditionally in iavf_alloc_q_vectors(), so we should fix it by letting netif_napi_del() match to netif_napi_add(). Fixes: 5eae00c57f5e ("i40evf: main driver core") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> Cc: Huang Cun <huangcun@sangfor.com.cn> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-14i40e: Wait for pending VF reset in VF set callbacksIvan Vecera1-11/+5
Commit 028daf80117376 ("i40e: Fix attach VF to VM issue") fixed a race between i40e_ndo_set_vf_mac() and i40e_reset_vf() during an attachment of VF device to VM. This issue is not related to setting MAC address only but also VLAN assignment to particular VF because the newer libvirt sets configured MAC address as well as an optional VLAN. The same behavior is also for i40e's .ndo_set_vf_rate and .ndo_set_vf_spoofchk where the callbacks just check if the VF was initialized but not wait for the finish of pending reset. Reproducer: [root@host ~]# virsh attach-interface guest hostdev --managed 0000:02:02.0 --mac 52:54:00:b4:aa:bb error: Failed to attach interface error: Cannot set interface MAC/vlanid to 52:54:00:b4:aa:bb/0 for ifname enp2s0f0 vf 0: Resource temporarily unavailable Fix this issue by using i40e_check_vf_init_timeout() helper to check whether a reset of particular VF was finished in i40e's .ndo_set_vf_vlan, .ndo_set_vf_rate and .ndo_set_vf_spoofchk callbacks. Tested-by: Ma Yuying <yuma@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-14i40e: Add helper for VF inited state check with timeoutIvan Vecera1-16/+33
Move the check for VF inited state (with optional up-to 300ms timeout to separate helper i40e_check_vf_init_timeout() that will be used in the following commit. Tested-by: Ma Yuying <yuma@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-14ice: prevent NULL pointer deref during reloadMichal Swiatkowski3-4/+21
Calling ethtool during reload can lead to call trace, because VSI isn't configured for some time, but netdev is alive. To fix it add rtnl lock for VSI deconfig and config. Set ::num_q_vectors to 0 after freeing and add a check for ::tx/rx_rings in ring related ethtool ops. Add proper unroll of filters in ice_start_eth(). Reproduction: $watch -n 0.1 -d 'ethtool -g enp24s0f0np0' $devlink dev reload pci/0000:18:00.0 action driver_reinit Call trace before fix: [66303.926205] BUG: kernel NULL pointer dereference, address: 0000000000000000 [66303.926259] #PF: supervisor read access in kernel mode [66303.926286] #PF: error_code(0x0000) - not-present page [66303.926311] PGD 0 P4D 0 [66303.926332] Oops: 0000 [#1] PREEMPT SMP PTI [66303.926358] CPU: 4 PID: 933821 Comm: ethtool Kdump: loaded Tainted: G OE 6.4.0-rc5+ #1 [66303.926400] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0014.070920180847 07/09/2018 [66303.926446] RIP: 0010:ice_get_ringparam+0x22/0x50 [ice] [66303.926649] Code: 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 87 c0 09 00 00 c7 46 04 e0 1f 00 00 c7 46 10 e0 1f 00 00 48 8b 50 20 <48> 8b 12 0f b7 52 3a 89 56 14 48 8b 40 28 48 8b 00 0f b7 40 58 48 [66303.926722] RSP: 0018:ffffad40472f39c8 EFLAGS: 00010246 [66303.926749] RAX: ffff98a8ada05828 RBX: ffff98a8c46dd060 RCX: ffffad40472f3b48 [66303.926781] RDX: 0000000000000000 RSI: ffff98a8c46dd068 RDI: ffff98a8b23c4000 [66303.926811] RBP: ffffad40472f3b48 R08: 00000000000337b0 R09: 0000000000000000 [66303.926843] R10: 0000000000000001 R11: 0000000000000100 R12: ffff98a8b23c4000 [66303.926874] R13: ffff98a8c46dd060 R14: 000000000000000f R15: ffffad40472f3a50 [66303.926906] FS: 00007f6397966740(0000) GS:ffff98b390900000(0000) knlGS:0000000000000000 [66303.926941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66303.926967] CR2: 0000000000000000 CR3: 000000011ac20002 CR4: 00000000007706e0 [66303.926999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [66303.927029] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [66303.927060] PKRU: 55555554 [66303.927075] Call Trace: [66303.927094] <TASK> [66303.927111] ? __die+0x23/0x70 [66303.927140] ? page_fault_oops+0x171/0x4e0 [66303.927176] ? exc_page_fault+0x7f/0x180 [66303.927209] ? asm_exc_page_fault+0x26/0x30 [66303.927244] ? ice_get_ringparam+0x22/0x50 [ice] [66303.927433] rings_prepare_data+0x62/0x80 [66303.927469] ethnl_default_doit+0xe2/0x350 [66303.927501] genl_family_rcv_msg_doit.isra.0+0xe3/0x140 [66303.927538] genl_rcv_msg+0x1b1/0x2c0 [66303.927561] ? __pfx_ethnl_default_doit+0x10/0x10 [66303.927590] ? __pfx_genl_rcv_msg+0x10/0x10 [66303.927615] netlink_rcv_skb+0x58/0x110 [66303.927644] genl_rcv+0x28/0x40 [66303.927665] netlink_unicast+0x19e/0x290 [66303.927691] netlink_sendmsg+0x254/0x4d0 [66303.927717] sock_sendmsg+0x93/0xa0 [66303.927743] __sys_sendto+0x126/0x170 [66303.927780] __x64_sys_sendto+0x24/0x30 [66303.928593] do_syscall_64+0x5d/0x90 [66303.929370] ? __count_memcg_events+0x60/0xa0 [66303.930146] ? count_memcg_events.constprop.0+0x1a/0x30 [66303.930920] ? handle_mm_fault+0x9e/0x350 [66303.931688] ? do_user_addr_fault+0x258/0x740 [66303.932452] ? exc_page_fault+0x7f/0x180 [66303.933193] entry_SYSCALL_64_after_hwframe+0x72/0xdc Fixes: 5b246e533d01 ("ice: split probe into smaller functions") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-14ice: Unregister netdev and devlink_port only oncePetr Oros1-27/+0
Since commit 6624e780a577fc ("ice: split ice_vsi_setup into smaller functions") ice_vsi_release does things twice. There is unregister netdev which is unregistered in ice_deinit_eth also. It also unregisters the devlink_port twice which is also unregistered in ice_deinit_eth(). This double deregistration is hidden because devl_port_unregister ignores the return value of xa_erase. [ 68.642167] Call Trace: [ 68.650385] ice_devlink_destroy_pf_port+0xe/0x20 [ice] [ 68.655656] ice_vsi_release+0x445/0x690 [ice] [ 68.660147] ice_deinit+0x99/0x280 [ice] [ 68.664117] ice_remove+0x1b6/0x5c0 [ice] [ 171.103841] Call Trace: [ 171.109607] ice_devlink_destroy_pf_port+0xf/0x20 [ice] [ 171.114841] ice_remove+0x158/0x270 [ice] [ 171.118854] pci_device_remove+0x3b/0xc0 [ 171.122779] device_release_driver_internal+0xc7/0x170 [ 171.127912] driver_detach+0x54/0x8c [ 171.131491] bus_remove_driver+0x77/0xd1 [ 171.135406] pci_unregister_driver+0x2d/0xb0 [ 171.139670] ice_module_exit+0xc/0x55f [ice] Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: Fix inserting of empty frame for launchtimeFlorian Kauer1-1/+1
The insertion of an empty frame was introduced with commit db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") in order to ensure that the current cycle has at least one packet if there is some packet to be scheduled for the next cycle. However, the current implementation does not properly check if a packet is already scheduled for the current cycle. Currently, an empty packet is always inserted if and only if txtime >= end_of_cycle && txtime > last_tx_cycle but since last_tx_cycle is always either the end of the current cycle (end_of_cycle) or the end of a previous cycle, the second part (txtime > last_tx_cycle) is always true unless txtime == last_tx_cycle. What actually needs to be checked here is if the last_tx_cycle was already written within the current cycle, so an empty frame should only be inserted if and only if txtime >= end_of_cycle && end_of_cycle > last_tx_cycle. This patch does not only avoid an unnecessary insertion, but it can actually be harmful to insert an empty packet if packets are already scheduled in the current cycle, because it can lead to a situation where the empty packet is actually processed as the first packet in the upcoming cycle shifting the packet with the first_flag even one cycle into the future, finally leading to a TX hang. The TX hang can be reproduced on a i225 with: sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio \ num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 \ base-time 0 \ sched-entry S 01 300000 \ flags 0x1 \ txtime-delay 500000 \ clockid CLOCK_TAI sudo tc qdisc replace dev enp1s0 parent 100:1 etf \ clockid CLOCK_TAI \ delta 500000 \ offload \ skip_sock_check and traffic generator sudo trafgen -i traffic.cfg -o enp1s0 --cpp -n0 -q -t1400ns with traffic.cfg #define ETH_P_IP 0x0800 { /* Ethernet Header */ 0x30, 0x1f, 0x9a, 0xd0, 0xf0, 0x0e, # MAC Dest - adapt as needed 0x24, 0x5e, 0xbe, 0x57, 0x2e, 0x36, # MAC Src - adapt as needed const16(ETH_P_IP), /* IPv4 Header */ 0b01000101, 0, # IPv4 version, IHL, TOS const16(1028), # IPv4 total length (UDP length + 20 bytes (IP header)) const16(2), # IPv4 ident 0b01000000, 0, # IPv4 flags, fragmentation off 64, # IPv4 TTL 17, # Protocol UDP csumip(14, 33), # IPv4 checksum /* UDP Header */ 10, 0, 48, 1, # IP Src - adapt as needed 10, 0, 48, 10, # IP Dest - adapt as needed const16(5555), # UDP Src Port const16(6666), # UDP Dest Port const16(1008), # UDP length (UDP header 8 bytes + payload length) csumudp(14, 34), # UDP checksum /* Payload */ fill('W', 1000), } and the observed message with that is for example igc 0000:01:00.0 enp1s0: Detected Tx Unit Hang Tx Queue <0> TDH <32> TDT <3c> next_to_use <3c> next_to_clean <32> buffer_info[next_to_clean] time_stamp <ffff26a8> next_to_watch <00000000632a1828> jiffies <ffff27f8> desc.status <1048000> Fixes: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: Fix launchtime before start of cycleFlorian Kauer1-1/+1
It is possible (verified on a running system) that frames are processed by igc_tx_launchtime with a txtime before the start of the cycle (baset_est). However, the result of txtime - baset_est is written into a u32, leading to a wrap around to a positive number. The following launchtime > 0 check will only branch to executing launchtime = 0 if launchtime is already 0. Fix it by using a s32 before checking launchtime > 0. Fixes: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: No strict mode in pure launchtime/CBS offloadFlorian Kauer1-2/+22
The flags IGC_TXQCTL_STRICT_CYCLE and IGC_TXQCTL_STRICT_END prevent the packet transmission over slot and cycle boundaries. This is important for taprio offload where the slots and cycles correspond to the slots and cycles configured for the network. However, the Qbv offload feature of the i225 is also used for enabling TX launchtime / ETF offload. In that case, however, the cycle has no meaning for the network and is only used internally to adapt the base time register after a second has passed. Enabling strict mode in this case would unnecessarily prevent the transmission of certain packets (i.e. at the boundary of a second) and thus interferes with the ETF qdisc that promises transmission at a certain point in time. Similar to ETF, this also applies to CBS offload that also should not be influenced by strict mode unless taprio offload would be enabled at the same time. This fully reverts commit d8f45be01dd9 ("igc: Use strict cycles for Qbv scheduling") but its commit message only describes what was already implemented before that commit. The difference to a plain revert of that commit is that it now copes with the base_time = 0 case that was fixed with commit e17090eb2494 ("igc: allow BaseTime 0 enrollment for Qbv") In particular, enabling strict mode leads to TX hang situations under high traffic if taprio is applied WITHOUT taprio offload but WITH ETF offload, e.g. as in sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio \ num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 \ base-time 0 \ sched-entry S 01 300000 \ flags 0x1 \ txtime-delay 500000 \ clockid CLOCK_TAI sudo tc qdisc replace dev enp1s0 parent 100:1 etf \ clockid CLOCK_TAI \ delta 500000 \ offload \ skip_sock_check and traffic generator sudo trafgen -i traffic.cfg -o enp1s0 --cpp -n0 -q -t1400ns with traffic.cfg #define ETH_P_IP 0x0800 { /* Ethernet Header */ 0x30, 0x1f, 0x9a, 0xd0, 0xf0, 0x0e, # MAC Dest - adapt as needed 0x24, 0x5e, 0xbe, 0x57, 0x2e, 0x36, # MAC Src - adapt as needed const16(ETH_P_IP), /* IPv4 Header */ 0b01000101, 0, # IPv4 version, IHL, TOS const16(1028), # IPv4 total length (UDP length + 20 bytes (IP header)) const16(2), # IPv4 ident 0b01000000, 0, # IPv4 flags, fragmentation off 64, # IPv4 TTL 17, # Protocol UDP csumip(14, 33), # IPv4 checksum /* UDP Header */ 10, 0, 48, 1, # IP Src - adapt as needed 10, 0, 48, 10, # IP Dest - adapt as needed const16(5555), # UDP Src Port const16(6666), # UDP Dest Port const16(1008), # UDP length (UDP header 8 bytes + payload length) csumudp(14, 34), # UDP checksum /* Payload */ fill('W', 1000), } and the observed message with that is for example igc 0000:01:00.0 enp1s0: Detected Tx Unit Hang Tx Queue <0> TDH <d0> TDT <f0> next_to_use <f0> next_to_clean <d0> buffer_info[next_to_clean] time_stamp <ffff661f> next_to_watch <00000000245a4efb> jiffies <ffff6e48> desc.status <1048000> Fixes: d8f45be01dd9 ("igc: Use strict cycles for Qbv scheduling") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: Handle already enabled taprio offload for basetime 0Florian Kauer1-1/+1
Since commit e17090eb2494 ("igc: allow BaseTime 0 enrollment for Qbv") it is possible to enable taprio offload with a basetime of 0. However, the check if taprio offload is already enabled (and thus -EALREADY should be returned for igc_save_qbv_schedule) still relied on adapter->base_time > 0. This can be reproduced as follows: # TAPRIO offload (flags == 0x2) and base-time = 0 sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \ num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 \ base-time 0 \ sched-entry S 01 300000 \ flags 0x2 # The second call should fail with "Error: Device failed to setup taprio offload." # But that only happens if base-time was != 0 sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \ num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 \ base-time 0 \ sched-entry S 01 300000 \ flags 0x2 Fixes: e17090eb2494 ("igc: allow BaseTime 0 enrollment for Qbv") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: Do not enable taprio offload for invalid argumentsFlorian Kauer1-12/+6
Only set adapter->taprio_offload_enable after validating the arguments. Otherwise, it stays set even if the offload was not enabled. Since the subsequent code does not get executed in case of invalid arguments, it will not be read at first. However, by activating and then deactivating another offload (e.g. ETF/TX launchtime offload), taprio_offload_enable is read and erroneously keeps the offload feature of the NIC enabled. This can be reproduced as follows: # TAPRIO offload (flags == 0x2) and negative base-time leading to expected -ERANGE sudo tc qdisc replace dev enp1s0 parent root handle 100 stab overhead 24 taprio \ num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 \ base-time -1000 \ sched-entry S 01 300000 \ flags 0x2 # IGC_TQAVCTRL is 0x0 as expected (iomem=relaxed for reading register) sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1 # Activate ETF offload sudo tc qdisc replace dev enp1s0 parent root handle 6666 mqprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ hw 0 sudo tc qdisc add dev enp1s0 parent 6666:1 etf \ clockid CLOCK_TAI \ delta 500000 \ offload # IGC_TQAVCTRL is 0x9 as expected sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1 # Deactivate ETF offload again sudo tc qdisc delete dev enp1s0 parent 6666:1 # IGC_TQAVCTRL should now be 0x0 again, but is observed as 0x9 sudo pcimem /sys/bus/pci/devices/0000:01:00.0/resource0 0x3570 w*1 Fixes: e17090eb2494 ("igc: allow BaseTime 0 enrollment for Qbv") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-10igc: Rename qbv_enable to taprio_offload_enableFlorian Kauer3-5/+5
In the current implementation the flags adapter->qbv_enable and IGC_FLAG_TSN_QBV_ENABLED have a similar name, but do not have the same meaning. The first one is used only to indicate taprio offload (i.e. when igc_save_qbv_schedule was called), while the second one corresponds to the Qbv mode of the hardware. However, the second one is also used to support the TX launchtime feature, i.e. ETF qdisc offload. This leads to situations where adapter->qbv_enable is false, but the flag IGC_FLAG_TSN_QBV_ENABLED is set. This is prone to confusion. The rename should reduce this confusion. Since it is a pure rename, it has no impact on functionality. Fixes: e17090eb2494 ("igc: allow BaseTime 0 enrollment for Qbv") Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-07Merge branch '1GbE' of ↵David S. Miller5-32/+118
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-07-05 (igc) This series contains updates to igc driver only. Husaini adds check to increment Qbv change error counter only on taprio Qbvs. He also removes delay during Tx ring configuration and resolves Tx hang that could occur when transmitting on a gate to be closed. Prasad Koya reports ethtool link mode as TP (twisted pair). Tee Min corrects value for max SDU. Aravindhan ensures that registers for PPS are always programmed to occur in future. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-05igc: Handle PPS start time programming for past time valuesAravindhan Gunasekaran1-3/+22
I225/6 hardware can be programmed to start PPS output once the time in Target Time registers is reached. The time programmed in these registers should always be into future. Only then PPS output is triggered when SYSTIM register reaches the programmed value. There are two modes in i225/6 hardware to program PPS, pulse and clock mode. There were issues reported where PPS is not generated when start time is in past. Example 1, "echo 0 0 0 2 0 > /sys/class/ptp/ptp0/period" In the current implementation, a value of '0' is programmed into Target time registers and PPS output is in pulse mode. Eventually an interrupt which is triggered upon SYSTIM register reaching Target time is not fired. Thus no PPS output is generated. Example 2, "echo 0 0 0 1 0 > /sys/class/ptp/ptp0/period" Above case, a value of '0' is programmed into Target time registers and PPS output is in clock mode. Here, HW tries to catch-up the current time by incrementing Target Time register. This catch-up time seem to vary according to programmed PPS period time as per the HW design. In my experiments, the delay ranged between few tens of seconds to few minutes. The PPS output is only generated after the Target time register reaches current time. In my experiments, I also observed PPS stopped working with below test and could not recover until module is removed and loaded again. 1) echo 0 <future time> 0 1 0 > /sys/class/ptp/ptp1/period 2) echo 0 0 0 1 0 > /sys/class/ptp/ptp1/period 3) echo 0 0 0 1 0 > /sys/class/ptp/ptp1/period After this PPS did not work even if i re-program with proper values. I could only get this back working by reloading the driver. This patch takes care of calculating and programming appropriate future time value into Target Time registers. Fixes: 5e91c72e560c ("igc: Fix PPS delta between two synchronized end-points") Signed-off-by: Aravindhan Gunasekaran <aravindhan.gunasekaran@intel.com> Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-05igc: Include the length/type field and VLAN tag in queueMaxSDUTan Tee Min1-11/+4
IEEE 802.1Q does not have clear definitions of what constitutes an SDU (Service Data Unit), but IEEE Std 802.3 clause 3.1.2 does define the MAC service primitives and clause 3.2.7 does define the MAC Client Data for Q-tagged frames. It shows that the mac_service_data_unit (MSDU) does NOT contain the preamble, destination and source address, or FCS. The MSDU does contain the length/type field, MAC client data, VLAN tag and any padding data (prior to the FCS). Thus, the maximum 802.3 frame size that is allowed to be transmitted should be QueueMaxSDU (MSDU) + 16 (6 byte SA + 6 byte DA + 4 byte FCS). Fixes: 92a0dcb8427d ("igc: offload queue max SDU from tc-taprio") Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com> Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-05igc: set TP bit in 'supported' and 'advertising' fields of ↵Prasad Koya1-0/+2
ethtool_link_ksettings set TP bit in the 'supported' and 'advertising' fields. i225/226 parts only support twisted pair copper. Fixes: 8c5ad0dae93c ("igc: Add ethtool support") Signed-off-by: Prasad Koya <prasad@arista.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-05igc: Fix TX Hang issue when QBV Gate is closedMuhammad Husaini Zulkifli3-18/+87
If a user schedules a Gate Control List (GCL) to close one of the QBV gates while also transmitting a packet to that closed gate, TX Hang will be happen. HW would not drop any packet when the gate is closed and keep queuing up in HW TX FIFO until the gate is re-opened. This patch implements the solution to drop the packet for the closed gate. This patch will also reset the adapter to perform SW initialization for each 1st Gate Control List (GCL) to avoid hang. This is due to the HW design, where changing to TSN transmit mode requires SW initialization. Intel Discrete I225/6 transmit mode cannot be changed when in dynamic mode according to Software User Manual Section 7.5.2.1. Subsequent Gate Control List (GCL) operations will proceed without a reset, as they already are in TSN Mode. Step to reproduce: DUT: 1) Configure GCL List with certain gate close. BASE=$(date +%s%N) tc qdisc replace dev $IFACE parent root handle 100 taprio \ num_tc 4 \ map 0 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 \ queues 1@0 1@1 1@2 1@3 \ base-time $BASE \ sched-entry S 0x8 500000 \ sched-entry S 0x4 500000 \ flags 0x2 2) Transmit the packet to closed gate. You may use udp_tai application to transmit UDP packet to any of the closed gate. ./udp_tai -i <interface> -P 100000 -p 90 -c 1 -t <0/1> -u 30004 Fixes: ec50a9d437f0 ("igc: Add support for taprio offloading") Co-developed-by: Tan Tee Min <tee.min.tan@linux.intel.com> Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com> Tested-by: Chwee Lin Choong <chwee.lin.choong@intel.com> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-05igc: Remove delay during TX ring configurationMuhammad Husaini Zulkifli1-1/+0
Remove unnecessary delay during the TX ring configuration. This will cause delay, especially during link down and link up activity. Furthermore, old SKUs like as I225 will call the reset_adapter to reset the controller during TSN mode Gate Control List (GCL) setting. This will add more time to the configuration of the real-time use case. It doesn't mentioned about this delay in the Software User Manual. It might have been ported from legacy code I210 in the past. Fixes: 13b5b7fd6a4a ("igc: Add support for Tx/Rx rings") Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>