summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel/i40e/i40e_txrx.c
AgeCommit message (Collapse)AuthorFilesLines
2018-03-27i40e: add support for XDP_REDIRECTBjörn Töpel1-10/+64
The driver now acts upon the XDP_REDIRECT return action. Two new ndos are implemented, ndo_xdp_xmit and ndo_xdp_flush. XDP_REDIRECT action enables XDP program to redirect frames to other netdevs. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-27i40e: tweak page counting for XDP_REDIRECTBjörn Töpel1-5/+4
This commit tweaks the page counting for XDP_REDIRECT to function properly. XDP_REDIRECT support will be added in a future commit. The current page counting scheme assumes that the reference count cannot decrease until the received frame is sent to the upper layers of the networking stack. This assumption does not hold for the XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have its reference count decreased via the xdp_do_redirect call. To work around that, we now start off by a large page count and then don't allow a refcount less than two. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-26i40e: move AUTO_DISABLED flags into the state fieldJacob Keller1-8/+13
The two Flow Directory auto disable flags are used at run time to mark when the flow director features needed to be disabled. Thus the flags could change even when the RTNL lock is not held. They also have some code constructions which really should be test_and_set or test_and_clear using atomic bit operations. Create new state fields to mark this, and stop including them in pf->flags. This is part of a larger effort to remove the need for cmpxchg64 in i40e_set_priv_flags(). Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-23intel: add SPDX identifiers to all the Intel driversJeff Kirsher1-0/+1
Add the SPDX identifiers to all the Intel wired LAN driver files, as outlined in Documentation/process/license-rules.rst. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26i40e/i40evf: use SW variables for hang detectionAlan Brady1-5/+11
The i40e_detect_recover_hung function uses the i40e_get_tx_pending function to determine if there are packets stalled on the ring. i40e_get_tx_pending calculates the pending packets using the head writeback value and HW tail. If the queue is stopped and we lose the interrupt to update our next_to_clean then we a) won't get another interrupt to clean because queue is stopped b) we won't catch the problem with i40e_detect_recover_hung because the HW values look like there's no packets waiting to be transmitted. Using the SW values we can catch the issue because next_to_clean will be out of sync with head writeback. This has the added benefit being less CPU intensive because we don't need to reach into the hardware to get the values. Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Add support for new mechanism of updating adaptive ITRAlexander Duyck1-111/+251
This patch replaces the existing mechanism for determining the correct value to program for adaptive ITR with yet another new and more complicated approach. The basic idea from a 30K foot view is that this new approach will push the Rx interrupt moderation up so that by default it starts in low latency and is gradually pushed up into a higher latency setup as long as doing so increases the number of packets processed, if the number of packets drops to 4 to 1 per packet we will reset and just base our ITR on the size of the packets being received. For Tx we leave it floating at a high interrupt delay and do not pull it down unless we start processing more than 112 packets per interrupt. If we start exceeding that we will cut our interrupt rates in half until we are back below 112. The side effect of these patches are that we will be processing more packets per interrupt. This is both a good and a bad thing as it means we will not be blocking processing in the case of things like pktgen and XDP, but we will also be consuming a bit more CPU in the cases of things such as network throughput tests using netperf. One delta from this versus the ixgbe version of the changes is that I have made the interrupt moderation a bit more aggressive when we are in bulk mode by moving our "goldilocks zone" up from 48 to 96 to 56 to 112. The main motivation behind moving this is to address the fact that we need to update less frequently, and have more fine grained control due to the separate Tx and Rx ITR times. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Split container ITR into current_itr and target_itrAlexander Duyck1-29/+39
This patch is mostly prep-work for replacing the current approach to programming the dynamic aka adaptive ITR. Specifically here what we are doing is splitting the Tx and Rx ITR each into two separate values. The first value current_itr represents the current value of the register. The second value target_itr represents the desired value of the register. The general plan by doing this is to allow for deferring the update of the ITR value under certain circumstances. For now we will work with what we have, but in the future I hope to change the behavior so that we always only update one ITR at a time using some simple logic to determine which ITR requires an update. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Use usec value instead of reg value for ITR definesAlexander Duyck1-2/+9
Instead of using the register value for the defines when setting up the ring ITR we can just use the actual values and avoid the use of shifts and macros to translate between the values we have and the values we want. This helps to make the code more readable as we can quickly translate from one value to the other. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Don't bother setting the CLEARPBA bitAlexander Duyck1-1/+10
The CLEARPBA bit in the dynamic interrupt control register actually has no effect either way on the hardware. As per errata 28 in the XL710 specification update the interrupt is actually cleared any time the register is written with the INTENA_MSK bit set to 0. As such the act of toggling the enable bit actually will trigger the interrupt being cleared and could lead to potential lost events if auto-masking is not enabled. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Clean up logic for adaptive ITRAlexander Duyck1-41/+14
The logic for dynamic ITR update is confusing at best as there were odd paths chosen for how to find the rings associated with a given queue based on the vector index and other inconsistencies throughout the code. This patch is an attempt to clean up the logic so that we can more easily understand what is going on. Specifically if there is a Rx or Tx ring that is enabled in dynamic mode on the q_vector it is allowed to override the other side of the interrupt moderation. While it isn't correct all this patch is doing is cleaning up the logic for now so that when we come through and fix it we can more easily identify that this is wrong. The other big change made here is that we replace references to: vsi->rx_rings[q_vector->v_idx]->itr_setting with: q_vector->rx.ring->itr_setting The general idea is we can avoid the long pointer chase since just accessing q_vector->rx.ring is a single pointer access versus having to chase down vsi->rx_rings, and then finding the pointer in the array, and finally chasing down the itr_setting from there. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e/i40evf: Only track one ITR setting per ring instead of Tx/RxAlexander Duyck1-3/+3
The rings are already split out into Tx and Rx rings so it doesn't make sense to have any single ring store both a Tx and Rx itr_setting value. Since that is the case drop the pair in favor of storing just a single ITR value. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-02-12i40e: fix typo in function descriptionAlan Brady1-1/+1
'bufer' should be 'buffer' Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-27i40e/i40evf: Record ITR register location in the q_vectorAlexander Duyck1-8/+4
The drivers for i40e and i40evf had a reg_idx value stored in the q_vector that was going completely unused. I can only assume this was copied over from ixgbe and nobody knew how to use it. I'm going to make use of the value to avoid having to compute the vector and thus the register index for multiple paths throughout the drivers. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-23i40e/i40evf: Detect and recover hung queue scenarioSudheer Mogilappagari1-0/+54
In VFs, there is a known issue which can cause writebacks to not occur when interrupts are disabled and there are less than 4 descriptors resulting in TX timeout. Timeout can also occur due to lost interrupt. The current implementation for detecting and recovering from hung queues in the PF is problematic because it actually actively encourages lost interrupts. By triggering a SW interrupt, interrupts are forced on. If we are already in napi_poll and an interrupt fires, napi_poll will not be rescheduled and the interrupt is effectively lost; thereby potentially *causing* hung queues. This patch checks whether packets are being processed between every watchdog cycle and determine potential hung queue and fires triggers SW interrupt only for that particular queue. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+23
2018-01-06i40e: setup xdp_rxq_infoJesper Dangaard Brouer1-2/+16
The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is a sideband channel for configuring/updating the flow director tables. This (i40e_vsi_)type does not invoke XDP-ebpf code. As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring a special case, reverse the logic and only select RX-rings of type I40E_VSI_MAIN to register xdp_rxq_info's for. Driver hook points for xdp_rxq_info: * reg : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources) * unreg: i40e_free_rx_resources (via i40e_vsi_free_rx_resources) Tested on actual hardware with samples/bpf program. V2: Fixed bug in i40e_set_ringparam (memset zero) + match on I40E_VSI_MAIN. V4: Update patch desc that got out-of-sync with code. Cc: intel-wired-lan@lists.osuosl.org Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-03i40e/i40evf: Account for frags split over multiple descriptors in check ↵Alexander Duyck1-3/+23
linearize The original code for __i40e_chk_linearize didn't take into account the fact that if a fragment is 16K in size or larger it has to be split over 2 descriptors and the smaller of those 2 descriptors will be on the trailing edge of the transmit. As a result we can get into situations where we didn't catch requests that could result in a Tx hang. This patch takes care of that by subtracting the length of all but the trailing edge of the stale fragment before we test for sum. By doing this we can guarantee that we have all cases covered, including the case of a fragment that spans multiple descriptors. We don't need to worry about checking the inner portions of this since 12K is the maximum aligned DMA size and that is larger than any MSS will ever be since the MTU limit for jumbos is something on the order of 9K. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-11-22i40e: Use smp_rmb rather than read_barrier_dependsBrian King1-1/+1
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with i40e as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes. Cc: stable <stable@vger.kernel.org> Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-31i40e/i40evf: Revert "i40e/i40evf: bump tail only in multiples of 8"Alexander Duyck1-9/+0
This reverts commit 11f29003d6376fb123b7c3779dba49bb56fb0815. I am reverting this as I am fairly certain this can result in a memory leak when combined with the current page recycling scheme. Specifically we end up attempting to allocate fewer buffers than we recycled and this results in us rewinding the next to alloc pointer which leads to leaks when we overwrite the rx_buffer_info when processing the next frame. Fixes: 11f29003d637 ("i40e/i40evf: bump tail only in multiples of 8") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
Several conflicts here. NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to nfp_fl_output() needed some adjustments because the code block is in an else block now. Parallel additions to net/pkt_cls.h and net/sch_generic.h A bug fix in __tcp_retransmit_skb() conflicted with some of the rbtree changes in net-next. The tc action RCU callback fixes in 'net' had some overlap with some of the recent tcf_block reworking. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26i40e: Add programming descriptors to cleaned_countAlexander Duyck1-0/+1
This patch updates the i40e driver to include programming descriptors in the cleaned_count. Without this change it becomes possible for us to leak memory as we don't trigger a large enough allocation when the time comes to allocate new buffers and we end up overwriting a number of rx_buffers equal to the number of programming descriptors we encountered. Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Anders K. Pedersen <akp@cohaesio.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-26i40e: Fix incorrect use of tx_itr_setting when checking for Rx ITR setupAlexander Duyck1-1/+1
It looks like there was either a copy/paste error or just a typo that resulted in the Tx ITR setting being used to determine if we were using adaptive Rx interrupt moderation or not. This patch fixes the typo. Fixes: 65e87c0398f5 ("i40evf: support queue-specific settings for interrupt moderation") Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-27/+36
There were quite a few overlapping sets of changes here. Daniel's bug fix for off-by-ones in the new BPF branch instructions, along with the added allowances for "data_end > ptr + x" forms collided with the metadata additions. Along with those three changes came veritifer test cases, which in their final form I tried to group together properly. If I had just trimmed GIT's conflict tags as-is, this would have split up the meta tests unnecessarily. In the socketmap code, a set of preemption disabling changes overlapped with the rename of bpf_compute_data_end() to bpf_compute_data_pointers(). Changes were made to the mv88e6060.c driver set addr method which got removed in net-next. The hyperv transport socket layer had a locking change in 'net' which overlapped with a change of socket state macro usage in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10i40e: Fix memory leak related filter programming statusAlexander Duyck1-27/+36
It looks like we weren't correctly placing the pages from buffers that had been used to return a filter programming status back on the ring. As a result they were being overwritten and tracking of the pages was lost. This change works to correct that by incorporating part of i40e_put_rx_buffer into the programming status handler code. As a result we should now be correctly placing the pages for those buffers on the re-allocation list instead of letting them stay in place. Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming status") Reported-by: Anders K. Pedersen <akp@cohaesio.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Anders K Pedersen <akp@cohaesio.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10i40e/i40evf: bump tail only in multiples of 8Jacob Keller1-0/+9
Hardware only fetches descriptors on cachelines of 8, essentially ignoring the lower 3 bits of the tail register. Thus, it is pointless to bump tail by an unaligned access as the hardware will ignore some of the new descriptors we allocated. Thus, it's ideal if we can ensure tail writes are always aligned to 8. At first, it seems like we'd already do this, since we allocate descriptors in batches which are a multiple of 8. Since we'd always increment by a multiple of 8, it seems like the value should always be aligned. However, this ignores allocation failures. If we fail to allocate a buffer, our tail register will become unaligned. Once it has become unaligned it will essentially be stuck unaligned until a buffer allocation happens to fail at the exact amount necessary to re-align it. We can do better, by simply rounding down the number of buffers we're about to allocate (cleaned_count) such that "next_to_clean + cleaned_count" is rounded to the nearest multiple of 8. We do this by calculating how far off that value is and subtracting it from the cleaned_count. This essentially defers allocation of buffers if they're going to be ignored by hardware anyways, and re-aligns our next_to_use and tail values after a failure to allocate a descriptor. This calculation ensures that we always align the tail writes in a way the hardware expects and don't unnecessarily allocate buffers which won't be fetched immediately. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10i40e/i40evf: always set the CLEARPBA flag when re-enabling interruptsJacob Keller1-4/+2
In the past we changed driver behavior to not clear the PBA when re-enabling interrupts. This change was motivated by the flawed belief that clearing the PBA would cause a lost interrupt if a receive interrupt occurred while interrupts were disabled. According to empirical testing this isn't the case. Additionally, the data sheet specifically says that we should set the CLEARPBA bit when re-enabling interrupts in a polling setup. This reverts commit 40d72a509862 ("i40e/i40evf: don't lose interrupts") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-06i40e: ignore skb->xmit_more when deciding to set RS bitJacob Keller1-30/+4
Since commit 6a7fded776a7 ("i40e: Fix RS bit update in Tx path and disable force WB workaround") we've tried to "optimize" setting the RS bit based around skb->xmit_more. This same logic was refactored in commit 1dc8b538795f ("i40e: Reorder logic for coalescing RS bits"), but ultimately was not functionally changed. Using skb->xmit_more in this way is incorrect, because in certain circumstances we may see a large number of skbs in sequence with xmit_more set. This leads to a performance loss as the hardware does not writeback anything for those packets, which delays the time it takes for us to respond to the stack transmit requests. This significantly impacts UDP performance, especially when layered with multiple devices, such as bonding, VLANs, and vnet setups. This was not noticed until now because it is difficult to create a setup which reproduces the issue. It was discovered in a UDP_STREAM test in a VM, connected using a vnet device to a bridge, which is connected to a bonded pair of X710 ports in active-backup mode with a VLAN. These layered devices seem to compound the number of skbs transmitted at once by the qdisc. Additionally, the problem can be masked by reducing the ITR value. Since the original commit does not provide strong justification for this RS bit "optimization", revert to the previous behavior of setting the RS bit every 4th packet. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-09-29i40e/i40evf: rename bytes_per_int to bytes_per_usecJacob Keller1-6/+6
This value is not calculating bytes_per_int, which would actually just be bytes/ITR_COUNTDOWN_START, but rather it's calculating bytes/usecs. Rename the variable for clarity so that future developers understand what the value is actually calculating. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-09-26bpf: add meta pointer for direct accessDaniel Borkmann1-0/+1
This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28i40e/i40evf: avoid dynamic ITR updates when polling or low packet rateJacob Keller1-5/+17
The dynamic ITR algorithm depends on a calculation of usecs which assumes that the interrupts have been firing constantly at the interrupt throttle rate. This is not guaranteed because we could have a low packet rate, or have been polling in software. We'll estimate whether this is the case by using jiffies to determine if we've been too long. If the time difference of jiffies is larger we are guaranteed to have an incorrect calculation. If the time difference of jiffies is smaller we might have been polling some but the difference shouldn't affect the calculation too much. This ensures that we don't get stuck in BULK latency during certain rare situations where we receive bursts of packets that force us into NAPI polling. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-28i40e/i40evf: remove ULTRA latency modeJacob Keller1-17/+0
Since commit c56625d59726 ("i40e/i40evf: change dynamic interrupt thresholds") a new higher latency ITR setting called I40E_ULTRA_LATENCY was added with a cryptic comment about how it was meant for adjusting Rx more aggressively when streaming small packets. This mode was attempting to calculate packets per second and then kick in when we have a huge number of small packets. Unfortunately, the ULTRA setting was kicking in for workloads it wasn't intended for including single-thread UDP_STREAM workloads. This wasn't caught for a variety of reasons. First, the ip_defrag routines were improved somewhat which makes the UDP_STREAM test still reasonable at 10GbE, even when dropped down to 8k interrupts a second. Additionally, some other obvious workloads appear to work fine, such as TCP_STREAM. The number 40k doesn't make sense for a number of reasons. First, we absolutely can do more than 40k packets per second. Second, we calculate the value inline in an integer, which sometimes can overflow resulting in using incorrect values. If we fix this overflow it makes it even more likely that we'll enter ULTRA mode which is the opposite of what we want. The ULTRA mode was added originally as a way to reduce CPU utilization during a small packet workload where we weren't keeping up anyways. It should never have been kicking in during these other workloads. Given the issues outlined above, let's remove the ULTRA latency mode. If necessary, a better solution to the CPU utilization issue for small packet workloads will be added in a future patch. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-28i40e: invert logic for checking incorrect cpu vs irq affinityJacob Keller1-16/+15
In commit 96db776a3682 ("i40e/vf: fix interrupt affinity bug") we added some code to force exit of polling in case we did not have the correct CPU. This is important since it was possible for the IRQ affinity to be changed while the CPU is pegged at 100%. This can result in the polling routine being stuck on the wrong CPU until traffic finally stops. Unfortunately, the implementation, "if the CPU is correct, exit as normal, otherwise, fall-through to the end-polling exit" is incredibly confusing to reason about. In this case, the normal flow looks like the exception, while the exception actually occurs far away from the if statement and comment. We recently discovered and fixed a bug in this code because we were incorrectly initializing the affinity mask. Re-write the code so that the exceptional case is handled at the check, rather than having the logic be spread through the regular exit flow. This does end up with minor code duplication, but the resulting code is much easier to reason about. The new logic is identical, but inverted. If we are running on a CPU not in our affinity mask, we'll exit polling. However, the code flow is much easier to understand. Note that we don't actually have to check for MSI-X, because in the MSI case we'll only have one q_vector, but its default affinity mask should be correct as it includes all CPUs when it's initialized. Further, we could at some point add code to setup the notifier for the non-MSI-X case and enable this workaround for that case too, if desired, though there isn't much gain since its unlikely to be the common case. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-28i40e: move enabling icr0 into i40e_update_enable_itrJacob Keller1-2/+6
If we don't have MSI-X enabled, we handle interrupts on all icr0. This is a special case, so let's move the conditional into i40e_update_enable_itr() in order to make i40e_napi_poll easier to read about. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-08-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+2
The UDP offload conflict is dealt with by simply taking what is in net-next where we have removed all of the UFO handling code entirely. The TCP conflict was a case of local variables in a function being removed from both net and net-next. In netvsc we had an assignment right next to where a missing set of u64 stats sync object inits were added. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02i40e: Initialize 64-bit statistics TX ring seqcountFlorian Fainelli1-0/+2
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that. Fixes: 980e9b118642 ("i40e: Add support for 64 bit netstats") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-26i40e/i40evf: remove mismatched type warningsJesse Brandeburg1-3/+3
Compiler reported several places where driver compared signed and unsigned types. Cast or change the types to remove the warnings. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-07-26i40e/i40evf: make IPv6 ATR code clearerJesse Brandeburg1-3/+9
This just reorders some local vars and makes the code flow clearer. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-06-21i40e: add support for XDP_TX actionBjörn Töpel1-5/+113
This patch adds proper XDP_TX action support. For each Tx ring, an additional XDP Tx ring is allocated and setup. This version does the DMA mapping in the fast-path, which will penalize performance for IOMMU enabled systems. Further, debugfs support is not wired up for the XDP Tx rings. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-06-21i40e: add XDP support for pass and drop actionsBjörn Töpel1-31/+99
This commit adds basic XDP support for i40e derived NICs. All XDP actions will end up in XDP_DROP. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+2
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-13i40e: fix handling of HW ATR evictionJacob Keller1-2/+2
A recent commit to refactor the driver and remove the hw_disabled_flags field accidentally introduced two regressions. First, we overwrote pf->flags which removed various key flags including the MSI-X settings. Additionally, it was intended that we have now two flags, HW_ATR_EVICT_CAPABLE and HW_ATR_EVICT_ENABLED, but this was not done, and we accidentally were mis-using HW_ATR_EVICT_CAPABLE everywhere. This patch adds the missing piece, HW_ATR_EVICT_ENABLED, and safely updates pf->flags instead of overwriting it. Without this patch we will have many problems including disabling MSI-X support, and we'll attempt to use HW ATR eviction on devices which do not support it. Fixes: 47994c119a36 ("i40e: remove hw_disabled_flags in favor of using separate flag bits", 2017-04-19) Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06i40e/i40evf: proper update of the page_offset fieldBjörn Töpel1-1/+2
In f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames") i40e_build_skb updates the page_offset field with an incorrect offset, which can lead to data corruption. This patch updates page_offset correctly, by properly setting truesize. Note that the bug only appears on architectures where PAGE_SIZE is 8192 or larger. Fixes: f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-05-31i40e: check for Tx timestamp timeouts during watchdogJacob Keller1-0/+1
The i40e driver has logic to handle only one Tx timestamp at a time, using a state bit lock to avoid multiple requests at once. It may be possible, if incredibly unlikely, that a Tx timestamp event is requested but never completes. Since we use an interrupt scheme to determine when the Tx timestamp occurred we would never clear the state bit in this case. Add an i40e_ptp_tx_hang() function similar to the already existing i40e_ptp_rx_hang() function. This function runs in the watchdog routine and makes sure we eventually recover from this case instead of permanently disabling Tx timestamps. Note: there is no currently known way to cause this without hacking the driver code to force it. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-05-31i40e: add statistic indicating number of skipped Tx timestampsJacob Keller1-0/+1
The i40e driver can only handle one Tx timestamp request at a time. This means it is possible for an application timestamp request to be ignored. There is no easy way for an administrator to determine if this occurred. Add a new statistic which tracks this, tx_hwtstamp_skipped. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-05-31i40e: avoid permanent lock of *_PTP_TX_IN_PROGRESSJacob Keller1-6/+20
The i40e driver uses a bit lock to indicate when a Tx timestamp is in progress to avoid attempting to timestamp multiple packets at once. This is required because hardware only has registers to handle one request at a time. There is a corner case where we failed to cleanup the bit lock after a failed transmit. This can potentially result in a state bit being locked forever. Add some cleanup code to i40e_xmit_frame_ring to check and make sure we cleanup incase of these failures. We also modify i40e_tx_map to return an error code indication DMA failure. Reported-by: Reported-by: David Mirabito <davidm@metamako.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30i40e: remove hw_disabled_flags in favor of using separate flag bitsJacob Keller1-15/+7
The hw_disabled_flags field was added as a way of signifying that a feature was automatically or temporarily disabled. However, we actually only use this for FDir features. Replace its use with new _AUTO_DISABLED flags instead. This is more readable, because you aren't setting an *_ENABLED flag to *disable* the feature. Additionally, clean up a few areas where we used these bits. First, we don't really need to set the auto-disable flag for ATR if we're fully disabling the feature via ethtool. Second, we should always clear the auto-disable bits in case they somehow got set when the feature was disabled. However, avoid displaying a message that we've re-enabled the feature. Third, we shouldn't be re-enabling ATR in the SB ntuple add flow, because it might have been disabled due to space constraints. Instead, we should just wait for the fdir_check_and_reenable to be called by the watchdog. Overall, this change allows us to simplify some code by removing an extra field we didn't need, and the result should make it more clear as to what we're actually doing with these flags. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30i40e: use DECLARE_BITMAP for state fieldsJacob Keller1-7/+7
Instead of assuming our flags fit within an unsigned long, use DECLARE_BITMAP which will ensure that we always allocate enough space. Additionally, use __I40E_STATE_SIZE__ markers as the last element of the enumeration so that the size of the BITMAP is compile-time assigned rather than programmer-time assigned. This ensures that potential future flag additions do not actually overrun the array. This is especially important as 32bit systems would only have 32bit longs instead of 64bit longs as we generally have assumed in the prior code. This change also removes a dereference of the state fields throughout the code, so it does have a bit of code churn. The conversions were automated using sed replacements with an alternation s/&(vsi->back|vsi|pf)->state/\1->state/ s/&adapter->vsi.state/adapter->vsi.state/ For debugfs, we modify the printing so that we can display chunks of the state value on new lines. This ensures that we can print the entire set of state values. Additionally, we now print them as 08lx to ensure that they display nicely. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30i40e: separate PF and VSI state flagsJacob Keller1-4/+4
Avoid using the same named flags for both vsi->state and pf->state. This makes code review easier, as it is more likely that future authors will use the correct state field when checking bits. Previous commits already found issues with at least one check, and possibly others may be incorrect. This reduces confusion as it is more clear what each flag represents, and which flags are valid for which state field. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-20i40e/i40evf: Add tracepointsScott Peterson1-0/+9
This patch adds tracepoints to the i40e and i40evf drivers to which BPF programs can be attached for feature testing and verification. It's expected that an attached BPF program will identify and count or log some interesting subset of traffic. The bcc-tools package is helpful there for containing all the BPF arcana in a handy Python wrapper. Though you can make these tracepoints log trace messages, the messages themselves probably won't be very useful (other to verify the tracepoint is being called while you're debugging your BPF program). The idea here is that tracepoints have such low performance cost when disabled that we can leave these in the upstream drivers. This may eventually enable the instrumentation of unmodified customer systems should the need arise to verify a NIC feature is working as expected. In general this enables one set of feature verification tools to be used on these drivers whether they're built with the kernel or separately. Users are advised against using these tracepoints for anything other than a diagnostic tool. They have a performance impact when enabled, and their exact placement and form may change as we see how well they work in practice for the purposes above. Change-ID: Id6014a7322c0e6d08068114dd20bd156f2f6435e Signed-off-by: Scott Peterson <scott.d.peterson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>