summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/freescale/enetc
AgeCommit message (Collapse)AuthorFilesLines
2023-09-19net: enetc: use EXPORT_SYMBOL_GPL for enetc_phc_indexChristoph Hellwig1-1/+1
commit 569820befb16ffc755ab7af71f4f08cc5f68f0fe upstream. enetc_phc_index is only used via symbol_get, which was only ever intended for very internal symbols like this one. Use EXPORT_SYMBOL_GPL for it so that symbol_get can enforce only being used on EXPORT_SYMBOL_GPL symbols. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21net: enetc: correct the indexes of highest and 2nd highest TCsWei Fang1-2/+2
[ Upstream commit 21225873be1472b7c59ed3650396af0e40578112 ] For ENETC hardware, the TCs are numbered from 0 to N-1, where N is the number of TCs. Numerically higher TC has higher priority. It's obvious that the highest priority TC index should be N-1 and the 2nd highest priority TC index should be N-2. However, the previous logic uses netdev_get_prio_tc_map() to get the indexes of highest priority and 2nd highest priority TCs, it does not make sense and is incorrect to give a "tc" argument to netdev_get_prio_tc_map(). So the driver may get the wrong indexes of the two highest priotiry TCs which would lead to failed to set the CBS for the two highest priotiry TCs. e.g. $ tc qdisc add dev eno0 parent root handle 100: mqprio num_tc 6 \ map 0 0 1 1 2 3 4 5 queues 1@0 1@1 1@2 1@3 2@4 2@6 hw 1 $ tc qdisc replace dev eno0 parent 100:6 cbs idleslope 100000 \ sendslope -900000 hicredit 12 locredit -113 offload 1 $ Error: Specified device failed to setup cbs hardware offload. ^^^^^ In this example, the previous logic deems the indexes of the two highest priotiry TCs should be 3 and 2. Actually, the indexes are 5 and 4, because the number of TCs is 6. So it would be failed to configure the CBS for the two highest priority TCs. Fixes: c431047c4efe ("enetc: add support Credit Based Shaper(CBS) for hardware offload") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17net: enetc: check the index of the SFI rather than the handleWei Fang1-1/+1
[ Upstream commit 299efdc2380aac588557f4d0b2ce7bee05bd0cf2 ] We should check whether the current SFI (Stream Filter Instance) table is full before creating a new SFI entry. However, the previous logic checks the handle by mistake and might lead to unpredictable behavior. Fixes: 888ae5a3952b ("net: enetc: add tc flower psfp offload driver") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-14net: add a helper to avoid issues with HW TX timestamping and SO_TXTIMEVladimir Oltean1-6/+2
[ Upstream commit 847cbfc014adafeac401e19e349b0fd524f201c3 ] As explained in commit 29d98f54a4fe ("net: enetc: allow hardware timestamping on TX queues with tc-etf enabled"), hardware TX timestamping requires an skb with skb->tstamp = 0. When a packet is sent with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp, so the drivers need to explicitly reset skb->tstamp to zero after consuming the TX time. Create a helper named skb_txtime_consumed() which does just that. All drivers which offload TC_SETUP_QDISC_ETF should implement it, and it would make it easier to assess during review whether they do the right thing in order to be compatible with hardware timestamping or not. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02net: enetc: preserve TX ring priority across reconfigurationVladimir Oltean3-11/+19
[ Upstream commit 290b5fe096e7dd0aad730d1af4f7f2d9fea43e11 ] In the blamed commit, a rudimentary reallocation procedure for RX buffer descriptors was implemented, for the situation when their format changes between normal (no PTP) and extended (PTP). enetc_hwtstamp_set() calls enetc_close() and enetc_open() in a sequence, and this sequence loses information which was previously configured in the TX BDR Mode Register, specifically via the enetc_set_bdr_prio() call. The TX ring priority is configured by tc-mqprio and tc-taprio, and affects important things for TSN such as the TX time of packets. The issue manifests itself most visibly by the fact that isochron --txtime reports premature packet transmissions when PTP is first enabled on an enetc interface. Save the TX ring priority in a new field in struct enetc_bdr (occupies a 2 byte hole on arm64) in order to make this survive a ring reconfiguration. Fixes: 434cebabd3a2 ("enetc: Add dynamic allocation of extended Rx BD rings") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/r/20221122130936.1704151-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02net: enetc: cache accesses to &priv->si->hwVladimir Oltean3-48/+49
[ Upstream commit 715bf2610f1d1adf3d4f9b7b3dd729984ec4270a ] The &priv->si->hw construct dereferences 2 pointers and makes lines longer than they need to be, in turn making the code harder to read. Replace &priv->si->hw accesses with a "hw" variable when there are 2 or more accesses within a function that dereference this. This includes loops, since &priv->si->hw is a loop invariant. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 290b5fe096e7 ("net: enetc: preserve TX ring priority across reconfiguration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02net: enetc: manage ENETC_F_QBV in priv->active_offloads only when enabledVladimir Oltean2-4/+8
[ Upstream commit 32bf8e1f6fb9f6dc334b2b98dffc2e5dcd51e513 ] Future work in this driver would like to look at priv->active_offloads & ENETC_F_QBV to determine whether a tc-taprio qdisc offload was installed, but this does not produce the intended effect. All the other flags in priv->active_offloads are managed dynamically, except ENETC_F_QBV which is set statically based on the probed SI capability. This change makes priv->active_offloads & ENETC_F_QBV really track the presence of a tc-taprio schedule on the port. Some existing users, like the enetc_sched_speed_set() call from phylink_mac_link_up(), are best kept using the old logic: the tc-taprio offload does not re-trigger another link mode resolve, so the scheduler needs to be functional from the get go, as long as Qbv is supported at all on the port. So to preserve functionality there, look at the static station interface capability from pf->si->hw_features instead. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 290b5fe096e7 ("net: enetc: preserve TX ring priority across reconfiguration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-03net: enetc: survive memory pressure without crashingVladimir Oltean1-0/+5
[ Upstream commit 84ce1ca3fe9e1249bf21176ff162200f1c4e5ed1 ] Under memory pressure, enetc_refill_rx_ring() may fail, and when called during the enetc_open() -> enetc_setup_rxbdr() procedure, this is not checked for. An extreme case of memory pressure will result in exactly zero buffers being allocated for the RX ring, and in such a case it is expected that hardware drops all RX packets due to lack of buffers. This does not happen, because the reset-default value of the consumer and produces index is 0, and this makes the ENETC think that all buffers have been initialized and that it owns them (when in reality none were). The hardware guide explains this best: | Configure the receive ring producer index register RBaPIR with a value | of 0. The producer index is initially configured by software but owned | by hardware after the ring has been enabled. Hardware increments the | index when a frame is received which may consume one or more BDs. | Hardware is not allowed to increment the producer index to match the | consumer index since it is used to indicate an empty condition. The ring | can hold at most RBLENR[LENGTH]-1 received BDs. | | Configure the receive ring consumer index register RBaCIR. The | consumer index is owned by software and updated during operation of the | of the BD ring by software, to indicate that any receive data occupied | in the BD has been processed and it has been prepared for new data. | - If consumer index and producer index are initialized to the same | value, it indicates that all BDs in the ring have been prepared and | hardware owns all of the entries. | - If consumer index is initialized to producer index plus N, it would | indicate N BDs have been prepared. Note that hardware cannot start if | only a single buffer is prepared due to the restrictions described in | (2). | - Software may write consumer index to match producer index anytime | while the ring is operational to indicate all received BDs prior have | been processed and new BDs prepared for hardware. Normally, the value of rx_ring->rcir (consumer index) is brought in sync with the rx_ring->next_to_use software index, but this only happens if page allocation ever succeeded. When PI==CI==0, the hardware appears to receive frames and write them to DMA address 0x0 (?!), then set the READY bit in the BD. The enetc_clean_rx_ring() function (and its XDP derivative) is naturally not prepared to handle such a condition. It will attempt to process those frames using the rx_swbd structure associated with index i of the RX ring, but that structure is not fully initialized (enetc_new_page() does all of that). So what happens next is undefined behavior. To operate using no buffer, we must initialize the CI to PI + 1, which will block the hardware from advancing the CI any further, and drop everything. The issue was seen while adding support for zero-copy AF_XDP sockets, where buffer memory comes from user space, which can even decide to supply no buffers at all (example: "xdpsock --txonly"). However, the bug is present also with the network stack code, even though it would take a very determined person to trigger a page allocation failure at the perfect time (a series of ifup/ifdown under memory pressure should eventually reproduce it given enough retries). Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20221027182925.3256653-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28net: enetc: move enetc_set_psfp() out of the common enetc_set_features()Vladimir Oltean5-35/+44
[ Upstream commit fed38e64d9b99d65a36c0dbadc3d3f8ddd9ea030 ] The VF netdev driver shouldn't respond to changes in the NETIF_F_HW_TC flag; only PFs should. Moreover, TSN-specific code should go to enetc_qos.c, which should not be included in the VF driver. Fixes: 79e499829f3f ("net: enetc: add hw tc hw offload features for PSPF capability") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220916133209.3351399-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08net: enetc: report software timestamping via SO_TIMESTAMPINGVladimir Oltean1-1/+4
[ Upstream commit feb13dcb1818b775fbd9191f797be67cd605f03e ] Let user space properly determine that the enetc driver provides software timestamps. Fixes: 4caefbce06d1 ("enetc: add software timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20220324161210.4122281-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18net: enetc: unmap DMA in enetc_send_cmd()Tim Gardner1-7/+11
[ Upstream commit cd4bc63de774eee95e9bac26a565cd80e0fca421 ] Coverity complains of a possible dereference of a null return value. 5. returned_null: kzalloc returns NULL. [show details] 6. var_assigned: Assigning: si_data = NULL return value from kzalloc. 488 si_data = kzalloc(data_size, __GFP_DMA | GFP_KERNEL); 489 cbd.length = cpu_to_le16(data_size); 490 491 dma = dma_map_single(&priv->si->pdev->dev, si_data, 492 data_size, DMA_FROM_DEVICE); While this kzalloc() is unlikely to fail, I did notice that the function returned without unmapping si_data. Fix this by refactoring the error paths and checking for kzalloc() failure. Fixes: 888ae5a3952ba ("net: enetc: add tc flower psfp offload driver") Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org (open list) Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27net: enetc: fix ethtool counter name for PM0_TERRVladimir Oltean1-1/+1
[ Upstream commit fb8dc5fc8cbdfd62ecd16493848aee2f42ed84d9 ] There are two counters named "MAC tx frames", one of them is actually incorrect. The correct name for that counter should be "MAC tx error frames", which is symmetric to the existing "MAC rx error frames". Fixes: 16eb4c85c964 ("enetc: Add ethtool statistics") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: <Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20211020165206.1069889-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-06net: enetc: fix the incorrect clearing of IF_MODE bitsVladimir Oltean1-2/+1
[ Upstream commit 325fd36ae76a6d089983b2d2eccb41237d35b221 ] The enetc phylink .mac_config handler intends to clear the IFMODE field (bits 1:0) of the PM0_IF_MODE register, but incorrectly clears all the other fields instead. For normal operation, the bug was inconsequential, due to the fact that we write the PM0_IF_MODE register in two stages, first in phylink .mac_config (which incorrectly cleared out a bunch of stuff), then we update the speed and duplex to the correct values in phylink .mac_link_up. Judging by the code (not tested), it looks like maybe loopback mode was broken, since this is one of the settings in PM0_IF_MODE which is incorrectly cleared. Fixes: c76a97218dcb ("net: enetc: force the RGMII speed and duplex instead of operating in inband mode") Reported-by: Pavel Machek (CIP) <pavel@denx.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-30enetc: Fix uninitialized struct dim_sample field usageClaudiu Manoil1-1/+1
[ Upstream commit 9f7afa05c9522b086327929ae622facab0f0f72b ] The only struct dim_sample member that does not get initialized by dim_update_sample() is comp_ctr. (There is special API to initialize comp_ctr: dim_update_sample_with_comps(), and it is currently used only for RDMA.) comp_ctr is used to compute curr_stats->cmps and curr_stats->cpe_ratio (see dim_calc_stats()) which in turn are consumed by the rdma_dim_*() API. Therefore, functionally, the net_dim*() API consumers are not affected. Nevertheless, fix the computation of statistics based on an uninitialized variable, even if the mentioned statistics are not used at the moment. Fixes: ae0e6a5d1627 ("enetc: Add adaptive interrupt coalescing") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-30enetc: Fix illegal access when reading affinity_hintClaudiu Manoil1-4/+1
[ Upstream commit 7237a494decfa17d0b9d0076e6cee3235719de90 ] irq_set_affinity_hit() stores a reference to the cpumask_t parameter in the irq descriptor, and that reference can be accessed later from irq_affinity_hint_proc_show(). Since the cpu_mask parameter passed to irq_set_affinity_hit() has only temporary storage (it's on the stack memory), later accesses to it are illegal. Thus reads from the corresponding procfs affinity_hint file can result in paging request oops. The issue is fixed by the get_cpu_mask() helper, which provides a permanent storage for the cpumask_t parameter. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30net: enetc: set MAC RX FIFO to recommended valueAlex Marginean2-0/+8
[ Upstream commit 1b2395dfff5bb40228a187f21f577cd90673d344 ] On LS1028A, the MAC RX FIFO defaults to the value 2, which is too high and may lead to RX lock-up under traffic at a rate higher than 6 Gbps. Set it to 1 instead, as recommended by the hardware design team and by later versions of the ENETC block guide. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Jason Liu <jason.hui.liu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-17net: enetc: initialize RFS/RSS memories for unused ports tooVladimir Oltean3-9/+36
[ Upstream commit 3222b5b613db558e9a494bbf53f3c984d90f71ea ] Michael reports that since linux-next-20210211, the AER messages for ECC errors have started reappearing, and this time they can be reliably reproduced with the first ping on one of his LS1028A boards. $ ping 1[ 33.258069] pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0 72.16.0.1 PING [ 33.267050] pcieport 0000:00:1f.0: AER: can't find device of ID0000 172.16.0.1 (172.16.0.1): 56 data bytes 64 bytes from 172.16.0.1: seq=0 ttl=64 time=17.124 ms 64 bytes from 172.16.0.1: seq=1 ttl=64 time=0.273 ms $ devmem 0x1f8010e10 32 0xC0000006 It isn't clear why this is necessary, but it seems that for the errors to go away, we must clear the entire RFS and RSS memory, not just for the ports in use. Sadly the code is structured in such a way that we can't have unified logic for the used and unused ports. For the minimal initialization of an unused port, we need just to enable and ioremap the PF memory space, and a control buffer descriptor ring. Unused ports must then free the CBDR because the driver will exit, but used ports can not pick up from where that code path left, since the CBDR API does not reinitialize a ring when setting it up, so its producer and consumer indices are out of sync between the software and hardware state. So a separate enetc_init_unused_port function was created, and it gets called right after the PF memory space is enabled. Fixes: 07bf34a50e32 ("net: enetc: initialize the RFS and RSS memories") Reported-by: Michael Walle <michael@walle.cc> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-17enetc: Fix unused var build warning for CONFIG_OFArnd Bergmann1-11/+10
[ Upstream commit 4560b2a3ecdd5d587c4c6eea4339899f173a559a ] When CONFIG_OF is disabled, there is a harmless warning about an unused variable: enetc_pf.c: In function 'enetc_phylink_create': enetc_pf.c:981:17: error: unused variable 'dev' [-Werror=unused-variable] Slightly rearrange the code to pass around the of_node as a function argument, which avoids the problem without hurting readability. Fixes: 71b77a7a27a3 ("enetc: Migrate to PHYLINK and PCS_LYNX") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201204120800.17193-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-17net: enetc: allow hardware timestamping on TX queues with tc-etf enabledVladimir Oltean1-0/+6
commit 29d98f54a4fe1b6a9089bec8715a1b89ff9ad59c upstream. The txtime is passed to the driver in skb->skb_mstamp_ns, which is actually in a union with skb->tstamp (the place where software timestamps are kept). Since commit b50a5c70ffa4 ("net: allow simultaneous SW and HW transmit timestamping"), __sock_recv_timestamp has some logic for making sure that the two calls to skb_tstamp_tx: skb_tx_timestamp(skb) # Software timestamp in the driver -> skb_tstamp_tx(skb, NULL) and skb_tstamp_tx(skb, &shhwtstamps) # Hardware timestamp in the driver will both do the right thing and in a race-free manner, meaning that skb_tx_timestamp will deliver a cmsg with the software timestamp only, and skb_tstamp_tx with a non-NULL hwtstamps argument will deliver a cmsg with the hardware timestamp only. Why are races even possible? Well, because although the software timestamp skb->tstamp is private per skb, the hardware timestamp skb_hwtstamps(skb) lives in skb_shinfo(skb), an area which is shared between skbs and their clones. And skb_tstamp_tx works by cloning the packets when timestamping them, therefore attempting to perform hardware timestamping on an skb's clone will also change the hardware timestamp of the original skb. And the original skb might have been yet again cloned for software timestamping, at an earlier stage. So the logic in __sock_recv_timestamp can't be as simple as saying "does this skb have a hardware timestamp? if yes I'll send the hardware timestamp to the socket, otherwise I'll send the software timestamp", precisely because the hardware timestamp is shared. Instead, it's quite the other way around: __sock_recv_timestamp says "does this skb have a software timestamp? if yes, I'll send the software timestamp, otherwise the hardware one". This works because the software timestamp is not shared with clones. But that means we have a problem when we attempt hardware timestamping with skbs that don't have the skb->tstamp == 0. __sock_recv_timestamp will say "oh, yeah, this must be some sort of odd clone" and will not deliver the hardware timestamp to the socket. And this is exactly what is happening when we have txtime enabled on the socket: as mentioned, that is put in a union with skb->tstamp, so it is quite easy to mistake it. Do what other drivers do (intel igb/igc) and write zero to skb->tstamp before taking the hardware timestamp. It's of no use to us now (we're already on the TX confirmation path). Fixes: 0d08c9ec7d6e ("enetc: add support time specific departure base on the qos etf") Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: keep RX ring consumer index in sync with hardwareVladimir Oltean1-0/+2
commit 3a5d12c9be6f30080600c8bacaf310194e37d029 upstream. The RX rings have a producer index owned by hardware, where newly received frame buffers are placed, and a consumer index owned by software, where newly allocated buffers are placed, in expectation of hardware being able to place frame data in them. Hardware increments the producer index when a frame is received, however it is not allowed to increment the producer index to match the consumer index (RBCIR) since the ring can hold at most RBLENR[LENGTH]-1 received BDs. Whenever the producer index matches the value of the consumer index, the ring has no unprocessed received frames and all BDs in the ring have been initialized/prepared by software, i.e. hardware owns all BDs in the ring. The code uses the next_to_clean variable to keep track of the producer index, and the next_to_use variable to keep track of the consumer index. The RX rings are seeded from enetc_refill_rx_ring, which is called from two places: 1. initially the ring is seeded until full with enetc_bd_unused(rx_ring), i.e. with 511 buffers. This will make next_to_clean=0 and next_to_use=511: .ndo_open -> enetc_open -> enetc_setup_bdrs -> enetc_setup_rxbdr -> enetc_refill_rx_ring 2. then during the data path processing, it is refilled with 16 buffers at a time: enetc_msix -> napi_schedule -> enetc_poll -> enetc_clean_rx_ring -> enetc_refill_rx_ring There is just one problem: the initial seeding done during .ndo_open updates just the producer index (ENETC_RBPIR) with 0, and the software next_to_clean and next_to_use variables. Notably, it will not update the consumer index to make the hardware aware of the newly added buffers. Wait, what? So how does it work? Well, the reset values of the producer index and of the consumer index of a ring are both zero. As per the description in the second paragraph, it means that the ring is full of buffers waiting for hardware to put frames in them, which by coincidence is almost true, because we have in fact seeded 511 buffers into the ring. But will the hardware attempt to access the 512th entry of the ring, which has an invalid BD in it? Well, no, because in order to do that, it would have to first populate the first 511 entries, and the NAPI enetc_poll will kick in by then. Eventually, after 16 processed slots have become available in the RX ring, enetc_clean_rx_ring will call enetc_refill_rx_ring and then will [ finally ] update the consumer index with the new software next_to_use variable. From now on, the next_to_clean and next_to_use variables are in sync with the producer and consumer ring indices. So the day is saved, right? Well, not quite. Freeing the memory allocated for the rings is done in: enetc_close -> enetc_clear_bdrs -> enetc_clear_rxbdr -> this just disables the ring -> enetc_free_rxtx_rings -> enetc_free_rx_ring -> sets next_to_clean and next_to_use to 0 but again, nothing is committed to the hardware producer and consumer indices (yay!). The assumption is that the ring is disabled, so the indices don't matter anyway, and it's the responsibility of the "open" code path to set those up. .. Except that the "open" code path does not set those up properly. While initially, things almost work, during subsequent enetc_close -> enetc_open sequences, we have problems. To be precise, the enetc_open that is subsequent to enetc_close will again refill the ring with 511 entries, but it will leave the consumer index untouched. Untouched means, of course, equal to the value it had before disabling the ring and draining the old buffers in enetc_close. But as mentioned, enetc_setup_rxbdr will at least update the producer index though, through this line of code: enetc_rxbdr_wr(hw, idx, ENETC_RBPIR, 0); so at this stage we'll have: next_to_clean=0 (in hardware 0) next_to_use=511 (in hardware we'll have the refill index prior to enetc_close) Again, the next_to_clean and producer index are in sync and set to correct values, so the driver manages to limp on. Eventually, 16 ring entries will be consumed by enetc_poll, and the savior enetc_clean_rx_ring will come and call enetc_refill_rx_ring, and then update the hardware consumer ring based upon the new next_to_use. So.. it works? Well, by coincidence, it almost does, but there's a circumstance where enetc_clean_rx_ring won't be there to save us. If the previous value of the consumer index was 15, there's a problem, because the NAPI poll sequence will only issue a refill when 16 or more buffers have been consumed. It's easiest to illustrate this with an example: ip link set eno0 up ip addr add 192.168.100.1/24 dev eno0 ping 192.168.100.1 -c 20 # ping this port from another board ip link set eno0 down ip link set eno0 up ping 192.168.100.1 -c 20 # ping it again from the same other board One by one: 1. ip link set eno0 up -> calls enetc_setup_rxbdr: -> calls enetc_refill_rx_ring(511 buffers) -> next_to_clean=0 (in hw 0) -> next_to_use=511 (in hw 0) 2. ping 192.168.100.1 -c 20 # ping this port from another board enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 0) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=15 next_to_clean 14 (in hw 15) next_to_use 511 (in hw 0) enetc_clean_rx_ring: enetc_refill_rx_ring(16) increments next_to_use by 16 (mod 512) and writes it to hw enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=0 next_to_clean 15 (in hw 16) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 16 (in hw 17) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 17 (in hw 18) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 18 (in hw 19) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 19 (in hw 20) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 20 (in hw 21) next_to_use 15 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 21 (in hw 22) next_to_use 15 (in hw 15) 20 packets transmitted, 20 packets received, 0% packet loss 3. ip link set eno0 down enetc_free_rx_ring: next_to_clean 0 (in hw 22), next_to_use 0 (in hw 15) 4. ip link set eno0 up -> calls enetc_setup_rxbdr: -> calls enetc_refill_rx_ring(511 buffers) -> next_to_clean=0 (in hw 0) -> next_to_use=511 (in hw 15) 5. ping 192.168.100.1 -c 20 # ping it again from the same other board enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=1 next_to_clean 0 (in hw 1) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=2 next_to_clean 1 (in hw 2) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=3 next_to_clean 2 (in hw 3) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=4 next_to_clean 3 (in hw 4) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=5 next_to_clean 4 (in hw 5) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=6 next_to_clean 5 (in hw 6) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=7 next_to_clean 6 (in hw 7) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=8 next_to_clean 7 (in hw 8) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=9 next_to_clean 8 (in hw 9) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=10 next_to_clean 9 (in hw 10) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=11 next_to_clean 10 (in hw 11) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=12 next_to_clean 11 (in hw 12) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=13 next_to_clean 12 (in hw 13) next_to_use 511 (in hw 15) enetc_clean_rx_ring: rx_frm_cnt=1 cleaned_cnt=14 next_to_clean 13 (in hw 14) next_to_use 511 (in hw 15) 20 packets transmitted, 12 packets received, 40% packet loss And there it dies. No enetc_refill_rx_ring (because cleaned_cnt must be equal to 15 for that to happen), no nothing. The hardware enters the condition where the producer (14) + 1 is equal to the consumer (15) index, which makes it believe it has no more free buffers to put packets in, so it starts discarding them: ip netns exec ns0 ethtool -S eno0 | grep -v ': 0' NIC statistics: Rx ring 0 discarded frames: 8 Summarized, if the interface receives between 16 and 32 (mod 512) frames and then there is a link flap, then the port will eventually die with no way to recover. If it receives less than 16 (mod 512) frames, then the initial NAPI poll [ before the link flap ] will not update the consumer index in hardware (it will remain zero) which will be ok when the buffers are later reinitialized. If more than 32 (mod 512) frames are received, the initial NAPI poll has the chance to refill the ring twice, updating the consumer index to at least 32. So after the link flap, the consumer index is still wrong, but the post-flap NAPI poll gets a chance to refill the ring once (because it passes through cleaned_cnt=15) and makes the consumer index be again back in sync with next_to_use. The solution to this problem is actually simple, we just need to write next_to_use into the hardware consumer index at enetc_open time, which always brings it back in sync after an initial buffer seeding process. The simpler thing would be to put the write to the consumer index into enetc_refill_rx_ring directly, but there are issues with the MDIO locking: in the NAPI poll code we have the enetc_lock_mdio() taken from top-level and we use the unlocked enetc_wr_reg_hot, whereas in enetc_open, the enetc_lock_mdio() is not taken at the top level, but instead by each individual enetc_wr_reg, so we are forced to put an additional enetc_wr_reg in enetc_setup_rxbdr. Better organization of the code is left as a refactoring exercise. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdrVladimir Oltean1-1/+0
commit 96a5223b918c8b79270fc0fec235a7ebad459098 upstream. The Station Interface Receive Interrupt Detect Register (SIRXIDR) contains a 16-bit wide mask of 'interrupt detected' events for each ring associated with a port. Bit i is write-1-to-clean for RX ring i. I have no explanation whatsoever how this line of code came to be inserted in the blamed commit. I checked the downstream versions of that patch and none of them have it. The somewhat comical aspect of it is that we're writing a binary number to the SIRXIDR register, which is derived from enetc_bd_unused(rx_ring). Since the RX rings have 512 buffer descriptors, we end up writing 511 to this register, which is 0x1ff, so we are effectively clearing the 'interrupt detected' event for rings 0-8. This register is not what is used for interrupt handling though - it only provides a summary for the entire SI. The hardware provides one separate Interrupt Detect Register per RX ring, which auto-clears upon read. So there doesn't seem to be any adverse effect caused by this bogus write. There is, however, one reason why this should be handled as a bugfix: next_to_clean _should_ be committed to hardware, just not to that register, and this was obscuring the fact that it wasn't. This is fixed in the next patch, and removing the bogus line now allows the fix patch to be backported beyond that point. Fixes: fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: force the RGMII speed and duplex instead of operating in inband modeVladimir Oltean2-10/+56
commit c76a97218dcbb2cb7cec1404ace43ef96c87d874 upstream. The ENETC port 0 MAC supports in-band status signaling coming from a PHY when operating in RGMII mode, and this feature is enabled by default. It has been reported that RGMII is broken in fixed-link, and that is not surprising considering the fact that no PHY is attached to the MAC in that case, but a switch. This brings us to the topic of the patch: the enetc driver should have not enabled the optional in-band status signaling for RGMII unconditionally, but should have forced the speed and duplex to what was resolved by phylink. Note that phylink does not accept the RGMII modes as valid for in-band signaling, and these operate a bit differently than 1000base-x and SGMII (notably there is no clause 37 state machine so no ACK required from the MAC, instead the PHY sends extra code words on RXD[3:0] whenever it is not transmitting something else, so it should be safe to leave a PHY with this option unconditionally enabled even if we ignore it). The spec talks about this here: https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/138/RGMIIv1_5F00_3.pdf Fixes: 71b77a7a27a3 ("enetc: Migrate to PHYLINK and PCS_LYNX") Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: don't disable VLAN filtering in IFF_PROMISC modeVladimir Oltean1-5/+0
commit a74dbce9d4541888fe0d39afe69a3a95004669b4 upstream. Quoting from the blamed commit: In promiscuous mode, it is more intuitive that all traffic is received, including VLAN tagged traffic. It appears that it is necessary to set the flag in PSIPVMR for that to be the case, so VLAN promiscuous mode is also temporarily enabled. On exit from promiscuous mode, the setting made by ethtool is restored. Intuitive or not, there isn't any definition issued by a standards body which says that promiscuity has anything to do with VLAN filtering - it only has to do with accepting packets regardless of destination MAC address. In fact people are already trying to use this misunderstanding/bug of the enetc driver as a justification to transform promiscuity into something it never was about: accepting every packet (maybe that would be the "rx-all" netdev feature?): https://lore.kernel.org/netdev/20201110153958.ci5ekor3o2ekg3ky@ipetronik.com/ This is relevant because there are use cases in the kernel (such as tc-flower rules with the protocol 802.1Q and a vlan_id key) which do not (yet) use the vlan_vid_add API to be compatible with VLAN-filtering NICs such as enetc, so for those, disabling rx-vlan-filter is currently the only right solution to make these setups work: https://lore.kernel.org/netdev/CA+h21hoxwRdhq4y+w8Kwgm74d4cA0xLeiHTrmT-VpSaM7obhkg@mail.gmail.com/ The blamed patch has unintentionally introduced one more way for this to work, which is to enable IFF_PROMISC, however this is non-portable because port promiscuity is not meant to disable VLAN filtering. Therefore, it could invite people to write broken scripts for enetc, and then wonder why they are broken when migrating to other drivers that don't handle promiscuity in the same way. Fixes: 7070eea5e95a ("enetc: permit configuration of rx-vlan-filter with ethtool") Cc: Markus Blöchl <Markus.Bloechl@ipetronik.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: fix incorrect TPID when receiving 802.1ad tagged packetsVladimir Oltean2-8/+29
commit 827b6fd046516af605e190c872949f22208b5d41 upstream. When the enetc ports have rx-vlan-offload enabled, they report a TPID of ETH_P_8021Q regardless of what was actually in the packet. When rx-vlan-offload is disabled, packets have the proper TPID. Fix this inconsistency by finishing the TODO left in the code. Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: take the MDIO lock only once per NAPI poll cycleVladimir Oltean2-22/+11
commit 6d36ecdbc4410e61a0e02adc5d3abeee22a8ffd3 upstream. The workaround for the ENETC MDIO erratum caused a performance degradation of 82 Kpps (seen with IP forwarding of two 1Gbps streams of 64B packets). This is due to excessive locking and unlocking in the fast path, which can be avoided. By taking the MDIO read-side lock only once per NAPI poll cycle, we are able to regain 54 Kpps (65%) of the performance hit. The rest of the performance degradation comes from the TX data path, but unfortunately it doesn't look like we can optimize that away easily, even with netdev_xmit_more(), there just isn't any skb batching done, to help with taking the MDIO lock less often than once per packet. We need to change the register accessor type for enetc_get_tx_tstamp, because it now runs under the enetc_lock_mdio as per the new call path detailed below: enetc_msix -> napi_schedule -> enetc_poll -> enetc_lock_mdio -> enetc_clean_tx_ring -> enetc_get_tx_tstamp -> enetc_clean_rx_ring -> enetc_unlock_mdio Fixes: fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17net: enetc: don't overwrite the RSS indirection table when initializingVladimir Oltean4-8/+18
commit c646d10dda2dcde82c6ce5a474522621ab2b8b19 upstream. After the blamed patch, all RX traffic gets hashed to CPU 0 because the hashing indirection table set up in: enetc_pf_probe -> enetc_alloc_si_resources -> enetc_configure_si -> enetc_setup_default_rss_table is overwritten later in: enetc_pf_probe -> enetc_init_port_rss_memory which zero-initializes the entire port RSS table in order to avoid ECC errors. The trouble really is that enetc_init_port_rss_memory really neads enetc_alloc_si_resources to be called, because it depends upon enetc_alloc_cbdr and enetc_setup_cbdr. But that whole enetc_configure_si thing could have been better thought out, it has nothing to do in a function called "alloc_si_resources", especially since its counterpart, "free_si_resources", does nothing to unwind the configuration of the SI. The point is, we need to pull out enetc_configure_si out of enetc_alloc_resources, and move it after enetc_init_port_rss_memory. This allows us to set up the default RSS indirection table after initializing the memory. Fixes: 07bf34a50e32 ("net: enetc: initialize the RFS and RSS memories") Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04net: enetc: fix destroyed phylink dereference during unbindVladimir Oltean1-2/+3
[ Upstream commit 3af409ca278d4a8d50e91f9f7c4c33b175645cf3 ] The following call path suggests that calling unregister_netdev on an interface that is up will first bring it down. enetc_pf_remove -> unregister_netdev -> unregister_netdevice_queue -> unregister_netdevice_many -> dev_close_many -> __dev_close_many -> enetc_close -> enetc_stop -> phylink_stop However, enetc first destroys the phylink instance, then calls unregister_netdev. This is already dissimilar to the setup (and error path teardown path) from enetc_pf_probe, but more than that, it is buggy because it is invalid to call phylink_stop after phylink_destroy. So let's first unregister the netdev (and let the .ndo_stop events consume themselves), then destroy the phylink instance, then free the netdev. Fixes: 71b77a7a27a3 ("enetc: Migrate to PHYLINK and PCS_LYNX") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-17net: enetc: initialize the RFS and RSS memoriesVladimir Oltean2-0/+61
[ Upstream commit 07bf34a50e327975b21a9dee64d220c3dcb72ee9 ] Michael tried to enable Advanced Error Reporting through the ENETC's Root Complex Event Collector, and the system started spitting out single bit correctable ECC errors coming from the ENETC interfaces: pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0 fsl_enetc 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) fsl_enetc 0000:00:00.0: device [1957:e100] error status/mask=00004000/00000000 fsl_enetc 0000:00:00.0: [14] CorrIntErr fsl_enetc 0000:00:00.1: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) fsl_enetc 0000:00:00.1: device [1957:e100] error status/mask=00004000/00000000 fsl_enetc 0000:00:00.1: [14] CorrIntErr Further investigating the port correctable memory error detect register (PCMEDR) shows that these AER errors have an associated SOURCE_ID of 6 (RFS/RSS): $ devmem 0x1f8010e10 32 0xC0000006 $ devmem 0x1f8050e10 32 0xC0000006 Discussion with the hardware design engineers reveals that on LS1028A, the hardware does not do initialization of that RFS/RSS memory, and that software should clear/initialize the entire table before starting to operate. That comes as a bit of a surprise, since the driver does not do initialization of the RFS memory. Also, the initialization of the Receive Side Scaling is done only partially. Even though the entire ENETC IP has a single shared flow steering memory, the flow steering service should returns matches only for TCAM entries that are within the range of the Station Interface that is doing the search. Therefore, it should be sufficient for a Station Interface to initialize all of its own entries in order to avoid any ECC errors, and only the Station Interfaces in use should need initialization. There are Physical Station Interfaces associated with PCIe PFs and Virtual Station Interfaces associated with PCIe VFs. We let the PF driver initialize the entire port's memory, which includes the RFS entries which are going to be used by the VF. Reported-by: Michael Walle <michael@walle.cc> Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Link: https://lore.kernel.org/r/20210204134511.2640309-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-08enetc: Fix reporting of h/w packet countersClaudiu Manoil2-6/+14
Noticed some inconsistencies in packet statistics reporting. This patch adds the missing Tx packet counter registers to ethtool reporting and fixes the information strings for a few of them. Fixes: 16eb4c85c964 ("enetc: Add ethtool statistics") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201204171505.21389-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25enetc: Let the hardware auto-advance the taprio base-time of 0Vladimir Oltean1-12/+2
The tc-taprio base time indicates the beginning of the tc-taprio schedule, which is cyclic by definition (where the length of the cycle in nanoseconds is called the cycle time). The base time is a 64-bit PTP time in the TAI domain. Logically, the base-time should be a future time. But that imposes some restrictions to user space, which has to retrieve the current PTP time from the NIC first, then calculate a base time that will still be larger than the base time by the time the kernel driver programs this value into the hardware. Actually ensuring that the programmed base time is in the future is still a problem even if the kernel alone deals with this. Luckily, the enetc hardware already advances a base-time that is in the past into a congruent time in the immediate future, according to the same formula that can be found in the software implementation of taprio (in taprio_get_start_time): /* Schedule the start time for the beginning of the next * cycle. */ n = div64_s64(ktime_sub_ns(now, base), cycle); *start = ktime_add_ns(base, (n + 1) * cycle); There's only one problem: the driver doesn't let the hardware do that. It interferes with the base-time passed from user space, by special-casing the situation when the base-time is zero, and replaces that with the current PTP time. This changes the intended effective base-time of the schedule, which will in the end have a different phase offset than if the base-time of 0.000000000 was to be advanced by an integer multiple of the cycle-time. Fixes: 34c6adf1977b ("enetc: Configure the Time-Aware Scheduler via tc-taprio offload") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201124220259.3027991-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17enetc: Workaround for MDIO register access issueAlex Marginean4-25/+161
Due to a hardware issue, an access to MDIO registers that is concurrent with other ENETC register accesses may lead to the MDIO access being dropped or corrupted. The workaround introduces locking for all register accesses to the ENETC register space. To reduce performance impact, a readers-writers locking scheme has been implemented. The writer in this case is the MDIO access code (irrelevant whether that MDIO access is a register read or write), and the reader is any access code to non-MDIO ENETC registers. Also, the datapath functions acquire the read lock fewer times and use _hot accessors. All the rest of the code uses the _wa accessors which lock every register access. The commit introducing MDIO support is - commit ebfcb23d62ab ("enetc: Add ENETC PF level external MDIO support") but due to subsequent refactoring this patch is applicable on top of a later commit. Fixes: 6517798dd343 ("enetc: Make MDIO accessors more generic and export to include/linux/fsl") Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201112182608.26177-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30enetc: Replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-11enetc: Migrate to PHYLINK and PCS_LYNXClaudiu Manoil7-166/+191
This is a methodical transition of the driver from phylib to phylink, following the guidelines from sfp-phylink.rst. The MAC register configurations based on interface mode were moved from the probing path to the mac_config() hook. MAC enable and disable commands (enabling Rx and Tx paths at MAC level) were also extracted and assigned to their corresponding phylink hooks. As part of the migration to phylink, the serdes configuration from the driver was offloaded to the PCS_LYNX module, introduced in commit 0da4c3d393e4 ("net: phy: add Lynx PCS module"), the PCS_LYNX module being a mandatory component required to make the enetc driver work with phylink. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Ioana Ciornei <ioana.cionei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-11enetc: Clean up serdes configurationClaudiu Manoil1-53/+48
Decouple internal mdio bus creation from serdes configuration, as a prerequisite to offloading serdes configuration to a different module. Group together mdio bus creation routines, cleanup. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-11enetc: Clean up MAC and link configurationClaudiu Manoil1-48/+67
Decouple level MAC configuration based on phy interface type from general port configuration. Group together MAC and link configuration code. Decouple external mdio bus creation from interface type parsing. No longer return an (unhandled) error code when phy_node not found, use phy_node to indicate whether the port has a phy or not. No longer fall-through when serdes configuration fails for the link modes that require internal link configuration. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+1
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-21enetc: simplify the return expression of enetc_vf_set_mac_addr()Qinglang Miao1-6/+1
Simplify the return expression. Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-12enetc: Fix mdio bus removal on PF probe bailoutClaudiu Manoil1-1/+1
This is the correct resolution for the conflict from merging the "net" tree fix: commit 26cb7085c898 ("enetc: Remove the mdio bus on PF probe bailout") with the "net-next" new work: commit 07095c025ac2 ("net: enetc: Use DT protocol information to set up the ports") that moved mdio bus allocation to an ealier stage of the PF probing routine. Fixes: a57066b1a019 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04enetc: use napi_schedule to be compatible with PREEMPT_RTJiafei Pan1-1/+1
The driver calls napi_schedule_irqoff() from a context where, in RT, hardirqs are not disabled, since the IRQ handler is force-threaded. In the call path of this function, __raise_softirq_irqoff() is modifying its per-CPU mask of pending softirqs that must be processed, using or_softirq_pending(). The or_softirq_pending() function is not atomic, but since interrupts are supposed to be disabled, nobody should be preempting it, and the operation should be safe. Nonetheless, when running with hardirqs on, as in the PREEMPT_RT case, it isn't safe, and the pending softirqs mask can get corrupted, resulting in softirqs being lost and never processed. To have common code that works with PREEMPT_RT and with mainline Linux, we can use plain napi_schedule() instead. The difference is that napi_schedule() (via __napi_schedule) also calls local_irq_save, which disables hardirqs if they aren't already. But, since they already are disabled in non-RT, this means that in practice we don't see any measurable difference in throughput or latency with this patch. Signed-off-by: Jiafei Pan <Jiafei.Pan@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04net/enetc: Fix wrong return value in enetc_psfp_parse_clsflower()Tianjia Zhang1-4/+4
In the case of invalid rule, a positive value EINVAL is returned here. I think this is a typo error. It is necessary to return an error value. Cc: Po Liu <Po.Liu@nxp.com> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-0/+1
The UDP reuseport conflict was a little bit tricky. The net-next code, via bpf-next, extracted the reuseport handling into a helper so that the BPF sk lookup code could invoke it. At the same time, the logic for reuseport handling of unconnected sockets changed via commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace which changed the logic to carry on the reuseport result into the rest of the lookup loop if we do not return immediately. This requires moving the reuseport_has_conns() logic into the callers. While we are here, get rid of inline directives as they do not belong in foo.c files. The other changes were cases of more straightforward overlapping modifications. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23enetc: Remove the mdio bus on PF probe bailoutClaudiu Manoil1-0/+1
For ENETC ports that register an external MDIO bus, the bus doesn't get removed on the error bailout path of enetc_pf_probe(). This issue became much more visible after recent: commit 07095c025ac2 ("net: enetc: Use DT protocol information to set up the ports") Before this commit, one could make probing fail on the error path only by having register_netdev() fail, which is unlikely. But after this commit, because it moved the enetc_of_phy_get() call up in the probing sequence, now we can trigger an mdiobus_free() bug just by forcing enetc_alloc_msix() to return error, i.e. with the 'pci=nomsi' kernel bootarg (since ENETC relies on MSI support to work), as the calltrace below shows: kernel BUG at /home/eiz/work/enetc/net/drivers/net/phy/mdio_bus.c:648! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [...] Hardware name: LS1028A RDB Board (DT) pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--) pc : mdiobus_free+0x50/0x58 lr : devm_mdiobus_free+0x14/0x20 [...] Call trace: mdiobus_free+0x50/0x58 devm_mdiobus_free+0x14/0x20 release_nodes+0x138/0x228 devres_release_all+0x38/0x60 really_probe+0x1c8/0x368 driver_probe_device+0x5c/0xc0 device_driver_attach+0x74/0x80 __driver_attach+0x8c/0xd8 bus_for_each_dev+0x7c/0xd8 driver_attach+0x24/0x30 bus_add_driver+0x154/0x200 driver_register+0x64/0x120 __pci_register_driver+0x44/0x50 enetc_pf_driver_init+0x24/0x30 do_one_initcall+0x60/0x1c0 kernel_init_freeable+0x1fc/0x274 kernel_init+0x14/0x110 ret_from_fork+0x10/0x34 Fixes: ebfcb23d62ab ("enetc: Add ENETC PF level external MDIO support") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23enetc: Remove the imdio bus on PF probe bailoutClaudiu Manoil1-1/+9
enetc_imdio_remove() is missing from the enetc_pf_probe() bailout path. Not surprisingly because enetc_setup_serdes() is registering the imdio bus for internal purposes, and it's not obvious that enetc_imdio_remove() currently performs the teardown of enetc_setup_serdes(). To fix this, define enetc_teardown_serdes() to wrap enetc_imdio_remove() (improve code maintenance) and call it on bailout and remove paths. Fixes: 975d183ef0ca ("net: enetc: Initialize SerDes for SGMII and USXGMII protocols") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Add adaptive interrupt coalescingClaudiu Manoil4-7/+73
Use the generic dynamic interrupt moderation (dim) framework to implement adaptive interrupt coalescing on Rx. With the per-packet interrupt scheme, a high interrupt rate has been noted for moderate traffic flows leading to high CPU utilization. The 'dim' scheme implemented by the current patch addresses this issue improving CPU utilization while using minimal coalescing time thresholds in order to preserve a good latency. On the Tx side use an optimal time threshold value by default. This value has been optimized for Tx TCP streams at a rate of around 85kpps on a 1G link, at which rate half of the Tx ring size (128) gets filled in 1500 usecs. Scaling this down to 2.5G links yields the current value of 600 usecs, which is conservative and gives good enough results for 1G links too (see next). Below are some measurement results for before and after this patch (and related dependencies) basically, for a 2 ARM Cortex-A72 @1.3Ghz CPUs system (32 KB L1 data cache), using 60secs log netperf TCP stream tests @ 1Gbit link (maximum throughput): 1) 1 Rx TCP flow, both Rx and Tx processed by the same NAPI thread on the same CPU: CPU utilization int rate (ints/sec) Before: 50%-60% (over 50%) 92k After: 13%-22% 3.5k-12k Comment: Major CPU utilization improvement for a single flow Rx TCP flow (i.e. netperf -t TCP_MAERTS) on a single CPU. Usually settles under 16% for longer tests. 2) 4 Rx TCP flows + 4 Tx TCP flows (+ pings to check the latency): Total CPU utilization Total int rate (ints/sec) Before: ~80% (spikes to 90%) ~100k After: 60% (more steady) ~4k Comment: Important improvement for this load test, while the ping test outcome does not show any notable difference compared to before. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Add interrupt coalescing supportClaudiu Manoil4-10/+132
Enable programming of the interrupt coalescing registers and allow manual configuration of the coalescing time thresholds via ethtool. Packet thresholds have been fixed to predetermined values as there's no point in making them run-time configurable, also anticipating the dynamic interrupt moderation (DIM) algorithm which uses fixed packet thresholds as well. If the interface is up when the operation mode of traffic interrupt events is changed by the user (i.e. switching from default per-packet interrupts to coalesced interrupts), the traffic needs to be paused in the process. This patch also prepares the ground for introducing DIM on Rx. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Drop redundant ____cacheline_aligned_in_smpClaudiu Manoil1-1/+1
'struct enetc_bdr' is already '____cacheline_aligned_in_smp'. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Fix interrupt coalescing register namingClaudiu Manoil3-7/+7
Interrupt coalescing registers naming in the current revision of the Ref Man (RM) is ICR, deprecating the ICIR name used in earlier (draft) versions of the RM. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Factor out the traffic start/stop proceduresClaudiu Manoil1-25/+49
A reliable traffic pause (and reconfiguration) procedure is needed to be able to safely make h/w configuration changes during run-time, like changing the mode in which the interrupts are operating (i.e. with or without coalescing), as opposed to making on-the-fly register updates that may be subject to h/w or s/w concurrency issues. To this end, the code responsible of the run-time device configurations that basically starts resp. stops the traffic flow through the device has been extracted from the the enetc_open/_close procedures, to the separate standalone enetc_start/_stop procedures. Traffic stop should be as graceful as possible, it lets the executing napi threads to to finish while the interrupts stay disabled. But since the napi thread will try to re-enable interrupts by clearing the device's unmask register, the enable_irq/ disable_irq API has been used to avoid this potential concurrency issue and make the traffic pause procedure more reliable. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22enetc: Refine buffer descriptor ring sizesClaudiu Manoil2-4/+5
It's time to differentiate between Rx and Tx ring sizes. Not only Tx rings are processed differently than Rx rings, but their default number also differs - i.e. up to 8 Tx rings per device (8 traffic classes) vs. 2 Rx rings (one per CPU). So let's set Tx rings sizes to half the size of the Rx rings for now, to be conservative. The default ring sizes were decreased as well (to the next lower power of 2), to reduce the memory footprint, buffering etc., since the measurements I've made so far show that the rings are very unlikely to get full. This change also anticipates the introduction of the dynamic interrupt moderation (dim) algorithm which operates on maximum packet thresholds of 256 packets for Rx and 128 packets for Tx. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-20net: enetc: Use DT protocol information to set up the portsAlex Marginean2-26/+34
Use DT information rather than in-band information from bootloader to set up MAC for XGMII. For RGMII use the DT indication in addition to RGMII defaults in hardware. However, this implies that PHY connection information needs to be extracted before netdevice creation, when the ENETC Port MAC is being configured. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Michael Walle <michael@walle.cc> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>