summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-02-25KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3Mark Brown3-2/+13
The arch_timer and vgic_irq kselftests assume that they can create a vgic-v3, using the library function vgic_v3_setup() which aborts with a test failure if it is not possible to do so. Since vgic-v3 can only be instantiated on systems where the host has GICv3 this leads to false positives on older systems where that is not the case. Fix this by changing vgic_v3_setup() to return an error if the vgic can't be instantiated and have the callers skip if this happens. We could also exit flagging a skip in vgic_v3_setup() but this would prevent future test cases conditionally deciding which GIC to use or generally doing more complex output. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Tested-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220223131624.1830351-1-broonie@kernel.org
2022-02-25net: sparx5: Fix add vlan when invalid operationCasper Andersson1-10/+10
Check if operation is valid before changing any settings in hardware. Otherwise it results in changes being made despite it not being a valid operation. Fixes: 78eab33bb68b ("net: sparx5: add vlan support") Signed-off-by: Casper Andersson <casper.casan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: chelsio: cxgb3: check the return value of pci_find_capability()Jia-Ju Bai1-0/+2
The function pci_find_capability() in t3_prep_adapter() can fail, so its return value should be checked. Fixes: 4d22de3e6cc4 ("Add support for the latest 1G/10G Chelsio adapter, T3") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25Merge branch 'sja1105-phylink-updates'David S. Miller1-58/+42
Russell King says: ==================== net: dsa: sja1105: phylink updates This series updates the phylink implementation in sja1105 to use the supported_interfaces bitmap, convert to the mac_select_pcs() interface, mark as non-legacy, and get rid of the validation method. As a final step, enable switching between SGMII and 2500BASE-X as it is a feature that Vladimir desires. Specifically, the patches in this series: 1. Populates the supported_interfaces bitmap. 2. As a result of the supported_interfaces bitmap being populated, sja1105 no longer needs to check the interface mode as phylink will do this. 3. Switch away from using phylink_set_pcs(), using the mac_select_pcs() method instead. 4. Mark the driver as not-legacy 5. Fill in mac_capabilities using _exactly_ the same conditions as is currently used to decide which link modes to support, and convert to use phylink_generic_validate() 6. Add brand new support to permit switching between SGMII and 2500BASE-X modes of operation as per Vladimir's single patch that performs steps 1, 2, 5 and 6 in one go. There are some additional changes in Vladimir's single patch that I have not included: * validation of priv->phy_mode[] in sja1105_phylink_get_caps(). The driver has already validated the phy_mode for each port in sja1105_init_mii_settings(), and a failure here will prevent the driver reaching sja1105_phylink_get_caps(). * Changing the decisions on which mac_capabilities to set. Vladimir's patch always sets MAC_10FD | MAC_100FD | MAC_1000FD despite the current code clearly making the 1G speed conditional on the xmii_mode for the port. The change in decision making may be visible when in PHY_INTERFACE_MODE_INTERNAL mode, for which the phylink_generic_validate() will pass through all the MAC capabilities as ethtool link modes. Hence, if we have PHY_INTERFACE_MODE_INTERNAL but supports_rgmii[] or supports_sgmii[] is non-zero, currently we do not get 1G speeds. With Vladimir's additional change, we will get 1G speeds. While it is not clear whether that can happen, I feel changing the decision making should be a separate patch. * The decision for MAC_2500FD is made differently - sja1105_init_mii_settings() allows PHY_INTERFACE_MODE_2500BASEX when supports_2500basex[] is non-zero, and is not based on any other condition such as supports_sgmii[] or supports_rgmii[]. Vladimir's patch makes it additionally conditional on those supports_.gmii[] settings, which is a functional change that should be made in a separate patch - and if desired, then sja1105_init_mii_settings() should also be updated at the same time. Consequently, I believe that my previous objections to Vladimir's single patch approach are well founded and justified, even through Vladimir is the maintainer of this driver. I have no objection to the additional changes, I just don't think they should all be wrapped up into a single patch that converts the way validation is done _and_ also makes a bunch of other functional changes. RFC->non-RFC: added Vladimir's Reviewed-by's, fixed the typo in the commit message of patch 6, and removed the phrase at the end of a comment as requested. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: support switching between SGMII and 2500BASE-XRussell King (Oracle)1-5/+22
Vladimir Oltean suggests that sja1105 can support switching between SGMII and 2500BASE-X modes. Augment sja1105_phylink_get_caps() to fill in both interface modes if they can be supported. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: convert to phylink_generic_validate()Russell King (Oracle)1-28/+7
Populate the MAC capabilities for the SJA1105 DSA switch using the same decision making which sja1105_phylink_validate() uses. Remove the now obsolete sja1105_phylink_validate() implementation to allow DSA to use phylink_generic_validate() for this switch driver. As noted by Vladimir, this fixes an inconsequential bug which allowed gigabit and lower interface modes to be indicated when operating in 2500base-X mode. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: mark as non-legacyRussell King (Oracle)1-0/+6
The sja1105 DSA driver does not have a phylink_mac_config() method implementation, it is safe to mark this as a non-legacy driver. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: use .mac_select_pcs() interfaceRussell King (Oracle)1-9/+7
Convert the PCS selection to use mac_select_pcs, which allows the PCS to perform any validation it needs, and removes the need to set the PCS in the mac_config() callback, delving into the higher DSA levels to do so. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: remove interface checksRussell King (Oracle)1-29/+0
When the supported interfaces bitmap is populated, phylink will itself check that the interface mode is present in this bitmap. Drivers no longer need to perform this check themselves. Remove these checks. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dsa: sja1105: populate supported_interfacesRussell King (Oracle)1-0/+13
Populate the supported interfaces bitmap for the SJA1105 DSA switch. This switch only supports a static model of configuration, so we restrict the interface modes to the configured setting. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Vladimir Oltean <vladimir. │ Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25Merge branch 'ibmvnic-fixes'David S. Miller2-28/+156
Sukadev Bhattiprolu says: ==================== ibmvnic: Fix a race in ibmvnic_probe() If we get a transport (reset) event right after a successful CRQ_INIT during ibmvnic_probe() but before we set the adapter state to VNIC_PROBED, we will throw away the reset assuming that the adapter is still in the probing state. But since the adapter has completed the CRQ_INIT any subsequent CRQs the we send will be ignored by the vnicserver until we release/init the CRQ again. This can leave the adapter unconfigured. While here fix a couple of other bugs that were observed (Patches 1,2,4). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: Allow queueing resets during probeSukadev Bhattiprolu2-10/+98
We currently don't allow queuing resets when adapter is in VNIC_PROBING state - instead we throw away the reset and return EBUSY. The reasoning is probably that during ibmvnic_probe() the ibmvnic_adapter itself is being initialized so performing a reset during this time can lead us to accessing fields in the ibmvnic_adapter that are not fully initialized. A review of the code shows that all the adapter state neede to process a reset is initialized before registering the CRQ so that should no longer be a concern. Further the expectation is that if we do get a reset (transport event) during probe, the do..while() loop in ibmvnic_probe() will handle this by reinitializing the CRQ. While that is true to some extent, it is possible that the reset might occur _after_ the CRQ is registered and CRQ_INIT message was exchanged but _before_ the adapter state is set to VNIC_PROBED. As mentioned above, such a reset will be thrown away. While the client assumes that the adapter is functional, the vnic server will wait for the client to reinit the adapter. This disconnect between the two leaves the adapter down needing manual intervention. Because ibmvnic_probe() has other work to do after initializing the CRQ (such as registering the netdev at a minimum) and because the reset event can occur at any instant after the CRQ is initialized, there will always be a window between initializing the CRQ and considering the adapter ready for resets (ie state == PROBED). So rather than discarding resets during this window, allow queueing them - but only process them after the adapter is fully initialized. To do this, introduce a new completion state ->probe_done and have the reset worker thread wait on this before processing resets. This change brings up two new situations in or just after ibmvnic_probe(). First after one or more resets were queued, we encounter an error and decide to retry the initialization. At that point the queued resets are no longer relevant since we could be talking to a new vnic server. So we must purge/flush the queued resets before restarting the initialization. As a side note, since we are still in the probing stage and we have not registered the netdev, it will not be CHANGE_PARAM reset. Second this change opens up a potential race between the worker thread in __ibmvnic_reset(), the tasklet and the ibmvnic_open() due to the following sequence of events: 1. Register CRQ 2. Get transport event before CRQ_INIT completes. 3. Tasklet schedules reset: a) add rwi to list b) schedule_work() to start worker thread which runs and waits for ->probe_done. 4. ibmvnic_probe() decides to retry, purges rwi_list 5. Re-register crq and this time rest of probe succeeds - register netdev and complete(->probe_done). 6. Worker thread resumes in __ibmvnic_reset() from 3b. 7. Worker thread sets ->resetting bit 8. ibmvnic_open() comes in, notices ->resetting bit, sets state to IBMVNIC_OPEN and returns early expecting worker thread to finish the open. 9. Worker thread finds rwi_list empty and returns without opening the interface. If this happens, the ->ndo_open() call is effectively lost and the interface remains down. To address this, ensure that ->rwi_list is not empty before setting the ->resetting bit. See also comments in __ibmvnic_reset(). Fixes: 6a2fb0e99f9c ("ibmvnic: driver initialization for kdump/kexec") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: clear fop when retrying probeSukadev Bhattiprolu1-0/+5
Clear ->failover_pending flag that may have been set in the previous pass of registering CRQ. If we don't clear, a subsequent ibmvnic_open() call would be misled into thinking a failover is pending and assuming that the reset worker thread would open the adapter. If this pass of registering the CRQ succeeds (i.e there is no transport event), there wouldn't be a reset worker thread. This would leave the adapter unconfigured and require manual intervention to bring it up during boot. Fixes: 5a18e1e0c193 ("ibmvnic: Fix failover case for non-redundant configuration") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: init init_done_rc earlierSukadev Bhattiprolu1-5/+21
We currently initialize the ->init_done completion/return code fields before issuing a CRQ_INIT command. But if we get a transport event soon after registering the CRQ the taskslet may already have recorded the completion and error code. If we initialize here, we might overwrite/ lose that and end up issuing the CRQ_INIT only to timeout later. If that timeout happens during probe, we will leave the adapter in the DOWN state rather than retrying to register/init the CRQ. Initialize the completion before registering the CRQ so we don't lose the notification. Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: register netdev after init of adapterSukadev Bhattiprolu1-6/+8
Finish initializing the adapter before registering netdev so state is consistent. Fixes: c26eba03e407 ("ibmvnic: Update reset infrastructure to support tunable parameters") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: complete init_done on transport eventsSukadev Bhattiprolu1-0/+7
If we get a transport event, set the error and mark the init as complete so the attempt to send crq-init or login fail sooner rather than wait for the timeout. Fixes: bbd669a868bb ("ibmvnic: Fix completion structure initialization") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: define flush_reset_queue helperSukadev Bhattiprolu1-8/+16
Define and use a helper to flush the reset queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: initialize rc before completing waitSukadev Bhattiprolu1-1/+1
We should initialize ->init_done_rc before calling complete(). Otherwise the waiting thread may see ->init_done_rc as 0 before we have updated it and may assume that the CRQ was successful. Fixes: 6b278c0cb378 ("ibmvnic delay complete()") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: free reset-work-item when flushingSukadev Bhattiprolu1-1/+3
Fix a tiny memory leak when flushing the reset work queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25Merge branch 'master' of ↵David S. Miller11-35/+51
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix PMTU for IPv6 if the reported MTU minus the ESP overhead is smaller than 1280. From Jiri Bohac. 2) Fix xfrm interface ID and inter address family tunneling when migrating xfrm states. From Yan Yan. 3) Add missing xfrm intrerface ID initialization on xfrmi_changelink. From Antony Antony. 4) Enforce validity of xfrm offload input flags so that userspace can't send undefined flags to the offload driver. From Leon Romanovsky. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: dcb: flush lingering app table entries for unregistered devicesVladimir Oltean1-0/+44
If I'm not mistaken (and I don't think I am), the way in which the dcbnl_ops work is that drivers call dcb_ieee_setapp() and this populates the application table with dynamically allocated struct dcb_app_type entries that are kept in the module-global dcb_app_list. However, nobody keeps exact track of these entries, and although dcb_ieee_delapp() is supposed to remove them, nobody does so when the interface goes away (example: driver unbinds from device). So the dcb_app_list will contain lingering entries with an ifindex that no longer matches any device in dcb_app_lookup(). Reclaim the lost memory by listening for the NETDEV_UNREGISTER event and flushing the app table entries of interfaces that are now gone. In fact something like this used to be done as part of the initial commit (blamed below), but it was done in dcbnl_exit() -> dcb_flushapp(), essentially at module_exit time. That became dead code after commit 7a6b6f515f77 ("DCB: fix kconfig option") which essentially merged "tristate config DCB" and "bool config DCBNL" into a single "bool config DCB", so net/dcb/dcbnl.c could not be built as a module anymore. Commit 36b9ad8084bd ("net/dcb: make dcbnl.c explicitly non-modular") recognized this and deleted dcbnl_exit() and dcb_flushapp() altogether, leaving us with the version we have today. Since flushing application table entries can and should be done as soon as the netdevice disappears, fundamentally the commit that is to blame is the one that introduced the design of this API. Fixes: 9ab933ab2cc8 ("dcbnl: add appliction tlv handlers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net/smc: fix connection leakD. Wythe1-2/+8
There's a potential leak issue under following execution sequence : smc_release smc_connect_work if (sk->sk_state == SMC_INIT) send_clc_confirim tcp_abort(); ... sk.sk_state = SMC_ACTIVE smc_close_active switch(sk->sk_state) { ... case SMC_ACTIVE: smc_close_final() // then wait peer closed Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are still in the tcp send buffer, in which case our connection token cannot be delivered to the server side, which means that we cannot get a passive close message at all. Therefore, it is impossible for the to be disconnected at all. This patch tries a very simple way to avoid this issue, once the state has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the smc connection, considering that the state is SMC_INIT before tcp_abort(), abandoning the complete disconnection process should not cause too much problem. In fact, this problem may exist as long as the CLC CONFIRM message is not received by the server. Whether a timer should be added after smc_close_final() needs to be discussed in the future. But even so, this patch provides a faster release for connection in above case, it should also be valuable. Fixes: 39f41f367b08 ("net/smc: common release code for non-accepted sockets") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: stmmac: only enable DMA interrupts when readyVincent Whitchurch1-2/+26
In this driver's ->ndo_open() callback, it enables DMA interrupts, starts the DMA channels, then requests interrupts with request_irq(), and then finally enables napi. If RX DMA interrupts are received before napi is enabled, no processing is done because napi_schedule_prep() will return false. If the network has a lot of broadcast/multicast traffic, then the RX ring could fill up completely before napi is enabled. When this happens, no further RX interrupts will be delivered, and the driver will fail to receive any packets. Fix this by only enabling DMA interrupts after all other initialization is complete. Fixes: 523f11b5d4fd72efb ("net: stmmac: move hardware setup for stmmac_open to new function") Reported-by: Lars Persson <larper@axis.com> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25net: openvswitch: IPv6: Add IPv6 extension header supportToms Atteka4-2/+184
This change adds a new OpenFlow field OFPXMT_OFB_IPV6_EXTHDR and packets can be filtered using ipv6_ext flag. Signed-off-by: Toms Atteka <cpp.code.lv@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25xen/netfront: destroy queues before real_num_tx_queues is zeroedMarek Marczykowski-Górecki1-16/+23
xennet_destroy_queues() relies on info->netdev->real_num_tx_queues to delete queues. Since d7dac083414eb5bb99a6d2ed53dc2c1b405224e5 ("net-sysfs: update the queue counts in the unregistration path"), unregister_netdev() indirectly sets real_num_tx_queues to 0. Those two facts together means, that xennet_destroy_queues() called from xennet_remove() cannot do its job, because it's called after unregister_netdev(). This results in kfree-ing queues that are still linked in napi, which ultimately crashes: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 52 Comm: xenwatch Tainted: G W 5.16.10-1.32.fc32.qubes.x86_64+ #226 RIP: 0010:free_netdev+0xa3/0x1a0 Code: ff 48 89 df e8 2e e9 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 ed c1 66 ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00 RSP: 0000:ffffc90000bcfd00 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88800edad000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffc90000bcfc30 RDI: 00000000ffffffff RBP: fffffffffffffea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff88800edad050 R13: ffff8880065f8f88 R14: 0000000000000000 R15: ffff8880066c6680 FS: 0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000e998c006 CR4: 00000000003706e0 Call Trace: <TASK> xennet_remove+0x13d/0x300 [xen_netfront] xenbus_dev_remove+0x6d/0xf0 __device_release_driver+0x17a/0x240 device_release_driver+0x24/0x30 bus_remove_device+0xd8/0x140 device_del+0x18b/0x410 ? _raw_spin_unlock+0x16/0x30 ? klist_iter_exit+0x14/0x20 ? xenbus_dev_request_and_reply+0x80/0x80 device_unregister+0x13/0x60 xenbus_dev_changed+0x18e/0x1f0 xenwatch_thread+0xc0/0x1a0 ? do_wait_intr_irq+0xa0/0xa0 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Fix this by calling xennet_destroy_queues() from xennet_uninit(), when real_num_tx_queues is still available. This ensures that queues are destroyed when real_num_tx_queues is set to 0, regardless of how unregister_netdev() was called. Originally reported at https://github.com/QubesOS/qubes-issues/issues/7257 Fixes: d7dac083414eb5bb9 ("net-sysfs: update the queue counts in the unregistration path") Cc: stable@vger.kernel.org Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25Merge tag 'omap-for-v5.17/fixes-signed' of ↵Arnd Bergmann3-35/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes Fixes for omaps Fixes for devkit8000 timer regression. Similar to the earlier beagleboard fixes, we must not configure the clocksource drivers to use an alternative timer configuration. It causes unnecessary issues with power management. Only some old designs based on early beagleboard revisions with a miswired timer need to use the alternative timer. * tag 'omap-for-v5.17/fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: dts: Use 32KiHz oscillator on devkit8000 ARM: dts: switch timer config to common devkit8000 devicetree Link: https://lore.kernel.org/r/pull-1645606483-876944@atomide.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-02-25Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"Sean Christopherson2-11/+21
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter. The PCID (ASID) part of CR3 can be bumped without KVM being scheduled out, as the kernel will switch CR3 during __text_poke(), e.g. in response to a static key toggling. If switch_mm_irqs_off() chooses a new ASID for the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale vmcs.HOST_CR3. Add a comment to explain why KVM must wait until VM-Enter is imminent to refresh vmcs.HOST_CR3. The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being loaded for the KVM-associated mm while KVM has a "running" vCPU: static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) { struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); ... WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) && (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK), "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3); } ------------[ cut here ]------------ KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003 WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel CPU: 4 PID: 20717 Comm: stable Tainted: G W 5.17.0-rc3+ #747 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:load_new_mm_cr3+0x82/0xe0 RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788 RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33 FS: 00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0 Call Trace: <TASK> switch_mm_irqs_off+0x1cb/0x460 __text_poke+0x308/0x3e0 text_poke_bp_batch+0x168/0x220 text_poke_finish+0x1b/0x30 arch_jump_label_transform_apply+0x18/0x30 static_key_slow_inc_cpuslocked+0x7c/0x90 static_key_slow_inc+0x16/0x20 kvm_lapic_set_base+0x116/0x190 kvm_set_apic_base+0xa5/0xe0 kvm_set_msr_common+0x2f4/0xf60 vmx_set_msr+0x355/0xe70 [kvm_intel] kvm_set_msr_ignored_check+0x91/0x230 kvm_emulate_wrmsr+0x36/0x120 vmx_handle_exit+0x609/0x6c0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x146f/0x1b80 kvm_vcpu_ioctl+0x279/0x690 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- This reverts commit 15ad9762d69fd8e40a4a51828c1d6b0c1b8fbea0. Fixes: 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") Reported-by: Wanpeng Li <kernellwp@gmail.com> Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Lai Jiangshan <jiangshanlai@gmail.com> Message-Id: <20220224191917.3508476-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"Sean Christopherson3-14/+14
Undo a nested VMX fix as a step toward reverting the commit it fixed, 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"), as the underlying premise that "host CR3 in the vcpu thread can only be changed when scheduling" is wrong. This reverts commit a9f2705ec84449e3b8d70c804766f8e97e23080d. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220224191917.3508476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25can: gs_usb: change active_channels's type from atomic_t to u8Vincent Mailhol1-5/+5
The driver uses an atomic_t variable: gs_usb:active_channels to keep track of the number of opened channels in order to only allocate memory for the URBs when this count changes from zero to one. However, the driver does not decrement the counter when an error occurs in gs_can_open(). This issue is fixed by changing the type from atomic_t to u8 and by simplifying the logic accordingly. It is safe to use an u8 here because the network stack big kernel lock (a.k.a. rtnl_mutex) is being hold. For details, please refer to [1]. [1] https://lore.kernel.org/linux-can/CAMZ6Rq+sHpiw34ijPsmp7vbUpDtJwvVtdV7CvRZJsLixjAFfrg@mail.gmail.com/T/#t Fixes: d08e973a77d1 ("can: gs_usb: Added support for the GS_USB CAN devices") Link: https://lore.kernel.org/all/20220214234814.1321599-1-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-02-25can: etas_es58x: change opened_channel_cnt's type from atomic_t to u8Vincent Mailhol2-7/+10
The driver uses an atomic_t variable: struct es58x_device::opened_channel_cnt to keep track of the number of opened channels in order to only allocate memory for the URBs when this count changes from zero to one. While the intent was to prevent race conditions, the choice of an atomic_t turns out to be a bad idea for several reasons: - implementation is incorrect and fails to decrement opened_channel_cnt when the URB allocation fails as reported in [1]. - even if opened_channel_cnt were to be correctly decremented, atomic_t is insufficient to cover edge cases: there can be a race condition in which 1/ a first process fails to allocate URBs memory 2/ a second process enters es58x_open() before the first process does its cleanup and decrements opened_channed_cnt. In which case, the second process would successfully return despite the URBs memory not being allocated. - actually, any kind of locking mechanism was useless here because it is redundant with the network stack big kernel lock (a.k.a. rtnl_lock) which is being hold by all the callers of net_device_ops:ndo_open() and net_device_ops:ndo_close(). c.f. the ASSERST_RTNL() calls in __dev_open() [2] and __dev_close_many() [3]. The atmomic_t is thus replaced by a simple u8 type and the logic to increment and decrement es58x_device:opened_channel_cnt is simplified accordingly fixing the bug reported in [1]. We do not check again for ASSERST_RTNL() as this is already done by the callers. [1] https://lore.kernel.org/linux-can/20220201140351.GA2548@kili/T/#u [2] https://elixir.bootlin.com/linux/v5.16/source/net/core/dev.c#L1463 [3] https://elixir.bootlin.com/linux/v5.16/source/net/core/dev.c#L1541 Fixes: 8537257874e9 ("can: etas_es58x: add core support for ETAS ES58X CAN USB interfaces") Link: https://lore.kernel.org/all/20220212112713.577957-1-mailhol.vincent@wanadoo.fr Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-02-25Merge branch 'mptcp-fixes-for-5-17'Jakub Kicinski2-4/+18
Mat Martineau says: ==================== mptcp: Fixes for 5.17 Patch 1 fixes an issue with the SIOCOUTQ ioctl in MPTCP sockets that have performed a fallback to TCP. Patch 2 is a selftest fix to correctly remove temp files. Patch 3 fixes a shift-out-of-bounds issue found by syzkaller. ==================== Link: https://lore.kernel.org/r/20220225005259.318898-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25mptcp: Correctly set DATA_FIN timeout when number of retransmits is largeMat Martineau1-2/+5
Syzkaller with UBSAN uncovered a scenario where a large number of DATA_FIN retransmits caused a shift-out-of-bounds in the DATA_FIN timeout calculation: ================================================================================ UBSAN: shift-out-of-bounds in net/mptcp/protocol.c:470:29 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 1 PID: 13059 Comm: kworker/1:0 Not tainted 5.17.0-rc2-00630-g5fbf21c90c60 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e lib/ubsan.c:330 mptcp_set_datafin_timeout net/mptcp/protocol.c:470 [inline] __mptcp_retrans.cold+0x72/0x77 net/mptcp/protocol.c:2445 mptcp_worker+0x58a/0xa70 net/mptcp/protocol.c:2528 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2307 worker_thread+0x95/0xe10 kernel/workqueue.c:2454 kthread+0x2f4/0x3b0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> ================================================================================ This change limits the maximum timeout by limiting the size of the shift, which keeps all intermediate values in-bounds. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/259 Fixes: 6477dd39e62c ("mptcp: Retransmit DATA_FIN") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25selftests: mptcp: do complete cleanup at exitPaolo Abeni1-2/+2
After commit 05be5e273c84 ("selftests: mptcp: add disconnect tests") the mptcp selftests leave behind a couple of tmp files after each run. run_tests_disconnect() misnames a few variables used to track them. Address the issue setting the appropriate global variables Fixes: 05be5e273c84 ("selftests: mptcp: add disconnect tests") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25mptcp: accurate SIOCOUTQ for fallback socketPaolo Abeni1-0/+11
The MPTCP SIOCOUTQ implementation is not very accurate in case of fallback: it only measures the data in the MPTCP-level write queue, but it does not take in account the subflow write queue utilization. In case of fallback the first can be empty, while the latter is not. The above produces sporadic self-tests issues and can foul legit user-space application. Fix the issue additionally querying the subflow in case of fallback. Fixes: 644807e3e462 ("mptcp: add SIOCINQ, OUTQ and OUTQNSD ioctls") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/260 Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25Merge branch 'nfp-flow-independent-tc-action-hardware-offload'Jakub Kicinski5-26/+534
Simon Horman says: ==================== nfp: flow-independent tc action hardware offload Baowen Zheng says: Allow nfp NIC to offload tc actions independent of flows. The motivation for this work is to offload tc actions independent of flows for nfp NIC. We allow nfp driver to provide hardware offload of OVS metering feature - which calls for policers that may be used by multiple flows and whose lifecycle is independent of any flows that use them. When nfp driver tries to offload a flow table using the independent action, the driver will search if the action is already offloaded to the hardware. If not, the flow table offload will fail. When the nfp NIC successes to offload an action, the user can check in_hw_count when dumping the tc action. Tc cli command to offload and dump an action: # tc actions add action police rate 100mbit burst 10000k index 200 skip_sw # tc -s -d actions list action police total acts 1 action order 0: police 0xc8 rate 100Mbit burst 10000Kb mtu 2Kb action reclassify overhead 0b linklayer ethernet ref 1 bind 0 installed 142 sec used 0 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 skip_sw in_hw in_hw_count 1 used_hw_stats delayed ==================== Link: https://lore.kernel.org/r/20220223162302.97609-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: add NFP_FL_FEATS_QOS_METER to host features to enable meter offloadBaowen Zheng1-1/+2
Add NFP_FL_FEATS_QOS_METER to host features to enable meter offload in driver. Before adding this feature, we will not offload any police action since we will check the host features before offloading any police action. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: add support to offload police action from flower tableBaowen Zheng4-1/+68
Offload flow table if the action is already offloaded to hardware when flow table uses this action. Change meter id to type of u32 to support all the action index. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: add process to get action stats from hardwareBaowen Zheng2-5/+114
Add a process to update action stats from hardware. This stats data will be updated to tc action when dumping actions or filters. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: add hash table to store meter tableBaowen Zheng2-1/+176
Add a hash table to store meter table. This meter table will also be used by flower action. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: add support to offload tc action to hardwareBaowen Zheng3-1/+121
Add process to offload tc action to hardware. Currently we only support to offload police action. Add meter capability to check if firmware supports meter offload. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25nfp: refactor policer config to support ingress/egress meterBaowen Zheng2-20/+56
Add an policer API to support ingress/egress meter. Change ingress police to compatible with the new API. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net/tcp: Merge TCP-MD5 inbound callbacksDmitry Safonov4-131/+92
The functions do essentially the same work to verify TCP-MD5 sign. Code can be merged into one family-independent function in order to reduce copy'n'paste and generated code. Later with TCP-AO option added, this will allow to create one function that's responsible for segment verification, that will have all the different checks for MD5/AO/non-signed packets, which in turn will help to see checks for all corner-cases in one function, rather than spread around different families and functions. Cc: Eric Dumazet <edumazet@google.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25Merge branch 'fdb-entries-on-dsa-lag-interfaces'Jakub Kicinski15-195/+560
Vladimir Oltean says: ==================== FDB entries on DSA LAG interfaces This work permits having static and local FDB entries on LAG interfaces that are offloaded by DSA ports. New API needs to be introduced in drivers. To maintain consistency with the bridging offload code, I've taken the liberty to reorganize the data structures added by Tobias in the DSA core a little bit. Tested on NXP LS1028A (felix switch). Would appreciate feedback/testing on other platforms too. Testing procedure was the one described here: https://patchwork.kernel.org/project/netdevbpf/cover/20210205130240.4072854-1-vladimir.oltean@nxp.com/ with this script: ip link del bond0 ip link add bond0 type bond mode 802.3ad ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link del br0 ip link add br0 type bridge && ip link set br0 up ip link set br0 arp off ip link set bond0 master br0 && ip link set bond0 up ip link set swp0 master br0 && ip link set swp0 up ip link set dev bond0 type bridge_slave flood off learning off bridge fdb add dev bond0 <mac address of other eno0> master static I'm noticing a problem in 'bridge fdb dump' with the 'self' entries, and I didn't solve this. On Ocelot, an entry learned on a LAG is reported as being on the first member port of it (so instead of saying 'self bond0', it says 'self swp1'). This is better than not seeing the entry at all, but when DSA queries for the FDBs on a port via ds->ops->port_fdb_dump, it never queries for FDBs on a LAG. Not clear what we should do there, we aren't in control of the ->ndo_fdb_dump of the bonding/team drivers. Alternatively, we could just consider the 'self' entries reported via ndo_fdb_dump as "better than nothing", and concentrate on the 'master' entries that are in sync with the bridge when packets are flooded to software. ==================== Link: https://lore.kernel.org/r/20220223140054.3379617-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: felix: support FDB entries on offloaded LAG interfacesVladimir Oltean3-1/+157
This adds the logic in the Felix DSA driver and Ocelot switch library. For Ocelot switches, the DEST_IDX that is the output of the MAC table lookup is a logical port (equal to physical port, if no LAG is used, or a dynamically allocated number otherwise). The allocation we have in place for LAG IDs is different from DSA's, so we can't use that: - DSA allocates a continuous range of LAG IDs starting from 1 - Ocelot appears to require that physical ports and LAG IDs are in the same space of [0, num_phys_ports), and additionally, ports that aren't in a LAG must have physical port id == logical port id The implication is that an FDB entry towards a LAG might need to be deleted and reinstalled when the LAG ID changes. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: support FDB events on offloaded LAG interfacesVladimir Oltean5-15/+183
This change introduces support for installing static FDB entries towards a bridge port that is a LAG of multiple DSA switch ports, as well as support for filtering towards the CPU local FDB entries emitted for LAG interfaces that are bridge ports. Conceptually, host addresses on LAG ports are identical to what we do for plain bridge ports. Whereas FDB entries _towards_ a LAG can't simply be replicated towards all member ports like we do for multicast, or VLAN. Instead we need new driver API. Hardware usually considers a LAG to be a "logical port", and sets the entire LAG as the forwarding destination. The physical egress port selection within the LAG is made by hashing policy, as usual. To represent the logical port corresponding to the LAG, we pass by value a copy of the dsa_lag structure to all switches in the tree that have at least one port in that LAG. To illustrate why a refcounted list of FDB entries is needed in struct dsa_lag, it is enough to say that: - a LAG may be a bridge port and may therefore receive FDB events even while it isn't yet offloaded by any DSA interface - DSA interfaces may be removed from a LAG while that is a bridge port; we don't want FDB entries lingering around, but we don't want to remove entries that are still in use, either For all the cases below to work, the idea is to always keep an FDB entry on a LAG with a reference count equal to the DSA member ports. So: - if a port joins a LAG, it requests the bridge to replay the FDB, and the FDB entries get created, or their refcount gets bumped by one - if a port leaves a LAG, the FDB replay deletes or decrements refcount by one - if an FDB is installed towards a LAG with ports already present, that entry is created (if it doesn't exist) and its refcount is bumped by the amount of ports already present in the LAG echo "Adding FDB entry to bond with existing ports" ip link del bond0 ip link add bond0 type bond mode 802.3ad ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link del br0 ip link add br0 type bridge ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link del br0 ip link del bond0 echo "Adding FDB entry to empty bond" ip link del bond0 ip link add bond0 type bond mode 802.3ad ip link del br0 ip link add br0 type bridge ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link del br0 ip link del bond0 echo "Adding FDB entry to empty bond, then removing ports one by one" ip link del bond0 ip link add bond0 type bond mode 802.3ad ip link del br0 ip link add br0 type bridge ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link set swp1 nomaster ip link set swp2 nomaster ip link del br0 ip link del bond0 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: call SWITCHDEV_FDB_OFFLOADED for the orig_devVladimir Oltean2-1/+3
When switchdev_handle_fdb_event_to_device() replicates a FDB event emitted for the bridge or for a LAG port and DSA offloads that, we should notify back to switchdev that the FDB entry on the original device is what was offloaded, not on the DSA slave devices that the event is replicated on. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: remove "ds" and "port" from struct dsa_switchdev_event_workVladimir Oltean2-13/+5
By construction, the struct net_device *dev passed to dsa_slave_switchdev_event_work() via struct dsa_switchdev_event_work is always a DSA slave device. Therefore, it is redundant to pass struct dsa_switch and int port information in the deferred work structure. This can be retrieved at all times from the provided struct net_device via dsa_slave_to_port(). For the same reason, we can drop the dsa_is_user_port() check in dsa_fdb_offload_notify(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_deviceVladimir Oltean4-66/+42
When the switchdev_handle_fdb_event_to_device() event replication helper was created, my original thought was that FDB events on LAG interfaces should most likely be special-cased, not just replicated towards all switchdev ports beneath that LAG. So this replication helper currently does not recurse through switchdev lower interfaces of LAG bridge ports, but rather calls the lag_mod_cb() if that was provided. No switchdev driver uses this helper for FDB events on LAG interfaces yet, so that was an assumption which was yet to be tested. It is certainly usable for that purpose, as my RFC series shows: https://patchwork.kernel.org/project/netdevbpf/cover/20220210125201.2859463-1-vladimir.oltean@nxp.com/ however this approach is slightly convoluted because: - the switchdev driver gets a "dev" that isn't its own net device, but rather the LAG net device. It must call switchdev_lower_dev_find(dev) in order to get a handle of any of its own net devices (the ones that pass check_cb). - in order for FDB entries on LAG ports to be correctly refcounted per the number of switchdev ports beneath that LAG, we haven't escaped the need to iterate through the LAG's lower interfaces. Except that is now the responsibility of the switchdev driver, because the replication helper just stopped half-way. So, even though yes, FDB events on LAG bridge ports must be special-cased, in the end it's simpler to let switchdev_handle_fdb_* just iterate through the LAG port's switchdev lowers, and let the switchdev driver figure out that those physical ports are under a LAG. The switchdev_handle_fdb_event_to_device() helper takes a "foreign_dev_check" callback so it can figure out whether @dev can autonomously forward to @foreign_dev. DSA fills this method properly: if the LAG is offloaded by another port in the same tree as @dev, then it isn't foreign. If it is a software LAG, it is foreign - forwarding happens in software. Whether an interface is foreign or not decides whether the replication helper will go through the LAG's switchdev lowers or not. Since the lan966x doesn't properly fill this out, FDB events on software LAG uppers will get called. By changing lan966x_foreign_dev_check(), we can suppress them. Whereas DSA will now start receiving FDB events for its offloaded LAG uppers, so we need to return -EOPNOTSUPP, since we currently don't do the right thing for them. Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: create a dsa_lag structureVladimir Oltean10-109/+173
The main purpose of this change is to create a data structure for a LAG as seen by DSA. This is similar to what we have for bridging - we pass a copy of this structure by value to ->port_lag_join and ->port_lag_leave. For now we keep the lag_dev, id and a reference count in it. Future patches will add a list of FDB entries for the LAG (these also need to be refcounted to work properly). The LAG structure is created using dsa_port_lag_create() and destroyed using dsa_port_lag_destroy(), just like we have for bridging. Because now, the dsa_lag itself is refcounted, we can simplify dsa_lag_map() and dsa_lag_unmap(). These functions need to keep a LAG in the dst->lags array only as long as at least one port uses it. The refcounting logic inside those functions can be removed now - they are called only when we should perform the operation. dsa_lag_dev() is renamed to dsa_lag_by_id() and now returns the dsa_lag structure instead of the lag_dev net_device. dsa_lag_foreach_port() now takes the dsa_lag structure as argument. dst->lags holds an array of dsa_lag structures. dsa_lag_map() now also saves the dsa_lag->id value, so that linear walking of dst->lags in drivers using dsa_lag_id() is no longer necessary. They can just look at lag.id. dsa_port_lag_id_get() is a helper, similar to dsa_port_bridge_num_get(), which can be used by drivers to get the LAG ID assigned by DSA to a given port. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25net: dsa: mv88e6xxx: use dsa_switch_for_each_port in mv88e6xxx_lag_sync_masksVladimir Oltean1-2/+2
Make the intent of the code more clear by using the dedicated helper for iterating over the ports of a switch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>