summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-10-20Merge branch 'dev_addr-conversions-part-three'David S. Miller6-17/+31
Jakub Kicinski says: ==================== ethernet: manual netdev->dev_addr conversions (part 3) Manual conversions of Ethernet drivers writing directly to netdev->dev_addr (part 3 out of 3). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: via-velocity: use eth_hw_addr_set()Jakub Kicinski1-1/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: via-rhine: use eth_hw_addr_set()Jakub Kicinski1-1/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: tlan: use eth_hw_addr_set()Jakub Kicinski1-4/+6
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, do the swapping, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: tehuti: use eth_hw_addr_set()Jakub Kicinski1-2/+4
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Break the address up into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: stmmac: use eth_hw_addr_set()Jakub Kicinski1-2/+6
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20ethernet: netsec: use eth_hw_addr_set()Jakub Kicinski1-7/+9
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20Merge branch 'hns3-fixes'David S. Miller10-23/+68
Guangbin Huang says: ==================== net: hns3: add some fixes for -net This series adds some fixes for the HNS3 ethernet driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: disable sriov before unload hclge layerPeng Li3-0/+23
HNS3 driver includes hns3.ko, hnae3.ko and hclge.ko. hns3.ko includes network stack and pci_driver, hclge.ko includes HW device action, algo_ops and timer task, hnae3.ko includes some register function. When SRIOV is enable and hclge.ko is removed, HW device is unloaded but VF still exists, PF will not reply VF mbx messages, and cause errors. This patch fix it by disable SRIOV before remove hclge.ko. Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support") Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: fix vf reset workqueue cannot exitYufeng Mo1-3/+3
The task of VF reset is performed through the workqueue. It checks the value of hdev->reset_pending to determine whether to exit the loop. However, the value of hdev->reset_pending may also be assigned by the interrupt function hclgevf_misc_irq_handle(), which may cause the loop fail to exit and keep occupying the workqueue. This loop is not necessary, so remove it and the workqueue will be rescheduled if the reset needs to be retried or a new reset occurs. Fixes: 1cc9bc6e5867 ("net: hns3: split hclgevf_reset() into preparing and rebuilding part") Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: schedule the polling again when allocation failsYunsheng Lin1-10/+12
Currently when there is a rx page allocation failure, it is possible that polling may be stopped if there is no more packet to be reveiced, which may cause queue stall problem under memory pressure. This patch makes sure polling is scheduled again when there is any rx page allocation failure, and polling will try to allocate receive buffers until it succeeds. Now the allocation retry is added, it is unnecessary to do the rx page allocation at the end of rx cleaning, so remove it. And reset the unused_count to zero after calling hns3_nic_alloc_rx_buffers() to avoid calling hns3_nic_alloc_rx_buffers() repeatedly under memory pressure. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: fix for miscalculation of rx unused descYunsheng Lin2-0/+9
rx unused desc is the desc that need attatching new buffer before refilling to hw to receive new packet, the number of desc need attatching new buffer is calculated using next_to_use and next_to_clean. when next_to_use == next_to_clean, currently hns3 driver assumes that all the desc has the buffer attatched, but 'next_to_use == next_to_clean' also means all the desc need attatching new buffer if hw has comsumed all the desc and the driver has not attatched any buffer to the desc yet. This patch adds 'refill' in desc_cb to indicate whether a new buffer has been refilled to a desc. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: fix the max tx size according to user manualYunsheng Lin2-9/+4
Currently the max tx size supported by the hw is calculated by using the max BD num supported by the hw. According to the hw user manual, the max tx size is fixed value for both non-TSO and TSO skb. This patch updates the max tx size according to the manual. Fixes: 8ae10cfb5089("net: hns3: support tx-scatter-gather-fraglist feature") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: add limit ets dwrr bandwidth cannot be 0Guangbin Huang1-0/+9
If ets dwrr bandwidth of tc is set to 0, the hardware will switch to SP mode. In this case, this tc may occupy all the tx bandwidth if it has huge traffic, so it violates the purpose of the user setting. To fix this problem, limit the ets dwrr bandwidth must greater than 0. Fixes: cacde272dd00 ("net: hns3: Add hclge_dcb module for the support of DCB feature") Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: reset DWRR of unused tc to zeroGuangbin Huang1-0/+2
Currently, DWRR of tc will be initialized to a fixed value when this tc is enabled, but it is not been reset to 0 when this tc is disabled. It cause a problem that the DWRR of unused tc is not 0 after using tc tool to add and delete multi-tc parameters. For examples, after enabling 4 TCs and restoring to 1 TC by follow tc commands: $ tc qdisc add dev eth0 root mqprio num_tc 4 map 0 1 2 3 0 1 2 3 queues \ 8@0 8@8 8@16 8@24 hw 1 mode channel $ tc qdisc del dev eth0 root Now there is just one TC is enabled for eth0, but the tc info querying by debugfs is shown as follow: $ cat /mnt/hns3/0000:7d:00.0/tm/tc_sch_info enabled tc number: 1 weight_offset: 14 TC MODE WEIGHT 0 dwrr 100 1 dwrr 100 2 dwrr 100 3 dwrr 100 4 dwrr 0 5 dwrr 0 6 dwrr 0 7 dwrr 0 This patch fixes it by resetting DWRR of tc to 0 when tc is disabled. Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver") Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: hns3: Add configuration of TM QCN error eventJiaran Zhang2-1/+6
Add configuration of interrupt type and fifo interrupt enable of TM QCN error event if enabled, otherwise this event will not be reported when there is error. Fixes: d914971df022 ("net: hns3: remove redundant query in hclge_config_tm_hw_err_int()") Signed-off-by: Jiaran Zhang <zhangjiaran@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20powerpc/smp: do not decrement idle task preempt count in CPU offlineNathan Lynch1-2/+0
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
2021-10-20powerpc/idle: Don't corrupt back chain when going idleMichael Ellerman1-4/+6
In isa206_idle_insn_mayloss() we store various registers into the stack red zone, which is allowed. However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again, to 0(r1), which corrupts the stack back chain. We used to do the same in isa206_idle_insn_mayloss() itself, but we fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving GPRs"), however we missed that the macro also corrupts the back chain. Corrupting the back chain is bad for debuggability but doesn't necessarily cause a bug. However we recently changed the stack handling in some KVM code, and it now relies on the stack back chain being valid when it returns. The corruption causes that code to return with r1 pointing somewhere in kernel data, at some point LR is restored from the stack and we branch to NULL or somewhere else invalid. Only affects Power8 hosts running KVM guests, with dynamic_mt_modes enabled (which it is by default). The fixes tag below points to the commit that changed the KVM stack handling, exposing this bug. The actual corruption of the back chain has always existed since 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode"). Fixes: 9b4416c5095c ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
2021-10-20Merge branch 'sja1105-next'David S. Miller5-47/+157
Vladimir Oltean says: ==================== New RGMII delay DT bindings for the SJA1105 DSA driver During recent reviews I've been telling people that new MAC drivers should adopt a certain DT binding format for RGMII delays in order to avoid conflicting interpretations. Some suggestions were better received than others, and it appears we are still far from a consensus. Part of the problem seems to be that there are still drivers that apply RGMII delays based on an incorrect interpretation of the device tree, and these serve as a bad example for others. I happen to maintain one of those drivers and I am able to test it, so I figure that one of the ways in which I can make a change is to stop providing a bad example. Therefore, this series adds support for the "rx-internal-delay-ps" and "tx-internal-delay-ps" properties inside sja1105 switch port DT nodes, and if these are present, they will decide what RGMII delays will the driver apply. The in-tree device trees are also updated to follow the new format, as well as the schema validator. I assume it's okay to get all changes merged in through the same tree (net-next). Although the DTS changes could be split, if needed - the driver works with or without them. There is one more DTS which should be changed, which is in Shawn's tree but not in net-next: https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux.git/tree/arch/arm64/boot/dts/freescale/fsl-lx2160a-bluebox3.dts?h=for-next For that, I'd have to send a separate patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: dsa: sja1105: parse {rx, tx}-internal-delay-ps properties for RGMII delaysVladimir Oltean3-47/+107
This change does not fix any functional issue or address any real life use case that wasn't possible before. It is just a small step in the process of standardizing the way in which Ethernet MAC drivers may apply RGMII delays (traditionally these have been applied by PHYs, with no clear definition of what to do in the case of a fixed-link). The sja1105 driver used to apply MAC-level RGMII delays on the RX data lines when in fixed-link mode and using a phy-mode of "rgmii-rxid" or "rgmii-id" and on the TX data lines when using "rgmii-txid" or "rgmii-id". But the standard definitions don't say anything about behaving differently when the port is in fixed-link vs when it isn't, and the new device tree bindings are about having a way of applying the delays in a way that is independent of the phy-mode and of the fixed-link property. When the {rx,tx}-internal-delay-ps properties are present, use them, otherwise fall back to the old behavior and warn. One other thing to note is that the SJA1105 hardware applies a delay value in degrees rather than in picoseconds (the delay in ps changes depending on the frequency of the RGMII clock - 125 MHz at 1G, 25 MHz at 100M, 2.5MHz at 10M). I assume that is fine, we calculate the phase shift of the internal delay lines assuming that the device tree meant gigabit, and we let the hardware scale those according to the link speed. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20210723173108.459770-6-prasanna.vengateshan@microchip.com/ Link: https://patchwork.ozlabs.org/project/netdev/patch/20200616074955.GA9092@laureti-dev/#2461123 Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20dt-bindings: net: dsa: sja1105: add {rx,tx}-internal-delay-psVladimir Oltean2-0/+46
Add a schema validator to nxp,sja1105.yaml and to dsa.yaml for explicit MAC-level RGMII delays. These properties must be per port and must be present only for a phy-mode that represents RGMII. We tell dsa.yaml that these port properties might be present, we also define their valid values for SJA1105. We create a common definition for the RX and TX valid range, since it's quite a mouthful. We also modify the example to include the explicit RGMII delay properties. On the fixed-link ports (in the example, port 4), having these explicit delays is actually mandatory, since with the new behavior, the driver shouts that it is interpreting what delays to apply based on phy-mode. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20dt-bindings: net: dsa: inherit the ethernet-controller DT schemaVladimir Oltean1-0/+3
Since a switch is basically a bunch of Ethernet controllers, just inherit the common schema for one to get stronger type validation of the properties of a port. For example, before this change it was valid to have a phy-mode = "xfi" even if "xfi" is not part of ethernet-controller.yaml, now it is not. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20dt-bindings: net: dsa: sja1105: fix example so all ports have a phy-handle ↵Vladimir Oltean1-0/+1
of fixed-link All ports require either a phy-handle or a fixed-link, and port 3 in the example didn't have one. Add it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20vrf: Revert "Reset skb conntrack connection..."Eugene Crosser1-4/+0
This reverts commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1. When an interface is enslaved in a VRF, prerouting conntrack hook is called twice: once in the context of the original input interface, and once in the context of the VRF interface. If no special precausions are taken, this leads to creation of two conntrack entries instead of one, and breaks SNAT. Commit above was intended to avoid creation of extra conntrack entries when input interface is enslaved in a VRF. It did so by resetting conntrack related data associated with the skb when it enters VRF context. However it breaks netfilter operation. Imagine a use case when conntrack zone must be assigned based on the original input interface, rather than VRF interface (that would make original interfaces indistinguishable). One could create netfilter rules similar to these: chain rawprerouting { type filter hook prerouting priority raw; iif realiface1 ct zone set 1 return iif realiface2 ct zone set 2 return } This works before the mentioned commit, but not after: zone assignment is "forgotten", and any subsequent NAT or filtering that is dependent on the conntrack zone does not work. Here is a reproducer script that demonstrates the difference in behaviour. ========== #!/bin/sh # This script demonstrates unexpected change of nftables behaviour # caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack # connection on VRF rcv" # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1 # # Before the commit, it was possible to assign conntrack zone to a # packet (or mark it for `notracking`) in the prerouting chanin, raw # priority, based on the `iif` (interface from which the packet # arrived). # After the change, # if the interface is enslaved in a VRF, such # assignment is lost. Instead, assignment based on the `iif` matching # the VRF master interface is honored. Thus it is impossible to # distinguish packets based on the original interface. # # This script demonstrates this change of behaviour: conntrack zone 1 # or 2 is assigned depending on the match with the original interface # or the vrf master interface. It can be observed that conntrack entry # appears in different zone in the kernel versions before and after # the commit. IPIN=172.30.30.1 IPOUT=172.30.30.2 PFXL=30 ip li sh vein >/dev/null 2>&1 && ip li del vein ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf nft list table testct >/dev/null 2>&1 && nft delete table testct ip li add vein type veth peer veout ip li add tvrf type vrf table 9876 ip li set veout master tvrf ip li set vein up ip li set veout up ip li set tvrf up /sbin/sysctl -w net.ipv4.conf.veout.accept_local=1 /sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0 ip addr add $IPIN/$PFXL dev vein ip addr add $IPOUT/$PFXL dev veout nft -f - <<__END__ table testct { chain rawpre { type filter hook prerouting priority raw; iif { veout, tvrf } meta nftrace set 1 iif veout ct zone set 1 return iif tvrf ct zone set 2 return notrack } chain rawout { type filter hook output priority raw; notrack } } __END__ uname -rv conntrack -F ping -W 1 -c 1 -I vein $IPOUT conntrack -L Signed-off-by: Eugene Crosser <crosser@average.org> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20net: dsa: Fix an error handling path in 'dsa_switch_parse_ports_of()'Christophe JAILLET1-2/+7
If we return before the end of the 'for_each_child_of_node()' iterator, the reference taken on 'port' must be released. Add the missing 'of_node_put()' calls. Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/15d5310d1d55ad51c1af80775865306d92432e03.1634587046.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-20Merge branch 'net-sched-fixes-after-recent-qdisc-running-changes'Jakub Kicinski1-4/+8
Eric Dumazet says: ==================== net: sched: fixes after recent qdisc->running changes First patch fixes a plain bug in qdisc_run_begin(). Second patch removes a pair of atomic operations, increasing performance. ==================== Link: https://lore.kernel.org/r/20211019003402.2110017-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-20net: sched: remove one pair of atomic operationsEric Dumazet1-4/+8
__QDISC_STATE_RUNNING is only set/cleared from contexts owning qdisc lock. Thus we can use less expensive bit operations, as we were doing before commit f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: 29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ahmed S. Darwish <a.darwish@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-20net: sched: fix logic error in qdisc_run_begin()Eric Dumazet1-1/+1
For non TCQ_F_NOLOCK qdisc, qdisc_run_begin() tries to set __QDISC_STATE_RUNNING and should return true if the bit was not set. test_and_set_bit() returns old bit value, therefore we need to invert. Fixes: 29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ahmed S. Darwish <a.darwish@linutronix.de> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-19ice: fix an error code in ice_ena_vfs()Dan Carpenter1-1/+2
Return the error code if ice_eswitch_configure() fails. Don't return success. Fixes: 1c54c839935b ("ice: enable/disable switchdev when managing VFs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: use devm_kcalloc() instead of devm_kzalloc()Gustavo A. R. Silva2-4/+4
Use 2-factor multiplication argument form devm_kcalloc() instead of devm_kzalloc(). Link: https://github.com/KSPP/linux/issues/162 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: Make use of the helper function devm_add_action_or_reset()Cai Huoqing1-3/+1
The helper function devm_add_action_or_reset() will internally call devm_add_action(), and if devm_add_action() fails then it will execute the action mentioned and return the error code. So use devm_add_action_or_reset() instead of devm_add_action() to simplify the error handling, reduce the code. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: Refactor PR ethtool opsWojciech Drewek1-25/+74
This patch improves a few things: - it fixes issue where ethtool -i reports that PR supports priv-flags and tests when in fact it does not support them - instead of using the same functions for both PF and PR ethtool ops, this patch introduces separate ops for both cases and internal functions with core logic. - prevent accessing VF VSI while VF is not ready by calling ice_check_vf_ready_for_cfg - all PR specific functions in ethtool.c were moved to one place in file - instead overwriting n_priv_flags in ice_repr_get_drvinfo, priv-flags code was moved from __ice_get_drvinfo to ice_get_drvinfo Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: Manage act flags for switchdev offloadsWojciech Drewek5-138/+21
Currently it is not possible to set/unset lb_en and lan_en flags for advanced rules during their creation. Both flags are enabled by default. In case of switchdev offloads for egress traffic we need lb_en to be disabled. Because of that, we work around it by updating the rule immediately after its creation. This change allows us to set/unset those flags right away and it gets rid of old workaround as well. Using ice_adv_rule_flags_info structure we can pass info about flags we want to be set for a given advanced rule. Flags are stored in flags_info.act. Values from act would be used only if act_valid was set to true, otherwise default values would be used. Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: Forbid trusted VFs in switchdev modeWojciech Drewek1-5/+5
Merge issues caused the check for switchdev mode has been inserted in wrong place. It should be in ice_set_vf_trust not in ice_set_vf_mac. Trusted VFs are forbidden in switchdev mode because they should be configured only from the host side. Fixes: 1c54c839935b ("ice: enable/disable switchdev when managing VFs") Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: fix software generating extra interruptsJesse Brandeburg2-11/+16
The driver tried to work around missing completion events that occurred while interrupts are disabled, by triggering a software interrupt whenever we exit polling (but we had to have polled at least once). This was causing a *lot* of extra interrupts for some workloads like NVMe over TCP, which resulted in regressions in performance. It was also visible when polling didn't prevent interrupts when busy_poll was enabled. Fix the extra interrupts by utilizing our previously unused 3rd ITR (interrupt throttle) index and set it to 20K interrupts per second, and then trigger a software interrupt within that rate limit. While here, slightly refactor the code to avoid an overwrite of a local variable in the case of wb_en = true. Fixes: b7306b42beaf ("ice: manage interrupts during poll exit") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: fix rate limit update after coalesce changeJesse Brandeburg2-8/+13
If the adaptive settings are changed with ethtool -C ethx adaptive-rx off adaptive-tx off then the interrupt rate limit should be maintained as a user set value, but only if BOTH adaptive settings are off. Fix a bug where the rate limit that was being used in adaptive mode was staying set in the register but was not reported correctly by ethtool -c ethx. Due to long lines include a small refactor of q_vector variable. Fixes: b8b4772377dd ("ice: refactor interrupt moderation writes") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: update dim usage and moderationJesse Brandeburg4-87/+142
The driver was having trouble with unreliable latency when doing single threaded ping-pong tests. This was root caused to the DIM algorithm landing on a too slow interrupt value, which caused high latency, and it was especially present when queues were being switched frequently by the scheduler as happens on default setups today. In attempting to improve this, we allow the upper rate limit for interrupts to move to rate limit of 4 microseconds as a max, which means that no vector can generate more than 250,000 interrupts per second. The old config was up to 100,000. The driver previously tried to program the rate limit too frequently and if the receive and transmit side were both active on the same vector, the INTRL would be set incorrectly, and this change fixes that issue as a side effect of the redesign. This driver will operate from now on with a slightly changed DIM table with more emphasis towards latency sensitivity by having more table entries with lower latency than with high latency (high being >= 64 microseconds). The driver also resets the DIM algorithm state with a new stats set when there is no work done and the data becomes stale (older than 1 second), for the respective receive or transmit portion of the interrupt. Add a new helper for setting rate limit, which will be used more in a followup patch. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: Add support for VF rate limitingBrett Creeley7-4/+486
Implement ndo_set_vf_rate to support setting of min_tx_rate and max_tx_rate; set the appropriate bandwidth in the scheduler for the node representing the specified VF VSI. Co-developed-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ucounts: Proper error handling in set_cred_ucountsEric W. Biederman1-2/+3
Instead of leaking the ucounts in new if alloc_ucounts fails, store the result of alloc_ucounts into a temporary variable, which is later assigned to new->ucounts. Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_credsEric W. Biederman1-1/+1
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds is to change which rlimit counter is used to track a process when the credentials changes. Use the same test for both to guarantee the tracking is correct. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds17-99/+138
Merge misc fixes from Andrew Morton: "19 patches. Subsystems affected by this patch series: mm (userfaultfd, migration, memblock, mempolicy, slub, secretmem, and thp), ocfs2, binfmt, vfs, and misc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mailmap: add Andrej Shadura mm/thp: decrease nr_thps in file's mapping on THP split mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() vfs: check fd has read access in kernel_read_file_from_fd() elfcore: correct reference to CONFIG_UML mm, slub: fix incorrect memcg slab count for bulk free mm, slub: fix potential use-after-free in slab_debugfs_fops mm, slub: fix potential memoryleak in kmem_cache_open() mm, slub: fix mismatch between reconstructed freelist depth and cnt mm, slub: fix two bugs in slab_debug_trace_open() mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() memblock: check memory total_size ocfs2: mount fails with buffer overflow in strlen ocfs2: fix data corruption after conversion from inline format mm/migrate: fix CPUHP state to update node demotion order mm/migrate: add CPU hotplug to demotion #ifdef mm/migrate: optimize hotplug-time demotion order updates userfaultfd: fix a race between writeprotect and exit_mmap() mm/userfaultfd: selftests: fix memory corruption with thp enabled
2021-10-19net: ethernet: ixp4xx: Make use of dma_pool_zalloc() instead of ↵Cai Huoqing1-3/+2
dma_pool_alloc/memset() Replacing dma_pool_alloc/memset() with dma_pool_zalloc() to simplify the code. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19ieee802154: Remove redundant 'flush_workqueue()' callsChristophe JAILLET1-2/+0
'destroy_workqueue()' already drains the queue before destroying it, so there is no need to flush it explicitly. Remove the redundant 'flush_workqueue()' calls. This was generated with coccinelle: @@ expression E; @@ - flush_workqueue(E); destroy_workqueue(E); Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19devlink: Remove extra device_lock assert checksLeon Romanovsky1-2/+0
PCI core code in the pci_call_probe() has a path that doesn't hold device_lock. It happens because the ->probe() is called through the workqueue mechanism. 349 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, 350 const struct pci_device_id *id) 351 { 352 .... 377 if (cpu < nr_cpu_ids) 378 error = work_on_cpu(cpu, local_pci_probe, &ddi); Luckily enough, the core still ensures that only single flow is executed, so it safe to remove the assert checks that anyway were added for annotations purposes. Fixes: b88f7b1203bf ("devlink: Annotate devlink API calls") Reported-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19Merge tag 'linux-can-fixes-for-5.15-20211019' of ↵David S. Miller1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-10-19 this is a pull request of a single patch for net/master. The patch is by me and fixes the error handling in case of a FC timeout in the TX path of the ISOTOP CAN protocol. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19ethernet: Remove redundant statementluo penghao1-1/+0
The variable will be assigned again later in the if condition, there is no meaning there. drivers/net/ethernet/broadcom/tg3.c:5750:2 warning: Value stored to 'current_link_up' is never read. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19cavium: Fix return values of the probe functionZheyu Ma1-2/+2
During the process of driver probing, the probe function should return < 0 for failure, otherwise, the kernel will treat value > 0 as success. Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19mISDN: Fix return values of the probe functionZheyu Ma1-4/+4
During the process of driver probing, the probe function should return < 0 for failure, otherwise, the kernel will treat value > 0 as success. Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19net: phylink: Support disabling autonegotiation for PCSRobert Hancock1-5/+19
The auto-negotiation state in the PCS as set by phylink_mii_c22_pcs_config was previously always enabled when the driver is configured for in-band autonegotiation, even if autonegotiation was disabled on the interface with ethtool. Update the code to set the BMCR_ANENABLE bit based on the interface's autonegotiation enabled state. Update phylink_mii_c22_pcs_get_state to not check autonegotiation-related fields when autonegotiation is disabled. Update phylink_mac_pcs_get_state to initialize the state based on the interface's configured speed, duplex and pause parameters rather than to unknown when autonegotiation is disabled, before calling the driver's pcs_get_state functions, as they are not likely to provide meaningful data for these fields when autonegotiation is disabled. In this case the driver is really just filling in the link state field. Note that in cases where there is a downstream PHY connected, such as with SGMII and a copper PHY, the configuration set by ethtool is handled by phy_ethtool_ksettings_set and not propagated to the PCS. This is correct since SGMII or 1000Base-X autonegotiation with the PCS should normally still be used even if the copper side has disabled it. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-19net: sched: Allow statistics reads from softirq.Sebastian Andrzej Siewior1-1/+1
Eric reported that the rate estimator reads statics from the softirq which in turn triggers a warning introduced in the statistics rework. The warning is too cautious. The updates happen in the softirq context so reads from softirq are fine since the writes can not be preempted. The updates/writes happen during qdisc_run() which ensures one writer and the softirq context. The remaining bad context for reading statistics remains in hard-IRQ because it may preempt a writer. Fixes: 29cbcd8582837 ("net: sched: Remove Qdisc::running sequence counter") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>