summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-06-29net: sched: em_ipt: match only on ip/ipv6 trafficNikolay Aleksandrov1-0/+13
Restrict matching only to ip/ipv6 traffic and make sure we can use the headers, otherwise matches will be attempted on any protocol which can be unexpected by the xt matches. Currently policy supports only ipv4/6. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29hinic: add vlan offload supportXue Chaojing7-2/+97
This patch adds vlan offload support for the HINIC driver. Signed-off-by: Xue Chaojing <xuechaojing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29Merge branch 'bpf-lookup-devmap'Daniel Borkmann9-159/+157
Toke Høiland-Jørgensen says: ==================== When using the bpf_redirect_map() helper to redirect packets from XDP, the eBPF program cannot currently know whether the redirect will succeed, which makes it impossible to gracefully handle errors. To properly fix this will probably require deeper changes to the way TX resources are allocated, but one thing that is fairly straight forward to fix is to allow lookups into devmaps, so programs can at least know when a redirect is *guaranteed* to fail because there is no entry in the map. Currently, programs work around this by keeping a shadow map of another type which indicates whether a map index is valid. This series contains two changes that are complementary ways to fix this issue: - Moving the map lookup into the bpf_redirect_map() helper (and caching the result), so the helper can return an error if no value is found in the map. This includes a refactoring of the devmap and cpumap code to not care about the index on enqueue. - Allowing regular lookups into devmaps from eBPF programs, using the read-only flag to make sure they don't change the values. The performance impact of the series is negligible, in the sense that I cannot measure it because the variance between test runs is higher than the difference pre/post series. Changelog: v6: - Factor out list handling in maps to a helper in list.h (new patch 1) - Rename variables in struct bpf_redirect_info (new patch 3 + patch 4) - Explain why we are clearing out the map in the info struct on lookup failure - Remove unneeded check for forwarding target in tracepoint macro v5: - Rebase on latest bpf-next. - Update documentation for bpf_redirect_map() with the new meaning of flags. v4: - Fix a few nits from Andrii - Lose the #defines in bpf.h and just compare the flags argument directly to XDP_TX in bpf_xdp_redirect_map(). v3: - Adopt Jonathan's idea of using the lower two bits of the flag value as the return code. - Always do the lookup, and cache the result for use in xdp_do_redirect(); to achieve this, refactor the devmap and cpumap code to get rid the bitmap for selecting which devices to flush. v2: - For patch 1, make it clear that the change works for any map type. - For patch 2, just use the new BPF_F_RDONLY_PROG flag to make the return value read-only. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29devmap: Allow map lookups from eBPFToke Høiland-Jørgensen2-5/+7
We don't currently allow lookups into a devmap from eBPF, because the map lookup returns a pointer directly to the dev->ifindex, which shouldn't be modifiable from eBPF. However, being able to do lookups in devmaps is useful to know (e.g.) whether forwarding to a specific interface is enabled. Currently, programs work around this by keeping a shadow map of another type which indicates whether a map index is valid. Since we now have a flag to make maps read-only from the eBPF side, we can simply lift the lookup restriction if we make sure this flag is always set. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29bpf_xdp_redirect_map: Perform map lookup in eBPF helperToke Høiland-Jørgensen4-19/+26
The bpf_redirect_map() helper used by XDP programs doesn't return any indication of whether it can successfully redirect to the map index it was given. Instead, BPF programs have to track this themselves, leading to programs using duplicate maps to track which entries are populated in the devmap. This patch fixes this by moving the map lookup into the bpf_redirect_map() helper, which makes it possible to return failure to the eBPF program. The lower bits of the flags argument is used as the return code, which means that existing users who pass a '0' flag argument will get XDP_ABORTED. With this, a BPF program can check the return code from the helper call and react by, for instance, substituting a different redirect. This works for any type of map used for redirect. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29devmap: Rename ifindex member in bpf_redirect_infoToke Høiland-Jørgensen2-14/+14
The bpf_redirect_info struct has an 'ifindex' member which was named back when the redirects could only target egress interfaces. Now that we can also redirect to sockets and CPUs, this is a bit misleading, so rename the member to tgt_index. Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next to each other in a subsequent patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29devmap/cpumap: Use flush list instead of bitmapToke Høiland-Jørgensen3-119/+95
The socket map uses a linked list instead of a bitmap to keep track of which entries to flush. Do the same for devmap and cpumap, as this means we don't have to care about the map index when enqueueing things into the map (and so we can cache the map lookup). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29xskmap: Move non-standard list manipulation to helperToke Høiland-Jørgensen2-2/+15
Add a helper in list.h for the non-standard way of clearing a list that is used in xskmap. This makes it easier to reuse it in the other map types, and also makes sure this usage is not forgotten in any list refactorings in the future. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29selftests/bpf: fix -Wstrict-aliasing in test_sockopt_sk.cStanislav Fomichev1-27/+24
Let's use union with u8[4] and u32 members for sockopt buffer, that should fix any possible aliasing issues. test_sockopt_sk.c: In function ‘getsetsockopt’: test_sockopt_sk.c:115:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] if (*(__u32 *)buf != 0x55AA*2) { ^~ test_sockopt_sk.c:116:3: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] log_err("Unexpected getsockopt(SO_SNDBUF) 0x%x != 0x55AA*2", ^~~~~~~ Fixes: 8a027dc0d8f5 ("selftests/bpf: add sockopt test that exercises sk helpers") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29net/mlx5e: Disallow tc redirect offload cases we don't supportPaul Blakey3-5/+24
After changing the parent_id to be the same for both NICs of same the hardware device, netdev_port_same_parent_id now returns true for more cases (all the lower devices in the hierarchy are on the same hardware device). If merged eswitch isn't enabled, these cases aren't supported, so disallow them. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Expose same physical switch_id for all representorsPaul Blakey1-20/+9
Report system_image_guid as the E-Switch switch_id, this ensures that when a NIC contains multiple PCI functions and which has merged eswitch capability, all representors from multiple PFs publish same switch_id. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Don't refresh TIRs when updating representor SQsGavi Teitz5-4/+21
Refreshing TIRs is done in order to update the TIRs with the current state of SQs in the transport domain, so that the TIRs can filter out undesired self-loopback packets based on the source SQ of the packet. Representor TIRs will only receive packets that originate from their associated vport, due to dedicated steering, and therefore will never receive self-loopback packets, whose source vport will be the vport of the E-Switch manager, and therefore not the vport associated with the representor. As such, it is not necessary to refresh the representors' TIRs, since self-loopback packets can't reach them. Since representors only exist in switchdev mode, and there is no scenario in which a representor will exist in the transport domain alongside a non-representor, it is not necessary to refresh the transport domain's TIRs upon changing the state of a representor's queues. Therefore, do not refresh TIRs upon such a change. Achieve this by adding an update_rx callback to the mlx5e_profile, which refreshes TIRs for non-representors and does nothing for representors, and replace instances of mlx5e_refresh_tirs() upon changing the state of the queues with update_rx(). Signed-off-by: Gavi Teitz <gavi@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: reduce stack usage in mlx5_eswitch_termtbl_createArnd Bergmann3-12/+12
Putting an empty 'mlx5_flow_spec' structure on the stack is a bit wasteful and causes a warning on 32-bit architectures when building with clang -fsanitize-coverage: drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads_termtbl.c: In function 'mlx5_eswitch_termtbl_create': drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads_termtbl.c:90:1: error: the frame size of 1032 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Since the structure is never written to, we can statically allocate it to avoid the stack usage. To be on the safe side, mark all subsequent function arguments that we pass it into as 'const' as well. Fixes: 10caabdaad5a ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Set drvinfo in generic mannerParav Pandit1-1/+1
Consider PCI and non PCI device types while setting device name in get_drvinfo() callback using existing generic device. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Correct phys_port_name for PF portParav Pandit1-0/+2
Currently PF phys_port_name is named as pfNvf-1 as vport number for PF vport is 65535. Correct PF's phys_port name as agreed upon name as pfN. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Report netdevice MPLS featuresAriel Levkovich1-0/+5
Set supported device features in the netdevice MPLS features mask. This will enable HW checksumming and TSO for MPLS tagged traffic. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5e: Move to HW checksumming advertisingAriel Levkovich1-4/+2
This patch changes the way the driver advertises its checksum offload capabilities within the net device features bit mask. Instead of advertising protocol specific checksumming capabilities which are limited today to IPv4 and IPv6, we move to reporing generic HW checksumming capabilities. This will allow the network stack to let mlx5 device offload checksum for cases where the IP header is encapsulated within another protocol and the skb->protocol doesn't indicate one of the IP versions protocol, specifically in the case of MPLS label encapsulating the IP header and the skb->protocol indiciates MPLS ethertype rather than IP. Moving the HW_CSUM reporting is required in the basic net device hw features mask and also in the extensions (vlan and encpasulation features) since the extensions are always multiplied by the basic features set during the packet's traversal through the stack's tx flow. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5: MPFS, Allow adding the same MAC more than onceGavi Teitz1-1/+6
Remove the limitation preventing adding a vport's MAC address to the Multi-Physical Function Switch (MPFS) more than once per E-switch, as there is no difference in the MPFS if an address is being used by an E-switch more than once. This allows the E-switch to have multiple vports with the same MAC address, allowing vports to be classified by VLAN id instead of by MAC if desired. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29net/mlx5: MPFS, Cleanup add MAC flowGavi Teitz1-11/+15
Unify and isolate the error handling flow in mlx5_mpfs_add_mac(), removing code duplication. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29Merge branch 'mlx5-next' of ↵Saeed Mahameed33-630/+1367
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Misc updates from mlx5-next branch: 1) E-Switch vport metadata support for source vport matching 2) Convert mkey_table to XArray 3) Shared IRQs and to use single IRQ for all async EQs Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-29e1000e: PCIm function state supportVitaly Lifshits2-1/+20
Due to commit: 5d8682588605 ("[misc] mei: me: allow runtime pm for platform with D0i3") When disconnecting the cable and reconnecting it the NIC enters DMoff state. This caused wrong link indication and duplex mismatch. This bug is described in: https://bugzilla.redhat.com/show_bug.cgi?id=1689436 Checking PCIm function state and performing PHY reset after a timeout in watchdog task solves this issue. Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29e1000e: Make watchdog use delayed workDetlev Casanova2-27/+32
Use delayed work instead of timers to run the watchdog of the e1000e driver. Simplify the code with one less middle function. Signed-off-by: Detlev Casanova <detlev.casanova@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29i40e: Add macvlan support on i40eHarshitha Ramamurthy2-2/+522
This patch enables macvlan offloads for i40e. The idea is to use channels as macvlan interfaces. The channels are VSIs of type VMDQ. When the first macvlan is created, the maximum number of channels possible are created. From then on, as a macvlan interface is created, a macvlan filter is added to these already created channels (VSIs). This patch utilizes subordinate device traffic classes to make queue groups(channels) available for an upper device like a macvlan. Steps to configure macvlan offloads: 1. ethtool -K ethx l2-fwd-offload on 2. ip link add link ethx name macvlan1 type macvlan 3. ip addr add <address> dev macvlan1 4. ip link set macvlan1 up Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29ixgbevf: Use cached link state instead of re-reading the value for ethtoolAlexander Duyck1-8/+2
Change the ethtool link settings call to just read the cached state out of the adapter structure instead of trying to recheck the value from the PF. Doing this should prevent excessive reading of the mailbox. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: "Guilherme G. Piccoli" <gpiccoli@canonical.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29iavf: fix dereference of null rx_buffer pointerColin Ian King1-2/+4
A recent commit efa14c3985828d ("iavf: allow null RX descriptors") added a null pointer sanity check on rx_buffer, however, rx_buffer is being dereferenced before that check, which implies a null pointer dereference bug can potentially occur. Fix this by only dereferencing rx_buffer until after the null pointer check. Addresses-Coverity: ("Dereference before null check") Signed-off-by: Colin Ian King <colin.king@canonical.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29igb: add RR2DCDELAY to ethtool registers dumpArtem Bityutskiy2-1/+6
This patch adds the RR2DCDELAY register to the ethtool registers dump. RR2DCDELAY exists on I210 and I211 Intel Gigabit Ethernet chips and it stands for "Read Request To Data Completion Delay". Here is how this register is described in the I210 datasheet: "This field captures the maximum PCIe split time in 16 ns units, which is the maximum delay between the read request to the first data completion. This is giving an estimation of the PCIe round trip time." In other words, whenever I210 reads from the host memory (e.g., fetches a descriptor from the ring), the chip measures every PCI DMA read transaction and captures the maximum value. So it ends up containing the longest DMA transaction time. This register is very useful for troubleshooting and research purposes. If you are dealing with time-sensitive networks, this register can help you get an idea of your "I210-to-ring" latency. This helps answering questions like "should I have PCIe ASPM enabled?" or "should I enable deep C-states?" on my system. It is safe to read this register at any point, reading it has no effect on the I210 chip functionality. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29igb: minor ethool regdump amendmentArtem Bityutskiy1-35/+35
This patch has no functional impact and it is just a preparation for the following patch. It removes an early return from the 'igb_get_regs()' function by moving the 82576-only registers dump into an "if" block. With this preparation, we can dump more non-82576 registers at the end of this function. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29iavf: Fix up debug print macroJeff Kirsher1-3/+7
This aligns the iavf_debug() macro with the other Intel drivers. Add the bus number, bus_id field to i40e_bus_info so output shows each physical port(i.e func) in following format: [[[[<domain>]:]<bus>]:][<slot>][.[<func>]] domains are numbered from 0 to ffff), bus (0-ff), slot (0-1f) and function (0-7). Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
2019-06-29e1000e: Reduce boot time by tightening sleep rangesArjan van de Ven7-28/+28
The e1000e driver is a great user of the usleep_range() API, and has nice ranges that in principle help power management. However the ranges that are used only during system startup are very long (and can add easily 100 msec to the boot time) while the power savings of such long ranges is irrelevant due to the one-off, boot only, nature of these functions. This patch shrinks some of the longest ranges to be shorter (while still using a power friendly 1 msec range); this saves 100msec+ of boot time on my BDW NUCs Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29iavf: use struct_size() helperGustavo A. R. Silva1-21/+16
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes, in particular in the context in which this code is being used. So, replace code of the following form: sizeof(struct virtchnl_ether_addr_list) + (count * sizeof(struct virtchnl_ether_addr)) with: struct_size(veal, list, count) and so on... This code was detected with the help of Coccinelle. Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29e1000: Use dma_wmb() instead of wmb() before doorbell writesVenkatesh Srinivas1-3/+3
e1000 writes to doorbells to post transmit descriptors and fill the receive ring. After writing descriptors to memory but before writing to doorbells, use dma_wmb() rather than wmb(). wmb() is more heavyweight than necessary for a device to see descriptor writes. On x86, this avoids SFENCEs before doorbell writes in both the Tx and Rx paths. On ARM, this converts DSB ST -> DMB OSHST. Tested: 82576EB / x86; QEMU (qemu emulates an 8257x) Signed-off-by: Venkatesh Srinivas <venkateshs@google.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29ixgbe: fix potential u32 overflow on shiftColin Ian King1-10/+4
The u32 variable rem is being shifted using u32 arithmetic however it is being passed to div_u64 that expects the expression to be a u64. The 32 bit shift may potentially overflow, so cast rem to a u64 before shifting to avoid this. Also remove comment about overflow. Addresses-Coverity: ("Unintentional integer overflow") Fixes: cd4583206990 ("ixgbe: implement support for SDP/PPS output on X550 hardware") Fixes: 68d9676fc04e ("ixgbe: fix PTP SDP pin setup on X540 hardware") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29ixgbe: Avoid NULL pointer dereference with VF on non-IPsec hwDann Frazier1-0/+3
An ipsec structure will not be allocated if the hardware does not support offload. Fixes the following Oops: [ 191.045452] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 191.054232] Mem abort info: [ 191.057014] ESR = 0x96000004 [ 191.060057] Exception class = DABT (current EL), IL = 32 bits [ 191.065963] SET = 0, FnV = 0 [ 191.069004] EA = 0, S1PTW = 0 [ 191.072132] Data abort info: [ 191.074999] ISV = 0, ISS = 0x00000004 [ 191.078822] CM = 0, WnR = 0 [ 191.081780] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000043d9e467 [ 191.088382] [0000000000000000] pgd=0000000000000000 [ 191.093252] Internal error: Oops: 96000004 [#1] SMP [ 191.098119] Modules linked in: vhost_net vhost tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter devlink ebtables ip6table_filter ip6_tables iptable_filter bpfilter ipmi_ssif nls_iso8859_1 input_leds joydev ipmi_si hns_roce_hw_v2 ipmi_devintf hns_roce ipmi_msghandler cppc_cpufreq sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 ses enclosure btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear ixgbevf hibmc_drm ttm [ 191.168607] drm_kms_helper aes_ce_blk aes_ce_cipher syscopyarea crct10dif_ce sysfillrect ghash_ce qla2xxx sysimgblt sha2_ce sha256_arm64 hisi_sas_v3_hw fb_sys_fops sha1_ce uas nvme_fc mpt3sas ixgbe drm hisi_sas_main nvme_fabrics usb_storage hclge scsi_transport_fc ahci libsas hnae3 raid_class libahci xfrm_algo scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64 [ 191.202952] CPU: 94 PID: 0 Comm: swapper/94 Not tainted 4.19.0-rc1+ #11 [ 191.209553] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.20.01 04/26/2019 [ 191.218064] pstate: 20400089 (nzCv daIf +PAN -UAO) [ 191.222873] pc : ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe] [ 191.228093] lr : ixgbe_msg_task+0x2d0/0x1088 [ixgbe] [ 191.233044] sp : ffff000009b3bcd0 [ 191.236346] x29: ffff000009b3bcd0 x28: 0000000000000000 [ 191.241647] x27: ffff000009628000 x26: 0000000000000000 [ 191.246946] x25: ffff803f652d7600 x24: 0000000000000004 [ 191.252246] x23: ffff803f6a718900 x22: 0000000000000000 [ 191.257546] x21: 0000000000000000 x20: 0000000000000000 [ 191.262845] x19: 0000000000000000 x18: 0000000000000000 [ 191.268144] x17: 0000000000000000 x16: 0000000000000000 [ 191.273443] x15: 0000000000000000 x14: 0000000100000026 [ 191.278742] x13: 0000000100000025 x12: ffff8a5f7fbe0df0 [ 191.284042] x11: 000000010000000b x10: 0000000000000040 [ 191.289341] x9 : 0000000000001100 x8 : ffff803f6a824fd8 [ 191.294640] x7 : ffff803f6a825098 x6 : 0000000000000001 [ 191.299939] x5 : ffff000000f0ffc0 x4 : 0000000000000000 [ 191.305238] x3 : ffff000028c00000 x2 : ffff803f652d7600 [ 191.310538] x1 : 0000000000000000 x0 : ffff000000f205f0 [ 191.315838] Process swapper/94 (pid: 0, stack limit = 0x00000000addfed5a) [ 191.322613] Call trace: [ 191.325055] ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe] [ 191.329927] ixgbe_msg_task+0x2d0/0x1088 [ixgbe] [ 191.334536] ixgbe_msix_other+0x274/0x330 [ixgbe] [ 191.339233] __handle_irq_event_percpu+0x78/0x270 [ 191.343924] handle_irq_event_percpu+0x40/0x98 [ 191.348355] handle_irq_event+0x50/0xa8 [ 191.352180] handle_fasteoi_irq+0xbc/0x148 [ 191.356263] generic_handle_irq+0x34/0x50 [ 191.360259] __handle_domain_irq+0x68/0xc0 [ 191.364343] gic_handle_irq+0x84/0x180 [ 191.368079] el1_irq+0xe8/0x180 [ 191.371208] arch_cpu_idle+0x30/0x1a8 [ 191.374860] do_idle+0x1dc/0x2a0 [ 191.378077] cpu_startup_entry+0x2c/0x30 [ 191.381988] secondary_start_kernel+0x150/0x1e0 [ 191.386506] Code: 6b15003f 54000320 f1404a9f 54000060 (79400260) Fixes: eda0333ac2930 ("ixgbe: add VF IPsec management") Signed-off-by: Dann Frazier <dann.frazier@canonical.com> Acked-by: Shannon Nelson <snelson@pensando.io> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29e1000e: Increase pause and refresh timeMiguel Bernal Marin1-2/+2
Suggested-by: Tim Pepper <timothy.c.pepper@linux.intel.com> Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29ice: Use struct_size() helperGustavo A. R. Silva1-2/+2
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = alloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); This code was detected with the help of Coccinelle. Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-29Merge branch 'net-sched-Add-txtime-assist-support-for-taprio'David S. Miller4-36/+405
Vedang Patel says: ==================== net/sched: Add txtime-assist support for taprio. Changes in v6: - Use _BITUL() instead of BIT() in UAPI for etf. (patch #1) - Fix a bug reported by kbuild test bot in length_to_duration(). (patch #6) - Remove an unused function (get_cycle_start()). (Patch #6) Changes in v5: - Commit message improved for the igb patch (patch #1). - Fixed typo in commit message for etf patch (patch #2). Changes in v4: - Remove inline directive from functions in foo.c. - Fix spacing in pkt_sched.h (for etf patch). Changes in v3: - Simplify implementation for taprio flags. - txtime_delay can only be set if txtime-assist mode is enabled. - txtime_delay and flags will only be visible in tc output if set by user. - Minor changes in error reporting. Changes in v2: - Txtime-offload has now been renamed to txtime-assist mode. - Renamed the offload parameter to flags. - Removed the code which introduced the hardware offloading functionality. Original Cover letter (with above changes included) -------------------------------------------------- Currently, we are seeing packets being transmitted outside their timeslices. We can confirm that the packets are being dequeued at the right time. So, the delay is induced after the packet is dequeued, because taprio, without any offloading, has no control of when a packet is actually transmitted. In order to solve this, we are making use of the txtime feature provided by ETF qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to take advantage of this feature. The taprio qdisc will assign txtime (in skb->tstamp) for all the packets which do not have the txtime allocated via the SO_TXTIME socket option. For the packets which already have SO_TXTIME set, taprio will validate whether the packet will be transmitted in the correct interval. In order to support this, the following parameters have been added: - flags (taprio): This is added in order to support different offloading modes which will be added in the future. - txtime-delay (taprio): This indicates the minimum time it will take for the packet to hit the wire after it reaches taprio_enqueue(). This is useful in determining whether we can transmit the packet in the remaining time if the gate corresponding to the packet is currently open. - skip_skb_check (ETF): ETF currently drops any packet which does not have the SO_TXTIME socket option set. This check can be skipped by specifying this option. Following is an example configuration: tc qdisc replace dev $IFACE parent root handle 100 taprio \\ num_tc 3 \\ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\ queues 1@0 1@0 1@0 \\ base-time $BASE_TIME \\ sched-entry S 01 300000 \\ sched-entry S 02 300000 \\ sched-entry S 04 400000 \\ flags 0x1 \\ txtime-delay 200000 \\ clockid CLOCK_TAI tc qdisc replace dev $IFACE parent 100:1 etf \\ offload delta 200000 clockid CLOCK_TAI skip_skb_check Here, the "flags" parameter is indicating that the txtime-assist mode is enabled. Also, all the traffic classes have been assigned the same queue. This is to prevent the traffic classes in the lower priority queues from getting starved. Note that this configuration is specific to the i210 ethernet card. Other network cards where the hardware queues are given the same priority, might be able to utilize more than one queue. Following are some of the other highlights of the series: - Fix a bug where hardware timestamping and SO_TXTIME options cannot be used together. (Patch 1) - Introduces the skip_skb_check option. (Patch 2) - Make TxTime assist mode work with TCP packets (Patch 7). The following changes are recommended to be done in order to get the best performance from taprio in this mode: ip link set dev enp1s0 mtu 1514 ethtool -K eth0 gso off ethtool -K eth0 tso off ethtool --set-eee eth0 eee off ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29taprio: Adjust timestamps for TCP packetsVedang Patel1-1/+40
When the taprio qdisc is running in "txtime offload" mode, it will set the launchtime value (in skb->tstamp) for all the packets which do not have the SO_TXTIME socket option. But, the TCP packets already have this value set and it indicates the earliest departure time represented in CLOCK_MONOTONIC clock. We need to respect the timestamp set by the TCP subsystem. So, convert this time to the clock which taprio is using and ensure that the packet is not transmitted before the deadline set by TCP. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29taprio: make clock reference conversions easierVedang Patel1-8/+22
Later in this series we will need to transform from CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29taprio: Add support for txtime-assist modeVedang Patel2-17/+328
Currently, we are seeing non-critical packets being transmitted outside of their timeslice. We can confirm that the packets are being dequeued at the right time. So, the delay is induced in the hardware side. The most likely reason is the hardware queues are starving the lower priority queues. In order to improve the performance of taprio, we will be making use of the txtime feature provided by the ETF qdisc. For all the packets which do not have the SO_TXTIME option set, taprio will set the transmit timestamp (set in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate corresponding to skb's traffic class is open. Following two parameters added to support this mode: - flags: used to enable txtime-assist mode. Will also be used to enable other modes (like hardware offloading) later. - txtime-delay: This indicates the minimum time it will take for the packet to hit the wire. This is useful in determining whether we can transmit the packet in the remaining time if the gate corresponding to the packet is currently open. An example configuration for enabling txtime-assist: tc qdisc replace dev eth0 parent root handle 100 taprio \\ num_tc 3 \\ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\ queues 1@0 1@0 1@0 \\ base-time 1558653424279842568 \\ sched-entry S 01 300000 \\ sched-entry S 02 300000 \\ sched-entry S 04 400000 \\ flags 0x1 \\ txtime-delay 40000 \\ clockid CLOCK_TAI tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\ offload delta 200000 clockid CLOCK_TAI Note that all the traffic classes are mapped to the same queue. This is only possible in taprio when txtime-assist is enabled. Also, note that the ETF Qdisc is enabled with offload mode set. In this mode, if the packet's traffic class is open and the complete packet can be transmitted, taprio will try to transmit the packet immediately. This will be done by setting skb->tstamp to current_time + the time delta indicated in the txtime-delay parameter. This parameter indicates the time taken (in software) for packet to reach the network adapter. If the packet cannot be transmitted in the current interval or if the packet's traffic is not currently transmitting, the skb->tstamp is set to the next available timestamp value. This is tracked in the next_launchtime parameter in the struct sched_entry. The behaviour w.r.t admin and oper schedules is not changed from what is present in software mode. The transmit time is already known in advance. So, we do not need the HR timers to advance the schedule and wakeup the dequeue side of taprio. So, HR timer won't be run when this mode is enabled. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29taprio: Remove inline directiveVedang Patel1-1/+1
Remove inline directive from length_to_duration(). We will let the compiler make the decisions. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29taprio: calculate cycle_time when schedule is installedVedang Patel1-18/+11
cycle time for a particular schedule is calculated only when it is first installed. So, it makes sense to just calculate it once right after the 'cycle_time' parameter has been parsed and store it in cycle_time. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29etf: Add skip_sock_checkVedang Patel2-0/+11
Currently, etf expects a socket with SO_TXTIME option set for each packet it encounters. So, it will drop all other packets. But, in the future commits we are planning to add functionality where tstamp value will be set by another qdisc. Also, some packets which are generated from within the kernel (e.g. ICMP packets) do not have any socket associated with them. So, this commit adds support for skip_sock_check. When this option is set, etf will skip checking for a socket and other associated options for all skbs. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29etf: Don't use BIT() in UAPI headers.Vedang Patel1-2/+2
The BIT() macro isn't exported as part of the UAPI interface. So, the compile-test to ensure they are self contained fails. So, use _BITUL() instead. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29igb: clear out skb->tstamp after reading the txtimeVedang Patel1-0/+1
If a packet which is utilizing the launchtime feature (via SO_TXTIME socket option) also requests the hardware transmit timestamp, the hardware timestamp is not delivered to the userspace. This is because the value in skb->tstamp is mistaken as the software timestamp. Applications, like ptp4l, request a hardware timestamp by setting the SOF_TIMESTAMPING_TX_HARDWARE socket option. Whenever a new timestamp is detected by the driver (this work is done in igb_ptp_tx_work() which calls igb_ptp_tx_hwtstamps() in igb_ptp.c[1]), it will queue the timestamp in the ERR_QUEUE for the userspace to read. When the userspace is ready, it will issue a recvmsg() call to collect this timestamp. The problem is in this recvmsg() call. If the skb->tstamp is not cleared out, it will be interpreted as a software timestamp and the hardware tx timestamp will not be successfully sent to the userspace. Look at skb_is_swtx_tstamp() and the callee function __sock_recv_timestamp() in net/socket.c for more details. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29Merge branch 'mirred-recurse'David S. Miller4-6/+19
John Hurley says: ==================== Track recursive calls in TC act_mirred These patches aim to prevent act_mirred causing stack overflow events from recursively calling packet xmit or receive functions. Such events can occur with poor TC configuration that causes packets to travel in loops within the system. Florian Westphal advises that a recursion crash and packets looping are separate issues and should be treated as such. David Miller futher points out that pcpu counters cannot track the precise skb context required to detect loops. Hence these patches are not aimed at detecting packet loops, rather, preventing stack flows arising from such loops. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29net: sched: protect against stack overflow in TC act_mirredJohn Hurley1-0/+14
TC hooks allow the application of filters and actions to packets at both ingress and egress of the network stack. It is possible, with poor configuration, that this can produce loops whereby an ingress hook calls a mirred egress action that has an egress hook that redirects back to the first ingress etc. The TC core classifier protects against loops when doing reclassifies but there is no protection against a packet looping between multiple hooks and recursively calling act_mirred. This can lead to stack overflow panics. Add a per CPU counter to act_mirred that is incremented for each recursive call of the action function when processing a packet. If a limit is passed then the packet is dropped and CPU counter reset. Note that this patch does not protect against loops in TC datapaths. Its aim is to prevent stack overflow kernel panics that can be a consequence of such loops. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29net: sched: refactor reinsert actionJohn Hurley4-6/+5
The TC_ACT_REINSERT return type was added as an in-kernel only option to allow a packet ingress or egress redirect. This is used to avoid unnecessary skb clones in situations where they are not required. If a TC hook returns this code then the packet is 'reinserted' and no skb consume is carried out as no clone took place. This return type is only used in act_mirred. Rather than have the reinsert called from the main datapath, call it directly in act_mirred. Instead of returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED which tells the caller that the packet has been stolen by another process and that no consume call is required. Moving all redirect calls to the act_mirred code is in preparation for tracking recursion created by act_mirred. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29ipv4: enable route flushing in network namespacesChristian Brauner1-4/+8
Tools such as vpnc try to flush routes when run inside network namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This currently does not work because flush is not enabled in non-initial network namespaces. Since routes are per network namespace it is safe to enable /proc/sys/net/ipv4/route/flush in there. Link: https://github.com/lxc/lxd/issues/4257 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28Merge tag 'batadv-next-for-davem-20190627v2' of ↵David S. Miller40-457/+1035
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - fix includes for _MAX constants, atomic functions and fwdecls, by Sven Eckelmann (3 patches) - shorten multicast tt/tvlv worker spinlock section, by Linus Luessing - routeable multicast preparations: implement MAC multicast filtering, by Linus Luessing (2 patches, David Millers comments integrated) - remove return value checks for debugfs_create, by Greg Kroah-Hartman - add routable multicast optimizations, by Linus Luessing (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28Merge branch 'hns3-next'David S. Miller13-91/+184
Huazhong Tan says: ==================== net: hns3: some code optimizations & cleanups & bugfixes [patch 01/12] fixes a TX timeout issue. [patch 02/12 - 04/12] adds some patch related to TM module. [patch 05/12] fixes a compile warning. [patch 06/12] adds Asym Pause support for autoneg [patch 07/12] optimizes the error handler for VF reset. [patch 08/12] deals with the empty interrupt case. [patch 09/12 - 12/12] adds some cleanups & optimizations. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>