summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2020-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+4
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-07-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 36 non-merge commits during the last 7 day(s) which contain a total of 62 files changed, 2242 insertions(+), 468 deletions(-). The main changes are: 1) Avoid trace_printk warning banner by switching bpf_trace_printk to use its own tracing event, from Alan. 2) Better libbpf support on older kernels, from Andrii. 3) Additional AF_XDP stats, from Ciara. 4) build time resolution of BTF IDs, from Jiri. 5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-14net: sched: Pass qdisc reference in struct flow_block_offloadPetr Machata1-3/+6
Previously, shared blocks were only relevant for the pseudo-qdiscs ingress and clsact. Recently, a qevent facility was introduced, which allows to bind blocks to well-defined slots of a qdisc instance. RED in particular got two qevents: early_drop and mark. Drivers that wish to offload these blocks will be sent the usual notification, and need to know which qdisc it is related to. To that end, extend flow_block_offload with a "sch" pointer, and initialize as appropriate. This prompts changes in the indirect block facility, which now tracks the scheduler in addition to the netdevice. Update signatures of several functions similarly. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-14xsk: Add new statisticsCiara Loftus1-0/+4
It can be useful for the user to know the reason behind a dropped packet. Introduce new counters which track drops on the receive path caused by: 1. rx ring being full 2. fill ring being empty Also, on the tx path introduce a counter which tracks the number of times we attempt pull from the tx ring when it is empty. Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200708072835.4427-2-ciara.loftus@intel.com
2020-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller9-37/+41
All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11devlink: Add devlink health port reporters APIVladyslav Tarasiuk1-0/+9
In order to use new devlink port health reporters infrastructure, add corresponding constructor and destructor functions. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11devlink: Implement devlink health reporters on per-port basisVladyslav Tarasiuk1-0/+2
Add devlink-health reporter support on per-port basis. The main difference existing devlink-health is that port reporters are stored in per-devlink_port lists. Upon creation of such health reporter the reference to a port it belongs to is stored in reporter struct. Fill the port index attribute in devlink-health response to allow devlink userspace utility to distinguish between device and port reporters. Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10ethtool: add tunnel info interfaceJakub Kicinski1-0/+21
Add an interface to report offloaded UDP ports via ethtool netlink. Now that core takes care of tracking which UDP tunnel ports the NICs are aware of we can quite easily export this information out to user space. The responsibility of writing the netlink dumps is split between ethtool code and udp_tunnel_nic.c - since udp_tunnel module may not always be loaded, yet we should always report the capabilities of the NIC. $ ethtool --show-tunnels eth0 Tunnel information for eth0: UDP port table 0: Size: 4 Types: vxlan No entries UDP port table 1: Size: 4 Types: geneve, vxlan-gpe Entries (1): port 1230, vxlan-gpe v4: - back to v2, build fix is now directly in udp_tunnel.h v3: - don't compile ETHTOOL_MSG_TUNNEL_INFO_GET in if CONFIG_INET not set. v2: - fix string set count, - reorder enums in the uAPI, - fix type of ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES to bitset in docs and comments. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10udp_tunnel: add central NIC RX port offload infrastructureJakub Kicinski1-0/+137
Cater to devices which: (a) may want to sleep in the callbacks; (b) only have IPv4 support; (c) need all the programming to happen while the netdev is up. Drivers attach UDP tunnel offload info struct to their netdevs, where they declare how many UDP ports of various tunnel types they support. Core takes care of tracking which ports to offload. Use a fixed-size array since this matches what almost all drivers do, and avoids a complexity and uncertainty around memory allocations in an atomic context. Make sure that tunnel drivers don't try to replay the ports when new NIC netdev is registered. Automatic replays would mess up reference counting, and will be removed completely once all drivers are converted. v4: - use a #define NULL to avoid build issues with CONFIG_INET=n. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-10udp_tunnel: re-number the offload tunnel typesJakub Kicinski1-3/+3
Make it possible to use tunnel types as flags more easily. There doesn't appear to be any user using the type as an array index, so this should make no difference. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09devlink: Add a new devlink port split ability attribute and pass to netlinkDanielle Ratson1-1/+3
Add a new attribute that indicates the split ability of devlink port. Drivers are expected to set it via devlink_port_attrs_set(), before registering the port. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09devlink: Add a new devlink port lanes attribute and pass to netlinkDanielle Ratson1-0/+2
Add a new devlink port attribute that indicates the port's number of lanes. Drivers are expected to set it via devlink_port_attrs_set(), before registering the port. The attribute is not passed to user space in case the number of lanes is invalid (0). Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09devlink: Replace devlink_port_attrs_set parameters with a structDanielle Ratson1-11/+9
Currently, devlink_port_attrs_set accepts a long list of parameters, that most of them are devlink port's attributes. Use the devlink_port_attrs struct to replace the relevant parameters. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09devlink: Move switch_port attribute of devlink_port_attrs to devlink_portDanielle Ratson1-3/+3
The struct devlink_port_attrs holds the attributes of devlink_port. Similarly to the previous patch, 'switch_port' attribute is another exception. Move 'switch_port' to be devlink_port's field. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09devlink: Move set attribute of devlink_port_attrs to devlink_portDanielle Ratson1-2/+2
The struct devlink_port_attrs holds the attributes of devlink_port. The 'set' field is not devlink_port's attribute as opposed to most of the others. Move 'set' to be devlink_port's field called 'attrs_set'. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09bpf: net: Avoid incorrect bpf_sk_reuseport_detach callMartin KaFai Lau1-1/+2
bpf_sk_reuseport_detach is currently called when sk->sk_user_data is not NULL. It is incorrect because sk->sk_user_data may not be managed by the bpf's reuseport_array. It has been reported in [1] that, the bpf_sk_reuseport_detach() which is called from udp_lib_unhash() has corrupted the sk_user_data managed by l2tp. This patch solves it by using another bit (defined as SK_USER_DATA_BPF) of the sk_user_data pointer value. It marks that a sk_user_data is managed/owned by BPF. The patch depends on a PTRMASK introduced in commit f1ff5ce2cd5e ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). [ Note: sk->sk_user_data is used by bpf's reuseport_array only when a sk is added to the bpf's reuseport_array. i.e. doing setsockopt(SO_REUSEPORT) and having "sk->sk_reuseport == 1" alone will not stop sk->sk_user_data being used by other means. ] [1]: https://lore.kernel.org/netdev/20200706121259.GA20199@katalix.com/ Fixes: 5dc4c4b7d4e8 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY") Reported-by: James Chapman <jchapman@katalix.com> Reported-by: syzbot+9f092552ba9a5efca5df@syzkaller.appspotmail.com Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: James Chapman <jchapman@katalix.com> Acked-by: James Chapman <jchapman@katalix.com> Link: https://lore.kernel.org/bpf/20200709061110.4019316-1-kafai@fb.com
2020-07-09net: dsa: tag_rtl4_a: Implement Realtek 4 byte A tagLinus Walleij1-0/+2
This implements the known parts of the Realtek 4 byte tag protocol version 0xA, as found in the RTL8366RB DSA switch. It is designated as protocol version 0xA as a different Realtek 4 byte tag format with protocol version 0x9 is known to exist in the Realtek RTL8306 chips. The tag and switch chip lacks public documentation, so the tag format has been reverse-engineered from packet dumps. As only ingress traffic has been available for analysis an egress tag has not been possible to develop (even using educated guesses about bit fields) so this is as far as it gets. It is not known if the switch even supports egress tagging. Excessive attempts to figure out the egress tag format was made. When nothing else worked, I just tried all bit combinations with 0xannp where a is protocol and p is port. I looped through all values several times trying to get a response from ping, without any positive result. Using just these ingress tags however, the switch functionality is vastly improved and the packets find their way into the destination port without any tricky VLAN configuration. On the D-Link DIR-685 the LAN ports now come up and respond to ping without any command line configuration so this is a real improvement for users. Egress packets need to be restricted to the proper target ports using VLAN, which the RTL8366RB DSA switch driver already sets up. Cc: DENG Qingfang <dqfext@gmail.com> Cc: Mauri Sandberg <sandberg@mailfence.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-13/+25
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Support for rejecting packets from the prerouting chain, from Laura Garcia Liebana. 2) Remove useless assignment in pipapo, from Stefano Brivio. 3) On demand hook registration in IPVS, from Julian Anastasov. 4) Expire IPVS connection from process context to not overload timers, also from Julian. 5) Fallback to conntrack TCP tracker to handle connection reuse in IPVS, from Julian Anastasov. 6) Several patches to support for chain bindings. 7) Expose enum nft_chain_flags through UAPI. 8) Reject unsupported chain flags from the netlink control plane. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skbMartin Varghese1-1/+9
The packets from tunnel devices (eg bareudp) may have only metadata in the dst pointer of skb. Hence a pointer check of neigh_lookup is needed in dst_neigh_lookup_skb Kernel crashes when packets from bareudp device is processed in the kernel neighbour subsytem. [ 133.384484] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 133.385240] #PF: supervisor instruction fetch in kernel mode [ 133.385828] #PF: error_code(0x0010) - not-present page [ 133.386603] PGD 0 P4D 0 [ 133.386875] Oops: 0010 [#1] SMP PTI [ 133.387275] CPU: 0 PID: 5045 Comm: ping Tainted: G W 5.8.0-rc2+ #15 [ 133.388052] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 133.391076] RIP: 0010:0x0 [ 133.392401] Code: Bad RIP value. [ 133.394029] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246 [ 133.396656] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00 [ 133.399018] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400 [ 133.399685] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000 [ 133.400350] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400 [ 133.401010] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003 [ 133.401667] FS: 00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000 [ 133.402412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 133.402948] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0 [ 133.403611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 133.404270] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 133.404933] Call Trace: [ 133.405169] <IRQ> [ 133.405367] __neigh_update+0x5a4/0x8f0 [ 133.405734] arp_process+0x294/0x820 [ 133.406076] ? __netif_receive_skb_core+0x866/0xe70 [ 133.406557] arp_rcv+0x129/0x1c0 [ 133.406882] __netif_receive_skb_one_core+0x95/0xb0 [ 133.407340] process_backlog+0xa7/0x150 [ 133.407705] net_rx_action+0x2af/0x420 [ 133.408457] __do_softirq+0xda/0x2a8 [ 133.408813] asm_call_on_stack+0x12/0x20 [ 133.409290] </IRQ> [ 133.409519] do_softirq_own_stack+0x39/0x50 [ 133.410036] do_softirq+0x50/0x60 [ 133.410401] __local_bh_enable_ip+0x50/0x60 [ 133.410871] ip_finish_output2+0x195/0x530 [ 133.411288] ip_output+0x72/0xf0 [ 133.411673] ? __ip_finish_output+0x1f0/0x1f0 [ 133.412122] ip_send_skb+0x15/0x40 [ 133.412471] raw_sendmsg+0x853/0xab0 [ 133.412855] ? insert_pfn+0xfe/0x270 [ 133.413827] ? vvar_fault+0xec/0x190 [ 133.414772] sock_sendmsg+0x57/0x80 [ 133.415685] __sys_sendto+0xdc/0x160 [ 133.416605] ? syscall_trace_enter+0x1d4/0x2b0 [ 133.417679] ? __audit_syscall_exit+0x1d9/0x280 [ 133.418753] ? __prepare_exit_to_usermode+0x5d/0x1a0 [ 133.419819] __x64_sys_sendto+0x24/0x30 [ 133.420848] do_syscall_64+0x4d/0x90 [ 133.421768] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 133.422833] RIP: 0033:0x7fe013689c03 [ 133.423749] Code: Bad RIP value. [ 133.424624] RSP: 002b:00007ffc7288f418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 133.425940] RAX: ffffffffffffffda RBX: 000056151fc63720 RCX: 00007fe013689c03 [ 133.427225] RDX: 0000000000000040 RSI: 000056151fc63720 RDI: 0000000000000003 [ 133.428481] RBP: 00007ffc72890b30 R08: 000056151fc60500 R09: 0000000000000010 [ 133.429757] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 [ 133.431041] R13: 000056151fc636e0 R14: 000056151fc616bc R15: 0000000000000080 [ 133.432481] Modules linked in: mpls_iptunnel act_mirred act_tunnel_key cls_flower sch_ingress veth mpls_router ip_tunnel bareudp ip6_udp_tunnel udp_tunnel macsec udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc ebtable_filter ebtables overlay ip6table_filter ip6_tables iptable_filter sunrpc ext4 mbcache jbd2 pcspkr i2c_piix4 virtio_balloon joydev ip_tables xfs libcrc32c ata_generic qxl pata_acpi drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata virtio_net net_failover virtio_console failover virtio_blk i2c_core virtio_pci virtio_ring serio_raw floppy virtio dm_mirror dm_region_hash dm_log dm_mod [ 133.444045] CR2: 0000000000000000 [ 133.445082] ---[ end trace f4aeee1958fd1638 ]--- [ 133.446236] RIP: 0010:0x0 [ 133.447180] Code: Bad RIP value. [ 133.448152] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246 [ 133.449363] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00 [ 133.450835] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400 [ 133.452237] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000 [ 133.453722] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400 [ 133.455149] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003 [ 133.456520] FS: 00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000 [ 133.458046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 133.459342] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0 [ 133.460782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 133.462240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 133.463697] Kernel panic - not syncing: Fatal exception in interrupt [ 133.465226] Kernel Offset: 0xfa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 133.467025] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: aaa0c23cb901 ("Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug") Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-0/+11
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-07-04 The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 17 day(s) which contain a total of 106 files changed, 5233 insertions(+), 1283 deletions(-). The main changes are: 1) bpftool ability to show PIDs of processes having open file descriptors for BPF map/program/link/BTF objects, relying on BPF iterator progs to extract this info efficiently, from Andrii Nakryiko. 2) Addition of BPF iterator progs for dumping TCP and UDP sockets to seq_files, from Yonghong Song. 3) Support access to BPF map fields in struct bpf_map from programs through BTF struct access, from Andrey Ignatov. 4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack via seq_file from BPF iterator progs, from Song Liu. 5) Make SO_KEEPALIVE and related options available to bpf_setsockopt() helper, from Dmitry Yakunin. 6) Optimize BPF sk_storage selection of its caching index, from Martin KaFai Lau. 7) Removal of redundant synchronize_rcu()s from BPF map destruction which has been a historic leftover, from Alexei Starovoitov. 8) Several improvements to test_progs to make it easier to create a shell loop that invokes each test individually which is useful for some CIs, from Jesper Dangaard Brouer. 9) Fix bpftool prog dump segfault when compiled without skeleton code on older clang versions, from John Fastabend. 10) Bunch of cleanups and minor improvements, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04netfilter: nf_tables: add NFT_CHAIN_BINDINGPablo Neira Ayuso1-1/+12
This new chain flag specifies that: * the kernel dynamically allocates the chain name, if no chain name is specified. * If the immediate expression that refers to this chain is removed, then this bound chain (and its content) is destroyed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: expose enum nft_chain_flags through UAPIPablo Neira Ayuso1-6/+1
This enum definition was never exposed through UAPI. Rename NFT_BASE_CHAIN to NFT_CHAIN_BASE for consistency. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04netfilter: nf_tables: add NFTA_CHAIN_ID attributePablo Neira Ayuso1-0/+3
This netlink attribute allows you to refer to chains inside a transaction as an alternative to the name and the handle. The chain binding support requires this new chain ID approach. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04ipvs: allow connection reuse for unconfirmed conntrackJulian Anastasov1-6/+4
YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747 - Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544 - Additional 1s latency in `host -> service IP -> pod` https://github.com/kubernetes/kubernetes/issues/90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04sched: consistently handle layer3 header accesses in the presence of VLANsToke Høiland-Jørgensen2-19/+17
There are a couple of places in net/sched/ that check skb->protocol and act on the value there. However, in the presence of VLAN tags, the value stored in skb->protocol can be inconsistent based on whether VLAN acceleration is enabled. The commit quoted in the Fixes tag below fixed the users of skb->protocol to use a helper that will always see the VLAN ethertype. However, most of the callers don't actually handle the VLAN ethertype, but expect to find the IP header type in the protocol field. This means that things like changing the ECN field, or parsing diffserv values, stops working if there's a VLAN tag, or if there are multiple nested VLAN tags (QinQ). To fix this, change the helper to take an argument that indicates whether the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we make sure to skip all of them, so behaviour is consistent even in QinQ mode. To make the helper usable from the ECN code, move it to if_vlan.h instead of pkt_sched.h. v3: - Remove empty lines - Move vlan variable definitions inside loop in skb_protocol() - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and bpf_skb_ecn_set_ce() v2: - Use eth_type_vlan() helper in skb_protocol() - Also fix code that reads skb->protocol directly - Change a couple of 'if/else if' statements to switch constructs to avoid calling the helper twice Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02bonding: allow xfrm offload setup post-module-loadJarod Wilson1-0/+5
At the moment, bonding xfrm crypto offload can only be set up if the bonding module is loaded with active-backup mode already set. We need to be able to make this work with bonds set to AB after the bonding driver has already been loaded. So what's done here is: 1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used by both bond_main.c and bond_options.c 2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than only when loading in AB mode 3) wire up xfrmdev_ops universally too 4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB 5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a performance hit from traversing into the underlying drivers 5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call netdev_change_features() from bond_option_mode_set() In my local testing, I can change bonding modes back and forth on the fly, have hardware offload work when I'm in AB, and see no performance penalty to non-AB software encryption, despite having xfrm bits all wired up for all modes now. Fixes: 18cb261afd7b ("bonding: support hardware encryption offload to slaves") Reported-by: Huy Nguyen <huyn@mellanox.com> CC: Saeed Mahameed <saeedm@mellanox.com> CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Jakub Kicinski <kuba@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: netdev@vger.kernel.org CC: intel-wired-lan@lists.osuosl.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02genetlink: remove genl_bindSean Tranchetti1-8/+0
A potential deadlock can occur during registering or unregistering a new generic netlink family between the main nl_table_lock and the cb_lock where each thread wants the lock held by the other, as demonstrated below. 1) Thread 1 is performing a netlink_bind() operation on a socket. As part of this call, it will call netlink_lock_table(), incrementing the nl_table_users count to 1. 2) Thread 2 is registering (or unregistering) a genl_family via the genl_(un)register_family() API. The cb_lock semaphore will be taken for writing. 3) Thread 1 will call genl_bind() as part of the bind operation to handle subscribing to GENL multicast groups at the request of the user. It will attempt to take the cb_lock semaphore for reading, but it will fail and be scheduled away, waiting for Thread 2 to finish the write. 4) Thread 2 will call netlink_table_grab() during the (un)registration call. However, as Thread 1 has incremented nl_table_users, it will not be able to proceed, and both threads will be stuck waiting for the other. genl_bind() is a noop, unless a genl_family implements the mcast_bind() function to handle setting up family-specific multicast operations. Since no one in-tree uses this functionality as Cong pointed out, simply removing the genl_bind() function will remove the possibility for deadlock, as there is no attempt by Thread 1 above to take the cb_lock semaphore. Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Johannes Berg <johannes.berg@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-6/+10
Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net/tls: fix sign extension issue when left shifting u16 valueColin Ian King1-1/+1
Left shifting the u16 value promotes it to a int and then it gets sign extended to a u64. If len << 16 is greater than 0x7fffffff then the upper bits get set to 1 because of the implicit sign extension. Fix this by casting len to u64 before shifting it. Addresses-Coverity: ("integer handling issues") Fixes: ed9b7646b06a ("net/tls: Add asynchronous resync") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ip_tunnel: add header_ops for layer 3 devicesJason A. Donenfeld1-0/+3
Some devices that take straight up layer 3 packets benefit from having a shared header_ops so that AF_PACKET sockets can inject packets that are recognized. This shared infrastructure will be used by other drivers that currently can't inject packets using AF_PACKET. It also exposes the parser function, as it is useful in standalone form too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki1-1/+1
To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki1-1/+4
Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki1-1/+2
Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30ipvs: register hooks only with servicesJulian Anastasov1-0/+5
Keep the IPVS hooks registered in Netfilter only while there are configured virtual services. This saves CPU cycles while IPVS is loaded but not used. Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30xsk: Replace the cheap_dma flag with a dma_need_sync flagChristoph Hellwig1-3/+3
Invert the polarity and better name the flag so that the use case is properly documented. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
2020-06-30net:qos: police action offloading parameter 'burst' change to the original valuePo Liu3-4/+32
Since 'tcfp_burst' with TICK factor, driver side always need to recover it to the original value, this patch moves the generic calculation and recover to the 'burst' original value before offloading to device driver. Signed-off-by: Po Liu <po.liu@nxp.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30Merge tag 'mlx5-tls-2020-06-26' of ↵David S. Miller1-4/+30
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-tls-2020-06-26 1) Improve hardware layouts and structure for kTLS support 2) Generalize ICOSQ (Internal Channel Operations Send Queue) Due to the asynchronous nature of adding new kTLS flows and handling HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to support generic async operations, such as kTLS add flow and resync, in addition to the existing XSK usages. 3) kTLS hardware flow steering and classification: The driver already has the means to classify TCP ipv4/6 flows to send them to the corresponding RSS HW engine, as reflected in patches 3 through 5, the series will add a steering layer that will hook to the driver's TCP classifiers and will match on well known kTLS connection, in case of a match traffic will be redirected to the kTLS decryption engine, otherwise traffic will continue flowing normally to the TCP RSS engine. 3) kTLS add flow RX HW offload support New offload contexts post their static/progress params WQEs (Work Queue Element) to communicate the newly added kTLS contexts over the per-channel async ICOSQ. The Channel/RQ is selected according to the socket's rxq index. A new TLS-RX workqueue is used to allow asynchronous addition of steering rules, out of the NAPI context. It will be also used in a downstream patch in the resync procedure. Feature is OFF by default. Can be turned on by: $ ethtool -K <if> tls-hw-rx-offload on 4) Added mlx5 kTLS sw stats and new counters are documented in Documentation/networking/tls-offload.rst rx_tls_ctx - number of TLS RX HW offload contexts added to device for decryption. rx_tls_ooo - number of RX packets which were part of a TLS stream but did not arrive in the expected order and triggered the resync procedure. rx_tls_del - number of TLS RX HW offload contexts deleted from device (connection has finished). rx_tls_err - number of RX packets which were part of a TLS stream but were not decrypted due to unexpected error in the state machine. 5) Asynchronous RX resync a. The NIC driver indicates that it would like to resync on some TLS record within the received packet (P), but the driver does not know (yet) which of the TLS records within the packet. At this stage, the NIC driver will query the device to find the exact TCP sequence for resync (tcpsn), however, the driver does not wait for the device to provide the response. b. Eventually, the device responds, and the driver provides the tcpsn within the resync packet to KTLS. Now, KTLS can check the tcpsn against any processed TLS records within packet P, and also against any record that is processed in the future within packet P. The asynchronous resync path simplifies the device driver, as it can save bits on the packet completion (32-bit TCP sequence), and pass this information on an asynchronous command instead. Performance: CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off NIC: ConnectX-6 Dx 100GbE dual port Goodput (app-layer throughput) comparison: +---------------+-------+-------+---------+ | # connections | 1 | 4 | 8 | +---------------+-------+-------+---------+ | SW (Gbps) | 7.26 | 24.70 | 50.30 | +---------------+-------+-------+---------+ | HW (Gbps) | 18.50 | 64.30 | 92.90 | +---------------+-------+-------+---------+ | Speedup | 2.55x | 2.56x | 1.85x * | +---------------+-------+-------+---------+ * After linerate is reached, diff is observed in CPU util ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30genetlink: get rid of family->attrbufCong Wang1-2/+0
genl_family_rcv_msg_attrs_parse() reuses the global family->attrbuf when family->parallel_ops is false. However, family->attrbuf is not protected by any lock on the genl_family_rcv_msg_doit() code path. This leads to several different consequences, one of them is UAF, like the following: genl_family_rcv_msg_doit(): genl_start(): genl_family_rcv_msg_attrs_parse() attrbuf = family->attrbuf __nlmsg_parse(attrbuf); genl_family_rcv_msg_attrs_parse() attrbuf = family->attrbuf __nlmsg_parse(attrbuf); info->attrs = attrs; cb->data = info; netlink_unicast_kernel(): consume_skb() genl_lock_dumpit(): genl_dumpit_info(cb)->attrs Note family->attrbuf is an array of pointers to the skb data, once the skb is freed, any dereference of family->attrbuf will be a UAF. Maybe we could serialize the family->attrbuf with genl_mutex too, but that would make the locking more complicated. Instead, we can just get rid of family->attrbuf and always allocate attrbuf from heap like the family->parallel_ops==true code path. This may add some performance overhead but comparing with taking the global genl_mutex, it still looks better. Fixes: 75cdbdd08900 ("net: ieee802154: have genetlink code to parse the attrs during dumpit") Fixes: 057af7071344 ("net: tipc: have genetlink code to parse the attrs during dumpit") Reported-and-tested-by: syzbot+3039ddf6d7b13daf3787@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+80cad1e3cb4c41cde6ff@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+736bcbcb11b60d0c0792@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+520f8704db2b68091d44@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c96e4dfb32f8987fdeed@syzkaller.appspotmail.com Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: sched: sch_red: Add qevents "early_drop" and "mark"Petr Machata1-0/+2
In order to allow acting on dropped and/or ECN-marked packets, add two new qevents to the RED qdisc: "early_drop" and "mark". Filters attached at "early_drop" block are executed as packets are early-dropped, those attached at the "mark" block are executed as packets are ECN-marked. Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark" qevent. Absence of these attributes signifies "don't care": no block is allocated in that case, or the existing blocks are left intact in case of the change callback. For purposes of offloading, blocks attached to these qevents appear with newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and FLOW_BLOCK_BINDER_TYPE_RED_MARK. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: sched: Introduce helpers for qevent blocksPetr Machata1-0/+49
Qevents are attach points for TC blocks, where filters can be put that are executed when "interesting events" take place in a qdisc. The data to keep and the functions to invoke to maintain a qevent will be largely the same between qevents. Therefore introduce sched-wide helpers for qevent management. Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc, blocks attachment cannot be changed after the qdisc is created. To that end, add a helper tcf_qevent_validate_change(), which verifies whether block index attribute is not attached, or if it is, whether its value matches the current one (i.e. there is no material change). The function tcf_qevent_handle() should be invoked when qdisc hits the "interesting event" corresponding to a block. This function releases root lock for the duration of executing the attached filters, to allow packets generated through user actions (notably mirred) to be reinserted to the same qdisc tree. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: sched: Pass root lock to Qdisc_ops.enqueuePetr Machata1-2/+4
A following patch introduces qevents, points in qdisc algorithm where packet can be processed by user-defined filters. Should this processing lead to a situation where a new packet is to be enqueued on the same port, holding the root lock would lead to deadlocks. To solve the issue, qevent handler needs to unlock and relock the root lock when necessary. To that end, add the root lock argument to the qdisc op enqueue, and propagate throughout. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29sctp: use list_is_singular in sctp_list_single_entryGeliang Tang1-1/+1
Use list_is_singular() instead of open-coding. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29bareudp: Added attribute to enable & disable rx metadata collectionMartin1-0/+1
Metadata need not be collected in receive if the packet from bareudp device is not targeted to openvswitch. Signed-off-by: Martin <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-28net/tls: Add asynchronous resyncBoris Pismenny1-1/+37
This patch adds support for asynchronous resynchronization in tls_device. Async resync follows two distinct stages: 1. The NIC driver indicates that it would like to resync on some TLS record within the received packet (P), but the driver does not know (yet) which of the TLS records within the packet. At this stage, the NIC driver will query the device to find the exact TCP sequence for resync (tcpsn), however, the driver does not wait for the device to provide the response. 2. Eventually, the device responds, and the driver provides the tcpsn within the resync packet to KTLS. Now, KTLS can check the tcpsn against any processed TLS records within packet P, and also against any record that is processed in the future within packet P. The asynchronous resync path simplifies the device driver, as it can save bits on the packet completion (32-bit TCP sequence), and pass this information on an asynchronous command instead. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-28Revert "net/tls: Add force_resync for driver resync"Boris Pismenny1-11/+1
This reverts commit b3ae2459f89773adcbf16fef4b68deaaa3be1929. Revert the force resync API. Not in use. To be replaced by a better async resync API downstream. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller5-7/+26
Minor overlapping changes in xfrm_device.c, between the double ESP trailing bug fix setting the XFRM_INIT flag and the changes in net-next preparing for bonding encryption support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26sctp: Don't advertise IPv4 addresses if ipv6only is set on the socketMarcelo Ricardo Leitner1-3/+5
If a socket is set ipv6only, it will still send IPv4 addresses in the INIT and INIT_ACK packets. This potentially misleads the peer into using them, which then would cause association termination. The fix is to not add IPv4 addresses to ipv6only sockets. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25net: qos: police action add index for tc flower offloadingPo Liu1-0/+1
Hardware device may include more than one police entry. Specifying the action's index make it possible for several tc filters to share the same police action when installing the filters. Propagate this index to device drivers through the flow offload intermediate representation, so that drivers could share a single hardware policer between multiple filters. v1->v2 changes: - Update the commit message suggest by Ido Schimmel <idosch@idosch.org> Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25net: qos: add tc police offloading action with max frame size limitPo Liu2-0/+11
Current police offloading support the 'burst'' and 'rate_bytes_ps'. Some hardware own the capability to limit the frame size. If the frame size larger than the setting, the frame would be dropped. For the police action itself already accept the 'mtu' parameter in tc command. But not extend to tc flower offloading. So extend 'mtu' to tc flower offloading. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25net: bpf: Add bpf_seq_afinfo in udp_iter_stateYonghong Song1-0/+1
Similar to tcp_iter_state, a new field bpf_seq_afinfo is added to udp_iter_state to provide bpf udp iterator afinfo. This does not change /proc/net/{udp, udp6} behavior. But it enables bpf iterator to avoid get afinfo from PDE_DATA and iterate through all udp and udp6 sockets in one pass. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230812.3988347-1-yhs@fb.com
2020-06-25net: bpf: Add bpf_seq_afinfo in tcp_iter_stateYonghong Song1-0/+1
A new field bpf_seq_afinfo is added to tcp_iter_state to provide bpf tcp iterator afinfo. There are two reasons on why we did this. First, the current way to get afinfo from PDE_DATA does not work for bpf iterator as its seq_file inode does not conform to /proc/net/{tcp,tcp6} inode structures. More specifically, anonymous bpf iterator will use an anonymous inode which is shared in the system and we cannot change inode private data structure at all. Second, bpf iterator for tcp/tcp6 wants to traverse all tcp and tcp6 sockets in one pass and bpf program can control whether they want to skip one sk_family or not. Having a different afinfo with family AF_UNSPEC make it easier to understand in the code. This patch does not change /proc/net/{tcp,tcp6} behavior as the bpf_seq_afinfo will be NULL for these two proc files. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230804.3987829-1-yhs@fb.com