summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2015-06-16bpf: disallow bpf tc programs access current->pid,uidAlexei Starovoitov1-6/+0
Accessing current->pid/uid from cls_bpf may lead to misleading results and should not be used when TC classifiers need accurate information about pid/uid. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-16sock_diag: define destruction multicast groupsCraig Gallek2-1/+95
These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-16net/core: Add reading VF statistics through the PF netdeviceEran Ben Elisha1-2/+49
Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic statistics. We encode the VF stats in a nested manner to allow for future extensions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-16bpf: allow networking programs to use bpf_trace_printk() for debuggingAlexei Starovoitov1-0/+2
bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-16bpf: introduce current->pid, tgid, uid, gid, comm accessorsAlexei Starovoitov1-0/+6
eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 | current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 | current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-10/+7
2015-06-13flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrsEric Dumazet1-2/+4
__skb_header_pointer() returns a pointer that must be checked. Fixes infinite loop reported by Alexei, and add __must_check to catch these errors earlier. Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-13flow_dissector: add support for dst, hop-by-hop and routing ext hdrsTom Herbert1-0/+17
If dst, hop-by-hop or routing extension headers are present determine length of the options and skip over them in flow dissection. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-13flow_dissector: Fix MPLS entropy label handling in flow dissectorTom Herbert1-2/+2
Need to shift after masking to get label value for comparison. Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12net: don't wait for order-3 page allocationShaohua Li2-2/+2
We saw excessive direct memory compaction triggered by skb_page_frag_refill. This causes performance issues and add latency. Commit 5640f7685831e0 introduces the order-3 allocation. According to the changelog, the order-3 allocation isn't a must-have but to improve performance. But direct memory compaction has high overhead. The benefit of order-3 allocation can't compensate the overhead of direct memory compaction. This patch makes the order-3 page allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice the order-3 benefit here. If the atomic allocation fails, direct memory compaction will not be triggered, skb_page_frag_refill will fallback to order-0 immediately, hence the direct memory compaction overhead is avoided. In the allocation failure case, kswapd is waken up and doing compaction, so chances are allocation could success next time. alloc_skb_with_frags is the same. The mellanox driver does similar thing, if this is accepted, we must fix the driver too. V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric V2: make the changelog clearer Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11net/ethtool: Add current supported tunable optionsHadar Hen Zion1-0/+12
Add strings array of the current supported tunable options. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11net, swap: Remove a warning and clarify why sk_mem_reclaim is required when ↵Mel Gorman1-8/+5
deactivating swap Jeff Layton reported the following; [ 74.232485] ------------[ cut here ]------------ [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80() [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5 [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8 [ 74.249483] Call Trace: [ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57 [ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0 [ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20 [ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80 [ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc] [ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc] [ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs] [ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80 [ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0 [ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140 [ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71 [ 74.260158] ---[ end trace 2530722966429f10 ]--- The warning in question was unnecessary but with Jeff's series the rules are also clearer. This patch removes the warning and updates the comment to explain why sk_mem_reclaim() may still be called. [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon points out that it's not needed.] Cc: Leon Romanovsky <leon@leon.nu> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-9/+2
2015-06-08net: replace last open coded skb_orphan_frags with function callWillem de Bruijn1-9/+2
Commit 70008aa50e92 ("skbuff: convert to skb_orphan_frags") replaced open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls to helper function skb_orphan_frags. Apply that to the last remaining open coded site. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07bpf: allow programs to write to certain skb fieldsAlexei Starovoitov1-12/+82
allow programs read/write skb->mark, tc_index fields and ((struct qdisc_skb_cb *)cb)->data. mark and tc_index are generically useful in TC. cb[0]-cb[4] are primarily used to pass arguments from one program to another called via bpf_tail_call() which can be seen in sockex3_kern.c example. All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. mark, tc_index are writeable from tc_cls_act only. cb[0]-cb[4] are writeable by both sockets and tc_cls_act. Add verifier tests and improve sample code. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-07bpf: make programs see skb->data == L2 for ingress and egressAlexei Starovoitov1-23/+3
eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. For ingress L2 header is already pulled, whereas for egress it's present. This is known to program writers which are currently forced to use BPF_LL_OFF workaround. Since programs don't change skb internal pointers it is safe to do pull/push right around invocation of the program and earlier taps and later pt->func() will not be affected. Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick around run_filter/BPF_PROG_RUN even if skb_shared. This fix finally allows programs to use optimized LD_ABS/IND instructions without BPF_LL_OFF for higher performance. tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o w/o JIT w/JIT before 20.5 23.6 Mpps after 21.8 26.6 Mpps Old programs with BPF_LL_OFF will still work as-is. We can now undo most of the earlier workaround commit: a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05mpls: Add MPLS entropy label in flow_keysTom Herbert1-0/+35
In flow dissector if an MPLS header contains an entropy label this is saved in the new keyid field of flow_keys. The entropy label is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add GRE keyid in flow_keysTom Herbert1-1/+23
In flow dissector if a GRE header contains a keyid this is saved in the new keyid field of flow_keys. The GRE keyid is then represented in the flow hash function input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add IPv6 flow label to flow_keysTom Herbert1-19/+10
In flow_dissector set the flow label in flow_keys for IPv6. This also removes the shortcircuiting of flow dissection when a non-zero label is present, the flow label can be considered to provide additional entropy for a hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add VLAN ID to flow_keysTom Herbert1-0/+14
In flow_dissector set vlan_id in flow_keys when VLAN is found. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Get rid of IPv6 hash addresses flow keysTom Herbert1-17/+0
We don't need to return the IPv6 address hash as part of flow keys. In general, using the IPv6 address hash is risky in a hash value since the underlying use of xor provides no entropy. If someone really needs the hash value they can get it from the full IPv6 addresses in flow keys (e.g. from flow_get_u32_src). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add keys for TIPC addressTom Herbert1-5/+13
Add a new flow key for TIPC addresses. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Add full IPv6 addresses to flow_keysTom Herbert1-19/+97
This patch adds full IPv6 addresses into flow_keys and uses them as input to the flow hash function. The implementation supports either IPv4 or IPv6 addresses in a union, and selector is used to determine how may words to input to jhash2. We also add flow_get_u32_dst and flow_get_u32_src functions which are used to get a u32 representation of the source and destination addresses. For IPv6, ipv6_addr_hash is called. These functions retain getting the legacy values of src and dst in flow_keys. With this patch, Ethertype and IP protocol are now included in the flow hash input. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Get skb hash over flow_keys structureTom Herbert1-13/+41
This patch changes flow hashing to use jhash2 over the flow_keys structure instead just doing jhash_3words over src, dst, and ports. This method will allow us take more input into the hashing function so that we can include full IPv6 addresses, VLAN, flow labels etc. without needing to resort to xor'ing which makes for a poor hash. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Remove superfluous setting of key_basicTom Herbert1-6/+0
key_basic is set twice in __skb_flow_dissect which seems unnecessary. Remove second one. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-05net: Simplify GRE case in flow_dissectorTom Herbert1-22/+22
Do break when we see routing flag or a non-zero version number in GRE header. Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04bpf: fix build due to missing tc_verdAlexei Starovoitov1-3/+1
fix build error: net/core/filter.c: In function 'bpf_clone_redirect': net/core/filter.c:1429:18: error: 'struct sk_buff' has no member named 'tc_verd' if (G_TC_AT(skb2->tc_verd) & AT_INGRESS) Fixes: 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") Reported-by: Or Gerlitz <gerlitz.or@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04bpf: introduce bpf_clone_redirect() helperAlexei Starovoitov1-0/+40
Allow eBPF programs attached to classifier/actions to call bpf_clone_redirect(skb, ifindex, flags) helper which will mirror or redirect the packet by dynamic ifindex selection from within the program to a target device either at ingress or at egress. Can be used for various scenarios, for example, to load balance skbs into veths, split parts of the traffic to local taps, etc. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-9/+1
Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02net: Add priority to packet_offload objects.David S. Miller1-2/+6
When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-02Revert "net: core: 'ethtool' issue with querying phy settings"David S. Miller1-9/+1
This reverts commit f96dee13b8e10f00840124255bed1d8b4c6afd6f. It isn't right, ethtool is meant to manage one PHY instance per netdevice at a time, and this is selected by the SET command. Therefore by definition the GET command must only return the settings for the configured and selected PHY. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01ebpf: allow bpf_ktime_get_ns_proto also for networkingDaniel Borkmann1-0/+2
As this is already exported from tracing side via commit d9847d310ab4 ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might as well want to move it to the core, so also networking users can make use of it, e.g. to measure diffs for certain flows from ingress/egress. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31netevent: remove automatic variable in register_netevent_notifier()Wang Long1-4/+1
Remove automatic variable 'err' in register_netevent_notifier() and return the result of atomic_notifier_chain_register() directly. Signed-off-by: Wang Long <long.wanglong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31bpf: allow BPF programs access skb->skb_iif and skb->dev->ifindex fieldsAlexei Starovoitov1-0/+18
classic BPF already exposes skb->dev->ifindex via SKF_AD_IFINDEX extension. Allow eBPF program to access it as well. Note that classic aborts execution of the program if 'skb->dev == NULL' (which is inconvenient for program writers), whereas eBPF returns zero in such case. Also expose the 'skb_iif' field, since programs triggered by redirected packet need to known the original interface index. Summary: __skb->ifindex -> skb->dev->ifindex __skb->ingress_ifindex -> skb->skb_iif Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27tcp: tcp_tso_autosize() minimum is one packetEric Dumazet1-1/+4
By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert 843925f33fcc293d80acf2c5c8a78adf3344d49b ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26net: fix inet_proto_csum_replace4() sparse errorsEric Dumazet1-5/+7
make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o ... net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:307:72: expected restricted __wsum [usertype] addend net/core/utils.c:307:72: got restricted __be32 [usertype] from net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types) net/core/utils.c:308:34: expected restricted __wsum [usertype] addend net/core/utils.c:308:34: got restricted __be32 [usertype] to net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:70: expected restricted __wsum [usertype] addend net/core/utils.c:310:70: got restricted __be32 [usertype] from net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types) net/core/utils.c:310:77: expected restricted __wsum [usertype] addend net/core/utils.c:310:77: got restricted __be32 [usertype] to net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types) net/core/utils.c:312:72: expected restricted __wsum [usertype] addend net/core/utils.c:312:72: got restricted __be32 [usertype] from net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types) net/core/utils.c:313:35: expected restricted __wsum [usertype] addend net/core/utils.c:313:35: got restricted __be32 [usertype] to Note we can use csum_replace4() helper Fixes: 58e3cac5613aa ("net: optimise inet_proto_csum_replace4()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26net: remove a sparse error in secure_dccpv6_sequence_number()Eric Dumazet1-1/+1
make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to integer Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26pktgen: remove one sparse errorEric Dumazet1-5/+5
net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types) net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident> net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol Let's use proper struct ethhdr instead of hard coding everything. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25net: af_unix: implement splice for stream af_unix socketsHannes Frederic Sowa1-0/+1
unix_stream_recvmsg is refactored to unix_stream_read_generic in this patch and enhanced to deal with pipe splicing. The refactoring is inneglible, we mostly have to deal with a non-existing struct msghdr argument. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25net: make skb_splice_bits more configureableHannes Frederic Sowa1-17/+28
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets. AF_UNIX sockets don't use lock_sock/release_sock and thus we have to use a callback to make the locking and unlocking configureable. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25net: skbuff: add skb_append_pagefrags and use itHannes Frederic Sowa1-0/+18
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-1/+12
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23pktgen: make /proc/net/pktgen/pgctrl report fail on invalid inputJesper Dangaard Brouer1-2/+2
Giving /proc/net/pktgen/pgctrl an invalid command just returns shell success and prints a warning in dmesg. This is not very useful for shell scripting, as it can only detect the error by parsing dmesg. Instead return -EINVAL when the command is unknown, as this provides userspace shell scripting a way of detecting this. Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl output this version number which would allow to detect this small semantic change, and (2) because the pktgen version tag have not been updated since 2010. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23pktgen: adjust spacing in proc file interface outputJesper Dangaard Brouer1-1/+1
Too many spaces were introduced in commit 63adc6fb8ac0 ("pktgen: cleanup checkpatch warnings"), thus misaligning "src_min:" to other columns. Fixes: 63adc6fb8ac0 ("pktgen: cleanup checkpatch warnings") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22net: core: 'ethtool' issue with querying phy settingsArun Parameswaran1-1/+9
When trying to configure the settings for PHY1, using commands like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to modify other settings apart from the speed of the PHY1, in the above case. The ethtool seems to query the settings for PHY0, and use this as the base to apply the new settings to the PHY1. This is causing the other settings of the PHY 1 to be wrongly configured. The issue is caused by the '_ethtool_get_settings()' API, which gets called because of the 'ETHTOOL_GSET' command, is clearing the 'cmd' pointer (of type 'struct ethtool_cmd') by calling memset. This clears all the parameters (if any) passed for the 'ETHTOOL_GSET' cmd. So the driver's callback is always invoked with 'cmd->phy_address' as '0'. The '_ethtool_get_settings()' is called from other files in the 'net/core'. So the fix is applied to the 'ethtool_get_settings()' which is only called in the context of the 'ethtool'. Signed-off-by: Arun Parameswaran <aparames@broadcom.com> Reviewed-by: Ray Jui <rjui@broadcom.com> Reviewed-by: Scott Branden <sbranden@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22flow_dissector: do not break if ports are not needed in flowlabelJiri Pirko1-7/+7
This restored previous behaviour. If caller does not want ports to be filled, we should not break. Fixes: 06635a35d13d ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-22bpf: allow bpf programs to tail-call other bpf programsAlexei Starovoitov1-0/+2
introduce bpf_tail_call(ctx, &jmp_table, index) helper function which can be used from BPF programs like: int bpf_prog(struct pt_regs *ctx) { ... bpf_tail_call(ctx, &jmp_table, index); ... } that is roughly equivalent to: int bpf_prog(struct pt_regs *ctx) { ... if (jmp_table[index]) return (*jmp_table[index])(ctx); ... } The important detail that it's not a normal call, but a tail call. The kernel stack is precious, so this helper reuses the current stack frame and jumps into another BPF program without adding extra call frame. It's trivially done in interpreter and a bit trickier in JITs. In case of x64 JIT the bigger part of generated assembler prologue is common for all programs, so it is simply skipped while jumping. Other JITs can do similar prologue-skipping optimization or do stack unwind before jumping into the next program. bpf_tail_call() arguments: ctx - context pointer jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table index - index in the jump table Since all BPF programs are idenitified by file descriptor, user space need to populate the jmp_table with FDs of other BPF programs. If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere and program execution continues as normal. New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can populate this jmp_table array with FDs of other bpf programs. Programs can share the same jmp_table array or use multiple jmp_tables. The chain of tail calls can form unpredictable dynamic loops therefore tail_call_cnt is used to limit the number of calls and currently is set to 32. Use cases: Acked-by: Daniel Borkmann <daniel@iogearbox.net> ========== - simplify complex programs by splitting them into a sequence of small programs - dispatch routine For tracing and future seccomp the program may be triggered on all system calls, but processing of syscall arguments will be different. It's more efficient to implement them as: int syscall_entry(struct seccomp_data *ctx) { bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */); ... default: process unknown syscall ... } int sys_write_event(struct seccomp_data *ctx) {...} int sys_read_event(struct seccomp_data *ctx) {...} syscall_jmp_table[__NR_write] = sys_write_event; syscall_jmp_table[__NR_read] = sys_read_event; For networking the program may call into different parsers depending on packet format, like: int packet_parser(struct __sk_buff *skb) { ... parse L2, L3 here ... __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol)); bpf_tail_call(skb, &ipproto_jmp_table, ipproto); ... default: process unknown protocol ... } int parse_tcp(struct __sk_buff *skb) {...} int parse_udp(struct __sk_buff *skb) {...} ipproto_jmp_table[IPPROTO_TCP] = parse_tcp; ipproto_jmp_table[IPPROTO_UDP] = parse_udp; - for TC use case, bpf_tail_call() allows to implement reclassify-like logic - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table are atomic, so user space can build chains of BPF programs on the fly Implementation details: ======================= - high performance of bpf_tail_call() is the goal. It could have been implemented without JIT changes as a wrapper on top of BPF_PROG_RUN() macro, but with two downsides: . all programs would have to pay performance penalty for this feature and tail call itself would be slower, since mandatory stack unwind, return, stack allocate would be done for every tailcall. . tailcall would be limited to programs running preempt_disabled, since generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would need to be either global per_cpu variable accessed by helper and by wrapper or global variable protected by locks. In this implementation x64 JIT bypasses stack unwind and jumps into the callee program after prologue. - bpf_prog_array_compatible() ensures that prog_type of callee and caller are the same and JITed/non-JITed flag is the same, since calling JITed program from non-JITed is invalid, since stack frames are different. Similarly calling kprobe type program from socket type program is invalid. - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map' abstraction, its user space API and all of verifier logic. It's in the existing arraymap.c file, since several functions are shared with regular array map. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21net: dev: reduce both ingress hook ifdefsDaniel Borkmann1-18/+4
Reduce ifdef pollution slightly, no functional change. We can simply remove the extra alternative definition of handle_ing() and nf_ingress(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21neigh: Better handling of transition to NUD_PROBE stateErik Kline1-0/+3
[1] When entering NUD_PROBE state via neigh_update(), perhaps received from userspace, correctly (re)initialize the probes count to zero. This is useful for forcing revalidation of a neighbor (for example if the host is attempting to do DNA [IPv4 4436, IPv6 6059]). [2] Notify listeners when a neighbor goes into NUD_PROBE state. By sending notifications on entry to NUD_PROBE state listeners get more timely warnings of imminent connectivity issues. The current notifications on entry to NUD_STALE have somewhat limited usefulness: NUD_STALE is a perfectly normal state, as is NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after a neighbor reachability problem has been confirmed (typically after three probes). Signed-off-by: Erik Kline <ek@google.com> Acked-By: Lorenzo Colitti <lorenzo@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18netns: make nsid_lock per netWANG Cong1-16/+16
The spinlock is used to protect netns_ids which is per net, so there is no need to use a global spinlock. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>