summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2014-08-06Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller6-27/+37
Antonio Quartulli says: ==================== pull request: batman-adv 2014-08-05 this is a pull request intended for net-next/linux-3.17 (yeah..it's really late). Patches 1, 2 and 4 are really minor changes: - kmalloc_array is substituted to kmalloc when possible (as suggested by checkpatch); - net_ratelimited() is now used properly and the "suppressed" message is not printed anymore if not needed; - the internal version number has been increased to reflect our current version. Patch 3 instead is introducing a change in the metric computation function by changing the penalty applied at each mesh hop from 15/255 (~6%) to 30/255 (~11%). This change is introduced by Simon Wunderlich after having observed a performance improvement in several networks when using the new value. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06bridge: Update outdated comment on promiscuous modeToshiaki Makita1-4/+2
Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an unnecessary operation. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: ACK timestamp for bytestreamsWillem de Bruijn2-0/+8
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte in the send() call is acknowledged. It implements the feature for TCP. The timestamp is generated when the TCP socket cumulative ACK is moved beyond the tracked seqno for the first time. The feature ignores SACK and FACK, because those acknowledge the specific byte, but not necessarily the entire contents of the buffer up to that byte. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: TCP timestampingWillem de Bruijn4-6/+52
TCP timestamping extends SO_TIMESTAMPING to bytestreams. Bytestreams do not have a 1:1 relationship between send() buffers and network packets. The feature interprets a send call on a bytestream as a request for a timestamp for the last byte in that send() buffer. The choice corresponds to a request for a timestamp when all bytes in the buffer have been sent. That assumption depends on in-order kernel transmission. This is the common case. That said, it is possible to construct a traffic shaping tree that would result in reordering. The guarantee is strong, then, but not ironclad. This implementation supports send and sendpages (splice). GSO replaces one large packet with multiple smaller packets. This patch also copies the option into the correct smaller packet. This patch does not yet support timestamping on data in an initial TCP Fast Open SYN, because that takes a very different data path. If ID generation in ee_data is enabled, bytestream timestamps return a byte offset, instead of the packet counter for datagrams. The implementation supports a single timestamp per packet. It silenty replaces requests for previous timestamps. To avoid missing tstamps, flush the tcp queue by disabling Nagle, cork and autocork. Missing tstamps can be detected by offset when the ee_data ID is enabled. Implementation details: - On GSO, the timestamping code can be included in the main loop. I moved it into its own loop to reduce the impact on the common case to a single branch. - To avoid leaking the absolute seqno to userspace, the offset returned in ee_data must always be relative. It is an offset between an skb and sk field. The first is always set (also for GSO & ACK). The second must also never be uninitialized. Only allow the ID option on sockets in the ESTABLISHED state, for which the seqno is available. Never reset it to zero (instead, move it to the current seqno when reenabling the option). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: SCHED timestamp on entering packet schedulerWillem de Bruijn3-4/+19
Kernel transmit latency is often incurred in the packet scheduler. Introduce a new timestamp on transmission just before entering the scheduler. When data travels through multiple devices (bonding, tunneling, ...) each device will export an individual timestamp. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: add key to disambiguate concurrent datagramsWillem de Bruijn4-1/+19
Datagrams timestamped on transmission can coexist in the kernel stack and be reordered in packet scheduling. When reading looped datagrams from the socket error queue it is not always possible to unique correlate looped data with original send() call (for application level retransmits). Even if possible, it may be expensive and complex, requiring packet inspection. Introduce a data-independent ID mechanism to associate timestamps with send calls. Pass an ID alongside the timestamp in field ee_data of sock_extended_err. The ID is a simple 32 bit unsigned int that is associated with the socket and incremented on each send() call for which software tx timestamp generation is enabled. The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to avoid changing ee_data for existing applications that expect it 0. The counter is reset each time the flag is reenabled. Reenabling does not change the ID of already submitted data. It is possible to receive out of order IDs if the timestamp stream is not quiesced first. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: move timestamp flags out of sk_flagsWillem de Bruijn2-27/+6
sk_flags is reaching its limit. New timestamping options will not fit. Move all of them into a new field sk->sk_tsflags. Added benefit is that this removes boilerplate code to convert between SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt. SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive timestamp logic (netstamp_needed). That can be simplified and this last key removed, but will leave that for a separate patch. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs, though that scatters tstamp fields throughout the struct. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06net-timestamp: extend SCM_TIMESTAMPING ancillary data structWillem de Bruijn2-9/+12
Applications that request kernel tx timestamps with SO_TIMESTAMPING read timestamps as recvmsg() ancillary data. The response is defined implicitly as timespec[3]. 1) define struct scm_timestamping explicitly and 2) add support for new tstamp types. On tx, scm_timestamping always accompanies a sock_extended_err. Define previously unused field ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to define the existing behavior. The reception path is not modified. On rx, no struct similar to sock_extended_err is passed along with SCM_TIMESTAMPING. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06tcp: reduce spurious retransmits due to transient SACK renegingNeal Cardwell2-13/+20
This commit reduces spurious retransmits due to apparent SACK reneging by only reacting to SACK reneging that persists for a short delay. When a sequence space hole at snd_una is filled, some TCP receivers send a series of ACKs as they apparently scan their out-of-order queue and cumulatively ACK all the packets that have now been consecutiveyly received. This is essentially misbehavior B in "Misbehaviors in TCP SACK generation" ACM SIGCOMM Computer Communication Review, April 2011, so we suspect that this is from several common OSes (Windows 2000, Windows Server 2003, Windows XP). However, this issue has also been seen in other cases, e.g. the netdev thread "TCP being hoodwinked into spurious retransmissions by lack of timestamps?" from March 2014, where the receiver was thought to be a BSD box. Since snd_una would temporarily be adjacent to a previously SACKed range in these scenarios, this receiver behavior triggered the Linux SACK reneging code path in the sender. This led the sender to clear the SACK scoreboard, enter CA_Loss, and spuriously retransmit (potentially) every packet from the entire write queue at line rate just a few milliseconds before the ACK for each packet arrives at the sender. To avoid such situations, now when a sender sees apparent reneging it does not yet retransmit, but rather adjusts the RTO timer to give the receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs that will restore sanity to the SACK scoreboard. If the reneging persists until this RTO then, as before, we clear the SACK scoreboard and enter CA_Loss. A 10ms delay tolerates a receiver sending such a stream of ACKs at 56Kbit/sec. And to allow for receivers with slower or more congested paths, we wait for at least RTT/2. We validated the resulting max(RTT/2, 10ms) delay formula with a mix of North American and South American Google web server traffic, and found that for ACKs displaying transient reneging: (1) 90% of inter-ACK delays were less than 10ms (2) 99% of inter-ACK delays were less than RTT/2 In tests on Google web servers this commit reduced reneging events by 75%-90% (as measured by the TcpExtTCPSACKReneging counter), without any measurable impact on latency for user HTTP and SPDY requests. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06Merge tag 'master-2014-07-31' of ↵David S. Miller13-174/+520
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next Conflicts: net/6lowpan/iphc.c Minor conflicts in iphc.c were changes overlapping with some style cleanups. John W. Linville says: ==================== Please pull this last(?) batch of wireless change intended for the 3.17 stream... For the NFC bits, Samuel says: "This is a rather quiet one, we have: - A new driver from ST Microelectronics for their NCI ST21NFCB, including device tree support. - p2p support for the ST21NFCA driver - A few fixes an enhancements for the NFC digital laye" For the Atheros bits, Kalle says: "Michal and Janusz did some important RX aggregation fixes, basically we were missing RX reordering altogether. The 10.1 firmware doesn't support Ad-Hoc mode and Michal fixed ath10k so that it doesn't advertise Ad-Hoc support with that firmware. Also he implemented a workaround for a KVM issue." For the Bluetooth bits, Gustavo and Johan say: "To quote Gustavo from his previous request: 'Some last minute fixes for -next. We have a fix for a use after free in RFCOMM, another fix to an issue with ADV_DIRECT_IND and one for ADV_IND with auto-connection handling. Last, we added support for reading the codec and MWS setting for controllers that support these features.' Additionally there are fixes to LE scanning, an update to conform to the 4.1 core specification as well as fixes for tracking the page scan state. All of these fixes are important for 3.17." And, "We've got: - 6lowpan fixes/cleanups - A couple crash fixes, one for the Marvell HCI driver and another in LE SMP. - Fix for an incorrect connected state check - Fix for the bondable requirement during pairing (an issue which had crept in because of using "pairable" when in fact the actual meaning was "bondable" (these have different meanings in Bluetooth)" Along with those are some late-breaking hardware support patches in brcmfmac and b43 as well as a stray ath9k patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05batman-adv: Start new development cycleSimon Wunderlich1-1/+1
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-08-05batman-adv: increase default hop penaltySimon Wunderlich1-1/+1
The default hop penalty is currently set to 15, which is applied like that for multi interface devices (e.g. dual band APs). Single band devices will still use an effective penalty of 30 (hop penalty + wifi penalty). After receiving reports of too long paths in mesh networks with dual band APs which were fixed by increasing the hop penalty, we'd like to suggest to increase that default value in the default setting as well. We've evaluated that increase in a handful of medium sized mesh networks (5-20 nodes) with single and dual band devices, with changes for the better (shorter routes, higher throughput) or no change at all. This patch changes the hop penalty to 30, which will give an effective penalty of 60 on single band devices (hop penalty + wifi penalty). Signed-off-by: Simon Wunderlich <simon@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-08-05batman-adv: remove unnecessary logspamAndré Gaul2-15/+23
This patch removes unnecessary logspam which resulted from superfluous calls to net_ratelimit(). With the supplied patch, net_ratelimit() is called after the loglevel has been checked. Signed-off-by: André Gaul <gaul@web-yard.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-08-05netlink: fix lockdep splatsEric Dumazet1-2/+2
With netlink_lookup() conversion to RCU, we need to use appropriate rcu dereference in netlink_seq_socket_idx() & netlink_seq_next() Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: e341694e3eb57fc ("netlink: Convert netlink_lookup() to use RCU protected hash table") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05tcp: md5: remove unneeded check in tcp_v4_parse_md5_keysDmitry Popov1-1/+1
tcpm_key is an array inside struct tcp_md5sig, there is no need to check it against NULL. Signed-off-by: Dmitry Popov <ixaphire@qrator.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04bridge: remove a useless commentMichael S. Tsirkin1-1/+0
commit 6cbdceeb1cb12c7d620161925a8c3e81daadb2e4 bridge: Dump vlan information from a bridge port introduced a comment in an attempt to explain the code logic. The comment is unfinished so it confuses more than it explains, remove it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04batman-adv: prefer kmalloc_array to kmalloc when possibleAntonio Quartulli3-10/+12
Reported by checkpatch with the following warning: WARNING: Prefer kmalloc_array over kmalloc with multiply Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-08-03nftables: Convert nft_hash to use generic rhashtableThomas Graf1-236/+55
The sizing of the hash table and the practice of requiring a lookup to retrieve the pprev to be stored in the element cookie before the deletion of an entry is left intact. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Patrick McHardy <kaber@trash.net> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03netlink: Convert netlink_lookup() to use RCU protected hash tableThomas Graf3-195/+119
Heavy Netlink users such as Open vSwitch spend a considerable amount of time in netlink_lookup() due to the read-lock on nl_table_lock. Use of RCU relieves the lock contention. Makes use of the new resizable hash table to avoid locking on the lookup. The hash table will grow if entries exceeds 75% of table size up to a total table size of 64K. It will automatically shrink if usage falls below 30%. Also splits nl_table_lock into a separate mutex to protect hash table mutations and allow synchronize_rcu() to sleep while waiting for readers during expansion and shrinking. Before: 9.16% kpktgend_0 [openvswitch] [k] masked_flow_lookup 6.42% kpktgend_0 [pktgen] [k] mod_cur_headers 6.26% kpktgend_0 [pktgen] [k] pktgen_thread_worker 6.23% kpktgend_0 [kernel.kallsyms] [k] memset 4.79% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup 4.37% kpktgend_0 [kernel.kallsyms] [k] memcpy 3.60% kpktgend_0 [openvswitch] [k] ovs_flow_extract 2.69% kpktgend_0 [kernel.kallsyms] [k] jhash2 After: 15.26% kpktgend_0 [openvswitch] [k] masked_flow_lookup 8.12% kpktgend_0 [pktgen] [k] pktgen_thread_worker 7.92% kpktgend_0 [pktgen] [k] mod_cur_headers 5.11% kpktgend_0 [kernel.kallsyms] [k] memset 4.11% kpktgend_0 [openvswitch] [k] ovs_flow_extract 4.06% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock 3.90% kpktgend_0 [kernel.kallsyms] [k] jhash2 [...] 0.67% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03inet: frags: use kmem_cache for inet_frag_queueNikolay Aleksandrov5-8/+32
Use kmem_cache to allocate/free inet_frag_queue objects since they're all the same size per inet_frags user and are alloced/freed in high volumes thus making it a perfect case for kmem_cache. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03inet: frags: use INET_FRAG_EVICTED to prevent icmp messagesNikolay Aleksandrov3-16/+15
Now that we have INET_FRAG_EVICTED we might as well use it to stop sending icmp messages in the "frag_expire" functions instead of stripping INET_FRAG_FIRST_IN from their flags when evicting. Also fix the comment style in ip6_expire_frag_queue(). Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03inet: frags: fix function declaration alignments in inet_fragmentNikolay Aleksandrov1-5/+9
Fix a couple of functions' declaration alignments. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03inet: frags: rename last_in to flagsNikolay Aleksandrov5-39/+39
The last_in field has been used to store various flags different from first/last frag in so give it a more descriptive name: flags. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03inet: frags: use INC_STATS_BH in the ipv6 reassembly codeNikolay Aleksandrov1-3/+4
Softirqs are already disabled so no need to do it again, thus let's be consistent and use the IP6_INC_STATS_BH variant. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03ipv4: remove nested rcu_read_lock/unlockDuan Jiong1-2/+0
ip_local_deliver_finish() already have a rcu_read_lock/unlock, so the rcu_read_lock/unlock is unnecessary. See the stack below: ip_local_deliver_finish | | ->icmp_rcv | | ->icmp_socket_deliver Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03net: filter: split 'struct sk_filter' into socket and bpf partsAlexei Starovoitov5-53/+65
clean up names related to socket filtering and bpf in the following way: - everything that deals with sockets keeps 'sk_*' prefix - everything that is pure BPF is changed to 'bpf_*' prefix split 'struct sk_filter' into struct sk_filter { atomic_t refcnt; struct rcu_head rcu; struct bpf_prog *prog; }; and struct bpf_prog { u32 jited:1, len:31; struct sock_fprog_kern *orig_prog; unsigned int (*bpf_func)(const struct sk_buff *skb, const struct bpf_insn *filter); union { struct sock_filter insns[0]; struct bpf_insn insnsi[0]; struct work_struct work; }; }; so that 'struct bpf_prog' can be used independent of sockets and cleans up 'unattached' bpf use cases split SK_RUN_FILTER macro into: SK_RUN_FILTER to be used with 'struct sk_filter *' and BPF_PROG_RUN to be used with 'struct bpf_prog *' __sk_filter_release(struct sk_filter *) gains __bpf_prog_release(struct bpf_prog *) helper function also perform related renames for the functions that work with 'struct bpf_prog *', since they're on the same lines: sk_filter_size -> bpf_prog_size sk_filter_select_runtime -> bpf_prog_select_runtime sk_filter_free -> bpf_prog_free sk_unattached_filter_create -> bpf_prog_create sk_unattached_filter_destroy -> bpf_prog_destroy sk_store_orig_filter -> bpf_prog_store_orig_filter sk_release_orig_filter -> bpf_release_orig_filter __sk_migrate_filter -> bpf_migrate_filter __sk_prepare_filter -> bpf_prepare_filter API for attaching classic BPF to a socket stays the same: sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *) and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program which is used by sockets, tun, af_packet API for 'unattached' BPF programs becomes: bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *) and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03net: filter: rename sk_convert_filter() -> bpf_convert_filter()Alexei Starovoitov1-8/+8
to indicate that this function is converting classic BPF into eBPF and not related to sockets Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03net: filter: rename sk_chk_filter() -> bpf_check_classic()Alexei Starovoitov1-5/+5
trivial rename to indicate that this functions performs classic BPF checking Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03net: filter: rename sk_filter_proglen -> bpf_classic_proglenAlexei Starovoitov2-5/+5
trivial rename to better match semantics of macro Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-03net: filter: simplify socket chargingAlexei Starovoitov2-52/+44
attaching bpf program to a socket involves multiple socket memory arithmetic, since size of 'sk_filter' is changing when classic BPF is converted to eBPF. Also common path of program creation has to deal with two ways of freeing the memory. Simplify the code by delaying socket charging until program is ready and its size is known Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01net: use inet6_iif instead of IP6CB()->iifDuan Jiong4-7/+7
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01netlink: Use PAGE_ALIGNED macroTobias Klauser1-1/+1
Use PAGE_ALIGNED(...) instead of IS_ALIGNED(..., PAGE_SIZE). Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORSDuan Jiong5-18/+17
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver() and icmpv6_notify() do some valid checks on packet's length, but then some protocols check packet's length redaudantly. So remove those duplicated statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in function icmp_socket_deliver() and icmpv6_notify() respectively. In addition, add missed counter in udp6/udplite6 when socket is NULL. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01sctp: Fixup v4mapped behaviour to comply with Sock APIJason Gunthorpe5-100/+107
The SCTP socket extensions API document describes the v4mapping option as follows: 8.1.15. Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR) This socket option is a Boolean flag which turns on or off the mapping of IPv4 addresses. If this option is turned on, then IPv4 addresses will be mapped to V6 representation. If this option is turned off, then no mapping will be done of V4 addresses and a user will receive both PF_INET6 and PF_INET type addresses on the socket. See [RFC3542] for more details on mapped V6 addresses. This description isn't really in line with what the code does though. Introduce addr_to_user (renamed addr_v4map), which should be called before any sockaddr is passed back to user space. The new function places the sockaddr into the correct format depending on the SCTP_I_WANT_MAPPED_V4_ADDR option. Audit all places that touched v4mapped and either sanely construct a v4 or v6 address then call addr_to_user, or drop the unnecessary v4mapped check entirely. Audit all places that call addr_to_user and verify they are on a sycall return path. Add a custom getname that formats the address properly. Several bugs are addressed: - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for addresses to user space - The addr_len returned from recvmsg was not correct when returning AF_INET on a v6 socket - flowlabel and scope_id were not zerod when promoting a v4 to v6 - Some syscalls like bind and connect behaved differently depending on v4mapped Tested bind, getpeername, getsockname, connect, and recvmsg for proper behaviour in v4mapped = 1 and 0 cases. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller9-129/+133
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains netfilter updates for net-next, they are: 1) Add the reject expression for the nf_tables bridge family, this allows us to send explicit reject (TCP RST / ICMP dest unrech) to the packets matching a rule. 2) Simplify and consolidate the nf_tables set dumping logic. This uses netlink control->data to filter out depending on the request. 3) Perform garbage collection in xt_hashlimit using a workqueue instead of a timer, which is problematic when many entries are in place in the tables, from Eric Dumazet. 4) Remove leftover code from the removed ulog target support, from Paul Bolle. 5) Dump unmodified flags in the netfilter packet accounting when resetting counters, so userspace knows that a counter was in overquota situation, from Alexey Perevalov. 6) Fix wrong usage of the bitwise functions in nfnetlink_acct, also from Alexey. 7) Fix a crash when adding new set element with an empty NFTA_SET_ELEM_LIST attribute. This patchset also includes a couple of cleanups for xt_LED from Duan Jiong and for nf_conntrack_ipv4 (using coccinelle) from Himangi Saraogi. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-01tcp: don't require root to read tcp_metricsBanerjee, Debabrata1-1/+0
commit d23ff7016 (tcp: add generic netlink support for tcp_metrics) introduced netlink support for the new tcp_metrics, however it restricted getting of tcp_metrics to root user only. This is a change from how these values could have been fetched when in the old route cache. Unless there's a legitimate reason to restrict the reading of these values it would be better if normal users could fetch them. Cc: Julian Anastasov <ja@ssi.bg> Cc: linux-kernel@vger.kernel.org Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31netfilter: nf_tables: check for unset NFTA_SET_ELEM_LIST_ELEMENTS attributePablo Neira Ayuso1-0/+6
Otherwise, the kernel oopses in nla_for_each_nested when iterating over the unset attribute NFTA_SET_ELEM_LIST_ELEMENTS in the nf_tables_{new,del}setelem() path. netlink: 65524 bytes leftover after parsing attributes in process `nft'. [...] Oops: 0000 [#1] SMP [...] CPU: 2 PID: 6287 Comm: nft Not tainted 3.16.0-rc2+ #169 RIP: 0010:[<ffffffffa0526e61>] [<ffffffffa0526e61>] nf_tables_newsetelem+0x82/0xec [nf_tables] [...] Call Trace: [<ffffffffa05178c4>] nfnetlink_rcv+0x2e7/0x3d7 [nfnetlink] [<ffffffffa0517939>] ? nfnetlink_rcv+0x35c/0x3d7 [nfnetlink] [<ffffffff8137d300>] netlink_unicast+0xf8/0x17a [<ffffffff8137d6a5>] netlink_sendmsg+0x323/0x351 [...] Fix this by returning -EINVAL if this attribute is not set, which doesn't make sense at all since those commands are there to add and to delete elements from the set. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-07-31netfilter: nfnetlink_acct: avoid using NFACCT_F_OVERQUOTA with bit helper ↵Alexey Perevalov1-3/+5
functions Bit helper functions were used for manipulation with NFACCT_F_OVERQUOTA, but they are accepting pit position, but not a bit mask. As a result not a third bit for NFACCT_F_OVERQUOTA was set, but forth. Such behaviour was dangarous and could lead to unexpected overquota report result. Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-07-31Merge branch 'master' of ↵David S. Miller3-65/+42
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2014-07-30 This is the last pull request for ipsec-next before I'll be off for two weeks starting on friday. David, can you please take urgent ipsec patches directly into net/net-next during this time? 1) Error handling simplifications for vti and vti6. From Mathias Krause. 2) Remove a duplicate semicolon after a return statement. From Christoph Paasch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: filter: don't release unattached filter through call_rcu()Pablo Neira1-3/+8
sk_unattached_filter_destroy() does not always need to release the filter object via rcu. Since this filter is never attached to the socket, the caller should be responsible for releasing the filter in a safe way, which may not necessarily imply rcu. This is a short summary of clients of this function: 1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules are removed from the packet path before the filter is released. Thus, the framework makes sure the filter is safely removed. 2) In the ppp driver, the ppp_lock ensures serialization between the xmit and filter attachment/detachment path. This doesn't use rcu so deferred release via rcu makes no sense. 3) In the isdn/ppp driver, it is called from isdn_ppp_release() the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu. Thus, deferred rcu makes no sense to me either, the deferred releases may be just masking the effects of wrong locking strategy, which should be fixed in the driver itself. 4) In the team driver, this is the only place where the rcu synchronization with unattached filter is used. Therefore, this patch introduces synchronize_rcu() which is called from the genetlink path to make sure the filter doesn't go away while packets are still walking over it. I think we can revisit this once struct bpf_prog (that only wraps specific bpf code bits) is in place, then add some specific struct rcu_head in the scope of the team driver if Jiri thinks this is needed. Deferred rcu release for unattached filters was originally introduced in 302d663 ("filter: Allow to create sk-unattached filters"). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: Remove unlikely() for WARN_ON() conditionsThomas Graf2-2/+2
No need for the unlikely(), WARN_ON() and BUG_ON() internally use unlikely() on the condition. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31dcbnl : Fix misleading dcb_app->priority explanationAnish Bhatt1-2/+3
Current explanation of dcb_app->priority is wrong. It says priority is expected to be a 3-bit unsigned integer which is only true when working with DCBx-IEEE. Use of dcb_app->priority by DCBx-CEE expects it to be 802.1p user priority bitmap. Updated accordingly This affects the cxgb4 driver, but I will post those changes as part of a larger changeset shortly. Fixes: 3e29027af4372 ("dcbnl: add support for ieee8021Qaz attributes") Signed-off-by: Anish Bhatt <anish@chelsio.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-16/+46
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30Bluetooth: Always use non-bonding requirement when not bondableJohan Hedberg1-3/+8
When we're not bondable we should never send any other SSP authentication requirement besides one of the non-bonding ones. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-07-30Bluetooth: Rename pairable mgmt setting to bondableJohan Hedberg1-7/+7
This setting maps to the HCI_BONDABLE flag which tracks whether we're bondable or not. Therefore, rename the mgmt setting and respective command accordingly. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-07-30Bluetooth: Rename HCI_PAIRABLE to HCI_BONDABLEJohan Hedberg4-11/+11
The HCI_PAIRABLE flag isn't actually controlling whether we're pairable but whether we're bondable. Therefore, rename it accordingly. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-07-30Bluetooth: Fix sparse warning from HID new leds handlingMarcel Holtmann1-1/+1
The new leds bit handling produces this spares warning. CHECK net/bluetooth/hidp/core.c net/bluetooth/hidp/core.c:156:60: warning: dubious: x | !y Just fix it by doing an explicit x << 0 shift operation. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2014-07-30Bluetooth: Fix check for connected state when pairingJohan Hedberg1-1/+1
Both BT_CONNECTED and BT_CONFIG state mean that we have a baseband link available. We should therefore check for either of these when pairing and deciding whether to call hci_conn_security() directly. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-07-306lowpan: iphc: Fix parenthesis alignments which off-by-oneMarcel Holtmann1-3/+3
CHECK: Alignment should match open parenthesis + if (((hdr->flow_lbl[0] & 0x0F) == 0) && + (hdr->flow_lbl[1] == 0) && (hdr->flow_lbl[2] == 0)) { CHECK: Alignment should match open parenthesis + if ((hdr->priority == 0) && + ((hdr->flow_lbl[0] & 0xF0) == 0)) { CHECK: Alignment should match open parenthesis + if ((hdr->priority == 0) && + ((hdr->flow_lbl[0] & 0xF0) == 0)) { Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2014-07-306lowpan: iphc: Fix missing braces for if statementMarcel Holtmann1-2/+2
CHECK: braces {} should be used on all arms of this statement + if ((iphc0 & 0x03) != LOWPAN_IPHC_TTL_I) [...] + else { [...] Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>