summaryrefslogtreecommitdiff
path: root/drivers/net/tun.c
AgeCommit message (Collapse)AuthorFilesLines
2017-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-4/+6
The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07tun: read vnet_hdr_sz onceWillem de Bruijn1-4/+6
When IFF_VNET_HDR is enabled, a virtio_net header must precede data. Data length is verified to be greater than or equal to expected header length tun->vnet_hdr_sz before copying. Read this value once and cache locally, as it can be updated between the test and use (TOCTOU). Signed-off-by: Willem de Bruijn <willemb@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> CC: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receivingJason Wang1-1/+1
Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too, fixing this by adding a hint (has_data_valid) and set it only on the receiving path. Cc: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19tun: rx batchingJason Wang1-6/+70
We can only process 1 packet at one time during sendmsg(). This often lead bad cache utilization under heavy load. So this patch tries to do some batching during rx before submitting them to host network stack. This is done through accepting MSG_MORE as a hint from sendmsg() caller, if it was set, batch the packet temporarily in a linked list and submit them all once MSG_MORE were cleared. Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host: Mpps -+% rx-frames = 0 0.91 +0% rx-frames = 4 1.00 +9.8% rx-frames = 8 1.00 +9.8% rx-frames = 16 1.01 +10.9% rx-frames = 32 1.07 +17.5% rx-frames = 48 1.07 +17.5% rx-frames = 64 1.08 +18.6% rx-frames = 64 (no MSG_MORE) 0.91 +0% User were allowed to change per device batched packets through ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation to prevent bh from being disabled too long. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net: make ndo_get_stats64 a void functionstephen hemminger1-2/+1
The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-16Merge branch 'for-linus' of ↵Linus Torvalds1-5/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: - more ->d_init() stuff (work.dcache) - pathname resolution cleanups (work.namei) - a few missing iov_iter primitives - copy_from_iter_full() and friends. Either copy the full requested amount, advance the iterator and return true, or fail, return false and do _not_ advance the iterator. Quite a few open-coded callers converted (and became more readable and harder to fuck up that way) (work.iov_iter) - several assorted patches, the big one being logfs removal * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: logfs: remove from tree vfs: fix put_compat_statfs64() does not handle errors namei: fold should_follow_link() with the step into not-followed link namei: pass both WALK_GET and WALK_MORE to should_follow_link() namei: invert WALK_PUT logics namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link() namei: saner calling conventions for mountpoint_last() namei.c: get rid of user_path_parent() switch getfrag callbacks to ..._full() primitives make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success [iov_iter] new primitives - copy_from_iter_full() and friends don't open-code file_inode() ceph: switch to use of ->d_init() ceph: unify dentry_operations instances lustre: switch to use of ->d_init()
2016-12-07tun: Use netif_receive_skb instead of netif_rxAndrey Konovalov1-0/+6
This patch changes tun.c to call netif_receive_skb instead of netif_rx when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid stack exhaustion). The difference between the two is that netif_rx queues the packet into the backlog, and netif_receive_skb proccesses the packet in the current context. This patch is required for syzkaller [1] to collect coverage from packet receive paths, when a packet being received through tun (syzkaller collects coverage per process in the process context). As mentioned by Eric this change also speeds up tun/tap. As measured by Peter it speeds up his closed-loop single-stream tap/OVS benchmark by about 23%, from 700k packets/second to 867k packets/second. A similar patch was introduced back in 2010 [2, 3], but the author found out that the patch doesn't help with the task he had in mind (for cgroups to shape network traffic based on the original process) and decided not to go further with it. The main concern back then was about possible stack exhaustion with 4K stacks. [1] https://github.com/google/syzkaller [2] https://www.spinics.net/lists/netdev/thrd440.html#130570 [3] https://www.spinics.net/lists/netdev/msg130570.html Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05[iov_iter] new primitives - copy_from_iter_full() and friendsAl Viro1-5/+2
copy_from_iter_full(), copy_from_iter_full_nocache() and csum_and_copy_from_iter_full() - counterparts of copy_from_iter() et.al., advancing iterator only in case of successful full copy and returning whether it had been successful or not. Convert some obvious users. *NOTE* - do not blindly assume that something is a good candidate for those unless you are sure that not advancing iov_iter in failure case is the right thing in this case. Anything that does short read/short write kind of stuff (or is in a loop, etc.) is unlikely to be a good one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-6/+4
Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30tun: handle ubuf refcount correctly when meet errorsJason Wang1-6/+4
We trigger uarg->callback() immediately after we decide do datacopy even if caller want to do zerocopy. This will cause the callback (vhost_net_zerocopy_callback) decrease the refcount. But when we meet an error afterwards, the error handling in vhost handle_tx() will try to decrease it again. This is wrong and fix this by delay the uarg->callback() until we're sure there's no errors. Reported-by: wangyunjian <wangyunjian@huawei.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25tuntap: remove unnecessary sk_receive_queue length check during xmitJason Wang1-7/+0
After commit 1576d9860599 ("tun: switch to use skb array for tx"), sk_receive_queue was not used any more. So remove the uncessary sk_receive_queue length check during xmit. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19virtio_net: Do not clear memory for struct virtio_net_hdr twice.Jarno Rajahalme1-2/+1
virtio_net_hdr_from_skb() clears the memory for the header, so there is no point for the callers to do the same. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb().Jarno Rajahalme1-5/+3
No point storing the return value of virtio_net_hdr_to_skb() or virtio_net_hdr_from_skb() to a variable when the value is used only once as a boolean in an immediately following if statement. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31driver: tun: Use new macro SOCK_IOC_TYPE instead of literal number 0x89Gao Feng1-1/+1
The current codes use _IOC_TYPE(cmd) == 0x89 to check if the cmd is one socket ioctl command like SIOCGIFHWADDR. But the literal number 0x89 may confuse readers. So create one macro SOCK_IOC_TYPE to enhance the readability. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29driver: tun: Move tun check into the block of TUNSETIFF condition checkGao Feng1-1/+5
When cmd is TUNSETIFF and tun is not null, the original codes go ahead, then reach the default case of switch(cmd) and set the ret is -EINVAL. It is not clear for readers. Now move the tun check into the block of TUNSETIFF condition check, and return -EEXIST instead of -EINVAL when the tfile already owns one tun. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20net: use core MTU range checking in core net infraJarod Wilson1-14/+6
geneve: - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu - This one isn't quite as straight-forward as others, could use some closer inspection and testing macvlan: - set min/max_mtu tun: - set min/max_mtu, remove tun_net_change_mtu vxlan: - Merge __vxlan_change_mtu back into vxlan_change_mtu - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in change_mtu function - This one is also not as straight-forward and could use closer inspection and testing from vxlan folks bridge: - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in change_mtu function openvswitch: - set min/max_mtu, remove internal_dev_change_mtu - note: max_mtu wasn't checked previously, it's been set to 65535, which is the largest possible size supported sch_teql: - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535) macsec: - min_mtu = 0, max_mtu = 65535 macvlan: - min_mtu = 0, max_mtu = 65535 ntb_netdev: - min_mtu = 0, max_mtu = 65535 veth: - min_mtu = 68, max_mtu = 65535 8021q: - min_mtu = 0, max_mtu = 65535 CC: netdev@vger.kernel.org CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> CC: Tom Herbert <tom@herbertland.com> CC: Daniel Borkmann <daniel@iogearbox.net> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Paolo Abeni <pabeni@redhat.com> CC: Jiri Benc <jbenc@redhat.com> CC: WANG Cong <xiyou.wangcong@gmail.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Pravin B Shelar <pshelar@ovn.org> CC: Sabrina Dubroca <sd@queasysnail.net> CC: Patrick McHardy <kaber@trash.net> CC: Stephen Hemminger <stephen@networkplumber.org> CC: Pravin Shelar <pshelar@nicira.com> CC: Maxim Krasnyansky <maxk@qti.qualcomm.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-5/+1
All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-24tun: fix transmit timestamp supportSoheil Hassas Yeganeh1-5/+1
Instead of using sock_tx_timestamp, use skb_tx_timestamp to record software transmit timestamp of a packet. sock_tx_timestamp resets and overrides the tx_flags of the skb. The function is intended to be called from within the protocol layer when creating the skb, not from a device driver. This is inconsistent with other drivers and will cause issues for TCP. In TCP, we intend to sample the timestamps for the last byte for each sendmsg/sendpage. For that reason, tcp_sendmsg calls tcp_tx_timestamp only with the last skb that it generates. For example, if a 128KB message is split into two 64KB packets we want to sample the SND timestamp of the last packet. The current code in the tun driver, however, will result in sampling the SND timestamp for both packets. Also, when the last packet is split into smaller packets for retranmission (see tcp_fragment), the tun driver will record timestamps for all of the retransmitted packets and not only the last packet. Fixes: eda297729171 (tun: Support software transmit time stamping.) Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Francis Yan <francisyyan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-21tun: Rename a jump label in update_filter()Markus Elfring1-3/+2
Adjust a jump target according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-21tun: Use memdup_user() rather than duplicating its implementationMarkus Elfring1-8/+3
Reuse existing functionality from memdup_user() instead of keeping duplicate source code. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09tun: Don't assume type tun in tun_device_eventCraig Gallek1-0/+3
The referenced change added a netlink notifier for processing device queue size events. These events are fired for all devices but the registered callback assumed they only occurred for tun devices. This fix adds a check (borrowed from macvtap.c) to discard non-tun device events. For reference, this fixes the following splat: [ 71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 71.513870] IP: [<ffffffff8153c1a0>] tun_device_event+0x110/0x340 [ 71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0 [ 71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [ 71.529374] gsmi: Log Shutdown Reason 0x03 [ 71.533417] Modules linked in:[ 71.533826] mlx4_en: eth1: Link Up [ 71.539616] bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core [ 71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8 [ 71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016 [ 71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000 [ 71.571894] RIP: 0010:[<ffffffff8153c1a0>] [<ffffffff8153c1a0>] tun_device_event+0x110/0x340 [ 71.580327] RSP: 0018:ffff887f0079bbd8 EFLAGS: 00010202 [ 71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000 [ 71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000 [ 71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001 [ 71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010 [ 71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00 [ 71.620832] FS: 00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000 [ 71.628826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0 [ 71.641549] Stack: [ 71.643533] ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000 [ 71.650871] ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000 [ 71.658210] ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0 [ 71.665549] Call Trace: [ 71.667975] [<ffffffff810eeb0d>] notifier_call_chain+0x5d/0x80 [ 71.673823] [<ffffffff816365d0>] ? show_tx_maxrate+0x30/0x30 [ 71.679502] [<ffffffff810eeb3e>] __raw_notifier_call_chain+0xe/0x10 [ 71.685778] [<ffffffff810eeb56>] raw_notifier_call_chain+0x16/0x20 [ 71.691976] [<ffffffff8160eb30>] call_netdevice_notifiers_info+0x40/0x70 [ 71.698681] [<ffffffff8160ec36>] call_netdevice_notifiers+0x16/0x20 [ 71.704956] [<ffffffff81636636>] change_tx_queue_len+0x66/0x90 [ 71.710807] [<ffffffff816381ef>] netdev_store.isra.5+0xbf/0xd0 [ 71.716658] [<ffffffff81638350>] tx_queue_len_store+0x50/0x60 [ 71.722431] [<ffffffff814a6798>] dev_attr_store+0x18/0x30 [ 71.727857] [<ffffffff812ea3ff>] sysfs_kf_write+0x4f/0x70 [ 71.733274] [<ffffffff812e9507>] kernfs_fop_write+0x147/0x1d0 [ 71.739045] [<ffffffff81134a4f>] ? rcu_read_lock_sched_held+0x8f/0xa0 [ 71.745499] [<ffffffff8125a108>] __vfs_write+0x28/0x120 [ 71.750748] [<ffffffff8111b137>] ? percpu_down_read+0x57/0x90 [ 71.756516] [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0 [ 71.762278] [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0 [ 71.768038] [<ffffffff8125bd5e>] vfs_write+0xbe/0x1b0 [ 71.773113] [<ffffffff8125c092>] SyS_write+0x52/0xa0 [ 71.778110] [<ffffffff817528e5>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 <49> 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff [ 71.803655] RIP [<ffffffff8153c1a0>] tun_device_event+0x110/0x340 [ 71.809769] RSP <ffff887f0079bbd8> [ 71.813213] CR2: 0000000000000010 [ 71.816512] ---[ end trace 4db6449606319f73 ]--- Fixes: 1576d9860599 ("tun: switch to use skb array for tx") Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05tun: fix build warningsJason Wang1-3/+5
Stephen Rothwell reports a build warnings(powerpc ppc64_defconfig) drivers/net/tun.c: In function 'tun_do_read.part.5': /home/sfr/next/next/drivers/net/tun.c:1491:6: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] int err; This is because tun_ring_recv() may return an uninitialized err, fix this. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01tun: switch to use skb array for txJason Wang1-8/+130
We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by: - switch from sk_receive_queue to a skb_array, and resize it when tx_queue_len was changed. - introduce a new proto_ops peek_len which was used for peeking the skb length. - implement a tun version of peek_len for vhost_net to use and convert vhost_net to use peek_len if possible. Pktgen test shows about 15.3% improvement on guest receiving pps for small buffers: Before: ~1300000pps After : ~1500000pps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16tun: fix csum generation for tap devicesPaolo Abeni1-7/+7
The commit 34166093639b ("tuntap: use common code for virtio_net_hdr and skb GSO conversion") replaced the tun code for header manipulation with the generic helpers. While doing so, it implictly moved the skb_partial_csum_set() invocation after eth_type_trans(), which invalidate the current gso start/offset values. Fix it by moving the helper invocation before the mac pulling. Fixes: 34166093639 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11tuntap: use common code for virtio_net_hdr and skb GSO conversionMike Rapoport1-76/+21
Replace open coded conversion between virtio_net_hdr to skb GSO info with virtio_net_hdr_{from,to}_skb Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-21tuntap: correctly wake up process during uninitJason Wang1-3/+3
We used to check dev->reg_state against NETREG_REGISTERED after each time we are woke up. But after commit 9e641bdcfa4e ("net-tun: restructure tun_do_read for better sleep/wakeup efficiency"), it uses skb_recv_datagram() which does not check dev->reg_state. This will result if we delete a tun/tap device after a process is blocked in the reading. The device will wait for the reference count which was held by that process for ever. Fixes this by using RCV_SHUTDOWN which will be checked during sk_recv_datagram() before trying to wake up the process during uninit. Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better sleep/wakeup efficiency") Cc: Eric Dumazet <edumazet@google.com> Cc: Xi Wang <xii@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28tuntap: calculate rps hash only when neededJason Wang1-1/+3
There's no need to calculate rps hash if it was not enabled. So this patch export rps_needed and check it before trying to get rps hash. Tests (using pktgen to inject packets to guest) shows this can improve pps about 13% (when rps is disabled). Before: ~1150000 pps After: ~1300000 pps Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> ---- Changes from V1: - Fix build when CONFIG_RPS is not set Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18tun: don't require serialization lock on txPaolo Abeni1-1/+1
The current tun_net_xmit() implementation don't need any external lock since it relies on rcu protection for the tun data structure and on socket queue lock for skb queuing. This patch set the NETIF_F_LLTX feature bit in the tun device, so that on xmit, in absence of qdisc, no serialization lock is acquired by the caller. The user space can remove the default tun qdisc with: tc qdisc replace dev <tun device name> root noqueue Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tun: use per cpu variables for stats accountingPaolo Abeni1-12/+83
Currently the tun device accounting uses dev->stats without applying any kind of protection, regardless that accounting happens in preemptible process context. This patch move the tun stats to a per cpu data structure, and protect the updates with u64_stats_update_begin()/u64_stats_update_end() or this_cpu_inc according to the stat type. The per cpu stats are aggregated by the newly added ndo_get_stats64 ops. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+2
2016-04-08tuntap: restore default qdiscJason Wang1-2/+2
After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev"), default qdisc was changed to noqueue because tuntap does not set tx_queue_len during .setup(). This patch restores default qdisc by setting tx_queue_len in tun_setup(). Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev") Cc: Phil Sutter <phil@nwl.cc> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07tun: use socket locks for sk_{attach,detatch}_filterHannes Frederic Sowa1-5/+9
This reverts commit 5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter") and replaces it to use lock_sock around sk_{attach,detach}_filter. The checks inside filter.c are updated with lockdep_sock_is_held to check for proper socket locks. It keeps the code cleaner by ensuring that only one lock governs the socket filter instead of two independent locks. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: enable timestamping using control messagesSoheil Hassas Yeganeh1-1/+2
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-01tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filterDaniel Borkmann1-3/+5
Sasha Levin reported a suspicious rcu_dereference_protected() warning found while fuzzing with trinity that is similar to this one: [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage! [ 52.765688] other info that might help us debug this: [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1 [ 52.765701] 1 lock held by a.out/1525: [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20 [ 52.765721] stack backtrace: [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264 [...] [ 52.765768] Call Trace: [ 52.765775] [<ffffffff813e488d>] dump_stack+0x85/0xc8 [ 52.765784] [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110 [ 52.765792] [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90 [ 52.765801] [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun] [ 52.765810] [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun] [ 52.765818] [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210 [ 52.765827] [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun] [ 52.765834] [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690 [ 52.765843] [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60 [ 52.765850] [<ffffffff81261519>] SyS_ioctl+0x79/0x90 [ 52.765858] [<ffffffff81003ba2>] do_syscall_64+0x62/0x140 [ 52.765866] [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25 Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH, DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices. Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the filter with rcu_dereference_protected(), checking whether socket lock is held in control path. Since its introduction in 994051625981 ("tun: socket filter support"), tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this specific case and therefore triggers the false positive. Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair that is used by tap filters and pass in lockdep_rtnl_is_held() for the rcu_dereference_protected() checks instead. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01net/tun: implement ndo_set_rx_headroomPaolo Abeni1-1/+16
ndo_set_rx_headroom controls the align value used by tun devices to allocate skbs on frame reception. When the xmit device adds a large encapsulation, this avoids an skb head reallocation on forwarding. The measured improvement when forwarding towards a vxlan dev with frame size below the egress device MTU is as follow: vxlan over ipv6, bridged: +6% vxlan over ipv6, ovs: +7% In case of ipv4 tunnels there is no improvement, since the tun device default alignment provides enough headroom to avoid the skb head reallocation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17tun: honor IFF_UP in tun_get_user()Eric Dumazet1-0/+3
If a tun interface is turned down, we should not allow packet injection into the kernel. Kernel does not send packets to the tun already. TUNATTACHFILTER can not be used as only tun_net_xmit() is taking care of it. Reported-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATAEric Dumazet1-2/+2
This patch is a cleanup to make following patch easier to review. Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA from (struct socket)->flags to a (struct socket_wq)->flags to benefit from RCU protection in sock_wake_async() To ease backports, we rename both constants. Two new helpers, sk_set_bit(int nr, struct sock *sk) and sk_clear_bit(int net, struct sock *sk) are added so that following patch can change their implementation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13tun: use sk_fullsock() before reading sk->sk_tsflagsEric Dumazet1-1/+1
timewait or request sockets are small and do not contain sk->sk_tsflags Without this fix, we might read garbage, and crash later in __skb_complete_tx_timestamp() -> sock_queue_err_skb() (These pseudo sockets do not have an error queue either) Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-04tuntap: Don't segment multiple tagged packets on tap deviceToshiaki Makita1-0/+1
Tap devices don't need to segment multiple tagged packets. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-04Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-2/+65
Pull virtio/vhost cross endian support from Michael Tsirkin: "I have just queued some more bugfix patches today but none fix regressions and none are related to these ones, so it looks like a good time for a merge for -rc1. The motivation for this is support for legacy BE guests on the new LE hosts. There are two redeeming properties that made me merge this: - It's a trivial amount of code: since we wrap host/guest accesses anyway, almost all of it is well hidden from drivers. - Sane platforms would never set flags like VHOST_CROSS_ENDIAN_LEGACY, and when it's clear, there's zero overhead (as some point it was tested by compiling with and without the patches, got the same stripped binary). Maybe we could create a Kconfig symbol to enforce the second point: prevent people from enabling it eg on x86. I will look into this" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio-pci: alloc only resources actually used. macvtap/tun: cross-endian support for little-endian hosts vhost: cross-endian support for legacy devices virtio: add explicit big-endian support to memory accessors vhost: introduce vhost_is_little_endian() helper vringh: introduce vringh_is_little_endian() helper macvtap: introduce macvtap_is_little_endian() helper tun: add tun_is_little_endian() helper virtio: introduce virtio_is_little_endian() helper
2015-06-01macvtap/tun: cross-endian support for little-endian hostsGreg Kurz1-1/+58
The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers that are always little-endian. It can also be used to handle the special case of a legacy little-endian device implemented by a big-endian host. Let's add a flag and ioctls for big-endian devices as well. If both flags are set, little-endian wins. Since this is isn't a common usecase, the feature is controlled by a kernel config option (not set by default). Both macvtap and tun are covered by this patch since they share the same API with userland. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2015-06-01virtio: add explicit big-endian support to memory accessorsGreg Kurz1-1/+2
The current memory accessors logic is: - little endian if little_endian - native endian (i.e. no byteswap) if !little_endian If we want to fully support cross-endian vhost, we also need to be able to convert to big endian. Instead of changing the little_endian argument to some 3-value enum, this patch changes the logic to: - little endian if little_endian - big endian if !little_endian The native endian case is handled by all users with a trivial helper. This patch doesn't change any functionality, nor it does add overhead. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2015-06-01tun: add tun_is_little_endian() helperGreg Kurz1-2/+7
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2015-05-11net: Pass kern from net_proto_family.create to sk_allocEric W. Biederman1-1/+1
In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11tun: Utilize the normal socket network namespace refcounting.Eric W. Biederman1-20/+4
There is no need for tun to do the weird network namespace refcounting. The existing network namespace refcounting in tfile has almost exactly the same lifetime. So rewrite the code to use the struct sock network namespace refcounting and remove the unnecessary hand rolled network namespace refcounting and the unncesary tfile->net. This change allows the tun code to directly call sock_put bypassing sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary. Remove the now unncessary tun_release so that if anything tries to use the sock_release code path the kernel will oops, and let us know about the bug. The macvtap code already uses it's internal socket this way. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12make new_sync_{read,write}() staticAl Viro1-2/+0
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue1-4/+2
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09net: rfs: add hash collision detectionEric Dumazet1-4/+1
Receive Flow Steering is a nice solution but suffers from hash collisions when a mix of connected and unconnected traffic is received on the host, when flow hash table is populated. Also, clearing flow in inet_release() makes RFS not very good for short lived flows, as many packets can follow close(). (FIN , ACK packets, ...) This patch extends the information stored into global hash table to not only include cpu number, but upper part of the hash value. I use a 32bit value, and dynamically split it in two parts. For host with less than 64 possible cpus, this gives 6 bits for the cpu number, and 26 (32-6) bits for the upper part of the hash. Since hash bucket selection use low order bits of the hash, we have a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big enough. If the hash found in flow table does not match, we fallback to RPS (if it is enabled for the rxqueue). This means that a packet for an non connected flow can avoid the IPI through a unrelated/victim CPU. This also means we no longer have to clear the table at socket close time, and this helps short lived flows performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>