<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/sock.h, branch v3.12.13</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v3.12.13</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v3.12.13'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2013-10-07T19:16:45+00:00</updated>
<entry>
<title>net: fix unsafe set_memory_rw from softirq</title>
<updated>2013-10-07T19:16:45+00:00</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@plumgrid.com</email>
</author>
<published>2013-10-04T07:14:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d45ed4a4e33ae103053c0a53d280014e7101bb5c'/>
<id>urn:sha1:d45ed4a4e33ae103053c0a53d280014e7101bb5c</id>
<content type='text'>
on x86 system with net.core.bpf_jit_enable = 1

sudo tcpdump -i eth1 'tcp port 22'

causes the warning:
[   56.766097]  Possible unsafe locking scenario:
[   56.766097]
[   56.780146]        CPU0
[   56.786807]        ----
[   56.793188]   lock(&amp;(&amp;vb-&gt;lock)-&gt;rlock);
[   56.799593]   &lt;Interrupt&gt;
[   56.805889]     lock(&amp;(&amp;vb-&gt;lock)-&gt;rlock);
[   56.812266]
[   56.812266]  *** DEADLOCK ***
[   56.812266]
[   56.830670] 1 lock held by ksoftirqd/1/13:
[   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [&lt;ffffffff8118f44c&gt;] vm_unmap_aliases+0x8c/0x380
[   56.849757]
[   56.849757] stack backtrace:
[   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
[   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
[   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
[   56.923006] Call Trace:
[   56.929532]  [&lt;ffffffff8175a145&gt;] dump_stack+0x55/0x76
[   56.936067]  [&lt;ffffffff81755b14&gt;] print_usage_bug+0x1f7/0x208
[   56.942445]  [&lt;ffffffff8101178f&gt;] ? save_stack_trace+0x2f/0x50
[   56.948932]  [&lt;ffffffff810cc0a0&gt;] ? check_usage_backwards+0x150/0x150
[   56.955470]  [&lt;ffffffff810ccb52&gt;] mark_lock+0x282/0x2c0
[   56.961945]  [&lt;ffffffff810ccfed&gt;] __lock_acquire+0x45d/0x1d50
[   56.968474]  [&lt;ffffffff810cce6e&gt;] ? __lock_acquire+0x2de/0x1d50
[   56.975140]  [&lt;ffffffff81393bf5&gt;] ? cpumask_next_and+0x55/0x90
[   56.981942]  [&lt;ffffffff810cef72&gt;] lock_acquire+0x92/0x1d0
[   56.988745]  [&lt;ffffffff8118f52a&gt;] ? vm_unmap_aliases+0x16a/0x380
[   56.995619]  [&lt;ffffffff817628f1&gt;] _raw_spin_lock+0x41/0x50
[   57.002493]  [&lt;ffffffff8118f52a&gt;] ? vm_unmap_aliases+0x16a/0x380
[   57.009447]  [&lt;ffffffff8118f52a&gt;] vm_unmap_aliases+0x16a/0x380
[   57.016477]  [&lt;ffffffff8118f44c&gt;] ? vm_unmap_aliases+0x8c/0x380
[   57.023607]  [&lt;ffffffff810436b0&gt;] change_page_attr_set_clr+0xc0/0x460
[   57.030818]  [&lt;ffffffff810cfb8d&gt;] ? trace_hardirqs_on+0xd/0x10
[   57.037896]  [&lt;ffffffff811a8330&gt;] ? kmem_cache_free+0xb0/0x2b0
[   57.044789]  [&lt;ffffffff811b59c3&gt;] ? free_object_rcu+0x93/0xa0
[   57.051720]  [&lt;ffffffff81043d9f&gt;] set_memory_rw+0x2f/0x40
[   57.058727]  [&lt;ffffffff8104e17c&gt;] bpf_jit_free+0x2c/0x40
[   57.065577]  [&lt;ffffffff81642cba&gt;] sk_filter_release_rcu+0x1a/0x30
[   57.072338]  [&lt;ffffffff811108e2&gt;] rcu_process_callbacks+0x202/0x7c0
[   57.078962]  [&lt;ffffffff81057f17&gt;] __do_softirq+0xf7/0x3f0
[   57.085373]  [&lt;ffffffff81058245&gt;] run_ksoftirqd+0x35/0x70

cannot reuse jited filter memory, since it's readonly,
so use original bpf insns memory to hold work_struct

defer kfree of sk_filter until jit completed freeing

tested on x86_64 and i386

Signed-off-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>vxlan: Use RCU apis to access sk_user_data.</title>
<updated>2013-09-30T18:22:59+00:00</updated>
<author>
<name>Pravin B Shelar</name>
<email>pshelar@nicira.com</email>
</author>
<published>2013-09-24T17:25:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=559835ea7292e2f09304d81eda16f4209433245e'/>
<id>urn:sha1:559835ea7292e2f09304d81eda16f4209433245e</id>
<content type='text'>
Use of RCU api makes vxlan code easier to understand.  It also
fixes bug due to missing ACCESS_ONCE() on sk_user_data dereference.
In rare case without ACCESS_ONCE() compiler might omit vs on
sk_user_data dereference.
Compiler can use vs as alias for sk-&gt;sk_user_data, resulting in
multiple sk_user_data dereference in rcu read context which
could change.

CC: Jesse Gross &lt;jesse@nicira.com&gt;
Signed-off-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: TSO packets automatic sizing</title>
<updated>2013-08-29T19:50:06+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-08-27T12:46:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=95bd09eb27507691520d39ee1044d6ad831c1168'/>
<id>urn:sha1:95bd09eb27507691520d39ee1044d6ad831c1168</id>
<content type='text'>
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp-&gt;xmit_size_goal_segs

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Cc: Van Jacobson &lt;vanj@google.com&gt;
Cc: Tom Herbert &lt;therbert@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: attempt high order allocations in sock_alloc_send_pskb()</title>
<updated>2013-08-10T08:16:44+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-08-08T21:38:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=28d6427109d13b0f447cba5761f88d3548e83605'/>
<id>urn:sha1:28d6427109d13b0f447cba5761f88d3548e83605</id>
<content type='text'>
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2013-08-04T04:36:46+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2013-08-04T04:36:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0e76a3a587fc7abda2badf249053b427baad255e'/>
<id>urn:sha1:0e76a3a587fc7abda2badf249053b427baad255e</id>
<content type='text'>
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL</title>
<updated>2013-08-01T22:11:17+00:00</updated>
<author>
<name>Cong Wang</name>
<email>amwang@redhat.com</email>
</author>
<published>2013-08-01T03:10:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e0d1095ae3405404d247afb00233ef837d58da83'/>
<id>urn:sha1:e0d1095ae3405404d247afb00233ef837d58da83</id>
<content type='text'>
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir &lt;eliezer.tamir@linux.intel.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Cong Wang &lt;amwang@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netem: Introduce skb_orphan_partial() helper</title>
<updated>2013-07-31T21:59:49+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-07-31T00:55:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f2f872f9272a79a1048877ea14c15576f46c225e'/>
<id>urn:sha1:f2f872f9272a79a1048877ea14c15576f46c225e</id>
<content type='text'>
Commit 547669d483e578 ("tcp: xps: fix reordering issues") added
unexpected reorders in case netem is used in a MQ setup for high
performance test bed.

ETH=eth0
tc qd del dev $ETH root 2&gt;/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 32`
do
 tc qd add dev $ETH parent 1:$i netem delay 100ms
done

As all tcp packets are orphaned by netem, TCP stack believes it can
set skb-&gt;ooo_okay on all packets.

In order to allow producers to send more packets, we want to
keep sk_wmem_alloc from reaching sk_sndbuf limit.

We can do that by accounting one byte per skb in netem queues,
so that TCP stack is not fooled too much.

Tested:

With above MQ/netem setup, scaling number of concurrent flows gives
linear results and no reorders/retransmits

lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
 do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
n:1 198.46
n:10 2002.69
n:20 4000.98
n:30 6006.35
n:40 8020.93
n:50 10032.3
n:60 12081.9
n:70 13971.3
n:80 16009.7
n:90 17117.3
n:100 17425.5

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: TCP_NOTSENT_LOWAT socket option</title>
<updated>2013-07-25T00:54:48+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-07-23T03:27:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36'/>
<id>urn:sha1:c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36</id>
<content type='text'>
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp-&gt;nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 &gt;/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &amp;); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 &gt;/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &amp;); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 &gt;/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 &gt;/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-By: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: add sk_stream_is_writeable() helper</title>
<updated>2013-07-25T00:54:48+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2013-07-23T03:26:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=64dc61306ce7da370833289739e2f52dfc6b37ba'/>
<id>urn:sha1:64dc61306ce7da370833289739e2f52dfc6b37ba</id>
<content type='text'>
Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) &gt;= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Neal Cardwell &lt;ncardwell@google.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: Provide a generic socket error queue delivery method for Tx time stamps.</title>
<updated>2013-07-22T21:58:19+00:00</updated>
<author>
<name>Richard Cochran</name>
<email>richardcochran@gmail.com</email>
</author>
<published>2013-07-19T17:40:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cb820f8e4b7f73d1a32175e6591735b25bb5398d'/>
<id>urn:sha1:cb820f8e4b7f73d1a32175e6591735b25bb5398d</id>
<content type='text'>
This patch moves the private error queue delivery function from the
af_packet code to the core socket method. In this way, network layers
only needing the error queue for transmit time stamping can share common
code.

Signed-off-by: Richard Cochran &lt;richardcochran@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
