<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/sctp, branch v3.14.43</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v3.14.43</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v3.14.43'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2014-11-21T17:23:06+00:00</updated>
<entry>
<title>net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks</title>
<updated>2014-11-21T17:23:06+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>dborkman@redhat.com</email>
</author>
<published>2014-10-09T20:55:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e36b6ac9e011205eb7ad3af329dbd27a21bacd50'/>
<id>urn:sha1:e36b6ac9e011205eb7ad3af329dbd27a21bacd50</id>
<content type='text'>
commit 9de7922bc709eee2f609cd01d98aaedc4cf5ea74 upstream.

Commit 6f4c618ddb0 ("SCTP : Add paramters validity check for
ASCONF chunk") added basic verification of ASCONF chunks, however,
it is still possible to remotely crash a server by sending a
special crafted ASCONF chunk, even up to pre 2.6.12 kernels:

skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
 head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
 end:0x440 dev:&lt;NULL&gt;
 ------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:129!
[...]
Call Trace:
 &lt;IRQ&gt;
 [&lt;ffffffff8144fb1c&gt;] skb_put+0x5c/0x70
 [&lt;ffffffffa01ea1c3&gt;] sctp_addto_chunk+0x63/0xd0 [sctp]
 [&lt;ffffffffa01eadaf&gt;] sctp_process_asconf+0x1af/0x540 [sctp]
 [&lt;ffffffff8152d025&gt;] ? _read_unlock_bh+0x15/0x20
 [&lt;ffffffffa01e0038&gt;] sctp_sf_do_asconf+0x168/0x240 [sctp]
 [&lt;ffffffffa01e3751&gt;] sctp_do_sm+0x71/0x1210 [sctp]
 [&lt;ffffffff8147645d&gt;] ? fib_rules_lookup+0xad/0xf0
 [&lt;ffffffffa01e6b22&gt;] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
 [&lt;ffffffffa01e8393&gt;] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
 [&lt;ffffffffa01ee986&gt;] sctp_inq_push+0x56/0x80 [sctp]
 [&lt;ffffffffa01fcc42&gt;] sctp_rcv+0x982/0xa10 [sctp]
 [&lt;ffffffffa01d5123&gt;] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
 [&lt;ffffffff8148bdc9&gt;] ? nf_iterate+0x69/0xb0
 [&lt;ffffffff81496d10&gt;] ? ip_local_deliver_finish+0x0/0x2d0
 [&lt;ffffffff8148bf86&gt;] ? nf_hook_slow+0x76/0x120
 [&lt;ffffffff81496d10&gt;] ? ip_local_deliver_finish+0x0/0x2d0
 [&lt;ffffffff81496ded&gt;] ip_local_deliver_finish+0xdd/0x2d0
 [&lt;ffffffff81497078&gt;] ip_local_deliver+0x98/0xa0
 [&lt;ffffffff8149653d&gt;] ip_rcv_finish+0x12d/0x440
 [&lt;ffffffff81496ac5&gt;] ip_rcv+0x275/0x350
 [&lt;ffffffff8145c88b&gt;] __netif_receive_skb+0x4ab/0x750
 [&lt;ffffffff81460588&gt;] netif_receive_skb+0x58/0x60

This can be triggered e.g., through a simple scripted nmap
connection scan injecting the chunk after the handshake, for
example, ...

  -------------- INIT[ASCONF; ASCONF_ACK] -------------&gt;
  &lt;----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
  -------------------- COOKIE-ECHO --------------------&gt;
  &lt;-------------------- COOKIE-ACK ---------------------
  ------------------ ASCONF; UNKNOWN ------------------&gt;

... where ASCONF chunk of length 280 contains 2 parameters ...

  1) Add IP address parameter (param length: 16)
  2) Add/del IP address parameter (param length: 255)

... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
Address Parameter in the ASCONF chunk is even missing, too.
This is just an example and similarly-crafted ASCONF chunks
could be used just as well.

The ASCONF chunk passes through sctp_verify_asconf() as all
parameters passed sanity checks, and after walking, we ended
up successfully at the chunk end boundary, and thus may invoke
sctp_process_asconf(). Parameter walking is done with
WORD_ROUND() to take padding into account.

In sctp_process_asconf()'s TLV processing, we may fail in
sctp_process_asconf_param() e.g., due to removal of the IP
address that is also the source address of the packet containing
the ASCONF chunk, and thus we need to add all TLVs after the
failure to our ASCONF response to remote via helper function
sctp_add_asconf_response(), which basically invokes a
sctp_addto_chunk() adding the error parameters to the given
skb.

When walking to the next parameter this time, we proceed
with ...

  length = ntohs(asconf_param-&gt;param_hdr.length);
  asconf_param = (void *)asconf_param + length;

... instead of the WORD_ROUND()'ed length, thus resulting here
in an off-by-one that leads to reading the follow-up garbage
parameter length of 12336, and thus throwing an skb_over_panic
for the reply when trying to sctp_addto_chunk() next time,
which implicitly calls the skb_put() with that length.

Fix it by using sctp_walk_params() [ which is also used in
INIT parameter processing ] macro in the verification *and*
in ASCONF processing: it will make sure we don't spill over,
that we walk parameters WORD_ROUND()'ed. Moreover, we're being
more defensive and guard against unknown parameter types and
missized addresses.

Joint work with Vlad Yasevich.

Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
Signed-off-by: Daniel Borkmann &lt;dborkman@redhat.com&gt;
Signed-off-by: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Acked-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Josh Boyer &lt;jwboyer@fedoraproject.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>net: sctp: fix panic on duplicate ASCONF chunks</title>
<updated>2014-11-21T17:23:05+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>dborkman@redhat.com</email>
</author>
<published>2014-10-09T20:55:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=59ea8663e3a7fc3a0c2841e310b83f7aaec1c017'/>
<id>urn:sha1:59ea8663e3a7fc3a0c2841e310b83f7aaec1c017</id>
<content type='text'>
commit b69040d8e39f20d5215a03502a8e8b4c6ab78395 upstream.

When receiving a e.g. semi-good formed connection scan in the
form of ...

  -------------- INIT[ASCONF; ASCONF_ACK] -------------&gt;
  &lt;----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
  -------------------- COOKIE-ECHO --------------------&gt;
  &lt;-------------------- COOKIE-ACK ---------------------
  ---------------- ASCONF_a; ASCONF_b -----------------&gt;

... where ASCONF_a equals ASCONF_b chunk (at least both serials
need to be equal), we panic an SCTP server!

The problem is that good-formed ASCONF chunks that we reply with
ASCONF_ACK chunks are cached per serial. Thus, when we receive a
same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
not need to process them again on the server side (that was the
idea, also proposed in the RFC). Instead, we know it was cached
and we just resend the cached chunk instead. So far, so good.

Where things get nasty is in SCTP's side effect interpreter, that
is, sctp_cmd_interpreter():

While incoming ASCONF_a (chunk = event_arg) is being marked
!end_of_packet and !singleton, and we have an association context,
we do not flush the outqueue the first time after processing the
ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
queued up, although we set local_cork to 1. Commit 2e3216cd54b1
changed the precedence, so that as long as we get bundled, incoming
chunks we try possible bundling on outgoing queue as well. Before
this commit, we would just flush the output queue.

Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
continue to process the same ASCONF_b chunk from the packet. As
we have cached the previous ASCONF_ACK, we find it, grab it and
do another SCTP_CMD_REPLY command on it. So, effectively, we rip
the chunk-&gt;list pointers and requeue the same ASCONF_ACK chunk
another time. Since we process ASCONF_b, it's correctly marked
with end_of_packet and we enforce an uncork, and thus flush, thus
crashing the kernel.

Fix it by testing if the ASCONF_ACK is currently pending and if
that is the case, do not requeue it. When flushing the output
queue we may relink the chunk for preparing an outgoing packet,
but eventually unlink it when it's copied into the skb right
before transmission.

Joint work with Vlad Yasevich.

Fixes: 2e3216cd54b1 ("sctp: Follow security requirement of responding with 1 packet")
Signed-off-by: Daniel Borkmann &lt;dborkman@redhat.com&gt;
Signed-off-by: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Josh Boyer &lt;jwboyer@fedoraproject.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>sctp: handle association restarts when the socket is closed.</title>
<updated>2014-10-15T06:36:42+00:00</updated>
<author>
<name>Vlad Yasevich</name>
<email>vyasevich@gmail.com</email>
</author>
<published>2014-10-03T22:16:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ce8c50393d4065f869cbb96ead8cad2c2586f20e'/>
<id>urn:sha1:ce8c50393d4065f869cbb96ead8cad2c2586f20e</id>
<content type='text'>
[ Upstream commit bdf6fa52f01b941d4a80372d56de465bdbbd1d23 ]

Currently association restarts do not take into consideration the
state of the socket.  When a restart happens, the current assocation
simply transitions into established state.  This creates a condition
where a remote system, through a the restart procedure, may create a
local association that is no way reachable by user.  The conditions
to trigger this are as follows:
  1) Remote does not acknoledge some data causing data to remain
     outstanding.
  2) Local application calls close() on the socket.  Since data
     is still outstanding, the association is placed in SHUTDOWN_PENDING
     state.  However, the socket is closed.
  3) The remote tries to create a new association, triggering a restart
     on the local system.  The association moves from SHUTDOWN_PENDING
     to ESTABLISHED.  At this point, it is no longer reachable by
     any socket on the local system.

This patch addresses the above situation by moving the newly ESTABLISHED
association into SHUTDOWN-SENT state and bundling a SHUTDOWN after
the COOKIE-ACK chunk.  This way, the restarted associate immidiately
enters the shutdown procedure and forces the termination of the
unreachable association.

Reported-by: David Laight &lt;David.Laight@aculab.com&gt;
Signed-off-by: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>net: sctp: cache auth_enable per endpoint</title>
<updated>2014-05-31T20:20:34+00:00</updated>
<author>
<name>Vlad Yasevich</name>
<email>vyasevic@redhat.com</email>
</author>
<published>2014-04-17T15:26:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3938b0336a93fa5faa242dc9e5823ac69df9e066'/>
<id>urn:sha1:3938b0336a93fa5faa242dc9e5823ac69df9e066</id>
<content type='text'>
[ Upstream commit b14878ccb7fac0242db82720b784ab62c467c0dc ]

Currently, it is possible to create an SCTP socket, then switch
auth_enable via sysctl setting to 1 and crash the system on connect:

Oops[#1]:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
[...]
Call Trace:
[&lt;ffffffff8043c4e8&gt;] sctp_auth_asoc_set_default_hmac+0x68/0x80
[&lt;ffffffff8042b300&gt;] sctp_process_init+0x5e0/0x8a4
[&lt;ffffffff8042188c&gt;] sctp_sf_do_5_1B_init+0x234/0x34c
[&lt;ffffffff804228c8&gt;] sctp_do_sm+0xb4/0x1e8
[&lt;ffffffff80425a08&gt;] sctp_endpoint_bh_rcv+0x1c4/0x214
[&lt;ffffffff8043af68&gt;] sctp_rcv+0x588/0x630
[&lt;ffffffff8043e8e8&gt;] sctp6_rcv+0x10/0x24
[&lt;ffffffff803acb50&gt;] ip6_input+0x2c0/0x440
[&lt;ffffffff8030fc00&gt;] __netif_receive_skb_core+0x4a8/0x564
[&lt;ffffffff80310650&gt;] process_backlog+0xb4/0x18c
[&lt;ffffffff80313cbc&gt;] net_rx_action+0x12c/0x210
[&lt;ffffffff80034254&gt;] __do_softirq+0x17c/0x2ac
[&lt;ffffffff800345e0&gt;] irq_exit+0x54/0xb0
[&lt;ffffffff800075a4&gt;] ret_from_irq+0x0/0x4
[&lt;ffffffff800090ec&gt;] rm7k_wait_irqoff+0x24/0x48
[&lt;ffffffff8005e388&gt;] cpu_startup_entry+0xc0/0x148
[&lt;ffffffff805a88b0&gt;] start_kernel+0x37c/0x398
Code: dd0900b8  000330f8  0126302d &lt;dcc60000&gt; 50c0fff1  0047182a  a48306a0
03e00008  00000000
---[ end trace b530b0551467f2fd ]---
Kernel panic - not syncing: Fatal exception in interrupt

What happens while auth_enable=0 in that case is, that
ep-&gt;auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
when endpoint is being created.

After that point, if an admin switches over to auth_enable=1,
the machine can crash due to NULL pointer dereference during
reception of an INIT chunk. When we enter sctp_process_init()
via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
the INIT verification succeeds and while we walk and process
all INIT params via sctp_process_param() we find that
net-&gt;sctp.auth_enable is set, therefore do not fall through,
but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
dereference what we have set to NULL during endpoint
initialization phase.

The fix is to make auth_enable immutable by caching its value
during endpoint initialization, so that its original value is
being carried along until destruction. The bug seems to originate
from the very first days.

Fix in joint work with Daniel Borkmann.

Reported-by: Joshua Kinard &lt;kumba@gentoo.org&gt;
Signed-off-by: Vlad Yasevich &lt;vyasevic@redhat.com&gt;
Signed-off-by: Daniel Borkmann &lt;dborkman@redhat.com&gt;
Acked-by: Neil Horman &lt;nhorman@tuxdriver.com&gt;
Tested-by: Joshua Kinard &lt;kumba@gentoo.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"</title>
<updated>2014-05-31T20:20:33+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>dborkman@redhat.com</email>
</author>
<published>2014-04-14T19:45:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bde6d78b4a05212e6611c7c2fe104fa72512b6eb'/>
<id>urn:sha1:bde6d78b4a05212e6611c7c2fe104fa72512b6eb</id>
<content type='text'>
[ Upstream commit 362d52040c71f6e8d8158be48c812d7729cb8df1 ]

This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
to reflect real state of the receiver's buffer") as it introduced a
serious performance regression on SCTP over IPv4 and IPv6, though a not
as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

Current state:

[root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 17:56:21 GMT
Connecting to host 192.168.241.3, port 5201
      Cookie: Lab200slot2.1397238981.812898.548918
[  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
[  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
[  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
[  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
[  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
[  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
[  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
[  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
[  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
[  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
[  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
[  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
[  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
[  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
(etc)

[root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
Time: Fri, 11 Apr 2014 19:08:41 GMT
Connecting to host 2001:db8:0:f101::1, port 5201
      Cookie: Lab200slot2.1397243321.714295.2b3f7c
[  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
[  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
[  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
[  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
[  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
[  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
[  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
[  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
[  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
[  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
[  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
[  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
[  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
(etc)

After patch:

[root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
iperf version 3.0.1 (10 January 2014)
Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
Time: Mon, 14 Apr 2014 16:40:48 GMT
Connecting to host 192.168.240.3, port 5201
      Cookie: Lab200slot2.1397493648.413274.65e131
[  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
[  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
[  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
[  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
[  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
[  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec

With the reverted patch applied, the SCTP/IPv4 performance is back
to normal on latest upstream for IPv4 and IPv6 and has same throughput
as 3.4.2 test kernel, steady and interval reports are smooth again.

Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
Reported-by: Peter Butler &lt;pbutler@sonusnet.com&gt;
Reported-by: Dongsheng Song &lt;dongsheng.song@gmail.com&gt;
Reported-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Tested-by: Peter Butler &lt;pbutler@sonusnet.com&gt;
Signed-off-by: Daniel Borkmann &lt;dborkman@redhat.com&gt;
Cc: Matija Glavinic Pecotic &lt;matija.glavinic-pecotic.ext@nsn.com&gt;
Cc: Alexander Sverdlin &lt;alexander.sverdlin@nsn.com&gt;
Cc: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Acked-by: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer</title>
<updated>2014-02-17T05:16:56+00:00</updated>
<author>
<name>Matija Glavinic Pecotic</name>
<email>matija.glavinic-pecotic.ext@nsn.com</email>
</author>
<published>2014-02-14T13:51:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ef2820a735f74ea60335f8ba3801b844f0cb184d'/>
<id>urn:sha1:ef2820a735f74ea60335f8ba3801b844f0cb184d</id>
<content type='text'>
Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer &lt; currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer &lt; x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd &lt; x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 &lt; x(43+190);

	x &gt; 20000/233;

	x ~&gt; 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 &gt; B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 &gt; A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 &gt; B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 &gt; A.34340: sctp (1) [COOKIE ACK]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
&lt;...&gt;
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--&gt; Sudden drop

IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--&gt; At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 &gt; B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 &gt; A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

&gt;From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--&gt; Association 1, between A.y and B.12345

IP A.55915 &gt; B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 &gt; A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 &gt; B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 &gt; A.55915: sctp (1) [COOKIE ACK]

--&gt; Association 2, between A.z and B.12346

IP A.55915 &gt; B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 &gt; A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 &gt; B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 &gt; A.55915: sctp (1) [COOKIE ACK]

--&gt; Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

&lt;...&gt;

IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--&gt; Sudden drop on 1

IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--&gt; Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--&gt; Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 &gt; B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--&gt; Immediate drop

IP A.60995 &gt; B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--&gt; Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 &gt; B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 &gt; A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 &gt; B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc-&gt;base.sk-&gt;sk_rcvbuf - rx_count) &gt; 0)
		asoc-&gt;rwnd = (asoc-&gt;base.sk-&gt;sk_rcvbuf - rx_count) &gt;&gt; 1;
	else
		asoc-&gt;rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.

Signed-off-by: Matija Glavinic Pecotic &lt;matija.glavinic-pecotic.ext@nsn.com&gt;
Reviewed-by: Alexander Sverdlin &lt;alexander.sverdlin@nsn.com&gt;
Acked-by: Vlad Yasevich &lt;vyasevich@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>sctp: remove macros sctp_bh_[un]lock_sock</title>
<updated>2014-01-22T02:41:36+00:00</updated>
<author>
<name>wangweidong</name>
<email>wangweidong1@huawei.com</email>
</author>
<published>2014-01-21T07:44:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5bc1d1b4a261a865cbde65b1561748df5b9c724b'/>
<id>urn:sha1:5bc1d1b4a261a865cbde65b1561748df5b9c724b</id>
<content type='text'>
Redefined bh_[un]lock_sock to sctp_bh[un]lock_sock for user
space friendly code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong &lt;wangweidong1@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>sctp: remove macros sctp_{lock|release}_sock</title>
<updated>2014-01-22T02:41:36+00:00</updated>
<author>
<name>wangweidong</name>
<email>wangweidong1@huawei.com</email>
</author>
<published>2014-01-21T07:44:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=048ed4b6266144fdee55089c9eef55b0c1d42ba1'/>
<id>urn:sha1:048ed4b6266144fdee55089c9eef55b0c1d42ba1</id>
<content type='text'>
Redefined {lock|release}_sock to sctp_{lock|release}_sock for user space friendly
code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong &lt;wangweidong1@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>sctp: remove macros sctp_read_[un]lock</title>
<updated>2014-01-22T02:40:41+00:00</updated>
<author>
<name>wangweidong</name>
<email>wangweidong1@huawei.com</email>
</author>
<published>2014-01-21T07:44:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1b0de194f11450d1c1e164bf2826fe80a38e1f62'/>
<id>urn:sha1:1b0de194f11450d1c1e164bf2826fe80a38e1f62</id>
<content type='text'>
Redefined read_[un]lock to sctp_read_[un]lock for user space
friendly code which we haven't use in years, and the macros
we never used, so removing them.

Signed-off-by: Wang Weidong &lt;wangweidong1@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>sctp: remove macros sctp_write_[un]_lock</title>
<updated>2014-01-22T02:40:41+00:00</updated>
<author>
<name>wangweidong</name>
<email>wangweidong1@huawei.com</email>
</author>
<published>2014-01-21T07:44:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=387602dfdc3332f786f6195c74d53f9e8f7af402'/>
<id>urn:sha1:387602dfdc3332f786f6195c74d53f9e8f7af402</id>
<content type='text'>
Redefined write_[un]lock to sctp_write_[un]lock for user space
friendly code which we haven't use in years, so removing them.

Signed-off-by: Wang Weidong &lt;wangweidong1@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
