Age | Commit message (Collapse) | Author | Files | Lines |
|
net/ipv4/netfilter/ip_nat_snmp_basic.c: In function 'asn1_header_decode':
net/ipv4/netfilter/ip_nat_snmp_basic.c:248: warning: 'len' may be used unini
net/ipv4/netfilter/ip_nat_snmp_basic.c:248: warning: 'def' may be used unini
net/ipv4/netfilter/ip_nat_snmp_basic.c: In function 'snmp_translate':
net/ipv4/netfilter/ip_nat_snmp_basic.c:672: warning: 'l' may be used uniniti
net/ipv4/netfilter/ip_nat_snmp_basic.c:668: warning: 'type' may be used unin
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
- Don't trust a length which is greater than the working buffer.
An invalid length could cause overflow when calculating buffer size
for decoding oid.
- An oid length of zero is invalid and allows for an off-by-one error when
decoding oid because the first subid actually encodes first 2 subids.
- A primitive encoding may not have an indefinite length.
Thanks to Wei Wang from McAfee for report.
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
Upstream commit: 607bfbf2d55dd1cfe5368b41c2a81a8c9ccf4723
When selecting a new window, tcp_select_window() tries not to shrink
the offered window by using the maximum of the remaining offered window
size and the newly calculated window size. The newly calculated window
size is always a multiple of the window scaling factor, the remaining
window size however might not be since it depends on rcv_wup/rcv_nxt.
This means we're effectively shrinking the window when scaling it down.
The dump below shows the problem (scaling factor 2^7):
- Window size of 557 (71296) is advertised, up to 3111907257:
IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...>
- New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes
below the last end:
IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...>
The number 40 results from downscaling the remaining window:
3111907257 - 3111841425 = 65832
65832 / 2^7 = 514
65832 % 2^7 = 40
If the sender uses up the entire window before it is shrunk, this can have
chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq()
will notice that the window has been shrunk since tcp_wnd_end() is before
tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number.
This will fail the receivers checks in tcp_sequence() however since it
is before it's tp->rcv_wup, making it respond with a dupack.
If both sides are in this condition, this leads to a constant flood of
ACKs until the connection times out.
Make sure the window is never shrunk by aligning the remaining window to
the window scaling factor.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
[ Upstream commit: d8c9283089287341c85a0a69de32c2287a990e71 ]
I noticed "ip route list cache x.y.z.t" can be *very* slow.
While strace-ing -T it I also noticed that first part of route cache
is fetched quite fast :
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) =
+3772 <0.000047>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0)
+= 3736 <0.000042>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0)
+= 3740 <0.000055>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0)
+= 3712 <0.000043>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0)
+= 3732 <0.000053>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) =
+3708 <0.000052>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) =
+3680 <0.000041>
while the part at the end of the table is more expensive:
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
+msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"...,
+16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848>
The following patch corrects this performance/latency problem,
removing quadratic behavior.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
[ Upstream commit: 44344b2a85f03326c7047a8c861b0c625c674839 ]
When re-naming an interface, the previous secondary address
labels get lost e.g.
$> brctl addbr foo
$> ip addr add 192.168.0.1 dev foo
$> ip addr add 192.168.0.2 dev foo label foo:00
$> ip addr show dev foo | grep inet
inet 192.168.0.1/32 scope global foo
inet 192.168.0.2/32 scope global foo:00
$> ip link set foo name bar
$> ip addr show dev bar | grep inet
inet 192.168.0.1/32 scope global bar
inet 192.168.0.2/32 scope global bar:2
Turns out to be a simple thinko in inetdev_changename() - clearly we
want to look at the address label, rather than the device name, for
a suffix to retain.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
[ Upstream commit: f844c74fe07321953e2dd227fe35280075f18f60 ]
We currently check that iph->ihl is bounded by the real length and that
the real length is greater than the minimum IP header length. However,
we did not check the caes where iph->ihl is less than the minimum IP
header length.
This breaks because some ip_fast_csum implementations assume that which
is quite reasonable.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
The #ifdef's in arp_process() were not only a mess, they were also wrong
in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or
CONFIG_NETDEV_10000=y) cases.
Since they are not required this patch removes them.
Also removed are some #ifdef's around #include's that caused compile
errors after this change.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
netlink_run_queue() doesn't handle multiple processes processing the
queue concurrently. Serialize queue processing in inet_diag to fix
a oops in netlink_rcv_skb caused by netlink_run_queue passing a
NULL for the skb.
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054
[349587.500454] printing eip:
[349587.500457] c03318ae
[349587.500459] *pde = 00000000
[349587.500464] Oops: 0000 [#1]
[349587.500466] PREEMPT SMP
[349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
[349587.500483] CPU: 0
[349587.500485] EIP: 0060:[<c03318ae>] Not tainted VLI
[349587.500487] EFLAGS: 00010246 (2.6.22.3 #1)
[349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
[349587.500506] eax: 00000000 ebx: 00000000 ecx: c148d2a0 edx: c0398819
[349587.500510] esi: 00000000 edi: c0398819 ebp: c7a21c8c esp: c7a21c80
[349587.500517] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
[349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000)
[349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c
[349587.500542] f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c
[349587.500558] f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff
[349587.500578] Call Trace:
[349587.500581] [<c010361a>] show_trace_log_lvl+0x1c/0x33
[349587.500591] [<c01036d4>] show_stack_log_lvl+0x8d/0xaa
[349587.500595] [<c010390e>] show_registers+0x1cb/0x321
[349587.500604] [<c0103bff>] die+0x112/0x1e1
[349587.500607] [<c01132d2>] do_page_fault+0x229/0x565
[349587.500618] [<c03c8d3a>] error_code+0x72/0x78
[349587.500625] [<c0331962>] netlink_run_queue+0x40/0x76
[349587.500632] [<c03988e3>] inet_diag_rcv+0x1f/0x2c
[349587.500639] [<c03312a5>] netlink_data_ready+0x57/0x59
[349587.500643] [<c0330681>] netlink_sendskb+0x24/0x45
[349587.500651] [<c03307c6>] netlink_unicast+0x100/0x116
[349587.500656] [<c0330f83>] netlink_sendmsg+0x1c2/0x280
[349587.500664] [<c02fcce9>] sock_sendmsg+0xba/0xd5
[349587.500671] [<c02fe4d1>] sys_sendmsg+0x17b/0x1e8
[349587.500676] [<c02fe92d>] sys_socketcall+0x230/0x24d
[349587.500684] [<c01028d2>] syscall_call+0x7/0xb
[349587.500691] =======================
[349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f
Reported by Athanasius <link@miggy.org>
Adrian Bunk:
Backported to 2.6.16 based on a suggestion by David S. Miller.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
Currently when icmp_errors_use_inbound_ifaddr is set and an ICMP error is
sent after the packet passed through ip_output(), an address from the
outgoing interface is chosen as ICMP source address since skb->dev doesn't
point to the incoming interface anymore.
Fix this by doing an interface lookup on rt->dst.iif and using that device.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
When only GSO skb was partially ACKed, no hints are reset,
therefore fastpath_cnt_hint must be tweaked too or else it can
corrupt fackets_out. The corruption to occur, one must have
non-trivial ACK/SACK sequence, so this bug is not very often
that harmful. There's a fackets_out state reset in TCP because
fackets_out is known to be inaccurate and that fixes the issue
eventually anyway.
In case there was also at least one skb that got fully ACKed,
the fastpath_skb_hint is set to NULL which causes a recount for
fastpath_cnt_hint (the old value won't be accessed anymore),
thus it can safely be decremented without additional checking.
Reported by Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
This reverts commit 3198d0f16dec0c87071cf26f3f11af9c8f0a009b.
|
|
Static functions mustn't be exported.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
In testing our ESP/AH offload hardware, I discovered an issue with how
AH handles mutable fields in IPv4. RFC 4302 (AH) states the following
on the subject:
For IPv4, the entire option is viewed as a unit; so even
though the type and length fields within most options are immutable
in transit, if an option is classified as mutable, the entire option
is zeroed for ICV computation purposes.
The current implementation does not zero the type and length fields,
resulting in authentication failures when communicating with hosts
that do (i.e. FreeBSD).
I have tested record route and timestamp options (ping -R and ping -T)
on a small network involving Windows XP, FreeBSD 6.2, and Linux hosts,
with one router. In the presence of these options, the FreeBSD and
Linux hosts (with the patch or with the hardware) can communicate.
The Windows XP host simply fails to accept these packets with or
without the patch.
I have also been trying to test source routing options (using
traceroute -g), but haven't had much luck getting this option to work
*without* AH, let alone with.
Signed-off-by: Nick Bowler <nbowler@ellipticsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
It's possible that new SACK blocks that should trigger new LOST
markings arrive with new data (which previously made is_dupack
false). In addition, I think this fixes a case where we get
a cumulative ACK with enough SACK blocks to trigger the fast
recovery (is_dupack would be false there too).
I'm not completely pleased with this solution because readability
of the code is somewhat questionable as 'is_dupack' in SACK case
is no longer about dupacks only but would mean something like
'lost_marker_work_todo' too... But because of Eifel stuff done
in CA_Recovery, the FLAG_DATA_SACKED check cannot be placed to
the if statement which seems attractive solution. Nevertheless,
I didn't like adding another variable just for that either... :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
|
|
The conntrack assigned to locally generated ICMP error is usually the one
assigned to the original packet which has caused the error. But if
the original packet is handled as invalid by nf_conntrack, no conntrack
is assigned to the original packet. Then nf_ct_attach() cannot assign
any conntrack to the ICMP error packet. In that case the current
nf_conntrack_icmp assigns appropriate conntrack to it. But the current
code mistakes the direction of the packet. As a result, NAT code mistakes
the address to be mangled.
To fix the bug, this changes nf_conntrack_icmp not to assign conntrack
to such ICMP error. Actually no address is necessary to be mangled
in this case.
Spotted by Jordan Russell.
Upstream commit ID: 130e7a83d7ec8c5c673225e0fa8ea37b1ed507a5
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
This diff changes the default port range used for outgoing connections,
from "use 32768-61000 in most cases, but use N-4999 on small boxes
(where N is a multiple of 1024, depending on just *how* small the box
is)" to just "use 32768-61000 in all cases".
I don't believe there are any drawbacks to this change, and it keeps
outgoing connection ports farther away from the mess of
IANA-registered ports.
Signed-off-by: Mark Glines <mark@glines.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
As mentioned in http://bugzilla.kernel.org/show_bug.cgi?id=5015
The helptext implies that this is on by default.
This may be true on some distros (Fedora/RHEL have it enabled
in /etc/sysctl.conf), but the kernel defaults to it off.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
dereference (CVE-2007-2876)
When creating a new connection by sending an unknown chunk type, we don't
transition to a valid state, causing a NULL pointer dereference in
sctp_packet when accessing sctp_timeouts[SCTP_CONNTRACK_NONE].
Fix by don't creating new conntrack entry if initial state is invalid.
Noticed by Vilmos Nebehaj <vilmos.nebehaj@ramsys.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
When the server drops its connection, NFS client reconnects using the
same socket after disconnecting. If the new connection's SYN,ACK
doesn't contain the TCP timestamp option and the old connection's did,
tp->tcp_header_len is recomputed assuming no timestamp header but
tp->rx_opt.tstamp_ok remains set. Then tcp_build_and_update_options()
adds in a timestamp option past the end of the allocated TCP header,
overwriting TCP data, or when the data is in skb_shinfo(skb)->frags[],
overwriting skb_shinfo(skb) causing a crash soon after. (The issue was
debugged from such a crash.)
Similarly, wscale_ok and sack_ok also get set based on the SYN,ACK
packet but not reset on disconnect, since they are zeroed out at
initialization. The patch zeroes out the entire tp->rx_opt struct in
tcp_disconnect() to avoid this sort of problem.
Signed-off-by: Srinivas Aji <Aji_Srinivas@emc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The clusterip_config_find_get() already increases entries reference
counter, so there is no reason to do it twice in checkentry() callback.
This causes the config to be freed before it is removed from the list,
resulting in a crash when adding the next rule.
Signed-off-by: Jaroslav Kysela <perex@suse.cz>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
While porting some changes of the 2.6.21-rc7 pptp/proto_gre conntrack
and nat modules to a 2.4.32 kernel I noticed that the gre_key function
returns a wrong pointer to the GRE key of a version 0 packet thus
corrupting the packet payload.
The intended behaviour for GREv0 packets is to act like
ip_conntrack_proto_generic/ip_nat_proto_unknown so I have ripped the
offending functions (not used anymore) and modified the
ip_nat_proto_gre modules to not touch version 0 (non PPTP) packets.
Signed-off-by: Jorge Boncompte <jorge@dti2.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel,
which resulted in infinite recursion and stack overflow.
The bug is present in all kernel versions since the feature appeared.
The patch also makes some minimal cleanup:
1. Return something consistent (-ENOENT) when fib table is missing
2. Do not crash when queue is empty (does not happen, but yet)
3. Put result of lookup
Sergey Vlasov:
Oops fix
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Doug Leith observed a discrepancy between the version of CUBIC described
in the papers and the version in 2.6.18. A math error related to scaling
causes Cubic to grow too slowly.
Patch is from "Sangtae Ha" <sha2@ncsu.edu>. I validated that
it does fix the problems.
See the following to show behavior over 500ms 100 Mbit link.
Sender (2.6.19-rc3) --- Bridge (2.6.18-rt7) ------- Receiver (2.6.19-rc3)
1G [netem] 100M
http://developer.osdl.org/shemminger/tcp/2.6.19-rc3/cubic-orig.png
http://developer.osdl.org/shemminger/tcp/2.6.19-rc3/cubic-fix.png
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Some stacks apparently send packets with SYN|URG set. Linux accepts
these packets, so TCP conntrack should to.
Pointed out by Martijn Posthuma <posthuma@sangine.com>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Check that status flags are available in the netlink message received
to create a new conntrack.
Fixes a crash in ctnetlink_create_conntrack when the CTA_STATUS attribute
is not present.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Hello, Just discussed this Patrick...
We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete
both are holding RTNL. So there shouldn't be need for this preempt stuff.
This is assumed to a leftover from an older RCU-take.
> Mhh .. I think I just remembered something - me incorrectly suggesting
> to add it there while we were talking about this at OLS :) IIRC the
> idea was to make sure tnode_free (which at that time didn't use
> call_rcu) wouldn't free memory while still in use in a rcu read-side
> critical section. It should have been synchronize_rcu of course,
> but with tnode_free using call_rcu it seems to be completely
> unnecessary. So I guess we can simply remove it.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
On 2/28/07, KOVACS Krisztian <hidden@balabit.hu> wrote:
>
> Hi,
>
> While reading TCP minisock code I've found this suspiciously looking
> code fragment:
>
> - 8< -
> struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
> {
> struct sock *newsk = inet_csk_clone(sk, req, GFP_ATOMIC);
>
> if (newsk != NULL) {
> const struct inet_request_sock *ireq = inet_rsk(req);
> struct tcp_request_sock *treq = tcp_rsk(req);
> struct inet_connection_sock *newicsk = inet_csk(sk);
> struct tcp_sock *newtp;
> - 8< -
>
> The above code initializes newicsk to inet_csk(sk), isn't that supposed
> to be inet_csk(newsk)? As far as I can tell this might leave
> icsk_ack.last_seg_size zero even if we do have received data.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The header may have moved when trimming.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
TCP may advertize up to 16-bits window in SYN packets (no window
scaling allowed). At the same time, TCP may have rcv_wnd
(32-bits) that does not fit to 16-bits without window scaling
resulting in pseudo garbage into advertized window from the
low-order bits of rcv_wnd. This can happen at least when
mss <= (1<<wscale) (see tcp_select_initial_window). This patch
fixes the handling of SYN advertized windows (compile tested
only).
In worst case (which is unlikely to occur though), the receiver
advertized window could be just couple of bytes. I'm not sure
that such situation would be handled very well at all by the
receiver!? Fortunately, the situation normalizes after the
first non-SYN ACK is received because it has the correct,
scaled window.
Alternatively, tcp_select_initial_window could be changed to
prevent too large rcv_wnd in the first place.
[ tcp_make_synack() has the same bug, and I've added a fix for
that to this patch -DaveM ]
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
add_grhead() allocates memory with GFP_ATOMIC and in at least two places skb
from it passed to skb_put() without checking.
Adrian Bunk:
backported to 2.6.16
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Some of the instances of tcp_sack_block are host-endian, some - net-endian.
Define struct tcp_sack_block_wire identical to struct tcp_sack_block
with u32 replaced with __be32; annotate uses of tcp_sack_block replacing
net-endian ones with tcp_sack_block_wire. Change is obviously safe since
for cc(1) __be32 is typedefed to u32.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The TCP reset packet is copied from the original. This
includes all the GSO bits which do not apply to the new
packet. So we should clear those bits.
Spotted by Patrick McHardy.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
I encountered a kernel panic with my test program, which is a very
simple IPv6 client-server program.
The server side sets IPV6_RECVPKTINFO on a listening socket, and the
client side just sends a message to the server. Then the kernel panic
occurs on the server. (If you need the test program, please let me
know. I can provide it.)
This problem happens because a skb is forcibly freed in
tcp_rcv_state_process().
When a socket in listening state(TCP_LISTEN) receives a syn packet,
then tcp_v6_conn_request() will be called from
tcp_rcv_state_process(). If the tcp_v6_conn_request() successfully
returns, the skb would be discarded by __kfree_skb().
However, in case of a listening socket which was already set
IPV6_RECVPKTINFO, an address of the skb will be stored in
treq->pktopts and a ref count of the skb will be incremented in
tcp_v6_conn_request(). But, even if the skb is still in use, the skb
will be freed. Then someone still using the freed skb will cause the
kernel panic.
I suggest to use kfree_skb() instead of __kfree_skb().
Signed-off-by: Masayuki Nakagawa <nakagawa.msy@ncos.nec.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The sorting of SACK blocks actually munges them rather than sort,
causing the TCP stack to ignore some SACK information and breaking the
assumption of ordered SACK blocks after sorting.
The sort takes the data from a second buffer which isn't moved causing
subsequent data moves to occur from the wrong location. The fix is to
use a temporary buffer as a normal sort does.
Signed-off-By: Baruch Even <baruch@ev-en.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
The whole idea with the NOTRACK netfilter target is that
you can force the netfilter code to avoid connection
tracking, and all costs assosciated with it, by making
traffic match a NOTRACK rule.
But this is totally broken by the fact that we do a checksum
calculation over the packet before we do the NOTRACK bypass
check, which is very expensive. People setup NOTRACK rules
explicitly to avoid all of these kinds of costs.
This patch from Patrick, already in Linus's tree, fixes the
bug.
Move the check for ip_conntrack_untracked before the call to
skb_checksum_help to fix NOTRACK excemptions from NAT. Pre-2.6.19
NAT code breaks TSO by invalidating hardware checksums for every
packet, even if explicitly excluded from NAT through NOTRACK.
2.6.19 includes a fix that makes NAT and TSO live in harmony,
but the performance degradation caused by this deserves making
at least the workaround work properly in -stable.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
This changes the microsecond RTT sampling so that samples are taken in
the same way that RTT samples are taken for the RTO calculator: on the
last segment acknowledged, and only when the segment hasn't been
retransmitted.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
It is important that we only assign dev->ip{,6}_ptr
only after all portions of the inet{,6} are setup.
Otherwise we can receive packets before the multicast
spinlocks et al. are initialized.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Otherwise we could compute an inaccurate hash due to the
random seed changing.
Noticed by Zach Brown and patch is based upon some feedback
from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Found in 2.4 by Yixin Pan <yxpan@hotmail.com>.
> When I read fib_semantics.c of Linux-2.4.32, write_lock(&fib_info_lock) =
> is used in fib_release_info() instead of write_lock_bh(&fib_info_lock). =
> Is the following case possible: a BH interrupts fib_release_info() while =
> holding the write lock, and calls ip_check_fib_default() which calls =
> read_lock(&fib_info_lock), and spin forever.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
We grab a reference to the route's inetpeer entry but
forget to release it in xfrm4_dst_destroy().
Bug discovered by Kazunori MIYAZAWA <kazunori@miyazawa.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Currently the behaviour of disable_xfrm is inconsistent between
locally generated and forwarded packets. For locally generated
packets disable_xfrm disables the policy lookup if it is set on
the output device, for forwarded traffic however it looks at the
input device. This makes it impossible to disable xfrm on all
devices but a dummy device and use normal routing to direct
traffic to that device.
Always use the output device when checking disable_xfrm.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Spotted by Thomas Voegtle.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
sbi->s_group_desc is an array of pointers to buffer_head. memcpy() of
buffer size from address of buffer_head is a bad idea - it will generate
junk in any case, may oops if buffer_head is close to the end of slab
page and next page is not mapped and isn't what was intended there.
IOW, ->b_data is missing in that call. Fortunately, result doesn't go
into the primary on-disk data structures, so only backup ones get crap
written to them; that had allowed this bug to remain unnoticed until
now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
Make udp_encap_rcv use pskb_may_pull
IPsec with NAT-T breaks on some notebooks using the latest e1000 chipset,
when header split is enabled. When receiving sufficiently large packets, the
driver puts everything up to and including the UDP header into the header
portion of the skb, and the rest goes into the paged part. udp_encap_rcv
forgets to use pskb_may_pull, and fails to decapsulate it. Instead, it
passes it up it to the IKE daemon.
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
This patch removes consideration of high memory when determining TCP
hash table sizes. Taking into account high memory results in tcp_mem
values that are too large.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
During OpenVZ stress testing we found that UDP traffic with random src
can generate too much excessive rt hash growing leading finally to OOM
and kernel panics.
It was found that for 4GB i686 system (having 1048576 total pages and
225280 normal zone pages) kernel allocates the following route hash:
syslog: IP route cache hash table entries: 262144 (order: 8, 1048576
bytes) => ip_rt_max_size = 4194304 entries, i.e. max rt size is
4194304 * 256b = 1Gb of RAM > normal_zone
Attached the patch which removes HASH_HIGHMEM flag from
alloc_large_system_hash() call.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
memcpy 4 bytes to address of auto unsigned long variable followed
by comparison with u32 is a bloody bad idea.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|