kernel/linux.git/net, branch v3.0.97

tipc: fix lockdep warning during bearer initialization

2013-09-14T12:50:48+00:00

[ Upstream commit 4225a398c1352a7a5c14dc07277cb5cc4473983b ] When the lockdep validator is enabled, it will report the below warning when we enable a TIPC bearer: [ INFO: possible irq lock inversion dependency detected ] --------------------------------------------------------- Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(ptype_lock); local_irq_disable(); lock(tipc_net_lock); lock(ptype_lock); lock(tipc_net_lock); *** DEADLOCK *** the shortest dependencies between 2nd lock and 1st lock: -> (ptype_lock){+.+...} ops: 10 { [...] SOFTIRQ-ON-W at: [] __lock_acquire+0x528/0x13e0 [] lock_acquire+0x90/0x100 [] _raw_spin_lock+0x38/0x50 [] dev_add_pack+0x3a/0x60 [] arp_init+0x1a/0x48 [] inet_init+0x181/0x27e [] do_one_initcall+0x34/0x170 [] kernel_init+0x110/0x1b2 [] kernel_thread_helper+0x6/0x10 [...] ... key at: [] ptype_lock+0x10/0x20 ... acquired at: [] lock_acquire+0x90/0x100 [] _raw_spin_lock+0x38/0x50 [] dev_add_pack+0x3a/0x60 [] enable_bearer+0xf2/0x140 [tipc] [] tipc_enable_bearer+0x1ba/0x450 [tipc] [] tipc_cfg_do_cmd+0x5c4/0x830 [tipc] [] handle_cmd+0x42/0xd0 [tipc] [] genl_rcv_msg+0x232/0x280 [] netlink_rcv_skb+0x86/0xb0 [] genl_rcv+0x1c/0x30 [] netlink_unicast+0x174/0x1f0 [] netlink_sendmsg+0x1eb/0x2d0 [] sock_aio_write+0x161/0x170 [] do_sync_write+0xac/0xf0 [] vfs_write+0x156/0x170 [] sys_write+0x42/0x70 [] sysenter_do_call+0x12/0x38 [...] } -> (tipc_net_lock){+..-..} ops: 4 { [...] IN-SOFTIRQ-R at: [] __lock_acquire+0x64a/0x13e0 [] lock_acquire+0x90/0x100 [] _raw_read_lock_bh+0x3d/0x50 [] tipc_recv_msg+0x1d/0x830 [tipc] [] recv_msg+0x3f/0x50 [tipc] [] __netif_receive_skb+0x22a/0x590 [] netif_receive_skb+0x2b/0xf0 [] pcnet32_poll+0x292/0x780 [] net_rx_action+0xfa/0x1e0 [] __do_softirq+0xae/0x1e0 [...] } >From the log, we can see three different call chains between CPU0 and CPU1: Time 0 on CPU0: kernel_init()->inet_init()->dev_add_pack() At time 0, the ptype_lock is held by CPU0 in dev_add_pack(); Time 1 on CPU1: tipc_enable_bearer()->enable_bearer()->dev_add_pack() At time 1, tipc_enable_bearer() first holds tipc_net_lock, and then wants to take ptype_lock to register TIPC protocol handler into the networking stack. But the ptype_lock has been taken by dev_add_pack() on CPU0, so at this time the dev_add_pack() running on CPU1 has to be busy looping. Time 2 on CPU0: netif_receive_skb()->recv_msg()->tipc_recv_msg() At time 2, an incoming TIPC packet arrives at CPU0, hence tipc_recv_msg() will be invoked. In tipc_recv_msg(), it first wants to hold tipc_net_lock. At the moment, below scenario happens: On CPU0, below is our sequence of taking locks: lock(ptype_lock)->lock(tipc_net_lock) On CPU1, our sequence of taking locks looks like: lock(tipc_net_lock)->lock(ptype_lock) Obviously deadlock may happen in this case. But please note the deadlock possibly doesn't occur at all when the first TIPC bearer is enabled. Before enable_bearer() -- running on CPU1 does not hold ptype_lock, so the TIPC receive handler (i.e. recv_msg()) is not registered successfully via dev_add_pack(), so the tipc_recv_msg() cannot be called by recv_msg() even if a TIPC message comes to CPU0. But when the second TIPC bearer is registered, the deadlock can perhaps really happen. To fix it, we will push the work of registering TIPC protocol handler into workqueue context. After the change, both paths taking ptype_lock are always in process contexts, thus, the deadlock should never occur. Signed-off-by: Ying Xue Signed-off-by: Jon Maloy Signed-off-by: Paul Gortmaker Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

ICMPv6: treat dest unreachable codes 5 and 6 as EACCES, not EPROTO

2013-09-14T12:50:48+00:00

[ Upstream commit 61e76b178dbe7145e8d6afa84bb4ccea71918994 ] RFC 4443 has defined two additional codes for ICMPv6 type 1 (destination unreachable) messages: 5 - Source address failed ingress/egress policy 6 - Reject route to destination Now they are treated as protocol error and icmpv6_err_convert() converts them to EPROTO. RFC 4443 says: "Codes 5 and 6 are more informative subsets of code 1." Treat codes 5 and 6 as code 1 (EACCES) Btw, connect() returning -EPROTO confuses firefox, so that fallback to other/IPv4 addresses does not work: https://bugzilla.mozilla.org/show_bug.cgi?id=910773 Signed-off-by: Jiri Bohac Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delay

2013-09-14T12:50:48+00:00

[ Upstream commit 2d98c29b6fb3de44d9eaa73c09f9cf7209346383 ] While looking into MLDv1/v2 code, I noticed that bridging code does not convert it's max delay into jiffies for MLDv2 messages as we do in core IPv6' multicast code. RFC3810, 5.1.3. Maximum Response Code says: The Maximum Response Code field specifies the maximum time allowed before sending a responding Report. The actual time allowed, called the Maximum Response Delay, is represented in units of milliseconds, and is derived from the Maximum Response Code as follows: [...] As we update timers that work with jiffies, we need to convert it. Signed-off-by: Daniel Borkmann Cc: Linus Lüssing Cc: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

ipv6: Don't depend on per socket memory for neighbour discovery messages

2013-09-14T12:50:48+00:00

[ Upstream commit 25a6e6b84fba601eff7c28d30da8ad7cfbef0d43 ] Allocating skbs when sending out neighbour discovery messages currently uses sock_alloc_send_skb() based on a per net namespace socket and thus share a socket wmem buffer space. If a netdevice is temporarily unable to transmit due to carrier loss or for other reasons, the queued up ndisc messages will cosnume all of the wmem space and will thus prevent from any more skbs to be allocated even for netdevices that are able to transmit packets. The number of neighbour discovery messages sent is very limited, use of alloc_skb() bypasses the socket wmem buffer size enforcement while the manual call to skb_set_owner_w() maintains the socket reference needed for the IPv6 output path. This patch has orginally been posted by Eric Dumazet in a modified form. Signed-off-by: Thomas Graf Cc: Eric Dumazet Cc: Hannes Frederic Sowa Cc: Stephen Warren Cc: Fabio Estevam Tested-by: Fabio Estevam Tested-by: Stephen Warren Acked-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

ipv6: drop packets with multiple fragmentation headers

2013-09-14T12:50:48+00:00

[ Upstream commit f46078cfcd77fa5165bf849f5e568a7ac5fa569c ] It is not allowed for an ipv6 packet to contain multiple fragmentation headers. So discard packets which were already reassembled by fragmentation logic and send back a parameter problem icmp. The updates for RFC 6980 will come in later, I have to do a bit more research here. Cc: YOSHIFUJI Hideaki Signed-off-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

ipv6: remove max_addresses check from ipv6_create_tempaddr

2013-09-14T12:50:48+00:00

[ Upstream commit 4b08a8f1bd8cb4541c93ec170027b4d0782dab52 ] Because of the max_addresses check attackers were able to disable privacy extensions on an interface by creating enough autoconfigured addresses: But the check is not actually needed: max_addresses protects the kernel to install too many ipv6 addresses on an interface and guards addrconf_prefix_rcv to install further addresses as soon as this limit is reached. We only generate temporary addresses in direct response of a new address showing up. As soon as we filled up the maximum number of addresses of an interface, we stop installing more addresses and thus also stop generating more temp addresses. Even if the attacker tries to generate a lot of temporary addresses by announcing a prefix and removing it again (lifetime == 0) we won't install more temp addresses, because the temporary addresses do count to the maximum number of addresses, thus we would stop installing new autoconfigured addresses when the limit is reached. This patch fixes CVE-2013-0343 (but other layer-2 attacks are still possible). Thanks to Ding Tianhong to bring this topic up again. Signed-off-by: Hannes Frederic Sowa Cc: Ding Tianhong Cc: George Kargiotakis Cc: P J P Cc: YOSHIFUJI Hideaki Acked-by: Ding Tianhong Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

ipv6: don't stop backtracking in fib6_lookup_1 if subtree does not match

2013-09-14T12:50:48+00:00

[ Upstream commit 3e3be275851bc6fc90bfdcd732cd95563acd982b ] In case a subtree did not match we currently stop backtracking and return NULL (root table from fib_lookup). This could yield in invalid routing table lookups when using subtrees. Instead continue to backtrack until a valid subtree or node is found and return this match. Also remove unneeded NULL check. Reported-by: Teco Boot Cc: YOSHIFUJI Hideaki Cc: David Lamparter Cc: Signed-off-by: Hannes Frederic Sowa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: cubic: fix bug in bictcp_acked()

2013-09-14T12:50:47+00:00

[ Upstream commit cd6b423afd3c08b27e1fed52db828ade0addbc6b ] While investigating about strange increase of retransmit rates on hosts ~24 days after boot, Van found hystart was disabled if ca->epoch_start was 0, as following condition is true when tcp_time_stamp high order bit is set. (s32)(tcp_time_stamp - ca->epoch_start) < HZ Quoting Van : At initialization & after every loss ca->epoch_start is set to zero so I believe that the above line will turn off hystart as soon as the 2^31 bit is set in tcp_time_stamp & hystart will stay off for 24 days. I think we've observed that cubic's restart is too aggressive without hystart so this might account for the higher drop rate we observe. Diagnosed-by: Van Jacobson Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Yuchung Cheng Acked-by: Neal Cardwell Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: cubic: fix overflow error in bictcp_update()

2013-09-14T12:50:47+00:00

[ Upstream commit 2ed0edf9090bf4afa2c6fc4f38575a85a80d4b20 ] commit 17a6e9f1aa9 ("tcp_cubic: fix clock dependency") added an overflow error in bictcp_update() in following code : /* change the unit from HZ to bictcp_HZ */ t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) - ca->epoch_start) << BICTCP_HZ) / HZ; Because msecs_to_jiffies() being unsigned long, compiler does implicit type promotion. We really want to constrain (tcp_time_stamp - ca->epoch_start) to a signed 32bit value, or else 't' has unexpected high values. This bugs triggers an increase of retransmit rates ~24 days after boot [1], as the high order bit of tcp_time_stamp flips. [1] for hosts with HZ=1000 Big thanks to Van Jacobson for spotting this problem. Diagnosed-by: Van Jacobson Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Yuchung Cheng Cc: Stephen Hemminger Acked-by: Neal Cardwell Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

fib_trie: remove potential out of bound access

2013-09-14T12:50:47+00:00

[ Upstream commit aab515d7c32a34300312416c50314e755ea6f765 ] AddressSanitizer [1] dynamic checker pointed a potential out of bound access in leaf_walk_rcu() We could allocate one more slot in tnode_new() to leave the prefetch() in-place but it looks not worth the pain. Bug added in commit 82cfbb008572b ("[IPV4] fib_trie: iterator recode") [1] : https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel Reported-by: Andrey Konovalov Signed-off-by: Eric Dumazet Cc: Dmitry Vyukov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman