summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2019-11-06net/tls: fix sk_msg trim on fallback to copy modeJakub Kicinski1-5/+15
sk_msg_trim() tries to only update curr pointer if it falls into the trimmed region. The logic, however, does not take into the account pointer wrapping that sk_msg_iter_var_prev() does nor (as John points out) the fact that msg->sg is a ring buffer. This means that when the message was trimmed completely, the new curr pointer would have the value of MAX_MSG_FRAGS - 1, which is neither smaller than any other value, nor would it actually be correct. Special case the trimming to 0 length a little bit and rework the comparison between curr and end to take into account wrapping. This bug caused the TLS code to not copy all of the message, if zero copy filled in fewer sg entries than memcopy would need. Big thanks to Alexander Potapenko for the non-KMSAN reproducer. v2: - take into account that msg->sg is a ring buffer (John). Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1) Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface") Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com Co-developed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06net: sched: prevent duplicate flower rules from tcf_proto destroy raceJohn Hurley1-4/+79
When a new filter is added to cls_api, the function tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to determine if the tcf_proto is duplicated in the chain's hashtable. It then creates a new entry or continues with an existing one. In cls_flower, this allows the function fl_ht_insert_unque to determine if a filter is a duplicate and reject appropriately, meaning that the duplicate will not be passed to drivers via the offload hooks. However, when a tcf_proto is destroyed it is removed from its chain before a hardware remove hook is hit. This can lead to a race whereby the driver has not received the remove message but duplicate flows can be accepted. This, in turn, can lead to the offload driver receiving incorrect duplicate flows and out of order add/delete messages. Prevent duplicates by utilising an approach suggested by Vlad Buslov. A hash table per block stores each unique chain/protocol/prio being destroyed. This entry is only removed when the full destroy (and hardware offload) has completed. If a new flow is being added with the same identiers as a tc_proto being detroyed, then the add request is replayed until the destroy is complete. Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution") Signed-off-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reported-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06taprio: fix panic while hw offload sched list swapIvan Khoronzhuk1-2/+3
Don't swap oper and admin schedules too early, it's not correct and causes crash. Steps to reproduce: 1) tc qdisc replace dev eth0 parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 1@2 \ base-time $SOME_BASE_TIME \ sched-entry S 01 80000 \ sched-entry S 02 15000 \ sched-entry S 04 40000 \ flags 2 2) tc qdisc replace dev eth0 parent root handle 100 taprio \ base-time $SOME_BASE_TIME \ sched-entry S 01 90000 \ sched-entry S 02 20000 \ sched-entry S 04 40000 \ flags 2 3) tc qdisc replace dev eth0 parent root handle 100 taprio \ base-time $SOME_BASE_TIME \ sched-entry S 01 150000 \ sched-entry S 02 200000 \ sched-entry S 04 40000 \ flags 2 Do 2 3 2 .. steps more times if not happens and observe: [ 305.832319] Unable to handle kernel write to read-only memory at virtual address ffff0000087ce7f0 [ 305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted [ 305.919306] Hardware name: Texas Instruments AM654 Base Board (DT) [...] [ 306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80 [ 306.022422] Call trace: [ 306.024866] taprio_free_sched_cb+0x4c/0x98 [ 306.029040] rcu_process_callbacks+0x25c/0x410 [ 306.033476] __do_softirq+0x10c/0x208 [ 306.037132] irq_exit+0xb8/0xc8 [ 306.040267] __handle_domain_irq+0x64/0xb8 [ 306.044352] gic_handle_irq+0x7c/0x178 [ 306.048092] el1_irq+0xb0/0x128 [ 306.051227] arch_cpu_idle+0x10/0x18 [ 306.054795] do_idle+0x120/0x138 [ 306.058015] cpu_startup_entry+0x20/0x28 [ 306.061931] rest_init+0xcc/0xd8 [ 306.065154] start_kernel+0x3bc/0x3e4 [ 306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662) [ 306.074900] ---[ end trace 96c8e2284a9d9d6e ]--- [ 306.079507] Kernel panic - not syncing: Fatal exception in interrupt [ 306.085847] SMP: stopping secondary CPUs [ 306.089765] Kernel Offset: disabled Try to explain one of the possible crash cases: The "real" admin list is assigned when admin_sched is set to new_admin, it happens after "swap", that assigns to oper_sched NULL. Thus if call qdisc show it can crash. Farther, next second time, when sched list is updated, the admin_sched is not NULL and becomes the oper_sched, previous oper_sched was NULL so just skipped. But then admin_sched is assigned new_admin, but schedules to free previous assigned admin_sched (that already became oper_sched). Farther, next third time, when sched list is updated, while one more swap, oper_sched is not null, but it was happy to be freed already (while prev. admin update), so while try to free oper_sched the kernel panic happens at taprio_free_sched_cb(). So, move the "swap emulation" where it should be according to function comment from code. Fixes: 9c66d15646760e ("taprio: Add support for hardware offloading") Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04can: j1939: transport: j1939_xtp_rx_eoma_one(): Add sanity check for correct ↵Oleksij Rempel1-0/+18
total message size We were sending malformed EOMA with total message size set to 0. This issue has been fixed in the previous patch. In this patch a sanity check is added to the RX path and a error message is displayed. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: transport: j1939_session_fresh_new(): make sure EOMA is send ↵Oleksij Rempel1-1/+1
with the total message size set We were sending malformed EOMA messageswith total message size set to 0. This patch fixes the bug. Reported-by: https://github.com/linux-can/can-utils/issues/159 Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: fix memory leak if filters was setOleksij Rempel1-0/+1
Filters array is coped from user space and linked to the j1939 socket. On socket release this memory was not freed. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04can: j1939: fix resource leak of skb on error return pathsColin Ian King1-2/+6
Currently the error return paths do not free skb and this results in a memory leak. Fix this by freeing them before the return. Addresses-Coverity: ("Resource leak") Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04Merge branch 'master' of git://blackhole.kfki.hu/nfPablo Neira Ayuso4-17/+36
Jozsef Kadlecsik says: ==================== ipset patches for nf - Fix the error code in ip_set_sockfn_get() when copy_to_user() is used, from Dan Carpenter. - The IPv6 part was missed when fixing copying the right MAC address in the patch "netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets", it is completed now by Stefano Brivio. - ipset nla_policies are fixed to fully support NL_VALIDATE_STRICT and the code is converted from deprecated parsings to verified ones. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables_offload: skip EBUSY on chain updatePablo Neira Ayuso1-1/+2
Do not try to bind a chain again if it exists, otherwise the driver returns EBUSY. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables: bogus EOPNOTSUPP on basechain updatePablo Neira Ayuso1-0/+1
Userspace never includes the NFT_BASE_CHAIN flag, this flag is inferred from the NFTA_CHAIN_HOOK atribute. The chain update path does not allow to update flags at this stage, the existing sanity check bogusly hits EOPNOTSUPP in the basechain case if the offload flag is set on. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04bridge: ebtables: don't crash when using dnat target in output chainsFlorian Westphal1-4/+15
xt_in() returns NULL in the output hook, skip the pkt_type change for that case, redirection only makes sense in broute/prerouting hooks. Reported-by: Tom Yan <tom.ty89@gmail.com> Cc: Linus Lüssing <linus.luessing@c0d3.blue> Fixes: cf3cb246e277d ("bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: nf_tables: fix unexpected EOPNOTSUPP errorFernando Fernandez Mancera1-4/+2
If the object type doesn't implement an update operation and the user tries to update it will silently ignore the update operation. Fixes: aa4095a156b5 ("netfilter: nf_tables: fix possible null-pointer dereference in object update") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04netfilter: ipset: Fix nla_policies to fully support NL_VALIDATE_STRICTJozsef Kadlecsik3-13/+30
Since v5.2 (commit "netlink: re-add parse/validate functions in strict mode") NL_VALIDATE_STRICT is enabled. Fix the ipset nla_policies which did not support strict mode and convert from deprecated parsings to verified ones. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 setsStefano Brivio1-1/+1
Same as commit 1b4a75108d5b ("netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets"), another copy and paste went wrong in commit 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets"). When I fixed this for IPv4 in 1b4a75108d5b, I didn't realise that hash:ip,mac sets also support IPv6 as family, and this is covered by a separate function, hash_ipmac6_kadt(). In hash:ip,mac sets, the first dimension is the IP address, and the second dimension is the MAC address: check the IPSET_DIM_TWO_SRC flag in flags while deciding which MAC address to copy, destination or source. This way, mixing source and destination matches for the two dimensions of ip,mac hash type works as expected, also for IPv6. With this setup: ip netns add A ip link add veth1 type veth peer name veth2 netns A ip addr add 2001:db8::1/64 dev veth1 ip -net A addr add 2001:db8::2/64 dev veth2 ip link set veth1 up ip -net A link set veth2 up dst=$(ip netns exec A cat /sys/class/net/veth2/address) ip netns exec A ipset create test_hash hash:ip,mac family inet6 ip netns exec A ipset add test_hash 2001:db8::1,${dst} ip netns exec A ip6tables -A INPUT -p icmpv6 --icmpv6-type 135 -j ACCEPT ip netns exec A ip6tables -A INPUT -m set ! --match-set test_hash src,dst -j DROP ipset now correctly matches a test packet: # ping -c1 2001:db8::2 >/dev/null # echo $? 0 Reported-by: Chen, Yi <yiche@redhat.com> Fixes: 8cc4ccf58379 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04netfilter: ipset: Fix an error code in ip_set_sockfn_get()Dan Carpenter1-3/+5
The copy_to_user() function returns the number of bytes remaining to be copied. In this code, that positive return is checked at the end of the function and we return zero/success. What we should do instead is return -EFAULT. Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04dccp: do not leak jiffies on the wireEric Dumazet1-1/+1
For some reason I missed the case of DCCP passive flows in my previous patch. Fixes: a904a0693c18 ("inet: stop leaking jiffies on the wire") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thiemo Nagel <tnagel@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04netfilter: nf_tables_offload: check for register data length mismatchesPablo Neira Ayuso2-3/+4
Make sure register data length does not mismatch immediate data length, otherwise hit EOPNOTSUPP. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds73-514/+845
Pull networking fixes from David Miller: 1) Fix free/alloc races in batmanadv, from Sven Eckelmann. 2) Several leaks and other fixes in kTLS support of mlx5 driver, from Tariq Toukan. 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke Høiland-Jørgensen. 4) Add an r8152 device ID, from Kazutoshi Noguchi. 5) Missing include in ipv6's addrconf.c, from Ben Dooks. 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can easily infer the 32-bit secret otherwise etc. 7) Several netdevice nesting depth fixes from Taehee Yoo. 8) Fix several KCSAN reported errors, from Eric Dumazet. For example, when doing lockless skb_queue_empty() checks, and accessing sk_napi_id/sk_incoming_cpu lockless as well. 9) Fix jumbo packet handling in RXRPC, from David Howells. 10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet. 11) Fix DMA synchronization in gve driver, from Yangchun Fu. 12) Several bpf offload fixes, from Jakub Kicinski. 13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo. 14) Fix ping latency during high traffic rates in hisilicon driver, from Jiangfent Xiao. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) net: fix installing orphaned programs net: cls_bpf: fix NULL deref on offload filter removal selftests: bpf: Skip write only files in debugfs selftests: net: reuseport_dualstack: fix uninitalized parameter r8169: fix wrong PHY ID issue with RTL8168dp net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: phylink: Fix phylink_dbg() macro gve: Fixes DMA synchronization. inet: stop leaking jiffies on the wire ixgbe: Remove duplicate clear_bit() call Documentation: networking: device drivers: Remove stray asterisks e1000: fix memory leaks i40e: Fix receive buffer starvation for AF_XDP igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare igb: Enable media autosense for the i350. igb/igc: Don't warn on fatal read failures when the device is removed tcp: increase tcp_max_syn_backlog max value net: increase SOMAXCONN to 4096 netdevsim: Fix use-after-free during device dismantle ...
2019-11-02net: fix installing orphaned programsJakub Kicinski1-1/+2
When netdevice with offloaded BPF programs is destroyed the programs are orphaned and removed from the program IDA - their IDs get released (the programs may remain accessible via existing open file descriptors and pinned files). After IDs are released they are set to 0. This confuses dev_change_xdp_fd() because it compares the __dev_xdp_query() result where 0 means no program with prog->aux->id where 0 means orphaned. dev_change_xdp_fd() would have incorrectly returned success even though it had not installed the program. Since drivers already catch this case via bpf_offload_dev_match() let them handle this case. The error message drivers produce in this case ("program loaded for a different device") is in fact correct as the orphaned program must had to be loaded for a different device. Fixes: c14a9f633d9e ("net: Don't call XDP_SETUP_PROG when nothing is changed") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02net: cls_bpf: fix NULL deref on offload filter removalJakub Kicinski1-2/+6
Commit 401192113730 ("net: sched: refactor block offloads counter usage") missed the fact that either new prog or old prog may be NULL. Fixes: 401192113730 ("net: sched: refactor block offloads counter usage") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02inet: stop leaking jiffies on the wireEric Dumazet4-5/+5
Historically linux tried to stick to RFC 791, 1122, 2003 for IPv4 ID field generation. RFC 6864 made clear that no matter how hard we try, we can not ensure unicity of IP ID within maximum lifetime for all datagrams with a given source address/destination address/protocol tuple. Linux uses a per socket inet generator (inet_id), initialized at connection startup with a XOR of 'jiffies' and other fields that appear clear on the wire. Thiemo Nagel pointed that this strategy is a privacy concern as this provides 16 bits of entropy to fingerprint devices. Let's switch to a random starting point, this is just as good as far as RFC 6864 is concerned and does not leak anything critical. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thiemo Nagel <tnagel@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01tcp: increase tcp_max_syn_backlog max valueEric Dumazet1-1/+1
tcp_max_syn_backlog default value depends on memory size and TCP ehash size. Before this patch, the max value was 2048 [1], which is considered too small nowadays. Increase it to 4096 to match the recent SOMAXCONN change. [1] This is with TCP ehash size being capped to 524288 buckets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31rxrpc: Fix handling of last subpacket of jumbo packetDavid Howells2-5/+14
When rxrpc_recvmsg_data() sets the return value to 1 because it's drained all the data for the last packet, it checks the last-packet flag on the whole packet - but this is wrong, since the last-packet flag is only set on the final subpacket of the last jumbo packet. This means that a call that receives its last packet in a jumbo packet won't complete properly. Fix this by having rxrpc_locate_data() determine the last-packet state of the subpacket it's looking at and passing that back to the caller rather than having the caller look in the packet header. The caller then needs to cache this in the rxrpc_call struct as rxrpc_locate_data() isn't then called again for this packet. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Fixes: e2de6c404898 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31Merge tag 'mac80211-for-net-2019-10-31' of ↵David S. Miller3-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just two fixes: * HT operation is not allowed on channel 14 (Japan only) * netlink policy for nexthop attribute was wrong ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30net: annotate accesses to sk->sk_incoming_cpuEric Dumazet5-6/+6
This socket field can be read and written by concurrent cpus. Use READ_ONCE() and WRITE_ONCE() annotations to document this, and avoid some compiler 'optimizations'. KCSAN reported : BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0: sk_incoming_cpu_update include/net/sock.h:953 [inline] tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1: sk_incoming_cpu_update include/net/sock.h:952 [inline] tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 napi_poll net/core/dev.c:6392 [inline] net_rx_action+0x3ae/0xa90 net/core/dev.c:6460 __do_softirq+0x115/0x33f kernel/softirq.c:292 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30SUNRPC: Destroy the back channel when we destroy the host transportTrond Myklebust2-1/+6
When we're destroying the host transport mechanism, we should ensure that we do not leak memory by failing to release any back channel slots that might still exist. Reported-by: Neil Brown <neilb@suse.de> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30SUNRPC: The RDMA back channel mustn't disappear while requests are outstandingTrond Myklebust1-0/+2
If there are RDMA back channel requests being processed by the server threads, then we should hold a reference to the transport to ensure it doesn't get freed from underneath us. Reported-by: Neil Brown <neilb@suse.de> Fixes: 63cae47005af ("xprtrdma: Handle incoming backward direction RPC calls") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30SUNRPC: The TCP back channel mustn't disappear while requests are outstandingTrond Myklebust1-2/+3
If there are TCP back channel requests being processed by the server threads, then we should hold a reference to the transport to ensure it doesn't get freed from underneath us. Reported-by: Neil Brown <neilb@suse.de> Fixes: 2ea24497a1b3 ("SUNRPC: RPC callbacks may be split across several..") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30nl80211: fix validation of mesh path nexthopMarkus Theil1-1/+1
Mesh path nexthop should be a ethernet address, but current validation checks against 4 byte integers. Cc: stable@vger.kernel.org Fixes: 2ec600d672e74 ("nl80211/cfg80211: support for mesh, sta dumping") Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de> Link: https://lore.kernel.org/r/20191029093003.10355-1-markus.theil@tu-ilmenau.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-30nl80211: Disallow setting of HT for channel 14Masashi Honma2-1/+7
This patch disables setting of HT20 and more for channel 14 because the channel is only for IEEE 802.11b. The patch for net/wireless/util.c was unit-tested. The patch for net/wireless/chan.c was tested with iw command. Before this patch. $ sudo iw dev <ifname> set channel 14 HT20 $ After this patch. $ sudo iw dev <ifname> set channel 14 HT20 kernel reports: invalid channel definition command failed: Invalid argument (-22) $ Signed-off-by: Masashi Honma <masashi.honma@gmail.com> Link: https://lore.kernel.org/r/20191021075045.2719-1-masashi.honma@gmail.com [clean up the code, use != instead of equivalent >] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-30net: rtnetlink: fix a typo fbd -> fdbNikolay Aleksandrov1-1/+1
A simple typo fix in the nl error message (fbd -> fdb). CC: David Ahern <dsahern@gmail.com> Fixes: 8c6e137fbc7f ("rtnetlink: Update rtnl_fdb_dump for strict data checking") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30net/smc: fix refcounting for non-blocking connect()Ursula Braun1-2/+2
If a nonblocking socket is immediately closed after connect(), the connect worker may not have started. This results in a refcount problem, since sock_hold() is called from the connect worker. This patch moves the sock_hold in front of the connect worker scheduling. Reported-by: syzbot+4c063e6dea39e4b79f29@syzkaller.appspotmail.com Fixes: 50717a37db03 ("net/smc: nonblocking connect rework") Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30erspan: fix the tun_info options_len check for erspanXin Long2-4/+4
The check for !md doens't really work for ip_tunnel_info_opts(info) which only does info + 1. Also to avoid out-of-bounds access on info, it should ensure options_len is not less than erspan_metadata in both erspan_xmit() and ip6erspan_tunnel_xmit(). Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29Merge tag 'batadv-net-for-davem-20191025' of git://git.open-mesh.org/linux-mergeDavid S. Miller4-17/+94
Simon Wunderlich says: ==================== Here are two batman-adv bugfixes: * Fix free/alloc race for OGM and OGMv2, by Sven Eckelmann (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28udp: fix data-race in udp_set_dev_scratch()Eric Dumazet1-4/+15
KCSAN reported a data-race in udp_set_dev_scratch() [1] The issue here is that we must not write over skb fields if skb is shared. A similar issue has been fixed in commit 89c22d8c3b27 ("net: Fix skb csum races when peeking") While we are at it, use a helper only dealing with udp_skb_scratch(skb)->csum_unnecessary, as this allows udp_set_dev_scratch() to be called once and thus inlined. [1] BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1: udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308 __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556 first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579 udp_poll+0xea/0x110 net/ipv4/udp.c:2720 sock_poll+0xed/0x250 net/socket.c:1256 vfs_poll include/linux/poll.h:90 [inline] do_select+0x7d0/0x1020 fs/select.c:534 core_sys_select+0x381/0x550 fs/select.c:677 do_pselect.constprop.0+0x11d/0x160 fs/select.c:759 __do_sys_pselect6 fs/select.c:784 [inline] __se_sys_pselect6 fs/select.c:769 [inline] __x64_sys_pselect6+0x12e/0x170 fs/select.c:769 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0: udp_skb_csum_unnecessary include/net/udp.h:358 [inline] udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310 inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: add READ_ONCE() annotation in __skb_wait_for_more_packets()Eric Dumazet1-1/+1
__skb_wait_for_more_packets() can be called while other cpus can feed packets to the socket receive queue. KCSAN reported : BUG: KCSAN: data-race in __skb_wait_for_more_packets / __udp_enqueue_schedule_skb write to 0xffff888102e40b58 of 8 bytes by interrupt on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] __udp_enqueue_schedule_skb+0x2d7/0x410 net/ipv4/udp.c:1470 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline] udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 read to 0xffff888102e40b58 of 8 bytes by task 13035 on cpu 1: __skb_wait_for_more_packets+0xfa/0x320 net/core/datagram.c:100 __skb_recv_udp+0x374/0x500 net/ipv4/udp.c:1683 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 13035 Comm: syz-executor.3 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: use skb_queue_empty_lockless() in busy poll contextsEric Dumazet4-4/+4
Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: use skb_queue_empty_lockless() in poll() handlersEric Dumazet13-21/+21
Many poll() handlers are lockless. Using skb_queue_empty_lockless() instead of skb_queue_empty() is more appropriate. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28udp: use skb_queue_empty_lockless()Eric Dumazet1-3/+3
syzbot reported a data-race [1]. We should use skb_queue_empty_lockless() to document that we are not ensuring a mutual exclusion and silence KCSAN. [1] BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470 __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline] udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057 udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074 udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233 __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300 udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124 process_backlog+0x1d3/0x420 net/core/dev.c:5955 read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1: skb_queue_empty include/linux/skbuff.h:1494 [inline] __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653 udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838 sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller8-27/+76
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix crash on flowtable due to race between garbage collection and insertion. 2) Restore callback unbinding in netfilter offloads. 3) Fix races on IPVS module removal, from Davide Caratti. 4) Make old_secure_tcp per-netns to fix sysbot report, from Eric Dumazet. 5) Validate matching length in netfilter offloads, from wenxu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-1/+12
Daniel Borkmann says: ==================== pull-request: bpf 2019-10-27 The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 11 day(s) which contain a total of 7 files changed, 66 insertions(+), 16 deletions(-). The main changes are: 1) Fix two use-after-free bugs in relation to RCU in jited symbol exposure to kallsyms, from Daniel Borkmann. 2) Fix NULL pointer dereference in AF_XDP rx-only sockets, from Magnus Karlsson. 3) Fix hang in netdev unregister for hash based devmap as well as another overflow bug on 32 bit archs in memlock cost calculation, from Toke Høiland-Jørgensen. 4) Fix wrong memory access in LWT BPF programs on reroute due to invalid dst. Also fix BPF selftests to use more compatible nc options, from Jiri Benc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26ipv4: fix route update on metric change.Paolo Abeni1-1/+1
Since commit af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes"), when updating an IP address with a different metric, the associated connected route is updated, too. Still, the mentioned commit doesn't handle properly some corner cases: $ ip addr add dev eth0 192.168.1.0/24 $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2 $ ip addr add dev eth0 192.168.3.1/24 $ ip addr change dev eth0 192.168.1.0/24 metric 10 $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10 $ ip addr change dev eth0 192.168.3.1/24 metric 10 $ ip -4 route 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0 192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1 192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10 Only the last route is correctly updated. The problem is the current test in fib_modify_prefix_metric(): if (!(dev->flags & IFF_UP) || ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) || ipv4_is_zeronet(prefix) || prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32) Which should be the logical 'not' of the pre-existing test in fib_add_ifaddr(): if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) && (prefix != addr || ifa->ifa_prefixlen < 32)) To properly negate the original expression, we need to change the last logical 'or' to a logical 'and'. Fixes: af4d768ad28c ("net/ipv4: Add support for specifying metric of connected routes") Reported-and-suggested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26net: Zeroing the structure ethtool_wolinfo in ethtool_get_wol()zhanglin1-1/+3
memset() the structure ethtool_wolinfo that has padded bytes but the padded bytes have not been zeroed out. Signed-off-by: zhanglin <zhang.lin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26Merge tag 'ipvs-fixes-for-v5.4' of ↵Pablo Neira Ayuso5-25/+35
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS fixes for v5.4 * Eric Dumazet resolves a race condition in switching the defense level * Davide Caratti resolves a race condition in module removal ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-26netns: fix GFP flags in rtnl_net_notifyid()Guillaume Nault4-25/+28
In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances, but there are a few paths calling rtnl_net_notifyid() from atomic context or from RCU critical sections. The later also precludes the use of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new() call is wrong too, as it uses GFP_KERNEL unconditionally. Therefore, we need to pass the GFP flags as parameter and propagate it through function calls until the proper flags can be determined. In most cases, GFP_KERNEL is fine. The exceptions are: * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump() indirectly call rtnl_net_notifyid() from RCU critical section, * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as parameter. Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used by nlmsg_new(). The function is allowed to sleep, so better make the flags consistent with the ones used in the following ovs_vport_cmd_fill_info() call. Found by code inspection. Fixes: 9a9634545c70 ("netns: notify netns id events") Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26net/smc: keep vlan_id for SMC-R in smc_listen_work()Ursula Braun1-1/+1
Creating of an SMC-R connection with vlan-id fails, because smc_listen_work() determines the vlan_id of the connection, saves it in struct smc_init_info ini, but clears the ini area again if SMC-D is not applicable. This patch just resets the ISM device before investigating SMC-R availability. Fixes: bc36d2fc93eb ("net/smc: consolidate function parameters") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26net/smc: fix closing of fallback SMC socketsUrsula Braun1-0/+7
For SMC sockets forced to fallback to TCP, the file is propagated from the outer SMC to the internal TCP socket. When closing the SMC socket, the internal TCP socket file pointer must be restored to the original NULL value, otherwise memory leaks may show up (found with CONFIG_DEBUG_KMEMLEAK). The internal TCP socket is released in smc_clcsock_release(), which calls __sock_release() function in net/socket.c. This calls the needed iput(SOCK_INODE(sock)) only, if the file pointer has been reset to the original NULL-value. Fixes: 07603b230895 ("net/smc: propagate file from SMC to TCP socket") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26net: sch_generic: Use pfifo_fast as fallback scheduler for CAN hardwareVincent Prince1-0/+2
There is networking hardware that isn't based on Ethernet for layers 1 and 2. For example CAN. CAN is a multi-master serial bus standard for connecting Electronic Control Units [ECUs] also known as nodes. A frame on the CAN bus carries up to 8 bytes of payload. Frame corruption is detected by a CRC. However frame loss due to corruption is possible, but a quite unusual phenomenon. While fq_codel works great for TCP/IP, it doesn't for CAN. There are a lot of legacy protocols on top of CAN, which are not build with flow control or high CAN frame drop rates in mind. When using fq_codel, as soon as the queue reaches a certain delay based length, skbs from the head of the queue are silently dropped. Silently meaning that the user space using a send() or similar syscall doesn't get an error. However TCP's flow control algorithm will detect dropped packages and adjust the bandwidth accordingly. When using fq_codel and sending raw frames over CAN, which is the common use case, the user space thinks the package has been sent without problems, because send() returned without an error. pfifo_fast will drop skbs, if the queue length exceeds the maximum. But with this scheduler the skbs at the tail are dropped, an error (-ENOBUFS) is propagated to user space. So that the user space can slow down the package generation. On distributions, where fq_codel is made default via CONFIG_DEFAULT_NET_SCH during compile time, or set default during runtime with sysctl net.core.default_qdisc (see [1]), we get a bad user experience. In my test case with pfifo_fast, I can transfer thousands of million CAN frames without a frame drop. On the other hand with fq_codel there is more then one lost CAN frame per thousand frames. As pointed out fq_codel is not suited for CAN hardware, so this patch changes attach_one_default_qdisc() to use pfifo_fast for "ARPHRD_CAN" network devices. During transition of a netdev from down to up state the default queuing discipline is attached by attach_default_qdiscs() with the help of attach_one_default_qdisc(). This patch modifies attach_one_default_qdisc() to attach the pfifo_fast (pfifo_fast_ops) if the network device type is "ARPHRD_CAN". [1] https://github.com/systemd/systemd/issues/9194 Signed-off-by: Vincent Prince <vincent.prince.fr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25net: remove unnecessary variables and callbackTaehee Yoo6-34/+8
This patch removes variables and callback these are related to the nested device structure. devices that can be nested have their own nest_level variable that represents the depth of nested devices. In the previous patch, new {lower/upper}_level variables are added and they replace old private nest_level variable. So, this patch removes all 'nest_level' variables. In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added to get lockdep subclass value, which is actually lower nested depth value. But now, they use the dynamic lockdep key to avoid lockdep warning instead of the subclass. So, this patch removes ->ndo_get_lock_subclass() callback. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25net: core: add ignore flag to netdev_adjacent structureTaehee Yoo1-21/+209
In order to link an adjacent node, netdev_upper_dev_link() is used and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used. unlink operation does not fail, but link operation can fail. In order to exchange adjacent nodes, we should unlink an old adjacent node first. then, link a new adjacent node. If link operation is failed, we should link an old adjacent node again. But this link operation can fail too. It eventually breaks the adjacent link relationship. This patch adds an ignore flag into the netdev_adjacent structure. If this flag is set, netdev_upper_dev_link() ignores an old adjacent node for a moment. This patch also adds new functions for other modules. netdev_adjacent_change_prepare() netdev_adjacent_change_commit() netdev_adjacent_change_abort() netdev_adjacent_change_prepare() inserts new device into adjacent list but new device is not allowed to use immediately. If netdev_adjacent_change_prepare() fails, it internally rollbacks adjacent list so that we don't need any other action. netdev_adjacent_change_commit() deletes old device in the adjacent list and allows new device to use. netdev_adjacent_change_abort() rollbacks adjacent list. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>