summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-01-17skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() tooAlexander Lobakin1-1/+5
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") ensured that skbs with data size lower than 1025 bytes will be kmalloc'ed to avoid excessive page cache fragmentation and memory consumption. However, the fix adressed only __napi_alloc_skb() (primarily for virtio_net and napi_get_frags()), but the issue can still be achieved through __netdev_alloc_skb(), which is still used by several drivers. Drivers often allocate a tiny skb for headers and place the rest of the frame to frags (so-called copybreak). Mirror the condition to __netdev_alloc_skb() to handle this case too. Since v1 [0]: - fix "Fixes:" tag; - refine commit message (mention copybreak usecase). [0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16netfilter: nft_dynset: dump expressions when set definition contains no ↵Pablo Neira Ayuso1-14/+17
expressions If the set definition provides no stateful expressions, then include the stateful expression in the ruleset listing. Without this fix, the dynset rule listing shows the stateful expressions provided by the set definition. Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16netfilter: nft_dynset: add timeout extension to templatePablo Neira Ayuso1-1/+3
Otherwise, the newly create element shows no timeout when listing the ruleset. If the set definition does not specify a default timeout, then the set element only shows the expiration time, but not the timeout. This is a problem when restoring a stateful ruleset listing since it skips the timeout policy entirely. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16netfilter: nft_dynset: honor stateful expressions in set definitionPablo Neira Ayuso2-3/+8
If the set definition contains stateful expressions, allocate them for the newly added entries from the packet path. Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16net_sched: avoid shift-out-of-bounds in tcindex_set_parms()Eric Dumazet1-2/+6
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT attribute is not silly. UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29 shift exponent 255 is too large for 32-bit type 'int' CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline] tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16net_sched: gen_estimator: support large ewma logEric Dumazet1-4/+7
syzbot report reminded us that very big ewma_log were supported in the past, even if they made litle sense. tc qdisc replace dev xxx root est 1sec 131072sec ... While fixing the bug, also add boundary checks for ewma_log, in line with range supported by iproute2. UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38 shift exponent -1 is negative CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83 call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417 expire_timers kernel/time/timer.c:1462 [inline] __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731 __run_timers kernel/time/timer.c:1712 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744 __do_softirq+0x2bc/0xa77 kernel/softirq.c:343 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline] do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:226 [inline] __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420 irq_exit_rcu+0x5/0x20 kernel/softirq.c:432 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516 Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16net_sched: reject silly cell_log in qdisc_get_rtab()Eric Dumazet1-1/+2
iproute2 probably never goes beyond 8 for the cell exponent, but stick to the max shift exponent for signed 32bit. UBSAN reported: UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22 shift exponent 130 is too large for 32-bit type 'int' CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x183/0x22e lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg net/socket.c:672 [inline] ____sys_sendmsg+0x5a2/0x900 net/socket.c:2345 ___sys_sendmsg net/socket.c:2399 [inline] __sys_sendmsg+0x319/0x400 net/socket.c:2432 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-1/+2
Daniel Borkmann says: ==================== pull-request: bpf 2021-01-16 1) Fix a double bpf_prog_put() for BPF_PROG_{TYPE_EXT,TYPE_TRACING} types in link creation's error path causing a refcount underflow, from Jiri Olsa. 2) Fix BTF validation errors for the case where kernel modules don't declare any new types and end up with an empty BTF, from Andrii Nakryiko. 3) Fix BPF local storage helpers to first check their {task,inode} owners for being NULL before access, from KP Singh. 4) Fix a memory leak in BPF setsockopt handling for the case where optlen is zero and thus temporary optval buffer should be freed, from Stanislav Fomichev. 5) Fix a syzbot memory allocation splat in BPF_PROG_TEST_RUN infra for raw_tracepoint caused by too big ctx_size_in, from Song Liu. 6) Fix LLVM code generation issues with verifier where PTR_TO_MEM{,_OR_NULL} registers were spilled to stack but not recognized, from Gilad Reti. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: MAINTAINERS: Update my email address selftests/bpf: Add verifier test for PTR_TO_MEM spill bpf: Support PTR_TO_MEM{,_OR_NULL} register spilling bpf: Reject too big ctx_size_in for raw_tp test run libbpf: Allow loading empty BTFs bpf: Allow empty module BTFs bpf: Don't leak memory in bpf getsockopt when optlen == 0 bpf: Update local storage test to check handling of null ptrs bpf: Fix typo in bpf_inode_storage.c bpf: Local storage helpers should check nullness of owner ptr passed bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach ==================== Link: https://lore.kernel.org/r/20210116002025.15706-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16cls_flower: call nla_ok() before nla_next()Cong Wang1-9/+13
fl_set_enc_opt() simply checks if there are still bytes left to parse, but this is not sufficent as syzbot seems to be able to generate malformatted netlink messages. nla_ok() is more strict so should be used to validate the next nlattr here. And nla_validate_nested_deprecated() has less strict check too, it is probably too late to switch to the strict version, but we can just call nla_ok() too after it. Reported-and-tested-by: syzbot+2624e3778b18fc497c92@syzkaller.appspotmail.com Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20210115185024.72298-1-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15Merge tag 'net-5.11-rc4' of ↵Linus Torvalds23-121/+202
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "We have a few fixes for long standing issues, in particular Eric's fix to not underestimate the skb sizes, and my fix for brokenness of register_netdevice() error path. They may uncover other bugs so we will keep an eye on them. Also included are Willem's fixes for kmap(_atomic). Looking at the "current release" fixes, it seems we are about one rc behind a normal cycle. We've previously seen an uptick of "people had run their test suites" / "humans actually tried to use new features" fixes between rc2 and rc3. Summary: Current release - regressions: - fix feature enforcement to allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM - dcb: accept RTM_GETDCB messages carrying set-like DCB commands if user is admin for backward-compatibility - selftests/tls: fix selftests build after adding ChaCha20-Poly1305 Current release - always broken: - ppp: fix refcount underflow on channel unbridge - bnxt_en: clear DEFRAG flag in firmware message when retry flashing - smc: fix out of bound access in the new netlink interface Previous releases - regressions: - fix use-after-free with UDP GRO by frags - mptcp: better msk-level shutdown - rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM request - i40e: xsk: fix potential NULL pointer dereferencing Previous releases - always broken: - skb frag: kmap_atomic fixes - avoid 32 x truesize under-estimation for tiny skbs - fix issues around register_netdevice() failures - udp: prevent reuseport_select_sock from reading uninitialized socks - dsa: unbind all switches from tree when DSA master unbinds - dsa: clear devlink port type before unregistering slave netdevs - can: isotp: isotp_getname(): fix kernel information leak - mlxsw: core: Thermal control fixes - ipv6: validate GSO SKB against MTU before finish IPv6 processing - stmmac: use __napi_schedule() for PREEMPT_RT - net: mvpp2: remove Pause and Asym_Pause support Misc: - remove from MAINTAINERS folks who had been inactive for >5yrs" * tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) mptcp: fix locking in mptcp_disconnect() net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM MAINTAINERS: dccp: move Gerrit Renker to CREDITS MAINTAINERS: ipvs: move Wensong Zhang to CREDITS MAINTAINERS: tls: move Aviad to CREDITS MAINTAINERS: ena: remove Zorik Machulsky from reviewers MAINTAINERS: vrf: move Shrijeet to CREDITS MAINTAINERS: net: move Alexey Kuznetsov to CREDITS MAINTAINERS: altx: move Jay Cliburn to CREDITS net: avoid 32 x truesize under-estimation for tiny skbs nt: usb: USB_RTL8153_ECM should not default to y net: stmmac: fix taprio configuration when base_time is in the past net: stmmac: fix taprio schedule configuration net: tip: fix a couple kernel-doc markups net: sit: unregister_netdevice on newlink's error path net: stmmac: Fixed mtu channged by cache aligned cxgb4/chtls: Fix tid stuck due to wrong update of qid i40e: fix potential NULL pointer dereferencing net: stmmac: use __napi_schedule() for PREEMPT_RT can: mcp251xfd: mcp251xfd_handle_rxif_one(): fix wrong NULL pointer check ...
2021-01-15mac80211: check if atf has been disabled in __ieee80211_schedule_txqLorenzo Bianconi1-1/+1
Check if atf has been disabled in __ieee80211_schedule_txq() in order to avoid a given sta is always put to the beginning of the active_txqs list and never moved to the end since deficit is not decremented in ieee80211_sta_register_airtime() Fixes: b4809e9484da1 ("mac80211: Add airtime accounting and scheduling to TXQs") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://lore.kernel.org/r/93889406c50f1416214c079ca0b8c9faecc5143e.1608975195.git.lorenzo@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-15mac80211: do not drop tx nulldata packets on encrypted linksFelix Fietkau1-1/+1
ieee80211_tx_h_select_key drops any non-mgmt packets without a key when encryption is used. This is wrong for nulldata packets that can't be encrypted and are sent out for probing clients and indicating 4-address mode. Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Fixes: a0761a301746 ("mac80211: drop data frames without key on encrypted links") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218191525.1168-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-15mac80211: fix encryption key selection for 802.3 xmitFelix Fietkau1-12/+15
When using WEP, the default unicast key needs to be selected, instead of the STA PTK. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218184718.93650-4-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-15mac80211: fix fast-rx encryption checkFelix Fietkau1-0/+2
When using WEP, the default unicast key needs to be selected, instead of the STA PTK. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201218184718.93650-5-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-15mac80211: fix incorrect strlen of .write in debugfsShayne Chen1-24/+20
This fixes strlen mismatch problems happening in some .write callbacks of debugfs. When trying to configure airtime_flags in debugfs, an error appeared: ash: write error: Invalid argument The error is returned from kstrtou16() since a wrong length makes it miss the real end of input string. To fix this, use count as the string length, and set proper end of string for a char buffer. The debug print is shown - airtime_flags_write: count = 2, len = 8, where the actual length is 2, but "len = strlen(buf)" gets 8. Also cleanup the other similar cases for the sake of consistency. Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: Shayne Chen <shayne.chen@mediatek.com> Link: https://lore.kernel.org/r/20210112032028.7482-1-shayne.chen@mediatek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14mptcp: fix locking in mptcp_disconnect()Paolo Abeni1-2/+7
tcp_disconnect() expects the caller acquires the sock lock, but mptcp_disconnect() is not doing that. Add the missing required lock. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 76e2a55d1625 ("mptcp: better msk-level shutdown.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUMTariq Toukan1-3/+9
Cited patch below blocked the TLS TX device offload unless HW_CSUM is set. This broke devices that use IP_CSUM && IP6_CSUM. Here we fix it. Note that the single HW_TLS_TX feature flag indicates support for both IPv4/6, hence it should still be disabled in case only one of (IP_CSUM | IPV6_CSUM) is set. Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reported-by: Rohit Maheshwari <rohitm@chelsio.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: avoid 32 x truesize under-estimation for tiny skbsEric Dumazet1-2/+7
Both virtio net and napi_get_frags() allocate skbs with a very small skb->head While using page fragments instead of a kmalloc backed skb->head might give a small performance improvement in some cases, there is a huge risk of under estimating memory usage. For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations per page (order-3 page in x86), or even 64 on PowerPC We have been tracking OOM issues on GKE hosts hitting tcp_mem limits but consuming far more memory for TCP buffers than instructed in tcp_mem[2] Even if we force napi_alloc_skb() to only use order-0 pages, the issue would still be there on arches with PAGE_SIZE >= 32768 This patch makes sure that small skb head are kmalloc backed, so that other objects in the slab page can be reused instead of being held as long as skbs are sitting in socket queues. Note that we might in the future use the sk_buff napi cache, instead of going through a more expensive __alloc_skb() Another idea would be to use separate page sizes depending on the allocated length (to never have more than 4 frags per page) I would like to thank Greg Thelen for his precious help on this matter, analysing crash dumps is always a time consuming task. Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Greg Thelen <gthelen@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: tip: fix a couple kernel-doc markupsMauro Carvalho Chehab2-2/+2
A function has a different name between their prototype and its kernel-doc markup: ../net/tipc/link.c:2551: warning: expecting prototype for link_reset_stats(). Prototype was for tipc_link_reset_stats() instead ../net/tipc/node.c:1678: warning: expecting prototype for is the general link level function for message sending(). Prototype was for tipc_node_xmit() instead Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: sit: unregister_netdevice on newlink's error pathJakub Kicinski1-1/+4
We need to unregister the netdevice if config failed. .ndo_uninit takes care of most of the heavy lifting. This was uncovered by recent commit c269a24ce057 ("net: make free_netdev() more lenient with unregistering devices"). Previously the partially-initialized device would be left in the system. Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com Fixes: e2f1f072db8d ("sit: allow to configure 6rd tunnels via netlink") Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14bpf: Reject too big ctx_size_in for raw_tp test runSong Liu1-1/+2
syzbot reported a WARNING for allocating too big memory: WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011 Modules linked in: CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976 Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8 RSP: 0018:ffffc900012efb10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 1ffff9200025df66 RCX: 0000000000000000 RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000140dc0 RBP: 0000000000140dc0 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff81b1f7e1 R11: 0000000000000000 R12: 0000000000000014 R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000 FS: 000000000190c880(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f08b7f316c0 CR3: 0000000012073000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267 alloc_pages include/linux/gfp.h:547 [inline] kmalloc_order+0x2e/0xb0 mm/slab_common.c:837 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853 kmalloc include/linux/slab.h:557 [inline] kzalloc include/linux/slab.h:682 [inline] bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282 bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline] __do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x440499 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffe1f3bfb18 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440499 RDX: 0000000000000048 RSI: 0000000020000600 RDI: 000000000000000a RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401ca0 R13: 0000000000401d30 R14: 0000000000000000 R15: 0000000000000000 This is because we didn't filter out too big ctx_size_in. Fix it by rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64 numbers. Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Reported-by: syzbot+4f98876664c7337a4ae6@syzkaller.appspotmail.com Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210112234254.1906829-1-songliubraving@fb.com
2021-01-14can: isotp: isotp_getname(): fix kernel information leakOliver Hartkopp1-0/+1
Initialize the sockaddr_can structure to prevent a data leak to user space. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: syzbot+057884e2f453e8afebc8@syzkaller.appspotmail.com Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20210112091643.11789-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-13rxrpc: Call state should be read with READ_ONCE() under some circumstancesBaptiste Lepers1-1/+1
The call state may be changed at any time by the data-ready routine in response to received packets, so if the call state is to be read and acted upon several times in a function, READ_ONCE() must be used unless the call state lock is held. As it happens, we used READ_ONCE() to read the state a few lines above the unmarked read in rxrpc_input_data(), so use that value rather than re-reading it. Fixes: a158bdd3247b ("rxrpc: Fix call timeouts") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/161046715522.2450566.488819910256264150.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13rxrpc: Fix handling of an unsupported token type in rxrpc_read()David Howells1-2/+4
Clang static analysis reports the following: net/rxrpc/key.c:657:11: warning: Assigned value is garbage or undefined toksize = toksizes[tok++]; ^ ~~~~~~~~~~~~~~~ rxrpc_read() contains two consecutive loops. The first loop calculates the token sizes and stores the results in toksizes[] and the second one uses the array. When there is an error in identifying the token in the first loop, the token is skipped, no change is made to the toksizes[] array. When the same error happens in the second loop, the token is not skipped. This will cause the toksizes[] array to be out of step and will overrun past the calculated sizes. Fix this by making both loops log a message and return an error in this case. This should only happen if a new token type is incompletely implemented, so it should normally be impossible to trigger this. Fixes: 9a059cd5ca7d ("rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()") Reported-by: Tom Rix <trix@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/161046503122.2445787.16714129930607546635.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13SUNRPC: Move the svc_xdr_recvfrom tracepoint againChuck Lever1-2/+2
Commit 156708adf2d9 ("SUNRPC: Move the svc_xdr_recvfrom() tracepoint") tried to capture the correct XID in the trace record, but this line in svc_recv: rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]); alters the size of rq_arg.head[0].iov_len. The tracepoint records the correct XID but an incorrect value for the length of the xdr_buf's head. To keep the trace callsites simple, I've created two trace classes. One assumes the xdr_buf contains a full RPC message, and the XID can be extracted from it. The other assumes the contents of the xdr_buf are arbitrary, and the xid will be provided by the caller. Currently there is only one user of each class, but I expect we will need a few more tracepoints using each class as time goes on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski2-0/+4
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Pass conntrack -f to specify family in netfilter conntrack helper selftests, from Chen Yi. 2) Honor hashsize modparam from nf_conntrack_buckets sysctl, from Jesper D. Brouer. 3) Fix memleak in nf_nat_init() error path, from Dinghao Liu. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nf_nat: Fix memleak in nf_nat_init netfilter: conntrack: fix reading nf_conntrack_buckets selftests: netfilter: Pass family parameter "-f" to conntrack tool ==================== Link: https://lore.kernel.org/r/20210112222033.9732-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13net/smc: use memcpy instead of snprintf to avoid out of bounds readGuvenc Gulce3-10/+16
Using snprintf() to convert not null-terminated strings to null terminated strings may cause out of bounds read in the source string. Therefore use memcpy() and terminate the target string with a null afterwards. Fixes: a3db10efcc4c ("net/smc: Add support for obtaining SMCR device list") Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13smc: fix out of bound access in smc_nl_get_sys_info()Jakub Kicinski1-1/+2
smc_clc_get_hostname() sets the host pointer to a buffer which is not NULL-terminated (see smc_clc_init()). Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com Fixes: 099b990bd11a ("net/smc: Add support for obtaining system information") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13mptcp: better msk-level shutdown.Paolo Abeni1-45/+17
Instead of re-implementing most of inet_shutdown, re-use such helper, and implement the MPTCP-specific bits at the 'proto' level. The msk-level disconnect() can now be invoked, lets provide a suitable implementation. As a side effect, this fixes bad state management for listener sockets. The latter could lead to division by 0 oops since commit ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling"). Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine") Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13mptcp: more strict state checking for acksPaolo Abeni1-1/+1
Syzkaller found a way to trigger division by zero in mptcp_subflow_cleanup_rbuf(). The current checks implemented into tcp_can_send_ack() are too week, let's be more accurate. Reported-by: Christoph Paasch <cpaasch@apple.com> Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling") Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13net: dsa: clear devlink port type before unregistering slave netdevsVladimir Oltean1-0/+4
Florian reported a use-after-free bug in devlink_nl_port_fill found with KASAN: (devlink_nl_port_fill) (devlink_port_notify) (devlink_port_unregister) (dsa_switch_teardown.part.3) (dsa_tree_teardown_switches) (dsa_unregister_switch) (bcm_sf2_sw_remove) (platform_remove) (device_release_driver_internal) (device_links_unbind_consumers) (device_release_driver_internal) (device_driver_detach) (unbind_store) Allocated by task 31: alloc_netdev_mqs+0x5c/0x50c dsa_slave_create+0x110/0x9c8 dsa_register_switch+0xdb0/0x13a4 b53_switch_register+0x47c/0x6dc bcm_sf2_sw_probe+0xaa4/0xc98 platform_probe+0x90/0xf4 really_probe+0x184/0x728 driver_probe_device+0xa4/0x278 __device_attach_driver+0xe8/0x148 bus_for_each_drv+0x108/0x158 Freed by task 249: free_netdev+0x170/0x194 dsa_slave_destroy+0xac/0xb0 dsa_port_teardown.part.2+0xa0/0xb4 dsa_tree_teardown_switches+0x50/0xc4 dsa_unregister_switch+0x124/0x250 bcm_sf2_sw_remove+0x98/0x13c platform_remove+0x44/0x5c device_release_driver_internal+0x150/0x254 device_links_unbind_consumers+0xf8/0x12c device_release_driver_internal+0x84/0x254 device_driver_detach+0x30/0x34 unbind_store+0x90/0x134 What happens is that devlink_port_unregister emits a netlink DEVLINK_CMD_PORT_DEL message which associates the devlink port that is getting unregistered with the ifindex of its corresponding net_device. Only trouble is, the net_device has already been unregistered. It looks like we can stub out the search for a corresponding net_device if we clear the devlink_port's type. This looks like a bit of a hack, but also seems to be the reason why the devlink_port_type_clear function exists in the first place. Fixes: 3122433eb533 ("net: dsa: Register devlink ports before calling DSA driver setup()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian fainelli <f.fainelli@gmail.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13net: dsa: unbind all switches from tree when DSA master unbindsVladimir Oltean1-0/+10
Currently the following happens when a DSA master driver unbinds while there are DSA switches attached to it: $ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind ------------[ cut here ]------------ WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507 Call trace: rollback_registered_many+0x5fc/0x688 unregister_netdevice_queue+0x98/0x120 dsa_slave_destroy+0x4c/0x88 dsa_port_teardown.part.16+0x78/0xb0 dsa_tree_teardown_switches+0x58/0xc0 dsa_unregister_switch+0x104/0x1b8 felix_pci_remove+0x24/0x48 pci_device_remove+0x48/0xf0 device_release_driver_internal+0x118/0x1e8 device_driver_detach+0x28/0x38 unbind_store+0xd0/0x100 Located at the above location is this WARN_ON: /* Notifier chain MUST detach us all upper devices. */ WARN_ON(netdev_has_any_upper_dev(dev)); Other stacked interfaces, like VLAN, do indeed listen for NETDEV_UNREGISTER on the real_dev and also unregister themselves at that time, which is clearly the behavior that rollback_registered_many expects. But DSA interfaces are not VLAN. They have backing hardware (platform devices, PCI devices, MDIO, SPI etc) which have a life cycle of their own and we can't just trigger an unregister from the DSA framework when we receive a netdev notifier that the master unregisters. Luckily, there is something we can do, and that is to inform the driver core that we have a runtime dependency to the DSA master interface's device, and create a device link where that is the supplier and we are the consumer. Having this device link will make the DSA switch unbind before the DSA master unbinds, which is enough to avoid the WARN_ON from rollback_registered_many. Note that even before the blamed commit, DSA did nothing intelligent when the master interface got unregistered either. See the discussion here: https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/ But this time, at least the WARN_ON is loud enough that the upper_dev_link commit can be blamed. The advantage with this approach vs dev_hold(master) in the attached link is that the latter is not meant for long term reference counting. With dev_hold, the only thing that will happen is that when the user attempts an unbind of the DSA master, netdev_wait_allrefs will keep waiting and waiting, due to DSA keeping the refcount forever. DSA would not access freed memory corresponding to the master interface, but the unbind would still result in a freeze. Whereas with device links, graceful teardown is ensured. It even works with cascaded DSA trees. $ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind [ 1818.797546] device swp0 left promiscuous mode [ 1819.301112] sja1105 spi2.0: Link is Down [ 1819.307981] DSA: tree 1 torn down [ 1819.312408] device eno2 left promiscuous mode [ 1819.656803] mscc_felix 0000:00:00.5: Link is Down [ 1819.667194] DSA: tree 0 torn down [ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down This approach allows us to keep the DSA framework absolutely unchanged, and the driver core will just know to unbind us first when the master goes away - as opposed to the large (and probably impossible) rework required if attempting to listen for NETDEV_UNREGISTER. As per the documentation at Documentation/driver-api/device_link.rst, specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link to be automatically purged when the consumer fails to probe or later unbinds. So we don't need to keep the consumer_link variable in struct dsa_switch. Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commandsPetr Machata1-1/+1
In commit 826f328e2b7e ("net: dcb: Validate netlink message in DCB handler"), Linux started rejecting RTM_GETDCB netlink messages if they contained a set-like DCB_CMD_ command. The reason was that privileges were only verified for RTM_SETDCB messages, but the value that determined the action to be taken is the command, not the message type. And validation of message type against the DCB command was the obvious missing piece. Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool for configuration of DCB, accesses the DCB set-like APIs through RTM_GETDCB. Therefore do not bounce the discrepancy between message type and command. Instead, in addition to validating privileges based on the actual message type, validate them also based on the expected message type. This closes the loophole of allowing DCB configuration on non-admin accounts, while maintaining backward compatibility. Fixes: 2f90b8657ec9 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver") Fixes: 826f328e2b7e ("net: dcb: Validate netlink message in DCB handler") Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12Merge tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+1
Pull NFS client fixes from Trond Myklebust: "Highlights include: - Fix parsing of link-local IPv6 addresses - Fix confusing logging of mount errors that was introduced by the fsopen() patchset. - Fix a tracing use after free in _nfs4_do_setlk() - Layout return-on-close fixes when called from nfs4_evict_inode() - Layout segments were being leaked in pnfs_generic_clear_request_commit() - Don't leak DS commits in pnfs_generic_retry_commit() - Fix an Oopsable use-after-free when nfs_delegation_find_inode_server() calls iput() on an inode after the super block has gone away" * tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: nfs_igrab_and_active must first reference the superblock NFS: nfs_delegation_find_inode_server must first reference the superblock NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit() NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request pNFS: Stricter ordering of layoutget and layoutreturn pNFS: Clean up pnfs_layoutreturn_free_lsegs() pNFS: We want return-on-close to complete when evicting the inode pNFS: Mark layout for return if return-on-close was not sent net: sunrpc: interpret the return value of kstrtou32 correctly NFS: Adjust fs_context error logging NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
2021-01-12esp: avoid unneeded kmap_atomic callWillem de Bruijn2-12/+2
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for the esp trailer. It accesses the page with kmap_atomic to handle highmem. But skb_page_frag_refill can return compound pages, of which kmap_atomic only maps the first underlying page. skb_page_frag_refill does not return highmem, because flag __GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP. That also does not call kmap_atomic, but directly uses page_address, in skb_copy_to_page_nocache. Do the same for ESP. This issue has become easier to trigger with recent kmap local debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12net: compound page support in skb_seq_readWillem de Bruijn1-5/+23
skb_seq_read iterates over an skb, returning pointer and length of the next data range with each call. It relies on kmap_atomic to access highmem pages when needed. An skb frag may be backed by a compound page, but kmap_atomic maps only a single page. There are not enough kmap slots to always map all pages concurrently. Instead, if kmap_atomic is needed, iterate over each page. As this increases the number of calls, avoid this unless needed. The necessary condition is captured in skb_frag_must_loop. I tried to make the change as obvious as possible. It should be easy to verify that nothing changes if skb_frag_must_loop returns false. Tested: On an x86 platform with CONFIG_HIGHMEM=y CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y CONFIG_NETFILTER_XT_MATCH_STRING=y Run ip link set dev lo mtu 1500 iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT dd if=/dev/urandom of=in bs=1M count=20 nc -l -p 8000 > /dev/null & nc -w 1 -q 0 localhost 8000 < in Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11Merge tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds1-1/+85
Pull nfsd fixes from Chuck Lever: - Fix major TCP performance regression - Get NFSv4.2 READ_PLUS regression tests to pass - Improve NFSv4 COMPOUND memory allocation - Fix sparse warning * tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6: NFSD: Restore NFSv4 decoding's SAVEMEM functionality SUNRPC: Handle TCP socket sends with kernel_sendpage() again NFSD: Fix sparse warning in nfssvc.c nfsd: Don't set eof on a truncated READ_PLUS nfsd: Fixes for nfsd4_encode_read_plus_data()
2021-01-11netfilter: nf_nat: Fix memleak in nf_nat_initDinghao Liu1-0/+1
When register_pernet_subsys() fails, nf_nat_bysource should be freed just like when nf_ct_extend_register() fails. Fixes: 1cd472bf036ca ("netfilter: nf_nat: add nat hook register functions to nf_nat") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-10net: sunrpc: interpret the return value of kstrtou32 correctlyj.nixdorf@avm.de1-1/+1
A return value of 0 means success. This is documented in lib/kstrtox.c. This was found by trying to mount an NFS share from a link-local IPv6 address with the interface specified by its index: mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1") Before this commit this failed with EINVAL and also caused the following message in dmesg: [...] NFS: bad IP address specified: addr=fe80::1%1 The syscall using the same address based on the interface name instead of its index succeeds. Credits for this patch go to my colleague Christian Speich, who traced the origin of this bug to this line of code. Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de> Fixes: 00cfaa943ec3 ("replace strict_strto calls") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10netfilter: conntrack: fix reading nf_conntrack_bucketsJesper Dangaard Brouer1-0/+3
The old way of changing the conntrack hashsize runtime was through changing the module param via file /sys/module/nf_conntrack/parameters/hashsize. This was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too"). The commit introduced second "user" variable nf_conntrack_htable_size_user which shadow actual variable nf_conntrack_htable_size. When hashsize is changed via module param this "user" variable isn't updated. This results in sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users update via the old way. This patch fix the issue by always updating "user" variable when reading the proc file. This will take care of changes to the actual variable without sysctl need to be aware. Fixes: 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too") Reported-by: Yoel Caspersen <yoel@kviknet.dk> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-10tipc: fix NULL deref in tipc_link_xmit()Hoang Le1-2/+7
The buffer list can have zero skb as following path: tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so we need to check the list before casting an &sk_buff. Fault report: [] tipc: Bulk publication failure [] general protection fault, probably for non-canonical [#1] PREEMPT [...] [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf] [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...] [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018 [] FS: 0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...] [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0 Fixes: af9b028e270fd ("tipc: make media xmit call outside node spinlock context") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-10net: ipv6: Validate GSO SKB before finish IPv6 processingAya Levin1-1/+40
There are cases where GSO segment's length exceeds the egress MTU: - Forwarding of a TCP GRO skb, when DF flag is not set. - Forwarding of an skb that arrived on a virtualisation interface (virtio-net/vhost/tap) with TSO/GSO size set by other network stack. - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an interface with a smaller MTU. - Arriving GRO skb (or GSO skb in a virtualised environment) that is bridged to a NETIF_F_TSO tunnel stacked over an interface with an insufficient MTU. If so: - Consume the SKB and its segments. - Issue an ICMP packet with 'Packet Too Big' message containing the MTU, allowing the source host to reduce its Path MTU appropriately. Note: These cases are handled in the same manner in IPv4 output finish. This patch aligns the behavior of IPv6 and the one of IPv4. Fixes: 9e50849054a4 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09net: make sure devices go through netdev_wait_all_refsJakub Kicinski1-10/+4
If register_netdevice() fails at the very last stage - the notifier call - some subsystems may have already seen it and grabbed a reference. struct net_device can't be freed right away without calling netdev_wait_all_refs(). Now that we have a clean interface in form of dev->needs_free_netdev and lenient free_netdev() we can undo what commit 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") has done and complete the unregistration path by bringing the net_set_todo() call back. After registration fails user is still expected to explicitly free the net_device, so make sure ->needs_free_netdev is cleared, otherwise rolling back the registration will cause the old double free for callers who release rtnl_lock before the free. This also solves the problem of priv_destructor not being called on notifier error. net_set_todo() will be moved back into unregister_netdevice_queue() in a follow up. Reported-by: Hulk Robot <hulkci@huawei.com> Reported-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09net: make free_netdev() more lenient with unregistering devicesJakub Kicinski3-20/+18
There are two flavors of handling netdev registration: - ones called without holding rtnl_lock: register_netdev() and unregister_netdev(); and - those called with rtnl_lock held: register_netdevice() and unregister_netdevice(). While the semantics of the former are pretty clear, the same can't be said about the latter. The netdev_todo mechanism is utilized to perform some of the device unregistering tasks and it hooks into rtnl_unlock() so the locked variants can't actually finish the work. In general free_netdev() does not mix well with locked calls. Most drivers operating under rtnl_lock set dev->needs_free_netdev to true and expect core to make the free_netdev() call some time later. The part where this becomes most problematic is error paths. There is no way to unwind the state cleanly after a call to register_netdevice(), since unreg can't be performed fully without dropping locks. Make free_netdev() more lenient, and defer the freeing if device is being unregistered. This allows error paths to simply call free_netdev() both after register_netdevice() failed, and after a call to unregister_netdevice() but before dropping rtnl_lock. Simplify the error paths which are currently doing gymnastics around free_netdev() handling. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09docs: net: explain struct net_device lifetimeJakub Kicinski1-1/+1
Explain the two basic flows of struct net_device's operation. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09udp: Prevent reuseport_select_sock from reading uninitialized socksBaptiste Lepers1-1/+1
reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, only read the array up until the last known safe index instead of incorrectly re-reading the last index of the array. Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09net: fix use-after-free when UDP GRO with shared fraglistDongseok Yi1-1/+19
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC writes, it will call skb_ensure_writable -> pskb_expand_head to create a private linear section for the head_skb. And then call skb_clone_fraglist -> skb_get on each skb in the fraglist. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in PF_PACKET. Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have a link for the same frag_skbs chain. If a new skb (not frags) is queued to one of the sk_receive_queue, multiple ptypes can see and release this. It causes use-after-free. [ 4443.426215] ------------[ cut here ]------------ [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi <dseok.yi@samsung.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08cfg80211: Save the regulatory domain with a lockIlan Peer1-1/+10
Saving the regulatory domain while setting custom regulatory domain was done while accessing a RCU protected pointer but without any protection. Fix this by using RTNL while accessing the pointer. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Reported-by: syzbot+27771d4abcd9b7a1f5d3@syzkaller.appspotmail.com Reported-by: syzbot+db4035751c56c0079282@syzkaller.appspotmail.com Reported-by: Hans de Goede <hdegoede@redhat.com> Fixes: beee24695157 ("cfg80211: Save the regulatory domain when setting custom regulatory") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210105165657.613e9a876829.Ia38d27dbebea28bf9c56d70691d243186ede70e7@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-08nexthop: Bounce NHA_GATEWAY in FDB nexthop groupsPetr Machata1-1/+1
The function nh_check_attr_group() is called to validate nexthop groups. The intention of that code seems to have been to bounce all attributes above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all these attributes except when NHA_FDB attribute is present--then it accepts them. NHA_FDB validation that takes place before, in rtm_to_nh_config(), already bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further back, NHA_GROUPS and NHA_MASTER are bounced unconditionally. But that still leaves NHA_GATEWAY as an attribute that would be accepted in FDB nexthop groups (with no meaning), so long as it keeps the address family as unspecified: # ip nexthop add id 1 fdb via 127.0.0.1 # ip nexthop add id 10 fdb via default group 1 The nexthop code is still relatively new and likely not used very broadly, and the FDB bits are newer still. Even though there is a reproducer out there, it relies on an improbable gateway arguments "via default", "via all" or "via any". Given all this, I believe it is OK to reformulate the condition to do the right thing and bounce NHA_GATEWAY. Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops") Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08nexthop: Unlink nexthop group entry in error pathIdo Schimmel1-1/+3
In case of error, remove the nexthop group entry from the list to which it was previously added. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>