Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
tiny skbs") ensured that skbs with data size lower than 1025 bytes
will be kmalloc'ed to avoid excessive page cache fragmentation and
memory consumption.
However, the fix adressed only __napi_alloc_skb() (primarily for
virtio_net and napi_get_frags()), but the issue can still be achieved
through __netdev_alloc_skb(), which is still used by several drivers.
Drivers often allocate a tiny skb for headers and place the rest of
the frame to frags (so-called copybreak).
Mirror the condition to __netdev_alloc_skb() to handle this case too.
Since v1 [0]:
- fix "Fixes:" tag;
- refine commit message (mention copybreak usecase).
[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me
Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
expressions
If the set definition provides no stateful expressions, then include the
stateful expression in the ruleset listing. Without this fix, the dynset
rule listing shows the stateful expressions provided by the set
definition.
Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Otherwise, the newly create element shows no timeout when listing the
ruleset. If the set definition does not specify a default timeout, then
the set element only shows the expiration time, but not the timeout.
This is a problem when restoring a stateful ruleset listing since it
skips the timeout policy entirely.
Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
If the set definition contains stateful expressions, allocate them for
the newly added entries from the packet path.
Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT
attribute is not silly.
UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29
shift exponent 255 is too large for 32-bit type 'int'
CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
valid_perfect_hash net/sched/cls_tcindex.c:260 [inline]
tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425
tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546
tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127
rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:672
____sys_sendmsg+0x6e8/0x810 net/socket.c:2336
___sys_sendmsg+0xf3/0x170 net/socket.c:2390
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2423
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot report reminded us that very big ewma_log were supported in the past,
even if they made litle sense.
tc qdisc replace dev xxx root est 1sec 131072sec ...
While fixing the bug, also add boundary checks for ewma_log, in line
with range supported by iproute2.
UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38
shift exponent -1 is negative
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83
call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417
expire_timers kernel/time/timer.c:1462 [inline]
__run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731
__run_timers kernel/time/timer.c:1712 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744
__do_softirq+0x2bc/0xa77 kernel/softirq.c:343
asm_call_irq_on_stack+0xf/0x20
</IRQ>
__run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77
invoke_softirq kernel/softirq.c:226 [inline]
__irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420
irq_exit_rcu+0x5/0x20 kernel/softirq.c:432
sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline]
RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline]
RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline]
RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline]
RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516
Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.
UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x183/0x22e lib/dump_stack.c:120
ubsan_epilogue lib/ubsan.c:148 [inline]
__ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
__detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg net/socket.c:672 [inline]
____sys_sendmsg+0x5a2/0x900 net/socket.c:2345
___sys_sendmsg net/socket.c:2399 [inline]
__sys_sendmsg+0x319/0x400 net/socket.c:2432
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2021-01-16
1) Fix a double bpf_prog_put() for BPF_PROG_{TYPE_EXT,TYPE_TRACING} types in
link creation's error path causing a refcount underflow, from Jiri Olsa.
2) Fix BTF validation errors for the case where kernel modules don't declare
any new types and end up with an empty BTF, from Andrii Nakryiko.
3) Fix BPF local storage helpers to first check their {task,inode} owners for
being NULL before access, from KP Singh.
4) Fix a memory leak in BPF setsockopt handling for the case where optlen is
zero and thus temporary optval buffer should be freed, from Stanislav Fomichev.
5) Fix a syzbot memory allocation splat in BPF_PROG_TEST_RUN infra for
raw_tracepoint caused by too big ctx_size_in, from Song Liu.
6) Fix LLVM code generation issues with verifier where PTR_TO_MEM{,_OR_NULL}
registers were spilled to stack but not recognized, from Gilad Reti.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
MAINTAINERS: Update my email address
selftests/bpf: Add verifier test for PTR_TO_MEM spill
bpf: Support PTR_TO_MEM{,_OR_NULL} register spilling
bpf: Reject too big ctx_size_in for raw_tp test run
libbpf: Allow loading empty BTFs
bpf: Allow empty module BTFs
bpf: Don't leak memory in bpf getsockopt when optlen == 0
bpf: Update local storage test to check handling of null ptrs
bpf: Fix typo in bpf_inode_storage.c
bpf: Local storage helpers should check nullness of owner ptr passed
bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach
====================
Link: https://lore.kernel.org/r/20210116002025.15706-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.
And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.
Reported-and-tested-by: syzbot+2624e3778b18fc497c92@syzkaller.appspotmail.com
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20210115185024.72298-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"We have a few fixes for long standing issues, in particular Eric's fix
to not underestimate the skb sizes, and my fix for brokenness of
register_netdevice() error path. They may uncover other bugs so we
will keep an eye on them. Also included are Willem's fixes for
kmap(_atomic).
Looking at the "current release" fixes, it seems we are about one rc
behind a normal cycle. We've previously seen an uptick of "people had
run their test suites" / "humans actually tried to use new features"
fixes between rc2 and rc3.
Summary:
Current release - regressions:
- fix feature enforcement to allow NETIF_F_HW_TLS_TX if IP_CSUM &&
IPV6_CSUM
- dcb: accept RTM_GETDCB messages carrying set-like DCB commands if
user is admin for backward-compatibility
- selftests/tls: fix selftests build after adding ChaCha20-Poly1305
Current release - always broken:
- ppp: fix refcount underflow on channel unbridge
- bnxt_en: clear DEFRAG flag in firmware message when retry flashing
- smc: fix out of bound access in the new netlink interface
Previous releases - regressions:
- fix use-after-free with UDP GRO by frags
- mptcp: better msk-level shutdown
- rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM
request
- i40e: xsk: fix potential NULL pointer dereferencing
Previous releases - always broken:
- skb frag: kmap_atomic fixes
- avoid 32 x truesize under-estimation for tiny skbs
- fix issues around register_netdevice() failures
- udp: prevent reuseport_select_sock from reading uninitialized socks
- dsa: unbind all switches from tree when DSA master unbinds
- dsa: clear devlink port type before unregistering slave netdevs
- can: isotp: isotp_getname(): fix kernel information leak
- mlxsw: core: Thermal control fixes
- ipv6: validate GSO SKB against MTU before finish IPv6 processing
- stmmac: use __napi_schedule() for PREEMPT_RT
- net: mvpp2: remove Pause and Asym_Pause support
Misc:
- remove from MAINTAINERS folks who had been inactive for >5yrs"
* tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
mptcp: fix locking in mptcp_disconnect()
net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
MAINTAINERS: dccp: move Gerrit Renker to CREDITS
MAINTAINERS: ipvs: move Wensong Zhang to CREDITS
MAINTAINERS: tls: move Aviad to CREDITS
MAINTAINERS: ena: remove Zorik Machulsky from reviewers
MAINTAINERS: vrf: move Shrijeet to CREDITS
MAINTAINERS: net: move Alexey Kuznetsov to CREDITS
MAINTAINERS: altx: move Jay Cliburn to CREDITS
net: avoid 32 x truesize under-estimation for tiny skbs
nt: usb: USB_RTL8153_ECM should not default to y
net: stmmac: fix taprio configuration when base_time is in the past
net: stmmac: fix taprio schedule configuration
net: tip: fix a couple kernel-doc markups
net: sit: unregister_netdevice on newlink's error path
net: stmmac: Fixed mtu channged by cache aligned
cxgb4/chtls: Fix tid stuck due to wrong update of qid
i40e: fix potential NULL pointer dereferencing
net: stmmac: use __napi_schedule() for PREEMPT_RT
can: mcp251xfd: mcp251xfd_handle_rxif_one(): fix wrong NULL pointer check
...
|
|
Check if atf has been disabled in __ieee80211_schedule_txq() in order to
avoid a given sta is always put to the beginning of the active_txqs list
and never moved to the end since deficit is not decremented in
ieee80211_sta_register_airtime()
Fixes: b4809e9484da1 ("mac80211: Add airtime accounting and scheduling to TXQs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/93889406c50f1416214c079ca0b8c9faecc5143e.1608975195.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
ieee80211_tx_h_select_key drops any non-mgmt packets without a key when
encryption is used. This is wrong for nulldata packets that can't be
encrypted and are sent out for probing clients and indicating 4-address
mode.
Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: a0761a301746 ("mac80211: drop data frames without key on encrypted links")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218191525.1168-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When using WEP, the default unicast key needs to be selected, instead of
the STA PTK.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When using WEP, the default unicast key needs to be selected, instead of
the STA PTK.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-5-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This fixes strlen mismatch problems happening in some .write callbacks
of debugfs.
When trying to configure airtime_flags in debugfs, an error appeared:
ash: write error: Invalid argument
The error is returned from kstrtou16() since a wrong length makes it
miss the real end of input string. To fix this, use count as the string
length, and set proper end of string for a char buffer.
The debug print is shown - airtime_flags_write: count = 2, len = 8,
where the actual length is 2, but "len = strlen(buf)" gets 8.
Also cleanup the other similar cases for the sake of consistency.
Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Shayne Chen <shayne.chen@mediatek.com>
Link: https://lore.kernel.org/r/20210112032028.7482-1-shayne.chen@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
tcp_disconnect() expects the caller acquires the sock lock,
but mptcp_disconnect() is not doing that. Add the missing
required lock.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 76e2a55d1625 ("mptcp: better msk-level shutdown.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.
Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.
Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head
While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.
For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC
We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768
This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.
Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()
Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)
I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.
Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A function has a different name between their prototype
and its kernel-doc markup:
../net/tipc/link.c:2551: warning: expecting prototype for link_reset_stats(). Prototype was for tipc_link_reset_stats() instead
../net/tipc/node.c:1678: warning: expecting prototype for is the general link level function for message sending(). Prototype was for tipc_node_xmit() instead
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We need to unregister the netdevice if config failed.
.ndo_uninit takes care of most of the heavy lifting.
This was uncovered by recent commit c269a24ce057 ("net: make
free_netdev() more lenient with unregistering devices").
Previously the partially-initialized device would be left
in the system.
Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com
Fixes: e2f1f072db8d ("sit: allow to configure 6rd tunnels via netlink")
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported a WARNING for allocating too big memory:
WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011
Modules linked in:
CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976
Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8
RSP: 0018:ffffc900012efb10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff9200025df66 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000140dc0
RBP: 0000000000140dc0 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81b1f7e1 R11: 0000000000000000 R12: 0000000000000014
R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
FS: 000000000190c880(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f08b7f316c0 CR3: 0000000012073000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc include/linux/slab.h:557 [inline]
kzalloc include/linux/slab.h:682 [inline]
bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282
bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline]
__do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440499
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffe1f3bfb18 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440499
RDX: 0000000000000048 RSI: 0000000020000600 RDI: 000000000000000a
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401ca0
R13: 0000000000401d30 R14: 0000000000000000 R15: 0000000000000000
This is because we didn't filter out too big ctx_size_in. Fix it by
rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64
numbers.
Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
Reported-by: syzbot+4f98876664c7337a4ae6@syzkaller.appspotmail.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112234254.1906829-1-songliubraving@fb.com
|
|
Initialize the sockaddr_can structure to prevent a data leak to user space.
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: syzbot+057884e2f453e8afebc8@syzkaller.appspotmail.com
Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20210112091643.11789-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.
As it happens, we used READ_ONCE() to read the state a few lines above the
unmarked read in rxrpc_input_data(), so use that value rather than
re-reading it.
Fixes: a158bdd3247b ("rxrpc: Fix call timeouts")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/161046715522.2450566.488819910256264150.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Clang static analysis reports the following:
net/rxrpc/key.c:657:11: warning: Assigned value is garbage or undefined
toksize = toksizes[tok++];
^ ~~~~~~~~~~~~~~~
rxrpc_read() contains two consecutive loops. The first loop calculates the
token sizes and stores the results in toksizes[] and the second one uses
the array. When there is an error in identifying the token in the first
loop, the token is skipped, no change is made to the toksizes[] array.
When the same error happens in the second loop, the token is not skipped.
This will cause the toksizes[] array to be out of step and will overrun
past the calculated sizes.
Fix this by making both loops log a message and return an error in this
case. This should only happen if a new token type is incompletely
implemented, so it should normally be impossible to trigger this.
Fixes: 9a059cd5ca7d ("rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()")
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/161046503122.2445787.16714129930607546635.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 156708adf2d9 ("SUNRPC: Move the svc_xdr_recvfrom()
tracepoint") tried to capture the correct XID in the trace record,
but this line in svc_recv:
rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);
alters the size of rq_arg.head[0].iov_len. The tracepoint records
the correct XID but an incorrect value for the length of the
xdr_buf's head.
To keep the trace callsites simple, I've created two trace classes.
One assumes the xdr_buf contains a full RPC message, and the XID
can be extracted from it. The other assumes the contents of the
xdr_buf are arbitrary, and the xid will be provided by the caller.
Currently there is only one user of each class, but I expect we will
need a few more tracepoints using each class as time goes on.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Pass conntrack -f to specify family in netfilter conntrack helper
selftests, from Chen Yi.
2) Honor hashsize modparam from nf_conntrack_buckets sysctl,
from Jesper D. Brouer.
3) Fix memleak in nf_nat_init() error path, from Dinghao Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nf_nat: Fix memleak in nf_nat_init
netfilter: conntrack: fix reading nf_conntrack_buckets
selftests: netfilter: Pass family parameter "-f" to conntrack tool
====================
Link: https://lore.kernel.org/r/20210112222033.9732-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.
Fixes: a3db10efcc4c ("net/smc: Add support for obtaining SMCR device list")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
smc_clc_get_hostname() sets the host pointer to a buffer
which is not NULL-terminated (see smc_clc_init()).
Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com
Fixes: 099b990bd11a ("net/smc: Add support for obtaining system information")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of re-implementing most of inet_shutdown, re-use
such helper, and implement the MPTCP-specific bits at the
'proto' level.
The msk-level disconnect() can now be invoked, lets provide a
suitable implementation.
As a side effect, this fixes bad state management for listener
sockets. The latter could lead to division by 0 oops since
commit ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling").
Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine")
Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Syzkaller found a way to trigger division by zero
in mptcp_subflow_cleanup_rbuf().
The current checks implemented into tcp_can_send_ack()
are too week, let's be more accurate.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")
Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Florian reported a use-after-free bug in devlink_nl_port_fill found with
KASAN:
(devlink_nl_port_fill)
(devlink_port_notify)
(devlink_port_unregister)
(dsa_switch_teardown.part.3)
(dsa_tree_teardown_switches)
(dsa_unregister_switch)
(bcm_sf2_sw_remove)
(platform_remove)
(device_release_driver_internal)
(device_links_unbind_consumers)
(device_release_driver_internal)
(device_driver_detach)
(unbind_store)
Allocated by task 31:
alloc_netdev_mqs+0x5c/0x50c
dsa_slave_create+0x110/0x9c8
dsa_register_switch+0xdb0/0x13a4
b53_switch_register+0x47c/0x6dc
bcm_sf2_sw_probe+0xaa4/0xc98
platform_probe+0x90/0xf4
really_probe+0x184/0x728
driver_probe_device+0xa4/0x278
__device_attach_driver+0xe8/0x148
bus_for_each_drv+0x108/0x158
Freed by task 249:
free_netdev+0x170/0x194
dsa_slave_destroy+0xac/0xb0
dsa_port_teardown.part.2+0xa0/0xb4
dsa_tree_teardown_switches+0x50/0xc4
dsa_unregister_switch+0x124/0x250
bcm_sf2_sw_remove+0x98/0x13c
platform_remove+0x44/0x5c
device_release_driver_internal+0x150/0x254
device_links_unbind_consumers+0xf8/0x12c
device_release_driver_internal+0x84/0x254
device_driver_detach+0x30/0x34
unbind_store+0x90/0x134
What happens is that devlink_port_unregister emits a netlink
DEVLINK_CMD_PORT_DEL message which associates the devlink port that is
getting unregistered with the ifindex of its corresponding net_device.
Only trouble is, the net_device has already been unregistered.
It looks like we can stub out the search for a corresponding net_device
if we clear the devlink_port's type. This looks like a bit of a hack,
but also seems to be the reason why the devlink_port_type_clear function
exists in the first place.
Fixes: 3122433eb533 ("net: dsa: Register devlink ports before calling DSA driver setup()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian fainelli <f.fainelli@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently the following happens when a DSA master driver unbinds while
there are DSA switches attached to it:
$ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind
------------[ cut here ]------------
WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507
Call trace:
rollback_registered_many+0x5fc/0x688
unregister_netdevice_queue+0x98/0x120
dsa_slave_destroy+0x4c/0x88
dsa_port_teardown.part.16+0x78/0xb0
dsa_tree_teardown_switches+0x58/0xc0
dsa_unregister_switch+0x104/0x1b8
felix_pci_remove+0x24/0x48
pci_device_remove+0x48/0xf0
device_release_driver_internal+0x118/0x1e8
device_driver_detach+0x28/0x38
unbind_store+0xd0/0x100
Located at the above location is this WARN_ON:
/* Notifier chain MUST detach us all upper devices. */
WARN_ON(netdev_has_any_upper_dev(dev));
Other stacked interfaces, like VLAN, do indeed listen for
NETDEV_UNREGISTER on the real_dev and also unregister themselves at that
time, which is clearly the behavior that rollback_registered_many
expects. But DSA interfaces are not VLAN. They have backing hardware
(platform devices, PCI devices, MDIO, SPI etc) which have a life cycle
of their own and we can't just trigger an unregister from the DSA
framework when we receive a netdev notifier that the master unregisters.
Luckily, there is something we can do, and that is to inform the driver
core that we have a runtime dependency to the DSA master interface's
device, and create a device link where that is the supplier and we are
the consumer. Having this device link will make the DSA switch unbind
before the DSA master unbinds, which is enough to avoid the WARN_ON from
rollback_registered_many.
Note that even before the blamed commit, DSA did nothing intelligent
when the master interface got unregistered either. See the discussion
here:
https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/
But this time, at least the WARN_ON is loud enough that the
upper_dev_link commit can be blamed.
The advantage with this approach vs dev_hold(master) in the attached
link is that the latter is not meant for long term reference counting.
With dev_hold, the only thing that will happen is that when the user
attempts an unbind of the DSA master, netdev_wait_allrefs will keep
waiting and waiting, due to DSA keeping the refcount forever. DSA would
not access freed memory corresponding to the master interface, but the
unbind would still result in a freeze. Whereas with device links,
graceful teardown is ensured. It even works with cascaded DSA trees.
$ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
[ 1818.797546] device swp0 left promiscuous mode
[ 1819.301112] sja1105 spi2.0: Link is Down
[ 1819.307981] DSA: tree 1 torn down
[ 1819.312408] device eno2 left promiscuous mode
[ 1819.656803] mscc_felix 0000:00:00.5: Link is Down
[ 1819.667194] DSA: tree 0 torn down
[ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down
This approach allows us to keep the DSA framework absolutely unchanged,
and the driver core will just know to unbind us first when the master
goes away - as opposed to the large (and probably impossible) rework
required if attempting to listen for NETDEV_UNREGISTER.
As per the documentation at Documentation/driver-api/device_link.rst,
specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link
to be automatically purged when the consumer fails to probe or later
unbinds. So we don't need to keep the consumer_link variable in struct
dsa_switch.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In commit 826f328e2b7e ("net: dcb: Validate netlink message in DCB
handler"), Linux started rejecting RTM_GETDCB netlink messages if they
contained a set-like DCB_CMD_ command.
The reason was that privileges were only verified for RTM_SETDCB messages,
but the value that determined the action to be taken is the command, not
the message type. And validation of message type against the DCB command
was the obvious missing piece.
Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
for configuration of DCB, accesses the DCB set-like APIs through
RTM_GETDCB.
Therefore do not bounce the discrepancy between message type and command.
Instead, in addition to validating privileges based on the actual message
type, validate them also based on the expected message type. This closes
the loophole of allowing DCB configuration on non-admin accounts, while
maintaining backward compatibility.
Fixes: 2f90b8657ec9 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Fixes: 826f328e2b7e ("net: dcb: Validate netlink message in DCB handler")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
- Fix parsing of link-local IPv6 addresses
- Fix confusing logging of mount errors that was introduced by the
fsopen() patchset.
- Fix a tracing use after free in _nfs4_do_setlk()
- Layout return-on-close fixes when called from nfs4_evict_inode()
- Layout segments were being leaked in
pnfs_generic_clear_request_commit()
- Don't leak DS commits in pnfs_generic_retry_commit()
- Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
calls iput() on an inode after the super block has gone away"
* tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: nfs_igrab_and_active must first reference the superblock
NFS: nfs_delegation_find_inode_server must first reference the superblock
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
pNFS: Stricter ordering of layoutget and layoutreturn
pNFS: Clean up pnfs_layoutreturn_free_lsegs()
pNFS: We want return-on-close to complete when evicting the inode
pNFS: Mark layout for return if return-on-close was not sent
net: sunrpc: interpret the return value of kstrtou32 correctly
NFS: Adjust fs_context error logging
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
|
|
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.
It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.
skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.
This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_seq_read iterates over an skb, returning pointer and length of
the next data range with each call.
It relies on kmap_atomic to access highmem pages when needed.
An skb frag may be backed by a compound page, but kmap_atomic maps
only a single page. There are not enough kmap slots to always map all
pages concurrently.
Instead, if kmap_atomic is needed, iterate over each page.
As this increases the number of calls, avoid this unless needed.
The necessary condition is captured in skb_frag_must_loop.
I tried to make the change as obvious as possible. It should be easy
to verify that nothing changes if skb_frag_must_loop returns false.
Tested:
On an x86 platform with
CONFIG_HIGHMEM=y
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
CONFIG_NETFILTER_XT_MATCH_STRING=y
Run
ip link set dev lo mtu 1500
iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
dd if=/dev/urandom of=in bs=1M count=20
nc -l -p 8000 > /dev/null &
nc -w 1 -q 0 localhost 8000 < in
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull nfsd fixes from Chuck Lever:
- Fix major TCP performance regression
- Get NFSv4.2 READ_PLUS regression tests to pass
- Improve NFSv4 COMPOUND memory allocation
- Fix sparse warning
* tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
NFSD: Restore NFSv4 decoding's SAVEMEM functionality
SUNRPC: Handle TCP socket sends with kernel_sendpage() again
NFSD: Fix sparse warning in nfssvc.c
nfsd: Don't set eof on a truncated READ_PLUS
nfsd: Fixes for nfsd4_encode_read_plus_data()
|
|
When register_pernet_subsys() fails, nf_nat_bysource
should be freed just like when nf_ct_extend_register()
fails.
Fixes: 1cd472bf036ca ("netfilter: nf_nat: add nat hook register functions to nf_nat")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
A return value of 0 means success. This is documented in lib/kstrtox.c.
This was found by trying to mount an NFS share from a link-local IPv6
address with the interface specified by its index:
mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1")
Before this commit this failed with EINVAL and also caused the following
message in dmesg:
[...] NFS: bad IP address specified: addr=fe80::1%1
The syscall using the same address based on the interface name instead
of its index succeeds.
Credits for this patch go to my colleague Christian Speich, who traced
the origin of this bug to this line of code.
Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Fixes: 00cfaa943ec3 ("replace strict_strto calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The old way of changing the conntrack hashsize runtime was through changing
the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack:
allow increasing bucket size via sysctl too").
The commit introduced second "user" variable nf_conntrack_htable_size_user
which shadow actual variable nf_conntrack_htable_size. When hashsize is
changed via module param this "user" variable isn't updated. This results in
sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
update via the old way.
This patch fix the issue by always updating "user" variable when reading the
proc file. This will take care of changes to the actual variable without
sysctl need to be aware.
Fixes: 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too")
Reported-by: Yoel Caspersen <yoel@kviknet.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The buffer list can have zero skb as following path:
tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so
we need to check the list before casting an &sk_buff.
Fault report:
[] tipc: Bulk publication failure
[] general protection fault, probably for non-canonical [#1] PREEMPT [...]
[] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
[] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2
[] Hardware name: Bochs ..., BIOS Bochs 01/01/2011
[] RIP: 0010:tipc_link_xmit+0xc1/0x2180
[] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...]
[] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202
[] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d
[] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8
[] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded
[] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148
[] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018
[] FS: 0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...]
[] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0
Fixes: af9b028e270fd ("tipc: make media xmit call outside node spinlock context")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are cases where GSO segment's length exceeds the egress MTU:
- Forwarding of a TCP GRO skb, when DF flag is not set.
- Forwarding of an skb that arrived on a virtualisation interface
(virtio-net/vhost/tap) with TSO/GSO size set by other network
stack.
- Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
interface with a smaller MTU.
- Arriving GRO skb (or GSO skb in a virtualised environment) that is
bridged to a NETIF_F_TSO tunnel stacked over an interface with an
insufficient MTU.
If so:
- Consume the SKB and its segments.
- Issue an ICMP packet with 'Packet Too Big' message containing the
MTU, allowing the source host to reduce its Path MTU appropriately.
Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.
Fixes: 9e50849054a4 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().
Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f6f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.
After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.
This also solves the problem of priv_destructor not being called
on notifier error.
net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.
Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are two flavors of handling netdev registration:
- ones called without holding rtnl_lock: register_netdev() and
unregister_netdev(); and
- those called with rtnl_lock held: register_netdevice() and
unregister_netdevice().
While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.
The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.
Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.
Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Explain the two basic flows of struct net_device's operation.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.
Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.
skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.
Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.
[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823] skb_release_data+0x144/0x264
[ 4443.426828] kfree_skb+0x58/0xc4
[ 4443.426832] skb_queue_purge+0x64/0x9c
[ 4443.426844] packet_set_ring+0x5f0/0x820
[ 4443.426849] packet_setsockopt+0x5a4/0xcd0
[ 4443.426853] __sys_setsockopt+0x188/0x278
[ 4443.426858] __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869] el0_svc_common+0xf0/0x1d0
[ 4443.426873] el0_svc_handler+0x74/0x98
[ 4443.426880] el0_svc+0x8/0xc
Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Saving the regulatory domain while setting custom regulatory domain
was done while accessing a RCU protected pointer but without any
protection.
Fix this by using RTNL while accessing the pointer.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reported-by: syzbot+27771d4abcd9b7a1f5d3@syzkaller.appspotmail.com
Reported-by: syzbot+db4035751c56c0079282@syzkaller.appspotmail.com
Reported-by: Hans de Goede <hdegoede@redhat.com>
Fixes: beee24695157 ("cfg80211: Save the regulatory domain when setting custom regulatory")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210105165657.613e9a876829.Ia38d27dbebea28bf9c56d70691d243186ede70e7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.
NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.
But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:
# ip nexthop add id 1 fdb via 127.0.0.1
# ip nexthop add id 10 fdb via default group 1
The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.
Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case of error, remove the nexthop group entry from the list to which
it was previously added.
Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|