summaryrefslogtreecommitdiff
path: root/net/bridge
AgeCommit message (Collapse)AuthorFilesLines
5 daysbridge: fix C-VLAN preservation in 802.1ad vlan_tunnel egressAlexandre Knecht1-4/+7
[ Upstream commit 3128df6be147768fe536986fbb85db1d37806a9f ] When using an 802.1ad bridge with vlan_tunnel, the C-VLAN tag is incorrectly stripped from frames during egress processing. br_handle_egress_vlan_tunnel() uses skb_vlan_pop() to remove the S-VLAN from hwaccel before VXLAN encapsulation. However, skb_vlan_pop() also moves any "next" VLAN from the payload into hwaccel: /* move next vlan tag to hw accel tag */ __skb_vlan_pop(skb, &vlan_tci); __vlan_hwaccel_put_tag(skb, vlan_proto, vlan_tci); For QinQ frames where the C-VLAN sits in the payload, this moves it to hwaccel where it gets lost during VXLAN encapsulation. Fix by calling __vlan_hwaccel_clear_tag() directly, which clears only the hwaccel S-VLAN and leaves the payload untouched. This path is only taken when vlan_tunnel is enabled and tunnel_info is configured, so 802.1Q bridges are unaffected. Tested with 802.1ad bridge + VXLAN vlan_tunnel, verified C-VLAN preserved in VXLAN payload via tcpdump. Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Alexandre Knecht <knecht.alexandre@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20251228020057.2788865-1-knecht.alexandre@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet: bridge: Describe @tunnel_hash member in net_bridge_vlan_group structBagas Sanjaya1-0/+1
[ Upstream commit f79f9b7ace1713e4b83888c385f5f55519dfb687 ] Sphinx reports kernel-doc warning: WARNING: ./net/bridge/br_private.h:267 struct member 'tunnel_hash' not described in 'net_bridge_vlan_group' Fix it by describing @tunnel_hash member. Fixes: efa5356b0d9753 ("bridge: per vlan dst_metadata netlink support") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251218042936.24175-2-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07bridge: Redirect to backup port when port is administratively downIdo Schimmel1-1/+2
[ Upstream commit 3d05b24429e1de7a17c8fdccb04a04dbc8ad297b ] If a backup port is configured for a bridge port, the bridge will redirect known unicast traffic towards the backup port when the primary port is administratively up but without a carrier. This is useful, for example, in MLAG configurations where a system is connected to two switches and there is a peer link between both switches. The peer link serves as the backup port in case one of the switches loses its connection to the multi-homed system. In order to avoid flooding when the primary port loses its carrier, the bridge does not flush dynamic FDB entries pointing to the port upon STP disablement, if the port has a backup port. The above means that known unicast traffic destined to the primary port will be blackholed when the port is put administratively down, until the FDB entries pointing to it are aged-out. Given that the current behavior is quite weird and unlikely to be depended on by anyone, amend the bridge to redirect to the backup port also when the primary port is administratively down and not only when it does not have a carrier. The change is motivated by a report from a user who expected traffic to be redirected to the backup port when the primary port was put administratively down while debugging a network issue. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250812080213.325298-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28netfilter: nft_reject: unify reject init and dump into nft_rejectJose M. Guisado Gomez1-58/+2
[ Upstream commit 312ca575a50543a886a5dfa2af1e72aa6a5b601e ] Bridge family is using the same static init and dump function as inet. This patch removes duplicate code unifying these functions body into nft_reject.c so they can be reused in the rest of families supporting reject verdict. Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 91a79b792204 ("netfilter: nf_reject: don't leak dst refcount for loopback packets") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27netfilter: bridge: Move specific fragmented packet to slow_path instead of ↵Huajian Yang1-6/+6
dropping it [ Upstream commit aa04c6f45b9224b949aa35d4fa5f8d0ba07b23d4 ] The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for fragmented packets. The original bridge does not know that it is a fragmented packet and forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function nf_br_ip_fragment and br_ip6_fragment will check the headroom. In original br_forward, insufficient headroom of skb may indeed exist, but there's still a way to save the skb in the device driver after dev_queue_xmit.So droping the skb will change the original bridge forwarding in some cases. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Huajian Yang <huajianyang@asrmicro.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04bridge: netfilter: Fix forwarding of fragmented packetsIdo Schimmel2-5/+3
[ Upstream commit 91b6dbced0ef1d680afdd69b14fc83d50ebafaf3 ] When netfilter defrag hooks are loaded (due to the presence of conntrack rules, for example), fragmented packets entering the bridge will be defragged by the bridge's pre-routing hook (br_nf_pre_routing() -> ipv4_conntrack_defrag()). Later on, in the bridge's post-routing hook, the defragged packet will be fragmented again. If the size of the largest fragment is larger than what the kernel has determined as the destination MTU (using ip_skb_dst_mtu()), the defragged packet will be dropped. Before commit ac6627a28dbf ("net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward"), ip_skb_dst_mtu() would return dst_mtu() as the destination MTU. Assuming the dst entry attached to the packet is the bridge's fake rtable one, this would simply be the bridge's MTU (see fake_mtu()). However, after above mentioned commit, ip_skb_dst_mtu() ends up returning the route's MTU stored in the dst entry's metrics. Ideally, in case the dst entry is the bridge's fake rtable one, this should be the bridge's MTU as the bridge takes care of updating this metric when its MTU changes (see br_change_mtu()). Unfortunately, the last operation is a no-op given the metrics attached to the fake rtable entry are marked as read-only. Therefore, ip_skb_dst_mtu() ends up returning 1500 (the initial MTU value) and defragged packets are dropped during fragmentation when dealing with large fragments and high MTU (e.g., 9k). Fix by moving the fake rtable entry's metrics to be per-bridge (in a similar fashion to the fake rtable entry itself) and marking them as writable, thereby allowing MTU changes to be reflected. Fixes: 62fa8a846d7d ("net: Implement read-only protection and COW'ing of metrics.") Fixes: 33eb9873a283 ("bridge: initialize fake_rtable metrics") Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Closes: https://lore.kernel.org/netdev/PH0PR10MB4504888284FF4CBA648197D0ACB82@PH0PR10MB4504.namprd10.prod.outlook.com/ Tested-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250515084848.727706-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-17net: bridge: xmit: make sure we have at least eth header len bytesNikolay Aleksandrov1-0/+5
commit 8bd67ebb50c0145fd2ca8681ab65eb7e8cde1afc upstream. syzbot triggered an uninit value[1] error in bridge device's xmit path by sending a short (less than ETH_HLEN bytes) skb. To fix it check if we can actually pull that amount instead of assuming. Tested with dropwatch: drop at: br_dev_xmit+0xb93/0x12d0 [bridge] (0xffffffffc06739b3) origin: software timestamp: Mon May 13 11:31:53 2024 778214037 nsec protocol: 0x88a8 length: 2 original length: 2 drop reason: PKT_TOO_SMALL [1] BUG: KMSAN: uninit-value in br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65 br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65 __netdev_start_xmit include/linux/netdevice.h:4903 [inline] netdev_start_xmit include/linux/netdevice.h:4917 [inline] xmit_one net/core/dev.c:3531 [inline] dev_hard_start_xmit+0x247/0xa20 net/core/dev.c:3547 __dev_queue_xmit+0x34db/0x5350 net/core/dev.c:4341 dev_queue_xmit include/linux/netdevice.h:3091 [inline] __bpf_tx_skb net/core/filter.c:2136 [inline] __bpf_redirect_common net/core/filter.c:2180 [inline] __bpf_redirect+0x14a6/0x1620 net/core/filter.c:2187 ____bpf_clone_redirect net/core/filter.c:2460 [inline] bpf_clone_redirect+0x328/0x470 net/core/filter.c:2432 ___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997 __bpf_prog_run512+0xb5/0xe0 kernel/bpf/core.c:2238 bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline] __bpf_prog_run include/linux/filter.h:657 [inline] bpf_prog_run include/linux/filter.h:664 [inline] bpf_test_run+0x499/0xc30 net/bpf/test_run.c:425 bpf_prog_test_run_skb+0x14ea/0x1f20 net/bpf/test_run.c:1058 bpf_prog_test_run+0x6b7/0xad0 kernel/bpf/syscall.c:4269 __sys_bpf+0x6aa/0xd90 kernel/bpf/syscall.c:5678 __do_sys_bpf kernel/bpf/syscall.c:5767 [inline] __se_sys_bpf kernel/bpf/syscall.c:5765 [inline] __x64_sys_bpf+0xa0/0xe0 kernel/bpf/syscall.c:5765 x64_sys_call+0x96b/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+a63a1f6a062033cf0f40@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a63a1f6a062033cf0f40 Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Randy MacLeod <Randy.MacLeod@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17netfilter: br_netfilter: fix panic with metadata_dst skbAndy Roulin1-0/+5
[ Upstream commit f9ff7665cd128012868098bbd07e28993e314fdb ] Fix a kernel panic in the br_netfilter module when sending untagged traffic via a VxLAN device. This happens during the check for fragmentation in br_nf_dev_queue_xmit. It is dependent on: 1) the br_netfilter module being loaded; 2) net.bridge.bridge-nf-call-iptables set to 1; 3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port; 4) untagged frames with size higher than the VxLAN MTU forwarded/flooded When forwarding the untagged packet to the VxLAN bridge port, before the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL. Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check for frames that needs to be fragmented: frames with higher MTU than the VxLAN device end up calling br_nf_ip_fragment, which in turns call ip_skb_dst_mtu. The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst with valid dst->dev, thus the crash. This case was never supported in the first place, so drop the packet instead. PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data. [ 176.291791] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000110 [ 176.292101] Mem abort info: [ 176.292184] ESR = 0x0000000096000004 [ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits [ 176.292530] SET = 0, FnV = 0 [ 176.292709] EA = 0, S1PTW = 0 [ 176.292862] FSC = 0x04: level 0 translation fault [ 176.293013] Data abort info: [ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000 [ 176.294166] [0000000000000110] pgd=0000000000000000, p4d=0000000000000000 [ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth br_netfilter bridge stp llc ipv6 crct10dif_ce [ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted 6.8.0-rc3-g5b3fbd61b9d1 #2 [ 176.296314] Hardware name: linux,dummy-virt (DT) [ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter] [ 176.297636] sp : ffff800080003630 [ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27: ffff6828c49ad9f8 [ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24: 00000000000003e8 [ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21: ffff6828c3b16d28 [ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18: 0000000000000014 [ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15: 0000000095744632 [ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12: ffffb7e137926a70 [ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 : 0000000000000000 [ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 : f20e0100bebafeca [ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 : 0000000000000000 [ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 : ffff6828c7f918f0 [ 176.300889] Call trace: [ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter] [ 176.301703] nf_hook_slow+0x48/0x124 [ 176.302060] br_forward_finish+0xc8/0xe8 [bridge] [ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter] [ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter] [ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter] [ 176.303359] nf_hook_slow+0x48/0x124 [ 176.303803] __br_forward+0xc4/0x194 [bridge] [ 176.304013] br_flood+0xd4/0x168 [bridge] [ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge] [ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter] [ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter] [ 176.305428] br_handle_frame+0x244/0x3cc [bridge] [ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc [ 176.306080] __netif_receive_skb_one_core+0x40/0x8c [ 176.306197] __netif_receive_skb+0x18/0x64 [ 176.306369] process_backlog+0x80/0x124 [ 176.306540] __napi_poll+0x38/0x17c [ 176.306636] net_rx_action+0x124/0x26c [ 176.306758] __do_softirq+0x100/0x26c [ 176.307051] ____do_softirq+0x10/0x1c [ 176.307162] call_on_irq_stack+0x24/0x4c [ 176.307289] do_softirq_own_stack+0x1c/0x2c [ 176.307396] do_softirq+0x54/0x6c [ 176.307485] __local_bh_enable_ip+0x8c/0x98 [ 176.307637] __dev_queue_xmit+0x22c/0xd28 [ 176.307775] neigh_resolve_output+0xf4/0x1a0 [ 176.308018] ip_finish_output2+0x1c8/0x628 [ 176.308137] ip_do_fragment+0x5b4/0x658 [ 176.308279] ip_fragment.constprop.0+0x48/0xec [ 176.308420] __ip_finish_output+0xa4/0x254 [ 176.308593] ip_finish_output+0x34/0x130 [ 176.308814] ip_output+0x6c/0x108 [ 176.308929] ip_send_skb+0x50/0xf0 [ 176.309095] ip_push_pending_frames+0x30/0x54 [ 176.309254] raw_sendmsg+0x758/0xaec [ 176.309568] inet_sendmsg+0x44/0x70 [ 176.309667] __sys_sendto+0x110/0x178 [ 176.309758] __arm64_sys_sendto+0x28/0x38 [ 176.309918] invoke_syscall+0x48/0x110 [ 176.310211] el0_svc_common.constprop.0+0x40/0xe0 [ 176.310353] do_el0_svc+0x1c/0x28 [ 176.310434] el0_svc+0x34/0xb4 [ 176.310551] el0t_64_sync_handler+0x120/0x12c [ 176.310690] el0t_64_sync+0x190/0x194 [ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860) [ 176.315743] ---[ end trace 0000000000000000 ]--- [ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000 [ 176.316564] PHYS_OFFSET: 0xffff97d780000000 [ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b [ 176.317210] Memory Limit: none [ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal Exception in interrupt ]---\ Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12net: bridge: br_fdb_external_learn_add(): always set EXT_LEARNJonas Gorski1-4/+2
[ Upstream commit bee2ef946d3184e99077be526567d791c473036f ] When userspace wants to take over a fdb entry by setting it as EXTERN_LEARNED, we set both flags BR_FDB_ADDED_BY_EXT_LEARN and BR_FDB_ADDED_BY_USER in br_fdb_external_learn_add(). If the bridge updates the entry later because its port changed, we clear the BR_FDB_ADDED_BY_EXT_LEARN flag, but leave the BR_FDB_ADDED_BY_USER flag set. If userspace then wants to take over the entry again, br_fdb_external_learn_add() sees that BR_FDB_ADDED_BY_USER and skips setting the BR_FDB_ADDED_BY_EXT_LEARN flags, thus silently ignores the update. Fix this by always allowing to set BR_FDB_ADDED_BY_EXT_LEARN regardless if this was a user fdb entry or not. Fixes: 710ae7287737 ("net: bridge: Mark FDB entries that were added by user as such") Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20240903081958.29951-1-jonas.gorski@bisdn.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17net: bridge: fix corrupted ethernet header on multicast-to-unicastFelix Fietkau1-2/+7
[ Upstream commit 86b29d830ad69eecff25b22dc96c14c6573718e6 ] The change from skb_copy to pskb_copy unfortunately changed the data copying to omit the ethernet header, since it was pulled before reaching this point. Fix this by calling __skb_push/pull around pskb_copy. Fixes: 59c878cbcdd8 ("net: bridge: fix multicast-to-unicast with fraglist GSO") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17net: bridge: fix multicast-to-unicast with fraglist GSOFelix Fietkau1-1/+1
[ Upstream commit 59c878cbcdd80ed39315573b3511d0acfd3501b5 ] Calling skb_copy on a SKB_GSO_FRAGLIST skb is not valid, since it returns an invalid linearized skb. This code only needs to change the ethernet header, so pskb_copy is the right function to call here. Fixes: 6db6f0eae605 ("bridge: multicast to unicast") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13netfilter: validate user input for expected lengthEric Dumazet1-0/+6
commit 0c83842df40f86e529db6842231154772c20edcc upstream. I got multiple syzbot reports showing old bugs exposed by BPF after commit 20f2505fb436 ("bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt") setsockopt() @optlen argument should be taken into account before copying data. BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline] BUG: KASAN: slab-out-of-bounds in do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline] BUG: KASAN: slab-out-of-bounds in do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627 Read of size 96 at addr ffff88802cd73da0 by task syz-executor.4/7238 CPU: 1 PID: 7238 Comm: syz-executor.4 Not tainted 6.9.0-rc2-next-20240403-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105 copy_from_sockptr_offset include/linux/sockptr.h:49 [inline] copy_from_sockptr include/linux/sockptr.h:55 [inline] do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline] do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627 nf_setsockopt+0x295/0x2c0 net/netfilter/nf_sockopt.c:101 do_sock_setsockopt+0x3af/0x720 net/socket.c:2311 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x72/0x7a RIP: 0033:0x7fd22067dde9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fd21f9ff0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fd2207abf80 RCX: 00007fd22067dde9 RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007fd2206ca47a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000020000880 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fd2207abf80 R15: 00007ffd2d0170d8 </TASK> Allocated by task 7238: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] __do_kmalloc_node mm/slub.c:4069 [inline] __kmalloc_noprof+0x200/0x410 mm/slub.c:4082 kmalloc_noprof include/linux/slab.h:664 [inline] __cgroup_bpf_run_filter_setsockopt+0xd47/0x1050 kernel/bpf/cgroup.c:1869 do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293 __sys_setsockopt+0x1ae/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x72/0x7a The buggy address belongs to the object at ffff88802cd73da0 which belongs to the cache kmalloc-8 of size 8 The buggy address is located 0 bytes inside of allocated 1-byte region [ffff88802cd73da0, ffff88802cd73da1) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802cd73020 pfn:0x2cd73 flags: 0xfff80000000000(node=0|zone=1|lastcpupid=0xfff) page_type: 0xffffefff(slab) raw: 00fff80000000000 ffff888015041280 dead000000000100 dead000000000122 raw: ffff88802cd73020 000000008080007f 00000001ffffefff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 5103, tgid 2119833701 (syz-executor.4), ts 5103, free_ts 70804600828 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1490 prep_new_page mm/page_alloc.c:1498 [inline] get_page_from_freelist+0x2e7e/0x2f40 mm/page_alloc.c:3454 __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4712 __alloc_pages_node_noprof include/linux/gfp.h:244 [inline] alloc_pages_node_noprof include/linux/gfp.h:271 [inline] alloc_slab_page+0x5f/0x120 mm/slub.c:2249 allocate_slab+0x5a/0x2e0 mm/slub.c:2412 new_slab mm/slub.c:2465 [inline] ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3615 __slab_alloc+0x58/0xa0 mm/slub.c:3705 __slab_alloc_node mm/slub.c:3758 [inline] slab_alloc_node mm/slub.c:3936 [inline] __do_kmalloc_node mm/slub.c:4068 [inline] kmalloc_node_track_caller_noprof+0x286/0x450 mm/slub.c:4089 kstrdup+0x3a/0x80 mm/util.c:62 device_rename+0xb5/0x1b0 drivers/base/core.c:4558 dev_change_name+0x275/0x860 net/core/dev.c:1232 do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2864 __rtnl_newlink net/core/rtnetlink.c:3680 [inline] rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3727 rtnetlink_rcv_msg+0x89b/0x10d0 net/core/rtnetlink.c:6594 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2559 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361 page last free pid 5146 tgid 5146 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1110 [inline] free_unref_page+0xd3c/0xec0 mm/page_alloc.c:2617 discard_slab mm/slub.c:2511 [inline] __put_partials+0xeb/0x130 mm/slub.c:2980 put_cpu_partial+0x17c/0x250 mm/slub.c:3055 __slab_free+0x2ea/0x3d0 mm/slub.c:4254 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x9e/0x140 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:322 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook mm/slub.c:3888 [inline] slab_alloc_node mm/slub.c:3948 [inline] __do_kmalloc_node mm/slub.c:4068 [inline] __kmalloc_node_noprof+0x1d7/0x450 mm/slub.c:4076 kmalloc_node_noprof include/linux/slab.h:681 [inline] kvmalloc_node_noprof+0x72/0x190 mm/util.c:634 bucket_table_alloc lib/rhashtable.c:186 [inline] rhashtable_rehash_alloc+0x9e/0x290 lib/rhashtable.c:367 rht_deferred_worker+0x4e1/0x2440 lib/rhashtable.c:427 process_one_work kernel/workqueue.c:3218 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380 kthread+0x2f0/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243 Memory state around the buggy address: ffff88802cd73c80: 07 fc fc fc 05 fc fc fc 05 fc fc fc fa fc fc fc ffff88802cd73d00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc >ffff88802cd73d80: fa fc fc fc 01 fc fc fc fa fc fc fc fa fc fc fc ^ ffff88802cd73e00: fa fc fc fc fa fc fc fc 05 fc fc fc 07 fc fc fc ffff88802cd73e80: 07 fc fc fc 07 fc fc fc 07 fc fc fc 07 fc fc fc Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://lore.kernel.org/r/20240404122051.2303764-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28netfilter: nf_conntrack_bridge: initialize err to 0Linkui Xiao1-1/+1
[ Upstream commit a44af08e3d4d7566eeea98d7a29fe06e7b9de944 ] K2CI reported a problem: consume_skb(skb); return err; [nf_br_ip_fragment() error] uninitialized symbol 'err'. err is not initialized, because returning 0 is expected, initialize err to 0. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10net: bridge: use DEV_STATS_INC()Eric Dumazet2-4/+4
[ Upstream commit 44bdb313da57322c9b3c108eb66981c6ec6509f4 ] syzbot/KCSAN reported data-races in br_handle_frame_finish() [1] This function can run from multiple cpus without mutual exclusion. Adopt SMP safe DEV_STATS_INC() to update dev->stats fields. Handles updates to dev->stats.tx_dropped while we are at it. [1] BUG: KCSAN: data-race in br_handle_frame_finish / br_handle_frame_finish read-write to 0xffff8881374b2178 of 8 bytes by interrupt on cpu 1: br_handle_frame_finish+0xd4f/0xef0 net/bridge/br_input.c:189 br_nf_hook_thresh+0x1ed/0x220 br_nf_pre_routing_finish_ipv6+0x50f/0x540 NF_HOOK include/linux/netfilter.h:304 [inline] br_nf_pre_routing_ipv6+0x1e3/0x2a0 net/bridge/br_netfilter_ipv6.c:178 br_nf_pre_routing+0x526/0xba0 net/bridge/br_netfilter_hooks.c:508 nf_hook_entry_hookfn include/linux/netfilter.h:144 [inline] nf_hook_bridge_pre net/bridge/br_input.c:272 [inline] br_handle_frame+0x4c9/0x940 net/bridge/br_input.c:417 __netif_receive_skb_core+0xa8a/0x21e0 net/core/dev.c:5417 __netif_receive_skb_one_core net/core/dev.c:5521 [inline] __netif_receive_skb+0x57/0x1b0 net/core/dev.c:5637 process_backlog+0x21f/0x380 net/core/dev.c:5965 __napi_poll+0x60/0x3b0 net/core/dev.c:6527 napi_poll net/core/dev.c:6594 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6727 __do_softirq+0xc1/0x265 kernel/softirq.c:553 run_ksoftirqd+0x17/0x20 kernel/softirq.c:921 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read-write to 0xffff8881374b2178 of 8 bytes by interrupt on cpu 0: br_handle_frame_finish+0xd4f/0xef0 net/bridge/br_input.c:189 br_nf_hook_thresh+0x1ed/0x220 br_nf_pre_routing_finish_ipv6+0x50f/0x540 NF_HOOK include/linux/netfilter.h:304 [inline] br_nf_pre_routing_ipv6+0x1e3/0x2a0 net/bridge/br_netfilter_ipv6.c:178 br_nf_pre_routing+0x526/0xba0 net/bridge/br_netfilter_hooks.c:508 nf_hook_entry_hookfn include/linux/netfilter.h:144 [inline] nf_hook_bridge_pre net/bridge/br_input.c:272 [inline] br_handle_frame+0x4c9/0x940 net/bridge/br_input.c:417 __netif_receive_skb_core+0xa8a/0x21e0 net/core/dev.c:5417 __netif_receive_skb_one_core net/core/dev.c:5521 [inline] __netif_receive_skb+0x57/0x1b0 net/core/dev.c:5637 process_backlog+0x21f/0x380 net/core/dev.c:5965 __napi_poll+0x60/0x3b0 net/core/dev.c:6527 napi_poll net/core/dev.c:6594 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6727 __do_softirq+0xc1/0x265 kernel/softirq.c:553 do_softirq+0x5e/0x90 kernel/softirq.c:454 __local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381 __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline] _raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210 spin_unlock_bh include/linux/spinlock.h:396 [inline] batadv_tt_local_purge+0x1a8/0x1f0 net/batman-adv/translation-table.c:1356 batadv_tt_purge+0x2b/0x630 net/batman-adv/translation-table.c:3560 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x00000000000d7190 -> 0x00000000000d7191 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 14848 Comm: kworker/u4:11 Not tainted 6.6.0-rc1-syzkaller-00236-gad8a69f361b9 #0 Fixes: 1c29fc4989bc ("[BRIDGE]: keep track of received multicast packets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: bridge@lists.linux-foundation.org Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230918091351.1356153-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23netfilter: ebtables: fix fortify warnings in size_entry_mwt()GONG, Ruiqi1-2/+1
[ Upstream commit a7ed3465daa240bdf01a5420f64336fee879c09d ] When compiling with gcc 13 and CONFIG_FORTIFY_SOURCE=y, the following warning appears: In function ‘fortify_memcpy_chk’, inlined from ‘size_entry_mwt’ at net/bridge/netfilter/ebtables.c:2118:2: ./include/linux/fortify-string.h:592:25: error: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning] 592 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The compiler is complaining: memcpy(&offsets[1], &entry->watchers_offset, sizeof(offsets) - sizeof(offsets[0])); where memcpy reads beyong &entry->watchers_offset to copy {watchers,target,next}_offset altogether into offsets[]. Silence the warning by wrapping these three up via struct_group(). Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27bridge: Add extack warning when enabling STP in netns.Kuniyuki Iwashima1-0/+3
[ Upstream commit 56a16035bb6effb37177867cea94c13a8382f745 ] When we create an L2 loop on a bridge in netns, we will see packets storm even if STP is enabled. # unshare -n # ip link add br0 type bridge # ip link add veth0 type veth peer name veth1 # ip link set veth0 master br0 up # ip link set veth1 master br0 up # ip link set br0 type bridge stp_state 1 # ip link set br0 up # sleep 30 # ip -s link show br0 2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether b6:61:98:1c:1c:b5 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 956553768 12861249 0 0 0 12861249 <-. Keep TX: bytes packets errors dropped carrier collsns | increasing 1027834 11951 0 0 0 0 <-' rapidly This is because llc_rcv() drops all packets in non-root netns and BPDU is dropped. Let's add extack warning when enabling STP in netns. # unshare -n # ip link add br0 type bridge # ip link set br0 type bridge stp_state 1 Warning: bridge: STP does not work in non-root netns. Note this commit will be reverted later when we namespacify the whole LLC infra. Fixes: e730c15519d0 ("[NET]: Make packet reception network namespace safe") Suggested-by: Harry Coin <hcoin@quietfountain.com> Link: https://lore.kernel.org/netdev/0f531295-e289-022d-5add-5ceffa0df9bc@quietfountain.com/ Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC modeVladimir Oltean1-2/+3
[ Upstream commit 6ca3c005d0604e8d2b439366e3923ea58db99641 ] According to the synchronization rules for .ndo_get_stats() as seen in Documentation/networking/netdevices.rst, acquiring a plain spin_lock() should not be illegal, but the bridge driver implementation makes it so. After running these commands, I am being faced with the following lockdep splat: $ ip link add link swp0 name macsec0 type macsec encrypt on && ip link set swp0 up $ ip link add dev br0 type bridge vlan_filtering 1 && ip link set br0 up $ ip link set macsec0 master br0 && ip link set macsec0 up ======================================================== WARNING: possible irq lock inversion dependency detected 6.4.0-04295-g31b577b4bd4a #603 Not tainted -------------------------------------------------------- swapper/1/0 just changed the state of lock: ffff6bd348724cd8 (&br->lock){+.-.}-{3:3}, at: br_forward_delay_timer_expired+0x34/0x198 but this lock took another, SOFTIRQ-unsafe lock in the past: (&ocelot->stats_lock){+.+.}-{3:3} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &br->lock --> &br->hash_lock --> &ocelot->stats_lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ocelot->stats_lock); local_irq_disable(); lock(&br->lock); lock(&br->hash_lock); <Interrupt> lock(&br->lock); *** DEADLOCK *** (details about the 3 locks skipped) swp0 is instantiated by drivers/net/dsa/ocelot/felix.c, and this only matters to the extent that its .ndo_get_stats64() method calls spin_lock(&ocelot->stats_lock). Documentation/locking/lockdep-design.rst says: | A lock is irq-safe means it was ever used in an irq context, while a lock | is irq-unsafe means it was ever acquired with irq enabled. (...) | Furthermore, the following usage based lock dependencies are not allowed | between any two lock-classes:: | | <hardirq-safe> -> <hardirq-unsafe> | <softirq-safe> -> <softirq-unsafe> Lockdep marks br->hash_lock as softirq-safe, because it is sometimes taken in softirq context (for example br_fdb_update() which runs in NET_RX softirq), and when it's not in softirq context it blocks softirqs by using spin_lock_bh(). Lockdep marks ocelot->stats_lock as softirq-unsafe, because it never blocks softirqs from running, and it is never taken from softirq context. So it can always be interrupted by softirqs. There is a call path through which a function that holds br->hash_lock: fdb_add_hw_addr() will call a function that acquires ocelot->stats_lock: ocelot_port_get_stats64(). This can be seen below: ocelot_port_get_stats64+0x3c/0x1e0 felix_get_stats64+0x20/0x38 dsa_slave_get_stats64+0x3c/0x60 dev_get_stats+0x74/0x2c8 rtnl_fill_stats+0x4c/0x150 rtnl_fill_ifinfo+0x5cc/0x7b8 rtmsg_ifinfo_build_skb+0xe4/0x150 rtmsg_ifinfo+0x5c/0xb0 __dev_notify_flags+0x58/0x200 __dev_set_promiscuity+0xa0/0x1f8 dev_set_promiscuity+0x30/0x70 macsec_dev_change_rx_flags+0x68/0x88 __dev_set_promiscuity+0x1a8/0x1f8 __dev_set_rx_mode+0x74/0xa8 dev_uc_add+0x74/0xa0 fdb_add_hw_addr+0x68/0xd8 fdb_add_local+0xc4/0x110 br_fdb_add_local+0x54/0x88 br_add_if+0x338/0x4a0 br_add_slave+0x20/0x38 do_setlink+0x3a4/0xcb8 rtnl_newlink+0x758/0x9d0 rtnetlink_rcv_msg+0x2f0/0x550 netlink_rcv_skb+0x128/0x148 rtnetlink_rcv+0x24/0x38 the plain English explanation for it is: The macsec0 bridge port is created without p->flags & BR_PROMISC, because it is what br_manage_promisc() decides for a VLAN filtering bridge with a single auto port. As part of the br_add_if() procedure, br_fdb_add_local() is called for the MAC address of the device, and this results in a call to dev_uc_add() for macsec0 while the softirq-safe br->hash_lock is taken. Because macsec0 does not have IFF_UNICAST_FLT, dev_uc_add() ends up calling __dev_set_promiscuity() for macsec0, which is propagated by its implementation, macsec_dev_change_rx_flags(), to the lower device: swp0. This triggers the call path: dev_set_promiscuity(swp0) -> rtmsg_ifinfo() -> dev_get_stats() -> ocelot_port_get_stats64() with a calling context that lockdep doesn't like (br->hash_lock held). Normally we don't see this, because even though many drivers that can be bridge ports don't support IFF_UNICAST_FLT, we need a driver that (a) doesn't support IFF_UNICAST_FLT, *and* (b) it forwards the IFF_PROMISC flag to another driver, and (c) *that* driver implements ndo_get_stats64() using a softirq-unsafe spinlock. Condition (b) is necessary because the first __dev_set_rx_mode() calls __dev_set_promiscuity() with "bool notify=false", and thus, the rtmsg_ifinfo() code path won't be entered. The same criteria also hold true for DSA switches which don't report IFF_UNICAST_FLT. When the DSA master uses a spin_lock() in its ndo_get_stats64() method, the same lockdep splat can be seen. I think the deadlock possibility is real, even though I didn't reproduce it, and I'm thinking of the following situation to support that claim: fdb_add_hw_addr() runs on a CPU A, in a context with softirqs locally disabled and br->hash_lock held, and may end up attempting to acquire ocelot->stats_lock. In parallel, ocelot->stats_lock is currently held by a thread B (say, ocelot_check_stats_work()), which is interrupted while holding it by a softirq which attempts to lock br->hash_lock. Thread B cannot make progress because br->hash_lock is held by A. Whereas thread A cannot make progress because ocelot->stats_lock is held by B. When taking the issue at face value, the bridge can avoid that problem by simply making the ports promiscuous from a code path with a saner calling context (br->hash_lock not held). A bridge port without IFF_UNICAST_FLT is going to become promiscuous as soon as we call dev_uc_add() on it (which we do unconditionally), so why not be preemptive and make it promiscuous right from the beginning, so as to not be taken by surprise. With this, we've broken the links between code that holds br->hash_lock or br->lock and code that calls into the ndo_change_rx_flags() or ndo_get_stats64() ops of the bridge port. Fixes: 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30bridge: always declare tunnel functionsArnd Bergmann1-4/+4
[ Upstream commit 89dcd87ce534a3a7f267cfd58505803006f51301 ] When CONFIG_BRIDGE_VLAN_FILTERING is disabled, two functions are still defined but have no prototype or caller. This causes a W=1 warning for the missing prototypes: net/bridge/br_netlink_tunnel.c:29:6: error: no previous prototype for 'vlan_tunid_inrange' [-Werror=missing-prototypes] net/bridge/br_netlink_tunnel.c:199:5: error: no previous prototype for 'br_vlan_tunnel_info' [-Werror=missing-prototypes] The functions are already contitional on CONFIG_BRIDGE_VLAN_FILTERING, and I coulnd't easily figure out the right set of #ifdefs, so just move the declarations out of the #ifdef to avoid the warning, at a small cost in code size over a more elaborate fix. Fixes: 188c67dd1906 ("net: bridge: vlan options: add support for tunnel id dumping") Fixes: 569da0822808 ("net: bridge: vlan options: add support for tunnel mapping set/del") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230516194625.549249-3-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30net: add vlan_get_protocol_and_depth() helperEric Dumazet1-1/+1
[ Upstream commit 4063384ef762cc5946fc7a3f89879e76c6ec51e2 ] Before blamed commit, pskb_may_pull() was used instead of skb_header_pointer() in __vlan_get_protocol() and friends. Few callers depended on skb->head being populated with MAC header, syzbot caught one of them (skb_mac_gso_segment()) Add vlan_get_protocol_and_depth() to make the intent clearer and use it where sensible. This is a more generic fix than commit e9d3f80935b6 ("net/af_packet: make sure to pull mac header") which was dealing with a similar issue. kernel BUG at include/linux/skbuff.h:2655 ! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023 RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline] RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136 Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286 RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9 R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012 R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7 FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> [<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419 [<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline] [<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725 [<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313 [<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029 [<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline] [<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142 [<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline] [<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline] [<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139 [<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline] [<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline] [<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147 [<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80 [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 469aceddfa3e ("vlan: consolidate VLAN parsing code and limit max parsing depth") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26netfilter: br_netfilter: fix recent physdev match breakageFlorian Westphal1-6/+11
[ Upstream commit 94623f579ce338b5fa61b5acaa5beb8aa657fb9e ] Recent attempt to ensure PREROUTING hook is executed again when a decrypted ipsec packet received on a bridge passes through the network stack a second time broke the physdev match in INPUT hook. We can't discard the nf_bridge info strct from sabotage_in hook, as this is needed by the physdev match. Keep the struct around and handle this with another conditional instead. Fixes: 2b272bb558f1 ("netfilter: br_netfilter: disable sabotage_in hook after first suppression") Reported-and-tested-by: Farid BENAMROUCHE <fariouche@yahoo.fr> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11netfilter: ebtables: fix table blob use-after-freeFlorian Westphal1-1/+1
[ Upstream commit e58a171d35e32e6e8c37cfe0e8a94406732a331f ] We are not allowed to return an error at this point. Looking at the code it looks like ret is always 0 at this point, but its not. t = find_table_lock(net, repl->name, &ret, &ebt_mutex); ... this can return a valid table, with ret != 0. This bug causes update of table->private with the new blob, but then frees the blob right away in the caller. Syzbot report: BUG: KASAN: vmalloc-out-of-bounds in __ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168 Read of size 4 at addr ffffc90005425000 by task kworker/u4:4/74 Workqueue: netns cleanup_net Call Trace: kasan_report+0xbf/0x1f0 mm/kasan/report.c:517 __ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168 ebt_unregister_table+0x35/0x40 net/bridge/netfilter/ebtables.c:1372 ops_exit_list+0xb0/0x170 net/core/net_namespace.c:169 cleanup_net+0x4ee/0xb10 net/core/net_namespace.c:613 ... ip(6)tables appears to be ok (ret should be 0 at this point) but make this more obvious. Fixes: c58dd2dd443c ("netfilter: Can't fail and free after table replacement") Reported-by: syzbot+f61594de72d6705aea03@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15netfilter: br_netfilter: disable sabotage_in hook after first suppressionFlorian Westphal1-0/+1
[ Upstream commit 2b272bb558f1d3a5aa95ed8a82253786fd1a48ba ] When using a xfrm interface in a bridged setup (the outgoing device is bridged), the incoming packets in the xfrm interface are only tracked in the outgoing direction. $ brctl show bridge name interfaces br_eth1 eth1 $ conntrack -L tcp 115 SYN_SENT src=192... dst=192... [UNREPLIED] ... If br_netfilter is enabled, the first (encrypted) packet is received onR eth1, conntrack hooks are called from br_netfilter emulation which allocates nf_bridge info for this skb. If the packet is for local machine, skb gets passed up the ip stack. The skb passes through ip prerouting a second time. br_netfilter ip_sabotage_in supresses the re-invocation of the hooks. After this, skb gets decrypted in xfrm layer and appears in network stack a second time (after decryption). Then, ip_sabotage_in is called again and suppresses netfilter hook invocation, even though the bridge layer never called them for the plaintext incarnation of the packet. Free the bridge info after the first suppression to avoid this. I was unable to figure out where the regression comes from, as far as i can see br_netfilter always had this problem; i did not expect that skb is looped again with different headers. Fixes: c4b0e771f906 ("netfilter: avoid using skb->nf_bridge directly") Reported-and-tested-by: Wolfgang Nothdurft <wolfgang@linogate.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02bridge: switchdev: Fix memory leaks when changing VLAN protocolIdo Schimmel1-3/+14
[ Upstream commit 9d45921ee4cb364910097e7d1b7558559c2f9fd2 ] The bridge driver can offload VLANs to the underlying hardware either via switchdev or the 8021q driver. When the former is used, the VLAN is marked in the bridge driver with the 'BR_VLFLAG_ADDED_BY_SWITCHDEV' private flag. To avoid the memory leaks mentioned in the cited commit, the bridge driver will try to delete a VLAN via the 8021q driver if the VLAN is not marked with the previously mentioned flag. When the VLAN protocol of the bridge changes, switchdev drivers are notified via the 'SWITCHDEV_ATTR_ID_BRIDGE_VLAN_PROTOCOL' attribute, but the 8021q driver is also called to add the existing VLANs with the new protocol and delete them with the old protocol. In case the VLANs were offloaded via switchdev, the above behavior is both redundant and buggy. Redundant because the VLANs are already programmed in hardware and drivers that support VLAN protocol change (currently only mlx5) change the protocol upon the switchdev attribute notification. Buggy because the 8021q driver is called despite these VLANs being marked with 'BR_VLFLAG_ADDED_BY_SWITCHDEV'. This leads to memory leaks [1] when the VLANs are deleted. Fix by not calling the 8021q driver for VLANs that were already programmed via switchdev. [1] unreferenced object 0xffff8881f6771200 (size 256): comm "ip", pid 446855, jiffies 4298238841 (age 55.240s) hex dump (first 32 bytes): 00 00 7f 0e 83 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000012819ac>] vlan_vid_add+0x437/0x750 [<00000000f2281fad>] __br_vlan_set_proto+0x289/0x920 [<000000000632b56f>] br_changelink+0x3d6/0x13f0 [<0000000089d25f04>] __rtnl_newlink+0x8ae/0x14c0 [<00000000f6276baf>] rtnl_newlink+0x5f/0x90 [<00000000746dc902>] rtnetlink_rcv_msg+0x336/0xa00 [<000000001c2241c0>] netlink_rcv_skb+0x11d/0x340 [<0000000010588814>] netlink_unicast+0x438/0x710 [<00000000e1a4cd5c>] netlink_sendmsg+0x788/0xc40 [<00000000e8992d4e>] sock_sendmsg+0xb0/0xe0 [<00000000621b8f91>] ____sys_sendmsg+0x4ff/0x6d0 [<000000000ea26996>] ___sys_sendmsg+0x12e/0x1b0 [<00000000684f7e25>] __sys_sendmsg+0xab/0x130 [<000000004538b104>] do_syscall_64+0x3d/0x90 [<0000000091ed9678>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 279737939a81 ("net: bridge: Fix VLANs memory leak") Reported-by: Vlad Buslov <vladbu@nvidia.com> Tested-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20221114084509.860831-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02bridge: switchdev: Notify about VLAN protocol changesDanielle Ratson1-2/+14
[ Upstream commit 22ec19f3aee327806c37c9fa1188741574bc6445 ] Drivers that support bridge offload need to be notified about changes to the bridge's VLAN protocol so that they could react accordingly and potentially veto the change. Add a new switchdev attribute to communicate the change to drivers. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 9d45921ee4cb ("bridge: switchdev: Fix memory leaks when changing VLAN protocol") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-28netfilter: ebtables: fix memory leak when blob is malformedFlorian Westphal1-1/+3
[ Upstream commit 62ce44c4fff947eebdf10bb582267e686e6835c9 ] The bug fix was incomplete, it "replaced" crash with a memory leak. The old code had an assignment to "ret" embedded into the conditional, restore this. Fixes: 7997eff82828 ("netfilter: ebtables: reject blobs that don't provide all entry points") Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15netfilter: br_netfilter: Drop dst references before setting.Harsh Modi2-0/+3
[ Upstream commit d047283a7034140ea5da759a494fd2274affdd46 ] The IPv6 path already drops dst in the daddr changed case, but the IPv4 path does not. This change makes the two code paths consistent. Further, it is possible that there is already a metadata_dst allocated from ingress that might already be attached to skbuff->dst while following the bridge path. If it is not released before setting a new metadata_dst, it will be leaked. This is similar to what is done in bpf_set_tunnel_key() or ip6_route_input(). It is important to note that the memory being leaked is not the dst being set in the bridge code, but rather memory allocated from some other code path that is not being freed correctly before the skb dst is overwritten. An example of the leakage fixed by this commit found using kmemleak: unreferenced object 0xffff888010112b00 (size 256): comm "softirq", pid 0, jiffies 4294762496 (age 32.012s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 16 f1 83 ff ff ff ff ................ e1 4e f6 82 ff ff ff ff 00 00 00 00 00 00 00 00 .N.............. backtrace: [<00000000d79567ea>] metadata_dst_alloc+0x1b/0xe0 [<00000000be113e13>] udp_tun_rx_dst+0x174/0x1f0 [<00000000a36848f4>] geneve_udp_encap_recv+0x350/0x7b0 [<00000000d4afb476>] udp_queue_rcv_one_skb+0x380/0x560 [<00000000ac064aea>] udp_unicast_rcv_skb+0x75/0x90 [<000000009a8ee8c5>] ip_protocol_deliver_rcu+0xd8/0x230 [<00000000ef4980bb>] ip_local_deliver_finish+0x7a/0xa0 [<00000000d7533c8c>] __netif_receive_skb_one_core+0x89/0xa0 [<00000000a879497d>] process_backlog+0x93/0x190 [<00000000e41ade9f>] __napi_poll+0x28/0x170 [<00000000b4c0906b>] net_rx_action+0x14f/0x2a0 [<00000000b20dd5d4>] __do_softirq+0xf4/0x305 [<000000003a7d7e15>] __irq_exit_rcu+0xc3/0x140 [<00000000968d39a2>] sysvec_apic_timer_interrupt+0x9e/0xc0 [<000000009e920794>] asm_sysvec_apic_timer_interrupt+0x16/0x20 [<000000008942add0>] native_safe_halt+0x13/0x20 Florian Westphal says: "Original code was likely fine because nothing ever did set a skb->dst entry earlier than bridge in those days." Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Harsh Modi <harshmodi@google.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31netfilter: ebtables: reject blobs that don't provide all entry pointsFlorian Westphal4-31/+1
[ Upstream commit 7997eff82828304b780dc0a39707e1946d6f1ebf ] Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0 For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table. t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for. Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21netfilter: br_netfilter: do not skip all hooks with 0 priorityFlorian Westphal1-3/+18
[ Upstream commit c2577862eeb0be94f151f2f1fff662b028061b00 ] When br_netfilter module is loaded, skbs may be diverted to the ipv4/ipv6 hooks, just like as if we were routing. Unfortunately, bridge filter hooks with priority 0 may be skipped in this case. Example: 1. an nftables bridge ruleset is loaded, with a prerouting hook that has priority 0. 2. interface is added to the bridge. 3. no tcp packet is ever seen by the bridge prerouting hook. 4. flush the ruleset 5. load the bridge ruleset again. 6. tcp packets are processed as expected. After 1) the only registered hook is the bridge prerouting hook, but its not called yet because the bridge hasn't been brought up yet. After 2), hook order is: 0 br_nf_pre_routing // br_netfilter internal hook 0 chain bridge f prerouting // nftables bridge ruleset The packet is diverted to br_nf_pre_routing. If call-iptables is off, the nftables bridge ruleset is called as expected. But if its enabled, br_nf_hook_thresh() will skip it because it assumes that all 0-priority hooks had been called previously in bridge context. To avoid this, check for the br_nf_pre_routing hook itself, we need to resume directly after it, even if this hook has a priority of 0. Unfortunately, this still results in different packet flow. With this fix, the eval order after in 3) is: 1. br_nf_pre_routing 2. ip(6)tables (if enabled) 3. nftables bridge but after 5 its the much saner: 1. nftables bridge 2. br_nf_pre_routing 3. ip(6)tables (if enabled) Unfortunately I don't see a solution here: It would be possible to move br_nf_pre_routing to a higher priority so that it will be called later in the pipeline, but this also impacts ebtables evaluation order, and would still result in this very ordering problem for all nftables-bridge hooks with the same priority as the br_nf_pre_routing one. Searching back through the git history I don't think this has ever behaved in any other way, hence, no fixes-tag. Reported-by: Radim Hrazdil <rhrazdil@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-29netfilter: nftables: add nft_parse_register_store() and use itPablo Neira Ayuso1-3/+2
[ Upstream commit 345023b0db315648ccc3c1a36aee88304a8b4d91 ] This new function combines the netlink register attribute parser and the store validation function. This update requires to replace: enum nft_registers dreg:8; in many of the expression private areas otherwise compiler complains with: error: cannot take address of bit-field ‘dreg’ when passing the register field as reference. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-25net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.Andrew Lunn1-0/+7
[ Upstream commit fbb3abdf2223cd0dfc07de85fe5a43ba7f435bdf ] It is possible to stack bridges on top of each other. Consider the following which makes use of an Ethernet switch: br1 / \ / \ / \ br0.11 wlan0 | br0 / | \ p1 p2 p3 br0 is offloaded to the switch. Above br0 is a vlan interface, for vlan 11. This vlan interface is then a slave of br1. br1 also has a wireless interface as a slave. This setup trunks wireless lan traffic over the copper network inside a VLAN. A frame received on p1 which is passed up to the bridge has the skb->offload_fwd_mark flag set to true, indicating that the switch has dealt with forwarding the frame out ports p2 and p3 as needed. This flag instructs the software bridge it does not need to pass the frame back down again. However, the flag is not getting reset when the frame is passed upwards. As a result br1 sees the flag, wrongly interprets it, and fails to forward the frame to wlan0. When passing a frame upwards, clear the flag. This is the Rx equivalent of br_switchdev_frame_unmark() in br_dev_xmit(). Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-01net: bridge: vlan: fix memory leak in __allowed_ingressTim Yi1-3/+3
[ Upstream commit fd20d9738395cf8e27d0a17eba34169699fccdff ] When using per-vlan state, if vlan snooping and stats are disabled, untagged or priority-tagged ingress frame will go to check pvid state. If the port state is forwarding and the pvid state is not learning/forwarding, untagged or priority-tagged frame will be dropped but skb memory is not freed. Should free skb when __allowed_ingress returns false. Fixes: a580c76d534c ("net: bridge: vlan: add per-vlan state") Signed-off-by: Tim Yi <tim.yi@pica8.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20220127074953.12632-1-tim.yi@pica8.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-01net: bridge: vlan: fix single net device option dumpingNikolay Aleksandrov1-1/+2
[ Upstream commit dcb2c5c6ca9b9177f04abaf76e5a983d177c9414 ] When dumping vlan options for a single net device we send the same entries infinitely because user-space expects a 0 return at the end but we keep returning skb->len and restarting the dump on retry. Fix it by returning the value from br_vlan_dump_dev() if it completed or there was an error. The only case that must return skb->len is when the dump was incomplete and needs to continue (-EMSGSIZE). Reported-by: Benjamin Poirier <bpoirier@nvidia.com> Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27netfilter: bridge: add support for pppoe filteringFlorian Westphal1-4/+3
[ Upstream commit 28b78ecffea8078d81466b2e01bb5a154509f1ba ] This makes 'bridge-nf-filter-pppoe-tagged' sysctl work for bridged traffic. Looking at the original commit it doesn't appear this ever worked: static unsigned int br_nf_post_routing(unsigned int hook, struct sk_buff **pskb, [..] if (skb->protocol == htons(ETH_P_8021Q)) { skb_pull(skb, VLAN_HLEN); skb->network_header += VLAN_HLEN; + } else if (skb->protocol == htons(ETH_P_PPP_SES)) { + skb_pull(skb, PPPOE_SES_HLEN); + skb->network_header += PPPOE_SES_HLEN; } [..] NF_HOOK(... POST_ROUTING, ...) ... but the adjusted offsets are never restored. The alternative would be to rip this code out for good, but otoh we'd have to keep this anyway for the vlan handling (which works because vlan tag info is in the skb, not the packet payload). Reported-and-tested-by: Amish Chana <amish@3g.co.za> Fixes: 516299d2f5b6f97 ("[NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-27net: bridge: mcast: use multicast_membership_interval for IGMPv3Nikolay Aleksandrov1-3/+1
commit fac3cb82a54a4b7c49c932f96ef196cf5774344c upstream. When I added IGMPv3 support I decided to follow the RFC for computing the GMI dynamically: " 8.4. Group Membership Interval The Group Membership Interval is the amount of time that must pass before a multicast router decides there are no more members of a group or a particular source on a network. This value MUST be ((the Robustness Variable) times (the Query Interval)) plus (one Query Response Interval)." But that actually is inconsistent with how the bridge used to compute it for IGMPv2, where it was user-configurable that has a correct default value but it is up to user-space to maintain it. This would make it consistent with the other timer values which are also maintained correct by the user instead of being dynamically computed. It also changes back to the previous user-expected GMI behaviour for IGMPv3 queries which were supported before IGMPv3 was added. Note that to properly compute it dynamically we would need to add support for "Robustness Variable" which is currently missing. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: 0436862e417e ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-13net: bridge: fix under estimation in br_get_linkxstats_size()Eric Dumazet1-0/+1
[ Upstream commit 0854a0513321cf70bea5fa483ebcaa983cc7c62e ] Commit de1799667b00 ("net: bridge: add STP xstats") added an additional nla_reserve_64bit() in br_fill_linkxstats(), but forgot to update br_get_linkxstats_size() accordingly. This can trigger the following in rtnl_stats_get() WARN_ON(err == -EMSGSIZE); Fixes: de1799667b00 ("net: bridge: add STP xstats") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-10-13net: bridge: use nla_total_size_64bit() in br_get_linkxstats_size()Eric Dumazet1-1/+1
[ Upstream commit dbe0b88064494b7bb6a9b2aa7e085b14a3112d44 ] bridge_fill_linkxstats() is using nla_reserve_64bit(). We must use nla_total_size_64bit() instead of nla_total_size() for corresponding data structure. Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Vivien Didelot <vivien.didelot@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18net: bridge: fix memleak in br_add_if()Yang Yingliang1-0/+2
[ Upstream commit 519133debcc19f5c834e7e28480b60bdc234fe02 ] I got a memleak report: BUG: memory leak unreferenced object 0x607ee521a658 (size 240): comm "syz-executor.0", pid 955, jiffies 4294780569 (age 16.449s) hex dump (first 32 bytes, cpu 1): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000d830ea5a>] br_multicast_add_port+0x1c2/0x300 net/bridge/br_multicast.c:1693 [<00000000274d9a71>] new_nbp net/bridge/br_if.c:435 [inline] [<00000000274d9a71>] br_add_if+0x670/0x1740 net/bridge/br_if.c:611 [<0000000012ce888e>] do_set_master net/core/rtnetlink.c:2513 [inline] [<0000000012ce888e>] do_set_master+0x1aa/0x210 net/core/rtnetlink.c:2487 [<0000000099d1cafc>] __rtnl_newlink+0x1095/0x13e0 net/core/rtnetlink.c:3457 [<00000000a01facc0>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488 [<00000000acc9186c>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5550 [<00000000d4aabb9c>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504 [<00000000bc2e12a3>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<00000000bc2e12a3>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340 [<00000000e4dc2d0e>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929 [<000000000d22c8b3>] sock_sendmsg_nosec net/socket.c:654 [inline] [<000000000d22c8b3>] sock_sendmsg+0x139/0x170 net/socket.c:674 [<00000000e281417a>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350 [<00000000237aa2ab>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404 [<000000004f2dc381>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433 [<0000000005feca6c>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47 [<000000007304477d>] entry_SYSCALL_64_after_hwframe+0x44/0xae On error path of br_add_if(), p->mcast_stats allocated in new_nbp() need be freed, or it will be leaked. Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210809132023.978546-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18net: bridge: fix flags interpretation for extern learn fdb entriesNikolay Aleksandrov3-10/+6
[ Upstream commit 45a687879b31caae4032abd1c2402e289d2b8083 ] Ignore fdb flags when adding port extern learn entries and always set BR_FDB_LOCAL flag when adding bridge extern learn entries. This is closest to the behaviour we had before and avoids breaking any use cases which were allowed. This patch fixes iproute2 calls which assume NUD_PERMANENT and were allowed before, example: $ bridge fdb add 00:11:22:33:44:55 dev swp1 extern_learn Extern learn entries are allowed to roam, but do not expire, so static or dynamic flags make no sense for them. Also add a comment for future reference. Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space") Fixes: 0541a6293298 ("net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210810110010.43859-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB ↵Vladimir Oltean3-8/+27
entry [ Upstream commit 0541a6293298fb52789de389dfb27ef54df81f73 ] Currently it is possible to add broken extern_learn FDB entries to the bridge in two ways: 1. Entries pointing towards the bridge device that are not local/permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static 2. Entries pointing towards the bridge device or towards a port that are marked as local/permanent, however the bridge does not process the 'permanent' bit in any way, therefore they are recorded as though they aren't permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent Since commit 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge"), these incorrect FDB entries can even trigger NULL pointer dereferences inside the kernel. This is because that commit made the assumption that all FDB entries that are not local/permanent have a valid destination port. For context, local / permanent FDB entries either have fdb->dst == NULL, and these point towards the bridge device and are therefore local and not to be used for forwarding, or have fdb->dst == a net_bridge_port structure (but are to be treated in the same way, i.e. not for forwarding). That assumption _is_ correct as long as things are working correctly in the bridge driver, i.e. we cannot logically have fdb->dst == NULL under any circumstance for FDB entries that are not local. However, the extern_learn code path where FDB entries are managed by a user space controller show that it is possible for the bridge kernel driver to misinterpret the NUD flags of an entry transmitted by user space, and end up having fdb->dst == NULL while not being a local entry. This is invalid and should be rejected. Before, the two commands listed above both crashed the kernel in this check from br_switchdev_fdb_notify: struct net_device *dev = info.is_local ? br->dev : dst->dev; info.is_local == false, dst == NULL. After this patch, the invalid entry added by the first command is rejected: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0 Error: bridge: FDB entry towards bridge must be permanent. and the valid entry added by the second command is properly treated as a local address and does not crash br_switchdev_fdb_notify anymore: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0 Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space") Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18netfilter: nf_conntrack_bridge: Fix memory leak when errorYajun Deng1-0/+6
[ Upstream commit 38ea9def5b62f9193f6bad96c5d108e2830ecbde ] It should be added kfree_skb_list() when err is not equal to zero in nf_br_ip_fragment(). v2: keep this aligned with IPv6. v3: modify iter.frag_list to iter.frag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-25net: bridge: sync fdb to new unicast-filtering portsWolfgang Bumiller1-1/+16
commit a019abd8022061b917da767cd1a66ed823724eab upstream. Since commit 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") bridges with `vlan_filtering 1` and only 1 auto-port don't set IFF_PROMISC for unicast-filtering-capable ports. Normally on port changes `br_manage_promisc` is called to update the promisc flags and unicast filters if necessary, but it cannot distinguish between *new* ports and ones losing their promisc flag, and new ports end up not receiving the MAC address list. Fix this by calling `br_fdb_sync_static` in `br_add_if` after the port promisc flags are updated and the unicast filter was supposed to have been filled. Fixes: 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20net: bridge: multicast: fix MRD advertisement router port marking raceNikolay Aleksandrov1-0/+4
commit 000b7287b67555fee39d39fff75229dedde0dcbf upstream. When an MRD advertisement is received on a bridge port with multicast snooping enabled, we mark it as a router port automatically, that includes adding that port to the router port list. The multicast lock protects that list, but it is not acquired in the MRD advertisement case leading to a race condition, we need to take it to fix the race. Cc: stable@vger.kernel.org Cc: linus.luessing@c0d3.blue Fixes: 4b3087c7e37f ("bridge: Snoop Multicast Router Advertisements") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20net: bridge: multicast: fix PIM hello router port marking raceNikolay Aleksandrov1-0/+2
commit 04bef83a3358946bfc98a5ecebd1b0003d83d882 upstream. When a PIM hello packet is received on a bridge port with multicast snooping enabled, we mark it as a router port automatically, that includes adding that port the router port list. The multicast lock protects that list, but it is not acquired in the PIM message case leading to a race condition, we need to take it to fix the race. Cc: stable@vger.kernel.org Fixes: 91b02d3d133b ("bridge: mcast: add router port on PIM hello message") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19net: bridge: mrp: Update ring transitions.Horatiu Vultur1-4/+2
[ Upstream commit fcb34635854a5a5814227628867ea914a9805384 ] According to the standard IEC 62439-2, the number of transitions needs to be counted for each transition 'between' ring state open and ring state closed and not from open state to closed state. Therefore fix this for both ring and interconnect ring. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-23net: bridge: fix vlan tunnel dst refcnt when egressingNikolay Aleksandrov1-2/+2
commit cfc579f9d89af4ada58c69b03bcaa4887840f3b3 upstream. The egress tunnel code uses dst_clone() and directly sets the result which is wrong because the entry might have 0 refcnt or be already deleted, causing number of problems. It also triggers the WARN_ON() in dst_hold()[1] when a refcnt couldn't be taken. Fix it by using dst_hold_safe() and checking if a reference was actually taken before setting the dst. [1] dmesg WARN_ON log and following refcnt errors WARNING: CPU: 5 PID: 38 at include/net/dst.h:230 br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Modules linked in: 8021q garp mrp bridge stp llc bonding ipv6 virtio_net CPU: 5 PID: 38 Comm: ksoftirqd/5 Kdump: loaded Tainted: G W 5.13.0-rc3+ #360 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 RIP: 0010:br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Code: e8 85 bc 01 e1 45 84 f6 74 90 45 31 f6 85 db 48 c7 c7 a0 02 19 a0 41 0f 94 c6 31 c9 31 d2 44 89 f6 e8 64 bc 01 e1 85 db 75 02 <0f> 0b 31 c9 31 d2 44 89 f6 48 c7 c7 70 02 19 a0 e8 4b bc 01 e1 49 RSP: 0018:ffff8881003d39e8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa01902a0 RBP: ffff8881040c6700 R08: 0000000000000000 R09: 0000000000000001 R10: 2ce93d0054fe0d00 R11: 54fe0d00000e0000 R12: ffff888109515000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000401 FS: 0000000000000000(0000) GS:ffff88822bf40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f42ba70f030 CR3: 0000000109926000 CR4: 00000000000006e0 Call Trace: br_handle_vlan+0xbc/0xca [bridge] __br_forward+0x23/0x164 [bridge] deliver_clone+0x41/0x48 [bridge] br_handle_frame_finish+0x36f/0x3aa [bridge] ? skb_dst+0x2e/0x38 [bridge] ? br_handle_ingress_vlan_tunnel+0x3e/0x1c8 [bridge] ? br_handle_frame_finish+0x3aa/0x3aa [bridge] br_handle_frame+0x2c3/0x377 [bridge] ? __skb_pull+0x33/0x51 ? vlan_do_receive+0x4f/0x36a ? br_handle_frame_finish+0x3aa/0x3aa [bridge] __netif_receive_skb_core+0x539/0x7c6 ? __list_del_entry_valid+0x16e/0x1c2 __netif_receive_skb_list_core+0x6d/0xd6 netif_receive_skb_list_internal+0x1d9/0x1fa gro_normal_list+0x22/0x3e dev_gro_receive+0x55b/0x600 ? detach_buf_split+0x58/0x140 napi_gro_receive+0x94/0x12e virtnet_poll+0x15d/0x315 [virtio_net] __napi_poll+0x2c/0x1c9 net_rx_action+0xe6/0x1fb __do_softirq+0x115/0x2d8 run_ksoftirqd+0x18/0x20 smpboot_thread_fn+0x183/0x19c ? smpboot_unregister_percpu_thread+0x66/0x66 kthread+0x10a/0x10f ? kthread_mod_delayed_work+0xb6/0xb6 ret_from_fork+0x22/0x30 ---[ end trace 49f61b07f775fd2b ]--- dst_release: dst:00000000c02d677a refcnt:-1 dst_release underflow Cc: stable@vger.kernel.org Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23net: bridge: fix vlan tunnel dst null pointer dereferenceNikolay Aleksandrov2-16/+26
commit 58e2071742e38f29f051b709a5cca014ba51166f upstream. This patch fixes a tunnel_dst null pointer dereference due to lockless access in the tunnel egress path. When deleting a vlan tunnel the tunnel_dst pointer is set to NULL without waiting a grace period (i.e. while it's still usable) and packets egressing are dereferencing it without checking. Use READ/WRITE_ONCE to annotate the lockless use of tunnel_id, use RCU for accessing tunnel_dst and make sure it is read only once and checked in the egress path. The dst is already properly RCU protected so we don't need to do anything fancy than to make sure tunnel_id and tunnel_dst are read only once and checked in the egress path. Cc: stable@vger.kernel.org Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22bridge: Fix possible races between assigning rx_handler_data and setting ↵Zhang Zhengming1-2/+3
IFF_BRIDGE_PORT bit [ Upstream commit 59259ff7a81b9eb6213891c6451221e567f8f22f ] There is a crash in the function br_get_link_af_size_filtered, as the port_exists(dev) is true and the rx_handler_data of dev is NULL. But the rx_handler_data of dev is correct saved in vmcore. The oops looks something like: ... pc : br_get_link_af_size_filtered+0x28/0x1c8 [bridge] ... Call trace: br_get_link_af_size_filtered+0x28/0x1c8 [bridge] if_nlmsg_size+0x180/0x1b0 rtnl_calcit.isra.12+0xf8/0x148 rtnetlink_rcv_msg+0x334/0x370 netlink_rcv_skb+0x64/0x130 rtnetlink_rcv+0x28/0x38 netlink_unicast+0x1f0/0x250 netlink_sendmsg+0x310/0x378 sock_sendmsg+0x4c/0x70 __sys_sendto+0x120/0x150 __arm64_sys_sendto+0x30/0x40 el0_svc_common+0x78/0x130 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc In br_add_if(), we found there is no guarantee that assigning rx_handler_data to dev->rx_handler_data will before setting the IFF_BRIDGE_PORT bit of priv_flags. So there is a possible data competition: CPU 0: CPU 1: (RCU read lock) (RTNL lock) rtnl_calcit() br_add_slave() if_nlmsg_size() br_add_if() br_get_link_af_size_filtered() -> netdev_rx_handler_register ... // The order is not guaranteed ... -> dev->priv_flags |= IFF_BRIDGE_PORT; // The IFF_BRIDGE_PORT bit of priv_flags has been set -> if (br_port_exists(dev)) { // The dev->rx_handler_data has NOT been assigned -> p = br_port_get_rcu(dev); .... -> rcu_assign_pointer(dev->rx_handler_data, rx_handler_data); ... Fix it in br_get_link_af_size_filtered, using br_port_get_check_rcu() and checking the return value. Signed-off-by: Zhang Zhengming <zhangzhengming@huawei.com> Reviewed-by: Zhao Lei <zhaolei69@huawei.com> Reviewed-by: Wang Xiaogang <wangxiaogang3@huawei.com> Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-19net: bridge: when suppression is enabled exclude RARP packetsNikolay Aleksandrov1-1/+3
[ Upstream commit 0353b4a96b7a9f60fe20d1b3ebd4931a4085f91c ] Recently we had an interop issue where RARP packets got suppressed with bridge neigh suppression enabled, but the check in the code was meant to suppress GARP. Exclude RARP packets from it which would allow some VMWare setups to work, to quote the report: "Those RARP packets usually get generated by vMware to notify physical switches when vMotion occurs. vMware may use random sip/tip or just use sip=tip=0. So the RARP packet sometimes get properly flooded by the vtep and other times get dropped by the logic" Reported-by: Amer Abdalamer <amer@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-14net: bridge: mcast: fix broken length + header check for MRDv6 Adv.Linus Lüssing1-25/+8
[ Upstream commit 99014088156cd78867d19514a0bc771c4b86b93b ] The IPv6 Multicast Router Advertisements parsing has the following two issues: For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes, assuming MLDv2 Reports with at least one multicast address entry). When ipv6_mc_check_mld_msg() tries to parse an Multicast Router Advertisement its MLD length check will fail - and it will wrongly return -EINVAL, even if we have a valid MRD Advertisement. With the returned -EINVAL the bridge code will assume a broken packet and will wrongly discard it, potentially leading to multicast packet loss towards multicast routers. The second issue is the MRD header parsing in br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header immediately after the IPv6 header (IPv6 next header type). However according to RFC4286, section 2 all MRD messages contain a Router Alert option (just like MLD). So instead there is an IPv6 Hop-by-Hop option for the Router Alert between the IPv6 and ICMPv6 header, again leading to the bridge wrongly discarding Multicast Router Advertisements. To fix these two issues, introduce a new return value -ENODATA to ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop option which is not an MLD but potentially an MRD packet. This also simplifies further parsing in the bridge code, as ipv6_mc_check_mld() already fully checks the ICMPv6 header and hop-by-hop option. These issues were found and fixed with the help of the mrdisc tool (https://github.com/troglobit/mrdisc). Fixes: 4b3087c7e37f ("bridge: Snoop Multicast Router Advertisements") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21netfilter: bridge: add pre_exit hooks for ebtable unregistrationFlorian Westphal4-6/+48
commit 7ee3c61dcd28bf6e290e06ad382f13511dc790e9 upstream. Just like ip/ip6/arptables, the hooks have to be removed, then synchronize_rcu() has to be called to make sure no more packets are being processed before the ruleset data is released. Place the hook unregistration in the pre_exit hook, then call the new ebtables pre_exit function from there. Years ago, when first netns support got added for netfilter+ebtables, this used an older (now removed) netfilter hook unregister API, that did a unconditional synchronize_rcu(). Now that all is done with call_rcu, ebtable_{filter,nat,broute} pernet exit handlers may free the ebtable ruleset while packets are still in flight. This can only happens on module removal, not during netns exit. The new function expects the table name, not the table struct. This is because upcoming patch set (targeting -next) will remove all net->xt.{nat,filter,broute}_table instances, this makes it necessary to avoid external references to those member variables. The existing APIs will be converted, so follow the upcoming scheme of passing name + hook type instead. Fixes: aee12a0a3727e ("ebtables: remove nf_hook_register usage") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>