summaryrefslogtreecommitdiff
path: root/net/openvswitch
AgeCommit message (Collapse)AuthorFilesLines
4 daysnet: openvswitch: Avoid needlessly taking the RTNL on vport destroyToke Høiland-Jørgensen1-4/+13
[ Upstream commit 5498227676303e3ffa9a3a46214af96bc3e81314 ] The openvswitch teardown code will immediately call ovs_netdev_detach_dev() in response to a NETDEV_UNREGISTER notification. It will then start the dp_notify_work workqueue, which will later end up calling the vport destroy() callback. This callback takes the RTNL to do another ovs_netdev_detach_port(), which in this case is unnecessary. This causes extra pressure on the RTNL, in some cases leading to "unregister_netdevice: waiting for XX to become free" warnings on teardown. We can straight-forwardly avoid the extra RTNL lock acquisition by checking the device flags before taking the lock, and skip the locking altogether if the IFF_OVS_DATAPATH flag has already been unset. Fixes: b07c26511e94 ("openvswitch: fix vport-netdev unregister") Tested-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20251211115006.228876-1-toke@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnet: openvswitch: fix middle attribute validation in push_nsh() actionIlya Maximets1-3/+10
[ Upstream commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e ] The push_nsh() action structure looks like this: OVS_ACTION_ATTR_PUSH_NSH(OVS_KEY_ATTR_NSH(OVS_NSH_KEY_ATTR_BASE,...)) The outermost OVS_ACTION_ATTR_PUSH_NSH attribute is OK'ed by the nla_for_each_nested() inside __ovs_nla_copy_actions(). The innermost OVS_NSH_KEY_ATTR_BASE/MD1/MD2 are OK'ed by the nla_for_each_nested() inside nsh_key_put_from_nlattr(). But nothing checks if the attribute in the middle is OK. We don't even check that this attribute is the OVS_KEY_ATTR_NSH. We just do a double unwrap with a pair of nla_data() calls - first time directly while calling validate_push_nsh() and the second time as part of the nla_for_each_nested() macro, which isn't safe, potentially causing invalid memory access if the size of this attribute is incorrect. The failure may not be noticed during validation due to larger netlink buffer, but cause trouble later during action execution where the buffer is allocated exactly to the size: BUG: KASAN: slab-out-of-bounds in nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch] Read of size 184 at addr ffff88816459a634 by task a.out/22624 CPU: 8 UID: 0 PID: 22624 6.18.0-rc7+ #115 PREEMPT(voluntary) Call Trace: <TASK> dump_stack_lvl+0x51/0x70 print_address_description.constprop.0+0x2c/0x390 kasan_report+0xdd/0x110 kasan_check_range+0x35/0x1b0 __asan_memcpy+0x20/0x60 nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch] push_nsh+0x82/0x120 [openvswitch] do_execute_actions+0x1405/0x2840 [openvswitch] ovs_execute_actions+0xd5/0x3b0 [openvswitch] ovs_packet_cmd_execute+0x949/0xdb0 [openvswitch] genl_family_rcv_msg_doit+0x1d6/0x2b0 genl_family_rcv_msg+0x336/0x580 genl_rcv_msg+0x9f/0x130 netlink_rcv_skb+0x11f/0x370 genl_rcv+0x24/0x40 netlink_unicast+0x73e/0xaa0 netlink_sendmsg+0x744/0xbf0 __sys_sendto+0x3d6/0x450 do_syscall_64+0x79/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> Let's add some checks that the attribute is properly sized and it's the only one attribute inside the action. Technically, there is no real reason for OVS_KEY_ATTR_NSH to be there, as we know that we're pushing an NSH header already, it just creates extra nesting, but that's how uAPI works today. So, keeping as it is. Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support") Reported-by: Junvy Yang <zhuque@tencent.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron echaudro@redhat.com Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20251204105334.900379-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: nf_conncount: rework API to use sk_buff directlyFernando Fernandez Mancera1-8/+8
[ Upstream commit be102eb6a0e7c03db00e50540622f4e43b2d2844 ] When using nf_conncount infrastructure for non-confirmed connections a duplicated track is possible due to an optimization introduced since commit d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC"). In order to fix this introduce a new conncount API that receives directly an sk_buff struct. It fetches the tuple and zone and the corresponding ct from it. It comes with both existing conncount variants nf_conncount_count_skb() and nf_conncount_add_skb(). In addition remove the old API and adjust all the users to use the new one. This way, for each sk_buff struct it is possible to check if there is a ct present and already confirmed. If so, skip the add operation. Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07net: openvswitch: remove never-working support for setting nsh fieldsIlya Maximets3-125/+9
[ Upstream commit dfe28c4167a9259fc0c372d9f9473e1ac95cff67 ] The validation of the set(nsh(...)) action is completely wrong. It runs through the nsh_key_put_from_nlattr() function that is the same function that validates NSH keys for the flow match and the push_nsh() action. However, the set(nsh(...)) has a very different memory layout. Nested attributes in there are doubled in size in case of the masked set(). That makes proper validation impossible. There is also confusion in the code between the 'masked' flag, that says that the nested attributes are doubled in size containing both the value and the mask, and the 'is_mask' that says that the value we're parsing is the mask. This is causing kernel crash on trying to write into mask part of the match with SW_FLOW_KEY_PUT() during validation, while validate_nsh() doesn't allocate any memory for it: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1c2383067 P4D 1c2383067 PUD 20b703067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 8 UID: 0 Kdump: loaded Not tainted 6.17.0-rc4+ #107 PREEMPT(voluntary) RIP: 0010:nsh_key_put_from_nlattr+0x19d/0x610 [openvswitch] Call Trace: <TASK> validate_nsh+0x60/0x90 [openvswitch] validate_set.constprop.0+0x270/0x3c0 [openvswitch] __ovs_nla_copy_actions+0x477/0x860 [openvswitch] ovs_nla_copy_actions+0x8d/0x100 [openvswitch] ovs_packet_cmd_execute+0x1cc/0x310 [openvswitch] genl_family_rcv_msg_doit+0xdb/0x130 genl_family_rcv_msg+0x14b/0x220 genl_rcv_msg+0x47/0xa0 netlink_rcv_skb+0x53/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x280/0x3b0 netlink_sendmsg+0x1f7/0x430 ____sys_sendmsg+0x36b/0x3a0 ___sys_sendmsg+0x87/0xd0 __sys_sendmsg+0x6d/0xd0 do_syscall_64+0x7b/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The third issue with this process is that while trying to convert the non-masked set into masked one, validate_set() copies and doubles the size of the OVS_KEY_ATTR_NSH as if it didn't have any nested attributes. It should be copying each nested attribute and doubling them in size independently. And the process must be properly reversed during the conversion back from masked to a non-masked variant during the flow dump. In the end, the only two outcomes of trying to use this action are either validation failure or a kernel crash. And if somehow someone manages to install a flow with such an action, it will most definitely not do what it is supposed to, since all the keys and the masks are mixed up. Fixing all the issues is a complex task as it requires re-writing most of the validation code. Given that and the fact that this functionality never worked since introduction, let's just remove it altogether. It's better to re-introduce it later with a proper implementation instead of trying to fix it in stable releases. Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support") Reported-by: Junvy Yang <zhuque@tencent.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20251112112246.95064-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27net: openvswitch: Fix the dead loop of MPLS parseFaicker Mo1-1/+1
[ Upstream commit 0bdc924bfb319fb10d1113cbf091fc26fb7b1f99 ] The unexpected MPLS packet may not end with the bottom label stack. When there are many stacks, The label count value has wrapped around. A dead loop occurs, soft lockup/CPU stuck finally. stack backtrace: UBSAN: array-index-out-of-bounds in /build/linux-0Pa0xK/linux-5.15.0/net/openvswitch/flow.c:662:26 index -1 is out of range for type '__be32 [3]' CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G OE 5.15.0-121-generic #131-Ubuntu Hardware name: Dell Inc. PowerEdge C6420/0JP9TF, BIOS 2.12.2 07/14/2021 Call Trace: <IRQ> show_stack+0x52/0x5c dump_stack_lvl+0x4a/0x63 dump_stack+0x10/0x16 ubsan_epilogue+0x9/0x36 __ubsan_handle_out_of_bounds.cold+0x44/0x49 key_extract_l3l4+0x82a/0x840 [openvswitch] ? kfree_skbmem+0x52/0xa0 key_extract+0x9c/0x2b0 [openvswitch] ovs_flow_key_extract+0x124/0x350 [openvswitch] ovs_vport_receive+0x61/0xd0 [openvswitch] ? kernel_init_free_pages.part.0+0x4a/0x70 ? get_page_from_freelist+0x353/0x540 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23a/0xf00 __netif_receive_skb_list_core+0xfa/0x240 netif_receive_skb_list_internal+0x18e/0x2a0 napi_complete_done+0x7a/0x1c0 bnxt_poll+0x155/0x1c0 [bnxt_en] __napi_poll+0x30/0x180 net_rx_action+0x126/0x280 ? bnxt_msix+0x67/0x80 [bnxt_en] handle_softirqs+0xda/0x2d0 irq_exit_rcu+0x96/0xc0 common_interrupt+0x8e/0xa0 </IRQ> Fixes: fbdcdd78da7c ("Change in Openvswitch to support MPLS label depth of 3 in ingress direction") Signed-off-by: Faicker Mo <faicker.mo@zenlayer.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/259D3404-575D-4A6D-B263-1DF59A67CF89@zenlayer.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04openvswitch: Fix unsafe attribute parsing in output_userspace()Eelco Chaudron1-2/+1
commit 6beb6835c1fbb3f676aebb51a5fee6b77fed9308 upstream. This patch replaces the manual Netlink attribute iteration in output_userspace() with nla_for_each_nested(), which ensures that only well-formed attributes are processed. Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/0bd65949df61591d9171c0dc13e42cea8941da10.1746541734.git.echaudro@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02openvswitch: fix lockup on tx to unregistering netdev with carrierIlya Maximets1-1/+3
commit 47e55e4b410f7d552e43011baa5be1aab4093990 upstream. Commit in a fixes tag attempted to fix the issue in the following sequence of calls: do_output -> ovs_vport_send -> dev_queue_xmit -> __dev_queue_xmit -> netdev_core_pick_tx -> skb_tx_hash When device is unregistering, the 'dev->real_num_tx_queues' goes to zero and the 'while (unlikely(hash >= qcount))' loop inside the 'skb_tx_hash' becomes infinite, locking up the core forever. But unfortunately, checking just the carrier status is not enough to fix the issue, because some devices may still be in unregistering state while reporting carrier status OK. One example of such device is a net/dummy. It sets carrier ON on start, but it doesn't implement .ndo_stop to set the carrier off. And it makes sense, because dummy doesn't really have a carrier. Therefore, while this device is unregistering, it's still easy to hit the infinite loop in the skb_tx_hash() from the OVS datapath. There might be other drivers that do the same, but dummy by itself is important for the OVS ecosystem, because it is frequently used as a packet sink for tcpdump while debugging OVS deployments. And when the issue is hit, the only way to recover is to reboot. Fix that by also checking if the device is running. The running state is handled by the net core during unregistering, so it covers unregistering case better, and we don't really need to send packets to devices that are not running anyway. While only checking the running state might be enough, the carrier check is preserved. The running and the carrier states seem disjoined throughout the code and different drivers. And other core functions like __dev_direct_xmit() check both before attempting to transmit a packet. So, it seems safer to check both flags in OVS as well. Fixes: 066b86787fa3 ("net: openvswitch: fix race on port output") Reported-by: Friedrich Weber <f.weber@proxmox.com> Closes: https://mail.openvswitch.org/pipermail/ovs-discuss/2025-January/053423.html Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Tested-by: Friedrich Weber <f.weber@proxmox.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20250109122225.4034688-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Carlos Soto <carlos.soto@broadcom.com> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02net: openvswitch: fix race on port outputFelix Huettner1-1/+1
commit 066b86787fa3d97b7aefb5ac0a99a22dad2d15f8 upstream. assume the following setup on a single machine: 1. An openvswitch instance with one bridge and default flows 2. two network namespaces "server" and "client" 3. two ovs interfaces "server" and "client" on the bridge 4. for each ovs interface a veth pair with a matching name and 32 rx and tx queues 5. move the ends of the veth pairs to the respective network namespaces 6. assign ip addresses to each of the veth ends in the namespaces (needs to be the same subnet) 7. start some http server on the server network namespace 8. test if a client in the client namespace can reach the http server when following the actions below the host has a chance of getting a cpu stuck in a infinite loop: 1. send a large amount of parallel requests to the http server (around 3000 curls should work) 2. in parallel delete the network namespace (do not delete interfaces or stop the server, just kill the namespace) there is a low chance that this will cause the below kernel cpu stuck message. If this does not happen just retry. Below there is also the output of bpftrace for the functions mentioned in the output. The series of events happening here is: 1. the network namespace is deleted calling `unregister_netdevice_many_notify` somewhere in the process 2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and then runs `synchronize_net` 3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER` 4. this is then handled by `dp_device_event` which calls `ovs_netdev_detach_dev` (if a vport is found, which is the case for the veth interface attached to ovs) 5. this removes the rx_handlers of the device but does not prevent packages to be sent to the device 6. `dp_device_event` then queues the vport deletion to work in background as a ovs_lock is needed that we do not hold in the unregistration path 7. `unregister_netdevice_many_notify` continues to call `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0 8. port deletion continues (but details are not relevant for this issue) 9. at some future point the background task deletes the vport If after 7. but before 9. a packet is send to the ovs vport (which is not deleted at this point in time) which forwards it to the `dev_queue_xmit` flow even though the device is unregistering. In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is a while loop (if the packet has a rx_queue recorded) that is infinite if `dev->real_num_tx_queues` is zero. To prevent this from happening we update `do_output` to handle devices without carrier the same as if the device is not found (which would be the code path after 9. is done). Additionally we now produce a warning in `skb_tx_hash` if we will hit the infinite loop. bpftrace (first word is function name): __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2 ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2 netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2 netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024 ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604 stuck message: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279] Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:netdev_pick_tx+0xf1/0x320 Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01 RSP: 0018:ffffb78b40298820 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000 RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000 R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000 FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0 Call Trace: <IRQ> netdev_core_pick_tx+0xa4/0xb0 __dev_queue_xmit+0xf8/0x510 ? __bpf_prog_exit+0x1e/0x30 dev_queue_xmit+0x10/0x20 ovs_vport_send+0xad/0x170 [openvswitch] do_output+0x59/0x180 [openvswitch] do_execute_actions+0xa80/0xaa0 [openvswitch] ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch] ovs_execute_actions+0x4c/0x120 [openvswitch] ovs_dp_process_packet+0xa1/0x200 [openvswitch] ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch] ? ovs_ct_fill_key+0x1d/0x30 [openvswitch] ? ovs_flow_key_extract+0x2db/0x350 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] ? __htab_map_lookup_elem+0x4e/0x60 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714 ? trace_call_bpf+0xc8/0x150 ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? kprobe_perf_func+0x4f/0x2b0 ? __mod_memcg_lruvec_state+0x63/0xe0 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23d/0xf00 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x15/0x60 process_backlog+0x9e/0x170 __napi_poll+0x33/0x180 net_rx_action+0x126/0x280 ? ttwu_do_activate+0x72/0xf0 __do_softirq+0xd9/0x2e7 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0 do_softirq+0x7d/0xb0 </IRQ> <TASK> __local_bh_enable_ip+0x54/0x60 ip_finish_output2+0x191/0x460 __ip_finish_output+0xb7/0x180 ip_finish_output+0x2e/0xc0 ip_output+0x78/0x100 ? __ip_finish_output+0x180/0x180 ip_local_out+0x5e/0x70 __ip_queue_xmit+0x184/0x440 ? tcp_syn_options+0x1f9/0x300 ip_queue_xmit+0x15/0x20 __tcp_transmit_skb+0x910/0x9c0 ? __mod_memcg_state+0x44/0xa0 tcp_connect+0x437/0x4e0 ? ktime_get_with_offset+0x60/0xf0 tcp_v4_connect+0x436/0x530 __inet_stream_connect+0xd4/0x3a0 ? kprobe_perf_func+0x4f/0x2b0 ? aa_sk_perm+0x43/0x1c0 inet_stream_connect+0x3b/0x60 __sys_connect_file+0x63/0x70 __sys_connect+0xa6/0xd0 ? setfl+0x108/0x170 ? do_fcntl+0xe8/0x5a0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x5c/0xc0 ? __x64_sys_fcntl+0xa9/0xd0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? do_syscall_64+0x69/0xc0 ? __sys_setsockopt+0xea/0x1e0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? __x64_sys_setsockopt+0x1f/0x30 ? do_syscall_64+0x69/0xc0 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f7b8101c6a7 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7 RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005 RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410 R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 </TASK> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carlos Soto <carlos.soto@broadcom.com> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02net: openvswitch: fix nested key length validation in the set() actionIlya Maximets1-1/+2
[ Upstream commit 65d91192aa66f05710cfddf6a14b5a25ee554dba ] It's not safe to access nla_len(ovs_key) if the data is smaller than the netlink header. Check that the attribute is OK first. Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.") Reported-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b07a9da40df1576b8048 Tested-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20250412104052.2073688-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu().Guillaume Nault1-6/+0
[ Upstream commit 8930424777e43257f5bf6f0f0f53defd0d30415c ] Because skb_tunnel_check_pmtu() doesn't handle PACKET_HOST packets, commit 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper pmtud support.") forced skb->pkt_type to PACKET_OUTGOING for openvswitch packets that are sent using the OVS_ACTION_ATTR_OUTPUT action. This allowed such packets to invoke the iptunnel_pmtud_check_icmp() or iptunnel_pmtud_check_icmpv6() helpers and thus trigger PMTU update on the input device. However, this also broke other parts of PMTU discovery. Since these packets don't have the PACKET_HOST type anymore, they won't trigger the sending of ICMP Fragmentation Needed or Packet Too Big messages to remote hosts when oversized (see the skb_in->pkt_type condition in __icmp_send() for example). These two skb->pkt_type checks are therefore incompatible as one requires skb->pkt_type to be PACKET_HOST, while the other requires it to be anything but PACKET_HOST. It makes sense to not trigger ICMP messages for non-PACKET_HOST packets as these messages should be generated only for incoming l2-unicast packets. However there doesn't seem to be any reason for skb_tunnel_check_pmtu() to ignore PACKET_HOST packets. Allow both cases to work by allowing skb_tunnel_check_pmtu() to work on PACKET_HOST packets and not overriding skb->pkt_type in openvswitch anymore. Fixes: 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper pmtud support.") Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Tested-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/eac941652b86fddf8909df9b3bf0d97bc9444793.1743208264.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10netfilter: conntrack: convert to refcount_t apiFlorian Westphal1-1/+0
[ Upstream commit 719774377622bc4025d2a74f551b5dc2158c6c30 ] Convert nf_conn reference counting from atomic_t to refcount_t based api. refcount_t api provides more runtime sanity checks and will warn on certain constructs, e.g. refcount_inc() on a zero reference count, which usually indicates use-after-free. For this reason template allocation is changed to init the refcount to 1, the subsequenct add operations are removed. Likewise, init_conntrack() is changed to set the initial refcount to 1 instead refcount_inc(). This is safe because the new entry is not (yet) visible to other cpus. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 5cfe5612ca95 ("netfilter: nft_ct: Use __refcount_inc() for per-CPU nft_ct_pcpu_template.") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13openvswitch: use RCU protection in ovs_vport_cmd_fill_info()Eric Dumazet1-3/+9
[ Upstream commit 90b2f49a502fa71090d9f4fe29a2f51fe5dff76d ] ovs_vport_cmd_fill_info() can be called without RTNL or RCU. Use RCU protection and dev_net_rcu() to avoid potential UAF. Fixes: 9354d4520342 ("openvswitch: reliable interface indentification in port dumps") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16openvswitch: Set the skbuff pkt_type for proper pmtud support.Aaron Conole1-0/+6
[ Upstream commit 30a92c9e3d6b073932762bef2ac66f4ee784c657 ] Open vSwitch is originally intended to switch at layer 2, only dealing with Ethernet frames. With the introduction of l3 tunnels support, it crossed into the realm of needing to care a bit about some routing details when making forwarding decisions. If an oversized packet would need to be fragmented during this forwarding decision, there is a chance for pmtu to get involved and generate a routing exception. This is gated by the skbuff->pkt_type field. When a flow is already loaded into the openvswitch module this field is set up and transitioned properly as a packet moves from one port to another. In the case that a packet execute is invoked after a flow is newly installed this field is not properly initialized. This causes the pmtud mechanism to omit sending the required exception messages across the tunnel boundary and a second attempt needs to be made to make sure that the routing exception is properly setup. To fix this, we set the outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get to the openvswitch module via a port device or packet command. Even for bridge ports as users, the pkt_type needs to be reset when doing the transmit as the packet is truly outgoing and routing needs to get involved post packet transformations, in the case of VXLAN/GENEVE/udp-tunnel packets. In general, the pkt_type on output gets ignored, since we go straight to the driver, but in the case of tunnel ports they go through IP routing layer. This issue is periodically encountered in complex setups, such as large openshift deployments, where multiple sets of tunnel traversal occurs. A way to recreate this is with the ovn-heater project that can setup a networking environment which mimics such large deployments. We need larger environments for this because we need to ensure that flow misses occur. In these environment, without this patch, we can see: ./ovn_cluster.sh start podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200 podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache podman exec ovn-chassis-1 ip netns exec sw01p1 \ ping 21.0.0.3 -M do -s 1300 -c2 PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data. From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142) --- 21.0.0.3 ping statistics --- ... Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not sent into the server. With this patch, setting the pkt_type, we see the following: podman exec ovn-chassis-1 ip netns exec sw01p1 \ ping 21.0.0.3 -M do -s 1300 -c2 PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data. From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222) ping: local error: message too long, mtu=1222 --- 21.0.0.3 ping statistics --- ... In this case, the first ping request receives the FRAG_NEEDED message and a local routing exception is created. Tested-by: Jaime Caamano <jcaamano@redhat.com> Reported-at: https://issues.redhat.com/browse/FDP-164 Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.") Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16net: openvswitch: fix overwriting ct original tuple for ICMPv6Ilya Maximets1-1/+2
[ Upstream commit 7c988176b6c16c516474f6fceebe0f055af5eb56 ] OVS_PACKET_CMD_EXECUTE has 3 main attributes: - OVS_PACKET_ATTR_KEY - Packet metadata in a netlink format. - OVS_PACKET_ATTR_PACKET - Binary packet content. - OVS_PACKET_ATTR_ACTIONS - Actions to execute on the packet. OVS_PACKET_ATTR_KEY is parsed first to populate sw_flow_key structure with the metadata like conntrack state, input port, recirculation id, etc. Then the packet itself gets parsed to populate the rest of the keys from the packet headers. Whenever the packet parsing code starts parsing the ICMPv6 header, it first zeroes out fields in the key corresponding to Neighbor Discovery information even if it is not an ND packet. It is an 'ipv6.nd' field. However, the 'ipv6' is a union that shares the space between 'nd' and 'ct_orig' that holds the original tuple conntrack metadata parsed from the OVS_PACKET_ATTR_KEY. ND packets should not normally have conntrack state, so it's fine to share the space, but normal ICMPv6 Echo packets or maybe other types of ICMPv6 can have the state attached and it should not be overwritten. The issue results in all but the last 4 bytes of the destination address being wiped from the original conntrack tuple leading to incorrect packet matching and potentially executing wrong actions in case this packet recirculates within the datapath or goes back to userspace. ND fields should not be accessed in non-ND packets, so not clearing them should be fine. Executing memset() only for actual ND packets to avoid the issue. Initializing the whole thing before parsing is needed because ND packet may not contain all the options. The issue only affects the OVS_PACKET_CMD_EXECUTE path and doesn't affect packets entering OVS datapath from network interfaces, because in this case CT metadata is populated from skb after the packet is already parsed. Fixes: 9dd7f8907c37 ("openvswitch: Add original direction conntrack tuple to sw_flow_key.") Reported-by: Antonin Bas <antonin.bas@broadcom.com> Closes: https://github.com/openvswitch/ovs-issues/issues/327 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240509094228.1035477-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-02net: openvswitch: Fix Use-After-Free in ovs_ct_exitHyunwoo Kim1-2/+2
[ Upstream commit 5ea7b72d4fac2fdbc0425cd8f2ea33abe95235b2 ] Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal of ovs_ct_limit_exit, is not part of the RCU read critical section, it is possible that the RCU grace period will pass during the traversal and the key will be free. To prevent this, it should be changed to hlist_for_each_entry_safe. Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit") Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-02net: openvswitch: fix unwanted error log on timeout policy probingIlya Maximets1-2/+3
[ Upstream commit 4539f91f2a801c0c028c252bffae56030cfb2cae ] On startup, ovs-vswitchd probes different datapath features including support for timeout policies. While probing, it tries to execute certain operations with OVS_PACKET_ATTR_PROBE or OVS_FLOW_ATTR_PROBE attributes set. These attributes tell the openvswitch module to not log any errors when they occur as it is expected that some of the probes will fail. For some reason, setting the timeout policy ignores the PROBE attribute and logs a failure anyway. This is causing the following kernel log on each re-start of ovs-vswitchd: kernel: Failed to associated timeout policy `ovs_test_tp' Fix that by using the same logging macro that all other messages are using. The message will still be printed at info level when needed and will be rate limited, but with a net rate limiter instead of generic printk one. The nf_ct_set_timeout() itself will still print some info messages, but at least this change makes logging in openvswitch module more consistent. Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240403203803.2137962-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23net: openvswitch: limit the number of recursions from action setsAaron Conole1-16/+33
[ Upstream commit 6e2f90d31fe09f2b852de25125ca875aabd81367 ] The ovs module allows for some actions to recursively contain an action list for complex scenarios, such as sampling, checking lengths, etc. When these actions are copied into the internal flow table, they are evaluated to validate that such actions make sense, and these calls happen recursively. The ovs-vswitchd userspace won't emit more than 16 recursion levels deep. However, the module has no such limit and will happily accept limits larger than 16 levels nested. Prevent this by tracking the number of recursions happening and manually limiting it to 16 levels nested. The initial implementation of the sample action would track this depth and prevent more than 3 levels of recursion, but this was removed to support the clone use case, rather than limited at the current userspace limit. Fixes: 798c166173ff ("openvswitch: Optimize sample action for the clone use cases") Signed-off-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240207132416.1488485-2-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22net: openvswitch: fix possible memory leak in ovs_meter_cmd_set()Hangyu Hua1-1/+3
commit 2fa28f5c6fcbfc794340684f36d2581b4f2d20b5 upstream. old_meter needs to be free after it is detached regardless of whether the new meter is successfully attached. Fixes: c7c4c44c9a95 ("net: openvswitch: expand the meters supported number") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15net: openvswitch: fix flow memory leak in ovs_flow_cmd_newFedor Pchelkin1-6/+6
[ Upstream commit 0c598aed445eb45b0ee7ba405f7ece99ee349c30 ] Syzkaller reports a memory leak of new_flow in ovs_flow_cmd_new() as it is not freed when an allocation of a key fails. BUG: memory leak unreferenced object 0xffff888116668000 (size 632): comm "syz-executor231", pid 1090, jiffies 4294844701 (age 18.871s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000defa3494>] kmem_cache_zalloc include/linux/slab.h:654 [inline] [<00000000defa3494>] ovs_flow_alloc+0x19/0x180 net/openvswitch/flow_table.c:77 [<00000000c67d8873>] ovs_flow_cmd_new+0x1de/0xd40 net/openvswitch/datapath.c:957 [<0000000010a539a8>] genl_family_rcv_msg_doit+0x22d/0x330 net/netlink/genetlink.c:739 [<00000000dff3302d>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] [<00000000dff3302d>] genl_rcv_msg+0x328/0x590 net/netlink/genetlink.c:800 [<000000000286dd87>] netlink_rcv_skb+0x153/0x430 net/netlink/af_netlink.c:2515 [<0000000061fed410>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 [<000000009dc0f111>] netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] [<000000009dc0f111>] netlink_unicast+0x545/0x7f0 net/netlink/af_netlink.c:1339 [<000000004a5ee816>] netlink_sendmsg+0x8e7/0xde0 net/netlink/af_netlink.c:1934 [<00000000482b476f>] sock_sendmsg_nosec net/socket.c:651 [inline] [<00000000482b476f>] sock_sendmsg+0x152/0x190 net/socket.c:671 [<00000000698574ba>] ____sys_sendmsg+0x70a/0x870 net/socket.c:2356 [<00000000d28d9e11>] ___sys_sendmsg+0xf3/0x170 net/socket.c:2410 [<0000000083ba9120>] __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439 [<00000000c00628f8>] do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46 [<000000004abfdcf4>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 To fix this the patch rearranges the goto labels to reflect the order of object allocations and adds appropriate goto statements on the error paths. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 68bb10101e6b ("openvswitch: Fix flow lookup to use unmasked key") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230201210218.361970-1-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-14openvswitch: Fix flow lookup to use unmasked keyEelco Chaudron1-9/+16
[ Upstream commit 68bb10101e6b0a6bb44e9c908ef795fc4af99eae ] The commit mentioned below causes the ovs_flow_tbl_lookup() function to be called with the masked key. However, it's supposed to be called with the unmasked key. This due to the fact that the datapath supports installing wider flows, and OVS relies on this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/ 128.0.0.0) is allowed to be added. However, if we try to add a wildcard rule, the installation fails: $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 ovs-vswitchd: updating flow table (File exists) The reason is that the key used to determine if the flow is already present in the system uses the original key ANDed with the mask. This results in the IP address not being part of the (miniflow) key, i.e., being substituted with an all-zero value. When doing the actual lookup, this results in the key wrongfully matching the first flow, and therefore the flow does not get installed. This change reverses the commit below, but rather than having the key on the stack, it's allocated. Fixes: 190aa3e77880 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02netfilter: conntrack: Fix data-races around ct markDaniel Xu1-4/+4
[ Upstream commit 52d1aa8b8249ff477aaa38b6f74a8ced780d079c ] nf_conn:mark can be read from and written to in parallel. Use READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted compiler optimizations. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-03openvswitch: switch from WARN to pr_warnAaron Conole1-1/+2
[ Upstream commit fd954cc1919e35cb92f78671cab6e42d661945a3 ] As noted by Paolo Abeni, pr_warn doesn't generate any splat and can still preserve the warning to the user that feature downgrade occurred. We likely cannot introduce other kinds of checks / enforcement here because syzbot can generate different genl versions to the datapath. Reported-by: syzbot+31cde0bef4bbf8ba2d86@syzkaller.appspotmail.com Fixes: 44da5ae5fbea ("openvswitch: Drop user features if old user space attempted to create datapath") Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26openvswitch: Fix overreporting of drops in dropwatchMike Pattrick1-2/+3
[ Upstream commit c21ab2afa2c64896a7f0e3cbc6845ec63dcfad2e ] Currently queue_userspace_packet will call kfree_skb for all frames, whether or not an error occurred. This can result in a single dropped frame being reported as multiple drops in dropwatch. This functions caller may also call kfree_skb in case of an error. This patch will consume the skbs instead and allow caller's to use kfree_skb. Signed-off-by: Mike Pattrick <mkp@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109957 Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26openvswitch: Fix double reporting of drops in dropwatchMike Pattrick1-3/+10
[ Upstream commit 1100248a5c5ccd57059eb8d02ec077e839a23826 ] Frames sent to userspace can be reported as dropped in ovs_dp_process_packet, however, if they are dropped in the netlink code then netlink_attachskb will report the same frame as dropped. This patch checks for error codes which indicate that the frame has already been freed. Signed-off-by: Mike Pattrick <mkp@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109946 Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-29net: openvswitch: fix parsing of nw_proto for IPv6 fragmentsRosemarie O'Riorden1-1/+1
commit 12378a5a75e33f34f8586706eb61cca9e6d4690c upstream. When a packet enters the OVS datapath and does not match any existing flows installed in the kernel flow cache, the packet will be sent to userspace to be parsed, and a new flow will be created. The kernel and OVS rely on each other to parse packet fields in the same way so that packets will be handled properly. As per the design document linked below, OVS expects all later IPv6 fragments to have nw_proto=44 in the flow key, so they can be correctly matched on OpenFlow rules. OpenFlow controllers create pipelines based on this design. This behavior was changed by the commit in the Fixes tag so that nw_proto equals the next_header field of the last extension header. However, there is no counterpart for this change in OVS userspace, meaning that this field is parsed differently between OVS and the kernel. This is a problem because OVS creates actions based on what is parsed in userspace, but the kernel-provided flow key is used as a match criteria, as described in Documentation/networking/openvswitch.rst. This leads to issues such as packets incorrectly matching on a flow and thus the wrong list of actions being applied to the packet. Such changes in packet parsing cannot be implemented without breaking the userspace. The offending commit is partially reverted to restore the expected behavior. The change technically made sense and there is a good reason that it was implemented, but it does not comply with the original design of OVS. If in the future someone wants to implement such a change, then it must be user-configurable and disabled by default to preserve backwards compatibility with existing OVS versions. Cc: stable@vger.kernel.org Fixes: fa642f08839b ("openvswitch: Derive IP protocol number for IPv6 later frags") Link: https://docs.openvswitch.org/en/latest/topics/design/#fragments Signed-off-by: Rosemarie O'Riorden <roriorden@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20220621204845.9721-1-roriorden@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-22net: openvswitch: fix misuse of the cached connection on tuple changesIlya Maximets2-1/+8
commit 2061ecfdf2350994e5b61c43e50e98a7a70e95ee upstream. If packet headers changed, the cached nfct is no longer relevant for the packet and attempt to re-use it leads to the incorrect packet classification. This issue is causing broken connectivity in OpenStack deployments with OVS/OVN due to hairpin traffic being unexpectedly dropped. The setup has datapath flows with several conntrack actions and tuple changes between them: actions:ct(commit,zone=8,mark=0/0x1,nat(src)), set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)), set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)), ct(zone=8),recirc(0x4) After the first ct() action the packet headers are almost fully re-written. The next ct() tries to re-use the existing nfct entry and marks the packet as invalid, so it gets dropped later in the pipeline. Clearing the cached conntrack entry whenever packet tuple is changed to avoid the issue. The flow key should not be cleared though, because we should still be able to match on the ct_state if the recirculation happens after the tuple change but before the next ct() action. Cc: stable@vger.kernel.org Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Reported-by: Frode Nordahl <frode.nordahl@canonical.com> Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [Backport to 5.10: minor rebase in ovs_ct_clear function. This version also applicable to and tested on 5.4 and 4.19.] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-27openvswitch: fix OOB access in reserve_sfa_size()Paolo Valerio1-1/+1
commit cefa91b2332d7009bc0be5d951d6cbbf349f90f8 upstream. Given a sufficiently large number of actions, while copying and reserving memory for a new action of a new flow, if next_offset is greater than MAX_ACTIONS_BUFSIZE, the function reserve_sfa_size() does not return -EMSGSIZE as expected, but it allocates MAX_ACTIONS_BUFSIZE bytes increasing actions_len by req_size. This can then lead to an OOB write access, especially when further actions need to be copied. Fix it by rearranging the flow action size check. KASAN splat below: ================================================================== BUG: KASAN: slab-out-of-bounds in reserve_sfa_size+0x1ba/0x380 [openvswitch] Write of size 65360 at addr ffff888147e4001c by task handler15/836 CPU: 1 PID: 836 Comm: handler15 Not tainted 5.18.0-rc1+ #27 ... Call Trace: <TASK> dump_stack_lvl+0x45/0x5a print_report.cold+0x5e/0x5db ? __lock_text_start+0x8/0x8 ? reserve_sfa_size+0x1ba/0x380 [openvswitch] kasan_report+0xb5/0x130 ? reserve_sfa_size+0x1ba/0x380 [openvswitch] kasan_check_range+0xf5/0x1d0 memcpy+0x39/0x60 reserve_sfa_size+0x1ba/0x380 [openvswitch] __add_action+0x24/0x120 [openvswitch] ovs_nla_add_action+0xe/0x20 [openvswitch] ovs_ct_copy_action+0x29d/0x1130 [openvswitch] ? __kernel_text_address+0xe/0x30 ? unwind_get_return_address+0x56/0xa0 ? create_prof_cpu_mask+0x20/0x20 ? ovs_ct_verify+0xf0/0xf0 [openvswitch] ? prep_compound_page+0x198/0x2a0 ? __kasan_check_byte+0x10/0x40 ? kasan_unpoison+0x40/0x70 ? ksize+0x44/0x60 ? reserve_sfa_size+0x75/0x380 [openvswitch] __ovs_nla_copy_actions+0xc26/0x2070 [openvswitch] ? __zone_watermark_ok+0x420/0x420 ? validate_set.constprop.0+0xc90/0xc90 [openvswitch] ? __alloc_pages+0x1a9/0x3e0 ? __alloc_pages_slowpath.constprop.0+0x1da0/0x1da0 ? unwind_next_frame+0x991/0x1e40 ? __mod_node_page_state+0x99/0x120 ? __mod_lruvec_page_state+0x2e3/0x470 ? __kasan_kmalloc_large+0x90/0xe0 ovs_nla_copy_actions+0x1b4/0x2c0 [openvswitch] ovs_flow_cmd_new+0x3cd/0xb10 [openvswitch] ... Cc: stable@vger.kernel.org Fixes: f28cd2af22a0 ("openvswitch: fix flow actions reallocation") Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-13net: openvswitch: fix leak of nested actionsIlya Maximets1-5/+90
[ Upstream commit 1f30fb9166d4f15a1aa19449b9da871fe0ed4796 ] While parsing user-provided actions, openvswitch module may dynamically allocate memory and store pointers in the internal copy of the actions. So this memory has to be freed while destroying the actions. Currently there are only two such actions: ct() and set(). However, there are many actions that can hold nested lists of actions and ovs_nla_free_flow_actions() just jumps over them leaking the memory. For example, removal of the flow with the following actions will lead to a leak of the memory allocated by nf_ct_tmpl_alloc(): actions:clone(ct(commit),0) Non-freed set() action may also leak the 'dst' structure for the tunnel info including device references. Under certain conditions with a high rate of flow rotation that may cause significant memory leak problem (2MB per second in reporter's case). The problem is also hard to mitigate, because the user doesn't have direct control over the datapath flows generated by OVS. Fix that by iterating over all the nested actions and freeing everything that needs to be freed recursively. New build time assertion should protect us from this problem if new actions will be added in the future. Unfortunately, openvswitch module doesn't use NLA_F_NESTED, so all attributes has to be explicitly checked. sample() and clone() actions are mixing extra attributes into the user-provided action list. That prevents some code generalization too. Fixes: 34ae932a4036 ("openvswitch: Make tunnel set action attach a metadata dst") Link: https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392922.html Reported-by: Stéphane Graber <stgraber@ubuntu.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13net: openvswitch: don't send internal clone attribute to the userspace.Ilya Maximets2-2/+4
[ Upstream commit 3f2a3050b4a3e7f32fc0ea3c9b0183090ae00522 ] 'OVS_CLONE_ATTR_EXEC' is an internal attribute that is used for performance optimization inside the kernel. It's added by the kernel while parsing user-provided actions and should not be sent during the flow dump as it's not part of the uAPI. The issue doesn't cause any significant problems to the ovs-vswitchd process, because reported actions are not really used in the application lifecycle and only supposed to be shown to a human via ovs-dpctl flow dump. However, the action list is still incorrect and causes the following error if the user wants to look at the datapath flows: # ovs-dpctl add-dp system@ovs-system # ovs-dpctl add-flow "<flow match>" "clone(ct(commit),0)" # ovs-dpctl dump-flows <flow match>, packets:0, bytes:0, used:never, actions:clone(bad length 4, expected -1 for: action0(01 00 00 00), ct(commit),0) With the fix: # ovs-dpctl dump-flows <flow match>, packets:0, bytes:0, used:never, actions:clone(ct(commit),0) Additionally fixed an incorrect attribute name in the comment. Fixes: b233504033db ("openvswitch: kernel datapath clone action") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20220404104150.2865736-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08openvswitch: Fixed nd target mask field in the flow dump.Martin Varghese1-2/+2
commit f19c44452b58a84d95e209b847f5495d91c9983a upstream. IPv6 nd target mask was not getting populated in flow dump. In the function __ovs_nla_put_key the icmp code mask field was checked instead of icmp code key field to classify the flow as neighbour discovery. ufid:bdfbe3e5-60c2-43b0-a5ff-dfcac1c37328, recirc_id(0),dp_hash(0/0), skb_priority(0/0),in_port(ovs-nm1),skb_mark(0/0),ct_state(0/0), ct_zone(0/0),ct_mark(0/0),ct_label(0/0), eth(src=00:00:00:00:00:00/00:00:00:00:00:00, dst=00:00:00:00:00:00/00:00:00:00:00:00), eth_type(0x86dd), ipv6(src=::/::,dst=::/::,label=0/0,proto=58,tclass=0/0,hlimit=0/0,frag=no), icmpv6(type=135,code=0), nd(target=2001::2/::, sll=00:00:00:00:00:00/00:00:00:00:00:00, tll=00:00:00:00:00:00/00:00:00:00:00:00), packets:10, bytes:860, used:0.504s, dp:ovs, actions:ovs-nm2 Fixes: e64457191a25 (openvswitch: Restructure datapath.c and flow.c) Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Link: https://lore.kernel.org/r/20220328054148.3057-1-martinvarghesenokia@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08openvswitch: always update flow key after natAaron Conole1-59/+59
[ Upstream commit 60b44ca6bd7518dd38fa2719bc9240378b6172c3 ] During NAT, a tuple collision may occur. When this happens, openvswitch will make a second pass through NAT which will perform additional packet modification. This will update the skb data, but not the flow key that OVS uses. This means that future flow lookups, and packet matches will have incorrect data. This has been supported since 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack"). That commit failed to properly update the sw_flow_key attributes, since it only called the ovs_ct_nat_update_key once, rather than each time ovs_ct_nat_execute was called. As these two operations are linked, the ovs_ct_nat_execute() function should always make sure that the sw_flow_key is updated after a successful call through NAT infrastructure. Fixes: 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack") Cc: Dumitru Ceara <dceara@redhat.com> Cc: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-02openvswitch: Fix setting ipv6 fields causing hw csum failurePaul Blakey1-8/+38
commit d9b5ae5c1b241b91480aa30408be12fe91af834a upstream. Ipv6 ttl, label and tos fields are modified without first pulling/pushing the ipv6 header, which would have updated the hw csum (if available). This might cause csum validation when sending the packet to the stack, as can be seen in the trace below. Fix this by updating skb->csum if available. Trace resulted by ipv6 ttl dec and then sending packet to conntrack [actions: set(ipv6(hlimit=63)),ct(zone=99)]: [295241.900063] s_pf0vf2: hw csum failure [295241.923191] Call Trace: [295241.925728] <IRQ> [295241.927836] dump_stack+0x5c/0x80 [295241.931240] __skb_checksum_complete+0xac/0xc0 [295241.935778] nf_conntrack_tcp_packet+0x398/0xba0 [nf_conntrack] [295241.953030] nf_conntrack_in+0x498/0x5e0 [nf_conntrack] [295241.958344] __ovs_ct_lookup+0xac/0x860 [openvswitch] [295241.968532] ovs_ct_execute+0x4a7/0x7c0 [openvswitch] [295241.979167] do_execute_actions+0x54a/0xaa0 [openvswitch] [295242.001482] ovs_execute_actions+0x48/0x100 [openvswitch] [295242.006966] ovs_dp_process_packet+0x96/0x1d0 [openvswitch] [295242.012626] ovs_vport_receive+0x6c/0xc0 [openvswitch] [295242.028763] netdev_frame_hook+0xc0/0x180 [openvswitch] [295242.034074] __netif_receive_skb_core+0x2ca/0xcb0 [295242.047498] netif_receive_skb_internal+0x3e/0xc0 [295242.052291] napi_gro_receive+0xba/0xe0 [295242.056231] mlx5e_handle_rx_cqe_mpwrq_rep+0x12b/0x250 [mlx5_core] [295242.062513] mlx5e_poll_rx_cq+0xa0f/0xa30 [mlx5_core] [295242.067669] mlx5e_napi_poll+0xe1/0x6b0 [mlx5_core] [295242.077958] net_rx_action+0x149/0x3b0 [295242.086762] __do_softirq+0xd7/0x2d6 [295242.090427] irq_exit+0xf7/0x100 [295242.093748] do_IRQ+0x7f/0xd0 [295242.096806] common_interrupt+0xf/0xf [295242.100559] </IRQ> [295242.102750] RIP: 0033:0x7f9022e88cbd [295242.125246] RSP: 002b:00007f9022282b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda [295242.132900] RAX: 0000000000000005 RBX: 0000000000000010 RCX: 0000000000000000 [295242.140120] RDX: 00007f9022282ba8 RSI: 00007f9022282a30 RDI: 00007f9014005c30 [295242.147337] RBP: 00007f9014014d60 R08: 0000000000000020 R09: 00007f90254a8340 [295242.154557] R10: 00007f9022282a28 R11: 0000000000000246 R12: 0000000000000000 [295242.161775] R13: 00007f902308c000 R14: 000000000000002b R15: 00007f9022b71f40 Fixes: 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action") Signed-off-by: Paul Blakey <paulb@nvidia.com> Link: https://lore.kernel.org/r/20220223163416.24096-1-paulb@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-26ovs: clear skb->tstamp in forwarding pathkaixi.fan1-0/+1
[ Upstream commit 01634047bf0d5c2d9b7d8095bb4de1663dbeedeb ] fq qdisc requires tstamp to be cleared in the forwarding path. Now ovs doesn't clear skb->tstamp. We encountered a problem with linux version 5.4.56 and ovs version 2.14.1, and packets failed to dequeue from qdisc when fq qdisc was attached to ovs port. Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: kaixi.fan <fankaixi.li@bytedance.com> Signed-off-by: xiexiaohui <xiexiaohui.xxh@bytedance.com> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-06-03openvswitch: meter: fix race when getting now_ms.Tao Liu1-0/+8
[ Upstream commit e4df1b0c24350a0f00229ff895a91f1072bd850d ] We have observed meters working unexpected if traffic is 3+Gbit/s with multiple connections. now_ms is not pretected by meter->lock, we may get a negative long_delta_ms when another cpu updated meter->used, then: delta_ms = (u32)long_delta_ms; which will be a large value. band->bucket += delta_ms * band->rate; then we get a wrong band->bucket. OpenVswitch userspace datapath has fixed the same issue[1] some time ago, and we port the implementation to kernel datapath. [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/ Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure") Signed-off-by: Tao Liu <thomas.liu@ucloud.cn> Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-11openvswitch: fix stack OOB read while fragmenting IPv4 packetsDavide Caratti1-4/+4
commit 7c0ea5930c1c211931819d83cfb157bff1539a4c upstream. running openvswitch on kernels built with KASAN, it's possible to see the following splat while testing fragmentation of IPv4 packets: BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60 Read of size 1 at addr ffff888112fc713c by task handler2/1367 CPU: 0 PID: 1367 Comm: handler2 Not tainted 5.12.0-rc6+ #418 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 Call Trace: dump_stack+0x92/0xc1 print_address_description.constprop.7+0x1a/0x150 kasan_report.cold.13+0x7f/0x111 ip_do_fragment+0x1b03/0x1f60 ovs_fragment+0x5bf/0x840 [openvswitch] do_execute_actions+0x1bd5/0x2400 [openvswitch] ovs_execute_actions+0xc8/0x3d0 [openvswitch] ovs_packet_cmd_execute+0xa39/0x1150 [openvswitch] genl_family_rcv_msg_doit.isra.15+0x227/0x2d0 genl_rcv_msg+0x287/0x490 netlink_rcv_skb+0x120/0x380 genl_rcv+0x24/0x40 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f957079db07 Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 eb ec ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 24 ed ff ff 48 RSP: 002b:00007f956ce35a50 EFLAGS: 00000293 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 00007f957079db07 RDX: 0000000000000000 RSI: 00007f956ce35ae0 RDI: 0000000000000019 RBP: 00007f956ce35ae0 R08: 0000000000000000 R09: 00007f9558006730 R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 R13: 00007f956ce37308 R14: 00007f956ce35f80 R15: 00007f956ce35ae0 The buggy address belongs to the page: page:00000000af2a1d93 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112fc7 flags: 0x17ffffc0000000() raw: 0017ffffc0000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected addr ffff888112fc713c is located in stack of task handler2/1367 at offset 180 in frame: ovs_fragment+0x0/0x840 [openvswitch] this frame has 2 objects: [32, 144) 'ovs_dst' [192, 424) 'ovs_rt' Memory state around the buggy address: ffff888112fc7000: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888112fc7080: 00 f1 f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00 >ffff888112fc7100: 00 00 00 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 ^ ffff888112fc7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888112fc7200: 00 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 for IPv4 packets, ovs_fragment() uses a temporary struct dst_entry. Then, in the following call graph: ip_do_fragment() ip_skb_dst_mtu() ip_dst_mtu_maybe_forward() ip_mtu_locked() the pointer to struct dst_entry is used as pointer to struct rtable: this turns the access to struct members like rt_mtu_locked into an OOB read in the stack. Fix this changing the temporary variable used for IPv4 packets in ovs_fragment(), similarly to what is done for IPv6 few lines below. Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmt") Cc: <stable@vger.kernel.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14openvswitch: fix send of uninitialized stack memory in ct limit replyIlya Maximets1-4/+4
[ Upstream commit 4d51419d49930be2701c2633ae271b350397c3ca ] 'struct ovs_zone_limit' has more members than initialized in ovs_ct_limit_get_default_limit(). The rest of the memory is a random kernel stack content that ends up being sent to userspace. Fix that by using designated initializer that will clear all non-specified fields. Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-14net: openvswitch: conntrack: simplify the return expression of ↵Zheng Yongjun1-5/+1
ovs_ct_limit_get_default_limit() [ Upstream commit 5e359044c107ecbdc2e9b3fd5ce296006e6de4bc ] Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-23net: openvswitch: fix TTL decrement exception action executionEelco Chaudron1-9/+6
[ Upstream commit 09d6217254c004f6237cc2c2bfe604af58e9a8c5 ] Currently, the exception actions are not processed correctly as the wrong dataset is passed. This change fixes this, including the misleading comment. In addition, a check was added to make sure we work on an IPv4 packet, and not just assume if it's not IPv6 it's IPv4. This was all tested using OVS with patch, https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639, applied and sending packets with a TTL of 1 (and 0), both with IPv4 and IPv6. Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-05openvswitch: fix error return code in validate_and_copy_dec_ttl()Wang Hai1-1/+1
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Changing 'return start' to 'return action_start' can fix this bug. Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: openvswitch: ensure LSE is pullable before reading itDavide Caratti1-0/+3
when openvswitch is configured to mangle the LSE, the current value is read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is contained in the skb "linear" area. Found by code inspection. Fixes: d27cf5c59a12 ("net: core: add MPLS update core helper and use in OvS") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net: openvswitch: fix TTL decrement action netlink message formatEelco Chaudron2-23/+58
Currently, the openvswitch module is not accepting the correctly formated netlink message for the TTL decrement action. For both setting and getting the dec_ttl action, the actions should be nested in the OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi. When the original patch was sent, it was tested with a private OVS userspace implementation. This implementation was unfortunately not upstreamed and reviewed, hence an erroneous version of this patch was sent out. Leaving the patch as-is would cause problems as the kernel module could interpret additional attributes as actions and vice-versa, due to the actions not being encapsulated/nested within the actual attribute, but being concatinated after it. Fixes: 744676e77720 ("openvswitch: add TTL decrement action") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04net: openvswitch: silence suspicious RCU usage warningEelco Chaudron2-8/+8
Silence suspicious RCU usage warning in ovs_flow_tbl_masks_cache_resize() by replacing rcu_dereference() with rcu_dereference_ovsl(). In addition, when creating a new datapath, make sure it's configured under the ovs_lock. Fixes: 9bf24f594c6a ("net: openvswitch: make masks cache size configurable") Reported-by: syzbot+9a8f8bfcc56e8578016c@syzkaller.appspotmail.com Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160439190002.56943.1418882726496275961.stgit@ebuild Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-18net: openvswitch: fix to make sure flow_lookup() is not preemptedEelco Chaudron2-25/+41
The flow_lookup() function uses per CPU variables, which must be called with BH disabled. However, this is fine in the general NAPI use case where the local BH is disabled. But, it's also called from the netlink context. The below patch makes sure that even in the netlink path, the BH is disabled. In addition, u64_stats_update_begin() requires a lock to ensure one writer which is not ensured here. Making it per-CPU and disabling NAPI (softirq) ensures that there is always only one writer. Fixes: eac87c413bf9 ("net: openvswitch: reorder masks array based on usage") Reported-by: Juri Lelli <jlelli@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160295903253.7789.826736662555102345.stgit@ebuild Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-14net: openvswitch: use new function dev_fetch_sw_netstatsHeiner Kallweit1-19/+1
Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/5e52dc91-97b1-82b0-214b-65d404e4a2ec@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-9/+13
Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08openvswitch: handle DNAT tuple collisionDumitru Ceara1-9/+13
With multiple DNAT rules it's possible that after destination translation the resulting tuples collide. For example, two openvswitch flows: nw_dst=10.0.0.10,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) nw_dst=10.0.0.20,tp_dst=10, actions=ct(commit,table=2,nat(dst=20.0.0.1:20)) Assuming two TCP clients initiating the following connections: 10.0.0.10:5000->10.0.0.10:10 10.0.0.10:5000->10.0.0.20:10 Both tuples would translate to 10.0.0.10:5000->20.0.0.1:20 causing nf_conntrack_confirm() to fail because of tuple collision. Netfilter handles this case by allocating a null binding for SNAT at egress by default. Perform the same operation in openvswitch for DNAT if no explicit SNAT is requested by the user and allocate a null binding for SNAT for packets in the "original" direction. Reported-at: https://bugzilla.redhat.com/1877128 Suggested-by: Florian Westphal <fw@strlen.de> Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-06net: openvswitch: use dev_sw_netstats_rx_add()Fabian Frederick1-7/+1
use new helper for netstats settings Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05net: openvswitch: Constify static struct genl_small_opsRikard Falkeborn2-2/+2
The only usage of these is to assign their address to the small_ops field in the genl_family struct, which is a const pointer, and applying ARRAY_SIZE() on them. Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-04net/sched: act_vlan: Add {POP,PUSH}_ETH actionsGuillaume Nault1-18/+10
Implement TCA_VLAN_ACT_POP_ETH and TCA_VLAN_ACT_PUSH_ETH, to respectively pop and push a base Ethernet header at the beginning of a frame. POP_ETH is just a matter of pulling ETH_HLEN bytes. VLAN tags, if any, must be stripped before calling POP_ETH. PUSH_ETH is restricted to skbs with no mac_header, and only the MAC addresses can be configured. The Ethertype is automatically set from skb->protocol. These restrictions ensure that all skb's fields remain consistent, so that this action can't confuse other part of the networking stack (like GSO). Since openvswitch already had these actions, consolidate the code in skbuff.c (like for vlan and mpls push/pop). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-03genetlink: move to smaller ops wherever possibleJakub Kicinski3-18/+18
Bulk of the genetlink users can use smaller ops, move them. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>