summaryrefslogtreecommitdiff
path: root/net/sched/cls_u32.c
AgeCommit message (Collapse)AuthorFilesLines
2023-08-11net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-freevalis1-1/+0
[ Upstream commit 3044b16e7c6fe5d24b1cdbcf1bd0a9d92d1ebd81 ] When u32_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers") Reported-by: valis <sec@valis.email> Reported-by: M A Ramdhan <ramdhan@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11net/sched: cls_u32: Fix reference counter leak leading to overflowLee Jones1-7/+14
commit 04c55383fa5689357bcdd2c8036725a55ed632bc upstream. In the event of a failure in tcf_change_indev(), u32_set_parms() will immediately return without decrementing the recently incremented reference counter. If this happens enough times, the counter will rollover and the reference freed, leading to a double free which can be used to do 'bad things'. In order to prevent this, move the point of possible failure above the point where the reference counter is incremented. Also save any meaningful return values to be applied to the return data at the appropriate point in time. This issue was caught with KASAN. Fixes: 705c7091262d ("net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Lee Jones <lee@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-12net/sched: cls_u32: fix netns refcount changes in u32_change()Eric Dumazet1-7/+11
commit 3db09e762dc79584a69c10d74a6b98f89a9979f8 upstream. We are now able to detect extra put_net() at the moment they happen, instead of much later in correct code paths. u32_init_knode() / tcf_exts_init() populates the ->exts.net pointer, but as mentioned in tcf_exts_init(), the refcount on netns has not been elevated yet. The refcount is taken only once tcf_exts_get_net() is called. So the two u32_destroy_key() calls from u32_change() are attempting to release an invalid reference on the netns. syzbot report: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 21708 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 21708 Comm: syz-executor.5 Not tainted 5.18.0-rc2-next-20220412-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31 Code: 1d 14 b6 b2 09 31 ff 89 de e8 6d e9 89 fd 84 db 75 e0 e8 84 e5 89 fd 48 c7 c7 40 aa 26 8a c6 05 f4 b5 b2 09 01 e8 e5 81 2e 05 <0f> 0b eb c4 e8 68 e5 89 fd 0f b6 1d e3 b5 b2 09 31 ff 89 de e8 38 RSP: 0018:ffffc900051af1b0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000040000 RSI: ffffffff8160a0c8 RDI: fffff52000a35e28 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff81604a9e R11: 0000000000000000 R12: 1ffff92000a35e3b R13: 00000000ffffffef R14: ffff8880211a0194 R15: ffff8880577d0a00 FS: 00007f25d183e700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f19c859c028 CR3: 0000000051009000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] ref_tracker_free+0x535/0x6b0 lib/ref_tracker.c:118 netns_tracker_free include/net/net_namespace.h:327 [inline] put_net_track include/net/net_namespace.h:341 [inline] tcf_exts_put_net include/net/pkt_cls.h:255 [inline] u32_destroy_key.isra.0+0xa7/0x2b0 net/sched/cls_u32.c:394 u32_change+0xe01/0x3140 net/sched/cls_u32.c:909 tc_new_tfilter+0x98d/0x2200 net/sched/cls_api.c:2148 rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:6016 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2495 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f25d0689049 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f25d183e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f25d079c030 RCX: 00007f25d0689049 RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000005 RBP: 00007f25d06e308d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd0b752e3f R14: 00007f25d183e300 R15: 0000000000022000 </TASK> Fixes: 35c55fc156d8 ("cls_u32: use tcf_exts_get_net() before call_rcu()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> [rkolchmeyer: Backported to 4.14: adjusted u32_destroy_key() signature] Signed-off-by: Robert Kolchmeyer <rkolchmeyer@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15net: sched: Fix memory exposure from short TCA_U32_SELKees Cook1-2/+6
[ Upstream commit 98c8f125fd8a6240ea343c1aa50a1be9047791b8 ] Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink policy, so max length isn't enforced, only minimum. This means nkeys (from userspace) was being trusted without checking the actual size of nla_len(), which could lead to a memory over-read, and ultimately an exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within a namespace. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-09cls_u32: fix use after free in u32_destroy_key()Paolo Abeni1-10/+11
[ Upstream commit d7cdee5ea8d28ae1b6922deb0c1badaa3aa0ef8c ] Li Shuang reported an Oops with cls_u32 due to an use-after-free in u32_destroy_key(). The use-after-free can be triggered with: dev=lo tc qdisc add dev $dev root handle 1: htb default 10 tc filter add dev $dev parent 1: prio 5 handle 1: protocol ip u32 divisor 256 tc filter add dev $dev protocol ip parent 1: prio 5 u32 ht 800:: match ip dst\ 10.0.0.0/8 hashkey mask 0x0000ff00 at 16 link 1: tc qdisc del dev $dev root Which causes the following kasan splat: ================================================================== BUG: KASAN: use-after-free in u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] Read of size 4 at addr ffff881b83dae618 by task kworker/u48:5/571 CPU: 17 PID: 571 Comm: kworker/u48:5 Not tainted 4.15.0+ #87 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016 Workqueue: tc_filter_workqueue u32_delete_key_freepf_work [cls_u32] Call Trace: dump_stack+0xd6/0x182 ? dma_virt_map_sg+0x22e/0x22e print_address_description+0x73/0x290 kasan_report+0x277/0x360 ? u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] u32_destroy_key.constprop.21+0x117/0x140 [cls_u32] u32_delete_key_freepf_work+0x1c/0x30 [cls_u32] process_one_work+0xae0/0x1c80 ? sched_clock+0x5/0x10 ? pwq_dec_nr_in_flight+0x3c0/0x3c0 ? _raw_spin_unlock_irq+0x29/0x40 ? trace_hardirqs_on_caller+0x381/0x570 ? _raw_spin_unlock_irq+0x29/0x40 ? finish_task_switch+0x1e5/0x760 ? finish_task_switch+0x208/0x760 ? preempt_notifier_dec+0x20/0x20 ? __schedule+0x839/0x1ee0 ? check_noncircular+0x20/0x20 ? firmware_map_remove+0x73/0x73 ? find_held_lock+0x39/0x1c0 ? worker_thread+0x434/0x1820 ? lock_contended+0xee0/0xee0 ? lock_release+0x1100/0x1100 ? init_rescuer.part.16+0x150/0x150 ? retint_kernel+0x10/0x10 worker_thread+0x216/0x1820 ? process_one_work+0x1c80/0x1c80 ? lock_acquire+0x1a5/0x540 ? lock_downgrade+0x6b0/0x6b0 ? sched_clock+0x5/0x10 ? lock_release+0x1100/0x1100 ? compat_start_thread+0x80/0x80 ? do_raw_spin_trylock+0x190/0x190 ? _raw_spin_unlock_irq+0x29/0x40 ? trace_hardirqs_on_caller+0x381/0x570 ? _raw_spin_unlock_irq+0x29/0x40 ? finish_task_switch+0x1e5/0x760 ? finish_task_switch+0x208/0x760 ? preempt_notifier_dec+0x20/0x20 ? __schedule+0x839/0x1ee0 ? kmem_cache_alloc_trace+0x143/0x320 ? firmware_map_remove+0x73/0x73 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x39/0x1c0 ? schedule+0xf3/0x3b0 ? lock_downgrade+0x6b0/0x6b0 ? __schedule+0x1ee0/0x1ee0 ? do_wait_intr_irq+0x340/0x340 ? do_raw_spin_trylock+0x190/0x190 ? _raw_spin_unlock_irqrestore+0x32/0x60 ? process_one_work+0x1c80/0x1c80 ? process_one_work+0x1c80/0x1c80 kthread+0x312/0x3d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 Allocated by task 1688: kasan_kmalloc+0xa0/0xd0 __kmalloc+0x162/0x380 u32_change+0x1220/0x3c9e [cls_u32] tc_ctl_tfilter+0x1ba6/0x2f80 rtnetlink_rcv_msg+0x4f0/0x9d0 netlink_rcv_skb+0x124/0x320 netlink_unicast+0x430/0x600 netlink_sendmsg+0x8fa/0xd60 sock_sendmsg+0xb1/0xe0 ___sys_sendmsg+0x678/0x980 __sys_sendmsg+0xc4/0x210 do_syscall_64+0x232/0x7f0 return_from_SYSCALL_64+0x0/0x75 Freed by task 112: kasan_slab_free+0x71/0xc0 kfree+0x114/0x320 rcu_process_callbacks+0xc3f/0x1600 __do_softirq+0x2bf/0xc06 The buggy address belongs to the object at ffff881b83dae600 which belongs to the cache kmalloc-4096 of size 4096 The buggy address is located 24 bytes inside of 4096-byte region [ffff881b83dae600, ffff881b83daf600) The buggy address belongs to the page: page:ffffea006e0f6a00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 flags: 0x17ffffc0008100(slab|head) raw: 0017ffffc0008100 0000000000000000 0000000000000000 0000000100070007 raw: dead000000000100 dead000000000200 ffff880187c0e600 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff881b83dae500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff881b83dae580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff881b83dae600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff881b83dae680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff881b83dae700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The problem is that the htnode is freed before the linked knodes and the latter will try to access the first at u32_destroy_key() time. This change addresses the issue using the htnode refcnt to guarantee the correct free order. While at it also add a RCU annotation, to keep sparse happy. v1 -> v2: use rtnl_derefence() instead of RCU read locks v2 -> v3: - don't check refcnt in u32_destroy_hnode() - cleaned-up u32_destroy() implementation - cleaned-up code comment v3 -> v4: - dropped unneeded comment Reported-by: Li Shuang <shuali@redhat.com> Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-09net/sched: cls_u32: fix cls_u32 on filter replaceIvan Vecera1-1/+2
[ Upstream commit eb53f7af6f15285e2f6ada97285395343ce9f433 ] The following sequence is currently broken: # tc qdisc add dev foo ingress # tc filter replace dev foo protocol all ingress \ u32 match u8 0 0 action mirred egress mirror dev bar1 # tc filter replace dev foo protocol all ingress \ handle 800::800 pref 49152 \ u32 match u8 0 0 action mirred egress mirror dev bar2 Error: cls_u32: Key node flags do not match passed flags. We have an error talking to the kernel, -1 The error comes from u32_change() when comparing new and existing flags. The existing ones always contains one of TCA_CLS_FLAGS_{,NOT}_IN_HW flag depending on offloading state. These flags cannot be passed from userspace so the condition (n->flags != flags) in u32_change() always fails. Fix the condition so the flags TCA_CLS_FLAGS_NOT_IN_HW and TCA_CLS_FLAGS_IN_HW are not taken into account. Fixes: 24d3dc6d27ea ("net/sched: cls_u32: Reflect HW offload status") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-09cls_u32: use tcf_exts_get_net() before call_rcu()Cong Wang1-1/+7
Hold netns refcnt before call_rcu() and release it after the tcf_exts_destroy() is done. Note, on ->destroy() path we have to respect the return value of tcf_exts_get_net(), on other paths it should always return true, so we don't need to care. Cc: Lucas Bates <lucasb@mojatatu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29net_sched: use tcf_queue_work() in u32 filterCong Wang1-3/+26
Defer the tcf_exts_destroy() in RCU callback to tc filter workqueue and get RTNL lock. Reported-by: Chris Mi <chrism@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31net_sched: add reverse binding for tc classCong Wang1-0/+9
TC filters when used as classifiers are bound to TC classes. However, there is a hidden difference when adding them in different orders: 1. If we add tc classes before its filters, everything is fine. Logically, the classes exist before we specify their ID's in filters, it is easy to bind them together, just as in the current code base. 2. If we add tc filters before the tc classes they bind, we have to do dynamic lookup in fast path. What's worse, this happens all the time not just once, because on fast path tcf_result is passed on stack, there is no way to propagate back to the one in tc filters. This hidden difference hurts performance silently if we have many tc classes in hierarchy. This patch intends to close this gap by doing the reverse binding when we create a new class, in this case we can actually search all the filters in its parent, match and fixup by classid. And because tcf_result is specific to each type of tc filter, we have to introduce a new ops for each filter to tell how to bind the class. Note, we still can NOT totally get rid of those class lookup in ->enqueue() because cgroup and flow filters have no way to determine the classid at setup time, they still have to go through dynamic lookup. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-26net_sched: kill u32_node pointer in QdiscWANG Cong1-4/+53
It is ugly to hide a u32-filter-specific pointer inside Qdisc, this breaks the TC layers: 1. Qdisc is a generic representation, should not have any specific data of any type 2. Qdisc layer is above filter layer, should only save filters in the list of struct tcf_proto. This pointer is used as the head of the chain of u32 hash tables, that is struct tc_u_hnode, because u32 filter is very special, it allows to create multiple hash tables within one qdisc and across multiple u32 filters. Instead of using this ugly pointer, we can just save it in a global hash table key'ed by (dev ifindex, qdisc handle), therefore we can still treat it as a per qdisc basis data structure conceptually. Of course, because of network namespaces, this key is not unique at all, but it is fine as we already have a pointer to Qdisc in struct tc_u_common, we can just compare the pointers when collision. And this only affects slow paths, has no impact to fast path, thanks to the pointer ->tp_c. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net: sched: remove cops->tcf_cl_offloadJiri Pirko1-4/+4
cops->tcf_cl_offload is no longer needed, as the drivers check what they can and cannot offload using the classid identify helpers. So remove this. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08net_sched: use void pointer for filter handleWANG Cong1-15/+15
Now we use 'unsigned long fh' as a pointer in every place, it is safe to convert it to a void pointer now. This gets rid of many casts to pointer. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07net: sched: get rid of struct tc_to_netdevJiri Pirko1-46/+37
Get rid of struct tc_to_netdev which is now just unnecessary container and rather pass per-type structures down to drivers directly. Along with that, consolidate the naming of per-type structure variables in cls_*. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07net: sched: push cls related args into cls_common structureJiri Pirko1-12/+8
As ndo_setup_tc is generic offload op for whole tc subsystem, does not really make sense to have cls-specific args. So move them under cls_common structurure which is embedded in all cls structs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07net: sched: make type an argument for ndo_setup_tcJiri Pirko1-12/+8
Since the type is always present, push it to be a separate argument to ndo_setup_tc. On the way, name the type enum and use it for arg type. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04net: sched: cls_u32: no need to call tcf_exts_change for newly allocated structJiri Pirko1-14/+4
As the n struct was allocated right before u32_set_parms call, no need to use tcf_exts_change to do atomic change, and we can just fill-up the unused exts struct directly by tcf_exts_validate. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08net: propagate tc filter chain index down the ndo_setup_tc callJiri Pirko1-4/+8
We need to push the chain index down to the drivers, so they have the information to which chain the rule belongs. For now, no driver supports multichain offload, so only chain 0 is supported. This is needed to prevent chain squashes during offload for now. Later this will be used to implement multichain offload. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21net_sched: move the empty tp check from ->destroy() to ->delete()WANG Cong1-30/+41
We could have a race condition where in ->classify() path we dereference tp->root and meanwhile a parallel ->destroy() makes it a NULL. Daniel cured this bug in commit d936377414fa ("net, sched: respect rcu grace period on cls destruction"). This happens when ->destroy() is called for deleting a filter to check if we are the last one in tp, this tp is still linked and visible at that time. The root cause of this problem is the semantic of ->destroy(), it does two things (for non-force case): 1) check if tp is empty 2) if tp is empty we could really destroy it and its caller, if cares, needs to check its return value to see if it is really destroyed. Therefore we can't unlink tp unless we know it is empty. As suggested by Daniel, we could actually move the test logic to ->delete() so that we can safely unlink tp after ->delete() tells us the last one is just deleted and before ->destroy(). Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") Cc: Roi Dayan <roid@mellanox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13netlink: pass extended ACK struct to parsing functionsJohannes Berg1-1/+1
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17net/sched: cls_u32: Reflect HW offload statusOr Gerlitz1-0/+10
U32 support for the "in hw" offloading flags. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09cls_u32: don't bother explicitly initializing ->divisor to zeroAlexandru Moise1-1/+0
This struct member is already initialized to zero upon root_ht's allocation via kzalloc(). Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20net sched: stylistic cleanupsJamal Hadi Salim1-18/+12
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23net_sched: properly handle failure case of tcf_exts_init()WANG Cong1-6/+15
After commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array") we do dynamic allocation in tcf_exts_init(), therefore we need to handle the ENOMEM case properly. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09net: cls_u32: be more strict about skip-sw flag for knodesJakub Kicinski1-18/+19
Return an error if user requested skip-sw and the underlaying hardware cannot handle tc offloads (or offloads are disabled). This patch fixes the knode handling. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09net: cls_u32: catch all hardware offload errorsJakub Kicinski1-1/+7
Errors reported by u32_replace_hw_hnode() were not propagated. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: sched: fix tc_should_offload for specific clsact classesDaniel Borkmann1-4/+4
When offloading classifiers such as u32 or flower to hardware, and the qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes, since not all of them handle ingress, therefore we must leave those in software path. Add a .tcf_cl_offload() callback, so we can generically handle them, tested on ixgbe. Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support") Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support") Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: cls_u32: be more strict about skip-sw flagJakub Kicinski1-10/+11
Return an error if user requested skip-sw and the underlaying hardware cannot handle tc offloads (or offloads are disabled). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: cls_u32: fix error code for invalid flagsJakub Kicinski1-1/+1
'err' variable is not set in this test, we would return whatever previous test set 'err' to. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16net: cls_u32: Add support for skip-sw flag to tc u32 classifier.Samudrala, Sridhar1-9/+36
On devices that support TC U32 offloads, this flag enables a filter to be added only to HW. skip-sw and skip-hw are mutually exclusive flags. By default without any flags, the filter is added to both HW and SW, but no error checks are done in case of failure to add to HW. With skip-sw, failure to add to HW is treated as an error. Here is a sample script that adds 2 filters, one with skip-sw and the other with skip-hw flag. # add ingress qdisc tc qdisc add dev p4p1 ingress # enable hw tc offload. ethtool -K p4p1 hw-tc-offload on # add u32 filter with skip-sw flag. tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:1 u32 ht 800: flowid 800:1 \ skip-sw \ match ip src 192.168.1.0/24 \ action drop # add u32 filter with skip-hw flag. tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:2 u32 ht 800: flowid 800:2 \ skip-hw \ match ip src 192.168.2.0/24 \ action drop Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26sched: align nlattr properly when neededNicolas Dichtel1-3/+4
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02net: sched: cls_u32 add bit to specify software only rulesJohn Fastabend1-10/+27
In the initial implementation the only way to stop a rule from being inserted into the hardware table was via the device feature flag. However this doesn't work well when working on an end host system where packets are expect to hit both the hardware and software datapaths. For example we can imagine a rule that will match an IP address and increment a field. If we install this rule in both hardware and software we may increment the field twice. To date we have only added support for the drop action so we have been able to ignore these cases. But as we extend the action support we will hit this example plus more such cases. Arguably these are not even corner cases in many working systems these cases will be common. To avoid forcing the driver to always abort (i.e. the above example) this patch adds a flag to add a rule in software only. A careful user can use this flag to build software and hardware datapaths that work together. One example we have found particularly useful is to use hardware resources to set the skb->mark on the skb when the match may be expensive to run in software but a mark lookup in a hash table is cheap. The idea here is hardware can do in one lookup what the u32 classifier may need to traverse multiple lists and hash tables to compute. The flag is only passed down on inserts. On deletion to avoid stale references in hardware we always try to remove a rule if it exists. The flags field is part of the classifier specific options. Although it is tempting to lift this into the generic structure doing this proves difficult do to how the tc netlink attributes are implemented along with how the dump/change routines are called. There is also precedence for putting seemingly generic pieces in the specific classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So although not ideal I've left FLAGS in the u32 options as well as it simplifies the code greatly and user space has already learned how to manage these bits ala 'tc' tool. Another thing if trying to update a rule we require the flags to be unchanged. This is to force user space, software u32 and the hardware u32 to keep in sync. Thanks to Simon Horman for catching this case. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02net: sched: consolidate offload decision in cls_u32John Fastabend1-4/+4
The offload decision was originally very basic and tied to if the dev implemented the appropriate ndo op hook. The next step is to allow the user to more flexibly define if any paticular rule should be offloaded or not. In order to have this logic in one function lift the current check into a helper routine tc_should_offload(). Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17net: sched: add cls_u32 offload hooks for netdevsJohn Fastabend1-2/+97
This patch allows netdev drivers to consume cls_u32 offloads via the ndo_setup_tc ndo op. This works aligns with how network drivers have been doing qdisc offloads for mqprio. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-26cls_u32: complete the check for non-forced case in u32_destroy()WANG Cong1-0/+13
In commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") I added a check in u32_destroy() to see if all real filters are gone for each tp, however, that is only done for root_ht, same is needed for others. This can be reproduced by the following tc commands: tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256 tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32 ht 15:2: match ip src 10.0.0.2 flowid 1:10 tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32 ht 15:2: match ip src 10.0.0.3 flowid 1:10 Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") Reported-by: Akshat Kakkar <akshat.1984@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+4
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10net_sched: fix struct tc_u_hnode layout in u32WANG Cong1-1/+4
We dynamically allocate divisor+1 entries for ->ht[] in tc_u_hnode: ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL); So ->ht is supposed to be the last field of this struct, however this is broken, since an rcu head is appended after it. Fixes: 1ce87720d456 ("net: sched: make cls_u32 lockless") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09net_sched: destroy proto tp when all filters are goneCong Wang1-1/+24
Kernel automatically creates a tp for each (kind, protocol, priority) tuple, which has handle 0, when we add a new filter, but it still is left there after we remove our own, unless we don't specify the handle (literally means all the filters under the tuple). For example this one is left: # tc filter show dev eth0 filter parent 8001: protocol arp pref 49152 basic The user-space is hard to clean up these for kernel because filters like u32 are organized in a complex way. So kernel is responsible to remove it after all filters are gone. Each type of filter has its own way to store the filters, so each type has to provide its way to check if all filters are gone. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09net: sched: cls: remove unused op put from tcf_proto_opsJiri Pirko1-5/+0
It is never called and implementations are void. So just remove it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02net_sched: avoid calling tcf_unbind_filter() in call_rcu callbackWANG Cong1-4/+6
This fixes the following crash: [ 63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648 [ 63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000 [ 63.980094] RIP: 0010:[<ffffffff817e6d07>] [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d [ 63.980094] RSP: 0018:ffff880117dffcc0 EFLAGS: 00010202 [ 63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000 [ 63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b [ 63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000 [ 63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001 [ 63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30 [ 63.980094] FS: 0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000 [ 63.980094] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0 [ 63.980094] Stack: [ 63.980094] ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68 [ 63.980094] ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8 [ 63.980094] ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a [ 63.980094] Call Trace: [ 63.980094] [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d [ 63.980094] [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691 [ 63.980094] [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691 [ 63.980094] [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d [ 63.980094] [<ffffffff810780a4>] __do_softirq+0x142/0x323 [ 63.980094] [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53 [ 63.980094] [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221 [ 63.980094] [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33 [ 63.980094] [<ffffffff8108e44d>] kthread+0xc9/0xd1 [ 63.980094] [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125 [ 63.980094] [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61 [ 63.980094] [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0 [ 63.980094] [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61 tp could be freed in call_rcu callback too, the order is not guaranteed. John Fastabend says: ==================== Its worth noting why this is safe. Any running schedulers will either read the valid class field or it will be zeroed. All schedulers today when the class is 0 do a lookup using the same call used by the tcf_exts_bind(). So even if we have a running classifier hit the null class pointer it will do a lookup and get to the same result. This is particularly fragile at the moment because the only way to verify this is to audit the schedulers call sites. ==================== Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net_sched: remove the first parameter from tcf_exts_destroy()WANG Cong1-2/+2
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <hadi@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23net: sched: fix compile warning in cls_u32Eric Dumazet1-1/+1
$ grep CONFIG_CLS_U32_MARK .config # CONFIG_CLS_U32_MARK is not set net/sched/cls_u32.c: In function 'u32_change': net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used [-Wunused-label] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22net: sched: cls_u32 changes to knode must appear atomic to readersJohn Fastabend1-9/+126
Changes to the cls_u32 classifier must appear atomic to the readers. Before this patch if a change is requested for both the exts and ifindex, first the ifindex is updated then the exts with tcf_exts_change(). This opens a small window where a reader can have a exts chain with an incorrect ifindex. This violates the the RCU semantics. Here we resolve this by always passing u32_set_parms() a copy of the tc_u_knode to work on and then inserting it into the hash table after the updates have been successfully applied. Tested with the following short script: #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \ u32 divisor 256 #tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \ u32 link 1: hashkey mask ffffff00 at 12 \ match ip src 192.168.8.0/2 #tc filter add dev p3p2 parent 8001:0 protocol ip prio 102 \ handle 1::10 u32 classid 1:2 ht 1: \ match ip src 192.168.8.0/8 match ip tos 0x0a 1e #tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \ handle 1::10 u32 classid 1:2 ht 1: \ match ip src 1.1.0.0/8 match ip tos 0x0b 1e CC: Eric Dumazet <edumazet@google.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22net: cls_u32: fix missed pcpu_success free_percpuJohn Fastabend1-0/+13
This fixes a missed free_percpu in the unwind code path and when keys are destroyed. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-20net: sched: cls_u32: rcu can not be last nodeJohn Fastabend1-1/+4
tc_u32_sel 'sel' in tc_u_knode expects to be the last element in the structure and pads the structure with tc_u32_key fields for each key. kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL) CC: Eric Dumazet <edumazet@google.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-16net: sched: cls_u32 add missing rcu_assign_pointer and annotationJohn Fastabend1-3/+4
Add missing rcu_assign_pointer and missing annotation for ht_up in cls_u32.c Caught by kbuild bot, >> net/sched/cls_u32.c:378:36: sparse: incorrect type in initializer (different address spaces) net/sched/cls_u32.c:378:36: expected struct tc_u_hnode *ht net/sched/cls_u32.c:378:36: got struct tc_u_hnode [noderef] <asn:4>*ht_up >> net/sched/cls_u32.c:610:54: sparse: incorrect type in argument 4 (different address spaces) net/sched/cls_u32.c:610:54: expected struct tc_u_hnode *ht net/sched/cls_u32.c:610:54: got struct tc_u_hnode [noderef] <asn:4>*ht_up >> net/sched/cls_u32.c:684:18: sparse: incorrect type in assignment (different address spaces) net/sched/cls_u32.c:684:18: expected struct tc_u_hnode [noderef] <asn:4>*ht_up net/sched/cls_u32.c:684:18: got struct tc_u_hnode *[assigned] ht >> net/sched/cls_u32.c:359:18: sparse: dereference of noderef expression Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-16net: sched: fix unsued cpu variableJohn Fastabend1-3/+4
kbuild test robot reported an unused variable cpu in cls_u32.c after the patch below. This happens when PERF and MARK config variables are disabled Fix this is to use separate variables for perf and mark and define the cpu variable inside the ifdef logic. Fixes: 459d5f626da7 ("net: sched: make cls_u32 per cpu")' Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13net: sched: make cls_u32 locklessJohn Fastabend1-73/+110
Make cls_u32 classifier safe to run without holding lock. This patch converts statistics that are kept in read section u32_classify into per cpu counters. This patch was tested with a tight u32 filter add/delete loop while generating traffic with pktgen. By running pktgen on vlan devices created on top of a physical device we can hit the qdisc layer correctly. For ingress qdisc's a loopback cable was used. for i in {1..100}; do q=`echo $i%8|bc`; echo -n "u32 tos: iteration $i on queue $q"; tc filter add dev p3p2 parent $p prio $i u32 match ip tos 0x10 0xff \ action skbedit queue_mapping $q; sleep 1; tc filter del dev p3p2 prio $i; echo -n "u32 tos hash table: iteration $i on queue $q"; tc filter add dev p3p2 parent $p protocol ip prio $i handle 628: u32 divisor 1 tc filter add dev p3p2 parent $p protocol ip prio $i u32 \ match ip protocol 17 0xff link 628: offset at 0 mask 0xf00 shift 6 plus 0 tc filter add dev p3p2 parent $p protocol ip prio $i u32 \ ht 628:0 match ip tos 0x10 0xff action skbedit queue_mapping $q sleep 2; tc filter del dev p3p2 prio $i sleep 1; done Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13net: sched: make cls_u32 per cpuJohn Fastabend1-16/+59
This uses per cpu counters in cls_u32 in preparation to convert over to rcu. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-21net_sched: avoid generating same handle for u32 filtersCong Wang1-5/+14
When kernel generates a handle for a u32 filter, it tries to start from the max in the bucket. So when we have a filter with the max (fff) handle, it will cause kernel always generates the same handle for new filters. This can be shown by the following command: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip pref 770 handle 800::fff u32 match ip protocol 1 0xff tc filter add dev eth0 parent ffff: protocol ip pref 770 u32 match ip protocol 1 0xff ... we will get some u32 filters with same handle: # tc filter show dev eth0 parent ffff: filter protocol ip pref 770 u32 filter protocol ip pref 770 u32 fh 800: ht divisor 1 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 handles should be unique. This patch fixes it by looking up a bitmap, so that can guarantee the handle is as unique as possible. For compatibility, we still start from 0x800. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-28sched, cls: check if we could overwrite actions when changing a filterCong Wang1-5/+5
When actions are attached to a filter, they are a part of the filter itself, so when changing a filter we should allow to overwrite the actions inside as well. In my specific case, when I tried to _append_ a new action to an existing filter which already has an action, I got EEXIST since kernel refused to overwrite the existing one in kernel. This patch checks if we are changing the filter checking NLM_F_CREATE flag (Sigh, filters don't use NLM_F_REPLACE...) and then passes the boolean down to actions. This fixes the problem above. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>