summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)AuthorFilesLines
2026-03-04net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash()Ruitong Liu1-1/+5
[ Upstream commit be054cc66f739a9ba615dba9012a07fab8e7dd6f ] Commit 38a6f0865796 ("net: sched: support hash selecting tx queue") added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is computed as: mapping_mod = queue_mapping_max - queue_mapping + 1; The range size can be 65536 when the requested range covers all possible u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX). That value cannot be represented in a u16 and previously wrapped to 0, so tcf_skbedit_hash() could trigger a divide-by-zero: queue_mapping += skb_get_hash(skb) % params->mapping_mod; Compute mapping_mod in a wider type and reject ranges larger than U16_MAX to prevent params->mapping_mod from becoming 0 and avoid the crash. Fixes: 38a6f0865796 ("net: sched: support hash selecting tx queue") Cc: stable@vger.kernel.org # 6.12+ Signed-off-by: Ruitong Liu <cnitlrt@gmail.com> Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04bpf: net_sched: Use the correct destructor kfunc typeSami Tolvanen1-1/+7
[ Upstream commit c99d97b46631c4bea0c14b7581b7a59214601e63 ] With CONFIG_CFI enabled, the kernel strictly enforces that indirect function calls use a function pointer type that matches the target function. As bpf_kfree_skb() signature differs from the btf_dtor_kfunc_t pointer type used for the destructor calls in bpf_obj_free_fields(), add a stub function with the correct type to fix the type mismatch. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260110082548.113748-8-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11net/sched: cls_u32: use skb_header_pointer_careful()Eric Dumazet1-7/+6
[ Upstream commit cabd1a976375780dabab888784e356f574bbaed8 ] skb_header_pointer() does not fully validate negative @offset values. Use skb_header_pointer_careful() instead. GangMin Kim provided a report and a repro fooling u32_classify(): BUG: KASAN: slab-out-of-bounds in u32_classify+0x1180/0x11b0 net/sched/cls_u32.c:221 Fixes: fbc2e7d9cf49 ("cls_u32: use skb_header_pointer() to dereference data safely") Reported-by: GangMin Kim <km.kim1503@gmail.com> Closes: https://lore.kernel.org/netdev/CANn89iJkyUZ=mAzLzC4GdcAgLuPnUoivdLaOs6B9rq5_erj76w@mail.gmail.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260128141539.3404400-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-06net/sched: act_ife: convert comma to semicolonChen Ni1-3/+3
commit 205305c028ad986d0649b8b100bab6032dcd1bb5 upstream. Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251112072709.73755-1-nichen@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30net/sched: act_ife: avoid possible NULL derefEric Dumazet1-2/+4
[ Upstream commit 27880b0b0d35ad1c98863d09788254e36f874968 ] tcf_ife_encode() must make sure ife_encode() does not return NULL. syzbot reported: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:ife_tlv_meta_encode+0x41/0xa0 net/ife/ife.c:166 CPU: 3 UID: 0 PID: 8990 Comm: syz.0.696 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> ife_encode_meta_u32+0x153/0x180 net/sched/act_ife.c:101 tcf_ife_encode net/sched/act_ife.c:841 [inline] tcf_ife_act+0x1022/0x1de0 net/sched/act_ife.c:877 tc_act include/net/tc_wrapper.h:130 [inline] tcf_action_exec+0x1c0/0xa20 net/sched/act_api.c:1152 tcf_exts_exec include/net/pkt_cls.h:349 [inline] mall_classify+0x1a0/0x2a0 net/sched/cls_matchall.c:42 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x7f2/0x1380 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xe0/0x510 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x45/0x250 net/core/dev.c:4147 __dev_xmit_skb net/core/dev.c:4262 [inline] __dev_queue_xmit+0x2998/0x46c0 net/core/dev.c:4798 Fixes: 295a6e06d21e ("net/sched: act_ife: Change to use ife module") Reported-by: syzbot+5cf914f193dffde3bd3c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6970d61d.050a0220.706b.0010.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yotam Gigi <yotam.gi@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260121133724.3400020-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30net/sched: qfq: Use cl_is_active to determine whether class is active in ↵Jamal Hadi Salim1-1/+1
qfq_rm_from_ag [ Upstream commit d837fbee92453fbb829f950c8e7cf76207d73f33 ] This is more of a preventive patch to make the code more consistent and to prevent possible exploits that employ child qlen manipulations on qfq. use cl_is_active instead of relying on the child qdisc's qlen to determine class activation. Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30net/sched: Enforce that teql can only be used as root qdiscJamal Hadi Salim1-0/+5
[ Upstream commit 50da4b9d07a7a463e2cfb738f3ad4cff6b2c9c3b ] Design intent of teql is that it is only supposed to be used as root qdisc. We need to check for that constraint. Although not important, I will describe the scenario that unearthed this issue for the curious. GangMin Kim <km.kim1503@gmail.com> managed to concot a scenario as follows: ROOT qdisc 1:0 (QFQ) ├── class 1:1 (weight=15, lmax=16384) netem with delay 6.4s └── class 1:2 (weight=1, lmax=1514) teql GangMin sends a packet which is enqueued to 1:1 (netem). Any invocation of dequeue by QFQ from this class will not return a packet until after 6.4s. In the meantime, a second packet is sent and it lands on 1:2. teql's enqueue will return success and this will activate class 1:2. Main issue is that teql only updates the parent visible qlen (sch->q.qlen) at dequeue. Since QFQ will only call dequeue if peek succeeds (and teql's peek always returns NULL), dequeue will never be called and thus the qlen will remain as 0. With that in mind, when GangMin updates 1:2's lmax value, the qfq_change_class calls qfq_deact_rm_from_agg. Since the child qdisc's qlen was not incremented, qfq fails to deactivate the class, but still frees its pointers from the aggregate. So when the first packet is rescheduled after 6.4 seconds (netem's delay), a dangling pointer is accessed causing GangMin's causing a UAF. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: GangMin Kim <km.kim1503@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-23net/sched: sch_qfq: do not free existing class in qfq_change_class()Eric Dumazet1-2/+4
[ Upstream commit 3879cffd9d07aa0377c4b8835c4f64b4fb24ac78 ] Fixes qfq_change_class() error case. cl->qdisc and cl should only be freed if a new class and qdisc were allocated, or we risk various UAF. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: syzbot+07f3f38f723c335f106d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6965351d.050a0220.eaf7.00c5.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260112175656.17605-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-17net/sched: sch_qfq: Fix NULL deref when deactivating inactive aggregate in ↵Xiang Mei1-1/+1
qfq_reset [ Upstream commit c1d73b1480235731e35c81df70b08f4714a7d095 ] `qfq_class->leaf_qdisc->q.qlen > 0` does not imply that the class itself is active. Two qfq_class objects may point to the same leaf_qdisc. This happens when: 1. one QFQ qdisc is attached to the dev as the root qdisc, and 2. another QFQ qdisc is temporarily referenced (e.g., via qdisc_get() / qdisc_put()) and is pending to be destroyed, as in function tc_new_tfilter. When packets are enqueued through the root QFQ qdisc, the shared leaf_qdisc->q.qlen increases. At the same time, the second QFQ qdisc triggers qdisc_put and qdisc_destroy: the qdisc enters qfq_reset() with its own q->q.qlen == 0, but its class's leaf qdisc->q.qlen > 0. Therefore, the qfq_reset would wrongly deactivate an inactive aggregate and trigger a null-deref in qfq_deactivate_agg: [ 0.903172] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 0.903571] #PF: supervisor write access in kernel mode [ 0.903860] #PF: error_code(0x0002) - not-present page [ 0.904177] PGD 10299b067 P4D 10299b067 PUD 10299c067 PMD 0 [ 0.904502] Oops: Oops: 0002 [#1] SMP NOPTI [ 0.904737] CPU: 0 UID: 0 PID: 135 Comm: exploit Not tainted 6.19.0-rc3+ #2 NONE [ 0.905157] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 0.905754] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:992 (discriminator 2) include/linux/list.h:1006 (discriminator 2) net/sched/sch_qfq.c:1367 (discriminator 2) net/sched/sch_qfq.c:1393 (discriminator 2)) [ 0.906046] Code: 0f 84 4d 01 00 00 48 89 70 18 8b 4b 10 48 c7 c2 ff ff ff ff 48 8b 78 08 48 d3 e2 48 21 f2 48 2b 13 48 8b 30 48 d3 ea 8b 4b 18 0 Code starting with the faulting instruction =========================================== 0: 0f 84 4d 01 00 00 je 0x153 6: 48 89 70 18 mov %rsi,0x18(%rax) a: 8b 4b 10 mov 0x10(%rbx),%ecx d: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 14: 48 8b 78 08 mov 0x8(%rax),%rdi 18: 48 d3 e2 shl %cl,%rdx 1b: 48 21 f2 and %rsi,%rdx 1e: 48 2b 13 sub (%rbx),%rdx 21: 48 8b 30 mov (%rax),%rsi 24: 48 d3 ea shr %cl,%rdx 27: 8b 4b 18 mov 0x18(%rbx),%ecx ... [ 0.907095] RSP: 0018:ffffc900004a39a0 EFLAGS: 00010246 [ 0.907368] RAX: ffff8881043a0880 RBX: ffff888102953340 RCX: 0000000000000000 [ 0.907723] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 0.908100] RBP: ffff888102952180 R08: 0000000000000000 R09: 0000000000000000 [ 0.908451] R10: ffff8881043a0000 R11: 0000000000000000 R12: ffff888102952000 [ 0.908804] R13: ffff888102952180 R14: ffff8881043a0ad8 R15: ffff8881043a0880 [ 0.909179] FS: 000000002a1a0380(0000) GS:ffff888196d8d000(0000) knlGS:0000000000000000 [ 0.909572] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.909857] CR2: 0000000000000000 CR3: 0000000102993002 CR4: 0000000000772ef0 [ 0.910247] PKRU: 55555554 [ 0.910391] Call Trace: [ 0.910527] <TASK> [ 0.910638] qfq_reset_qdisc (net/sched/sch_qfq.c:357 net/sched/sch_qfq.c:1485) [ 0.910826] qdisc_reset (include/linux/skbuff.h:2195 include/linux/skbuff.h:2501 include/linux/skbuff.h:3424 include/linux/skbuff.h:3430 net/sched/sch_generic.c:1036) [ 0.911040] __qdisc_destroy (net/sched/sch_generic.c:1076) [ 0.911236] tc_new_tfilter (net/sched/cls_api.c:2447) [ 0.911447] rtnetlink_rcv_msg (net/core/rtnetlink.c:6958) [ 0.911663] ? __pfx_rtnetlink_rcv_msg (net/core/rtnetlink.c:6861) [ 0.911894] netlink_rcv_skb (net/netlink/af_netlink.c:2550) [ 0.912100] netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) [ 0.912296] ? __alloc_skb (net/core/skbuff.c:706) [ 0.912484] netlink_sendmsg (net/netlink/af_netlink.c:1894) [ 0.912682] sock_write_iter (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1) net/socket.c:1195 (discriminator 1)) [ 0.912880] vfs_write (fs/read_write.c:593 fs/read_write.c:686) [ 0.913077] ksys_write (fs/read_write.c:738) [ 0.913252] do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) [ 0.913438] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) [ 0.913687] RIP: 0033:0x424c34 [ 0.913844] Code: 89 02 48 c7 c0 ff ff ff ff eb bd 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d 2d 44 09 00 00 74 13 b8 01 00 00 00 0f 05 9 Code starting with the faulting instruction =========================================== 0: 89 02 mov %eax,(%rdx) 2: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax 9: eb bd jmp 0xffffffffffffffc8 b: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) 12: 00 00 00 15: 90 nop 16: f3 0f 1e fa endbr64 1a: 80 3d 2d 44 09 00 00 cmpb $0x0,0x9442d(%rip) # 0x9444e 21: 74 13 je 0x36 23: b8 01 00 00 00 mov $0x1,%eax 28: 0f 05 syscall 2a: 09 .byte 0x9 [ 0.914807] RSP: 002b:00007ffea1938b78 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 0.915197] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000424c34 [ 0.915556] RDX: 000000000000003c RSI: 000000002af378c0 RDI: 0000000000000003 [ 0.915912] RBP: 00007ffea1938bc0 R08: 00000000004b8820 R09: 0000000000000000 [ 0.916297] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffea1938d28 [ 0.916652] R13: 00007ffea1938d38 R14: 00000000004b3828 R15: 0000000000000001 [ 0.917039] </TASK> [ 0.917158] Modules linked in: [ 0.917316] CR2: 0000000000000000 [ 0.917484] ---[ end trace 0000000000000000 ]--- [ 0.917717] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:992 (discriminator 2) include/linux/list.h:1006 (discriminator 2) net/sched/sch_qfq.c:1367 (discriminator 2) net/sched/sch_qfq.c:1393 (discriminator 2)) [ 0.917978] Code: 0f 84 4d 01 00 00 48 89 70 18 8b 4b 10 48 c7 c2 ff ff ff ff 48 8b 78 08 48 d3 e2 48 21 f2 48 2b 13 48 8b 30 48 d3 ea 8b 4b 18 0 Code starting with the faulting instruction =========================================== 0: 0f 84 4d 01 00 00 je 0x153 6: 48 89 70 18 mov %rsi,0x18(%rax) a: 8b 4b 10 mov 0x10(%rbx),%ecx d: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 14: 48 8b 78 08 mov 0x8(%rax),%rdi 18: 48 d3 e2 shl %cl,%rdx 1b: 48 21 f2 and %rsi,%rdx 1e: 48 2b 13 sub (%rbx),%rdx 21: 48 8b 30 mov (%rax),%rsi 24: 48 d3 ea shr %cl,%rdx 27: 8b 4b 18 mov 0x18(%rbx),%ecx ... [ 0.918902] RSP: 0018:ffffc900004a39a0 EFLAGS: 00010246 [ 0.919198] RAX: ffff8881043a0880 RBX: ffff888102953340 RCX: 0000000000000000 [ 0.919559] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 0.919908] RBP: ffff888102952180 R08: 0000000000000000 R09: 0000000000000000 [ 0.920289] R10: ffff8881043a0000 R11: 0000000000000000 R12: ffff888102952000 [ 0.920648] R13: ffff888102952180 R14: ffff8881043a0ad8 R15: ffff8881043a0880 [ 0.921014] FS: 000000002a1a0380(0000) GS:ffff888196d8d000(0000) knlGS:0000000000000000 [ 0.921424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.921710] CR2: 0000000000000000 CR3: 0000000102993002 CR4: 0000000000772ef0 [ 0.922097] PKRU: 55555554 [ 0.922240] Kernel panic - not syncing: Fatal exception [ 0.922590] Kernel Offset: disabled Fixes: 0545a3037773 ("pkt_sched: QFQ - quick fair queue scheduler") Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260106034100.1780779-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-17net/sched: act_api: avoid dereferencing ERR_PTR in tcf_idrinfo_destroyShivani Gupta1-0/+2
[ Upstream commit adb25a46dc0a43173f5ea5f5f58fc8ba28970c7c ] syzbot reported a crash in tc_act_in_hw() during netns teardown where tcf_idrinfo_destroy() passed an ERR_PTR(-EBUSY) value as a tc_action pointer, leading to an invalid dereference. Guard against ERR_PTR entries when iterating the action IDR so teardown does not call tc_act_in_hw() on an error pointer. Fixes: 84a7d6797e6a ("net/sched: acp_api: no longer acquire RTNL in tc_action_net_exit()") Link: https://syzkaller.appspot.com/bug?extid=8f1c492ffa4644ff3826 Reported-by: syzbot+8f1c492ffa4644ff3826@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8f1c492ffa4644ff3826 Signed-off-by: Shivani Gupta <shivani07g@gmail.com> Link: https://patch.msgid.link/20260105005905.243423-1-shivani07g@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02net/sched: ets: Remove drr class from the active list if it changes to strictVictor Nogueira1-0/+4
[ Upstream commit b1e125ae425aba9b45252e933ca8df52a843ec70 ] Whenever a user issues an ets qdisc change command, transforming a drr class into a strict one, the ets code isn't checking whether that class was in the active list and removing it. This means that, if a user changes a strict class (which was in the active list) back to a drr one, that class will be added twice to the active list [1]. Doing so with the following commands: tc qdisc add dev lo root handle 1: ets bands 2 strict 1 tc qdisc add dev lo parent 1:2 handle 20: \ tbf rate 8bit burst 100b latency 1s tc filter add dev lo parent 1: basic classid 1:2 ping -c1 -W0.01 -s 56 127.0.0.1 tc qdisc change dev lo root handle 1: ets bands 2 strict 2 tc qdisc change dev lo root handle 1: ets bands 2 strict 1 ping -c1 -W0.01 -s 56 127.0.0.1 Will trigger the following splat with list debug turned on: [ 59.279014][ T365] ------------[ cut here ]------------ [ 59.279452][ T365] list_add double add: new=ffff88801d60e350, prev=ffff88801d60e350, next=ffff88801d60e2c0. [ 59.280153][ T365] WARNING: CPU: 3 PID: 365 at lib/list_debug.c:35 __list_add_valid_or_report+0x17f/0x220 [ 59.280860][ T365] Modules linked in: [ 59.281165][ T365] CPU: 3 UID: 0 PID: 365 Comm: tc Not tainted 6.18.0-rc7-00105-g7e9f13163c13-dirty #239 PREEMPT(voluntary) [ 59.281977][ T365] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 59.282391][ T365] RIP: 0010:__list_add_valid_or_report+0x17f/0x220 [ 59.282842][ T365] Code: 89 c6 e8 d4 b7 0d ff 90 0f 0b 90 90 31 c0 e9 31 ff ff ff 90 48 c7 c7 e0 a0 22 9f 48 89 f2 48 89 c1 4c 89 c6 e8 b2 b7 0d ff 90 <0f> 0b 90 90 31 c0 e9 0f ff ff ff 48 89 f7 48 89 44 24 10 4c 89 44 ... [ 59.288812][ T365] Call Trace: [ 59.289056][ T365] <TASK> [ 59.289224][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.289546][ T365] ets_qdisc_change+0xd2b/0x1e80 [ 59.289891][ T365] ? __lock_acquire+0x7e7/0x1be0 [ 59.290223][ T365] ? __pfx_ets_qdisc_change+0x10/0x10 [ 59.290546][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.290898][ T365] ? __mutex_trylock_common+0xda/0x240 [ 59.291228][ T365] ? __pfx___mutex_trylock_common+0x10/0x10 [ 59.291655][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.291993][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.292313][ T365] ? trace_contention_end+0xc8/0x110 [ 59.292656][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.293022][ T365] ? srso_alias_return_thunk+0x5/0xfbef5 [ 59.293351][ T365] tc_modify_qdisc+0x63a/0x1cf0 Fix this by always checking and removing an ets class from the active list when changing it to strict. [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/sched/sch_ets.c?id=ce052b9402e461a9aded599f5b47e76bc727f7de#n663 Fixes: cd9b50adc6bb9 ("net/sched: ets: fix crash when flipping from 'strict' to 'quantum'") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20251208190125.1868423-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02net/sched: ets: Always remove class from active list before deleting in ↵Jamal Hadi Salim1-1/+1
ets_qdisc_change [ Upstream commit ce052b9402e461a9aded599f5b47e76bc727f7de ] zdi-disclosures@trendmicro.com says: The vulnerability is a race condition between `ets_qdisc_dequeue` and `ets_qdisc_change`. It leads to UAF on `struct Qdisc` object. Attacker requires the capability to create new user and network namespace in order to trigger the bug. See my additional commentary at the end of the analysis. Analysis: static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack) { ... // (1) this lock is preventing .change handler (`ets_qdisc_change`) //to race with .dequeue handler (`ets_qdisc_dequeue`) sch_tree_lock(sch); for (i = nbands; i < oldbands; i++) { if (i >= q->nstrict && q->classes[i].qdisc->q.qlen) list_del_init(&q->classes[i].alist); qdisc_purge_queue(q->classes[i].qdisc); } WRITE_ONCE(q->nbands, nbands); for (i = nstrict; i < q->nstrict; i++) { if (q->classes[i].qdisc->q.qlen) { // (2) the class is added to the q->active list_add_tail(&q->classes[i].alist, &q->active); q->classes[i].deficit = quanta[i]; } } WRITE_ONCE(q->nstrict, nstrict); memcpy(q->prio2band, priomap, sizeof(priomap)); for (i = 0; i < q->nbands; i++) WRITE_ONCE(q->classes[i].quantum, quanta[i]); for (i = oldbands; i < q->nbands; i++) { q->classes[i].qdisc = queues[i]; if (q->classes[i].qdisc != &noop_qdisc) qdisc_hash_add(q->classes[i].qdisc, true); } // (3) the qdisc is unlocked, now dequeue can be called in parallel // to the rest of .change handler sch_tree_unlock(sch); ets_offload_change(sch); for (i = q->nbands; i < oldbands; i++) { // (4) we're reducing the refcount for our class's qdisc and // freeing it qdisc_put(q->classes[i].qdisc); // (5) If we call .dequeue between (4) and (5), we will have // a strong UAF and we can control RIP q->classes[i].qdisc = NULL; WRITE_ONCE(q->classes[i].quantum, 0); q->classes[i].deficit = 0; gnet_stats_basic_sync_init(&q->classes[i].bstats); memset(&q->classes[i].qstats, 0, sizeof(q->classes[i].qstats)); } return 0; } Comment: This happens because some of the classes have their qdiscs assigned to NULL, but remain in the active list. This commit fixes this issue by always removing the class from the active list before deleting and freeing its associated qdisc Reproducer Steps (trimmed version of what was sent by zdi-disclosures@trendmicro.com) ``` DEV="${DEV:-lo}" ROOT_HANDLE="${ROOT_HANDLE:-1:}" BAND2_HANDLE="${BAND2_HANDLE:-20:}" # child under 1:2 PING_BYTES="${PING_BYTES:-48}" PING_COUNT="${PING_COUNT:-200000}" PING_DST="${PING_DST:-127.0.0.1}" SLOW_TBF_RATE="${SLOW_TBF_RATE:-8bit}" SLOW_TBF_BURST="${SLOW_TBF_BURST:-100b}" SLOW_TBF_LAT="${SLOW_TBF_LAT:-1s}" cleanup() { tc qdisc del dev "$DEV" root 2>/dev/null } trap cleanup EXIT ip link set "$DEV" up tc qdisc del dev "$DEV" root 2>/dev/null || true tc qdisc add dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc qdisc add dev "$DEV" parent 1:2 handle "$BAND2_HANDLE" \ tbf rate "$SLOW_TBF_RATE" burst "$SLOW_TBF_BURST" latency "$SLOW_TBF_LAT" tc filter add dev "$DEV" parent 1: protocol all prio 1 u32 match u32 0 0 flowid 1:2 tc -s qdisc ls dev $DEV ping -I "$DEV" -f -c "$PING_COUNT" -s "$PING_BYTES" -W 0.001 "$PING_DST" \ >/dev/null 2>&1 & tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 0 tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 2 strict 2 tc -s qdisc ls dev $DEV tc qdisc del dev "$DEV" parent 1:2 || true tc -s qdisc ls dev $DEV tc qdisc change dev "$DEV" root handle "$ROOT_HANDLE" ets bands 1 strict 1 ``` KASAN report ``` ================================================================== BUG: KASAN: slab-use-after-free in ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 Read of size 8 at addr ffff8880502fc018 by task ping/12308 > CPU: 0 UID: 0 PID: 12308 Comm: ping Not tainted 6.18.0-rc4-dirty #1 PREEMPT(full) Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> __dump_stack kernel/lib/dump_stack.c:94 dump_stack_lvl+0x100/0x190 kernel/lib/dump_stack.c:120 print_address_description kernel/mm/kasan/report.c:378 print_report+0x156/0x4c9 kernel/mm/kasan/report.c:482 kasan_report+0xdf/0x110 kernel/mm/kasan/report.c:595 ets_qdisc_dequeue+0x1071/0x11b0 kernel/net/sched/sch_ets.c:481 dequeue_skb kernel/net/sched/sch_generic.c:294 qdisc_restart kernel/net/sched/sch_generic.c:399 __qdisc_run+0x1c9/0x1b00 kernel/net/sched/sch_generic.c:417 __dev_xmit_skb kernel/net/core/dev.c:4221 __dev_queue_xmit+0x2848/0x4410 kernel/net/core/dev.c:4729 dev_queue_xmit kernel/./include/linux/netdevice.h:3365 [...] Allocated by task 17115: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 poison_kmalloc_redzone kernel/mm/kasan/common.c:400 __kasan_kmalloc+0xaa/0xb0 kernel/mm/kasan/common.c:417 kasan_kmalloc kernel/./include/linux/kasan.h:262 __do_kmalloc_node kernel/mm/slub.c:5642 __kmalloc_node_noprof+0x34e/0x990 kernel/mm/slub.c:5648 kmalloc_node_noprof kernel/./include/linux/slab.h:987 qdisc_alloc+0xb8/0xc30 kernel/net/sched/sch_generic.c:950 qdisc_create_dflt+0x93/0x490 kernel/net/sched/sch_generic.c:1012 ets_class_graft+0x4fd/0x800 kernel/net/sched/sch_ets.c:261 qdisc_graft+0x3e4/0x1780 kernel/net/sched/sch_api.c:1196 [...] Freed by task 9905: kasan_save_stack+0x30/0x50 kernel/mm/kasan/common.c:56 kasan_save_track+0x14/0x30 kernel/mm/kasan/common.c:77 __kasan_save_free_info+0x3b/0x70 kernel/mm/kasan/generic.c:587 kasan_save_free_info kernel/mm/kasan/kasan.h:406 poison_slab_object kernel/mm/kasan/common.c:252 __kasan_slab_free+0x5f/0x80 kernel/mm/kasan/common.c:284 kasan_slab_free kernel/./include/linux/kasan.h:234 slab_free_hook kernel/mm/slub.c:2539 slab_free kernel/mm/slub.c:6630 kfree+0x144/0x700 kernel/mm/slub.c:6837 rcu_do_batch kernel/kernel/rcu/tree.c:2605 rcu_core+0x7c0/0x1500 kernel/kernel/rcu/tree.c:2861 handle_softirqs+0x1ea/0x8a0 kernel/kernel/softirq.c:622 __do_softirq kernel/kernel/softirq.c:656 [...] Commentary: 1. Maher Azzouzi working with Trend Micro Zero Day Initiative was reported as the person who found the issue. I requested to get a proper email to add to the reported-by tag but got no response. For this reason i will credit the person i exchanged emails with i.e zdi-disclosures@trendmicro.com 2. Neither i nor Victor who did a much more thorough testing was able to reproduce a UAF with the PoC or other approaches we tried. We were both able to reproduce a null ptr deref. After exchange with zdi-disclosures@trendmicro.com they sent a small change to be made to the code to add an extra delay which was able to simulate the UAF. i.e, this: qdisc_put(q->classes[i].qdisc); mdelay(90); q->classes[i].qdisc = NULL; I was informed by Thomas Gleixner(tglx@linutronix.de) that adding delays was acceptable approach for demonstrating the bug, quote: "Adding such delays is common exploit validation practice" The equivalent delay could happen "by virt scheduling the vCPU out, SMIs, NMIs, PREEMPT_RT enabled kernel" 3. I asked the OP to test and report back but got no response and after a few days gave up and proceeded to submit this fix. Fixes: de6d25924c2a ("net/sched: sch_ets: don't peek at classes beyond 'nbands'") Reported-by: zdi-disclosures@trendmicro.com Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20251128151919.576920-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18net/sched: sch_cake: Fix incorrect qlen reduction in cake_dropXiang Mei1-26/+32
[ Upstream commit 9fefc78f7f02d71810776fdeb119a05a946a27cc ] In cake_drop(), qdisc_tree_reduce_backlog() is used to update the qlen and backlog of the qdisc hierarchy. Its caller, cake_enqueue(), assumes that the parent qdisc will enqueue the current packet. However, this assumption breaks when cake_enqueue() returns NET_XMIT_CN: the parent qdisc stops enqueuing current packet, leaving the tree qlen/backlog accounting inconsistent. This mismatch can lead to a NULL dereference (e.g., when the parent Qdisc is qfq_qdisc). This patch computes the qlen/backlog delta in a more robust way by observing the difference before and after the series of cake_drop() calls, and then compensates the qdisc tree accounting if cake_enqueue() returns NET_XMIT_CN. To ensure correct compensation when ACK thinning is enabled, a new variable is introduced to keep qlen unchanged. Fixes: 15de71d06a40 ("net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit") Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20251128001415.377823-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-27Merge tag 'linux-can-fixes-for-6.18-20251126' of ↵Jakub Kicinski1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-11-26 this is a pull request of 8 patches for net/main. Seungjin Bae provides a patch for the kvaser_usb driver to fix a potential infinite loop in the USB data stream command parser. Thomas Mühlbacher's patch for the sja1000 driver IRQ handler's max loop handling, that might lead to unhandled interrupts. 3 patches by me for the gs_usb driver fix handling of failed transmit URBs and add checking of the actual length of received URBs before accessing the data. The next patch is by me and is a port of Thomas Mühlbacher's patch (fix IRQ handler's max loop handling, that might lead to unhandled interrupts.) to the sun4i_can driver. Biju Das provides a patch for the rcar_canfd driver to fix the CAN-FD mode setting. The last patch is by Shaurya Rane for the em_canid filter to ensure that the complete CAN frame is present in the linear data buffer before accessing it. * tag 'linux-can-fixes-for-6.18-20251126' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: net/sched: em_canid: fix uninit-value in em_canid_match can: rcar_canfd: Fix CAN-FD mode as default can: sun4i_can: sun4i_can_interrupt(): fix max irq loop handling can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing data can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing header can: gs_usb: gs_usb_xmit_callback(): fix handling of failed transmitted URBs can: sja1000: fix max irq loop handling can: kvaser_usb: leaf: Fix potential infinite loop in command parsers ==================== Link: https://patch.msgid.link/20251126155713.217105-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26net/sched: em_canid: fix uninit-value in em_canid_matchShaurya Rane1-0/+3
Use pskb_may_pull() to ensure a complete CAN frame is present in the linear data buffer before reading the CAN ID. A simple skb->len check is insufficient because it only verifies the total data length but does not guarantee the data is present in skb->data (it could be in fragments). pskb_may_pull() both validates the length and pulls fragmented data into the linear buffer if necessary, making it safe to directly access skb->data. Reported-by: syzbot+5d8269a1e099279152bc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5d8269a1e099279152bc Fixes: f057bbb6f9ed ("net: em_canid: Ematch rule to match CAN frames according to their identifiers") Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Link: https://patch.msgid.link/20251126085718.50808-1-ssranevjti@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-25net: sched: fix TCF_LAYER_TRANSPORT handling in tcf_get_base_ptr()Eric Dumazet3-3/+15
syzbot reported that tcf_get_base_ptr() can be called while transport header is not set [1]. Instead of returning a dangling pointer, return NULL. Fix tcf_get_base_ptr() callers to handle this NULL value. [1] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 skb_transport_header include/linux/skbuff.h:3071 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 tcf_get_base_ptr include/net/pkt_cls.h:539 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 em_nbyte_match+0x2d8/0x3f0 net/sched/em_nbyte.c:43 Modules linked in: CPU: 1 UID: 0 PID: 6019 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> tcf_em_match net/sched/ematch.c:494 [inline] __tcf_em_tree_match+0x1ac/0x770 net/sched/ematch.c:520 tcf_em_tree_match include/net/pkt_cls.h:512 [inline] basic_classify+0x115/0x2d0 net/sched/cls_basic.c:50 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x4cf/0x1140 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x4e/0x260 net/core/dev.c:4118 __dev_xmit_skb net/core/dev.c:4214 [inline] __dev_queue_xmit+0xe83/0x3b50 net/core/dev.c:4729 packet_snd net/packet/af_packet.c:3076 [inline] packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x21c/0x270 net/socket.c:742 ____sys_sendmsg+0x505/0x830 net/socket.c:2630 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+f3a497f02c389d86ef16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6920855a.a70a0220.2ea503.0058.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251121154100.1616228-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-15Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds2-8/+4
Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch
2025-11-14bpf: Add bpf_prog_run_data_pointers()Eric Dumazet2-8/+4
syzbot found that cls_bpf_classify() is able to change tc_skb_cb(skb)->drop_reason triggering a warning in sk_skb_reason_drop(). WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline] WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214 struct tc_skb_cb has been added in commit ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block"), which added a wrong interaction with db58ba459202 ("bpf: wire in data and data_end for cls_act_bpf"). drop_reason was added later. Add bpf_prog_run_data_pointers() helper to save/restore the net_sched storage colliding with BPF data_meta/data_end. Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block") Reported-by: syzbot <syzkaller@googlegroups.com> Closes: https://lore.kernel.org/netdev/6913437c.a70a0220.22f260.013b.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251112125516.1563021-1-edumazet@google.com
2025-11-12net_sched: limit try_bulk_dequeue_skb() batchesEric Dumazet1-7/+10
After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption") I started seeing many qdisc requeues on IDPF under high TX workload. $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1: qdisc mq 1: root Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114) backlog 1056Kb 6675p requeues 3532840114 qdisc mq 1: root Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653) backlog 781164b 4822p requeues 3537737653 This is caused by try_bulk_dequeue_skb() being only limited by BQL budget. perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script ... netperf 75332 [146] 2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200 netperf 75332 [146] 2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500 netperf 75330 [144] 2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100 netperf 75333 [147] 2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00 netperf 75337 [151] 2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300 netperf 75337 [151] 2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00 netperf 75330 [144] 2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000 ... This is bad because : 1) Large batches hold one victim cpu for a very long time. 2) Driver often hit their own TX ring limit (all slots are used). 3) We call dev_requeue_skb() 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to implement FQ or priority scheduling. 5) dequeue_skb() gets packets from q->gso_skb one skb at a time with no xmit_more support. This is causing many spinlock games between the qdisc and the device driver. Requeues were supposed to be very rare, lets keep them this way. Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as __qdisc_run() was designed to use. Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20251109161215.2574081-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-11net: sched: act_ife: initialize struct tc_ife to fix KMSAN kernel-infoleakRanganath V N1-5/+7
Fix a KMSAN kernel-infoleak detected by the syzbot . [net?] KMSAN: kernel-infoleak in __skb_datagram_iter In tcf_ife_dump(), the variable 'opt' was partially initialized using a designatied initializer. While the padding bytes are reamined uninitialized. nla_put() copies the entire structure into a netlink message, these uninitialized bytes leaked to userspace. Initialize the structure with memset before assigning its fields to ensure all members and padding are cleared prior to beign copied. This change silences the KMSAN report and prevents potential information leaks from the kernel memory. This fix has been tested and validated by syzbot. This patch closes the bug reported at the following syzkaller link and ensures no infoleak. Reported-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c85cae3350b7d486aee Tested-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com Fixes: ef6980b6becb ("introduce IFE action") Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251109091336.9277-3-vnranganath.20@gmail.com Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11net: sched: act_connmark: initialize struct tc_ife to fix kernel leakRanganath V N1-5/+7
In tcf_connmark_dump(), the variable 'opt' was partially initialized using a designatied initializer. While the padding bytes are reamined uninitialized. nla_put() copies the entire structure into a netlink message, these uninitialized bytes leaked to userspace. Initialize the structure with memset before assigning its fields to ensure all members and padding are cleared prior to beign copied. Reported-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c85cae3350b7d486aee Tested-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com Fixes: 22a5dc0e5e3e ("net: sched: Introduce connmark action") Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251109091336.9277-2-vnranganath.20@gmail.com Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11net/sched: Abort __tc_modify_qdisc if parent is a clsact/ingress qdiscVictor Nogueira1-0/+5
Wang reported an illegal configuration [1] where the user attempts to add a child qdisc to the ingress qdisc as follows: tc qdisc add dev eth0 handle ffff:0 ingress tc qdisc add dev eth0 handle ffe0:0 parent ffff:a fq To solve this, we reject any configuration attempt to add a child qdisc to ingress or clsact. [1] https://lore.kernel.org/netdev/20251105022213.1981982-1-wangliang74@huawei.com/ Fixes: 5e50da01d0ce ("[NET_SCHED]: Fix endless loops (part 2): "simple" qdiscs") Reported-by: Wang Liang <wangliang74@huawei.com> Closes: https://lore.kernel.org/netdev/20251105022213.1981982-1-wangliang74@huawei.com/ Reviewed-by: Pedro Tammela <pctammela@mojatatu.ai> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Cong Wang <cwang@multikernel.io> Link: https://patch.msgid.link/20251106205621.3307639-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03net_sched: act: remove tcfa_qstatsEric Dumazet1-4/+8
tcfa_qstats is currently only used to hold drops and overlimits counters. tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats() currently acquire a->tcfa_lock to increment these counters. Switch to two atomic_t to get lock-free accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03net_sched: add back BH safety to tcf_lockEric Dumazet11-22/+22
Jamal reported that we had to use BH safety after all, because stats can be updated from BH handler. Fixes: 3133d5c15cb5 ("net_sched: remove BH blocking in eight actions") Fixes: 53df77e78590 ("net_sched: act_skbmod: use RCU in tcf_skbmod_dump()") Fixes: e97ae742972f ("net_sched: act_tunnel_key: use RCU in tunnel_key_dump()") Fixes: 48b5e5dbdb23 ("net_sched: act_vlan: use RCU in tcf_vlan_dump()") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Closes: https://lore.kernel.org/netdev/CAM0EoMmhq66EtVqDEuNik8MVFZqkgxFbMu=fJtbNoYD7YXg4bA@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250901092608.2032473-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29net_sched: act_skbmod: use RCU in tcf_skbmod_dump()Eric Dumazet1-14/+12
Also storing tcf_action into struct tcf_skbmod_params makes sure there is no discrepancy in tcf_skbmod_act(). No longer block BH in tcf_skbmod_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29net_sched: act_tunnel_key: use RCU in tunnel_key_dump()Eric Dumazet1-11/+9
Also storing tcf_action into struct tcf_tunnel_key_params makes sure there is no discrepancy in tunnel_key_act(). No longer block BH in tunnel_key_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29net_sched: act_vlan: use RCU in tcf_vlan_dump()Eric Dumazet1-11/+9
Also storing tcf_action into struct tcf_vlan_params makes sure there is no discrepancy in tcf_vlan_act(). No longer block BH in tcf_vlan_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29net_sched: remove BH blocking in eight actionsEric Dumazet8-16/+16
Followup of f45b45cbfae3 ("Merge branch 'net_sched-act-extend-rcu-use-in-dump-methods'") We never grab tcf_lock from BH context in these modules: act_connmark act_csum act_ct act_ctinfo act_mpls act_nat act_pedit act_skbedit No longer block BH when acquiring tcf_lock from init functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-35/+58
Cross-merge networking fixes after downstream PR (net-6.17-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/sched: Remove unnecessary WARNING condition for empty child qdisc in ↵William Liu1-1/+1
htb_activate The WARN_ON trigger based on !cl->leaf.q->q.qlen is unnecessary in htb_activate. htb_dequeue_tree already accounts for that scenario. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250819033632.579854-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limitWilliam Liu1-2/+12
The following setup can trigger a WARNING in htb_activate due to the condition: !cl->leaf.q->q.qlen tc qdisc del dev lo root tc qdisc add dev lo root handle 1: htb default 1 tc class add dev lo parent 1: classid 1:1 \ htb rate 64bit tc qdisc add dev lo parent 1:1 handle f: \ cake memlimit 1b ping -I lo -f -c1 -s64 -W0.001 127.0.0.1 This is because the low memlimit leads to a low buffer_limit, which causes packet dropping. However, cake_enqueue still returns NET_XMIT_SUCCESS, causing htb_enqueue to call htb_activate with an empty child qdisc. We should return NET_XMIT_CN when packets are dropped from the same tin and flow. I do not believe return value of NET_XMIT_CN is necessary for packet drops in the case of ack filtering, as that is meant to optimize performance, not to signal congestion. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250819033601.579821-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20net/sched: sch_dualpi2: Run prob update timer in softirq to avoid deadlockVictor Nogueira1-2/+3
When a user creates a dualpi2 qdisc it automatically sets a timer. This timer will run constantly and update the qdisc's probability field. The issue is that the timer acquires the qdisc root lock and runs in hardirq. The qdisc root lock is also acquired in dev.c whenever a packet arrives for this qdisc. Since the dualpi2 timer callback runs in hardirq, it may interrupt the packet processing running in softirq. If that happens and it runs on the same CPU, it will acquire the same lock and cause a deadlock. The following splat shows up when running a kernel compiled with lock debugging: [ +0.000224] WARNING: inconsistent lock state [ +0.000224] 6.16.0+ #10 Not tainted [ +0.000169] -------------------------------- [ +0.000029] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ +0.000000] ping/156 [HC0[0]:SC0[2]:HE1:SE0] takes: [ +0.000000] ffff897841242110 (&sch->root_lock_key){?.-.}-{3:3}, at: __dev_queue_xmit+0x86d/0x1140 [ +0.000000] {IN-HARDIRQ-W} state was registered at: [ +0.000000] lock_acquire.part.0+0xb6/0x220 [ +0.000000] _raw_spin_lock+0x31/0x80 [ +0.000000] dualpi2_timer+0x6f/0x270 [ +0.000000] __hrtimer_run_queues+0x1c5/0x360 [ +0.000000] hrtimer_interrupt+0x115/0x260 [ +0.000000] __sysvec_apic_timer_interrupt+0x6d/0x1a0 [ +0.000000] sysvec_apic_timer_interrupt+0x6e/0x80 [ +0.000000] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ +0.000000] pv_native_safe_halt+0xf/0x20 [ +0.000000] default_idle+0x9/0x10 [ +0.000000] default_idle_call+0x7e/0x1e0 [ +0.000000] do_idle+0x1e8/0x250 [ +0.000000] cpu_startup_entry+0x29/0x30 [ +0.000000] rest_init+0x151/0x160 [ +0.000000] start_kernel+0x6f3/0x700 [ +0.000000] x86_64_start_reservations+0x24/0x30 [ +0.000000] x86_64_start_kernel+0xc8/0xd0 [ +0.000000] common_startup_64+0x13e/0x148 [ +0.000000] irq event stamp: 6884 [ +0.000000] hardirqs last enabled at (6883): [<ffffffffa75700b3>] neigh_resolve_output+0x223/0x270 [ +0.000000] hardirqs last disabled at (6882): [<ffffffffa7570078>] neigh_resolve_output+0x1e8/0x270 [ +0.000000] softirqs last enabled at (6880): [<ffffffffa757006b>] neigh_resolve_output+0x1db/0x270 [ +0.000000] softirqs last disabled at (6884): [<ffffffffa755b533>] __dev_queue_xmit+0x73/0x1140 [ +0.000000] other info that might help us debug this: [ +0.000000] Possible unsafe locking scenario: [ +0.000000] CPU0 [ +0.000000] ---- [ +0.000000] lock(&sch->root_lock_key); [ +0.000000] <Interrupt> [ +0.000000] lock(&sch->root_lock_key); [ +0.000000] *** DEADLOCK *** [ +0.000000] 4 locks held by ping/156: [ +0.000000] #0: ffff897842332e08 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0x41e/0xf40 [ +0.000000] #1: ffffffffa816f880 (rcu_read_lock){....}-{1:3}, at: ip_output+0x2c/0x190 [ +0.000000] #2: ffffffffa816f880 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0xad/0x950 [ +0.000000] #3: ffffffffa816f840 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x73/0x1140 I am able to reproduce it consistently when running the following: tc qdisc add dev lo handle 1: root dualpi2 ping -f 127.0.0.1 To fix it, make the timer run in softirq. Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250815135317.664993-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-15net/sched: Fix backlog accounting in qdisc_dequeue_internalWilliam Liu6-30/+42
This issue applies for the following qdiscs: hhf, fq, fq_codel, and fq_pie, and occurs in their change handlers when adjusting to the new limit. The problem is the following in the values passed to the subsequent qdisc_tree_reduce_backlog call given a tbf parent: When the tbf parent runs out of tokens, skbs of these qdiscs will be placed in gso_skb. Their peek handlers are qdisc_peek_dequeued, which accounts for both qlen and backlog. However, in the case of qdisc_dequeue_internal, ONLY qlen is accounted for when pulling from gso_skb. This means that these qdiscs are missing a qdisc_qstats_backlog_dec when dropping packets to satisfy the new limit in their change handlers. One can observe this issue with the following (with tc patched to support a limit of 0): export TARGET=fq tc qdisc del dev lo root tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms tc qdisc replace dev lo handle 3: parent 1:1 $TARGET limit 1000 echo ''; echo 'add child'; tc -s -d qdisc show dev lo ping -I lo -f -c2 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null echo ''; echo 'after ping'; tc -s -d qdisc show dev lo tc qdisc change dev lo handle 3: parent 1:1 $TARGET limit 0 echo ''; echo 'after limit drop'; tc -s -d qdisc show dev lo tc qdisc replace dev lo handle 2: parent 1:1 sfq echo ''; echo 'post graft'; tc -s -d qdisc show dev lo The second to last show command shows 0 packets but a positive number (74) of backlog bytes. The problem becomes clearer in the last show command, where qdisc_purge_queue triggers qdisc_tree_reduce_backlog with the positive backlog and causes an underflow in the tbf parent's backlog (4096 Mb instead of 0). To fix this issue, the codepath for all clients of qdisc_dequeue_internal has been simplified: codel, pie, hhf, fq, fq_pie, and fq_codel. qdisc_dequeue_internal handles the backlog adjustments for all cases that do not directly use the dequeue handler. The old fq_codel_change limit adjustment loop accumulated the arguments to the subsequent qdisc_tree_reduce_backlog call through the cstats field. However, this is confusing and error prone as fq_codel_dequeue could also potentially mutate this field (which qdisc_dequeue_internal calls in the non gso_skb case), so we have unified the code here with other qdiscs. Fixes: 2d3cbfd6d54a ("net_sched: Flush gso_skb list too during ->change()") Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM") Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250812235725.45243-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-15net/sched: Use TC_RTAB_SIZE instead of magic numberYue Haibing1-2/+2
Replace magic number with TC_RTAB_SIZE to make it more informative. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250813125526.853895-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+6
Cross-merge networking fixes after downstream PR (net-6.17-rc2). No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c d7a276a5768f ("net: stmmac: rk: convert to suspend()/resume() methods") de1e963ad064 ("net: stmmac: rk: put the PHY clock on remove") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-14net/sched: ets: use old 'nbands' while purging unused classesDavide Caratti1-5/+6
Shuang reported sch_ets test-case [1] crashing in ets_class_qlen_notify() after recent changes from Lion [2]. The problem is: in ets_qdisc_change() we purge unused DWRR queues; the value of 'q->nbands' is the new one, and the cleanup should be done with the old one. The problem is here since my first attempts to fix ets_qdisc_change(), but it surfaced again after the recent qdisc len accounting fixes. Fix it purging idle DWRR queues before assigning a new value of 'q->nbands', so that all purge operations find a consistent configuration: - old 'q->nbands' because it's needed by ets_class_find() - old 'q->nstrict' because it's needed by ets_class_is_strict() BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 62 UID: 0 PID: 39457 Comm: tc Kdump: loaded Not tainted 6.12.0-116.el10.x86_64 #1 PREEMPT(voluntary) Hardware name: Dell Inc. PowerEdge R640/06DKY5, BIOS 2.12.2 07/09/2021 RIP: 0010:__list_del_entry_valid_or_report+0x4/0x80 Code: ff 4c 39 c7 0f 84 39 19 8e ff b8 01 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 0f 84 56 19 8e ff 48 85 c9 0f 84 ab RSP: 0018:ffffba186009f400 EFLAGS: 00010202 RAX: 00000000000000d6 RBX: 0000000000000000 RCX: 0000000000000004 RDX: ffff9f0fa29b69c0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffffffc12c2400 R08: 0000000000000008 R09: 0000000000000004 R10: ffffffffffffffff R11: 0000000000000004 R12: 0000000000000000 R13: ffff9f0f8cfe0000 R14: 0000000000100005 R15: 0000000000000000 FS: 00007f2154f37480(0000) GS:ffff9f269c1c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001530be001 CR4: 00000000007726f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ets_class_qlen_notify+0x65/0x90 [sch_ets] qdisc_tree_reduce_backlog+0x74/0x110 ets_qdisc_change+0x630/0xa40 [sch_ets] __tc_modify_qdisc.constprop.0+0x216/0x7f0 tc_modify_qdisc+0x7c/0x120 rtnetlink_rcv_msg+0x145/0x3f0 netlink_rcv_skb+0x53/0x100 netlink_unicast+0x245/0x390 netlink_sendmsg+0x21b/0x470 ____sys_sendmsg+0x39d/0x3d0 ___sys_sendmsg+0x9a/0xe0 __sys_sendmsg+0x7a/0xd0 do_syscall_64+0x7d/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f2155114084 Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d 25 f0 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89 RSP: 002b:00007fff1fd7a988 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000560ec063e5e0 RCX: 00007f2155114084 RDX: 0000000000000000 RSI: 00007fff1fd7a9f0 RDI: 0000000000000003 RBP: 00007fff1fd7aa60 R08: 0000000000000010 R09: 000000000000003f R10: 0000560ee9b3a010 R11: 0000000000000202 R12: 00007fff1fd7aae0 R13: 000000006891ccde R14: 0000560ec063e5e0 R15: 00007fff1fd7aad0 </TASK> [1] https://lore.kernel.org/netdev/e08c7f4a6882f260011909a868311c6e9b54f3e4.1639153474.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com/ Cc: stable@vger.kernel.org Fixes: 103406b38c60 ("net/sched: Always pass notifications when child class becomes empty") Fixes: c062f2a0b04d ("net/sched: sch_ets: don't remove idle classes from the round-robin list") Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc") Reported-by: Li Shuang <shuali@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-108026 Reviewed-by: Petr Machata <petrm@nvidia.com> Co-developed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/7928ff6d17db47a2ae7cc205c44777b1f1950545.1755016081.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-13net/sched: Remove redundant memset(0) call in reset_policy()Thorsten Blum1-1/+0
The call to nla_strscpy() already zero-pads the tail of the destination buffer which makes the additional memset(0) call redundant. Remove it. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250811164039.43250-1-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-05net/sched: mqprio: fix stack out-of-bounds write in tc entry parsingMaher Azzouzi1-1/+1
TCA_MQPRIO_TC_ENTRY_INDEX is validated using NLA_POLICY_MAX(NLA_U32, TC_QOPT_MAX_QUEUE), which allows the value TC_QOPT_MAX_QUEUE (16). This leads to a 4-byte out-of-bounds stack write in the fp[] array, which only has room for 16 elements (0–15). Fix this by changing the policy to allow only up to TC_QOPT_MAX_QUEUE - 1. Fixes: f62af20bed2d ("net/sched: mqprio: allow per-TC user input of FP adminStatus") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Maher Azzouzi <maherazz04@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250802001857.2702497-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-02net/sched: taprio: enforce minimum value for picos_per_byteTakamitsu Iwai1-3/+18
Syzbot reported a WARNING in taprio_get_start_time(). When link speed is 470,589 or greater, q->picos_per_byte becomes too small, causing length_to_duration(q, ETH_ZLEN) to return zero. This zero value leads to validation failures in fill_sched_entry() and parse_taprio_schedule(), allowing arbitrary values to be assigned to entry->interval and cycle_time. As a result, sched->cycle can become zero. Since SPEED_800000 is the largest defined speed in include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios. To ensure length_to_duration() returns a non-zero value for minimum-sized Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17 (60 * 17 > PSEC_PER_NSEC which is 1000). This patch enforces a minimum value of 17 for picos_per_byte when the calculated value would be lower, and adds a warning message to inform users that scheduling accuracy may be affected at very high link speeds. Fixes: fb66df20a720 ("net/sched: taprio: extend minimum interval restriction to entire cycle too") Reported-by: syzbot+398e1ee4ca2cac05fddb@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=398e1ee4ca2cac05fddb Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Link: https://patch.msgid.link/20250728173149.45585-1-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-30Merge tag 'bpf-next-6.17' of ↵Linus Torvalds1-6/+3
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Remove usermode driver (UMD) framework (Thomas Weißschuh) - Introduce Strongly Connected Component (SCC) in the verifier to detect loops and refine register liveness (Eduard Zingerman) - Allow 'void *' cast using bpf_rdonly_cast() and corresponding '__arg_untrusted' for global function parameters (Eduard Zingerman) - Improve precision for BPF_ADD and BPF_SUB operations in the verifier (Harishankar Vishwanathan) - Teach the verifier that constant pointer to a map cannot be NULL (Ihor Solodrai) - Introduce BPF streams for error reporting of various conditions detected by BPF runtime (Kumar Kartikeya Dwivedi) - Teach the verifier to insert runtime speculation barrier (lfence on x86) to mitigate speculative execution instead of rejecting the programs (Luis Gerhorst) - Various improvements for 'veristat' (Mykyta Yatsenko) - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to improve bug detection by syzbot (Paul Chaignon) - Support BPF private stack on arm64 (Puranjay Mohan) - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's node (Song Liu) - Introduce kfuncs for read-only string opreations (Viktor Malik) - Implement show_fdinfo() for bpf_links (Tao Chen) - Reduce verifier's stack consumption (Yonghong Song) - Implement mprog API for cgroup-bpf programs (Yonghong Song) * tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits) selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite selftests/bpf: Add selftest for attaching tracing programs to functions in deny list bpf: Add log for attaching tracing programs to functions in deny list bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Fix various typos in verifier.c comments bpf: Add third round of bounds deduction selftests/bpf: Test invariants on JSLT crossing sign selftests/bpf: Test cross-sign 64bits range refinement selftests/bpf: Update reg_bound range refinement logic bpf: Improve bounds when s64 crosses sign boundary bpf: Simplify bounds refinement from s32 selftests/bpf: Enable private stack tests for arm64 bpf, arm64: JIT support for private stack bpf: Move bpf_jit_get_prog_name() to core.c bpf, arm64: Fix fp initialization for exception boundary umd: Remove usermode driver framework bpf/preload: Don't select USERMODE_DRIVER selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure selftests/bpf: Increase xdp data size for arm64 64K page size ...
2025-07-26net/sched: taprio: align entry index attr validation with mqprioSimon Horman1-8/+4
Both taprio and mqprio have code to validate respective entry index attributes. The validation is indented to ensure that the attribute is present, and that it's value is in range, and that each value is only used once. The purpose of this patch is to align the implementation of taprio with that of mqprio as there seems to be no good reason for them to differ. For one thing, this way, bugs will be present in both or neither. As a follow-up some consideration could be given to a common function used by both sch. No functional change intended. Except of tdc run: the results of the taprio tests # ok 81 ba39 - Add taprio Qdisc to multi-queue device (8 queues) # ok 82 9462 - Add taprio Qdisc with multiple sched-entry # ok 83 8d92 - Add taprio Qdisc with txtime-delay # ok 84 d092 - Delete taprio Qdisc with valid handle # ok 85 8471 - Show taprio class # ok 86 0a85 - Add taprio Qdisc to single-queue device # ok 87 6f62 - Add taprio Qdisc with too short interval # ok 88 831f - Add taprio Qdisc with too short cycle-time # ok 89 3e1e - Add taprio Qdisc with an invalid cycle-time # ok 90 39b4 - Reject grafting taprio as child qdisc of software taprio # ok 91 e8a1 - Reject grafting taprio as child qdisc of offloaded taprio # ok 92 a7bf - Graft cbs as child of software taprio # ok 93 6a83 - Graft cbs as child of offloaded taprio Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Cc: Maher Azzouzi <maherazz04@gmail.com> Link: https://lore.kernel.org/netdev/20250723125521.GA2459@horms.kernel.org/ Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://patch.msgid.link/20250725-taprio-idx-parse-v1-1-b582fffcde37@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26net/sched: Add precise drop reason for pfifo_fast queue overflowsFan Yu1-0/+2
Currently, packets dropped by pfifo_fast due to queue overflow are marked with a generic SKB_DROP_REASON_QDISC_DROP in __dev_xmit_skb(). This patch adds explicit drop reason SKB_DROP_REASON_QDISC_OVERLIMIT for queue-full cases, providing better distinction from other qdisc drops. Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250724212837119BP9HOs0ibXDRWgsXMMir7@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-4/+3
Cross-merge networking fixes after downstream PR (net-6.16-rc8). Conflicts: drivers/net/ethernet/microsoft/mana/gdma_main.c 9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion") 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/icssg/icssg_prueth.h 6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG") ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24sched: Add enqueue/dequeue of dualpi2 qdiscKoen De Schepper3-4/+481
DualPI2 provides L4S-type low latency & loss to traffic that uses a scalable congestion controller (e.g. TCP-Prague, DCTCP) without degrading the performance of 'classic' traffic (e.g. Reno, Cubic etc.). It is to be the reference implementation of IETF RFC9332 DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332). Note that creating two independent queues cannot meet the goal of DualPI2 mentioned in RFC9332: "...to preserve fairness between ECN-capable and non-ECN-capable traffic." Further, it could even lead to starvation of Classic traffic, which is also inconsistent with the requirements in RFC9332: "...although priority MUST be bounded in order not to starve Classic traffic." DualPI2 is designed to maintain approximate per-flow fairness on L-queue and C-queue by forming a single qdisc using the coupling factor and scheduler between two queues. The qdisc provides two queues called low latency and classic. It classifies packets based on the ECN field in the IP headers. By default it directs non-ECN and ECT(0) into the classic queue and ECT(1) and CE into the low latency queue, as per the IETF spec. Each queue runs its own AQM: * The classic AQM is called PI2, which is similar to the PIE AQM but more responsive and simpler. Classic traffic requires a decent target queue (default 15ms for Internet deployment) to fully utilize the link and to avoid high drop rates. * The low latency AQM is, by default, a very shallow ECN marking threshold (1ms) similar to that used for DCTCP. The DualQ isolates the low queuing delay of the Low Latency queue from the larger delay of the 'Classic' queue. However, from a bandwidth perspective, flows in either queue will share out the link capacity as if there was just a single queue. This bandwidth pooling effect is achieved by coupling together the drop and ECN-marking probabilities of the two AQMs. The PI2 AQM has two main parameters in addition to its target delay. The integral gain factor alpha is used to slowly correct any persistent standing queue error from the target delay, while the proportional gain factor beta is used to quickly compensate for queue changes (growth or shrinkage). Either alpha and beta are given as a parameter, or they can be calculated by tc from alternative typical and maximum RTT parameters. Internally, the output of a linear Proportional Integral (PI) controller is used for both queues. This output is squared to calculate the drop or ECN-marking probability of the classic queue. This counterbalances the square-root rate equation of Reno/Cubic, which is the trick that balances flow rates across the queues. For the ECN-marking probability of the low latency queue, the output of the base AQM is multiplied by a coupling factor. This determines the balance between the flow rates in each queue. The default setting makes the flow rates roughly equal, which should be generally applicable. If DUALPI2 AQM has detected overload (due to excessive non-responsive traffic in either queue), it will switch to signaling congestion solely using drop, irrespective of the ECN field. Alternatively, it can be configured to limit the drop probability and let the queue grow and eventually overflow (like tail-drop). GSO splitting in DUALPI2 is configurable from userspace while the default behavior is to split gso. When running DUALPI2 at unshaped 10gigE with 4 download streams test, splitting gso apart results in halving the latency with no loss in throughput: Summary of tcp_4down run 'no_split_gso': avg median # data pts Ping (ms) ICMP : 0.53 0.30 ms 350 TCP download avg : 2326.86 N/A Mbits/s 350 TCP download sum : 9307.42 N/A Mbits/s 350 TCP download::1 : 2672.99 2568.73 Mbits/s 350 TCP download::2 : 2586.96 2570.51 Mbits/s 350 TCP download::3 : 1786.26 1798.82 Mbits/s 350 TCP download::4 : 2261.21 2309.49 Mbits/s 350 Summart of tcp_4down run 'split_gso': avg median # data pts Ping (ms) ICMP   : 0.22 0.23 ms 350 TCP download avg : 2335.02 N/A Mbits/s 350 TCP download sum : 9340.09 N/A Mbits/s 350 TCP download::1 : 2335.30 2334.22 Mbits/s 350 TCP download::2 : 2334.72 2334.20 Mbits/s 350 TCP download::3 : 2335.28 2334.58 Mbits/s 350 TCP download::4 : 2334.79 2334.39 Mbits/s 350 A similar result is observed when running DUALPI2 at unshaped 1gigE with 1 download stream test: Summary of tcp_1down run 'no_split_gso': avg median # data pts Ping (ms) ICMP : 1.13 1.25 ms 350 TCP download : 941.41 941.46 Mbits/s 350 Summart of tcp_1down run 'split_gso': avg median # data pts Ping (ms) ICMP : 0.51 0.55 ms 350 TCP download : 941.41 941.45 Mbits/s 350 Additional details can be found in the draft: https://datatracker.ietf.org/doc/html/rfc9332 Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Co-developed-by: Olga Albisser <olga@albisser.org> Signed-off-by: Olga Albisser <olga@albisser.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Henrik Steen <henrist@henrist.net> Signed-off-by: Henrik Steen <henrist@henrist.net> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Bob Briscoe <research@bobbriscoe.net> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Acked-by: Dave Taht <dave.taht@gmail.com> Link: https://patch.msgid.link/20250722095915.24485-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24sched: Dump configuration and statistics of dualpi2 qdiscChia-Yu Chang1-17/+137
The configuration and statistics dump of the DualPI2 Qdisc provides information related to both queues, such as packet numbers and queuing delays in the L-queue and C-queue, as well as general information such as probability value, WRR credits, memory usage, packet marking counters, max queue size, etc. The following patch includes enqueue/dequeue for DualPI2. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20250722095915.24485-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24sched: Struct definition and parsing of dualpi2 qdiscChia-Yu Chang1-0/+591
DualPI2 is the reference implementation of IETF RFC9332 DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332) providing two queues called low latency (L-queue) and classic (C-queue). By default, it enqueues non-ECN and ECT(0) packets into the C-queue and ECT(1) and CE packets into the low latency queue (L-queue), as per IETF RFC9332 spec. This patch defines the dualpi2 Qdisc structure and parsing, and the following two patches include dumping and enqueue/dequeue for the DualPI2. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20250722095915.24485-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in ↵Xiang Mei1-4/+3
qfq_delete_class might_sleep could be trigger in the atomic context in qfq_delete_class. qfq_destroy_class was moved into atomic context locked by sch_tree_lock to avoid a race condition bug on qfq_aggregate. However, might_sleep could be triggered by qfq_destroy_class, which introduced sleeping in atomic context (path: qfq_destroy_class->qdisc_put->__qdisc_destroy->lockdep_unregister_key ->might_sleep). Considering the race is on the qfq_aggregate objects, keeping qfq_rm_from_agg in the lock but moving the left part out can solve this issue. Fixes: 5e28d5a3f774 ("net/sched: sch_qfq: Fix race condition on qfq_aggregate") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/4a04e0cc-a64b-44e7-9213-2880ed641d77@sabinyo.mountain Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/20250717230128.159766-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6Alexei Starovoitov4-58/+46
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-10/+24
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtreeWilliam Liu1-1/+3
htb_lookup_leaf has a BUG_ON that can trigger with the following: tc qdisc del dev lo root tc qdisc add dev lo root handle 1: htb default 1 tc class add dev lo parent 1: classid 1:1 htb rate 64bit tc qdisc add dev lo parent 1:1 handle 2: netem tc qdisc add dev lo parent 2:1 handle 3: blackhole ping -I lo -c1 -W0.001 127.0.0.1 The root cause is the following: 1. htb_dequeue calls htb_dequeue_tree which calls the dequeue handler on the selected leaf qdisc 2. netem_dequeue calls enqueue on the child qdisc 3. blackhole_enqueue drops the packet and returns a value that is not just NET_XMIT_SUCCESS 4. Because of this, netem_dequeue calls qdisc_tree_reduce_backlog, and since qlen is now 0, it calls htb_qlen_notify -> htb_deactivate -> htb_deactiviate_prios -> htb_remove_class_from_row -> htb_safe_rb_erase 5. As this is the only class in the selected hprio rbtree, __rb_change_child in __rb_erase_augmented sets the rb_root pointer to NULL 6. Because blackhole_dequeue returns NULL, netem_dequeue returns NULL, which causes htb_dequeue_tree to call htb_lookup_leaf with the same hprio rbtree, and fail the BUG_ON The function graph for this scenario is shown here: 0) | htb_enqueue() { 0) + 13.635 us | netem_enqueue(); 0) 4.719 us | htb_activate_prios(); 0) # 2249.199 us | } 0) | htb_dequeue() { 0) 2.355 us | htb_lookup_leaf(); 0) | netem_dequeue() { 0) + 11.061 us | blackhole_enqueue(); 0) | qdisc_tree_reduce_backlog() { 0) | qdisc_lookup_rcu() { 0) 1.873 us | qdisc_match_from_root(); 0) 6.292 us | } 0) 1.894 us | htb_search(); 0) | htb_qlen_notify() { 0) 2.655 us | htb_deactivate_prios(); 0) 6.933 us | } 0) + 25.227 us | } 0) 1.983 us | blackhole_dequeue(); 0) + 86.553 us | } 0) # 2932.761 us | qdisc_warn_nonwc(); 0) | htb_lookup_leaf() { 0) | BUG_ON(); ------------------------------------------ The full original bug report can be seen here [1]. We can fix this just by returning NULL instead of the BUG_ON, as htb_dequeue_tree returns NULL when htb_lookup_leaf returns NULL. [1] https://lore.kernel.org/netdev/pF5XOOIim0IuEfhI-SOxTgRvNoDwuux7UHKnE_Y5-zVd4wmGvNk2ceHjKb8ORnzw0cGwfmVu42g9dL7XyJLf1NEzaztboTWcm0Ogxuojoeo=@willsroot.io/ Fixes: 512bb43eb542 ("pkt_sched: sch_htb: Optimize WARN_ONs in htb_dequeue_tree() etc.") Signed-off-by: William Liu <will@willsroot.io> Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250717022816.221364-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>