kernel/linux.git/net/core/skbuff.c, branch v6.19.11

net: Drop the lock in skb_may_tx_timestamp()

2026-03-04T12:20:56+00:00

[ Upstream commit 983512f3a87fd8dc4c94dfa6b596b6e57df5aad7 ] skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must not be taken in IRQ context, only softirq is okay. A few drivers receive the timestamp via a dedicated interrupt and complete the TX timestamp from that handler. This will lead to a deadlock if the lock is already write-locked on the same CPU. Taking the lock can be avoided. The socket (pointed by the skb) will remain valid until the skb is released. The ->sk_socket and ->file member will be set to NULL once the user closes the socket which may happen before the timestamp arrives. If we happen to observe the pointer while the socket is closing but before the pointer is set to NULL then we may use it because both pointer (and the file's cred member) are RCU freed. Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a matching WRITE_ONCE() where the pointer are cleared. Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl") Signed-off-by: Sebastian Andrzej Siewior Reviewed-by: Willem de Bruijn Reviewed-by: Jason Xing Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de Signed-off-by: Paolo Abeni Signed-off-by: Sasha Levin

net: do not delay zero-copy skbs in skb_attempt_defer_free()

2026-02-26T23:01:32+00:00

[ Upstream commit 0943404b1f3b178e1e54386dadcbf4f2729c7762 ] After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

net: add skb->data_len and (skb>end - skb->tail) to skb_dump()

2026-01-16T03:49:47+00:00

While working on a syzbot report, I found that skb_dump() is lacking two important parts : - skb->data_len. - (skb>end - skb->tail) tailroom is zero if skb is not linear. Signed-off-by: Eric Dumazet Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20260112172621.4188700-1-edumazet@google.com Signed-off-by: Jakub Kicinski

net: fix memory leak in skb_segment_list for GRO packets

2026-01-06T01:01:28+00:00

When skb_segment_list() is called during packet forwarding, it handles packets that were aggregated by the GRO engine. Historically, the segmentation logic in skb_segment_list assumes that individual segments are split from a parent SKB and may need to carry their own socket memory accounting. Accordingly, the code transfers truesize from the parent to the newly created segments. Prior to commit ed4cccef64c1 ("gro: fix ownership transfer"), this truesize subtraction in skb_segment_list() was valid because fragments still carry a reference to the original socket. However, commit ed4cccef64c1 ("gro: fix ownership transfer") changed this behavior by ensuring that fraglist entries are explicitly orphaned (skb->sk = NULL) to prevent illegal orphaning later in the stack. This change meant that the entire socket memory charge remained with the head SKB, but the corresponding accounting logic in skb_segment_list() was never updated. As a result, the current code unconditionally adds each fragment's truesize to delta_truesize and subtracts it from the parent SKB. Since the fragments are no longer charged to the socket, this subtraction results in an effective under-count of memory when the head is freed. This causes sk_wmem_alloc to remain non-zero, preventing socket destruction and leading to a persistent memory leak. The leak can be observed via KMEMLEAK when tearing down the networking environment: unreferenced object 0xffff8881e6eb9100 (size 2048): comm "ping", pid 6720, jiffies 4295492526 backtrace: kmem_cache_alloc_noprof+0x5c6/0x800 sk_prot_alloc+0x5b/0x220 sk_alloc+0x35/0xa00 inet6_create.part.0+0x303/0x10d0 __sock_create+0x248/0x640 __sys_socket+0x11b/0x1d0 Since skb_segment_list() is exclusively used for SKB_GSO_FRAGLIST packets constructed by GRO, the truesize adjustment is removed. The call to skb_release_head_state() must be preserved. As documented in commit cf673ed0e057 ("net: fix fraglist segmentation reference count leak"), it is still required to correctly drop references to SKB extensions that may be overwritten during __copy_skb_header(). Fixes: ed4cccef64c1 ("gro: fix ownership transfer") Signed-off-by: Mohammad Heib Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20260104213101.352887-1-mheib@redhat.com Signed-off-by: Jakub Kicinski

net: restore napi_consume_skb()'s NULL-handling

2025-11-28T01:44:22+00:00

Commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") added a skb->cpu check to napi_consume_skb(), before the point where napi_consume_skb() validated skb is not NULL. Add an explicit check to the early exit condition. Reviewed-by: Eric Dumazet Signed-off-by: Jakub Kicinski

net: prefetch the next skb in napi_skb_cache_get()

2025-11-20T04:29:25+00:00

After getting the current skb in napi_skb_cache_get(), the next skb in cache is highly likely to be used soon, so prefetch would be helpful. Suggested-by: Eric Dumazet Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet Reviewed-by: Alexander Lobakin Link: https://patch.msgid.link/20251118070646.61344-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski

net: use NAPI_SKB_CACHE_FREE to keep 32 as default to do bulk free

2025-11-20T04:29:24+00:00

- Replace NAPI_SKB_CACHE_HALF with NAPI_SKB_CACHE_FREE - Only free 32 skbs in napi_skb_cache_put() Since the first patch adjusting NAPI_SKB_CACHE_SIZE to 128, the number of packets to be freed in the softirq was increased from 32 to 64. Considering a subsequent net_rx_action() calling napi_poll() a few times can easily consume the 64 available slots and we can afford keeping a higher value of sk_buffs in per-cpu storage, decrease NAPI_SKB_CACHE_FREE to 32 like before. So now the logic is 1) keeping 96 skbs, 2) freeing 32 skbs at one time. Suggested-by: Eric Dumazet Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet Reviewed-by: Alexander Lobakin Link: https://patch.msgid.link/20251118070646.61344-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski

net: increase default NAPI_SKB_CACHE_BULK to 32

2025-11-20T04:29:24+00:00

The previous value 16 is a bit conservative, so adjust it along with NAPI_SKB_CACHE_SIZE, which can minimize triggering memory allocation in napi_skb_cache_get*(). Suggested-by: Eric Dumazet Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet Reviewed-by: Alexander Lobakin Link: https://patch.msgid.link/20251118070646.61344-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski

net: increase default NAPI_SKB_CACHE_SIZE to 128

2025-11-20T04:29:24+00:00

After commit b61785852ed0 ("net: increase skb_defer_max default to 128") changed the value sysctl_skb_defer_max to avoid many calls to kick_defer_list_purge(), the same situation can be applied to NAPI_SKB_CACHE_SIZE that was proposed in 2016. It's a trade-off between using pre-allocated memory in skb_cache and saving more a bit heavy function calls in the softirq context. With this patch applied, we can have more skbs per-cpu to accelerate the sending path that needs to acquire new skbs. Suggested-by: Eric Dumazet Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet Reviewed-by: Alexander Lobakin Link: https://patch.msgid.link/20251118070646.61344-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski

net: use napi_skb_cache even in process context

2025-11-19T02:25:47+00:00

This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs"). Now the per-cpu napi_skb_cache is populated from TX completion path, we can make use of this cache, especially for cpus not used from a driver NAPI poll (primary user of napi_cache). We can use the napi_skb_cache only if current context is not from hard irq. With this patch, I consistently reach 130 Mpps on my UDP tx stress test and reduce SLUB spinlock contention to smaller values. Note there is still some SLUB contention for skb->head allocations. I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial and /sys/kernel/slab/skbuff_small_head/min_partial depending on the platform taxonomy. Signed-off-by: Eric Dumazet Reviewed-by: Jason Xing Tested-by: Jason Xing Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20251116202717.1542829-4-edumazet@google.com Signed-off-by: Jakub Kicinski