summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-02-07Merge branch 'remove-task-and-cgroup-local-storage-percpu-counters'Martin KaFai Lau15-561/+354
Amery Hung says: ==================== Remove task and cgroup local storage percpu counters * Motivation * The goal of this patchset is to make bpf syscalls and helpers updating task and cgroup local storage more robust by removing percpu counters in them. Task local storage and cgroup storage each employs a percpu counter to prevent deadlock caused by recursion. Since the underlying bpf local storage takes spinlocks in various operations, bpf programs running recursively may try to take a spinlock which is already taken. For example, when a tracing bpf program called recursively during bpf_task_storage_get(..., F_CREATE) tries to call bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock if the percpu variable is not in place. However, sometimes, the percpu counter may cause bpf syscalls or helpers to return errors spuriously, as soon as another threads is also updating the local storage or the local storage map. Ideally, the two threads could have taken turn to take the locks and perform their jobs respectively. However, due to the percpu counter, the syscalls and helpers can return -EBUSY even if one of them does not run recursively in another one. All it takes for this to happen is if the two threads run on the same CPU. This happened when BPF-CI ran the selftest of task local data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(..., F_CREATE) can easily fail. The failure mode is not good for users as they need to add retry logic in user space or bpf programs to avoid it. Even with retry, there is no guaranteed upper bound of the loop for a success call. Therefore, this patchset seeks to remove the percpu counter and makes the related bpf syscalls and helpers more reliable, while still make sure recursion deadlock will not happen, with the help of resilient queued spinlock (rqspinlock). * Implementation * To remove the percpu counter without introducing deadlock, bpf_local_storage is refactored by changing the locks from raw_spin_lock to rqspinlock, which prevents deadlock with deadlock detection and a timeout mechanism. The refactor basically repalces the locks with rqspinlock and propagates errors returned by the locking function to BPF helpers or syscalls. bpf_selem_unlink_nofail() is introduced to handle rqspinlock errors in two lock acquiring functions that cannot fail, bpf_local_storage_destroy() and bpf_local_storage_map_free() (i.e., local storage is being freed by the subsystem or the map is being freed). The high level idea is to bitfiel and atomic operation to track who is referencing an selem when any locks cannot be acquired. Additional care is needed to make sure special fields are freed and owner memory are uncharged safely and correctly. If not familiar with local storage, the last section briefly describe the locks and structure of local storage. It also shows the abbreviation used in the rest of the letter. * Test * Task and cgroup local storage selftests have already covered deadlock caused by recursion. Patch 14 updates the expected result of task local storage selftests as task local storage bpf helpers can now run on the same CPU as they don't cause deadlock. * Benchmark * ./bench -p 1 local-storage-create --storage-type <socket,task> \ --batch-size <16,32,64> The benchmark is a microbenchmark stress-testing how fast local storage can be created. After swicthing to rqspinlock and bpf_unlink_selem_nofail(), socket local storage creation speed has a ~5% gain. For task local storage, the number remains the same. Socket local storage batch creation speed creation speed diff --------------- ---- ------------------ ---- Before 16 134.371 ± 0.884k/s 3.12 kmallocs/create 32 133.032 ± 3.405k/s 3.12 kmallocs/create 64 133.494 ± 0.862k/s 3.12 kmallocs/create After 16 140.778 ± 1.306k/s 3.12 kmallocs/create +4.8% 32 140.550 ± 2.058k/s 3.11 kmallocs/create +5.7% 64 139.311 ± 0.911k/s 3.13 kmallocs/create +4.4% Task local storage batch creation speed creation speed diff --------------- ---- ------------------ ---- Before 16 25.301 ± 0.089k/s 2.43 kmallocs/create 32 23.797 ± 0.106k/s 2.51 kmallocs/create 64 23.251 ± 0.187k/s 2.51 kmallocs/create After 16 25.307 ± 0.080k/s 2.45 kmallocs/create +0.0% 32 23.889 ± 0.089k/s 2.46 kmallocs/create +0.0% 64 23.230 ± 0.113k/s 2.63 kmallocs/create -0.1% * Patchset organization * Patch 1-4 convert local storage internal helpers to failable. Patch 5 changes the locks to rqspinlock and propagate the error returned from raw_res_spin_lock_irqsave() to BPF heleprs and syscalls. Patch 6-8 remove percpu counters in task and cgroup local storage. Patch 9-11 address the unlikely rqspinlock errors by switching to bpf_selem_unlink_nofail() in map_free() and destroy(). Patch 12-17 update selftests. * Appendix: local storage internal * There are two locks in bpf_local_storage due to the ownership model as illustrated in the figure below. A map value, which consists of a pointer to the map and the data, is a bpf_local_storage_map_data (sdata) stored in a bpf_local_storage_elem (selem). A selem belongs to a bpf_local_storage and bpf_local_storage_map at the same time. bpf_local_storage::lock (lock_storage->lock in short) protects the list in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock) protects the hash bucket in a bpf_local_storage_map. task_struct ┌ task1 ───────┐ bpf_local_storage │ *bpf_storage │---->┌─────────┐ └──────────────┘<----│ *owner │ bpf_local_storage_elem │ *cache[16] (selem) selem │ *smap │ ┌──────────┐ ┌──────────┐ │ list │------->│ snode │<------->│ snode │ │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │ └─────────┘ │ │ sdata = │ │ │ │ sdata = │ task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │ ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘ │ *bpf_storage │---->┌─────────┐ │ │ │ └──────────────┘<----│ *owner │ │ │ │ │ *cache[16] │ selem │ │ selem │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐ │ list │--│---->│ snode │<--│-│-->│ snode │ │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │ └─────────┘ │ │ │ sdata = │ │ │ sdata = │ bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │ (smap) │ │ └──────────┘ │ └──────────┘ ┌ mapA ───────┐ │ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ │ └─────────────┘ │ list │------┘ │ │ │ lock │ │ │ └──────┘ │ │ smap ... │ │ ┌ mapB ───────┐ │ │ │ bpf_map map │ bpf_local_storage_map_bucket │ │ *buckets │---->┌ b[0] ┐ │ │ └─────────────┘ │ list │--------┘ │ │ lock │ │ └──────┘ │ ┌ b[1] ┐ │ │ list │-----------------------------┘ │ lock │ └──────┘ ... * Changelog * v6 -> v7 - Minor comment and commit msg tweaks - Patch 9: Remove unused "owner" (kernel test robot) - Patch 13: Update comments in task_ls_recursion.c (AI) Link: https://lore.kernel.org/bpf/20260205070208.186382-1-ameryhung@gmail.com/ v5 -> v6 - Redo benchmark - Patch 9: Remove storage->smap as it is not used any more - Patch 17: Remove storage->smap check in selftests - Patch 10, 11: Pass reuse_now = true to bpf_selem_free() and bpf_local_storage_free() to allow faster memory reclaim (Martin) - Patch 10: Use bitfield instead of refcount to track selem state to be more precise, which removes the possibility map_free missing an selem (Martin) - Patch 10: Allow map_free() to free local_storage and drop the change in bpf_local_storage_map_update() (Martin) - Patch 11: Simplify destroy() by not deferring work as an owner is unlikely to have too many maps that stalls RCU (Martin) Link: https://lore.kernel.org/bpf/20260201175050.468601-1-ameryhung@gmail.com/ v4 -> v5 - Patch 1: Fix incorrect bucket calculation (AI) - Patch 3: Fix memory leak in bpf_sk_storage_clone() (AI) - Patch 5: Fix memory leak in bpf_local_storage_update() (AI) - Fix typo/comment/commit msg (AI) - Patch 10: Replace smp_rmb() with smp_mb(). smp_rmb does not imply acquire semantics Link: https://lore.kernel.org/bpf/20260131050920.2574084-1-ameryhung@gmail.com/ v3 -> v4 - Add performance numbers - Avoid stale element when calling bpf_local_storage_map_free() by allowing it to unlink selem from local_storage->list and uncharge memory. Block destroy() from returning when pending map_free() are uncharging - Fix an -EAGAIN bug in bpf_local_storage_update() as map_free() now does not free local storage - Fix possible double-free of selem by ensuring an selem is only processed once for each caller (Kumar) - Fix possible inifinite loop in bpf_selem_unlink_nofail() when iterating b->list by replacing while loop with hlist_for_each_entry_rcu - Fix unsafe iteration in destroy() by iterating local_storage->list using hlist_for_each_entry_rcu - Fix UAF due to clearing storage_owner after destroy(). Flip the order to fix it - Misc clean-up suggested by Martin Link: https://lore.kernel.org/bpf/20251218175628.1460321-1-ameryhung@gmail.com/ v2 -> v3 - Rebase to bpf-next where BPF memory allocator is replaced with kmalloc_nolock() - Revert to selecting bucket based on selem - Introduce bpf_selem_unlink_lockless() to allow unlinking and freeing selem without taking locks Link: https://lore.kernel.org/bpf/20251002225356.1505480-1-ameryhung@gmail.com/ v1 -> v2 - Rebase to bpf-next - Select bucket based on local_storage instead of selem (Martin) - Simplify bpf_selem_unlink (Martin) - Change handling of rqspinlock errors in bpf_local_storage_destroy() and bpf_local_storage_map_free(). Retry instead of WARN_ON. Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/ ==================== Link: https://patch.msgid.link/20260205222916.1788211-1-ameryhung@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2026-02-07selftests/bpf: Fix outdated test on storage->smapAmery Hung1-17/+2
bpf_local_storage_free() already does not rely on local_storage->smap since switching to kmalloc_nolock(). As local_storage->smap is removed, fix the outdated test by dropping the local_storage->smap check. Keep the second map in task local storage map test to test that multiple elements can be added to the storage similar to sk storage test. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-18-ameryhung@gmail.com
2026-02-07selftests/bpf: Choose another percpu variable in bpf for btf_dump testAmery Hung1-2/+2
bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level instead. This percpu variable is also in the bpf subsystem so that if it is removed in the future, BPF-CI will catch this type of CI- breaking change. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-17-ameryhung@gmail.com
2026-02-07selftests/bpf: Remove test_task_storage_map_stress_lookupAmery Hung2-166/+0
Remove a test in test_maps that checks if the updating of the percpu counter in task local storage map is preemption safe as the percpu counter is now removed. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-16-ameryhung@gmail.com
2026-02-07selftests/bpf: Update task_local_storage/task_storage_nodeadlock testAmery Hung1-5/+2
Adjust the error code we are checking against as bpf_task_storage_delete() now returns -EDEADLK or -ETIMEDOUT when deadlock happens. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-15-ameryhung@gmail.com
2026-02-07selftests/bpf: Update task_local_storage/recursion testAmery Hung2-17/+7
Update the expected result of the selftest as recursion of task local storage syscall and helpers have been relaxed. Now that the percpu counter is removed, task local storage helpers, bpf_task_storage_get() and bpf_task_storage_delete() can now run on the same CPU at the same time unless they cause deadlock. Note that since there is no percpu counter preventing recursion in task local storage helpers, bpf_trampoline now catches the recursion of on_update as reported by recursion_misses. on_enter: tp_btf/sys_enter on_update: fentry/bpf_local_storage_update Old behavior New behavior ____________ ____________ on_enter on_enter bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a) bpf_task_storage_trylock succeed bpf_local_storage_update(&map_a) bpf_local_storage_update(&map_a) on_update on_update bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a) bpf_task_storage_trylock fail on_update::misses++ (1) return NULL create and return map_a::ptr map_a::ptr += 1 (1) bpf_task_storage_delete(&map_a) return 0 bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b) bpf_task_storage_trylock fail on_update::misses++ (2) return NULL create and return map_b::ptr map_b::ptr += 1 (1) create and return map_a::ptr create and return map_a::ptr map_a::ptr = 200 map_a::ptr = 200 bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b) bpf_task_storage_trylock succeed lockless lookup succeed bpf_local_storage_update(&map_b) return map_b::ptr on_update bpf_task_storage_get(&map_a) bpf_task_storage_trylock fail lockless lookup succeed return map_a::ptr map_a::ptr += 1 (201) bpf_task_storage_delete(&map_a) bpf_task_storage_trylock fail return -EBUSY nr_del_errs++ (1) bpf_task_storage_get(&map_b) bpf_task_storage_trylock fail return NULL create and return ptr map_b::ptr = 100 Expected result: map_a::ptr = 201 map_a::ptr = 200 map_b::ptr = 100 map_b::ptr = 1 nr_del_err = 1 nr_del_err = 0 on_update::recursion_misses = 0 on_update::recursion_misses = 2 On_enter::recursion_misses = 0 on_enter::recursion_misses = 0 Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-14-ameryhung@gmail.com
2026-02-07selftests/bpf: Update sk_storage_omem_uncharge testAmery Hung1-9/+3
Check sk_omem_alloc when the caller of bpf_local_storage_destroy() returns. bpf_local_storage_destroy() now returns the memory to uncharge to the caller instead of directly uncharge. Therefore, in the sk_storage_omem_uncharge, check sk_omem_alloc when bpf_sk_storage_free() returns instead of bpf_local_storage_destroy(). Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-13-ameryhung@gmail.com
2026-02-07bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}Amery Hung6-41/+39
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}() properly by switching to bpf_selem_unlink_nofail(). Both functions iterate their own RCU-protected list of selems and call bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when both map_free() and destroy() fail to remove a selem from b->list (extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(), also switch to hlist_for_each_entry_rcu() since we no longer iterate local_storage->list under local_storage->lock. bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths so reuse_now should always be false. Remove it from the argument and hardcode it. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
2026-02-07bpf: Support lockless unlink when freeing map or local storageAmery Hung2-7/+118
Introduce bpf_selem_unlink_nofail() to properly handle errors returned from rqspinlock in bpf_local_storage_map_free() and bpf_local_storage_destroy() where the operation must succeeds. The idea of bpf_selem_unlink_nofail() is to allow an selem to be partially linked and use atomic operation on a bit field, selem->state, to determine when and who can free the selem if any unlink under lock fails. An selem initially is fully linked to a map and a local storage. Under normal circumstances, bpf_selem_unlink_nofail() will be able to grab locks and unlink a selem from map and local storage in sequeunce, just like bpf_selem_unlink(), and then free it after an RCU grace period. However, if any of the lock attempts fails, it will only clear SDATA(selem)->smap or selem->local_storage depending on the caller and set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the caller. Then, after both map_free() and destroy() see the selem and the state becomes SELEM_UNLINKED, one of two racing caller can succeed in cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no double free or memory leak. To make sure bpf_obj_free_fields() is done only once and when map is still present, it is called when unlinking an selem from b->list under b->lock. To make sure uncharging memory is done only when the owner is still present in map_free(), block destroy() from returning until there is no pending map_free(). Since smap may not be valid in destroy(), bpf_selem_unlink_nofail() skips bpf_selem_unlink_storage_nolock_misc() when called from destroy(). This is okay as bpf_local_storage_destroy() will return the remaining amount of memory charge tracked by mem_charge to the owner to uncharge. It is also safe to skip clearing local_storage->owner and owner_storage as the owner is being freed and no users or bpf programs should be able to reference the owner and using local_storage. Finally, access of selem, SDATA(selem)->smap and selem->local_storage are racy. Callers will protect these fields with RCU. Acked-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
2026-02-07bpf: Prepare for bpf_selem_unlink_nofail()Amery Hung2-37/+37
The next patch will introduce bpf_selem_unlink_nofail() to handle rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be partially unlinked from map or local storage. Save memory allocation method in selem so that later an selem can be correctly freed even when SDATA(selem)->smap is init to NULL. In addition, keep track of memory charge to the owner in local storage so that later bpf_selem_unlink_nofail() can return the correct memory charge to the owner. Updating local_storage->mem_charge is protected by local_storage->lock. Finally, extract miscellaneous tasks performed when unlinking an selem from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will be reused by bpf_selem_unlink_nofail(). This patch also takes the chance to remove local_storage->smap, which is no longer used since commit f484f4a3e058 ("bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage"). Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
2026-02-07bpf: Remove unused percpu counter from bpf_local_storage_map_freeAmery Hung6-12/+6
Percpu locks have been removed from cgroup and task local storage. Now that all local storage no longer use percpu variables as locks preventing recursion, there is no need to pass them to bpf_local_storage_map_free(). Remove the argument from the function. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
2026-02-07bpf: Remove cgroup local storage percpu counterAmery Hung1-51/+8
The percpu counter in cgroup local storage is no longer needed as the underlying bpf_local_storage can now handle deadlock with the help of rqspinlock. Remove the percpu counter and related migrate_{disable, enable}. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com
2026-02-07bpf: Remove task local storage percpu counterAmery Hung2-136/+18
The percpu counter in task local storage is no longer needed as the underlying bpf_local_storage can now handle deadlock with the help of rqspinlock. Remove the percpu counter and related migrate_{disable, enable}. Since the percpu counter is removed, merge back bpf_task_storage_get() and bpf_task_storage_get_recur(). This will allow the bpf syscalls and helpers to run concurrently on the same CPU, removing the spurious -EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always succeed with enough free memory unless being called recursively. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com
2026-02-07bpf: Change local_storage->lock and b->lock to rqspinlockAmery Hung2-22/+47
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock from raw_spin_lock to rqspinlock. Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall return or BPF helper return. In bpf_local_storage_destroy(), ignore return from raw_res_spin_lock_irqsave() for now. A later patch will correctly handle errors correctly in bpf_local_storage_destroy() so that it can unlink selems even when failing to acquire locks. For __bpf_local_storage_map_cache(), instead of handling the error, skip updating the cache. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
2026-02-07bpf: Convert bpf_selem_unlink to failableAmery Hung6-49/+39
To prepare changing both bpf_local_storage_map_bucket::lock and bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Open code bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since unlink_map and unlink_storage must be done together after all the necessary locks are acquired. For bpf_local_storage_map_free(), ignore the return from bpf_selem_unlink() for now. A later patch will allow it to unlink selems even when failing to acquire locks. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
2026-02-07bpf: Convert bpf_selem_link_map to failableAmery Hung3-7/+16
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_link_map() to failable. It still always succeeds and returns 0 until the change happens. No functional change. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
2026-02-07bpf: Convert bpf_selem_unlink_map to failableAmery Hung1-18/+39
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock, convert bpf_selem_unlink_map() to failable. It still always succeeds and returns 0 for now. Since some operations updating local storage cannot fail in the middle, open-code bpf_selem_unlink_map() to take the b->lock before the operation. There are two such locations: - bpf_local_storage_alloc() The first selem will be unlinked from smap if cmpxchg owner_storage_ptr fails, which should not fail. Therefore, hold b->lock when linking until allocation complete. Helpers that assume b->lock is held by callers are introduced: bpf_selem_link_map_nolock() and bpf_selem_unlink_map_nolock(). - bpf_local_storage_update() The three step update process: link_map(new_selem), link_storage(new_selem), and unlink_map(old_selem) should not fail in the middle. In bpf_selem_unlink(), bpf_selem_unlink_map() and bpf_selem_unlink_storage() should either all succeed or fail as a whole instead of failing in the middle. So, return if unlink_map() failed. Remove the selem_linked_to_map_lockless() check as an selem in the common paths (not bpf_local_storage_map_free() or bpf_local_storage_destroy()), will be unlinked under b->lock and local_storage->lock and therefore no other threads can unlink the selem from map at the same time. In bpf_local_storage_destroy(), ignore the return of bpf_selem_unlink_map() for now. A later patch will allow bpf_local_storage_destroy() to unlink selems even when failing to acquire locks. Note that while this patch removes all callers of selem_linked_to_map(), a later patch that introduces bpf_selem_unlink_nofail() will use it again. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com
2026-02-07bpf: Select bpf_local_storage_map_bucket based on bpf_local_storageAmery Hung3-7/+13
A later bpf_local_storage refactor will acquire all locks before performing any update. To simplified the number of locks needed to take in bpf_local_storage_map_update(), determine the bucket based on the local_storage an selem belongs to instead of the selem pointer. Currently, when a new selem needs to be created to replace the old selem in bpf_local_storage_map_update(), locks of both buckets need to be acquired to prevent racing. This can be simplified if the two selem belongs to the same bucket so that only one bucket needs to be locked. Therefore, instead of hashing selem, hashing the local_storage pointer the selem belongs. Performance wise, this is slightly better as update now requires locking one bucket. It should not change the level of contention on one bucket as the pointers to local storages of selems in a map are just as unique as pointers to selems. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
2026-02-06Merge branch 'fix-some-corner-cases-in-xskxceiver'Alexei Starovoitov1-1/+3
Larysa Zaremba says: ==================== Fix some corner cases in xskxceiver While working on XDP and AF_XDP support for ixgbevf driver, I came across two distinct problems that caused tests to fail when they shouldn't have. ==================== Link: https://patch.msgid.link/20260203155103.2305816-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-06selftests/xsk: fix number of Tx frags in invalid packetLarysa Zaremba1-1/+1
The issue occurs in TOO_MANY_FRAGS test case when xdp_zc_max_segs is set to an odd number. TOO_MANY_FRAGS test case contains an invalid packet consisting of (xdp_zc_max_segs) frags. Every frag, even the last one has XDP_PKT_CONTD flag set. This packet is expected to be dropped. After that, there is a valid linear packet, which is expected to be received back. Once (xdp_zc_max_segs) is an odd number, the last packet cannot be received, if packet forwarding between Rx and Tx interfaces relies on the ethernet header, e.g. checks for ETH_P_LOOPBACK. Packet is malformed, if all traffic is looped. Turns out, sending function processes multiple invalid frags as if they were in 2-frag packets. So once the invalid mbuf packet contains an odd number of those, the valid packet after gets paired with the previous invalid descriptor, and hence does not get an ethernet header generated, so it is either dropped or malformed. Make invalid packets in verbatim mode always have only a single frag. For such packets, number of frags is otherwise meaningless, as descriptor flags are pre-configured in verbatim mode and packet data is not generated for invalid descriptors. Fixes: 697604492b64 ("selftests/xsk: add invalid descriptor test for multi-buffer") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20260203155103.2305816-3-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-06selftests/xsk: properly handle batch ending in the middle of a packetLarysa Zaremba1-0/+2
Referenced commit reduced the scope of the variable pkt, so now it has to be reinitialized via pkt_stream_get_next_rx_pkt(), which also increments some counters. When the packet is interrupted by the batch ending, pkt stream therefore proceeds to the next packet, while xsk ring still contains the previous one, this results in a pkt_nb mismatch. Decrement the affected counters when packet is interrupted. Fixes: 8913e653e9b8 ("selftests/xsk: Iterate over all the sockets in the receive pkts function") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20260203155103.2305816-2-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Prevent reentrance into call_rcu_tasks_trace()Alexei Starovoitov1-1/+13
call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant. To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption defer to irq_work when IRQs are disabled. call_rcu_tasks_generic() protects itself with local_irq_save(). Note when bpf_async_cb->refcnt drops to zero it's safe to reuse bpf_async_cb->worker for a different irq_work callback, since bpf_async_schedule_op() -> irq_work_queue(&cb->worker); is only called when refcnt >= 1. Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com
2026-02-05bpf: Require frozen map for calculating map hashKP Singh1-0/+3
Currently, bpf_map_get_info_by_fd calculates and caches the hash of the map regardless of the map's frozen state. This leads to a TOCTOU bug where userspace can call BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map contents before freezing. Therefore, a trusted loader can be tricked into verifying the stale hash while loading the modified contents. Fix this by returning -EPERM if the map is not frozen when the hash is requested. This ensures the hash is only generated for the final, immutable state of the map. Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD") Reported-by: Toshi Piazza <toshi.piazza@microsoft.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Limit bpf program signature sizeKP Singh1-0/+7
Practical BPF signatures are significantly smaller than KMALLOC_MAX_CACHE_SIZE Allowing larger sizes opens the door for abuse by passing excessive size values and forcing the kernel into expensive allocation paths (via kmalloc_large or vmalloc). Fixes: 349271568303 ("bpf: Implement signature verification for BPF programs") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05Merge branch 'fix-for-bpf_wq-retry-loop-during-free'Alexei Starovoitov1-1/+2
Kumar Kartikeya Dwivedi says: ==================== Fix for bpf_wq retry loop during free Small fix and improvement to ensure cancel_work() can handle the case where wq callback is running, and doesn't lead to call_rcu_tasks_trace() repeatedly after failing cancel_work, if wq callback is not pending. ==================== Link: https://patch.msgid.link/20260205003853.527571-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Reset prog callback in bpf_async_cancel_and_free()Kumar Kartikeya Dwivedi1-0/+1
Replace prog and callback in bpf_async_cb after removing visibility of bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the scheduled async callbacks short-circuit execution and exit early, and not starting a RCU tasks trace section. This improves the overall time spent in running the wq selftest. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260205003853.527571-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Check for running wq callback when freeing bpf_async_cbKumar Kartikeya Dwivedi1-1/+1
When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in case the wq callback is not scheduled, doing cancel_work() currently returns false and leads to retry of RCU tasks trace grace period. If the callback is never scheduled, we keep retrying indefinitely and don't put the prog reference. Since the only race we care about here is against a potentially running wq callback in the first grace period, it should finish by the second grace period, hence check work_busy() result to detect presence of running wq callback if it's not pending, otherwise free the object immediately without retrying. Reasoning behind the check and its correctness with racing wq callback invocation: cancel_work is supposed to be synchronized, hence calling it first and getting false would mean that work is definitely not pending, at this point, either the work is not scheduled at all or already running, or we race and it already finished by the time we checked for it using work_busy(). In case it is running, we synchronize using pool->lock to check the current work running there, if we match, it means we extend the wait by another grace period using retry = true, otherwise either the work already finished running or was never scheduled, so we can free the bpf_async_cb right away. Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05Merge branch 'bpf-improve-linked-register-tracking'Alexei Starovoitov4-15/+346
Puranjay Mohan says: ==================== bpf: Improve linked register tracking V3: https://lore.kernel.org/all/20260203222643.994713-1-puranjay@kernel.org/ Changes in v3->v4: - Add a call to reg_bounds_sync() in sync_linked_regs() to sync bounds after alu op. - Add a __sink(path[0]); in the C selftest so compiler doesn't say "error: variable 'path' set but not used" V2: https://lore.kernel.org/all/20260113152529.3217648-1-puranjay@kernel.org/ Changes in v2->v3: - Added another selftest showing a real usage pattern - Rebased on bpf-next/master v1: https://lore.kernel.org/bpf/20260107203941.1063754-1-puranjay@kernel.org/ Changes in v1->v2: - Add support for alu32 operations in linked register tracking (Alexei) - Squash the selftest fix with the first patch (Eduard) - Add more selftests to detect edge cases This series extends the BPF verifier's linked register tracking to handle negative offsets, BPF_SUB operations, and alu32 operations, enabling better bounds propagation for common arithmetic patterns. The verifier previously only tracked positive constant deltas between linked registers using BPF_ADD. This meant patterns using negative offsets or subtraction couldn't benefit from bounds propagation: void alu32_negative_offset(void) { volatile char path[5]; volatile int offset = bpf_get_prandom_u32(); int off = offset; if (off >= 5 && off < 10) path[off - 5] = '.'; } this gets compiled to: 0000000000000478 <alu32_negative_offset>: 143: call 0x7 144: *(u32 *)(r10 - 0xc) = w0 145: w1 = *(u32 *)(r10 - 0xc) 146: w2 = w1 // w2 and w1 share the same id 147: w2 += -0x5 // verifier knows w1 = w2 + 5 148: if w2 > 0x4 goto +0x5 <L0> // in fall-through: verifier knows w2 ∈ [0,4] => w1 ∈ [5, 9] 149: r2 = r10 150: r2 += -0x5 // r2 = fp - 5 151: r2 += r1 // r2 = fp - 5 + r1 (∈ [5, 9]) => r2 ∈ [fp, fp + 4] 152: w1 = 0x2e 153: *(u8 *)(r2 - 0x5) = w1 // r2 ∈ [fp, fp + 4] => r2 - 5 ∈ [fp - 5, fp - 1] <L0>: 154: exit After the changes, the verifier could link 32-bit scalars and also supported -ve offsets for linking: 146: w2 = w1 147: w2 += -0x5 It allowed the verifier to correctly propagate bounds, without the changes in this patchset, verifier would reject this program with: invalid unbounded variable-offset write to stack R2 This program has been added as a selftest in the second patch. Veristat comparison on programs from sched_ext, selftests, and some meta internal programs: Scx Progs File Program Verdict (A) Verdict (B) Verdict (DIFF) Insns (A) Insns (B) Insns (DIFF) ----------------- ---------------- ----------- ----------- -------------- --------- --------- ------------- scx_layered.bpf.o layered_runnable success success MATCH 5674 6077 +403 (+7.10%) FB Progs File Program Verdict (A) Verdict (B) Verdict (DIFF) Insns (A) Insns (B) Insns (DIFF) ------------ ---------------- ----------- ----------- -------------- --------- --------- ----------------- bpf232.bpf.o layered_dump success success MATCH 1151 1218 +67 (+5.82%) bpf257.bpf.o layered_runnable success success MATCH 5743 6143 +400 (+6.97%) bpf252.bpf.o layered_runnable success success MATCH 5677 6075 +398 (+7.01%) bpf227.bpf.o layered_dump success success MATCH 915 982 +67 (+7.32%) bpf239.bpf.o layered_runnable success success MATCH 5459 5861 +402 (+7.36%) bpf246.bpf.o layered_runnable success success MATCH 5562 6008 +446 (+8.02%) bpf229.bpf.o layered_runnable success success MATCH 2559 3011 +452 (+17.66%) bpf231.bpf.o layered_runnable success success MATCH 2559 3011 +452 (+17.66%) bpf234.bpf.o layered_runnable success success MATCH 2549 3001 +452 (+17.73%) bpf019.bpf.o do_sendmsg success success MATCH 124823 153523 +28700 (+22.99%) bpf019.bpf.o do_parse success success MATCH 124809 153509 +28700 (+23.00%) bpf227.bpf.o layered_runnable success success MATCH 1915 2356 +441 (+23.03%) bpf228.bpf.o layered_runnable success success MATCH 1700 2152 +452 (+26.59%) bpf232.bpf.o layered_runnable success success MATCH 1499 1951 +452 (+30.15%) bpf312.bpf.o mount_exit success success MATCH 19253 62883 +43630 (+226.61%) bpf312.bpf.o umount_exit success success MATCH 19253 62883 +43630 (+226.61%) bpf311.bpf.o mount_exit success success MATCH 19226 62863 +43637 (+226.97%) bpf311.bpf.o umount_exit success success MATCH 19226 62863 +43637 (+226.97%) The above four programs have specific patters that make the verifier explore a lot more states: for (; depth < MAX_DIR_DEPTH; depth++) { const unsigned char* name = BPF_CORE_READ(dentry, d_name.name); if (offset >= MAX_PATH_LEN - MAX_DIR_LEN) { return depth; } int len = bpf_probe_read_kernel_str(&path[offset], MAX_DIR_LEN, name); offset += len; if (len == MAX_DIR_LEN) { if (offset - 2 < MAX_PATH_LEN) { // <---- (a) path[offset - 2] = '.'; } if (offset - 3 < MAX_PATH_LEN) { // <---- (b) path[offset - 3] = '.'; } if (offset - 4 < MAX_PATH_LEN) { // <---- (c) path[offset - 4] = '.'; } } } When at some depth == N false branches of conditions (a), (b) and (c) are scheduled for verification, constraints for offset at depth == N+1 are: 1. offset >= MAX_PATH_LEN + 2 2. offset >= MAX_PATH_LEN + 3 3. offset >= MAX_PATH_LEN + 4 (visited before others) And after offset += len it becomes: 1. offset >= MAX_PATH_LEN - 4093 2. offset >= MAX_PATH_LEN - 4092 3. offset >= MAX_PATH_LEN - 4091 (visited before others) Because of the DFS states exploration logic, the states above are visited in order 3, 2, 1; 3 is not a subset of 2 and 1 is not a subset of 2, so pruning logic does not kick in. Previously this was not a problem, because range for offset was not propagated through the statements (a), (b), (c). As the root cause of this regression is understood, this is not a blocker for this change. Selftest Progs File Program Verdict (A) Verdict (B) Verdict (DIFF) Insns (A) Insns (B) Insns (DIFF) ---------------------------------- ------------------------ ----------- ----------- -------------- --------- --------- -------------- linked_list_peek.bpf.o list_peek success success MATCH 152 88 -64 (-42.11%) verifier_iterating_callbacks.bpf.o cond_break2 success success MATCH 110 88 -22 (-20.00%) These are the added selftests that failed earlier but are passing now: verifier_linked_scalars.bpf.o alu32_negative_offset failure success MISMATCH 11 13 +2 (+18.18%) verifier_linked_scalars.bpf.o scalars_alu32_big_offset failure success MISMATCH 7 10 +3 (+42.86%) verifier_linked_scalars.bpf.o scalars_neg_alu32_add failure success MISMATCH 7 10 +3 (+42.86%) verifier_linked_scalars.bpf.o scalars_neg_alu32_sub failure success MISMATCH 7 10 +3 (+42.86%) verifier_linked_scalars.bpf.o scalars_neg failure success MISMATCH 7 10 +3 (+42.86%) verifier_linked_scalars.bpf.o scalars_neg_sub failure success MISMATCH 7 10 +3 (+42.86%) verifier_linked_scalars.bpf.o scalars_sub_neg_imm failure success MISMATCH 7 10 +3 (+42.86%) iters.bpf.o iter_obfuscate_counter success success MATCH 83 119 +36 (+43.37%) bpf_cubic.bpf.o bpf_cubic_acked success success MATCH 243 430 +187 (+76.95%) ==================== Link: https://patch.msgid.link/20260204151741.2678118-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05selftests/bpf: Add tests for improved linked register trackingPuranjay Mohan1-2/+301
Add tests for linked register tracking with negative offsets, BPF_SUB, and alu32. These test for all edge cases like overflows, etc. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260204151741.2678118-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Support negative offsets, BPF_SUB, and alu32 for linked register trackingPuranjay Mohan3-13/+45
Previously, the verifier only tracked positive constant deltas between linked registers using BPF_ADD. This limitation meant patterns like: r1 = r0; r1 += -4; if r1 s>= 0 goto l0_%=; // r1 >= 0 implies r0 >= 4 // verifier couldn't propagate bounds back to r0 if r0 != 0 goto l0_%=; r0 /= 0; // Verifier thinks this is reachable l0_%=: Similar limitation exists for 32-bit registers. With this change, the verifier can now track negative deltas in reg->off enabling bound propagation for the above pattern. For alu32, we make sure the destination register has the upper 32 bits as 0s before creating the link. BPF_ADD_CONST is split into BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32 and sync_linked_regs uses this to zext the result if known_reg has this flag. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260204151741.2678118-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05Merge branch 'bpf-add-bitwise-tracking-for-bpf_end'Alexei Starovoitov4-3/+121
Tianci Cao says: ==================== bpf: Add bitwise tracking for BPF_END Add bitwise tracking (tnum analysis) for BPF_END (`bswap(16|32|64)`, `be(16|32|64)`, `le(16|32|64)`) operations. Please see commit log of 1/2 for more details. v3: - Resend to fix a version control error in v2. - The rest of the changes are identical to v2. v2 (incorrect): https://lore.kernel.org/bpf/20260204091146.52447-1-ziye@zju.edu.cn/ - Refactored selftests using BSWAP_RANGE_TEST macro to eliminate code duplication and improve maintainability. (Eduard) - Simplified test names. (Eduard) - Reduced excessive comments in test cases. (Eduard) - Added more comments to explain BPF_END's special handling of zext_32_to_64. v1: https://lore.kernel.org/bpf/20260202133536.66207-1-ziye@zju.edu.cn/ ==================== Link: https://patch.msgid.link/20260204111503.77871-1-ziye@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05selftests/bpf: Add tests for BPF_END bitwise trackingTianci Cao1-0/+43
Now BPF_END has bitwise tracking support. This patch adds selftests to cover various cases of BPF_END (`bswap(16|32|64)`, `be(16|32|64)`, `le(16|32|64)`) with bitwise propagation. This patch is based on existing `verifier_bswap.c`, and add several types of new tests: 1. Unconditional byte swap operations: - bswap16/bswap32/bswap64 with unknown bytes 2. Endian conversion operations (architecture-aware): - be16/be32/be64: convert to big-endian * on little-endian: do swap * on big-endian: truncation (16/32-bit) or no-op (64-bit) - le16/le32/le64: convert to little-endian * on big-endian: do swap * on little-endian: truncation (16/32-bit) or no-op (64-bit) Each test simulates realistic networking scenarios where a value is masked with unknown bits (e.g., var_off=(0x0; 0x3f00), range=[0,0x3f00]), then byte-swapped, and the verifier must prove the result stays within expected bounds. Specifically, these selftests are based on dead code elimination: If the BPF verifier can precisely track bitwise through byte swap operations, it can prune the trap path (invalid memory access) that should be unreachable, allowing the program to pass verification. If bitwise tracking is incorrect, the verifier cannot prove the trap is unreachable, causing verification failure. The tests use preprocessor conditionals (#ifdef __BYTE_ORDER__) to verify correct behavior on both little-endian and big-endian architectures, and require Clang 18+ for bswap instruction support. Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260204111503.77871-3-ziye@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05bpf: Add bitwise tracking for BPF_ENDTianci Cao3-3/+78
This patch implements bitwise tracking (tnum analysis) for BPF_END (byte swap) operation. Currently, the BPF verifier does not track value for BPF_END operation, treating the result as completely unknown. This limits the verifier's ability to prove safety of programs that perform endianness conversions, which are common in networking code. For example, the following code pattern for port number validation: int test(struct pt_regs *ctx) { __u64 x = bpf_get_prandom_u32(); x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00) x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f) if (x > 0x3f) goto trap; return 0; trap: return *(u64 *)NULL; // Should be unreachable } Currently generates verifier output: 1: (54) w0 &= 16128 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00)) 2: (d7) r0 = bswap16 r0 ; R0=scalar() 3: (25) if r0 > 0x3f goto pc+2 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) Without this patch, even though the verifier knows `x` has certain bits set, after bswap16, it loses all tracking information and treats port as having a completely unknown value [0, 65535]. According to the BPF instruction set[1], there are 3 kinds of BPF_END: 1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE) - do unconditional swap 2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE) - on big-endian: do swap - on little-endian: truncation (16/32-bit) or no-op (64-bit) 3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE) - on little-endian: do swap - on big-endian: truncation (16/32-bit) or no-op (64-bit) Since BPF_END operations are inherently bit-wise permutations, tnum (bitwise tracking) offers the most efficient and precise mechanism for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`, and `tnum_bswap64`, we can derive exact `var_off` values concisely, directly reflecting the bit-level changes. Here is the overview of changes: 1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c): Call `swab(16|32|64)` function on the value and mask of `var_off`, and do truncation for 16/32-bit cases. 2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c): Call helper function `scalar_byte_swap`. - Only do byte swap when * alu64 (unconditional swap) OR * switching between big-endian and little-endian machines. - If need do byte swap: * Firstly call `tnum_bswap(16|32|64)` to update `var_off`. * Then reset the bound since byte swap scrambles the range. - For 16/32-bit cases, truncate dst register to match the swapped size. This enables better verification of networking code that frequently uses byte swaps for protocol processing, reducing false positive rejections. [1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-05Merge branch 'bpf-fix-conditions-when-timer-wq-can-be-called'Andrii Nakryiko3-6/+128
Alexei Starovoitov says: ==================== bpf: Fix conditions when timer/wq can be called From: Alexei Starovoitov <ast@kernel.org> v2->v3: - Add missing refcount_put - Detect recursion of indiviual async_cb v2: https://lore.kernel.org/bpf/20260204040834.22263-4-alexei.starovoitov@gmail.com/ v1->v2: - Add a recursion check v1: https://lore.kernel.org/bpf/20260204030927.171-1-alexei.starovoitov@gmail.com/ ==================== Link: https://patch.msgid.link/20260204055147.54960-1-alexei.starovoitov@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2026-02-05selftests/bpf: Strengthen timer_start_deadlock testAlexei Starovoitov1-7/+2
Strengthen timer_start_deadlock test and check for recursion now Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260204055147.54960-5-alexei.starovoitov@gmail.com
2026-02-05bpf: Add a recursion check to prevent loops in bpf_timerAlexei Starovoitov1-0/+16
Do not schedule timer/wq operation on a cpu that is in irq_work callback that is processing async_cmds queue. Otherwise the following loop is possible: bpf_timer_start() -> bpf_async_schedule_op() -> irq_work_queue(). irqrestore -> bpf_async_irq_worker() -> tracepoint -> bpf_timer_start(). Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260204055147.54960-4-alexei.starovoitov@gmail.com
2026-02-05selftests/bpf: Add a testcase for deadlock avoidanceAlexei Starovoitov2-0/+108
Add a testcase that checks that deadlock avoidance is working as expected. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260204055147.54960-3-alexei.starovoitov@gmail.com
2026-02-05bpf: Tighten conditions when timer/wq can be called synchronouslyAlexei Starovoitov1-7/+10
Though hrtimer_start/cancel() inlines all of the smaller helpers in hrtimer.c and only call timerqueue_add/del() from lib/timerqueue.c where everything is not traceable and not kprobe-able (because all files in lib/ are not traceable), there are tracepoints within hrtimer that are called with locks held. Therefore prevent the deadlock by tightening conditions when timer/wq can be called synchronously. hrtimer/wq are using raw_spin_lock_irqsave(), so irqs_disabled() is enough. Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com
2026-02-04resolve_btfids: Refactor the sort_btf_by_name functionDonglin Peng1-7/+11
Preserve original relative order of anonymous or same-named types to improve the consistency. No functional changes. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260202120114.3707141-1-dolinux.peng@gmail.com
2026-02-04Merge branch 'bpf-misc-changes-around-af_unix'Martin KaFai Lau2-6/+2
Kuniyuki Iwashima says: ==================== bpf: Misc changes around AF_UNIX. Patch 1 adapts sk_is_XXX() helpers in __cgroup_bpf_run_filter_sock_addr(). Patch 2 removes an unnecessary sk_fullsock() in bpf_skc_to_unix_sock(). ==================== Link: https://patch.msgid.link/20260203213442.682838-1-kuniyu@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2026-02-04bpf: Don't check sk_fullsock() in bpf_skc_to_unix_sock().Kuniyuki Iwashima1-1/+1
AF_UNIX does not use TCP_NEW_SYN_RECV nor TCP_TIME_WAIT and checking sk->sk_family is sufficient. Let's remove sk_fullsock() and use sk_is_unix() in bpf_skc_to_unix_sock(). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260203213442.682838-3-kuniyu@google.com
2026-02-04bpf: Use sk_is_inet() and sk_is_unix() in __cgroup_bpf_run_filter_sock_addr().Kuniyuki Iwashima1-5/+1
sk->sk_family should be read with READ_ONCE() in __cgroup_bpf_run_filter_sock_addr() due to IPV6_ADDRFORM. Also, the comment there is a bit stale since commit 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets"), and the kdoc has the same comment. Let's use sk_is_inet() and sk_is_unix() and remove the comment. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260203213442.682838-2-kuniyu@google.com
2026-02-04Merge branch 'bpf-avoid-locks-in-bpf_timer-and-bpf_wq'Andrii Nakryiko7-342/+851
Alexei Starovoitov says: ==================== bpf: Avoid locks in bpf_timer and bpf_wq From: Alexei Starovoitov <ast@kernel.org> This series reworks implementation of BPF timer and workqueue APIs to make them usable from any context. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Changes in v9: - Different approach for patches 1 and 3: - s/EBUSY/ENOENT/ when refcnt==0 to match existing - drop latch, use refcnt and kmalloc_nolock() instead - address race between timer/wq_start and delete_elem, add a test - Link to v8: https://lore.kernel.org/bpf/20260127-timer_nolock-v8-0-5a29a9571059@meta.com/ Changes in v8: - Return -EBUSY in bpf_async_read_op() if last_seq is failed to be set - In bpf_async_cancel_and_free() drop bpf_async_cb ref after calling bpf_async_process() - Link to v7: https://lore.kernel.org/r/20260122-timer_nolock-v7-0-04a45c55c2e2@meta.com Changes in v7: - Addressed Andrii's review points from the previous version - nothing very significang. - Added NMI stress tests for bpf_timer - hit few verifier failing checks and removed them. - Address sparse warning in the bpf_async_update_prog_callback() - Link to v6: https://lore.kernel.org/r/20260120-timer_nolock-v6-0-670ffdd787b4@meta.com Changes in v6: - Reworked destruction and refcnt use: - On cancel_and_free() set last_seq to BPF_ASYNC_DESTROY value, drop map's reference - In irq work callback, atomically switch DESTROY to DESTROYED, cancel timer/wq - Free bpf_async_cb on refcnt going to 0. - Link to v5: https://lore.kernel.org/r/20260115-timer_nolock-v5-0-15e3aef2703d@meta.com Changes in v5: - Extracted lock-free algorithm for updating cb->prog and cb->callback_fn into a function bpf_async_update_prog_callback(), added a new commit and introduces this function and uses it in __bpf_async_set_callback(), bpf_timer_cancel() and bpf_async_cancel_and_free(). This allows to move the change into the separate commit without breaking correctness. - Handle NULL prog in bpf_async_update_prog_callback(). - Link to v4: https://lore.kernel.org/r/20260114-timer_nolock-v4-0-fa6355f51fa7@meta.com Changes in v4: - Handle irq_work_queue failures in both schedule and cancel_and_free paths: introduced bpf_async_refcnt_dec_cleanup() that decrements refcnt and makes sure if last reference is put, there is at least one irq_work scheduled to execute final cleanup. - Additional refcnt inc/dec in set_callback() + rcu lock to make sure cleanup is not running at the same time as set_callback(). - Added READ_ONCE where it was needed. - Squash 'bpf: Refactor __bpf_async_set_callback()' commit into 'bpf: Add lock-free cell for NMI-safe async operations' - Removed mpmc_cell, use seqcount_latch_t instead. - Link to v3: https://lore.kernel.org/r/20260107-timer_nolock-v3-0-740d3ec3e5f9@meta.com Changes in v3: - Major rework - Introduce mpmc_cell, allowing concurrent writes and reads - Implement irq_work deferring - Adding selftests - Introduces bpf_timer_cancel_async kfunc - Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com Changes in v2: - Move refcnt initialization and put (from cancel_and_free()) from patch 5 into the patch 4, so that patch 4 has more clear and full implementation and use of refcnt - Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com ==================== Link: https://patch.msgid.link/20260201025403.66625-1-alexei.starovoitov@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2026-02-04selftests/bpf: Add a test to stress bpf_timer_start and map_delete raceAlexei Starovoitov2-0/+203
Add a test to stress bpf_timer_start and map_delete race Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-10-alexei.starovoitov@gmail.com
2026-02-04selftests/bpf: Removed obsolete testsMykyta Yatsenko1-111/+0
Now bpf_timer can be used in tracepoints, so these tests are no longer relevant. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-9-alexei.starovoitov@gmail.com
2026-02-04selftests/bpf: Add timer stress test in NMI contextMykyta Yatsenko2-12/+231
Add stress tests for BPF timers that run in NMI context using perf_event programs attached to PERF_COUNT_HW_CPU_CYCLES. The tests cover three scenarios: - nmi_race: Tests concurrent timer start and async cancel operations - nmi_update: Tests updating a map element (effectively deleting and inserting new for array map) from within a timer callback - nmi_cancel: Tests timer self-cancellation attempt. A common test_common() helper is used to share timer setup logic across all test modes. The tests spawn multiple threads in a child process to generate perf events, which trigger the BPF programs in NMI context. Hit counters verify that the NMI code paths were actually exercised. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-8-alexei.starovoitov@gmail.com
2026-02-04selftests/bpf: Verify bpf_timer_cancel_async worksMykyta Yatsenko2-0/+48
Add test that verifies that bpf_timer_cancel_async works: can cancel callback successfully. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-7-alexei.starovoitov@gmail.com
2026-02-04selftests/bpf: Add stress test for timer async cancelMykyta Yatsenko2-4/+28
Extend BPF timer selftest to run stress test for async cancel. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-6-alexei.starovoitov@gmail.com
2026-02-04selftests/bpf: Refactor timer selftestsMykyta Yatsenko1-19/+36
Refactor timer selftests, extracting stress test into a separate test. This makes it easier to debug test failures and allows to extend. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-5-alexei.starovoitov@gmail.com
2026-02-04bpf: Introduce bpf_timer_cancel_async() kfuncAlexei Starovoitov1-0/+48
Introduce bpf_timer_cancel_async() that wraps hrtimer_try_to_cancel() and executes it either synchronously or defers to irq_work. Co-developed-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260201025403.66625-4-alexei.starovoitov@gmail.com