kernel/linux.git/Documentation/bpf, branch v7.2-rc1

Documentation/bpf: Refresh map_lru_hash_update.dot for rqspinlock

2026-06-08T01:46:13+00:00

Reflect the rqspinlock conversion and orphan-recovery paths added in the previous commit: - All LRU locks are rqspinlock_t; any acquire can fail (AA or timeout). A shared "rqspinlock acquire failed" terminal collapses to the existing -ENOMEM exit. Dashed arrows from each acquire site mark the failure paths. - The per-CPU local freelist is now lockless (free_llist). - Post-steal: re-acquiring loc_l->lock to insert the stolen node into the local pending list can fail; on failure the node is published to free_llist instead of being orphaned, and the call returns -ENOMEM. - Steal-loop victim lock failure is silent: skip the victim and try the next CPU. Signed-off-by: Mykyta Yatsenko Link: https://lore.kernel.org/r/20260607-lru_map_spin-v3-2-bcd9332e911b@meta.com Signed-off-by: Alexei Starovoitov

Merge branch 'arena_direct_access'

2026-05-25T15:35:48+00:00

Tejun Heo says: ==================== This makes BPF arena memory directly dereferenceable from kernel code (struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page that an arch fault hook installs into empty PTEs on kernel-side faults, after KFENCE. The faulting instruction retries and the violation is reported through the program's BPF stream. v4: - Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in inline comments on both x86 and arm64. (Andrea) - Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a new include/linux/bpf_defs.h. (lkp) - Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on pool growth. (Andrea) - Picked up Reviewed-by tags from Emil and Andrea. v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org Motivation ---------- sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask *. The kernel translates a kernel cpumask to a cmask, but it had no way to write into the arena, so the cmask lived in kernel memory and was passed as a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so the BPF side had to word-by-word probe-read the kernel cmask into an arena cmask via cmask_copy_from_kernel() before any helper could touch it. It works, but is clumsy. The shape isn't unique to set_cmask. Sub-scheduler support is on the way and more sched_ext callbacks will want to pass structured data to BPF. Anywhere a kfunc or struct_ops callback wants to hand a struct to a BPF program, arena residence is the natural answer. Approach -------- Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as today - PTEs are populated only for allocated pages. A new arch fault hook (bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64 __do_kernel_fault(), after KFENCE. When a kernel-side access faults inside an arena's kern_vm range, the helper walks the stack to find the BPF program responsible, range-checks the fault address against prog->aux->arena, and atomically installs the scratch page into the empty PTE via the new ptep_try_set() wrapper. The kernel instruction retries and reads/writes the scratch page. Free paths and map destruction treat scratch as non-owned. Real allocation refuses to overwrite scratch (apply_range_set_cb returns -EBUSY). A scratched address stays dead until map destroy, since its presence means the BPF program has already malfunctioned. The mechanism is default behavior - no UAPI flag. What this preserves ------------------- All the debugging properties of today's sparse-PTE design are preserved: * BPF programs still fault on unmapped arena accesses. The fault semantics (instruction retry with rdst = 0) and the violation report through bpf_streams are unchanged for prog-side accesses. * The first kernel-side touch of an unmapped address is reported via bpf_streams the same way as a prog-side fault, with the stack walk attributing it to the originating prog. * User-side fault on a never-scratched address still lazy-allocates a real page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a scratched address SIGSEGVs. What changes for the kernel-side caller is just that an unmapped deref no longer oopses - it retries through the scratch page and emits a violation report. The same shape today's BPF instruction faults have. Patches 1-2 (atomic PTE install + arena scratch-page recovery) -------------------------------------------------------------- mm: Add ptep_try_set() for lockless empty-slot installs bpf: Recover arena kernel faults with scratch page Patches 3-5 (helpers used by struct_ops registration) ----------------------------------------------------- bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers bpf: Add bpf_struct_ops_for_each_prog() bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena() ==================== Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/ Signed-off-by: Alexei Starovoitov

bpf: Recover arena kernel faults with scratch page

2026-05-23T08:50:33+00:00

BPF arena usage is becoming more prevalent, but kernel <-> BPF communication over arena memory is awkward today. Data has to be staged through a trusted kernel pointer with extra code and copying on the BPF side. While reads through arena pointers can use a fault-safe helper, writes don't have a good solution. The in-line alternative would need instruction emulation or asm fixup labels. Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any handed-in arena pointer, without bounds checking. A per-arena scratch page is installed by the arch fault path into empty arena kernel PTEs - x86 from page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for translation faults, both after the existing exception-table and KFENCE handling. The faulting instruction retries and the access is also reported through the program's BPF stream, preserving error reporting. bpf_prog_find_from_stack() resolves the current BPF program (and its arena) from the kernel stack - no new bpf_run_ctx state is added. Recovery covers the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower half-guard is excluded because well-behaved kfuncs only access forward from arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst. The install is lock-free via ptep_try_set(). On race-loss the winning installer's PTE is already valid, so the access retry succeeds. The arena clear path uses ptep_get_and_clear() so installer and clearer race through atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped" entries just cause one extra re-fault, cheaper than a global IPI on every install. Scratch exists only to keep the kernel from oopsing on an in-line arena access. Its presence at a PTE means the BPF program has already malfunctioned, and the violation is reported through the program's BPF stream. The only requirement for behavior on a scratched PTE is that the kernel doesn't crash. In particular, any user-side access through such a PTE may segfault. The shared scratch page is freed once during map destruction. BPF instruction faults continue to use the existing JIT exception-table path. This patch changes only the kernel-text fault path. No UAPI flag is added. The new behavior is the default. v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David) v3: Stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL. (lkp) Suggested-by: Alexei Starovoitov Signed-off-by: Kumar Kartikeya Dwivedi Signed-off-by: Tejun Heo Reviewed-by: Emil Tsalapatis Cc: David Hildenbrand Link: https://lore.kernel.org/r/20260522172219.1423324-3-tj@kernel.org Signed-off-by: Alexei Starovoitov

bpf: allow non-owning list-node args via __nonown_allowed

2026-05-21T09:47:45+00:00

KF_ARG_PTR_TO_LIST_NODE normally requires an owning reference (PTR_TO_BTF_ID | MEM_ALLOC with ref_obj_id). Introduce the __nonown_allowed annotation on selected list-node arguments so non-owning references with ref_obj_id==0 are accepted as well. This patch only adds the generic verifier support and documents the annotation. Later patches in the series will apply it to bpf_list_add /del(), and bpf_list_is_first/last(), allowing bpf_list_front/back() results to be used as the insertion point, deletion target, or query target for those kfuncs. Verifier keeps existing owning-ref checks by default; only arguments annotated with __nonown_allowed bypass MEM_ALLOC/ref_obj_id checks and then follow the same list-node validation path. Signed-off-by: Kaitao Cheng Reviewed-by: Eduard Zingerman Link: https://lore.kernel.org/r/20260521032306.97118-4-kaitao.cheng@linux.dev Signed-off-by: Alexei Starovoitov

Documentation/bpf: Update btf doc with updated vlen, kind sizes

2026-04-21T00:52:48+00:00

Sync doc with updated UAPI changes utilizing unused bts for extended vlen, kind values. Signed-off-by: Alan Maguire Link: https://lore.kernel.org/r/20260417143023.1551481-7-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov

docs/bpf: add missing fsession attach type to docs

2026-04-12T19:42:38+00:00

Add the fsession attach type to program_types.rst and drgn.rst. Acked-by: Leon Hwang Signed-off-by: Menglong Dong Link: https://lore.kernel.org/r/20260412060346.142007-3-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov

Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

2026-02-10T19:26:21+00:00

Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...

Documentation: use a source-read extension for the index link boilerplate

2026-01-23T18:59:34+00:00

The root document usually has a special :ref:`genindex` link to the generated index. This is also the case for Documentation/index.rst. The other index.rst files deeper in the directory hierarchy usually don't. For SPHINXDIRS builds, the root document isn't Documentation/index.rst, but some other index.rst in the hierarchy. Currently they have a ".. only::" block to add the index link when doing SPHINXDIRS html builds. This is obviously very tedious and repetitive. The link is also added to all index.rst files in the hierarchy for SPHINXDIRS builds, not just the root document. Put the boilerplate in a sphinx-includes/subproject-index.rst file, and include it at the end of the root document for subproject builds in an ad-hoc source-read extension defined in conf.py. For now, keep having the boilerplate in translations, because this approach currently doesn't cover translated index link headers. Cc: Jonathan Corbet Cc: Mauro Carvalho Chehab Cc: Randy Dunlap Signed-off-by: Jani Nikula Tested-by: Mauro Carvalho Chehab Reviewed-by: Mauro Carvalho Chehab [jc: did s/doctree/kern_doc_dir/ ] Signed-off-by: Jonathan Corbet Message-ID: <20260123143149.2024303-1-jani.nikula@intel.com>

bpf,docs: Document KF_IMPLICIT_ARGS flag

2026-01-21T00:22:38+00:00

Add a section explaining KF_IMPLICIT_ARGS kfunc flag. Remove __prog arg annotation, as it is no longer supported. Signed-off-by: Ihor Solodrai Link: https://lore.kernel.org/r/20260120223027.3981805-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov

bpf: Replace opt annotation with nullable for kfuncs

2026-01-02T23:51:34+00:00

The __opt annotation was originally introduced specifically for buffer/size argument pairs in bpf_dynptr_slice() and bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while still validating the size as a constant. The __nullable annotation serves the same purpose but is more general and is already used throughout the BPF subsystem for raw tracepoints, struct_ops, and other kfuncs. This patch unifies the two annotations by replacing __opt with __nullable. The key change is in the verifier's get_kfunc_ptr_arg_type() function, where mem/size pair detection is now performed before the nullable check. This ensures that buffer/size pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the buffer is nullable, while adding an !arg_mem_size condition to the nullable check prevents interference with mem/size pair handling. When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional() to determine whether to skip size validation for NULL buffers. This is the first documentation added for the __nullable annotation, which has been in use since it was introduced but was previously undocumented. No functional changes to verifier behavior - nullable buffer/size pairs continue to work exactly as before. Acked-by: Eduard Zingerman Signed-off-by: Puranjay Mohan Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov