Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit 980ca8ceeae6 ("bpf: check bpf_dummy_struct_ops program params for
test runs") does bitwise AND between reg_type and PTR_MAYBE_NULL, which
is correct, but due to type difference the compiler complains:
net/bpf/bpf_dummy_struct_ops.c:118:31: warning: bitwise operation between different enumeration types ('const enum bpf_reg_type' and 'enum bpf_type_flag') [-Wenum-enum-conversion]
118 | if (info && (info->reg_type & PTR_MAYBE_NULL))
| ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
Workaround the warning by moving the type_may_be_null() helper from
verifier.c into bpf_verifier.h, and reuse it here to check whether param
is nullable.
Fixes: 980ca8ceeae6 ("bpf: check bpf_dummy_struct_ops program params for test runs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404241956.HEiRYwWq-lkp@intel.com/
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240905055233.70203-1-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch removes the insn_buf array stack usage from the
inline_bpf_loop(). Instead, the env->insn_buf is used. The
usage in inline_bpf_loop() needs more than 16 insn, so the
INSN_BUF_SIZE needs to be increased from 16 to 32.
The compiler stack size warning on the verifier is gone
after this change.
Cc: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240904180847.56947-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar
to the existing .gen_prologue. Instead of allowing a subsystem
to run code at the beginning of a bpf prog, it allows the subsystem
to run code just before the bpf prog exit.
One of the use case is to allow the upcoming bpf qdisc to ensure that
the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc
struct_ops implementation could either fix it up or drop the skb.
Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd
has sane value (e.g. non zero).
The epilogue can do the useful thing (like checking skb->dev) if it
can access the bpf prog's ctx. Unlike prologue, r1 may not hold the
ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue
has returned some instructions in the "epilogue_buf".
The existing .gen_prologue is done in convert_ctx_accesses().
The new .gen_epilogue is done in the convert_ctx_accesses() also.
When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched
with the earlier generated "epilogue_buf". The epilogue patching is
only done for the main prog.
Only one epilogue will be patched to the main program. When the
bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used
to goto the earlier patched epilogue. Majority of the archs
support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch,
powerpc and arc. This patch keeps it simple and always
use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to
generate the (BPF_JMP32 | BPF_JA) insn.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch moves the 'struct bpf_insn insn_buf[16]' stack usage
to the bpf_verifier_env. A '#define INSN_BUF_SIZE 16' is also added
to replace the ARRAY_SIZE(insn_buf) usages.
Both convert_ctx_accesses() and do_misc_fixup() are changed
to use the env->insn_buf.
It is a refactoring work for adding the epilogue_buf[16] in a later patch.
With this patch, the stack size usage decreased.
Before:
./kernel/bpf/verifier.c:22133:5: warning: stack frame size (2584)
After:
./kernel/bpf/verifier.c:22184:5: warning: stack frame size (2264)
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map")
No conflicts.
Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile
Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Attribute used by LLVM implementation of the feature had been changed
from no_caller_saved_registers to bpf_fastcall (see [1]).
This commit replaces references to nocsr by references to bpf_fastcall
to keep LLVM and Kernel parts in sync.
[1] https://github.com/llvm/llvm-project/pull/105417
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The commit f7866c358733 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
fixed a NULL pointer dereference panic, but didn't fix the issue that
fails to update attached freplace prog to prog_array map.
Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
freplace prog and its target prog are able to tail call each other.
And the commit 3aac1ead5eb6 ("bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach")
sets prog->aux->dst_prog as NULL after attaching freplace prog to its
target prog.
After loading freplace the prog_array's owner type is BPF_PROG_TYPE_SCHED_CLS.
Then, after attaching freplace its prog->aux->dst_prog is NULL.
Then, while updating freplace in prog_array the bpf_prog_map_compatible()
incorrectly returns false because resolve_prog_type() returns
BPF_PROG_TYPE_EXT instead of BPF_PROG_TYPE_SCHED_CLS.
After this patch the resolve_prog_type() returns BPF_PROG_TYPE_SCHED_CLS
and update to prog_array can succeed.
Fixes: f7866c358733 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20240728114612.48486-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
GCC and LLVM define a no_caller_saved_registers function attribute.
This attribute means that function scratches only some of
the caller saved registers defined by ABI.
For BPF the set of such registers could be defined as follows:
- R0 is scratched only if function is non-void;
- R1-R5 are scratched only if corresponding parameter type is defined
in the function prototype.
This commit introduces flag bpf_func_prot->allow_nocsr.
If this flag is set for some helper function, verifier assumes that
it follows no_caller_saved_registers calling convention.
The contract between kernel and clang allows to simultaneously use
such functions and maintain backwards compatibility with old
kernels that don't understand no_caller_saved_registers calls
(nocsr for short):
- clang generates a simple pattern for nocsr calls, e.g.:
r1 = 1;
r2 = 2;
*(u64 *)(r10 - 8) = r1;
*(u64 *)(r10 - 16) = r2;
call %[to_be_inlined]
r2 = *(u64 *)(r10 - 16);
r1 = *(u64 *)(r10 - 8);
r0 = r1;
r0 += r2;
exit;
- kernel removes unnecessary spills and fills, if called function is
inlined by verifier or current JIT (with assumption that patch
inserted by verifier or JIT honors nocsr contract, e.g. does not
scratch r3-r5 for the example above), e.g. the code above would be
transformed to:
r1 = 1;
r2 = 2;
call %[to_be_inlined]
r0 = r1;
r0 += r2;
exit;
Technically, the transformation is split into the following phases:
- function mark_nocsr_patterns(), called from bpf_check()
searches and marks potential patterns in instruction auxiliary data;
- upon stack read or write access,
function check_nocsr_stack_contract() is used to verify if
stack offsets, presumably reserved for nocsr patterns, are used
only from those patterns;
- function remove_nocsr_spills_fills(), called from bpf_check(),
applies the rewrite for valid patterns.
See comment in mark_nocsr_pattern_for_call() for more details.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240722233844.1406874-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Use bpf_verifier_state->jmp_history to track which registers were
updated by find_equal_scalars() (renamed to collect_linked_regs())
when conditional jump was verified. Use recorded information in
backtrack_insn() to propagate precision.
E.g. for the following program:
while verifying instructions
1: r1 = r0 |
2: if r1 < 8 goto ... | push r0,r1 as linked registers in jmp_history
3: if r0 > 16 goto ... | push r0,r1 as linked registers in jmp_history
4: r2 = r10 |
5: r2 += r0 v mark_chain_precision(r0)
while doing mark_chain_precision(r0)
5: r2 += r0 | mark r0 precise
4: r2 = r10 |
3: if r0 > 16 goto ... | mark r0,r1 as precise
2: if r1 < 8 goto ... | mark r0,r1 as precise
1: r1 = r0 v
Technically, do this as follows:
- Use 10 bits to identify each register that gains range because of
sync_linked_regs():
- 3 bits for frame number;
- 6 bits for register or stack slot number;
- 1 bit to indicate if register is spilled.
- Use u64 as a vector of 6 such records + 4 bits for vector length.
- Augment struct bpf_jmp_history_entry with a field 'linked_regs'
representing such vector.
- When doing check_cond_jmp_op() remember up to 6 registers that
gain range because of sync_linked_regs() in such a vector.
- Don't propagate range information and reset IDs for registers that
don't fit in 6-value vector.
- Push a pair {instruction index, linked registers vector}
to bpf_verifier_state->jmp_history.
- When doing backtrack_insn() check if any of recorded linked
registers is currently marked precise, if so mark all linked
registers as precise.
This also requires fixes for two test_verifier tests:
- precise: test 1
- precise: test 2
Both tests contain the following instruction sequence:
19: (bf) r2 = r9 ; R2=scalar(id=3) R9=scalar(id=3)
20: (a5) if r2 < 0x8 goto pc+1 ; R2=scalar(id=3,umin=8)
21: (95) exit
22: (07) r2 += 1 ; R2_w=scalar(id=3+1,...)
23: (bf) r1 = r10 ; R1_w=fp0 R10=fp0
24: (07) r1 += -8 ; R1_w=fp-8
25: (b7) r3 = 0 ; R3_w=0
26: (85) call bpf_probe_read_kernel#113
The call to bpf_probe_read_kernel() at (26) forces r2 to be precise.
Previously, this forced all registers with same id to become precise
immediately when mark_chain_precision() is called.
After this change, the precision is propagated to registers sharing
same id only when 'if' instruction is backtracked.
Hence verification log for both tests is changed:
regs=r2,r9 -> regs=r2 for instructions 25..20.
Fixes: 904e6ddf4133 ("bpf: Use scalar ids in mark_chain_precision()")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240718202357.1746514-2-eddyz87@gmail.com
Closes: https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
|
|
When loading a EXT program without specifying `attr->attach_prog_fd`,
the `prog->aux->dst_prog` will be null. At this time, calling
resolve_prog_type() anywhere will result in a null pointer dereference.
Example stack trace:
[ 8.107863] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[ 8.108262] Mem abort info:
[ 8.108384] ESR = 0x0000000096000004
[ 8.108547] EC = 0x25: DABT (current EL), IL = 32 bits
[ 8.108722] SET = 0, FnV = 0
[ 8.108827] EA = 0, S1PTW = 0
[ 8.108939] FSC = 0x04: level 0 translation fault
[ 8.109102] Data abort info:
[ 8.109203] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 8.109399] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 8.109614] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 8.109836] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000101354000
[ 8.110011] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[ 8.112624] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 8.112783] Modules linked in:
[ 8.113120] CPU: 0 PID: 99 Comm: may_access_dire Not tainted 6.10.0-rc3-next-20240613-dirty #1
[ 8.113230] Hardware name: linux,dummy-virt (DT)
[ 8.113390] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 8.113429] pc : may_access_direct_pkt_data+0x24/0xa0
[ 8.113746] lr : add_subprog_and_kfunc+0x634/0x8e8
[ 8.113798] sp : ffff80008283b9f0
[ 8.113813] x29: ffff80008283b9f0 x28: ffff800082795048 x27: 0000000000000001
[ 8.113881] x26: ffff0000c0bb2600 x25: 0000000000000000 x24: 0000000000000000
[ 8.113897] x23: ffff0000c1134000 x22: 000000000001864f x21: ffff0000c1138000
[ 8.113912] x20: 0000000000000001 x19: ffff0000c12b8000 x18: ffffffffffffffff
[ 8.113929] x17: 0000000000000000 x16: 0000000000000000 x15: 0720072007200720
[ 8.113944] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720
[ 8.113958] x11: 0720072007200720 x10: 0000000000f9fca4 x9 : ffff80008021f4e4
[ 8.113991] x8 : 0101010101010101 x7 : 746f72705f6d656d x6 : 000000001e0e0f5f
[ 8.114006] x5 : 000000000001864f x4 : ffff0000c12b8000 x3 : 000000000000001c
[ 8.114020] x2 : 0000000000000002 x1 : 0000000000000000 x0 : 0000000000000000
[ 8.114126] Call trace:
[ 8.114159] may_access_direct_pkt_data+0x24/0xa0
[ 8.114202] bpf_check+0x3bc/0x28c0
[ 8.114214] bpf_prog_load+0x658/0xa58
[ 8.114227] __sys_bpf+0xc50/0x2250
[ 8.114240] __arm64_sys_bpf+0x28/0x40
[ 8.114254] invoke_syscall.constprop.0+0x54/0xf0
[ 8.114273] do_el0_svc+0x4c/0xd8
[ 8.114289] el0_svc+0x3c/0x140
[ 8.114305] el0t_64_sync_handler+0x134/0x150
[ 8.114331] el0t_64_sync+0x168/0x170
[ 8.114477] Code: 7100707f 54000081 f9401c00 f9403800 (b9400403)
[ 8.118672] ---[ end trace 0000000000000000 ]---
One way to fix it is by forcing `attach_prog_fd` non-empty when
bpf_prog_load(). But this will lead to `libbpf_probe_bpf_prog_type`
API broken which use verifier log to probe prog type and will log
nothing if we reject invalid EXT prog before bpf_check().
Another way is by adding null check in resolve_prog_type().
The issue was introduced by commit 4a9c7bbe2ed4 ("bpf: Resolve to
prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT") which wanted
to correct type resolution for BPF_PROG_TYPE_TRACING programs. Before
that, the type resolution of BPF_PROG_TYPE_EXT prog actually follows
the logic below:
prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type;
It implies that when EXT program is not yet attached to `dst_prog`,
the prog type should be EXT itself. This code worked fine in the past.
So just keep using it.
Fix this by returning `prog->type` for BPF_PROG_TYPE_EXT if `dst_prog`
is not present in resolve_prog_type().
Fixes: 4a9c7bbe2ed4 ("bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20240711145819.254178-2-wutengda@huaweicloud.com
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-08
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Compilers can generate the code
r1 = r2
r1 += 0x1
if r2 < 1000 goto ...
use knowledge of r2 range in subsequent r1 operations
So remember constant delta between r2 and r1 and update r1 after 'if' condition.
Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)
The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com
|
|
Juan reported that after doing some changes to buzzer [0] and implementing
a new fuzzing strategy guided by coverage, they noticed the following in
one of the probes:
[...]
13: (79) r6 = *(u64 *)(r0 +0) ; R0=map_value(ks=4,vs=8) R6_w=scalar()
14: (b7) r0 = 0 ; R0_w=0
15: (b4) w0 = -1 ; R0_w=0xffffffff
16: (74) w0 >>= 1 ; R0_w=0x7fffffff
17: (5c) w6 &= w0 ; R0_w=0x7fffffff R6_w=scalar(smin=smin32=0,smax=umax=umax32=0x7fffffff,var_off=(0x0; 0x7fffffff))
18: (44) w6 |= 2 ; R6_w=scalar(smin=umin=smin32=umin32=2,smax=umax=umax32=0x7fffffff,var_off=(0x2; 0x7ffffffd))
19: (56) if w6 != 0x7ffffffd goto pc+1
REG INVARIANTS VIOLATION (true_reg2): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
REG INVARIANTS VIOLATION (false_reg2): const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] u32=[0x0, 0xffffffff] s32=[0x80000000, 0x7fffffff] var_off=(0x7fffffff, 0x0)
19: R6_w=0x7fffffff
20: (95) exit
from 19 to 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
21: (14) w6 -= 2147483632 ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=14,var_off=(0x2; 0xfffffffd))
22: (76) if w6 s>= 0xe goto pc+1 ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=13,var_off=(0x2; 0xfffffffd))
23: (95) exit
from 22 to 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
24: (14) w6 -= 14 ; R6_w=0
[...]
What can be seen here is a register invariant violation on line 19. After
the binary-or in line 18, the verifier knows that bit 2 is set but knows
nothing about the rest of the content which was loaded from a map value,
meaning, range is [2,0x7fffffff] with var_off=(0x2; 0x7ffffffd). When in
line 19 the verifier analyzes the branch, it splits the register states
in reg_set_min_max() into the registers of the true branch (true_reg1,
true_reg2) and the registers of the false branch (false_reg1, false_reg2).
Since the test is w6 != 0x7ffffffd, the src_reg is a known constant.
Internally, the verifier creates a "fake" register initialized as scalar
to the value of 0x7ffffffd, and then passes it onto reg_set_min_max(). Now,
for line 19, it is mathematically impossible to take the false branch of
this program, yet the verifier analyzes it. It is impossible because the
second bit of r6 will be set due to the prior or operation and the
constant in the condition has that bit unset (hex(fd) == binary(1111 1101).
When the verifier first analyzes the false / fall-through branch, it will
compute an intersection between the var_off of r6 and of the constant. This
is because the verifier creates a "fake" register initialized to the value
of the constant. The intersection result later refines both registers in
regs_refine_cond_op():
[...]
t = tnum_intersect(tnum_subreg(reg1->var_off), tnum_subreg(reg2->var_off));
reg1->var_off = tnum_with_subreg(reg1->var_off, t);
reg2->var_off = tnum_with_subreg(reg2->var_off, t);
[...]
Since the verifier is analyzing the false branch of the conditional jump,
reg1 is equal to false_reg1 and reg2 is equal to false_reg2, i.e. the reg2
is the "fake" register that was meant to hold a constant value. The resulting
var_off of the intersection says that both registers now hold a known value
of var_off=(0x7fffffff, 0x0) or in other words: this operation manages to
make the verifier think that the "constant" value that was passed in the
jump operation now holds a different value.
Normally this would not be an issue since it should not influence the true
branch, however, false_reg2 and true_reg2 are pointers to the same "fake"
register. Meaning, the false branch can influence the results of the true
branch. In line 24, the verifier assumes R6_w=0, but the actual runtime
value in this case is 1. The fix is simply not passing in the same "fake"
register location as inputs to reg_set_min_max(), but instead making a
copy. Moving the fake_reg into the env also reduces stack consumption by
120 bytes. With this, the verifier successfully rejects invalid accesses
from the test program.
[0] https://github.com/google/buzzer
Fixes: 67420501e868 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons")
Reported-by: Juan José López Jaimez <jjlopezjaimez@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240613115310.25383-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce two new BPF kfuncs, bpf_preempt_disable and
bpf_preempt_enable. These kfuncs allow disabling preemption in BPF
programs. Nesting is allowed, since the intended use cases includes
building native BPF spin locks without kernel helper involvement. Apart
from that, this can be used to per-CPU data structures for cases where
programs (or userspace) may preempt one or the other. Currently, while
per-CPU access is stable, whether it will be consistent is not
guaranteed, as only migration is disabled for BPF programs.
Global functions are disallowed from being called, but support for them
will be added as a follow up not just preempt kfuncs, but rcu_read_lock
kfuncs as well. Static subprog calls are permitted. Sleepable helpers
and kfuncs are disallowed in non-preemptible regions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20240424031315.2757363-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
To support sleepable async callbacks, we need to tell push_async_cb()
whether the cb is sleepable or not.
The verifier now detects that we are in bpf_wq_set_callback_impl and
can allow a sleepable callback to happen.
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Link: https://lore.kernel.org/r/20240420-bpf_wq-v2-13-6c986a5a741f@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.
This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.
This patch does not change any existing functionality.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA.
Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.
The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
JIT will generate them as kern_vm_start + 32bit_addr memory accesses.
rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar.
If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
Otherwise JIT will convert it to:
rY = (u32)rX;
if (rY)
rY |= arena->user_vm_start & ~(u64)~0U;
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
|
|
Introduce may_goto instruction that from the verifier pov is similar to
open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
doesn't iterate any objects.
In assembly 'may_goto' is a nop most of the time until bpf runtime has to
terminate the program for whatever reason. In the current implementation
may_goto has a hidden counter, but other mechanisms can be used.
For programs written in C the later patch introduces 'cond_break' macro
that combines 'may_goto' with 'break' statement and has similar semantics:
cond_break is a nop until bpf runtime has to break out of this loop.
It can be used in any normal "for" or "while" loop, like
for (i = zero; i < cnt; cond_break, i++) {
The verifier recognizes that may_goto is used in the program, reserves
additional 8 bytes of stack, initializes them in subprog prologue, and
replaces may_goto instruction with:
aux_reg = *(u64 *)(fp - 40)
if aux_reg == 0 goto pc+off
aux_reg -= 1
*(u64 *)(fp - 40) = aux_reg
may_goto instruction can be used by LLVM to implement __builtin_memcpy,
__builtin_strcmp.
may_goto is not a full substitute for bpf_for() macro.
bpf_for() doesn't have induction variable that verifiers sees,
so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.
But when the code is written as:
for (i = 0; i < 100; cond_break, i++)
the verifier see 'i' as precise constant zero,
hence cond_break (aka may_goto) doesn't help to converge the loop.
A static or global variable can be used as a workaround:
static int zero = 0;
for (i = zero; i < 100; cond_break, i++) // works!
may_goto works well with arena pointers that don't need to be bounds
checked on access. Load/store from arena returns imprecise unbounded
scalar and loops with may_goto pass the verifier.
Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
JCOND stands for conditional pseudo jump.
Since goto_or_nop insn was proposed, it may use the same opcode.
may_goto vs goto_or_nop can be distinguished by src_reg:
code = BPF_JMP | BPF_JCOND
src_reg = 0 - may_goto
src_reg = 1 - goto_or_nop
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
|
|
When the width of a fill is smaller than the width of the preceding
spill, the information about scalar boundaries can still be preserved,
as long as it's coerced to the right width (done by coerce_reg_to_size).
Even further, if the actual value fits into the fill width, the ID can
be preserved as well for further tracking of equal scalars.
Implement the above improvements, which makes narrowing fills behave the
same as narrowing spills and MOVs between registers.
Two tests are adjusted to accommodate for endianness differences and to
take into account that it's now allowed to do a narrowing fill from the
least significant bits.
reg_bounds_sync is added to coerce_reg_to_size to correctly adjust
umin/umax boundaries after the var_off truncation, for example, a 64-bit
value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax =
0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be
synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
|
|
Add support for passing PTR_TO_BTF_ID registers to global subprogs.
Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported.
Non-NULL semantics is assumed, so caller will be forced to prove
PTR_TO_BTF_ID can't be NULL.
Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID
arguments, even the trusted one. We achieve that by not setting
ref_obj_id when validating subprog code. This basically enforces (in
Rust terms) borrowing semantics vs move semantics. Borrowing semantics
seems to be a better fit for isolated global subprog validation
approach.
Implementation-wise, we utilize existing logic for matching
user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic
and following same matching rules. We enforce a unique match for types.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
To ensure that a module remains accessible whenever a struct_ops object of
a struct_ops type provided by the module is still in use.
struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a
module will hold a refcnt to it's btf already. But, struct_ops programs are
different. They hold their associated btf, not the module since they need
only btf to assure their types (signatures).
However, verifier holds the refcnt of the associated module of a struct_ops
type temporarily when verify a struct_ops prog. Verifier needs the help
from the verifier operators (struct bpf_verifier_ops) provided by the owner
module to verify data access of a prog, provide information, and generate
code.
This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It
avoids bpf_struct_ops_map_put_progs() from accessing btf after calling
module_put() in bpf_struct_ops_map_free().
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Adjust the check in bpf_get_spilled_reg to take into account spilled
registers narrower than 64 bits. That allows find_equal_scalars to
properly adjust the range of all spilled registers that have the same
ID. Before this change, it was possible for a register and a spilled
register to have the same IDs but different ranges if the spill was
narrower than 64 bits and a range check was performed on the register.
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240108205209.838365-5-maxtram95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Subprog call logic in btf_check_subprog_call() currently has both a lot
of BTF parsing logic (which is, presumably, what justified putting it
into btf.c), but also a bunch of register state checks, some of each
utilize deep verifier logic helpers, necessarily exported from
verifier.c: check_ptr_off_reg(), check_func_arg_reg_off(),
and check_mem_reg().
Going forward, btf_check_subprog_call() will have a minimum of
BTF-related logic, but will get more internal verifier logic related to
register state manipulation. So move it into verifier.c to minimize
amount of verifier-specific logic exposed to btf.c.
We do this move before refactoring btf_check_func_arg_match() to
preserve as much history post-refactoring as possible.
No functional changes.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Generalize btf_prepare_func_args() to support both global and static
subprogs. We are going to utilize this property in the next patch,
reusing btf_prepare_func_args() for subprog call logic instead of
reparsing BTF information in a completely separate implementation.
btf_prepare_func_args() now detects whether subprog is global or static
makes slight logic adjustments for static func cases, like not failing
fatally (-EFAULT) for conditions that are allowable for static subprogs.
Somewhat subtle (but major!) difference is the handling of pointer arguments.
Both global and static functions need to handle special context
arguments (which are pointers to predefined type names), but static
subprogs give up on any other pointers, falling back to marking subprog
as "unreliable", disabling the use of BTF type information altogether.
For global functions, though, we are assuming that such pointers to
unrecognized types are just pointers to fixed-sized memory region (or
error out if size cannot be established, like for `void *` pointers).
This patch accommodates these small differences and sets up a stage for
refactoring in the next patch, eliminating a separate BTF-based parsing
logic in btf_check_func_arg_match().
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
btf_prepare_func_args() is used to understand expectations and
restrictions on global subprog arguments. But current implementation is
hard to extend, as it intermixes BTF-based func prototype parsing and
interpretation logic with setting up register state at subprog entry.
Worse still, those registers are not completely set up inside
btf_prepare_func_args(), requiring some more logic later in
do_check_common(). Like calling mark_reg_unknown() and similar
initialization operations.
This intermixing of BTF interpretation and register state setup is
problematic. First, it causes duplication of BTF parsing logic for global
subprog verification (to set up initial state of global subprog) and
global subprog call sites analysis (when we need to check that whatever
is being passed into global subprog matches expectations), performed in
btf_check_subprog_call().
Given we want to extend global func argument with tags later, this
duplication is problematic. So refactor btf_prepare_func_args() to do
only BTF-based func proto and args parsing, returning high-level
argument "expectations" only, with no regard to specifics of register
state. I.e., if it's a context argument, instead of setting register
state to PTR_TO_CTX, we return ARG_PTR_TO_CTX enum for that argument as
"an argument specification" for further processing inside
do_check_common(). Similarly for SCALAR arguments, PTR_TO_MEM, etc.
This allows to reuse btf_prepare_func_args() in following patches at
global subprog call site analysis time. It also keeps register setup
code consistently in one place, do_check_common().
Besides all this, we cache this argument specs information inside
env->subprog_info, eliminating the need to redo these potentially
expensive BTF traversals, especially if BPF program's BTF is big and/or
there are lots of global subprog calls.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231215011334.2307144-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We have a bunch of bool flags for each subprog. Instead of wasting bytes
for them, use bitfields instead.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231204233931.49758-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add comments to the datastructure tracking the stack state, as the
mapping between each stack slot and where its state is stored is not
entirely obvious.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20231208032519.260451-2-andreimatei1@gmail.com
|
|
Use instruction (jump) history to record instructions that performed
register spill/fill to/from stack, regardless if this was done through
read-only r10 register, or any other register after copying r10 into it
*and* potentially adjusting offset.
To make this work reliably, we push extra per-instruction flags into
instruction history, encoding stack slot index (spi) and stack frame
number in extra 10 bit flags we take away from prev_idx in instruction
history. We don't touch idx field for maximum performance, as it's
checked most frequently during backtracking.
This change removes basically the last remaining practical limitation of
precision backtracking logic in BPF verifier. It fixes known
deficiencies, but also opens up new opportunities to reduce number of
verified states, explored in the subsequent patches.
There are only three differences in selftests' BPF object files
according to veristat, all in the positive direction (less states).
File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF)
-------------------------------------- ------------- --------- --------- ------------- ---------- ---------- -------------
test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%)
xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%)
xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%)
Note, I avoided renaming jmp_history to more generic insn_hist to
minimize number of lines changed and potential merge conflicts between
bpf and bpf-next trees.
Notice also cur_hist_entry pointer reset to NULL at the beginning of
instruction verification loop. This pointer avoids the problem of
relying on last jump history entry's insn_idx to determine whether we
already have entry for current instruction or not. It can happen that we
added jump history entry because current instruction is_jmp_point(), but
also we need to add instruction flags for stack access. In this case, we
don't want to entries, so we need to reuse last added entry, if it is
present.
Relying on insn_idx comparison has the same ambiguity problem as the one
that was fixed recently in [0], so we avoid that.
[0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Tao Lyu <tao.lyu@epfl.ch>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Instead of relying on potentially imprecise tnum representation of
expected return value range for callbacks and subprogs, validate that
smin/smax range satisfy exact expected range of return values.
E.g., if callback would need to return [0, 2] range, tnum can't
represent this precisely and instead will allow [0, 3] range. By
checking smin/smax range, we can make sure that subprog/callback indeed
returns only valid [0, 2] range.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
It's a trivial rearrangement saving 8 bytes. We have 4 bytes of padding
at the end which can be filled with another field without increasing
struct bpf_func_state.
copy_func_state() logic remains correct without any further changes.
BEFORE
======
struct bpf_func_state {
struct bpf_reg_state regs[11]; /* 0 1320 */
/* --- cacheline 20 boundary (1280 bytes) was 40 bytes ago --- */
int callsite; /* 1320 4 */
u32 frameno; /* 1324 4 */
u32 subprogno; /* 1328 4 */
u32 async_entry_cnt; /* 1332 4 */
bool in_callback_fn; /* 1336 1 */
/* XXX 7 bytes hole, try to pack */
/* --- cacheline 21 boundary (1344 bytes) --- */
struct tnum callback_ret_range; /* 1344 16 */
bool in_async_callback_fn; /* 1360 1 */
bool in_exception_callback_fn; /* 1361 1 */
/* XXX 2 bytes hole, try to pack */
int acquired_refs; /* 1364 4 */
struct bpf_reference_state * refs; /* 1368 8 */
int allocated_stack; /* 1376 4 */
/* XXX 4 bytes hole, try to pack */
struct bpf_stack_state * stack; /* 1384 8 */
/* size: 1392, cachelines: 22, members: 13 */
/* sum members: 1379, holes: 3, sum holes: 13 */
/* last cacheline: 48 bytes */
};
AFTER
=====
struct bpf_func_state {
struct bpf_reg_state regs[11]; /* 0 1320 */
/* --- cacheline 20 boundary (1280 bytes) was 40 bytes ago --- */
int callsite; /* 1320 4 */
u32 frameno; /* 1324 4 */
u32 subprogno; /* 1328 4 */
u32 async_entry_cnt; /* 1332 4 */
struct tnum callback_ret_range; /* 1336 16 */
/* --- cacheline 21 boundary (1344 bytes) was 8 bytes ago --- */
bool in_callback_fn; /* 1352 1 */
bool in_async_callback_fn; /* 1353 1 */
bool in_exception_callback_fn; /* 1354 1 */
/* XXX 1 byte hole, try to pack */
int acquired_refs; /* 1356 4 */
struct bpf_reference_state * refs; /* 1360 8 */
struct bpf_stack_state * stack; /* 1368 8 */
int allocated_stack; /* 1376 4 */
/* size: 1384, cachelines: 22, members: 13 */
/* sum members: 1379, holes: 1, sum holes: 1 */
/* padding: 4 */
/* last cacheline: 40 bytes */
};
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231202175705.885270-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/intel/ice/ice_main.c
c9663f79cd82 ("ice: adjust switchdev rebuild path")
7758017911a4 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/
Adjacent changes:
kernel/bpf/verifier.c
bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:
static int cb(__u32 idx, struct num_context* ctx)
{
ctx->i++;
return 0;
}
SEC("?raw_tp")
int prog(void *_)
{
struct num_context ctx = { .i = 0 };
__u8 choice_arr[2] = { 0, 1 };
bpf_loop(2, cb, &ctx, 0);
return choice_arr[ctx.i];
}
Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.
This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.
For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
some identical parent state is found.
Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
calls to avoid previously masked unrelated error reports:
- cb_refs.c:underflow_prog
- exceptions_fail.c:reject_rbtree_add_throw
- exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
log trace:
- verifier_subprog_precision.c:callback_result_precise
(note: r0 precision is no longer propagated inside callback and
I think this is a correct behavior)
- verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
- verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.
There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.
No functional changes in this patch.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].
A few selftests and veristat need to be adjusted in the same patch as
well.
[0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.
These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.
Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.
By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
It turns out that .branches > 0 in is_state_visited() is not a
sufficient condition to identify if two verifier states form a loop
when iterators convergence is computed. This commit adds logic to
distinguish situations like below:
(I) initial (II) initial
| |
V V
.---------> hdr ..
| | |
| V V
| .------... .------..
| | | | |
| V V V V
| ... ... .-> hdr ..
| | | | | |
| V V | V V
| succ <- cur | succ <- cur
| | | |
| V | V
| ... | ...
| | | |
'----' '----'
For both (I) and (II) successor 'succ' of the current state 'cur' was
previously explored and has branches count at 0. However, loop entry
'hdr' corresponding to 'succ' might be a part of current DFS path.
If that is the case 'succ' and 'cur' are members of the same loop
and have to be compared exactly.
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Convergence for open coded iterators is computed in is_state_visited()
by examining states with branches count > 1 and using states_equal().
states_equal() computes sub-state relation using read and precision marks.
Read and precision marks are propagated from children states,
thus are not guaranteed to be complete inside a loop when branches
count > 1. This could be demonstrated using the following unsafe program:
1. r7 = -16
2. r6 = bpf_get_prandom_u32()
3. while (bpf_iter_num_next(&fp[-8])) {
4. if (r6 != 42) {
5. r7 = -32
6. r6 = bpf_get_prandom_u32()
7. continue
8. }
9. r0 = r10
10. r0 += r7
11. r8 = *(u64 *)(r0 + 0)
12. r6 = bpf_get_prandom_u32()
13. }
Here verifier would first visit path 1-3, create a checkpoint at 3
with r7=-16, continue to 4-7,3 with r7=-32.
Because instructions at 9-12 had not been visitied yet existing
checkpoint at 3 does not have read or precision mark for r7.
Thus states_equal() would return true and verifier would discard
current state, thus unsafe memory access at 11 would not be caught.
This commit fixes this loophole by introducing exact state comparisons
for iterator convergence logic:
- registers are compared using regs_exact() regardless of read or
precision marks;
- stack slots have to have identical type.
Unfortunately, this is too strict even for simple programs like below:
i = 0;
while(iter_next(&it))
i++;
At each iteration step i++ would produce a new distinct state and
eventually instruction processing limit would be reached.
To avoid such behavior speculatively forget (widen) range for
imprecise scalar registers, if those registers were not precise at the
end of the previous iteration and do not match exactly.
This a conservative heuristic that allows to verify wide range of
programs, however it precludes verification of programs that conjure
an imprecise value on the first loop iteration and use it as precise
on the second.
Test case iter_task_vma_for_each() presents one of such cases:
unsigned int seen = 0;
...
bpf_for_each(task_vma, vma, task, 0) {
if (seen >= 1000)
break;
...
seen++;
}
Here clang generates the following code:
<LBB0_4>:
24: r8 = r6 ; stash current value of
... body ... 'seen'
29: r1 = r10
30: r1 += -0x8
31: call bpf_iter_task_vma_next
32: r6 += 0x1 ; seen++;
33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL
34: r7 += 0x10
35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000
<LBB0_6>:
... exit ...
Note that counter in r6 is copied to r8 and then incremented,
conditional jump is done using r8. Because of this precision mark for
r6 lags one state behind of precision mark on r8 and widening logic
kicks in.
Adding barrier_var(seen) after conditional is sufficient to force
clang use the same register for both counting and conditional jump.
This issue was discussed in the thread [1] which was started by
Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
in callback functions handling. The callbacks would be addressed
in a followup patch.
[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
css_iter and task_iter should be used in rcu section. Specifically, in
sleepable progs explicit bpf_rcu_read_lock() is needed before use these
iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to
use them directly.
This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and
bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check
whether we are in rcu cs before we want to invoke this kfunc. If the rcu
protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU.
Once user do rcu_unlock during the iteration, state MEM_RCU of regs would
be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED.
It is worth noting that currently, bpf_rcu_read_unlock does not
clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg
only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg
search STACK_ITER.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
By default, the subprog generated by the verifier to handle a thrown
exception hardcodes a return value of 0. To allow user-defined logic
and modification of the return value when an exception is thrown,
introduce the 'exception_callback:' declaration tag, which marks a
callback as the default exception handler for the program.
The format of the declaration tag is 'exception_callback:<value>', where
<value> is the name of the exception callback. Each main program can be
tagged using this BTF declaratiion tag to associate it with an exception
callback. In case the tag is absent, the default callback is used.
As such, the exception callback cannot be modified at runtime, only set
during verification.
Allowing modification of the callback for the current program execution
at runtime leads to issues when the programs begin to nest, as any
per-CPU state maintaing this information will have to be saved and
restored. We don't want it to stay in bpf_prog_aux as this takes a
global effect for all programs. An alternative solution is spilling
the callback pointer at a known location on the program stack on entry,
and then passing this location to bpf_throw as a parameter.
However, since exceptions are geared more towards a use case where they
are ideally never invoked, optimizing for this use case and adding to
the complexity has diminishing returns.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch implements BPF exceptions, and introduces a bpf_throw kfunc
to allow programs to throw exceptions during their execution at runtime.
A bpf_throw invocation is treated as an immediate termination of the
program, returning back to its caller within the kernel, unwinding all
stack frames.
This allows the program to simplify its implementation, by testing for
runtime conditions which the verifier has no visibility into, and assert
that they are true. In case they are not, the program can simply throw
an exception from the other branch.
BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
primitive, and this objective has guided design choices of the
implementation of the them within the kernel (with the bulk of the cost
for unwinding the stack offloaded to the bpf_throw kfunc).
The implementation of this mechanism requires use of add_hidden_subprog
mechanism introduced in the previous patch, which generates a couple of
instructions to move R1 to R0 and exit. The JIT then rewrites the
prologue of this subprog to take the stack pointer and frame pointer as
inputs and reset the stack frame, popping all callee-saved registers
saved by the main subprog. The bpf_throw function then walks the stack
at runtime, and invokes this exception subprog with the stack and frame
pointers as parameters.
Reviewers must take note that currently the main program is made to save
all callee-saved registers on x86_64 during entry into the program. This
is because we must do an equivalent of a lightweight context switch when
unwinding the stack, therefore we need the callee-saved registers of the
caller of the BPF program to be able to return with a sane state.
Note that we have to additionally handle r12, even though it is not used
by the program, because when throwing the exception the program makes an
entry into the kernel which could clobber r12 after saving it on the
stack. To be able to preserve the value we received on program entry, we
push r12 and restore it from the generated subprogram when unwinding the
stack.
For now, bpf_throw invocation fails when lingering resources or locks
exist in that path of the program. In a future followup, bpf_throw will
be extended to perform frame-by-frame unwinding to release lingering
resources for each stack frame, removing this limitation.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce support in the verifier for generating a subprogram and
include it as part of a BPF program dynamically after the do_check phase
is complete. The first user will be the next patch which generates
default exception callbacks if none are set for the program. The phase
of invocation will be do_misc_fixups. Note that this is an internal
verifier function, and should be used with instruction blocks which
uphold the invariants stated in check_subprogs.
Since these subprogs are always appended to the end of the instruction
sequence of the program, it becomes relatively inexpensive to do the
related adjustments to the subprog_info of the program. Only the fake
exit subprogram is shifted forward, making room for our new subprog.
This is useful to insert a new subprogram, get it JITed, and obtain its
function pointer. The next patch will use this functionality to insert a
default exception callback which will be invoked after unwinding the
stack.
Note that these added subprograms are invisible to userspace, and never
reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
subprogram is supported, but more can be easily supported in the future.
To this end, two function counts are introduced now, the existing
func_cnt, and real_func_cnt, the latter including hidden programs. This
allows us to conver the JIT code to use the real_func_cnt for management
of resources while syscall path continues working with existing
func_cnt.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230912233214.1518551-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed
for allocated percpu objects. For an allocated percpu obj,
the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'.
The return type for these two re-purposed helpera is
'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'.
The MEM_ALLOC allows that the per-cpu data can be read and written.
Since the memory allocator bpf_mem_alloc() returns
a ptr to a percpu ptr for percpu data, the first argument
of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched
with a dereference before passing to the helper func.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Recent discussions around default kptr "trustedness" led to changes such
as commit 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the
verifier."). One of the conclusions of those discussions, as expressed
in code and comments in that patch, is that we'd like to move away from
'raw' PTR_TO_BTF_ID without some type flag or other register state
indicating trustedness. Although PTR_TRUSTED and PTR_UNTRUSTED flags mark
this state explicitly, the verifier currently considers trustedness
implied by other register state. For example, owning refs to graph
collection nodes must have a nonzero ref_obj_id, so they pass the
is_trusted_reg check despite having no explicit PTR_{UN}TRUSTED flag.
This patch makes trustedness of non-owning refs to graph collection
nodes explicit as well.
By definition, non-owning refs are currently trusted. Although the ref
has no control over pointee lifetime, due to non-owning ref clobbering
rules (see invalidate_non_owning_refs) dereferencing a non-owning ref is
safe in the critical section controlled by bpf_spin_lock associated with
its owning collection.
Note that the previous statement does not hold true for nodes with shared
ownership due to the use-after-free issue that this series is
addressing. True shared ownership was disabled by commit 7deca5eae833
("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed"),
though, so the statement holds for now. Further patches in the series will change
the trustedness state of non-owning refs before re-enabling
bpf_refcount_acquire.
Let's add NON_OWN_REF type flag to BPF_REG_TRUSTED_MODIFIERS such that a
non-owning ref reg state would pass is_trusted_reg check. Somewhat
surprisingly, this doesn't result in any change to user-visible
functionality elsewhere in the verifier: graph collection nodes are all
marked MEM_ALLOC, which tends to be handled in separate codepaths from
"raw" PTR_TO_BTF_ID. Regardless, let's be explicit here and document the
current state of things before changing it elsewhere in the series.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230821193311.3290257-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Make sure that the following unsafe example is rejected by verifier:
1: r9 = ... some pointer with range X ...
2: r6 = ... unbound scalar ID=a ...
3: r7 = ... unbound scalar ID=b ...
4: if (r6 > r7) goto +1
5: r6 = r7
6: if (r6 > X) goto ...
--- checkpoint ---
7: r9 += r7
8: *(u64 *)r9 = Y
This example is unsafe because not all execution paths verify r7 range.
Because of the jump at (4) the verifier would arrive at (6) in two states:
I. r6{.id=b}, r7{.id=b} via path 1-6;
II. r6{.id=a}, r7{.id=b} via path 1-4, 6.
Currently regsafe() does not call check_ids() for scalar registers,
thus from POV of regsafe() states (I) and (II) are identical. If the
path 1-6 is taken by verifier first, and checkpoint is created at (6)
the path [1-4, 6] would be considered safe.
Changes in this commit:
- check_ids() is modified to disallow mapping multiple old_id to the
same cur_id.
- check_scalar_ids() is added, unlike check_ids() it treats ID zero as
a unique scalar ID.
- check_scalar_ids() needs to generate temporary unique IDs, field
'tmp_id_gen' is added to bpf_verifier_env::idmap_scratch to
facilitate this.
- regsafe() is updated to:
- use check_scalar_ids() for precise scalar registers.
- compare scalar registers using memcmp only for explore_alu_limits
branch. This simplifies control flow for scalar case, and has no
measurable performance impact.
- check_alu_op() is updated to avoid generating bpf_reg_state::id for
constant scalar values when processing BPF_MOV. ID is needed to
propagate range information for identical values, but there is
nothing to propagate for constants.
Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-4-eddyz87@gmail.com
|
|
Change mark_chain_precision() to track precision in situations
like below:
r2 = unknown value
...
--- state #0 ---
...
r1 = r2 // r1 and r2 now share the same ID
...
--- state #1 {r1.id = A, r2.id = A} ---
...
if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1
...
--- state #2 {r1.id = A, r2.id = A} ---
r3 = r10
r3 += r1 // need to mark both r1 and r2
At the beginning of the processing of each state, ensure that if a
register with a scalar ID is marked as precise, all registers sharing
this ID are also marked as precise.
This property would be used by a follow-up change in regsafe().
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230613153824.3324830-2-eddyz87@gmail.com
|
|
Add helper to format register and stack masks in more human-readable
format. Adjust logging a bit during backtrack propagation and especially
during forcing precision fallback logic to make it clearer what's going
on (with log_level=2, of course), and also start reporting affected
frame depth. This is in preparation for having more than one active
frame later when precision propagation between subprog calls is added.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add struct backtrack_state and straightforward API around it to keep
track of register and stack masks used and maintained during precision
backtracking process. Having this logic separately allow to keep
high-level backtracking algorithm cleaner, but also it sets us up to
cleanly keep track of register and stack masks per frame, allowing (with
some further logic adjustments) to perform precision backpropagation
across multiple frames (i.e., subprog calls).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Consider this code snippet:
struct node {
long key;
bpf_list_node l;
bpf_rb_node r;
bpf_refcount ref;
}
int some_bpf_prog(void *ctx)
{
struct node *n = bpf_obj_new(/*...*/), *m;
bpf_spin_lock(&glock);
bpf_rbtree_add(&some_tree, &n->r, /* ... */);
m = bpf_refcount_acquire(n);
bpf_rbtree_add(&other_tree, &m->r, /* ... */);
bpf_spin_unlock(&glock);
/* ... */
}
After bpf_refcount_acquire, n and m point to the same underlying memory,
and that node's bpf_rb_node field is being used by the some_tree insert,
so overwriting it as a result of the second insert is an error. In order
to properly support refcounted nodes, the rbtree and list insert
functions must be allowed to fail. This patch adds such support.
The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
return an int indicating success/failure, with 0 -> success, nonzero ->
failure.
bpf_obj_drop on failure
=======================
Currently the only reason an insert can fail is the example above: the
bpf_{list,rb}_node is already in use. When such a failure occurs, the
insert kfuncs will bpf_obj_drop the input node. This allows the insert
operations to logically fail without changing their verifier owning ref
behavior, namely the unconditional release_reference of the input
owning ref.
With insert that always succeeds, ownership of the node is always passed
to the collection, since the node always ends up in the collection.
With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
is always passed either to the collection (success), or to bpf_obj_drop
(failure). Regardless, it's correct to continue unconditionally
releasing the input owning ref, as something is always taking ownership
from the calling program on insert.
Keeping owning ref behavior unchanged results in a nice default UX for
insert functions that can fail. If the program's reaction to a failed
insert is "fine, just get rid of this owning ref for me and let me go
on with my business", then there's no reason to check for failure since
that's default behavior. e.g.:
long important_failures = 0;
int some_bpf_prog(void *ctx)
{
struct node *n, *m, *o; /* all bpf_obj_new'd */
bpf_spin_lock(&glock);
bpf_rbtree_add(&some_tree, &n->node, /* ... */);
bpf_rbtree_add(&some_tree, &m->node, /* ... */);
if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
important_failures++;
}
bpf_spin_unlock(&glock);
}
If we instead chose to pass ownership back to the program on failed
insert - by returning NULL on success or an owning ref on failure -
programs would always have to do something with the returned ref on
failure. The most likely action is probably "I'll just get rid of this
owning ref and go about my business", which ideally would look like:
if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
bpf_obj_drop(n);
But bpf_obj_drop isn't allowed in a critical section and inserts must
occur within one, so in reality error handling would become a
hard-to-parse mess.
For refcounted nodes, we can replicate the "pass ownership back to
program on failure" logic with this patch's semantics, albeit in an ugly
way:
struct node *n = bpf_obj_new(/* ... */), *m;
bpf_spin_lock(&glock);
m = bpf_refcount_acquire(n);
if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
/* Do something with m */
}
bpf_spin_unlock(&glock);
bpf_obj_drop(m);
bpf_refcount_acquire is used to simulate "return owning ref on failure".
This should be an uncommon occurrence, though.
Addition of two verifier-fixup'd args to collection inserts
===========================================================
The actual bpf_obj_drop kfunc is
bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
macro populating the second arg with 0 and the verifier later filling in
the arg during insn fixup.
Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
passed to bpf_obj_drop_impl.
Similarly, because the 'node' param to those insert functions is the
bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
pointer to the beginning of the node, the insert functions need to be
able to find the beginning of the node struct. A second
verifier-populated param is necessary: the offset of {list,rb}_node within the
node type.
These two new params allow the insert kfuncs to correctly call
__bpf_obj_drop_impl:
beginning_of_node = bpf_rb_node_ptr - offset
if (already_inserted)
__bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);
Similarly to other kfuncs with "hidden" verifier-populated params, the
insert functions are renamed with _impl prefix and a macro is provided
for common usage. For example, bpf_rbtree_add kfunc is now
bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
"hidden" args to 0.
Due to the two new args BPF progs will need to be recompiled to work
with the new _impl kfuncs.
This patch also rewrites the "hidden argument" explanation to more
directly say why the BPF program writer doesn't need to populate the
arguments with anything meaningful.
How does this new logic affect non-owning references?
=====================================================
Currently, non-owning refs are valid until the end of the critical
section in which they're created. We can make this guarantee because, if
a non-owning ref exists, the referent was added to some collection. The
collection will drop() its nodes when it goes away, but it can't go away
while our program is accessing it, so that's not a problem. If the
referent is removed from the collection in the same CS that it was added
in, it can't be bpf_obj_drop'd until after CS end. Those are the only
two ways to free the referent's memory and neither can happen until
after the non-owning ref's lifetime ends.
On first glance, having these collection insert functions potentially
bpf_obj_drop their input seems like it breaks the "can't be
bpf_obj_drop'd until after CS end" line of reasoning. But we care about
the memory not being _freed_ until end of CS end, and a previous patch
in the series modified bpf_obj_drop such that it doesn't free refcounted
nodes until refcount == 0. So the statement can be more accurately
rewritten as "can't be free'd until after CS end".
We can prove that this rewritten statement holds for any non-owning
reference produced by collection insert functions:
* If the input to the insert function is _not_ refcounted
* We have an owning reference to the input, and can conclude it isn't
in any collection
* Inserting a node in a collection turns owning refs into
non-owning, and since our input type isn't refcounted, there's no
way to obtain additional owning refs to the same underlying
memory
* Because our node isn't in any collection, the insert operation
cannot fail, so bpf_obj_drop will not execute
* If bpf_obj_drop is guaranteed not to execute, there's no risk of
memory being free'd
* Otherwise, the input to the insert function is refcounted
* If the insert operation fails due to the node's list_head or rb_root
already being in some collection, there was some previous successful
insert which passed refcount to the collection
* We have an owning reference to the input, it must have been
acquired via bpf_refcount_acquire, which bumped the refcount
* refcount must be >= 2 since there's a valid owning reference and the
node is already in a collection
* Insert triggering bpf_obj_drop will decr refcount to >= 1, never
resulting in a free
So although we may do bpf_obj_drop during the critical section, this
will never result in memory being free'd, and no changes to non-owning
ref logic are needed in this patch.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|