diff options
| author | Alexei Starovoitov <ast@kernel.org> | 2023-04-16 03:36:50 +0300 |
|---|---|---|
| committer | Alexei Starovoitov <ast@kernel.org> | 2023-04-16 03:36:51 +0300 |
| commit | 7a0788fe835f98391b8fcb03e3cd29c1296b3280 (patch) | |
| tree | 94d7c2b79cf60f0aecafab0fe209008a27957d44 /include/linux | |
| parent | 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f (diff) | |
| parent | 6147f15131e2df544a5449815f456da48c0c88e7 (diff) | |
| download | linux-7a0788fe835f98391b8fcb03e3cd29c1296b3280.tar.xz | |
Merge branch 'Shared ownership for local kptrs'
Dave Marchevsky says:
====================
This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:
struct refcounted_node {
long data;
struct bpf_list_node ll;
struct bpf_refcount ref;
};
bpf_refcount is used to implement shared ownership for local kptrs.
Motivating usecase
==================
If a struct has two collection node fields, e.g.:
struct node {
long key;
long val;
struct bpf_rb_node rb;
struct bpf_list_node ll;
};
It's not currently possible to add a node to both the list and rbtree:
long bpf_prog(void *ctx)
{
struct node *n = bpf_obj_new(typeof(*n));
if (!n) { /* ... */ }
bpf_spin_lock(&lock);
bpf_list_push_back(&head, &n->ll);
bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
bpf_spin_unlock(&lock);
}
The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.
More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.
Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.
Semantic Changes
================
Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.
After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.
This change doesn't affect UX of owning or non-owning references much:
* insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
an owning reference arg, as ownership still must be passed to the
collection in a shared-ownership world.
* non-owning references still refer to valid memory without claiming
any ownership.
One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").
In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.
Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.
Implementation
==============
The changes in this series can be categorized as follows:
* struct bpf_refcount opaque field + plumbing
* support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
* bpf_refcount_acquire kfunc
* enables shared ownershp by bumping refcount + acquiring owning ref
* support for possibly-failing collection insertion and removal
* insertion changes are more complex
If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.
Patch contents:
* Patch 1 removes btf_field_offs struct
* Patch 2 adds struct bpf_refcount and associated plumbing
* Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
refcounted kptrs
* Patch 4 adds bpf_refcount_acquire
* Patches 5-7 add support for possibly-failing collection insert and remove
* Patch 8 centralizes constructor-like functionality for local kptr types
* Patch 9 adds tests for new functionality
base-commit: 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f
Changelog:
v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com
Patch #s used below refer to the patch's position in v1 unless otherwise
specified.
* General
* Rebase onto latest bpf-next (base-commit updated above)
* Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
* Fix typo in summary (Alexei)
* Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
* Modify a paragraph in patch summary to more clearly state that only
bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
patch (Alexei)
* refcount_off == -1 -> refcount_off < 0 in "node type w/ both list
and rb_node fields" check, since any negative value means "no
bpf_refcount field found", and furthermore refcount_off is never
explicitly set to -1, but rather -EINVAL. (Alexei)
* Instead of just changing "btf: list_node and rb_node in same struct" test
expectation to pass instead of fail, do some refactoring to test both
"list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
_no_ bpf_refcount" (failure) cases. This ensures that logic change in
previous bullet point is correct.
* v1's "btf: list_node and rb_node in same struct" test changes didn't
add bpf_refcount, so the fact that btf load succeeded w/ list and
rb_nodes but no bpf_refcount field is further proof that this logic
was incorrect in v1.
* Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
* Instead of doing __init_field_infer_size in kfuncs when taking
bpf_list_head type input which might've been 0-initialized in map, go
back to simple oneliner initialization. Add short comment explaining why
this is necessary. (Alexei)
* Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
* Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Diffstat (limited to 'include/linux')
| -rw-r--r-- | include/linux/bpf.h | 80 | ||||
| -rw-r--r-- | include/linux/bpf_verifier.h | 7 | ||||
| -rw-r--r-- | include/linux/btf.h | 2 |
3 files changed, 61 insertions, 28 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 88845aadc47d..18b592fde896 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -187,6 +187,7 @@ enum btf_field_type { BPF_RB_NODE = (1 << 7), BPF_GRAPH_NODE_OR_ROOT = BPF_LIST_NODE | BPF_LIST_HEAD | BPF_RB_NODE | BPF_RB_ROOT, + BPF_REFCOUNT = (1 << 8), }; typedef void (*btf_dtor_kfunc_t)(void *); @@ -210,6 +211,7 @@ struct btf_field_graph_root { struct btf_field { u32 offset; + u32 size; enum btf_field_type type; union { struct btf_field_kptr kptr; @@ -222,15 +224,10 @@ struct btf_record { u32 field_mask; int spin_lock_off; int timer_off; + int refcount_off; struct btf_field fields[]; }; -struct btf_field_offs { - u32 cnt; - u32 field_off[BTF_FIELDS_MAX]; - u8 field_sz[BTF_FIELDS_MAX]; -}; - struct bpf_map { /* The first two cachelines with read-mostly members of which some * are also accessed in fast-path (e.g. ops, max_entries). @@ -257,7 +254,6 @@ struct bpf_map { struct obj_cgroup *objcg; #endif char name[BPF_OBJ_NAME_LEN]; - struct btf_field_offs *field_offs; /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. */ @@ -299,6 +295,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type) return "bpf_rb_root"; case BPF_RB_NODE: return "bpf_rb_node"; + case BPF_REFCOUNT: + return "bpf_refcount"; default: WARN_ON_ONCE(1); return "unknown"; @@ -323,6 +321,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type) return sizeof(struct bpf_rb_root); case BPF_RB_NODE: return sizeof(struct bpf_rb_node); + case BPF_REFCOUNT: + return sizeof(struct bpf_refcount); default: WARN_ON_ONCE(1); return 0; @@ -347,12 +347,42 @@ static inline u32 btf_field_type_align(enum btf_field_type type) return __alignof__(struct bpf_rb_root); case BPF_RB_NODE: return __alignof__(struct bpf_rb_node); + case BPF_REFCOUNT: + return __alignof__(struct bpf_refcount); default: WARN_ON_ONCE(1); return 0; } } +static inline void bpf_obj_init_field(const struct btf_field *field, void *addr) +{ + memset(addr, 0, field->size); + + switch (field->type) { + case BPF_REFCOUNT: + refcount_set((refcount_t *)addr, 1); + break; + case BPF_RB_NODE: + RB_CLEAR_NODE((struct rb_node *)addr); + break; + case BPF_LIST_HEAD: + case BPF_LIST_NODE: + INIT_LIST_HEAD((struct list_head *)addr); + break; + case BPF_RB_ROOT: + /* RB_ROOT_CACHED 0-inits, no need to do anything after memset */ + case BPF_SPIN_LOCK: + case BPF_TIMER: + case BPF_KPTR_UNREF: + case BPF_KPTR_REF: + break; + default: + WARN_ON_ONCE(1); + return; + } +} + static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_field_type type) { if (IS_ERR_OR_NULL(rec)) @@ -360,14 +390,14 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f return rec->field_mask & type; } -static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) +static inline void bpf_obj_init(const struct btf_record *rec, void *obj) { int i; - if (!foffs) + if (IS_ERR_OR_NULL(rec)) return; - for (i = 0; i < foffs->cnt; i++) - memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); + for (i = 0; i < rec->cnt; i++) + bpf_obj_init_field(&rec->fields[i], obj + rec->fields[i].offset); } /* 'dst' must be a temporary buffer and should not point to memory that is being @@ -379,7 +409,7 @@ static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) */ static inline void check_and_init_map_value(struct bpf_map *map, void *dst) { - bpf_obj_init(map->field_offs, dst); + bpf_obj_init(map->record, dst); } /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and @@ -399,14 +429,14 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size) } /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */ -static inline void bpf_obj_memcpy(struct btf_field_offs *foffs, +static inline void bpf_obj_memcpy(struct btf_record *rec, void *dst, void *src, u32 size, bool long_memcpy) { u32 curr_off = 0; int i; - if (likely(!foffs)) { + if (IS_ERR_OR_NULL(rec)) { if (long_memcpy) bpf_long_memcpy(dst, src, round_up(size, 8)); else @@ -414,49 +444,49 @@ static inline void bpf_obj_memcpy(struct btf_field_offs *foffs, return; } - for (i = 0; i < foffs->cnt; i++) { - u32 next_off = foffs->field_off[i]; + for (i = 0; i < rec->cnt; i++) { + u32 next_off = rec->fields[i].offset; u32 sz = next_off - curr_off; memcpy(dst + curr_off, src + curr_off, sz); - curr_off += foffs->field_sz[i] + sz; + curr_off += rec->fields[i].size + sz; } memcpy(dst + curr_off, src + curr_off, size - curr_off); } static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) { - bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, false); + bpf_obj_memcpy(map->record, dst, src, map->value_size, false); } static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src) { - bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, true); + bpf_obj_memcpy(map->record, dst, src, map->value_size, true); } -static inline void bpf_obj_memzero(struct btf_field_offs *foffs, void *dst, u32 size) +static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size) { u32 curr_off = 0; int i; - if (likely(!foffs)) { + if (IS_ERR_OR_NULL(rec)) { memset(dst, 0, size); return; } - for (i = 0; i < foffs->cnt; i++) { - u32 next_off = foffs->field_off[i]; + for (i = 0; i < rec->cnt; i++) { + u32 next_off = rec->fields[i].offset; u32 sz = next_off - curr_off; memset(dst + curr_off, 0, sz); - curr_off += foffs->field_sz[i] + sz; + curr_off += rec->fields[i].size + sz; } memset(dst + curr_off, 0, size - curr_off); } static inline void zero_map_value(struct bpf_map *map, void *dst) { - bpf_obj_memzero(map->field_offs, dst, map->value_size); + bpf_obj_memzero(map->record, dst, map->value_size); } void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index f03852b89d28..3dd29a53b711 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -464,7 +464,12 @@ struct bpf_insn_aux_data { */ struct bpf_loop_inline_state loop_inline_state; }; - u64 obj_new_size; /* remember the size of type passed to bpf_obj_new to rewrite R1 */ + union { + /* remember the size of type passed to bpf_obj_new to rewrite R1 */ + u64 obj_new_size; + /* remember the offset of node field within type to rewrite */ + u64 insert_off; + }; struct btf_struct_meta *kptr_struct_meta; u64 map_key_state; /* constant (32 bit) key tracking for maps */ int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ diff --git a/include/linux/btf.h b/include/linux/btf.h index 495250162422..813227bff58a 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -113,7 +113,6 @@ struct btf_id_dtor_kfunc { struct btf_struct_meta { u32 btf_id; struct btf_record *record; - struct btf_field_offs *field_offs; }; struct btf_struct_metas { @@ -207,7 +206,6 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t); struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t, u32 field_mask, u32 value_size); int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec); -struct btf_field_offs *btf_parse_field_offs(struct btf_record *rec); bool btf_type_is_void(const struct btf_type *t); s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind); const struct btf_type *btf_type_skip_modifiers(const struct btf *btf, |
