Merge branch 'Shared ownership for local kptrs'

Dave Marchevsky says: ==================== This series adds support for refcounted local kptrs to the verifier. A local kptr is 'refcounted' if its type contains a struct bpf_refcount field: struct refcounted_node { long data; struct bpf_list_node ll; struct bpf_refcount ref; }; bpf_refcount is used to implement shared ownership for local kptrs. Motivating usecase ================== If a struct has two collection node fields, e.g.: struct node { long key; long val; struct bpf_rb_node rb; struct bpf_list_node ll; }; It's not currently possible to add a node to both the list and rbtree: long bpf_prog(void *ctx) { struct node *n = bpf_obj_new(typeof(*n)); if (!n) { /* ... */ } bpf_spin_lock(&lock); bpf_list_push_back(&head, &n->ll); bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */ bpf_spin_unlock(&lock); } The above program will fail verification due to current owning / non-owning ref logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be passed to bpf_rbtree_add. The only way to get an owning reference for the node that was added is to bpf_list_pop_{front,back} it. More generally, verifier ownership semantics expect that a node has one owner (program, collection, or stashed in map) with exclusive ownership of the node's lifetime. The owner free's the node's underlying memory when it itself goes away. Without a shared ownership concept it's impossible to express many real-world usecases such that they pass verification. Semantic Changes ================ Before this series, the verifier could make this statement: "whoever has the owning reference has exclusive ownership of the referent's lifetime". As demonstrated in the previous section, this implies that a BPF program can't have an owning reference to some node if that node is in a collection. If such a state were possible, the node would have multiple owners, each thinking they have exclusive ownership. In order to support shared ownership it's necessary to modify the exclusive ownership semantic. After this series' changes, an owning reference has ownership of the referent's lifetime, but it's not necessarily exclusive. The referent's underlying memory is guaranteed to be valid (i.e. not free'd) until the reference is dropped or used for collection insert. This change doesn't affect UX of owning or non-owning references much: * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require an owning reference arg, as ownership still must be passed to the collection in a shared-ownership world. * non-owning references still refer to valid memory without claiming any ownership. One important conclusion that followed from "exclusive ownership" statement is no longer valid, though. In exclusive-ownership world, if a BPF prog has an owning reference to a node, the verifier can conclude that no collection has ownership of it. This conclusion was used to avoid runtime checking in the implementations of insert and remove operations (""has the node already been {inserted, removed}?"). In a shared-ownership world the aforementioned conclusion is no longer valid, which necessitates doing runtime checking in insert and remove operation kfuncs, and those functions possibly failing to insert or remove anything. Luckily the verifier changes necessary to go from exclusive to shared ownership were fairly minimal. Patches in this series which do change verifier semantics generally have some summary dedicated to explaining why certain usecases Just Work for shared ownership without verifier changes. Implementation ============== The changes in this series can be categorized as follows: * struct bpf_refcount opaque field + plumbing * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop * bpf_refcount_acquire kfunc * enables shared ownershp by bumping refcount + acquiring owning ref * support for possibly-failing collection insertion and removal * insertion changes are more complex If a patch's changes have some nuance to their effect - or lack of effect - on verifier behavior, the patch summary talks about it at length. Patch contents: * Patch 1 removes btf_field_offs struct * Patch 2 adds struct bpf_refcount and associated plumbing * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle refcounted kptrs * Patch 4 adds bpf_refcount_acquire * Patches 5-7 add support for possibly-failing collection insert and remove * Patch 8 centralizes constructor-like functionality for local kptr types * Patch 9 adds tests for new functionality base-commit: 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f Changelog: v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com Patch #s used below refer to the patch's position in v1 unless otherwise specified. * General * Rebase onto latest bpf-next (base-commit updated above) * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc" * Fix typo in summary (Alexei) * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail" * Modify a paragraph in patch summary to more clearly state that only bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the patch (Alexei) * refcount_off == -1 -> refcount_off < 0 in "node type w/ both list and rb_node fields" check, since any negative value means "no bpf_refcount field found", and furthermore refcount_off is never explicitly set to -1, but rather -EINVAL. (Alexei) * Instead of just changing "btf: list_node and rb_node in same struct" test expectation to pass instead of fail, do some refactoring to test both "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node, _no_ bpf_refcount" (failure) cases. This ensures that logic change in previous bullet point is correct. * v1's "btf: list_node and rb_node in same struct" test changes didn't add bpf_refcount, so the fact that btf load succeeded w/ list and rb_nodes but no bpf_refcount field is further proof that this logic was incorrect in v1. * Patch 8 - "bpf: Centralize btf_field-specific initialization logic" * Instead of doing __init_field_infer_size in kfuncs when taking bpf_list_head type input which might've been 0-initialized in map, go back to simple oneliner initialization. Add short comment explaining why this is necessary. (Alexei) * Patch 9 - "selftests/bpf: Add refcounted_kptr tests" * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
author: Alexei Starovoitov <ast@kernel.org> 2023-04-16 03:36:50 +0300
committer: Alexei Starovoitov <ast@kernel.org> 2023-04-16 03:36:51 +0300
commit: 7a0788fe835f98391b8fcb03e3cd29c1296b3280 (patch)
tree: 94d7c2b79cf60f0aecafab0fe209008a27957d44 /include/linux
parent: 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f (diff)
parent: 6147f15131e2df544a5449815f456da48c0c88e7 (diff)
download: linux-7a0788fe835f98391b8fcb03e3cd29c1296b3280.tar.xz
3 files changed, 61 insertions, 28 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88845aadc47d..18b592fde896 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -187,6 +187,7 @@ enum btf_field_type {
 	BPF_RB_NODE    = (1 << 7),
 	BPF_GRAPH_NODE_OR_ROOT = BPF_LIST_NODE | BPF_LIST_HEAD |
 				 BPF_RB_NODE | BPF_RB_ROOT,
+	BPF_REFCOUNT   = (1 << 8),
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -210,6 +211,7 @@ struct btf_field_graph_root {
 
 struct btf_field {
 	u32 offset;
+	u32 size;
 	enum btf_field_type type;
 	union {
 		struct btf_field_kptr kptr;
@@ -222,15 +224,10 @@ struct btf_record {
 	u32 field_mask;
 	int spin_lock_off;
 	int timer_off;
+	int refcount_off;
 	struct btf_field fields[];
 };
 
-struct btf_field_offs {
-	u32 cnt;
-	u32 field_off[BTF_FIELDS_MAX];
-	u8 field_sz[BTF_FIELDS_MAX];
-};
-
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -257,7 +254,6 @@ struct bpf_map {
 	struct obj_cgroup *objcg;
 #endif
 	char name[BPF_OBJ_NAME_LEN];
-	struct btf_field_offs *field_offs;
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
 	 */
@@ -299,6 +295,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_rb_root";
 	case BPF_RB_NODE:
 		return "bpf_rb_node";
+	case BPF_REFCOUNT:
+		return "bpf_refcount";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -323,6 +321,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return sizeof(struct bpf_rb_node);
+	case BPF_REFCOUNT:
+		return sizeof(struct bpf_refcount);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -347,12 +347,42 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return __alignof__(struct bpf_rb_node);
+	case BPF_REFCOUNT:
+		return __alignof__(struct bpf_refcount);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
 	}
 }
 
+static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
+{
+	memset(addr, 0, field->size);
+
+	switch (field->type) {
+	case BPF_REFCOUNT:
+		refcount_set((refcount_t *)addr, 1);
+		break;
+	case BPF_RB_NODE:
+		RB_CLEAR_NODE((struct rb_node *)addr);
+		break;
+	case BPF_LIST_HEAD:
+	case BPF_LIST_NODE:
+		INIT_LIST_HEAD((struct list_head *)addr);
+		break;
+	case BPF_RB_ROOT:
+		/* RB_ROOT_CACHED 0-inits, no need to do anything after memset */
+	case BPF_SPIN_LOCK:
+	case BPF_TIMER:
+	case BPF_KPTR_UNREF:
+	case BPF_KPTR_REF:
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+}
+
 static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_field_type type)
 {
 	if (IS_ERR_OR_NULL(rec))
@@ -360,14 +390,14 @@ static inline bool btf_record_has_field(const struct btf_record *rec, enum btf_f
 	return rec->field_mask & type;
 }
 
-static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
+static inline void bpf_obj_init(const struct btf_record *rec, void *obj)
 {
 	int i;
 
-	if (!foffs)
+	if (IS_ERR_OR_NULL(rec))
 		return;
-	for (i = 0; i < foffs->cnt; i++)
-		memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
+	for (i = 0; i < rec->cnt; i++)
+		bpf_obj_init_field(&rec->fields[i], obj + rec->fields[i].offset);
 }
 
 /* 'dst' must be a temporary buffer and should not point to memory that is being
@@ -379,7 +409,7 @@ static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
  */
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
-	bpf_obj_init(map->field_offs, dst);
+	bpf_obj_init(map->record, dst);
 }
 
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
@@ -399,14 +429,14 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 }
 
 /* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
-static inline void bpf_obj_memcpy(struct btf_field_offs *foffs,
+static inline void bpf_obj_memcpy(struct btf_record *rec,
 				  void *dst, void *src, u32 size,
 				  bool long_memcpy)
 {
 	u32 curr_off = 0;
 	int i;
 
-	if (likely(!foffs)) {
+	if (IS_ERR_OR_NULL(rec)) {
 		if (long_memcpy)
 			bpf_long_memcpy(dst, src, round_up(size, 8));
 		else
@@ -414,49 +444,49 @@ static inline void bpf_obj_memcpy(struct btf_field_offs *foffs,
 		return;
 	}
 
-	for (i = 0; i < foffs->cnt; i++) {
-		u32 next_off = foffs->field_off[i];
+	for (i = 0; i < rec->cnt; i++) {
+		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;
 
 		memcpy(dst + curr_off, src + curr_off, sz);
-		curr_off += foffs->field_sz[i] + sz;
+		curr_off += rec->fields[i].size + sz;
 	}
 	memcpy(dst + curr_off, src + curr_off, size - curr_off);
 }
 
 static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, false);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, false);
 }
 
 static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
 {
-	bpf_obj_memcpy(map->field_offs, dst, src, map->value_size, true);
+	bpf_obj_memcpy(map->record, dst, src, map->value_size, true);
 }
 
-static inline void bpf_obj_memzero(struct btf_field_offs *foffs, void *dst, u32 size)
+static inline void bpf_obj_memzero(struct btf_record *rec, void *dst, u32 size)
 {
 	u32 curr_off = 0;
 	int i;
 
-	if (likely(!foffs)) {
+	if (IS_ERR_OR_NULL(rec)) {
 		memset(dst, 0, size);
 		return;
 	}
 
-	for (i = 0; i < foffs->cnt; i++) {
-		u32 next_off = foffs->field_off[i];
+	for (i = 0; i < rec->cnt; i++) {
+		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;
 
 		memset(dst + curr_off, 0, sz);
-		curr_off += foffs->field_sz[i] + sz;
+		curr_off += rec->fields[i].size + sz;
 	}
 	memset(dst + curr_off, 0, size - curr_off);
 }
 
 static inline void zero_map_value(struct bpf_map *map, void *dst)
 {
-	bpf_obj_memzero(map->field_offs, dst, map->value_size);
+	bpf_obj_memzero(map->record, dst, map->value_size);
 }
 
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index f03852b89d28..3dd29a53b711 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -464,7 +464,12 @@ struct bpf_insn_aux_data {
 		 */
 		struct bpf_loop_inline_state loop_inline_state;
 	};
-	u64 obj_new_size; /* remember the size of type passed to bpf_obj_new to rewrite R1 */
+	union {
+		/* remember the size of type passed to bpf_obj_new to rewrite R1 */
+		u64 obj_new_size;
+		/* remember the offset of node field within type to rewrite */
+		u64 insert_off;
+	};
 	struct btf_struct_meta *kptr_struct_meta;
 	u64 map_key_state; /* constant (32 bit) key tracking for maps */
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 495250162422..813227bff58a 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -113,7 +113,6 @@ struct btf_id_dtor_kfunc {
 struct btf_struct_meta {
 	u32 btf_id;
 	struct btf_record *record;
-	struct btf_field_offs *field_offs;
 };
 
 struct btf_struct_metas {
@@ -207,7 +206,6 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size);
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
-struct btf_field_offs *btf_parse_field_offs(struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
author	Alexei Starovoitov <ast@kernel.org>	2023-04-16 03:36:50 +0300
committer	Alexei Starovoitov <ast@kernel.org>	2023-04-16 03:36:51 +0300
commit	7a0788fe835f98391b8fcb03e3cd29c1296b3280 (patch)
tree	94d7c2b79cf60f0aecafab0fe209008a27957d44 /include/linux
parent	4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f (diff)
parent	6147f15131e2df544a5449815f456da48c0c88e7 (diff)
download	linux-7a0788fe835f98391b8fcb03e3cd29c1296b3280.tar.xz