summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-06-29Merge branch 'bpf: cgroup_sock lsm flavor'Alexei Starovoitov26-166/+1432
Stanislav Fomichev says: ==================== This series implements new lsm flavor for attaching per-cgroup programs to existing lsm hooks. The cgroup is taken out of 'current', unless the first argument of the hook is 'struct socket'. In this case, the cgroup association is taken out of socket. The attachment looks like a regular per-cgroup attachment: we add new BPF_LSM_CGROUP attach type which, together with attach_btf_id, signals per-cgroup lsm. Behind the scenes, we allocate trampoline shim program and attach to lsm. This program looks up cgroup from current/socket and runs cgroup's effective prog array. The rest of the per-cgroup BPF stays the same: hierarchy, local storage, retval conventions (return 1 == success). Current limitations: * haven't considered sleepable bpf; can be extended later on * not sure the verifier does the right thing with null checks; see latest selftest for details * total of 10 (global) per-cgroup LSM attach points v11: - Martin: address selftest memory & fd leaks - Martin: address moving into root (instead have another temp leaf cgroup) - Martin: move tools/include/uapi/linux/bpf.h change from libbpf patch into 'sync tools' patch v10: - Martin: reword commit message, drop outdated items - Martin: remove rcu_real_lock from __cgroup_bpf_run_lsm_current - Martin: remove CONFIG_BPF_LSM from cgroup_bpf_release - Martin: fix leaking shim reference in bpf_cgroup_link_release - Martin: WARN_ON_ONCE for bpf_trampoline_lookup in bpf_trampoline_unlink_cgroup_shim - Martin: sync tools/include/linux/btf_ids.h - Martin: move progs/flags closer to the places where they are used in __cgroup_bpf_query - Martin: remove sk_clone_security & sctp_bind_connect from bpf_lsm_locked_sockopt_hooks - Martin: try to determine vmlinux btf_id in bpftool - Martin: update tools header in a separate commit - Quentin: do libbpf_find_kernel_btf from the ops that need it - lkp@intel.com: another build failure v9: Major change since last version is the switch to bpf_setsockopt to change the socket state instead of letting the progs poke socket directly. This, in turn, highlights the challenge that we need to care about whether the socket is locked or not when we call bpf_setsockopt. (with my original example selftest, the hooks are running early in the init phase for this not to matter). For now, I've added two btf id lists: * hooks where we know the socket is locked and it's safe to call bpf_setsockopt * hooks where we know the socket is _not_ locked, but the hook works on the socket that's not yet exposed to userspace so it should be safe (for this mode, special new set of bpf_{s,g}etsockopt helpers is added; they don't have sock_owned_by_me check) Going forward, for the rest of the hooks, this might be a good motivation to expand lsm cgroup to support sleeping bpf and allow the callers to lock/unlock sockets or have a new bpf_setsockopt variant that does the locking. - ifdef around cleanup in cgroup_bpf_release - Andrii: a few nits in libbpf patches - Martin: remove unused btf_id_set_index - Martin: bring back refcnt for cgroup_atype - Martin: make __cgroup_bpf_query a bit more readable - Martin: expose dst_prog->aux->attach_btf as attach_btf_obj_id as well - Martin: reorg check_return_code path for BPF_LSM_CGROUP - Martin: return directly from check_helper_call (instead of goto err) - Martin: add note to new warning in check_return_code, print only for void hooks - Martin: remove confusing shim reuse - Martin: use bpf_{s,g}etsockopt instead of poking into socket data - Martin: use CONFIG_CGROUP_BPF in bpf_prog_alloc_no_stats/bpf_prog_free_deferred v8: - CI: fix compile issue - CI: fix broken bpf_cookie - Yonghong: remove __bpf_trampoline_unlink_prog comment - Yonghong: move cgroup_atype around to fill the gap - Yonghong: make bpf_lsm_find_cgroup_shim void - Yonghong: rename regs to args - Yonghong: remove if(current) check - Martin: move refcnt into bpf_link - Martin: move shim management to bpf_link ops - Martin: use cgroup_atype for shim only - Martin: go back to arrays for managing cgroup_atype(s) - Martin: export bpf_obj_id(aux->attach_btf) - Andrii: reorder SEC_DEF("lsm_cgroup+") - Andrii: OPTS_SET instead of OPTS_HAS - Andrii: rename attach_btf_func_id - Andrii: move into 1.0 map v7: - there were a lot of comments last time, hope I didn't forget anything, some of the bigger ones: - Martin: use/extend BTF_SOCK_TYPE_SOCKET - Martin: expose bpf_set_retval - Martin: reject 'return 0' at the verifier for 'void' hooks - Martin: prog_query returns all BPF_LSM_CGROUP, prog_info returns attach_btf_func_id - Andrii: split libbpf changes - Andrii: add field access test to test_progs, not test_verifier (still using asm though) - things that I haven't addressed, stating them here explicitly, let me know if some of these are still problematic: 1. Andrii: exposing only link-based api: seems like the changes to support non-link-based ones are minimal, couple of lines, so seems like it worth having it? 2. Alexei: applying cgroup_atype for all cgroup hooks, not only cgroup lsm: looks a bit harder to apply everywhere that I originally thought; with lsm cgroup, we have a shim_prog pointer where we store cgroup_atype; for non-lsm programs, we don't have a trace program where to store it, so we still need some kind of global table to map from "static" hook to "dynamic" slot. So I'm dropping this "can be easily extended" clause from the description for now. I have converted this whole machinery to an RCU-managed list to remove synchronize_rcu(). - also note that I had to introduce new bpf_shim_tramp_link and moved refcnt there; we need something to manage new bpf_tramp_link v6: - remove active count & stats for shim program (Martin KaFai Lau) - remove NULL/error check for btf_vmlinux (Martin) - don't check cgroup_atype in bpf_cgroup_lsm_shim_release (Martin) - use old_prog (instead of passed one) in __cgroup_bpf_detach (Martin) - make sure attach_btf_id is the same in __cgroup_bpf_replace (Martin) - enable cgroup local storage and test it (Martin) - properly implement prog query and add bpftool & tests (Martin) - prohibit non-shared cgroup storage mode for BPF_LSM_CGROUP (Martin) v5: - __cgroup_bpf_run_lsm_socket remove NULL sock/sk checks (Martin KaFai Lau) - __cgroup_bpf_run_lsm_{socket,current} s/prog/shim_prog/ (Martin) - make sure bpf_lsm_find_cgroup_shim works for hooks without args (Martin) - __cgroup_bpf_attach make sure attach_btf_id is the same when replacing (Martin) - call bpf_cgroup_lsm_shim_release only for LSM_CGROUP (Martin) - drop BPF_LSM_CGROUP from bpf_attach_type_to_tramp (Martin) - drop jited check from cgroup_shim_find (Martin) - new patch to convert cgroup_bpf to hlist_node (Jakub Sitnicki) - new shim flavor for 'struct sock' + list of exceptions (Martin) v4: - fix build when jit is on but syscall is off v3: - add BPF_LSM_CGROUP to bpftool - use simple int instead of refcnt_t (to avoid use-after-free false positive) v2: - addressed build bot failures ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29selftests/bpf: lsm_cgroup functional testStanislav Fomichev3-0/+474
Functional test that exercises the following: 1. apply default sk_priority policy 2. permit TX-only AF_PACKET socket 3. cgroup attach/detach/replace 4. reusing trampoline shim Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220628174314.1216643-12-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpftool: implement cgroup tree for BPF_LSM_CGROUPStanislav Fomichev1-22/+87
$ bpftool --nomount prog loadall $KDIR/tools/testing/selftests/bpf/lsm_cgroup.o /sys/fs/bpf/x $ bpftool cgroup attach /sys/fs/cgroup lsm_cgroup pinned /sys/fs/bpf/x/socket_alloc $ bpftool cgroup attach /sys/fs/cgroup lsm_cgroup pinned /sys/fs/bpf/x/socket_bind $ bpftool cgroup attach /sys/fs/cgroup lsm_cgroup pinned /sys/fs/bpf/x/socket_clone $ bpftool cgroup attach /sys/fs/cgroup lsm_cgroup pinned /sys/fs/bpf/x/socket_post_create $ bpftool cgroup tree CgroupPath ID AttachType AttachFlags Name /sys/fs/cgroup 6 lsm_cgroup socket_post_create bpf_lsm_socket_post_create 8 lsm_cgroup socket_bind bpf_lsm_socket_bind 10 lsm_cgroup socket_alloc bpf_lsm_sk_alloc_security 11 lsm_cgroup socket_clone bpf_lsm_inet_csk_clone $ bpftool cgroup detach /sys/fs/cgroup lsm_cgroup pinned /sys/fs/bpf/x/socket_post_create $ bpftool cgroup tree CgroupPath ID AttachType AttachFlags Name /sys/fs/cgroup 8 lsm_cgroup socket_bind bpf_lsm_socket_bind 10 lsm_cgroup socket_alloc bpf_lsm_sk_alloc_security 11 lsm_cgroup socket_clone bpf_lsm_inet_csk_clone Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-11-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29libbpf: implement bpf_prog_query_optsStanislav Fomichev3-7/+47
Implement bpf_prog_query_opts as a more expendable version of bpf_prog_query. Expose new prog_attach_flags and attach_btf_func_id as well: * prog_attach_flags is a per-program attach_type; relevant only for lsm cgroup program which might have different attach_flags per attach_btf_id * attach_btf_func_id is a new field expose for prog_query which specifies real btf function id for lsm cgroup attachments Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-10-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29libbpf: add lsm_cgoup_sock typeStanislav Fomichev1-0/+3
lsm_cgroup/ is the prefix for BPF_LSM_CGROUP. Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-9-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29tools/bpf: Sync btf_ids.h to toolsStanislav Fomichev3-8/+32
Has been slowly getting out of sync, let's update it. resolve_btfids usage has been updated to match the header changes. Also bring new parts of tools/include/uapi/linux/bpf.h. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-8-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: expose bpf_{g,s}etsockopt to lsm cgroupStanislav Fomichev3-7/+93
I don't see how to make it nice without introducing btf id lists for the hooks where these helpers are allowed. Some LSM hooks work on the locked sockets, some are triggering early and don't grab any locks, so have two lists for now: 1. LSM hooks which trigger under socket lock - minority of the hooks, but ideal case for us, we can expose existing BTF-based helpers 2. LSM hooks which trigger without socket lock, but they trigger early in the socket creation path where it should be safe to do setsockopt without any locks 3. The rest are prohibited. I'm thinking that this use-case might be a good gateway to sleeping lsm cgroup hooks in the future. We can either expose lock/unlock operations (and add tracking to the verifier) or have another set of bpf_setsockopt wrapper that grab the locks and might sleep. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUPStanislav Fomichev3-32/+74
We have two options: 1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id 2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point I was doing (2) in the original patch, but switching to (1) here: * bpf_prog_query returns all attached BPF_LSM_CGROUP programs regardless of attach_btf_id * attach_btf_id is exported via bpf_prog_info Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: minimize number of allocated lsm slots per programStanislav Fomichev8-24/+64
Previous patch adds 1:1 mapping between all 211 LSM hooks and bpf_cgroup program array. Instead of reserving a slot per possible hook, reserve 10 slots per cgroup for lsm programs. Those slots are dynamically allocated on demand and reclaimed. struct cgroup_bpf { struct bpf_prog_array * effective[33]; /* 0 264 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ struct hlist_head progs[33]; /* 264 264 */ /* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */ u8 flags[33]; /* 528 33 */ /* XXX 7 bytes hole, try to pack */ struct list_head storages; /* 568 16 */ /* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */ struct bpf_prog_array * inactive; /* 584 8 */ struct percpu_ref refcnt; /* 592 16 */ struct work_struct release_work; /* 608 72 */ /* size: 680, cachelines: 11, members: 7 */ /* sum members: 673, holes: 1, sum holes: 7 */ /* last cacheline: 40 bytes */ }; Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: per-cgroup lsm flavorStanislav Fomichev15-20/+498
Allow attaching to lsm hooks in the cgroup context. Attaching to per-cgroup LSM works exactly like attaching to other per-cgroup hooks. New BPF_LSM_CGROUP is added to trigger new mode; the actual lsm hook we attach to is signaled via existing attach_btf_id. For the hooks that have 'struct socket' or 'struct sock' as its first argument, we use the cgroup associated with that socket. For the rest, we use 'current' cgroup (this is all on default hierarchy == v2 only). Note that for some hooks that work on 'struct sock' we still take the cgroup from 'current' because some of them work on the socket that hasn't been properly initialized yet. Behind the scenes, we allocate a shim program that is attached to the trampoline and runs cgroup effective BPF programs array. This shim has some rudimentary ref counting and can be shared between several programs attaching to the same lsm hook from different cgroups. Note that this patch bloats cgroup size because we add 211 cgroup_bpf_attach_type(s) for simplicity sake. This will be addressed in the subsequent patch. Also note that we only add non-sleepable flavor for now. To enable sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu, shim programs have to be freed via trace rcu, cgroup_bpf.effective should be also trace-rcu-managed + maybe some other changes that I'm not aware of. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: convert cgroup_bpf.progs to hlistStanislav Fomichev3-35/+47
This lets us reclaim some space to be used by new cgroup lsm slots. Before: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct list_head progs[23]; /* 184 368 */ /* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */ u32 flags[23]; /* 552 92 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ struct list_head storages; /* 648 16 */ struct bpf_prog_array * inactive; /* 664 8 */ struct percpu_ref refcnt; /* 672 16 */ struct work_struct release_work; /* 688 32 */ /* size: 720, cachelines: 12, members: 7 */ /* sum members: 716, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; After: struct cgroup_bpf { struct bpf_prog_array * effective[23]; /* 0 184 */ /* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */ struct hlist_head progs[23]; /* 184 184 */ /* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */ u8 flags[23]; /* 368 23 */ /* XXX 1 byte hole, try to pack */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct list_head storages; /* 392 16 */ struct bpf_prog_array * inactive; /* 408 8 */ struct percpu_ref refcnt; /* 416 16 */ struct work_struct release_work; /* 432 72 */ /* size: 504, cachelines: 8, members: 7 */ /* sum members: 503, holes: 1, sum holes: 1 */ /* last cacheline: 56 bytes */ }; Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29bpf: add bpf_func_t and trampoline helpersStanislav Fomichev2-36/+38
I'll be adding lsm cgroup specific helpers that grab trampoline mutex. No functional changes. Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28Merge branch 'libbpf: remove deprecated APIs'Alexei Starovoitov23-2747/+258
Andrii Nakryiko says: ==================== This patch set removes all the deprecated APIs in preparation for 1.0 release. It also makes libbpf_set_strict_mode() a no-op (but keeps it to let per-1.0 applications buildable and dynamically linkable against libbpf 1.0 if they were already libbpf-1.0 ready) and starts enforcing all the behaviors that were previously opt-in through libbpf_set_strict_mode(). xsk.{c,h} parts that are now properly provided by libxdp ([0]) are still used by xdpxceiver.c in selftest/bpf, so I've moved xsk.{c,h} with barely any changes to under selftests/bpf. Other than that, apart from removing all the LIBBPF_DEPRECATED-marked APIs, there is a bunch of internal clean ups allowed by that. I've also "restored" libbpf.map inheritance chain while removing all the deprecated APIs. I think that's the right way to do this, as applications using libbpf as shared library but not relying on any deprecated APIs (i.e., "good citizens" that prepared for libbpf 1.0 release ahead of time to minimize disruption) should be able to link both against 0.x and 1.x versions. Either way, it doesn't seem to break anything and preserve a history on when each "surviving" API was added. [0] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp v1->v2: - rebase on latest bpf-next now that Jiri's perf patches landed; - fix xsk.o dependency in Makefile to ensure libbpf headers are installed reliably. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: fix up few libbpf.map problemsAndrii Nakryiko2-3/+4
Seems like we missed to add 2 APIs to libbpf.map and another API was misspelled. Fix it in libbpf.map. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-16-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: enforce strict libbpf 1.0 behaviorsAndrii Nakryiko5-261/+37
Remove support for legacy features and behaviors that previously had to be disabled by calling libbpf_set_strict_mode(): - legacy BPF map definitions are not supported now; - RLIMIT_MEMLOCK auto-setting, if necessary, is always on (but see libbpf_set_memlock_rlim()); - program name is used for program pinning (instead of section name); - cleaned up error returning logic; - entry BPF programs should have SEC() always. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-15-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28selftests/bpf: remove last tests with legacy BPF map definitionsAndrii Nakryiko4-79/+0
Libbpf 1.0 stops support legacy-style BPF map definitions. Selftests has been migrated away from using legacy BPF map definitions except for two selftests, to make sure that legacy functionality still worked in pre-1.0 libbpf. Now it's time to let those tests go as libbpf 1.0 is imminent. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-14-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: clean up SEC() handlingAndrii Nakryiko1-72/+47
Get rid of sloppy prefix logic and remove deprecated xdp_{devmap,cpumap} sections. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-13-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove internal multi-instance prog supportAndrii Nakryiko1-283/+34
Clean up internals that had to deal with the possibility of multi-instance bpf_programs. Libbpf 1.0 doesn't support this, so all this is not necessary now and can be simplified. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: cleanup LIBBPF_DEPRECATED_SINCE supporting macros for v0.xAndrii Nakryiko1-13/+3
Keep the LIBBPF_DEPRECATED_SINCE macro "framework" for future deprecations, but clean up 0.x related helper macros. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-11-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove multi-instance and custom private data APIsAndrii Nakryiko3-212/+10
Remove all the public APIs that are related to creating multi-instance bpf_programs through custom preprocessing callback and generally working with them. Also remove all the bpf_{object,map,program}__[set_]priv() APIs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove most other deprecated high-level APIsAndrii Nakryiko3-421/+26
Remove a bunch of high-level bpf_object/bpf_map/bpf_program related APIs. All the APIs related to private per-object/map/prog state, program preprocessing callback, and generally everything multi-instance related is removed in a separate patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove prog_info_linear APIsAndrii Nakryiko3-317/+0
Remove prog_info_linear-related APIs previously used by perf. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: clean up perfbuf APIsAndrii Nakryiko3-112/+18
Remove deprecated perfbuf APIs and clean up opts structs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove deprecated BTF APIsAndrii Nakryiko5-325/+24
Get rid of deprecated BTF-related APIs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove deprecated probing APIsAndrii Nakryiko3-132/+5
Get rid of deprecated feature-probing APIs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove deprecated XDP APIsAndrii Nakryiko3-78/+8
Get rid of deprecated bpf_set_link*() and bpf_get_link*() APIs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: remove deprecated low-level APIsAndrii Nakryiko5-374/+4
Drop low-level APIs as well as high-level (and very confusingly named) BPF object loading bpf_prog_load_xattr() and bpf_prog_load_deprecated() APIs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28libbpf: move xsk.{c,h} into selftests/bpfAndrii Nakryiko7-76/+49
Remove deprecated xsk APIs from libbpf. But given we have selftests relying on this, move those files (with minimal adjustments to make them compilable) under selftests/bpf. We also remove all the removed APIs from libbpf.map, while overall keeping version inheritance chain, as most APIs are backwards compatible so there is no need to reassign them as LIBBPF_1.0.0 versions. Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220627211527.2245459-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-28bpf: Fix sockmap calling sleepable function in teardown pathJohn Fastabend1-1/+1
syzbot reproduced the bug ... BUG: sleeping function called from invalid context at kernel/workqueue.c:3010 ... with the following stack trace fragment ... start_flush_work kernel/workqueue.c:3010 [inline] __flush_work+0x109/0xb10 kernel/workqueue.c:3074 __cancel_work_timer+0x3f9/0x570 kernel/workqueue.c:3162 sk_psock_stop+0x4cb/0x630 net/core/skmsg.c:802 sock_map_destroy+0x333/0x760 net/core/sock_map.c:1581 inet_csk_destroy_sock+0x196/0x440 net/ipv4/inet_connection_sock.c:1130 __tcp_close+0xd5b/0x12b0 net/ipv4/tcp.c:2897 tcp_close+0x29/0xc0 net/ipv4/tcp.c:2909 ... introduced by d8616ee2affc. Do a quick trace of the code path and the bug is obvious: inet_csk_destroy_sock(sk) sk_prot->destroy(sk); <--- sock_map_destroy sk_psock_stop(, true); <--- true so cancel workqueue cancel_work_sync() <--- splat, because *_bh_disable() We can not call cancel_work_sync() from inside destroy path. So mark the sk_psock_stop call to skip this cancel_work_sync(). This will avoid the BUG, but means we may run sk_psock_backlog after or during the destroy op. We zapped the ingress_skb queue in sk_psock_stop (safe to do with local_bh_disable) so its empty and the sk_psock_backlog work item will not find any pkts to process here. However, because we are not going to wait for it or clear its ->state its possible it kicks off or is already running. This should be 'safe' up until psock drops its refcnt to psock->sk. The sock_put() that drops this reference is only done at psock destroy time from sk_psock_destroy(). This is done through workqueue when sk_psock_drop() is called on psock refnt reaches 0. And importantly sk_psock_destroy() does a cancel_work_sync(). So trivial fix works. I've had hit or miss luck reproducing this caught it once or twice with the provided reproducer when running with many runners. However, syzkaller is very good at reproducing so relying on syzkaller to verify fix. Fixes: d8616ee2affc ("bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues") Reported-by: syzbot+140186ceba0c496183bc@syzkaller.appspotmail.com Suggested-by: Hillf Danton <hdanton@sina.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Wang Yufen <wangyufen@huawei.com> Link: https://lore.kernel.org/bpf/20220628035803.317876-1-john.fastabend@gmail.com
2022-06-25bpf: Merge "types_are_compat" logic into relo_core.cDaniel Müller4-154/+84
BPF type compatibility checks (bpf_core_types_are_compat()) are currently duplicated between kernel and user space. That's a historical artifact more than intentional doing and can lead to subtle bugs where one implementation is adjusted but another is forgotten. That happened with the enum64 work, for example, where the libbpf side was changed (commit 23b2a3a8f63a ("libbpf: Add enum64 relocation support")) to use the btf_kind_core_compat() helper function but the kernel side was not (commit 6089fb325cf7 ("bpf: Add btf enum64 support")). This patch addresses both the duplication issue, by merging both implementations and moving them into relo_core.c, and fixes the alluded to kind check (by giving preference to libbpf's already adjusted logic). For discussion of the topic, please refer to: https://lore.kernel.org/bpf/CAADnVQKbWR7oarBdewgOBZUPzryhRYvEbkhyPJQHHuxq=0K1gw@mail.gmail.com/T/#mcc99f4a33ad9a322afaf1b9276fb1f0b7add9665 Changelog: v1 -> v2: - limited libbpf recursion limit to 32 - changed name to __bpf_core_types_are_compat - included warning previously present in libbpf version - merged kernel and user space changes into a single patch Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220623182934.2582827-1-deso@posteo.net
2022-06-24bpf, docs: Fix the code formatting in instruction-setShahab Vahedi1-1/+1
A minor typo fix to include "| BPF_LD" into its previous code phrase: ``BPF_IND`` | BPF_LD --> ``BPF_IND | BPF_LD`` Signed-off-by: Shahab Vahedi <shahab@synopsys.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/b6120b31-3d1d-bf2d-2f2a-aa768d91257b@synopsys.com
2022-06-24Merge branch 'perf tools: Fix prologue generation'Andrii Nakryiko1-29/+175
Jiri Olsa says: ==================== hi, sending change we discussed some time ago [1] to get rid of some deprecated functions we use in perf prologue code. Despite the gloomy discussion I think the final code does not look that bad ;-) This patchset removes following libbpf functions from perf: bpf_program__set_prep bpf_program__nth_fd struct bpf_prog_prep_result v5 changes: - squashed patches together so we don't break bisection [Arnaldo] v4 changes: - fix typo [Andrii] v3 changes: - removed R0/R1 zero init in libbpf_prog_prepare_load_fn, because it's not needed [Andrii] - rebased/post on top of bpf-next/master which now has all the needed perf/core changes v2 changes: - use fallback section prog handler, so we don't need to use section prefix [Andrii] - realloc prog->insns array in bpf_program__set_insns [Andrii] - squash patch 1 from previous version with bpf_program__set_insns change [Daniel] - patch 3 already merged [Arnaldo] - added more comments thanks, jirka [1] https://lore.kernel.org/bpf/CAEf4BzaiBO3_617kkXZdYJ8hS8YF--ZLgapNbgeeEJ-pY0H88g@mail.gmail.com/ ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2022-06-24perf tools: Rework prologue generation codeJiri Olsa1-29/+175
Some functions we use for bpf prologue generation are going to be deprecated. This change reworks current code not to use them. We need to replace following functions/struct: bpf_program__set_prep bpf_program__nth_fd struct bpf_prog_prep_result Currently we use bpf_program__set_prep to hook perf callback before program is loaded and provide new instructions with the prologue. We replace this function/ality by taking instructions for specific program, attaching prologue to them and load such new ebpf programs with prologue using separate bpf_prog_load calls (outside libbpf load machinery). Before we can take and use program instructions, we need libbpf to actually load it. This way we get the final shape of its instructions with all relocations and verifier adjustments). There's one glitch though.. perf kprobe program already assumes generated prologue code with proper values in argument registers, so loading such program directly will fail in the verifier. That's where the fallback pre-load handler fits in and prepends the initialization code to the program. Once such program is loaded we take its instructions, cut off the initialization code and prepend the prologue. I know.. sorry ;-) To have access to the program when loading this patch adds support to register 'fallback' section handler to take care of perf kprobe programs. The fallback means that it handles any section definition besides the ones that libbpf handles. The handler serves two purposes: - allows perf programs to have special arguments in section name - allows perf to use pre-load callback where we can attach init code (zeroing all argument registers) to each perf program Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/bpf/20220616202214.70359-2-jolsa@kernel.org
2022-06-24selftest/bpf: Test for use-after-free bug fix in inline_bpf_loopEduard Zingerman2-0/+50
This test verifies that bpf_loop() inlining works as expected when address of `env->prog` is updated. This address is updated upon BPF program reallocation. Reallocation is handled by bpf_prog_realloc(), which reuses old memory if page boundary is not crossed. The value of `len` in the test is chosen to cross this boundary on bpf_loop() patching. Verify that the use-after-free bug in inline_bpf_loop() reported by Dan Carpenter is fixed. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220624020613.548108-3-eddyz87@gmail.com
2022-06-24bpf: Fix for use-after-free bug in inline_bpf_loopEduard Zingerman1-1/+1
As reported by Dan Carpenter, the following statements in inline_bpf_loop() might cause a use-after-free bug: struct bpf_prog *new_prog; // ... new_prog = bpf_patch_insn_data(env, position, insn_buf, *cnt); // ... env->prog->insnsi[call_insn_offset].imm = callback_offset; The bpf_patch_insn_data() might free the memory used by env->prog. Fixes: 1ade23711971 ("bpf: Inline calls to bpf_loop when callback is known") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220624020613.548108-2-eddyz87@gmail.com
2022-06-24bpf: Replace hard-coded 0 with BPF_K in check_alu_opSimon Wang1-1/+1
Enhance readability a bit. Signed-off-by: Simon Wang <wangchuanguo@inspur.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220622031923.65692-1-wangchuanguo@inspur.com
2022-06-23selftests/bpf: Fix rare segfault in sock_fields prog testJörn-Thorben Hinz1-1/+0
test_sock_fields__detach() got called with a null pointer here when one of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load() call resulted in a jump to the "done" label. A skeletons *__detach() is not safe to call with a null pointer, though. This led to a segfault. Go the easy route and only call test_sock_fields__destroy() which is null-pointer safe and includes detaching. Came across this while looking[1] to introduce the usage of bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with vmlinux.h. [1] https://lore.kernel.org/bpf/629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/ Fixes: 8f50f16ff39d ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads") Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de
2022-06-23Merge branch 'Align BPF TCP CCs implementing cong_control() with non-BPF CCs'Alexei Starovoitov6-37/+186
Jörn-Thorben Hinz says: ==================== This series corrects some inconveniences for a BPF TCP CC that implements and uses tcp_congestion_ops.cong_control(). Until now, such a CC did not have all necessary write access to struct sock and unnecessarily needed to implement cong_avoid(). v4: - Remove braces around single statements after if - Don’t check pointer passed to bpf_link__destroy() v3: - Add a selftest writing sk_pacing_* - Add a selftest with incomplete tcp_congestion_ops - Add a selftest with unsupported get_info() - Remove an unused variable - Reword a comment about reg() in bpf_struct_ops_map_update_elem() v2: - Drop redundant check for required functions and just rely on tcp_register_congestion_control() (Martin KaFai Lau) ==================== Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23selftests/bpf: Test a BPF CC implementing the unsupported get_info()Jörn-Thorben Hinz2-0/+41
Test whether a TCP CC implemented in BPF providing get_info() is rejected correctly. get_info() is unsupported in a BPF CC. The check for required functions in a BPF CC has moved, this test ensures unsupported functions are still rejected correctly. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23selftests/bpf: Test an incomplete BPF CCJörn-Thorben Hinz2-0/+57
Test whether a TCP CC implemented in BPF providing neither cong_avoid() nor cong_control() is correctly rejected. This check solely depends on tcp_register_congestion_control() now, which is invoked during bpf_map__attach_struct_ops(). Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23selftests/bpf: Test a BPF CC writing sk_pacing_*Jörn-Thorben Hinz2-0/+79
Test whether a TCP CC implemented in BPF is allowed to write sk_pacing_rate and sk_pacing_status in struct sock. This is needed when cong_control() is implemented and used. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23bpf: Require only one of cong_avoid() and cong_control() from a TCP CCJörn-Thorben Hinz2-37/+3
Remove the check for required and optional functions in a struct tcp_congestion_ops from bpf_tcp_ca.c. Rely on tcp_register_congestion_control() to reject a BPF CC that does not implement all required functions, as it will do for a non-BPF CC. When a CC implements tcp_congestion_ops.cong_control(), the alternate cong_avoid() is not in use in the TCP stack. Previously, a BPF CC was still forced to implement cong_avoid() as a no-op since it was non-optional in bpf_tcp_ca.c. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220622191227.898118-3-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23bpf: Allow a TCP CC to write sk_pacing_rate and sk_pacing_statusJörn-Thorben Hinz1-0/+6
A CC that implements tcp_congestion_ops.cong_control() should be able to control sk_pacing_rate and set sk_pacing_status, since tcp_update_pacing_rate() is never called in this case. A built-in CC or one from a kernel module is already able to write to both struct sock members. For a BPF program, write access has not been allowed, yet. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220622191227.898118-2-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23test_bpf: fix incorrect netdev featuresJian Shen1-2/+2
The prototype of .features is netdev_features_t, it should use NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT and NETIF_F_HW_VLAN_STAG_TX_BIT. Fixes: cf204a718357 ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test") Signed-off-by: Jian Shen <shenjian15@huawei.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-23selftests/bpf: Add benchmark for local_storage getDave Marchevsky7-1/+494
Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-22samples/bpf: fixup some tools to be able to support xdp multibufferAndy Gospodarek3-7/+17
This changes the section name for the bpf program embedded in these files to "xdp.frags" to allow the programs to be loaded on drivers that are using an MTU greater than PAGE_SIZE. Rather than directly accessing the buffers, the packet data is now accessed via xdp helper functions to provide an example for those who may need to write more complex programs. v2: remove new unnecessary variable Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/20220621175402.35327-1-gospo@broadcom.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-21bpf, arm64: Keep tail call count across bpf2bpf callsJakub Sitnicki1-1/+8
Today doing a BPF tail call after a BPF to BPF call, that is from a subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features requires support from JIT. Tail call count has to be tracked through BPF to BPF calls, as well as through BPF tail calls to prevent unbounded chains of tail calls. arm64 BPF JIT stores the tail call count (TCC) in a dedicated register (X26). This makes it easier to support bpf2bpf calls mixed with tail calls than on x86 platform. In order to keep the tail call count in tact throughout bpf2bpf calls, all we need to do is tweak the program prologue generator. When emitting prologue for a subprogram, we skip the block that initializes the tail call count and emits a jump pad for the tail call. With this change, a sample execution flow where a bpf2bpf call is followed by a tail call would look like so: int entry(struct __sk_buff *skb): 0xffffffc0090151d4: paciasp 0xffffffc0090151d8: stp x29, x30, [sp, #-16]! 0xffffffc0090151dc: mov x29, sp 0xffffffc0090151e0: stp x19, x20, [sp, #-16]! 0xffffffc0090151e4: stp x21, x22, [sp, #-16]! 0xffffffc0090151e8: stp x25, x26, [sp, #-16]! 0xffffffc0090151ec: stp x27, x28, [sp, #-16]! 0xffffffc0090151f0: mov x25, sp 0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only 0xffffffc0090151f8: bti j // in main prog 0xffffffc0090151fc: sub x27, x25, #0x0 0xffffffc009015200: sub sp, sp, #0x10 0xffffffc009015204: mov w1, #0x0 0xffffffc009015208: mov x10, #0xffffffffffffffff 0xffffffc00901520c: strb w1, [x25, x10] 0xffffffc009015210: mov x10, #0xffffffffffffd25c 0xffffffc009015214: movk x10, #0x902, lsl #16 0xffffffc009015218: movk x10, #0xffc0, lsl #32 0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call 0xffffffc009015220: add x7, x0, #0x0 <-------------. 0xffffffc009015224: add sp, sp, #0x10 | | 0xffffffc009015228: ldp x27, x28, [sp], #16 | | 0xffffffc00901522c: ldp x25, x26, [sp], #16 | | 0xffffffc009015230: ldp x21, x22, [sp], #16 | | 0xffffffc009015234: ldp x19, x20, [sp], #16 | | 0xffffffc009015238: ldp x29, x30, [sp], #16 | | 0xffffffc00901523c: add x0, x7, #0x0 | | 0xffffffc009015240: autiasp | | 0xffffffc009015244: ret | | | | int subprog_tail(struct __sk_buff *skb): | | 0xffffffc00902d25c: paciasp <----------------------' | 0xffffffc00902d260: stp x29, x30, [sp, #-16]! | 0xffffffc00902d264: mov x29, sp | 0xffffffc00902d268: stp x19, x20, [sp, #-16]! | 0xffffffc00902d26c: stp x21, x22, [sp, #-16]! | 0xffffffc00902d270: stp x25, x26, [sp, #-16]! | 0xffffffc00902d274: stp x27, x28, [sp, #-16]! | 0xffffffc00902d278: mov x25, sp | 0xffffffc00902d27c: sub x27, x25, #0x0 | 0xffffffc00902d280: sub sp, sp, #0x10 | // <- end of prologue, notice: 0xffffffc00902d284: add x19, x0, #0x0 | // 1) TCC not touched, and 0xffffffc00902d288: mov w0, #0x1 | // 2) no tail call jump pad 0xffffffc00902d28c: mov x10, #0xfffffffffffffffc | 0xffffffc00902d290: str w0, [x25, x10] | 0xffffffc00902d294: mov x20, #0xffffff80ffffffff | 0xffffffc00902d298: movk x20, #0xc033, lsl #16 | 0xffffffc00902d29c: movk x20, #0x4e00 | 0xffffffc00902d2a0: add x0, x19, #0x0 | 0xffffffc00902d2a4: add x1, x20, #0x0 | 0xffffffc00902d2a8: mov x2, #0x0 | 0xffffffc00902d2ac: mov w10, #0x24 | 0xffffffc00902d2b0: ldr w10, [x1, x10] | 0xffffffc00902d2b4: add w2, w2, #0x0 | 0xffffffc00902d2b8: cmp w2, w10 | 0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2c0: mov w10, #0x21 | 0xffffffc00902d2c4: cmp x26, x10 | // TCC >= MAX_TAIL_CALL_CNT? 0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 | 0xffffffc00902d2cc: add x26, x26, #0x1 | // TCC++ 0xffffffc00902d2d0: mov w10, #0x110 | 0xffffffc00902d2d4: add x10, x1, x10 | 0xffffffc00902d2d8: lsl x11, x2, #3 | 0xffffffc00902d2dc: ldr x11, [x10, x11] | 0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 | 0xffffffc00902d2e4: mov w10, #0x30 | 0xffffffc00902d2e8: ldr x10, [x11, x10] | 0xffffffc00902d2ec: add x10, x10, #0x24 | 0xffffffc00902d2f0: add sp, sp, #0x10 | // <- destroy just current 0xffffffc00902d2f4: br x10 ---------------------. | // BPF stack frame 0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc | | // before the tail call 0xffffffc00902d2fc: ldr w7, [x25, x10] | | 0xffffffc00902d300: add sp, sp, #0x10 | | 0xffffffc00902d304: ldp x27, x28, [sp], #16 | | 0xffffffc00902d308: ldp x25, x26, [sp], #16 | | 0xffffffc00902d30c: ldp x21, x22, [sp], #16 | | 0xffffffc00902d310: ldp x19, x20, [sp], #16 | | 0xffffffc00902d314: ldp x29, x30, [sp], #16 | | 0xffffffc00902d318: add x0, x7, #0x0 | | 0xffffffc00902d31c: autiasp | | 0xffffffc00902d320: ret | | | | int classifier_0(struct __sk_buff *skb): | | 0xffffffc008ff5874: paciasp | | 0xffffffc008ff5878: stp x29, x30, [sp, #-16]! | | 0xffffffc008ff587c: mov x29, sp | | 0xffffffc008ff5880: stp x19, x20, [sp, #-16]! | | 0xffffffc008ff5884: stp x21, x22, [sp, #-16]! | | 0xffffffc008ff5888: stp x25, x26, [sp, #-16]! | | 0xffffffc008ff588c: stp x27, x28, [sp, #-16]! | | 0xffffffc008ff5890: mov x25, sp | | 0xffffffc008ff5894: mov x26, #0x0 | | 0xffffffc008ff5898: bti j <----------------------' | 0xffffffc008ff589c: sub x27, x25, #0x0 | 0xffffffc008ff58a0: sub sp, sp, #0x0 | 0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff | 0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 | 0xffffffc008ff58ac: movk x0, #0x6000 | 0xffffffc008ff58b0: mov w1, #0x1 | 0xffffffc008ff58b4: str w1, [x0] | 0xffffffc008ff58b8: mov w7, #0x0 | 0xffffffc008ff58bc: mov sp, sp | 0xffffffc008ff58c0: ldp x27, x28, [sp], #16 | 0xffffffc008ff58c4: ldp x25, x26, [sp], #16 | 0xffffffc008ff58c8: ldp x21, x22, [sp], #16 | 0xffffffc008ff58cc: ldp x19, x20, [sp], #16 | 0xffffffc008ff58d0: ldp x29, x30, [sp], #16 | 0xffffffc008ff58d4: add x0, x7, #0x0 | 0xffffffc008ff58d8: autiasp | 0xffffffc008ff58dc: ret -------------------------------' Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com
2022-06-21bpf, x64: Add predicate for bpf2bpf with tailcalls support in JITTony Ambardar4-1/+15
The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail calls for only x86-64. Change the logic to instead rely on a new weak function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable JIT backend can override. Update the x86-64 eBPF JIT to reflect this. Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> [jakub: drop MIPS bits and tweak patch subject] Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
2022-06-21Merge branch 'bpf_loop inlining'Alexei Starovoitov10-27/+936
Eduard Zingerman says: ==================== Hi Everyone, This is the next iteration of the patch. It includes changes suggested by Song, Joanne and Alexei. Please find updated intro message and change log below. This patch implements inlining of calls to bpf_loop helper function when bpf_loop's callback is statically known. E.g. the rewrite does the following transformation during BPF program processing: bpf_loop(10, foo, NULL, 0); -> for (int i = 0; i < 10; ++i) foo(i, NULL); The transformation leads to measurable latency change for simple loops. Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op. The change is split in five parts: * Update to test_verifier.c to specify expected and unexpected instruction sequences. This allows to check BPF program rewrites applied by e.g. do_mix_fixups function. * Update to test_verifier.c to specify BTF function infos and types per test case. This is necessary for tests that load sub-program addresses to a variable because of the checks applied by check_ld_imm function. * The update to verifier.c that tracks state of the parameters for each bpf_loop call in a program and decides whether it could be replaced by a loop. * A set of test cases for `test_verifier` that use capabilities added by the first two patches to verify instructions produced by inlining logic. * Two test cases for `test_prog` to check that possible corner cases behave as expected. Additional details are available in commit messages for each patch. Changes since v7: - Call to `mark_chain_precision` is added in `loop_flag_is_zero` to avoid potential issues with state pruning and precision tracking. - `flags non-zero` test_verifier test case is updated to have two execution paths reaching `bpf_loop` call, one with flags = 0, another with flags = 1. Potentially this test case should be able to show that call to `mark_chain_precision` is necessary in `loop_flag_is_zero` but not at the moment. Please refer to discussion for [PATCH bpf-next v7 3/5] for additional details. - `stack_depth_extra` computation is updated to guarantee that R6, R7 and R8 offsets are always aligned on 8 byte boundary. - `stack locations for loop vars` test_verifier test case updated to show that R6, R7, R8 offsets are indeed aligned when function stack depth is not a multiple of 8. - I removed Song Liu's ACK from commit message for [PATCH bpf-next v8 4/5] because I updated the patch. (Please let me know if I had to keep the ACK tag). Changes since v6: - Return value of the `optimize_bpf_loop` function is no longer ignored. This is necessary to properly propagate -ENOMEM error. Changes since v5: - Added function `loop_flag_is_zero` to skip a few checks in `update_loop_inline_state` when loop instruction is not fit for inline. Changes since v4: - Added missing `static` modifier for `update_loop_inline_state` and `inline_bpf_loop` functions. - `update_loop_inline_state` updated for better readability. - Fields `initialized` and `fit_for_inline` of `struct bpf_loop_inline_state` are changed back from `bool` to bitfields. - Acks from Song Liu added to comments for patches 1/5, 2/5, 4/5, 5/5. Changes since v3: - Function `adjust_stack_depth_for_loop_inlining` is replaced by function `optimize_bpf_loop`. Function `optimize_bpf_loop` is responsible for both stack depth adjustment and call instruction replacement. - Changes in `do_misc_fixups` are reverted. - Changes in `adjust_subprog_starts_after_remove` are reverted and function `adjust_loop_inline_subprogno` is removed. This is possible because call to `optimize_bpf_loop` is placed before the dead code removal in `opt_remove_dead_code` (in contrast to the position of `do_misc_fixups` where inlining was done in v3). - Field `bpf_insn_aux_data.loop_inline_state` is now a part of anonymous union at the start of the `bpf_insn_aux_data`. - Data structure `bpf_loop_inline_state` is simplified to use single flag field `fit_for_inline` instead of separate fields `flags_is_zero` & `callback_is_constant`. - Macro definition `BPF_MAX_LOOPS` is moved from `include/linux/bpf_verifier.h` to `include/linux/bpf.h` to avoid include of `include/linux/bpf_verifier.h` in `bpf_iter.c`. - `inline_bpf_loop` changed back to use array initialization and hard coded offsets as in v2. - Style / formatting updates. Changes since v2: - fix for `stack_check` test case in `test_progs-no_alu32`, all tests are passing now; - v2 3/3 patch is split in three parts: - kernel changes - test_verifier changes - test_prog changes - updated `inline_bpf_loop` in `verifier.c` to calculate each offset used in instructions to avoid "magic" numbers; - removed newline handling logic in `fail_log` branch of `do_single_test` in `test_verifier.c` to simplify the patch set; - styling fixes suggested in review for v2 of this patch set. Changes since v1: - allow to use SKIP_INSNS in instruction pattern specification in test_verifier tests; - fix for a bug in spill offset assignement for loop vars when bpf_loop is located in a non-main function. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-21selftests/bpf: BPF test_prog selftests for bpf_loop inliningEduard Zingerman2-0/+176
Two new test BPF programs for test_prog selftests checking bpf_loop behavior. Both are corner cases for bpf_loop inlinig transformation: - check that bpf_loop behaves correctly when callback function is not a compile time constant - check that local function variables are not affected by allocating additional stack storage for registers spilled by loop inlining Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>