diff options
Diffstat (limited to 'Documentation/bpf')
23 files changed, 2972 insertions, 57 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index a210b8a4df00..cec2371173d7 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -298,3 +298,48 @@ A: NO. The BTF_ID macro does not cause a function to become part of the ABI any more than does the EXPORT_SYMBOL_GPL macro. + +Q: What is the compatibility story for special BPF types in map values? +----------------------------------------------------------------------- +Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map +values (when using BTF support for BPF maps). This allows to use helpers for +such objects on these fields inside map values. Users are also allowed to embed +pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the +kernel preserve backwards compatibility for these features? + +A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else: +NO, but see below. + +For struct types that have been added already, like bpf_spin_lock and bpf_timer, +the kernel will preserve backwards compatibility, as they are part of UAPI. + +For kptrs, they are also part of UAPI, but only with respect to the kptr +mechanism. The types that you can use with a __kptr and __kptr_ref tagged +pointer in your struct are NOT part of the UAPI contract. The supported types can +and will change across kernel releases. However, operations like accessing kptr +fields and bpf_kptr_xchg() helper will continue to be supported across kernel +releases for the supported types. + +For any other supported struct type, unless explicitly stated in this document +and added to bpf.h UAPI header, such types can and will arbitrarily change their +size, type, and alignment, or any other user visible API or ABI detail across +kernel releases. The users must adapt their BPF programs to the new changes and +update them to make sure their programs continue to work correctly. + +NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in +order to introduce more special fields in the future. Hence, user programs must +avoid defining types with 'bpf\_' prefix to not be broken in future releases. +In other words, no backwards compatibility is guaranteed if one using a type +in BTF with 'bpf\_' prefix. + +Q: What is the compatibility story for special BPF types in allocated objects? +------------------------------------------------------------------------------ +Q: Same as above, but for allocated objects (i.e. objects allocated using +bpf_obj_new for user defined types). Will the kernel preserve backwards +compatibility for these features? + +A: NO. + +Unlike map value types, there are no stability guarantees for this case. The +whole API to work with allocated objects and any support for special fields +inside them is unstable (since it is exposed through kfuncs). diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index 761474bd7fe6..03d4993eda6f 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.** Submitting patches ================== +Q: How do I run BPF CI on my changes before sending them out for review? +------------------------------------------------------------------------ +A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf. +While GitHub also provides a CLI that can be used to accomplish the same +results, here we focus on the UI based workflow. + +The following steps lay out how to start a CI run for your patches: + +- Create a fork of the aforementioned repository in your own account (one time + action) + +- Clone the fork locally, check out a new branch tracking either the bpf-next + or bpf branch, and apply your to-be-tested patches on top of it + +- Push the local branch to your fork and create a pull request against + kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively + +Shortly after the pull request has been created, the CI workflow will run. Note +that capacity is shared with patches submitted upstream being checked and so +depending on utilization the run can take a while to finish. + +Note furthermore that both base branches (bpf-next_base and bpf_base) will be +updated as patches are pushed to the respective upstream branches they track. As +such, your patch set will automatically (be attempted to) be rebased as well. +This behavior can result in a CI run being aborted and restarted with the new +base line. + Q: To which mailing list do I need to submit my BPF patches? ------------------------------------------------------------ A: Please submit your BPF patches to the bpf kernel mailing list: diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst new file mode 100644 index 000000000000..6d7770793fab --- /dev/null +++ b/Documentation/bpf/bpf_iterators.rst @@ -0,0 +1,485 @@ +============= +BPF Iterators +============= + + +---------- +Motivation +---------- + +There are a few existing ways to dump kernel data into user space. The most +popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps +all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink +sockets in the system. However, their output format tends to be fixed, and if +users want more information about these sockets, they have to patch the kernel, +which often takes time to publish upstream and release. The same is true for popular +tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any +additional information needs a kernel patch. + +To solve this problem, the `drgn +<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to +dig out the kernel data with no kernel change. However, the main drawback for +drgn is performance, as it cannot do pointer tracing inside the kernel. In +addition, drgn cannot validate a pointer value and may read invalid data if the +pointer becomes invalid inside the kernel. + +The BPF iterator solves the above problem by providing flexibility on what data +(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel +data object. + +---------------------- +How BPF Iterators Work +---------------------- + +A BPF iterator is a type of BPF program that allows users to iterate over +specific types of kernel objects. Unlike traditional BPF tracing programs that +allow users to define callbacks that are invoked at particular points of +execution in the kernel, BPF iterators allow users to define callbacks that +should be executed for every entry in a variety of kernel data structures. + +For example, users can define a BPF iterator that iterates over every task on +the system and dumps the total amount of CPU runtime currently used by each of +them. Another BPF task iterator may instead dump the cgroup information for each +task. Such flexibility is the core value of BPF iterators. + +A BPF program is always loaded into the kernel at the behest of a user space +process. A user space process loads a BPF program by opening and initializing +the program skeleton as required and then invoking a syscall to have the BPF +program verified and loaded by the kernel. + +In traditional tracing programs, a program is activated by having user space +obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once +activated, the program callback will be invoked whenever the tracepoint is +triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the +program is obtained using ``bpf_link_create()``, and the program callback is +invoked by issuing system calls from user space. + +Next, let us see how you can use the iterators to iterate on kernel objects and +read data. + +------------------------ +How to Use BPF iterators +------------------------ + +BPF selftests are a great resource to illustrate how to use the iterators. In +this section, we’ll walk through a BPF selftest which shows how to load and use +a BPF iterator program. To begin, we’ll look at `bpf_iter.c +<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_, +which illustrates how to load and trigger BPF iterators on the user space side. +Later, we’ll look at a BPF program that runs in kernel space. + +Loading a BPF iterator in the kernel from user space typically involves the +following steps: + +* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel + has verified and loaded the program, it returns a file descriptor (fd) to user + space. +* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()`` + specified with the BPF program file descriptor received from the kernel. +* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the + ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2. +* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is + available. +* Close the iterator fd using ``close(bpf_iter_fd)``. +* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again. + +The following are a few examples of selftest BPF iterator programs: + +* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ +* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ +* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ + +Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: + +Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h +<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_. +Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>`` +represents a BPF iterator. The suffix ``<iter_name>`` represents the type of +iterator. + +:: + + struct bpf_iter__task_file { + union { + struct bpf_iter_meta *meta; + }; + union { + struct task_struct *task; + }; + u32 fd; + union { + struct file *file; + }; + }; + +In the above code, the field 'meta' contains the metadata, which is the same for +all BPF iterator programs. The rest of the fields are specific to different +iterators. For example, for task_file iterators, the kernel layer provides the +'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference +counted +<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_, +so they won't go away when the BPF program runs. + +Here is a snippet from the ``bpf_iter_task_file.c`` file: + +:: + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + __u32 fd = ctx->fd; + + if (task == NULL || file == NULL) + return 0; + + if (ctx->meta->seq_num == 0) { + count = 0; + BPF_SEQ_PRINTF(seq, " tgid gid fd file\n"); + } + + if (tgid == task->tgid && task->tgid != task->pid) + count++; + + if (last_tgid != task->tgid) { + last_tgid = task->tgid; + unique_tgid_count++; + } + + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +In the above example, the section name ``SEC(iter/task_file)``, indicates that +the program is a BPF iterator program to iterate all files from all tasks. The +context of the program is ``bpf_iter__task_file`` struct. + +The user space program invokes the BPF iterator program running in the kernel +by issuing a ``read()`` syscall. Once invoked, the BPF +program can export data to user space using a variety of BPF helper functions. +You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or +``bpf_seq_write()`` function based on whether you need formatted output or just +binary data, respectively. For binary-encoded data, the user space applications +can process the data from ``bpf_seq_write()`` as needed. For the formatted data, +you can use ``cat <path>`` to print the results similar to ``cat +/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later, +use ``rm -f <path>`` to remove the pinned iterator. + +For example, you can use the following command to create a BPF iterator from the +``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route`` +path: + +:: + + $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route + +And then print out the results using the following command: + +:: + + $ cat /sys/fs/bpf/my_route + + +------------------------------------------------------- +Implement Kernel Support for BPF Iterator Program Types +------------------------------------------------------- + +To implement a BPF iterator in the kernel, the developer must make a one-time +change to the following key data structure defined in the `bpf.h +<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_ +file. + +:: + + struct bpf_iter_reg { + const char *target; + bpf_iter_attach_target_t attach_target; + bpf_iter_detach_target_t detach_target; + bpf_iter_show_fdinfo_t show_fdinfo; + bpf_iter_fill_link_info_t fill_link_info; + bpf_iter_get_func_proto_t get_func_proto; + u32 ctx_arg_info_size; + u32 feature; + struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; + const struct bpf_iter_seq_info *seq_info; + }; + +After filling the data structure fields, call ``bpf_iter_reg_target()`` to +register the iterator to the main BPF iterator subsystem. + +The following is the breakdown for each field in struct ``bpf_iter_reg``. + +.. list-table:: + :widths: 25 50 + :header-rows: 1 + + * - Fields + - Description + * - target + - Specifies the name of the BPF iterator. For example: ``bpf_map``, + ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel. + * - attach_target and detach_target + - Allows for target specific ``link_create`` action since some targets + may need special processing. Called during the user space link_create stage. + * - show_fdinfo and fill_link_info + - Called to fill target specific information when user tries to get link + info associated with the iterator. + * - get_func_proto + - Permits a BPF iterator to access BPF helpers specific to the iterator. + * - ctx_arg_info_size and ctx_arg_info + - Specifies the verifier states for BPF program arguments associated with + the bpf iterator. + * - feature + - Specifies certain action requests in the kernel BPF iterator + infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means + that the kernel function cond_resched() is called to avoid other kernel + subsystem (e.g., rcu) misbehaving. + * - seq_info + - Specifies certain action requests in the kernel BPF iterator + infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means + that the kernel function cond_resched() is called to avoid other kernel + subsystem (e.g., rcu) misbehaving. + + +`Click here +<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ +to see an implementation of the ``task_vma`` BPF iterator in the kernel. + +--------------------------------- +Parameterizing BPF Task Iterators +--------------------------------- + +By default, BPF iterators walk through all the objects of the specified types +(processes, cgroups, maps, etc.) across the entire system to read relevant +kernel data. But often, there are cases where we only care about a much smaller +subset of iterable kernel objects, such as only iterating tasks within a +specific process. Therefore, BPF iterator programs support filtering out objects +from iteration by allowing user space to configure the iterator program when it +is attached. + +-------------------------- +BPF Task Iterator Program +-------------------------- + +The following code is a BPF iterator program to print files and task information +through the ``seq_file`` of the iterator. It is a standard BPF iterator program +that visits every file of an iterator. We will use this BPF program in our +example later. + +:: + + #include <vmlinux.h> + #include <bpf/bpf_helpers.h> + + char _license[] SEC("license") = "GPL"; + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + __u32 fd = ctx->fd; + if (task == NULL || file == NULL) + return 0; + if (ctx->meta->seq_num == 0) { + BPF_SEQ_PRINTF(seq, " tgid pid fd file\n"); + } + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +---------------------------------------- +Creating a File Iterator with Parameters +---------------------------------------- + +Now, let us look at how to create an iterator that includes only files of a +process. + +First, fill the ``bpf_iter_attach_opts`` struct as shown below: + +:: + + LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo; + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + +``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator +that only includes opened files for the process with the specified ``pid``. In +this example, we will only be iterating files for our process. If +``linfo.task.pid`` is zero, the iterator will visit every opened file of every +process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator +that visits opened files of a specific thread, not a process. In this example, +``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a +separate file descriptor table. In most circumstances, all process threads share +a single file descriptor table. + +Now, in the userspace program, pass the pointer of struct to the +``bpf_program__attach_iter()``. + +:: + + link = bpf_program__attach_iter(prog, &opts); iter_fd = + bpf_iter_create(bpf_link__fd(link)); + +If both *tid* and *pid* are zero, an iterator created from this struct +``bpf_iter_attach_opts`` will include every opened file of every task in the +system (in the namespace, actually.) It is the same as passing a NULL as the +second argument to ``bpf_program__attach_iter()``. + +The whole program looks like the following code: + +:: + + #include <stdio.h> + #include <unistd.h> + #include <bpf/bpf.h> + #include <bpf/libbpf.h> + #include "bpf_iter_task_ex.skel.h" + + static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts) + { + struct bpf_link *link; + char buf[16] = {}; + int iter_fd = -1, len; + int ret = 0; + + link = bpf_program__attach_iter(prog, opts); + if (!link) { + fprintf(stderr, "bpf_program__attach_iter() fails\n"); + return -1; + } + iter_fd = bpf_iter_create(bpf_link__fd(link)); + if (iter_fd < 0) { + fprintf(stderr, "bpf_iter_create() fails\n"); + ret = -1; + goto free_link; + } + /* not check contents, but ensure read() ends without error */ + while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) { + buf[len] = 0; + printf("%s", buf); + } + printf("\n"); + free_link: + if (iter_fd >= 0) + close(iter_fd); + bpf_link__destroy(link); + return 0; + } + + static void test_task_file(void) + { + LIBBPF_OPTS(bpf_iter_attach_opts, opts); + struct bpf_iter_task_ex *skel; + union bpf_iter_link_info linfo; + skel = bpf_iter_task_ex__open_and_load(); + if (skel == NULL) + return; + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + printf("PID %d\n", getpid()); + do_read_opts(skel->progs.dump_task_file, &opts); + bpf_iter_task_ex__destroy(skel); + } + + int main(int argc, const char * const * argv) + { + test_task_file(); + return 0; + } + +The following lines are the output of the program. +:: + + PID 1859 + + tgid pid fd file + 1859 1859 0 ffffffff82270aa0 + 1859 1859 1 ffffffff82270aa0 + 1859 1859 2 ffffffff82270aa0 + 1859 1859 3 ffffffff82272980 + 1859 1859 4 ffffffff8225e120 + 1859 1859 5 ffffffff82255120 + 1859 1859 6 ffffffff82254f00 + 1859 1859 7 ffffffff82254d80 + 1859 1859 8 ffffffff8225abe0 + +------------------ +Without Parameters +------------------ + +Let us look at how a BPF iterator without parameters skips files of other +processes in the system. In this case, the BPF program has to check the pid or +the tid of tasks, or it will receive every opened file in the system (in the +current *pid* namespace, actually). So, we usually add a global variable in the +BPF program to pass a *pid* to the BPF program. + +The BPF program would look like the following block. + + :: + + ...... + int target_pid = 0; + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + ...... + if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */ + return 0; + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +The user space program would look like the following block: + + :: + + ...... + static void test_task_file(void) + { + ...... + skel = bpf_iter_task_ex__open_and_load(); + if (skel == NULL) + return; + skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */ + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + ...... + } + +``target_pid`` is a global variable in the BPF program. The user space program +should initialize the variable with a process ID to skip opened files of other +processes in the BPF program. When you parametrize a BPF iterator, the iterator +calls the BPF program fewer times which can save significant resources. + +--------------------------- +Parametrizing VMA Iterators +--------------------------- + +By default, a BPF VMA iterator includes every VMA in every process. However, +you can still specify a process or a thread to include only its VMAs. Unlike +files, a thread can not have a separate address space (since Linux 2.6.0-test6). +Here, using *tid* makes no difference from using *pid*. + +---------------------------- +Parametrizing Task Iterators +---------------------------- + +A BPF task iterator with *pid* includes all tasks (threads) of a process. The +BPF program receives these tasks one after another. You can specify a BPF task +iterator with *tid* parameter to include only the tasks that match the given +*tid*. diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index cf8722f96090..7cd7c5415a99 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -1062,4 +1062,9 @@ format.:: 7. Testing ========== -Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests. +The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_ +provides an extensive set of BTF-related tests. + +.. Links +.. _tools/testing/selftests/bpf/prog_tests/btf.c: + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 1b50de1983ee..b81533d8b061 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture. maps bpf_prog_run classic_vs_extended.rst + bpf_iterators bpf_licensing test_debug clang-notes linux-notes other + redirect .. only:: subproject and html diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index 5d798437dad4..e672d5ec6cc7 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ``BPF_XOR | BPF_K | BPF_ALU`` means:: - src_reg = (u32) src_reg ^ (u32) imm32 + dst_reg = (u32) dst_reg ^ (u32) imm32 ``BPF_XOR | BPF_K | BPF_ALU64`` means:: - src_reg = src_reg ^ imm32 + dst_reg = dst_reg ^ imm32 Byte swap instructions diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 0f858156371d..9fd7fb539f85 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -72,6 +72,30 @@ argument as its size. By default, without __sz annotation, the size of the type of the pointer is used. Without __sz annotation, a kfunc cannot accept a void pointer. +2.2.2 __k Annotation +-------------------- + +This annotation is only understood for scalar arguments, where it indicates that +the verifier must check the scalar argument to be a known constant, which does +not indicate a size parameter, and the value of the constant is relevant to the +safety of the program. + +An example is given below:: + + void *bpf_obj_new(u32 local_type_id__k, ...) + { + ... + } + +Here, bpf_obj_new uses local_type_id argument to find out the size of that type +ID in program's BTF and return a sized pointer to it. Each type ID will have a +distinct size, hence it is crucial to treat each such call as distinct when +values don't match during verifier state pruning checks. + +Hence, whenever a constant scalar argument is accepted by a kfunc which is not a +size parameter, and the value of the constant matters for program safety, __k +suffix should be used. + .. _BPF_kfunc_nodef: 2.3 Using an existing kernel function @@ -137,22 +161,20 @@ KF_ACQUIRE and KF_RET_NULL flags. -------------------------- The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It -indicates that the all pointer arguments will always have a guaranteed lifetime, -and pointers to kernel objects are always passed to helpers in their unmodified -form (as obtained from acquire kfuncs). +indicates that the all pointer arguments are valid, and that all pointers to +BTF objects have been passed in their unmodified form (that is, at a zero +offset, and without having been obtained from walking another pointer). -It can be used to enforce that a pointer to a refcounted object acquired from a -kfunc or BPF helper is passed as an argument to this kfunc without any -modifications (e.g. pointer arithmetic) such that it is trusted and points to -the original object. +There are two types of pointers to kernel objects which are considered "valid": -Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs, -but those can have a non-zero offset. +1. Pointers which are passed as tracepoint or struct_ops callback arguments. +2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. -This flag is often used for kfuncs that operate (change some property, perform -some operation) on an object that was obtained using an acquire kfunc. Such -kfuncs need an unchanged pointer to ensure the integrity of the operation being -performed on the expected object. +Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to +KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. + +The definition of "valid" pointers is subject to change at any time, and has +absolutely no ABI stability guarantees. 2.4.6 KF_SLEEPABLE flag ----------------------- @@ -169,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these calls. At the moment they only require CAP_SYS_BOOT capability, but more can be added later. +2.4.8 KF_RCU flag +----------------- + +The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. +When used together with KF_ACQUIRE, it indicates the kfunc should have a +single argument which must be a trusted argument or a MEM_RCU pointer. +The argument may have reference count of 0 and the kfunc must take this +into consideration. + 2.5 Registering the kfuncs -------------------------- @@ -191,3 +222,201 @@ type. An example is shown below:: return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); } late_initcall(init_subsystem); + +3. Core kfuncs +============== + +The BPF subsystem provides a number of "core" kfuncs that are potentially +applicable to a wide variety of different possible use cases and programs. +Those kfuncs are documented here. + +3.1 struct task_struct * kfuncs +------------------------------- + +There are a number of kfuncs that allow ``struct task_struct *`` objects to be +used as kptrs: + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_task_acquire bpf_task_release + +These kfuncs are useful when you want to acquire or release a reference to a +``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a +struct_ops callback arg. For example: + +.. code-block:: c + + /** + * A trivial example tracepoint program that shows how to + * acquire and release a struct task_struct * pointer. + */ + SEC("tp_btf/task_newtask") + int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) + { + struct task_struct *acquired; + + acquired = bpf_task_acquire(task); + + /* + * In a typical program you'd do something like store + * the task in a map, and the map will automatically + * release it later. Here, we release it manually. + */ + bpf_task_release(acquired); + return 0; + } + +---- + +A BPF program can also look up a task from a pid. This can be useful if the +caller doesn't have a trusted pointer to a ``struct task_struct *`` object that +it can acquire a reference on with bpf_task_acquire(). + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_task_from_pid + +Here is an example of it being used: + +.. code-block:: c + + SEC("tp_btf/task_newtask") + int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) + { + struct task_struct *lookup; + + lookup = bpf_task_from_pid(task->pid); + if (!lookup) + /* A task should always be found, as %task is a tracepoint arg. */ + return -ENOENT; + + if (lookup->pid != task->pid) { + /* bpf_task_from_pid() looks up the task via its + * globally-unique pid from the init_pid_ns. Thus, + * the pid of the lookup task should always be the + * same as the input task. + */ + bpf_task_release(lookup); + return -EINVAL; + } + + /* bpf_task_from_pid() returns an acquired reference, + * so it must be dropped before returning from the + * tracepoint handler. + */ + bpf_task_release(lookup); + return 0; + } + +3.2 struct cgroup * kfuncs +-------------------------- + +``struct cgroup *`` objects also have acquire and release functions: + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_acquire bpf_cgroup_release + +These kfuncs are used in exactly the same manner as bpf_task_acquire() and +bpf_task_release() respectively, so we won't provide examples for them. + +---- + +You may also acquire a reference to a ``struct cgroup`` kptr that's already +stored in a map using bpf_cgroup_kptr_get(): + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_kptr_get + +Here's an example of how it can be used: + +.. code-block:: c + + /* struct containing the struct task_struct kptr which is actually stored in the map. */ + struct __cgroups_kfunc_map_value { + struct cgroup __kptr_ref * cgroup; + }; + + /* The map containing struct __cgroups_kfunc_map_value entries. */ + struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, int); + __type(value, struct __cgroups_kfunc_map_value); + __uint(max_entries, 1); + } __cgroups_kfunc_map SEC(".maps"); + + /* ... */ + + /** + * A simple example tracepoint program showing how a + * struct cgroup kptr that is stored in a map can + * be acquired using the bpf_cgroup_kptr_get() kfunc. + */ + SEC("tp_btf/cgroup_mkdir") + int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) + { + struct cgroup *kptr; + struct __cgroups_kfunc_map_value *v; + s32 id = cgrp->self.id; + + /* Assume a cgroup kptr was previously stored in the map. */ + v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); + if (!v) + return -ENOENT; + + /* Acquire a reference to the cgroup kptr that's already stored in the map. */ + kptr = bpf_cgroup_kptr_get(&v->cgroup); + if (!kptr) + /* If no cgroup was present in the map, it's because + * we're racing with another CPU that removed it with + * bpf_kptr_xchg() between the bpf_map_lookup_elem() + * above, and our call to bpf_cgroup_kptr_get(). + * bpf_cgroup_kptr_get() internally safely handles this + * race, and will return NULL if the task is no longer + * present in the map by the time we invoke the kfunc. + */ + return -EBUSY; + + /* Free the reference we just took above. Note that the + * original struct cgroup kptr is still in the map. It will + * be freed either at a later time if another context deletes + * it from the map, or automatically by the BPF subsystem if + * it's still present when the map is destroyed. + */ + bpf_cgroup_release(kptr); + + return 0; + } + +---- + +Another kfunc available for interacting with ``struct cgroup *`` objects is +bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, +and return it as a cgroup kptr. + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_ancestor + +Eventually, BPF should be updated to allow this to happen with a normal memory +load in the program itself. This is currently not possible without more work in +the verifier. bpf_cgroup_ancestor() can be used as follows: + +.. code-block:: c + + /** + * Simple tracepoint example that illustrates how a cgroup's + * ancestor can be accessed using bpf_cgroup_ancestor(). + */ + SEC("tp_btf/cgroup_mkdir") + int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) + { + struct cgroup *parent; + + /* The parent cgroup resides at the level before the current cgroup's level. */ + parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); + if (!parent) + return -ENOENT; + + bpf_printk("Parent id is %d", parent->self.id); + + /* Return the parent cgroup that was acquired above. */ + bpf_cgroup_release(parent); + return 0; + } diff --git a/Documentation/bpf/libbpf/index.rst b/Documentation/bpf/libbpf/index.rst index 3722537d1384..f9b3b252e28f 100644 --- a/Documentation/bpf/libbpf/index.rst +++ b/Documentation/bpf/libbpf/index.rst @@ -1,5 +1,7 @@ .. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +.. _libbpf: + libbpf ====== @@ -7,6 +9,7 @@ libbpf :maxdepth: 1 API Documentation <https://libbpf.readthedocs.io/en/latest/api.html> + program_types libbpf_naming_convention libbpf_build diff --git a/Documentation/bpf/libbpf/program_types.rst b/Documentation/bpf/libbpf/program_types.rst new file mode 100644 index 000000000000..ad4d4d5eecb0 --- /dev/null +++ b/Documentation/bpf/libbpf/program_types.rst @@ -0,0 +1,203 @@ +.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) + +.. _program_types_and_elf: + +Program Types and ELF Sections +============================== + +The table below lists the program types, their attach types where relevant and the ELF section +names supported by libbpf for them. The ELF section names follow these rules: + +- ``type`` is an exact match, e.g. ``SEC("socket")`` +- ``type+`` means it can be either exact ``SEC("type")`` or well-formed ``SEC("type/extras")`` + with a '``/``' separator between ``type`` and ``extras``. + +When ``extras`` are specified, they provide details of how to auto-attach the BPF program. The +format of ``extras`` depends on the program type, e.g. ``SEC("tracepoint/<category>/<name>")`` +for tracepoints or ``SEC("usdt/<path>:<provider>:<name>")`` for USDT probes. The extras are +described in more detail in the footnotes. + + ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| Program Type | Attach Type | ELF Section Name | Sleepable | ++===========================================+========================================+==================================+===========+ +| ``BPF_PROG_TYPE_CGROUP_DEVICE`` | ``BPF_CGROUP_DEVICE`` | ``cgroup/dev`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_CGROUP_SKB`` | | ``cgroup/skb`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET_EGRESS`` | ``cgroup_skb/egress`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET_INGRESS`` | ``cgroup_skb/ingress`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` | ``BPF_CGROUP_GETSOCKOPT`` | ``cgroup/getsockopt`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_SETSOCKOPT`` | ``cgroup/setsockopt`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_CGROUP_SOCK_ADDR`` | ``BPF_CGROUP_INET4_BIND`` | ``cgroup/bind4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET4_CONNECT`` | ``cgroup/connect4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET4_GETPEERNAME`` | ``cgroup/getpeername4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET4_GETSOCKNAME`` | ``cgroup/getsockname4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET6_BIND`` | ``cgroup/bind6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET6_CONNECT`` | ``cgroup/connect6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET6_GETPEERNAME`` | ``cgroup/getpeername6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET6_GETSOCKNAME`` | ``cgroup/getsockname6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_UDP4_RECVMSG`` | ``cgroup/recvmsg4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_UDP4_SENDMSG`` | ``cgroup/sendmsg4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_UDP6_RECVMSG`` | ``cgroup/recvmsg6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_UDP6_SENDMSG`` | ``cgroup/sendmsg6`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_CGROUP_SOCK`` | ``BPF_CGROUP_INET4_POST_BIND`` | ``cgroup/post_bind4`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET6_POST_BIND`` | ``cgroup/post_bind6`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET_SOCK_CREATE`` | ``cgroup/sock_create`` | | ++ + +----------------------------------+-----------+ +| | | ``cgroup/sock`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_CGROUP_INET_SOCK_RELEASE`` | ``cgroup/sock_release`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_CGROUP_SYSCTL`` | ``BPF_CGROUP_SYSCTL`` | ``cgroup/sysctl`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_EXT`` | | ``freplace+`` [#fentry]_ | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_FLOW_DISSECTOR`` | ``BPF_FLOW_DISSECTOR`` | ``flow_dissector`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_KPROBE`` | | ``kprobe+`` [#kprobe]_ | | ++ + +----------------------------------+-----------+ +| | | ``kretprobe+`` [#kprobe]_ | | ++ + +----------------------------------+-----------+ +| | | ``ksyscall+`` [#ksyscall]_ | | ++ + +----------------------------------+-----------+ +| | | ``kretsyscall+`` [#ksyscall]_ | | ++ + +----------------------------------+-----------+ +| | | ``uprobe+`` [#uprobe]_ | | ++ + +----------------------------------+-----------+ +| | | ``uprobe.s+`` [#uprobe]_ | Yes | ++ + +----------------------------------+-----------+ +| | | ``uretprobe+`` [#uprobe]_ | | ++ + +----------------------------------+-----------+ +| | | ``uretprobe.s+`` [#uprobe]_ | Yes | ++ + +----------------------------------+-----------+ +| | | ``usdt+`` [#usdt]_ | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_TRACE_KPROBE_MULTI`` | ``kprobe.multi+`` [#kpmulti]_ | | ++ + +----------------------------------+-----------+ +| | | ``kretprobe.multi+`` [#kpmulti]_ | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LIRC_MODE2`` | ``BPF_LIRC_MODE2`` | ``lirc_mode2`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LSM`` | ``BPF_LSM_CGROUP`` | ``lsm_cgroup+`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_LSM_MAC`` | ``lsm+`` [#lsm]_ | | ++ + +----------------------------------+-----------+ +| | | ``lsm.s+`` [#lsm]_ | Yes | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LWT_IN`` | | ``lwt_in`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LWT_OUT`` | | ``lwt_out`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` | | ``lwt_seg6local`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_LWT_XMIT`` | | ``lwt_xmit`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_PERF_EVENT`` | | ``perf_event`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE`` | | ``raw_tp.w+`` [#rawtp]_ | | ++ + +----------------------------------+-----------+ +| | | ``raw_tracepoint.w+`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_RAW_TRACEPOINT`` | | ``raw_tp+`` [#rawtp]_ | | ++ + +----------------------------------+-----------+ +| | | ``raw_tracepoint+`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SCHED_ACT`` | | ``action`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` | | ++ + +----------------------------------+-----------+ +| | | ``tc`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SK_LOOKUP`` | ``BPF_SK_LOOKUP`` | ``sk_lookup`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SK_MSG`` | ``BPF_SK_MSG_VERDICT`` | ``sk_msg`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SK_REUSEPORT`` | ``BPF_SK_REUSEPORT_SELECT_OR_MIGRATE`` | ``sk_reuseport/migrate`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_SK_REUSEPORT_SELECT`` | ``sk_reuseport`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SK_SKB`` | | ``sk_skb`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_SK_SKB_STREAM_PARSER`` | ``sk_skb/stream_parser`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_SK_SKB_STREAM_VERDICT`` | ``sk_skb/stream_verdict`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SOCKET_FILTER`` | | ``socket`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SOCK_OPS`` | ``BPF_CGROUP_SOCK_OPS`` | ``sockops`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_STRUCT_OPS`` | | ``struct_ops+`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_SYSCALL`` | | ``syscall`` | Yes | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_TRACEPOINT`` | | ``tp+`` [#tp]_ | | ++ + +----------------------------------+-----------+ +| | | ``tracepoint+`` [#tp]_ | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_TRACING`` | ``BPF_MODIFY_RETURN`` | ``fmod_ret+`` [#fentry]_ | | ++ + +----------------------------------+-----------+ +| | | ``fmod_ret.s+`` [#fentry]_ | Yes | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_TRACE_FENTRY`` | ``fentry+`` [#fentry]_ | | ++ + +----------------------------------+-----------+ +| | | ``fentry.s+`` [#fentry]_ | Yes | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_TRACE_FEXIT`` | ``fexit+`` [#fentry]_ | | ++ + +----------------------------------+-----------+ +| | | ``fexit.s+`` [#fentry]_ | Yes | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_TRACE_ITER`` | ``iter+`` [#iter]_ | | ++ + +----------------------------------+-----------+ +| | | ``iter.s+`` [#iter]_ | Yes | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ +| ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | | ++ + +----------------------------------+-----------+ +| | | ``xdp/cpumap`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_XDP_DEVMAP`` | ``xdp.frags/devmap`` | | ++ + +----------------------------------+-----------+ +| | | ``xdp/devmap`` | | ++ +----------------------------------------+----------------------------------+-----------+ +| | ``BPF_XDP`` | ``xdp.frags`` | | ++ + +----------------------------------+-----------+ +| | | ``xdp`` | | ++-------------------------------------------+----------------------------------------+----------------------------------+-----------+ + + +.. rubric:: Footnotes + +.. [#fentry] The ``fentry`` attach format is ``fentry[.s]/<function>``. +.. [#kprobe] The ``kprobe`` attach format is ``kprobe/<function>[+<offset>]``. Valid + characters for ``function`` are ``a-zA-Z0-9_.`` and ``offset`` must be a valid + non-negative integer. +.. [#ksyscall] The ``ksyscall`` attach format is ``ksyscall/<syscall>``. +.. [#uprobe] The ``uprobe`` attach format is ``uprobe[.s]/<path>:<function>[+<offset>]``. +.. [#usdt] The ``usdt`` attach format is ``usdt/<path>:<provider>:<name>``. +.. [#kpmulti] The ``kprobe.multi`` attach format is ``kprobe.multi/<pattern>`` where ``pattern`` + supports ``*`` and ``?`` wildcards. Valid characters for pattern are + ``a-zA-Z0-9_.*?``. +.. [#lsm] The ``lsm`` attachment format is ``lsm[.s]/<hook>``. +.. [#rawtp] The ``raw_tp`` attach format is ``raw_tracepoint[.w]/<tracepoint>``. +.. [#tp] The ``tracepoint`` attach format is ``tracepoint/<category>/<name>``. +.. [#iter] The ``iter`` attach format is ``iter[.s]/<struct-name>``. diff --git a/Documentation/bpf/map_array.rst b/Documentation/bpf/map_array.rst new file mode 100644 index 000000000000..f2f51a53e8ae --- /dev/null +++ b/Documentation/bpf/map_array.rst @@ -0,0 +1,262 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +================================================ +BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_PERCPU_ARRAY +================================================ + +.. note:: + - ``BPF_MAP_TYPE_ARRAY`` was introduced in kernel version 3.19 + - ``BPF_MAP_TYPE_PERCPU_ARRAY`` was introduced in version 4.6 + +``BPF_MAP_TYPE_ARRAY`` and ``BPF_MAP_TYPE_PERCPU_ARRAY`` provide generic array +storage. The key type is an unsigned 32-bit integer (4 bytes) and the map is +of constant size. The size of the array is defined in ``max_entries`` at +creation time. All array elements are pre-allocated and zero initialized when +created. ``BPF_MAP_TYPE_PERCPU_ARRAY`` uses a different memory region for each +CPU whereas ``BPF_MAP_TYPE_ARRAY`` uses the same memory region. The value +stored can be of any size, however, all array elements are aligned to 8 +bytes. + +Since kernel 5.5, memory mapping may be enabled for ``BPF_MAP_TYPE_ARRAY`` by +setting the flag ``BPF_F_MMAPABLE``. The map definition is page-aligned and +starts on the first page. Sufficient page-sized and page-aligned blocks of +memory are allocated to store all array values, starting on the second page, +which in some cases will result in over-allocation of memory. The benefit of +using this is increased performance and ease of use since userspace programs +would not be required to use helper functions to access and mutate data. + +Usage +===== + +Kernel BPF +---------- + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Array elements can be retrieved using the ``bpf_map_lookup_elem()`` helper. +This helper returns a pointer into the array element, so to avoid data races +with userspace reading the value, the user must use primitives like +``__sync_fetch_and_add()`` when updating the value in-place. + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + +Array elements can be updated using the ``bpf_map_update_elem()`` helper. + +``bpf_map_update_elem()`` returns 0 on success, or negative error in case of +failure. + +Since the array is of constant size, ``bpf_map_delete_elem()`` is not supported. +To clear an array element, you may use ``bpf_map_update_elem()`` to insert a +zero value to that index. + +Per CPU Array +------------- + +Values stored in ``BPF_MAP_TYPE_ARRAY`` can be accessed by multiple programs +across different CPUs. To restrict storage to a single CPU, you may use a +``BPF_MAP_TYPE_PERCPU_ARRAY``. + +When using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` the ``bpf_map_update_elem()`` and +``bpf_map_lookup_elem()`` helpers automatically access the slot for the current +CPU. + +bpf_map_lookup_percpu_elem() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu) + +The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the array +value for a specific CPU. Returns value on success , or ``NULL`` if no entry was +found or ``cpu`` is invalid. + +Concurrency +----------- + +Since kernel version 5.1, the BPF infrastructure provides ``struct bpf_spin_lock`` +to synchronize access. + +Userspace +--------- + +Access from userspace uses libbpf APIs with the same names as above, with +the map identified by its ``fd``. + +Examples +======== + +Please see the ``tools/testing/selftests/bpf`` directory for functional +examples. The code samples below demonstrate API usage. + +Kernel BPF +---------- + +This snippet shows how to declare an array in a BPF program. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, long); + __uint(max_entries, 256); + } my_map SEC(".maps"); + + +This example BPF program shows how to access an array element. + +.. code-block:: c + + int bpf_prog(struct __sk_buff *skb) + { + struct iphdr ip; + int index; + long *value; + + if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip, sizeof(ip)) < 0) + return 0; + + index = ip.protocol; + value = bpf_map_lookup_elem(&my_map, &index); + if (value) + __sync_fetch_and_add(value, skb->len); + + return 0; + } + +Userspace +--------- + +BPF_MAP_TYPE_ARRAY +~~~~~~~~~~~~~~~~~~ + +This snippet shows how to create an array, using ``bpf_map_create_opts`` to +set flags. + +.. code-block:: c + + #include <bpf/libbpf.h> + #include <bpf/bpf.h> + + int create_array() + { + int fd; + LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE); + + fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, + "example_array", /* name */ + sizeof(__u32), /* key size */ + sizeof(long), /* value size */ + 256, /* max entries */ + &opts); /* create opts */ + return fd; + } + +This snippet shows how to initialize the elements of an array. + +.. code-block:: c + + int initialize_array(int fd) + { + __u32 i; + long value; + int ret; + + for (i = 0; i < 256; i++) { + value = i; + ret = bpf_map_update_elem(fd, &i, &value, BPF_ANY); + if (ret < 0) + return ret; + } + + return ret; + } + +This snippet shows how to retrieve an element value from an array. + +.. code-block:: c + + int lookup(int fd) + { + __u32 index = 42; + long value; + int ret; + + ret = bpf_map_lookup_elem(fd, &index, &value); + if (ret < 0) + return ret; + + /* use value here */ + assert(value == 42); + + return ret; + } + +BPF_MAP_TYPE_PERCPU_ARRAY +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This snippet shows how to initialize the elements of a per CPU array. + +.. code-block:: c + + int initialize_array(int fd) + { + int ncpus = libbpf_num_possible_cpus(); + long values[ncpus]; + __u32 i, j; + int ret; + + for (i = 0; i < 256 ; i++) { + for (j = 0; j < ncpus; j++) + values[j] = i; + ret = bpf_map_update_elem(fd, &i, &values, BPF_ANY); + if (ret < 0) + return ret; + } + + return ret; + } + +This snippet shows how to access the per CPU elements of an array value. + +.. code-block:: c + + int lookup(int fd) + { + int ncpus = libbpf_num_possible_cpus(); + __u32 index = 42, j; + long values[ncpus]; + int ret; + + ret = bpf_map_lookup_elem(fd, &index, &values); + if (ret < 0) + return ret; + + for (j = 0; j < ncpus; j++) { + /* Use per CPU value here */ + assert(values[j] == 42); + } + + return ret; + } + +Semantics +========= + +As shown in the example above, when accessing a ``BPF_MAP_TYPE_PERCPU_ARRAY`` +in userspace, each value is an array with ``ncpus`` elements. + +When calling ``bpf_map_update_elem()`` the flag ``BPF_NOEXIST`` can not be used +for these maps. diff --git a/Documentation/bpf/map_bloom_filter.rst b/Documentation/bpf/map_bloom_filter.rst new file mode 100644 index 000000000000..c82487f2fe0d --- /dev/null +++ b/Documentation/bpf/map_bloom_filter.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +========================= +BPF_MAP_TYPE_BLOOM_FILTER +========================= + +.. note:: + - ``BPF_MAP_TYPE_BLOOM_FILTER`` was introduced in kernel version 5.16 + +``BPF_MAP_TYPE_BLOOM_FILTER`` provides a BPF bloom filter map. Bloom +filters are a space-efficient probabilistic data structure used to +quickly test whether an element exists in a set. In a bloom filter, +false positives are possible whereas false negatives are not. + +The bloom filter map does not have keys, only values. When the bloom +filter map is created, it must be created with a ``key_size`` of 0. The +bloom filter map supports two operations: + +- push: adding an element to the map +- peek: determining whether an element is present in the map + +BPF programs must use ``bpf_map_push_elem`` to add an element to the +bloom filter map and ``bpf_map_peek_elem`` to query the map. These +operations are exposed to userspace applications using the existing +``bpf`` syscall in the following way: + +- ``BPF_MAP_UPDATE_ELEM`` -> push +- ``BPF_MAP_LOOKUP_ELEM`` -> peek + +The ``max_entries`` size that is specified at map creation time is used +to approximate a reasonable bitmap size for the bloom filter, and is not +otherwise strictly enforced. If the user wishes to insert more entries +into the bloom filter than ``max_entries``, this may lead to a higher +false positive rate. + +The number of hashes to use for the bloom filter is configurable using +the lower 4 bits of ``map_extra`` in ``union bpf_attr`` at map creation +time. If no number is specified, the default used will be 5 hash +functions. In general, using more hashes decreases both the false +positive rate and the speed of a lookup. + +It is not possible to delete elements from a bloom filter map. A bloom +filter map may be used as an inner map. The user is responsible for +synchronising concurrent updates and lookups to ensure no false negative +lookups occur. + +Usage +===== + +Kernel BPF +---------- + +bpf_map_push_elem() +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags) + +A ``value`` can be added to a bloom filter using the +``bpf_map_push_elem()`` helper. The ``flags`` parameter must be set to +``BPF_ANY`` when adding an entry to the bloom filter. This helper +returns ``0`` on success, or negative error in case of failure. + +bpf_map_peek_elem() +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_peek_elem(struct bpf_map *map, void *value) + +The ``bpf_map_peek_elem()`` helper is used to determine whether +``value`` is present in the bloom filter map. This helper returns ``0`` +if ``value`` is probably present in the map, or ``-ENOENT`` if ``value`` +is definitely not present in the map. + +Userspace +--------- + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags) + +A userspace program can add a ``value`` to a bloom filter using libbpf's +``bpf_map_update_elem`` function. The ``key`` parameter must be set to +``NULL`` and ``flags`` must be set to ``BPF_ANY``. Returns ``0`` on +success, or negative error in case of failure. + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_lookup_elem (int fd, const void *key, void *value) + +A userspace program can determine the presence of ``value`` in a bloom +filter using libbpf's ``bpf_map_lookup_elem`` function. The ``key`` +parameter must be set to ``NULL``. Returns ``0`` if ``value`` is +probably present in the map, or ``-ENOENT`` if ``value`` is definitely +not present in the map. + +Examples +======== + +Kernel BPF +---------- + +This snippet shows how to declare a bloom filter in a BPF program: + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_BLOOM_FILTER); + __type(value, __u32); + __uint(max_entries, 1000); + __uint(map_extra, 3); + } bloom_filter SEC(".maps"); + +This snippet shows how to determine presence of a value in a bloom +filter in a BPF program: + +.. code-block:: c + + void *lookup(__u32 key) + { + if (bpf_map_peek_elem(&bloom_filter, &key) == 0) { + /* Verify not a false positive and fetch an associated + * value using a secondary lookup, e.g. in a hash table + */ + return bpf_map_lookup_elem(&hash_table, &key); + } + return 0; + } + +Userspace +--------- + +This snippet shows how to use libbpf to create a bloom filter map from +userspace: + +.. code-block:: c + + int create_bloom() + { + LIBBPF_OPTS(bpf_map_create_opts, opts, + .map_extra = 3); /* number of hashes */ + + return bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER, + "ipv6_bloom", /* name */ + 0, /* key size, must be zero */ + sizeof(ipv6_addr), /* value size */ + 10000, /* max entries */ + &opts); /* create options */ + } + +This snippet shows how to add an element to a bloom filter from +userspace: + +.. code-block:: c + + int add_element(struct bpf_map *bloom_map, __u32 value) + { + int bloom_fd = bpf_map__fd(bloom_map); + return bpf_map_update_elem(bloom_fd, NULL, &value, BPF_ANY); + } + +References +========== + +https://lwn.net/ml/bpf/20210831225005.2762202-1-joannekoong@fb.com/ diff --git a/Documentation/bpf/map_cgrp_storage.rst b/Documentation/bpf/map_cgrp_storage.rst new file mode 100644 index 000000000000..5d3f603efffa --- /dev/null +++ b/Documentation/bpf/map_cgrp_storage.rst @@ -0,0 +1,109 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates. + +========================= +BPF_MAP_TYPE_CGRP_STORAGE +========================= + +The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized +storage for cgroups. It is only available with ``CONFIG_CGROUPS``. +The programs are made available by the same Kconfig. The +data for a particular cgroup can be retrieved by looking up the map +with that cgroup. + +This document describes the usage and semantics of the +``BPF_MAP_TYPE_CGRP_STORAGE`` map type. + +Usage +===== + +The map key must be ``sizeof(int)`` representing a cgroup fd. +To access the storage in a program, use ``bpf_cgrp_storage_get``:: + + void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) + +``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that +a new local storage will be created if one does not exist. + +The local storage can be removed with ``bpf_cgrp_storage_delete``:: + + long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup) + +The map is available to all program types. + +Examples +======== + +A BPF program example with BPF_MAP_TYPE_CGRP_STORAGE:: + + #include <vmlinux.h> + #include <bpf/bpf_helpers.h> + #include <bpf/bpf_tracing.h> + + struct { + __uint(type, BPF_MAP_TYPE_CGRP_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, long); + } cgrp_storage SEC(".maps"); + + SEC("tp_btf/sys_enter") + int BPF_PROG(on_enter, struct pt_regs *regs, long id) + { + struct task_struct *task = bpf_get_current_task_btf(); + long *ptr; + + ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (ptr) + __sync_fetch_and_add(ptr, 1); + + return 0; + } + +Userspace accessing map declared above:: + + #include <linux/bpf.h> + #include <linux/libbpf.h> + + __u32 map_lookup(struct bpf_map *map, int cgrp_fd) + { + __u32 *value; + value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd); + if (value) + return *value; + return 0; + } + +Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE +============================================================================ + +The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as +deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new +``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following +illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and +``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``. + +(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while + ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types + like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc. + +(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one + cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only supports one cgroup + which is attached by a BPF program. + +(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so + ``bpf_get_local_storage()`` always returns non-NULL local storage. + ``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so + it is possible that ``bpf_cgrp_storage_get()`` may return null local storage. + To avoid such null local storage issue, user space can do + ``bpf_map_update_elem()`` to pre-allocate local storage before a BPF program + is attached. + +(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a BPF program + while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during + prog detach time. + +So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` +functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE`` +instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``. diff --git a/Documentation/bpf/map_cpumap.rst b/Documentation/bpf/map_cpumap.rst new file mode 100644 index 000000000000..923cfc8ab51f --- /dev/null +++ b/Documentation/bpf/map_cpumap.rst @@ -0,0 +1,177 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +=================== +BPF_MAP_TYPE_CPUMAP +=================== + +.. note:: + - ``BPF_MAP_TYPE_CPUMAP`` was introduced in kernel version 4.15 + +.. kernel-doc:: kernel/bpf/cpumap.c + :doc: cpu map + +An example use-case for this map type is software based Receive Side Scaling (RSS). + +The CPUMAP represents the CPUs in the system indexed as the map-key, and the +map-value is the config setting (per CPUMAP entry). Each CPUMAP entry has a dedicated +kernel thread bound to the given CPU to represent the remote CPU execution unit. + +Starting from Linux kernel version 5.9 the CPUMAP can run a second XDP program +on the remote CPU. This allows an XDP program to split its processing across +multiple CPUs. For example, a scenario where the initial CPU (that sees/receives +the packets) needs to do minimal packet processing and the remote CPU (to which +the packet is directed) can afford to spend more cycles processing the frame. The +initial CPU is where the XDP redirect program is executed. The remote CPU +receives raw ``xdp_frame`` objects. + +Usage +===== + +Kernel BPF +---------- +bpf_redirect_map() +^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags) + +Redirect the packet to the endpoint referenced by ``map`` at index ``key``. +For ``BPF_MAP_TYPE_CPUMAP`` this map contains references to CPUs. + +The lower two bits of ``flags`` are used as the return code if the map lookup +fails. This is so that the return value can be one of the XDP program return +codes up to ``XDP_TX``, as chosen by the caller. + +User space +---------- +.. note:: + CPUMAP entries can only be updated/looked up/deleted from user space and not + from an eBPF program. Trying to call these functions from a kernel eBPF + program will result in the program failing to load and a verifier warning. + +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags); + +CPU entries can be added or updated using the ``bpf_map_update_elem()`` +helper. This helper replaces existing elements atomically. The ``value`` parameter +can be ``struct bpf_cpumap_val``. + + .. code-block:: c + + struct bpf_cpumap_val { + __u32 qsize; /* queue size to remote target CPU */ + union { + int fd; /* prog fd on map write */ + __u32 id; /* prog id on map read */ + } bpf_prog; + }; + +The flags argument can be one of the following: + - BPF_ANY: Create a new element or update an existing element. + - BPF_NOEXIST: Create a new element only if it did not exist. + - BPF_EXIST: Update an existing element. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_lookup_elem(int fd, const void *key, void *value); + +CPU entries can be retrieved using the ``bpf_map_lookup_elem()`` +helper. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_delete_elem(int fd, const void *key); + +CPU entries can be deleted using the ``bpf_map_delete_elem()`` +helper. This helper will return 0 on success, or negative error in case of +failure. + +Examples +======== +Kernel +------ + +The following code snippet shows how to declare a ``BPF_MAP_TYPE_CPUMAP`` called +``cpu_map`` and how to redirect packets to a remote CPU using a round robin scheme. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_CPUMAP); + __type(key, __u32); + __type(value, struct bpf_cpumap_val); + __uint(max_entries, 12); + } cpu_map SEC(".maps"); + + struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, __u32); + __type(value, __u32); + __uint(max_entries, 12); + } cpus_available SEC(".maps"); + + struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __type(key, __u32); + __type(value, __u32); + __uint(max_entries, 1); + } cpus_iterator SEC(".maps"); + + SEC("xdp") + int xdp_redir_cpu_round_robin(struct xdp_md *ctx) + { + __u32 key = 0; + __u32 cpu_dest = 0; + __u32 *cpu_selected, *cpu_iterator; + __u32 cpu_idx; + + cpu_iterator = bpf_map_lookup_elem(&cpus_iterator, &key); + if (!cpu_iterator) + return XDP_ABORTED; + cpu_idx = *cpu_iterator; + + *cpu_iterator += 1; + if (*cpu_iterator == bpf_num_possible_cpus()) + *cpu_iterator = 0; + + cpu_selected = bpf_map_lookup_elem(&cpus_available, &cpu_idx); + if (!cpu_selected) + return XDP_ABORTED; + cpu_dest = *cpu_selected; + + if (cpu_dest >= bpf_num_possible_cpus()) + return XDP_ABORTED; + + return bpf_redirect_map(&cpu_map, cpu_dest, 0); + } + +User space +---------- + +The following code snippet shows how to dynamically set the max_entries for a +CPUMAP to the max number of cpus available on the system. + +.. code-block:: c + + int set_max_cpu_entries(struct bpf_map *cpu_map) + { + if (bpf_map__set_max_entries(cpu_map, libbpf_num_possible_cpus()) < 0) { + fprintf(stderr, "Failed to set max entries for cpu_map map: %s", + strerror(errno)); + return -1; + } + return 0; + } + +References +=========== + +- https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#redirecting_into_a_cpumap diff --git a/Documentation/bpf/map_devmap.rst b/Documentation/bpf/map_devmap.rst new file mode 100644 index 000000000000..927312c7b8c8 --- /dev/null +++ b/Documentation/bpf/map_devmap.rst @@ -0,0 +1,238 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +================================================= +BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH +================================================= + +.. note:: + - ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14 + - ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4 + +``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily +used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``. +``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as +the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH`` +is backed by a hash table that uses a key to lookup a reference to a net device. +The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``> +pairs to update the maps with new net devices. + +.. note:: + - The key to a hash map doesn't have to be an ``ifindex``. + - While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices + it comes at the cost of a hash of the key when performing a look up. + +The setup and packet enqueue/send code is shared between the two types of +devmap; only the lookup and insertion is different. + +Usage +===== +Kernel BPF +---------- +bpf_redirect_map() +^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags) + +Redirect the packet to the endpoint referenced by ``map`` at index ``key``. +For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains +references to net devices (for forwarding packets through other ports). + +The lower two bits of *flags* are used as the return code if the map lookup +fails. This is so that the return value can be one of the XDP program return +codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags`` +can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined +below. + +With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces +in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded +from the broadcast. + +.. note:: + - The key is ignored if BPF_F_BROADCAST is set. + - The broadcast feature can also be used to implement multicast forwarding: + simply create multiple DEVMAPs, each one corresponding to a single multicast group. + +This helper will return ``XDP_REDIRECT`` on success, or the value of the two +lower bits of the ``flags`` argument if the map lookup fails. + +More information about redirection can be found :doc:`redirect` + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Net device entries can be retrieved using the ``bpf_map_lookup_elem()`` +helper. + +User space +---------- +.. note:: + DEVMAP entries can only be updated/deleted from user space and not + from an eBPF program. Trying to call these functions from a kernel eBPF + program will result in the program failing to load and a verifier warning. + +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags); + +Net device entries can be added or updated using the ``bpf_map_update_elem()`` +helper. This helper replaces existing elements atomically. The ``value`` parameter +can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards +compatibility. + + .. code-block:: c + + struct bpf_devmap_val { + __u32 ifindex; /* device index */ + union { + int fd; /* prog fd on map write */ + __u32 id; /* prog id on map read */ + } bpf_prog; + }; + +The ``flags`` argument can be one of the following: + - ``BPF_ANY``: Create a new element or update an existing element. + - ``BPF_NOEXIST``: Create a new element only if it did not exist. + - ``BPF_EXIST``: Update an existing element. + +DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd`` +to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have +access to both Rx device and Tx device. The program associated with the ``fd`` +must have type XDP with expected attach type ``xdp_devmap``. +When a program is associated with a device index, the program is run on an +``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples +of how to attach/use xdp_devmap progs can be found in the kernel selftests: + +- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c`` +- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c`` + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + +.. c:function:: + int bpf_map_lookup_elem(int fd, const void *key, void *value); + +Net device entries can be retrieved using the ``bpf_map_lookup_elem()`` +helper. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + +.. c:function:: + int bpf_map_delete_elem(int fd, const void *key); + +Net device entries can be deleted using the ``bpf_map_delete_elem()`` +helper. This helper will return 0 on success, or negative error in case of +failure. + +Examples +======== + +Kernel BPF +---------- + +The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP`` +called tx_port. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_DEVMAP); + __type(key, __u32); + __type(value, __u32); + __uint(max_entries, 256); + } tx_port SEC(".maps"); + +The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH`` +called forward_map. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_DEVMAP_HASH); + __type(key, __u32); + __type(value, struct bpf_devmap_val); + __uint(max_entries, 32); + } forward_map SEC(".maps"); + +.. note:: + + The value type in the DEVMAP above is a ``struct bpf_devmap_val`` + +The following code snippet shows a simple xdp_redirect_map program. This program +would work with a user space program that populates the devmap ``forward_map`` based +on ingress ifindexes. The BPF program (below) is redirecting packets using the +ingress ``ifindex`` as the ``key``. + +.. code-block:: c + + SEC("xdp") + int xdp_redirect_map_func(struct xdp_md *ctx) + { + int index = ctx->ingress_ifindex; + + return bpf_redirect_map(&forward_map, index, 0); + } + +The following code snippet shows a BPF program that is broadcasting packets to +all the interfaces in the ``tx_port`` devmap. + +.. code-block:: c + + SEC("xdp") + int xdp_redirect_map_func(struct xdp_md *ctx) + { + return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS); + } + +User space +---------- + +The following code snippet shows how to update a devmap called ``tx_port``. + +.. code-block:: c + + int update_devmap(int ifindex, int redirect_ifindex) + { + int ret; + + ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0); + if (ret < 0) { + fprintf(stderr, "Failed to update devmap_ value: %s\n", + strerror(errno)); + } + + return ret; + } + +The following code snippet shows how to update a hash_devmap called ``forward_map``. + +.. code-block:: c + + int update_devmap(int ifindex, int redirect_ifindex) + { + struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex }; + int ret; + + ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0); + if (ret < 0) { + fprintf(stderr, "Failed to update devmap_ value: %s\n", + strerror(errno)); + } + return ret; + } + +References +=========== + +- https://lwn.net/Articles/728146/ +- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176 +- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106 diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst index e85120878b27..8669426264c6 100644 --- a/Documentation/bpf/map_hash.rst +++ b/Documentation/bpf/map_hash.rst @@ -34,7 +34,14 @@ the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. Usage ===== -.. c:function:: +Kernel BPF +---------- + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) Hash entries can be added or updated using the ``bpf_map_update_elem()`` @@ -49,14 +56,22 @@ parameter can be used to control the update behaviour: ``bpf_map_update_elem()`` returns 0 on success, or negative error in case of failure. -.. c:function:: +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) Hash entries can be retrieved using the ``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the value associated with ``key``, or ``NULL`` if no entry was found. -.. c:function:: +bpf_map_delete_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + long bpf_map_delete_elem(struct bpf_map *map, const void *key) Hash entries can be deleted using the ``bpf_map_delete_elem()`` @@ -70,7 +85,11 @@ For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers automatically access the hash slot for the current CPU. -.. c:function:: +bpf_map_lookup_percpu_elem() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu) The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the @@ -89,7 +108,11 @@ See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``. Userspace --------- -.. c:function:: +bpf_map_get_next_key() +~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key) In userspace, it is possible to iterate through the keys of a hash using diff --git a/Documentation/bpf/map_lpm_trie.rst b/Documentation/bpf/map_lpm_trie.rst new file mode 100644 index 000000000000..74d64a30f500 --- /dev/null +++ b/Documentation/bpf/map_lpm_trie.rst @@ -0,0 +1,197 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +===================== +BPF_MAP_TYPE_LPM_TRIE +===================== + +.. note:: + - ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11 + +``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that +can be used to match IP addresses to a stored set of prefixes. +Internally, data is stored in an unbalanced trie of nodes that uses +``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in +network byte order, i.e. big endian, so ``data[0]`` stores the most +significant byte. + +LPM tries may be created with a maximum prefix length that is a multiple +of 8, in the range from 8 to 2048. The key used for lookup and update +operations is a ``struct bpf_lpm_trie_key``, extended by +``max_prefixlen/8`` bytes. + +- For IPv4 addresses the data length is 4 bytes +- For IPv6 addresses the data length is 16 bytes + +The value type stored in the LPM trie can be any user defined type. + +.. note:: + When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the + ``BPF_F_NO_PREALLOC`` flag. + +Usage +===== + +Kernel BPF +---------- + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +The longest prefix entry for a given data value can be found using the +``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the +value associated with the longest matching ``key``, or ``NULL`` if no +entry was found. + +The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when +performing longest prefix lookups. For example, when searching for the +longest prefix match for an IPv4 address, ``prefixlen`` should be set to +``32``. + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + +Prefix entries can be added or updated using the ``bpf_map_update_elem()`` +helper. This helper replaces existing elements atomically. + +``bpf_map_update_elem()`` returns ``0`` on success, or negative error in +case of failure. + + .. note:: + The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST, + but the value is ignored, giving BPF_ANY semantics. + +bpf_map_delete_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_delete_elem(struct bpf_map *map, const void *key) + +Prefix entries can be deleted using the ``bpf_map_delete_elem()`` +helper. This helper will return 0 on success, or negative error in case +of failure. + +Userspace +--------- + +Access from userspace uses libbpf APIs with the same names as above, with +the map identified by ``fd``. + +bpf_map_get_next_key() +~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key) + +A userspace program can iterate through the entries in an LPM trie using +libbpf's ``bpf_map_get_next_key()`` function. The first key can be +fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to +``NULL``. Subsequent calls will fetch the next key that follows the +current key. ``bpf_map_get_next_key()`` returns ``0`` on success, +``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative +error in case of failure. + +``bpf_map_get_next_key()`` will iterate through the LPM trie elements +from leftmost leaf first. This means that iteration will return more +specific keys before less specific ones. + +Examples +======== + +Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples +of LPM trie usage from userspace. The code snippets below demonstrate +API usage. + +Kernel BPF +---------- + +The following BPF code snippet shows how to declare a new LPM trie for IPv4 +address prefixes: + +.. code-block:: c + + #include <linux/bpf.h> + #include <bpf/bpf_helpers.h> + + struct ipv4_lpm_key { + __u32 prefixlen; + __u32 data; + }; + + struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct ipv4_lpm_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, 255); + } ipv4_lpm_map SEC(".maps"); + +The following BPF code snippet shows how to lookup by IPv4 address: + +.. code-block:: c + + void *lookup(__u32 ipaddr) + { + struct ipv4_lpm_key key = { + .prefixlen = 32, + .data = ipaddr + }; + + return bpf_map_lookup_elem(&ipv4_lpm_map, &key); + } + +Userspace +--------- + +The following snippet shows how to insert an IPv4 prefix entry into an +LPM trie: + +.. code-block:: c + + int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value) + { + struct ipv4_lpm_key ipv4_key = { + .prefixlen = prefixlen, + .data = addr + }; + return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY); + } + +The following snippet shows a userspace program walking through the entries +of an LPM trie: + + +.. code-block:: c + + #include <bpf/libbpf.h> + #include <bpf/bpf.h> + + void iterate_lpm_trie(int map_fd) + { + struct ipv4_lpm_key *cur_key = NULL; + struct ipv4_lpm_key next_key; + struct value value; + int err; + + for (;;) { + err = bpf_map_get_next_key(map_fd, cur_key, &next_key); + if (err) + break; + + bpf_map_lookup_elem(map_fd, &next_key, &value); + + /* Use key and value here */ + + cur_key = &next_key; + } + } diff --git a/Documentation/bpf/map_of_maps.rst b/Documentation/bpf/map_of_maps.rst new file mode 100644 index 000000000000..7b5617c2d017 --- /dev/null +++ b/Documentation/bpf/map_of_maps.rst @@ -0,0 +1,130 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +======================================================== +BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS +======================================================== + +.. note:: + - ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` were + introduced in kernel version 4.12 + +``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` provide general +purpose support for map in map storage. One level of nesting is supported, where +an outer map contains instances of a single type of inner map, for example +``array_of_maps->sock_map``. + +When creating an outer map, an inner map instance is used to initialize the +metadata that the outer map holds about its inner maps. This inner map has a +separate lifetime from the outer map and can be deleted after the outer map has +been created. + +The outer map supports element lookup, update and delete from user space using +the syscall API. A BPF program is only allowed to do element lookup in the outer +map. + +.. note:: + - Multi-level nesting is not supported. + - Any BPF map type can be used as an inner map, except for + ``BPF_MAP_TYPE_PROG_ARRAY``. + - A BPF program cannot update or delete outer map entries. + +For ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` the key is an unsigned 32-bit integer index +into the array. The array is a fixed size with ``max_entries`` elements that are +zero initialized when created. + +For ``BPF_MAP_TYPE_HASH_OF_MAPS`` the key type can be chosen when defining the +map. The kernel is responsible for allocating and freeing key/value pairs, up to +the max_entries limit that you specify. Hash maps use pre-allocation of hash +table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be used to disable +pre-allocation when it is too memory expensive. + +Usage +===== + +Kernel BPF Helper +----------------- + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Inner maps can be retrieved using the ``bpf_map_lookup_elem()`` helper. This +helper returns a pointer to the inner map, or ``NULL`` if no entry was found. + +Examples +======== + +Kernel BPF Example +------------------ + +This snippet shows how to create and initialise an array of devmaps in a BPF +program. Note that the outer array can only be modified from user space using +the syscall API. + +.. code-block:: c + + struct inner_map { + __uint(type, BPF_MAP_TYPE_DEVMAP); + __uint(max_entries, 10); + __type(key, __u32); + __type(value, __u32); + } inner_map1 SEC(".maps"), inner_map2 SEC(".maps"); + + struct { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __uint(max_entries, 2); + __type(key, __u32); + __array(values, struct inner_map); + } outer_map SEC(".maps") = { + .values = { &inner_map1, + &inner_map2 } + }; + +See ``progs/test_btf_map_in_map.c`` in ``tools/testing/selftests/bpf`` for more +examples of declarative initialisation of outer maps. + +User Space +---------- + +This snippet shows how to create an array based outer map: + +.. code-block:: c + + int create_outer_array(int inner_fd) { + LIBBPF_OPTS(bpf_map_create_opts, opts, .inner_map_fd = inner_fd); + int fd; + + fd = bpf_map_create(BPF_MAP_TYPE_ARRAY_OF_MAPS, + "example_array", /* name */ + sizeof(__u32), /* key size */ + sizeof(__u32), /* value size */ + 256, /* max entries */ + &opts); /* create opts */ + return fd; + } + + +This snippet shows how to add an inner map to an outer map: + +.. code-block:: c + + int add_devmap(int outer_fd, int index, const char *name) { + int fd; + + fd = bpf_map_create(BPF_MAP_TYPE_DEVMAP, name, + sizeof(__u32), sizeof(__u32), 256, NULL); + if (fd < 0) + return fd; + + return bpf_map_update_elem(outer_fd, &index, &fd, BPF_ANY); + } + +References +========== + +- https://lore.kernel.org/netdev/20170322170035.923581-3-kafai@fb.com/ +- https://lore.kernel.org/netdev/20170322170035.923581-4-kafai@fb.com/ diff --git a/Documentation/bpf/map_queue_stack.rst b/Documentation/bpf/map_queue_stack.rst new file mode 100644 index 000000000000..8d14ed49d6e1 --- /dev/null +++ b/Documentation/bpf/map_queue_stack.rst @@ -0,0 +1,146 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +========================================= +BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK +========================================= + +.. note:: + - ``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` were introduced + in kernel version 4.20 + +``BPF_MAP_TYPE_QUEUE`` provides FIFO storage and ``BPF_MAP_TYPE_STACK`` +provides LIFO storage for BPF programs. These maps support peek, pop and +push operations that are exposed to BPF programs through the respective +helpers. These operations are exposed to userspace applications using +the existing ``bpf`` syscall in the following way: + +- ``BPF_MAP_LOOKUP_ELEM`` -> peek +- ``BPF_MAP_LOOKUP_AND_DELETE_ELEM`` -> pop +- ``BPF_MAP_UPDATE_ELEM`` -> push + +``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` do not support +``BPF_F_NO_PREALLOC``. + +Usage +===== + +Kernel BPF +---------- + +bpf_map_push_elem() +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags) + +An element ``value`` can be added to a queue or stack using the +``bpf_map_push_elem`` helper. The ``flags`` parameter must be set to +``BPF_ANY`` or ``BPF_EXIST``. If ``flags`` is set to ``BPF_EXIST`` then, +when the queue or stack is full, the oldest element will be removed to +make room for ``value`` to be added. Returns ``0`` on success, or +negative error in case of failure. + +bpf_map_peek_elem() +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_peek_elem(struct bpf_map *map, void *value) + +This helper fetches an element ``value`` from a queue or stack without +removing it. Returns ``0`` on success, or negative error in case of +failure. + +bpf_map_pop_elem() +~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_map_pop_elem(struct bpf_map *map, void *value) + +This helper removes an element into ``value`` from a queue or +stack. Returns ``0`` on success, or negative error in case of failure. + + +Userspace +--------- + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags) + +A userspace program can push ``value`` onto a queue or stack using libbpf's +``bpf_map_update_elem`` function. The ``key`` parameter must be set to +``NULL`` and ``flags`` must be set to ``BPF_ANY`` or ``BPF_EXIST``, with the +same semantics as the ``bpf_map_push_elem`` kernel helper. Returns ``0`` on +success, or negative error in case of failure. + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_lookup_elem (int fd, const void *key, void *value) + +A userspace program can peek at the ``value`` at the head of a queue or stack +using the libbpf ``bpf_map_lookup_elem`` function. The ``key`` parameter must be +set to ``NULL``. Returns ``0`` on success, or negative error in case of +failure. + +bpf_map_lookup_and_delete_elem() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_lookup_and_delete_elem (int fd, const void *key, void *value) + +A userspace program can pop a ``value`` from the head of a queue or stack using +the libbpf ``bpf_map_lookup_and_delete_elem`` function. The ``key`` parameter +must be set to ``NULL``. Returns ``0`` on success, or negative error in case of +failure. + +Examples +======== + +Kernel BPF +---------- + +This snippet shows how to declare a queue in a BPF program: + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_QUEUE); + __type(value, __u32); + __uint(max_entries, 10); + } queue SEC(".maps"); + + +Userspace +--------- + +This snippet shows how to use libbpf's low-level API to create a queue from +userspace: + +.. code-block:: c + + int create_queue() + { + return bpf_map_create(BPF_MAP_TYPE_QUEUE, + "sample_queue", /* name */ + 0, /* key size, must be zero */ + sizeof(__u32), /* value size */ + 10, /* max entries */ + NULL); /* create options */ + } + + +References +========== + +https://lwn.net/ml/netdev/153986858555.9127.14517764371945179514.stgit@kernel/ diff --git a/Documentation/bpf/map_sk_storage.rst b/Documentation/bpf/map_sk_storage.rst new file mode 100644 index 000000000000..047e16c8aaa8 --- /dev/null +++ b/Documentation/bpf/map_sk_storage.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +======================= +BPF_MAP_TYPE_SK_STORAGE +======================= + +.. note:: + - ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2 + +``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF +programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage +to be provided and acts as the handle for accessing the socket-local +storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored +locally with each socket instead of with the map. The kernel is responsible for +allocating storage for a socket when requested and for freeing the storage when +either the map or the socket is deleted. + +.. note:: + - The key type must be ``int`` and ``max_entries`` must be set to ``0``. + - The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for + socket-local storage. + +Usage +===== + +Kernel BPF +---------- + +bpf_sk_storage_get() +~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags) + +Socket-local storage can be retrieved using the ``bpf_sk_storage_get()`` +helper. The helper gets the storage from ``sk`` that is associated with ``map``. +If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then +``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not +already exist. ``value`` can be used together with +``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it +will be zero initialized. Returns a pointer to the storage on success, or +``NULL`` in case of failure. + +.. note:: + - ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs. + - ``sk`` is a ``struct bpf_sock`` pointer for other program types. + +bpf_sk_storage_delete() +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_sk_storage_delete(struct bpf_map *map, void *sk) + +Socket-local storage can be deleted using the ``bpf_sk_storage_delete()`` +helper. The helper deletes the storage from ``sk`` that is identified by +``map``. Returns ``0`` on success, or negative error in case of failure. + +User space +---------- + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf +function. ``key`` must be a pointer to a valid ``fd`` in the user space +program. The ``flags`` parameter can be used to control the update behaviour: + +- ``BPF_ANY`` will create storage for ``fd`` or update existing storage. +- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already + exist, otherwise the call will fail with ``-EEXIST``. +- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists, + otherwise the call will fail with ``-ENOENT``. + +Returns ``0`` on success, or negative error in case of failure. + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_lookup_elem(int map_fd, const void *key, void *value) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf +function. ``key`` must be a pointer to a valid ``fd`` in the user space +program. Returns ``0`` on success, or negative error in case of failure. + +bpf_map_delete_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_delete_elem(int map_fd, const void *key) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf +function. Returns ``0`` on success, or negative error in case of failure. + +Examples +======== + +Kernel BPF +---------- + +This snippet shows how to declare socket-local storage in a BPF program: + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct my_storage); + } socket_storage SEC(".maps"); + +This snippet shows how to retrieve socket-local storage in a BPF program: + +.. code-block:: c + + SEC("sockops") + int _sockops(struct bpf_sock_ops *ctx) + { + struct my_storage *storage; + struct bpf_sock *sk; + + sk = ctx->sk; + if (!sk) + return 1; + + storage = bpf_sk_storage_get(&socket_storage, sk, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!storage) + return 1; + + /* Use 'storage' here */ + + return 1; + } + + +Please see the ``tools/testing/selftests/bpf`` directory for functional +examples. + +References +========== + +https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/ diff --git a/Documentation/bpf/map_xskmap.rst b/Documentation/bpf/map_xskmap.rst new file mode 100644 index 000000000000..7093b8208451 --- /dev/null +++ b/Documentation/bpf/map_xskmap.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +=================== +BPF_MAP_TYPE_XSKMAP +=================== + +.. note:: + - ``BPF_MAP_TYPE_XSKMAP`` was introduced in kernel version 4.18 + +The ``BPF_MAP_TYPE_XSKMAP`` is used as a backend map for XDP BPF helper +call ``bpf_redirect_map()`` and ``XDP_REDIRECT`` action, like 'devmap' and 'cpumap'. +This map type redirects raw XDP frames to `AF_XDP`_ sockets (XSKs), a new type of +address family in the kernel that allows redirection of frames from a driver to +user space without having to traverse the full network stack. An AF_XDP socket +binds to a single netdev queue. A mapping of XSKs to queues is shown below: + +.. code-block:: none + + +---------------------------------------------------+ + | xsk A | xsk B | xsk C |<---+ User space + =========================================================|========== + | Queue 0 | Queue 1 | Queue 2 | | Kernel + +---------------------------------------------------+ | + | Netdev eth0 | | + +---------------------------------------------------+ | + | +=============+ | | + | | key | xsk | | | + | +---------+ +=============+ | | + | | | | 0 | xsk A | | | + | | | +-------------+ | | + | | | | 1 | xsk B | | | + | | BPF |-- redirect -->+-------------+-------------+ + | | prog | | 2 | xsk C | | + | | | +-------------+ | + | | | | + | | | | + | +---------+ | + | | + +---------------------------------------------------+ + +.. note:: + An AF_XDP socket that is bound to a certain <netdev/queue_id> will *only* + accept XDP frames from that <netdev/queue_id>. If an XDP program tries to redirect + from a <netdev/queue_id> other than what the socket is bound to, the frame will + not be received on the socket. + +Typically an XSKMAP is created per netdev. This map contains an array of XSK File +Descriptors (FDs). The number of array elements is typically set or adjusted using +the ``max_entries`` map parameter. For AF_XDP ``max_entries`` is equal to the number +of queues supported by the netdev. + +.. note:: + Both the map key and map value size must be 4 bytes. + +Usage +===== + +Kernel BPF +---------- +bpf_redirect_map() +^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags) + +Redirect the packet to the endpoint referenced by ``map`` at index ``key``. +For ``BPF_MAP_TYPE_XSKMAP`` this map contains references to XSK FDs +for sockets attached to a netdev's queues. + +.. note:: + If the map is empty at an index, the packet is dropped. This means that it is + necessary to have an XDP program loaded with at least one XSK in the + XSKMAP to be able to get any traffic to user space through the socket. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +XSK entry references of type ``struct xdp_sock *`` can be retrieved using the +``bpf_map_lookup_elem()`` helper. + +User space +---------- +.. note:: + XSK entries can only be updated/deleted from user space and not from + a BPF program. Trying to call these functions from a kernel BPF program will + result in the program failing to load and a verifier warning. + +bpf_map_update_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags) + +XSK entries can be added or updated using the ``bpf_map_update_elem()`` +helper. The ``key`` parameter is equal to the queue_id of the queue the XSK +is attaching to. And the ``value`` parameter is the FD value of that socket. + +Under the hood, the XSKMAP update function uses the XSK FD value to retrieve the +associated ``struct xdp_sock`` instance. + +The flags argument can be one of the following: + +- BPF_ANY: Create a new element or update an existing element. +- BPF_NOEXIST: Create a new element only if it did not exist. +- BPF_EXIST: Update an existing element. + +bpf_map_lookup_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_lookup_elem(int fd, const void *key, void *value) + +Returns ``struct xdp_sock *`` or negative error in case of failure. + +bpf_map_delete_elem() +^^^^^^^^^^^^^^^^^^^^^ +.. code-block:: c + + int bpf_map_delete_elem(int fd, const void *key) + +XSK entries can be deleted using the ``bpf_map_delete_elem()`` +helper. This helper will return 0 on success, or negative error in case of +failure. + +.. note:: + When `libxdp`_ deletes an XSK it also removes the associated socket + entry from the XSKMAP. + +Examples +======== +Kernel +------ + +The following code snippet shows how to declare a ``BPF_MAP_TYPE_XSKMAP`` called +``xsks_map`` and how to redirect packets to an XSK. + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_XSKMAP); + __type(key, __u32); + __type(value, __u32); + __uint(max_entries, 64); + } xsks_map SEC(".maps"); + + + SEC("xdp") + int xsk_redir_prog(struct xdp_md *ctx) + { + __u32 index = ctx->rx_queue_index; + + if (bpf_map_lookup_elem(&xsks_map, &index)) + return bpf_redirect_map(&xsks_map, index, 0); + return XDP_PASS; + } + +User space +---------- + +The following code snippet shows how to update an XSKMAP with an XSK entry. + +.. code-block:: c + + int update_xsks_map(struct bpf_map *xsks_map, int queue_id, int xsk_fd) + { + int ret; + + ret = bpf_map_update_elem(bpf_map__fd(xsks_map), &queue_id, &xsk_fd, 0); + if (ret < 0) + fprintf(stderr, "Failed to update xsks_map: %s\n", strerror(errno)); + + return ret; + } + +For an example on how create AF_XDP sockets, please see the AF_XDP-example and +AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository. +For a detailed explaination of the AF_XDP interface please see: + +- `libxdp-readme`_. +- `AF_XDP`_ kernel documentation. + +.. note:: + The most comprehensive resource for using XSKMAPs and AF_XDP is `libxdp`_. + +.. _libxdp: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp +.. _AF_XDP: https://www.kernel.org/doc/html/latest/networking/af_xdp.html +.. _bpf-examples: https://github.com/xdp-project/bpf-examples +.. _libxdp-readme: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp#using-af_xdp-sockets diff --git a/Documentation/bpf/maps.rst b/Documentation/bpf/maps.rst index f41619e312ac..4906ff0f8382 100644 --- a/Documentation/bpf/maps.rst +++ b/Documentation/bpf/maps.rst @@ -1,52 +1,81 @@ -========= -eBPF maps +======== +BPF maps +======== + +BPF 'maps' provide generic storage of different types for sharing data between +kernel and user space. There are several storage types available, including +hash, array, bloom filter and radix-tree. Several of the map types exist to +support specific BPF helpers that perform actions based on the map contents. The +maps are accessed from BPF programs via BPF helpers which are documented in the +`man-pages`_ for `bpf-helpers(7)`_. + +BPF maps are accessed from user space via the ``bpf`` syscall, which provides +commands to create maps, lookup elements, update elements and delete +elements. More details of the BPF syscall are available in +:doc:`/userspace-api/ebpf/syscall` and in the `man-pages`_ for `bpf(2)`_. + +Map Types ========= -'maps' is a generic storage of different types for sharing data between kernel -and userspace. +.. toctree:: + :maxdepth: 1 + :glob: -The maps are accessed from user space via BPF syscall, which has commands: + map_* -- create a map with given type and attributes - ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` - using attr->map_type, attr->key_size, attr->value_size, attr->max_entries - returns process-local file descriptor or negative error +Usage Notes +=========== -- lookup key in a given map - ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key, attr->value - returns zero and stores found elem into value or negative error +.. c:function:: + int bpf(int command, union bpf_attr *attr, u32 size) -- create or update key/value pair in a given map - ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key, attr->value - returns zero or negative error +Use the ``bpf()`` system call to perform the operation specified by +``command``. The operation takes parameters provided in ``attr``. The ``size`` +argument is the size of the ``union bpf_attr`` in ``attr``. -- find and delete element by key in a given map - ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key +**BPF_MAP_CREATE** -- to delete map: close(fd) - Exiting process will delete maps automatically +Create a map with the desired type and attributes in ``attr``: -userspace programs use this syscall to create/access maps that eBPF programs -are concurrently updating. +.. code-block:: c -maps can have different types: hash, array, bloom filter, radix-tree, etc. + int fd; + union bpf_attr attr = { + .map_type = BPF_MAP_TYPE_ARRAY; /* mandatory */ + .key_size = sizeof(__u32); /* mandatory */ + .value_size = sizeof(__u32); /* mandatory */ + .max_entries = 256; /* mandatory */ + .map_flags = BPF_F_MMAPABLE; + .map_name = "example_array"; + }; -The map is defined by: + fd = bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); - - type - - max number of elements - - key size in bytes - - value size in bytes +Returns a process-local file descriptor on success, or negative error in case of +failure. The map can be deleted by calling ``close(fd)``. Maps held by open +file descriptors will be deleted automatically when a process exits. -Map Types -========= +.. note:: Valid characters for ``map_name`` are ``A-Z``, ``a-z``, ``0-9``, + ``'_'`` and ``'.'``. -.. toctree:: - :maxdepth: 1 - :glob: +**BPF_MAP_LOOKUP_ELEM** + +Lookup key in a given map using ``attr->map_fd``, ``attr->key``, +``attr->value``. Returns zero and stores found elem into ``attr->value`` on +success, or negative error on failure. + +**BPF_MAP_UPDATE_ELEM** + +Create or update key/value pair in a given map using ``attr->map_fd``, ``attr->key``, +``attr->value``. Returns zero on success or negative error on failure. + +**BPF_MAP_DELETE_ELEM** + +Find and delete element by key in a given map using ``attr->map_fd``, +``attr->key``. Returns zero on success or negative error on failure. - map_*
\ No newline at end of file +.. Links: +.. _man-pages: https://www.kernel.org/doc/man-pages/ +.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html +.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html diff --git a/Documentation/bpf/programs.rst b/Documentation/bpf/programs.rst index 620eb667ac7a..c99000ab6d9b 100644 --- a/Documentation/bpf/programs.rst +++ b/Documentation/bpf/programs.rst @@ -7,3 +7,6 @@ Program Types :glob: prog_* + +For a list of all program types, see :ref:`program_types_and_elf` in +the :ref:`libbpf` documentation. diff --git a/Documentation/bpf/redirect.rst b/Documentation/bpf/redirect.rst new file mode 100644 index 000000000000..2fa2b0b05004 --- /dev/null +++ b/Documentation/bpf/redirect.rst @@ -0,0 +1,81 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +======== +Redirect +======== +XDP_REDIRECT +############ +Supported maps +-------------- + +XDP_REDIRECT works with the following map types: + +- ``BPF_MAP_TYPE_DEVMAP`` +- ``BPF_MAP_TYPE_DEVMAP_HASH`` +- ``BPF_MAP_TYPE_CPUMAP`` +- ``BPF_MAP_TYPE_XSKMAP`` + +For more information on these maps, please see the specific map documentation. + +Process +------- + +.. kernel-doc:: net/core/filter.c + :doc: xdp redirect + +.. note:: + Not all drivers support transmitting frames after a redirect, and for + those that do, not all of them support non-linear frames. Non-linear xdp + bufs/frames are bufs/frames that contain more than one fragment. + +Debugging packet drops +---------------------- +Silent packet drops for XDP_REDIRECT can be debugged using: + +- bpf_trace +- perf_record + +bpf_trace +^^^^^^^^^ +The following bpftrace command can be used to capture and count all XDP tracepoints: + +.. code-block:: none + + sudo bpftrace -e 'tracepoint:xdp:* { @cnt[probe] = count(); }' + Attaching 12 probes... + ^C + + @cnt[tracepoint:xdp:mem_connect]: 18 + @cnt[tracepoint:xdp:mem_disconnect]: 18 + @cnt[tracepoint:xdp:xdp_exception]: 19605 + @cnt[tracepoint:xdp:xdp_devmap_xmit]: 1393604 + @cnt[tracepoint:xdp:xdp_redirect]: 22292200 + +.. note:: + The various xdp tracepoints can be found in ``source/include/trace/events/xdp.h`` + +The following bpftrace command can be used to extract the ``ERRNO`` being returned as +part of the err parameter: + +.. code-block:: none + + sudo bpftrace -e \ + 'tracepoint:xdp:xdp_redirect*_err {@redir_errno[-args->err] = count();} + tracepoint:xdp:xdp_devmap_xmit {@devmap_errno[-args->err] = count();}' + +perf record +^^^^^^^^^^^ +The perf tool also supports recording tracepoints: + +.. code-block:: none + + perf record -a -e xdp:xdp_redirect_err \ + -e xdp:xdp_redirect_map_err \ + -e xdp:xdp_exception \ + -e xdp:xdp_devmap_xmit + +References +=========== + +- https://github.com/xdp-project/xdp-tutorial/tree/master/tracing02-xdp-monitor |