diff options
author | Jakub Kicinski <kuba@kernel.org> | 2022-12-12 22:27:41 +0300 |
---|---|---|
committer | Jakub Kicinski <kuba@kernel.org> | 2022-12-12 22:27:42 +0300 |
commit | 26f708a28454df2062a63fd869e983c379f50ff0 (patch) | |
tree | e9580092e7d69af3f9d5add0cd331bad2a6bf708 /Documentation | |
parent | b2b509fb5a1e6af1e630a755b32c4658099df70b (diff) | |
parent | 99523094de48df65477cbbb9d8027f4bc4701794 (diff) | |
download | linux-26f708a28454df2062a63fd869e983c379f50ff0.tar.xz |
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2022-12-11
We've added 74 non-merge commits during the last 11 day(s) which contain
a total of 88 files changed, 3362 insertions(+), 789 deletions(-).
The main changes are:
1) Decouple prune and jump points handling in the verifier, from Andrii.
2) Do not rely on ALLOW_ERROR_INJECTION for fmod_ret, from Benjamin.
Merged from hid tree.
3) Do not zero-extend kfunc return values. Necessary fix for 32-bit archs,
from Björn.
4) Don't use rcu_users to refcount in task kfuncs, from David.
5) Three reg_state->id fixes in the verifier, from Eduard.
6) Optimize bpf_mem_alloc by reusing elements from free_by_rcu, from Hou.
7) Refactor dynptr handling in the verifier, from Kumar.
8) Remove the "/sys" mount and umount dance in {open,close}_netns
in bpf selftests, from Martin.
9) Enable sleepable support for cgrp local storage, from Yonghong.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (74 commits)
selftests/bpf: test case for relaxed prunning of active_lock.id
selftests/bpf: Add pruning test case for bpf_spin_lock
bpf: use check_ids() for active_lock comparison
selftests/bpf: verify states_equal() maintains idmap across all frames
bpf: states_equal() must build idmap for all function frames
selftests/bpf: test cases for regsafe() bug skipping check_id()
bpf: regsafe() must not skip check_ids()
docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGE
selftests/bpf: Add test for dynptr reinit in user_ringbuf callback
bpf: Use memmove for bpf_dynptr_{read,write}
bpf: Move PTR_TO_STACK alignment check to process_dynptr_func
bpf: Rework check_func_arg_reg_off
bpf: Rework process_dynptr_func
bpf: Propagate errors from process_* checks in check_func_arg
bpf: Refactor ARG_PTR_TO_DYNPTR checks into process_dynptr_func
bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is true
bpf: Reuse freed element in free_by_rcu during allocation
selftests/bpf: Bring test_offload.py back to life
bpf: Fix comment error in fixup_kfunc_call function
bpf: Do not zero-extend kfunc return values
...
====================
Link: https://lore.kernel.org/r/20221212024701.73809-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/bpf/bpf_iterators.rst | 485 | ||||
-rw-r--r-- | Documentation/bpf/index.rst | 1 | ||||
-rw-r--r-- | Documentation/bpf/instruction-set.rst | 4 | ||||
-rw-r--r-- | Documentation/bpf/kfuncs.rst | 207 | ||||
-rw-r--r-- | Documentation/bpf/map_sk_storage.rst | 155 |
5 files changed, 850 insertions, 2 deletions
diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst new file mode 100644 index 000000000000..6d7770793fab --- /dev/null +++ b/Documentation/bpf/bpf_iterators.rst @@ -0,0 +1,485 @@ +============= +BPF Iterators +============= + + +---------- +Motivation +---------- + +There are a few existing ways to dump kernel data into user space. The most +popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps +all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink +sockets in the system. However, their output format tends to be fixed, and if +users want more information about these sockets, they have to patch the kernel, +which often takes time to publish upstream and release. The same is true for popular +tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any +additional information needs a kernel patch. + +To solve this problem, the `drgn +<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to +dig out the kernel data with no kernel change. However, the main drawback for +drgn is performance, as it cannot do pointer tracing inside the kernel. In +addition, drgn cannot validate a pointer value and may read invalid data if the +pointer becomes invalid inside the kernel. + +The BPF iterator solves the above problem by providing flexibility on what data +(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel +data object. + +---------------------- +How BPF Iterators Work +---------------------- + +A BPF iterator is a type of BPF program that allows users to iterate over +specific types of kernel objects. Unlike traditional BPF tracing programs that +allow users to define callbacks that are invoked at particular points of +execution in the kernel, BPF iterators allow users to define callbacks that +should be executed for every entry in a variety of kernel data structures. + +For example, users can define a BPF iterator that iterates over every task on +the system and dumps the total amount of CPU runtime currently used by each of +them. Another BPF task iterator may instead dump the cgroup information for each +task. Such flexibility is the core value of BPF iterators. + +A BPF program is always loaded into the kernel at the behest of a user space +process. A user space process loads a BPF program by opening and initializing +the program skeleton as required and then invoking a syscall to have the BPF +program verified and loaded by the kernel. + +In traditional tracing programs, a program is activated by having user space +obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once +activated, the program callback will be invoked whenever the tracepoint is +triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the +program is obtained using ``bpf_link_create()``, and the program callback is +invoked by issuing system calls from user space. + +Next, let us see how you can use the iterators to iterate on kernel objects and +read data. + +------------------------ +How to Use BPF iterators +------------------------ + +BPF selftests are a great resource to illustrate how to use the iterators. In +this section, we’ll walk through a BPF selftest which shows how to load and use +a BPF iterator program. To begin, we’ll look at `bpf_iter.c +<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_, +which illustrates how to load and trigger BPF iterators on the user space side. +Later, we’ll look at a BPF program that runs in kernel space. + +Loading a BPF iterator in the kernel from user space typically involves the +following steps: + +* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel + has verified and loaded the program, it returns a file descriptor (fd) to user + space. +* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()`` + specified with the BPF program file descriptor received from the kernel. +* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the + ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2. +* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is + available. +* Close the iterator fd using ``close(bpf_iter_fd)``. +* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again. + +The following are a few examples of selftest BPF iterator programs: + +* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ +* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ +* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ + +Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: + +Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h +<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_. +Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>`` +represents a BPF iterator. The suffix ``<iter_name>`` represents the type of +iterator. + +:: + + struct bpf_iter__task_file { + union { + struct bpf_iter_meta *meta; + }; + union { + struct task_struct *task; + }; + u32 fd; + union { + struct file *file; + }; + }; + +In the above code, the field 'meta' contains the metadata, which is the same for +all BPF iterator programs. The rest of the fields are specific to different +iterators. For example, for task_file iterators, the kernel layer provides the +'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference +counted +<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_, +so they won't go away when the BPF program runs. + +Here is a snippet from the ``bpf_iter_task_file.c`` file: + +:: + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + __u32 fd = ctx->fd; + + if (task == NULL || file == NULL) + return 0; + + if (ctx->meta->seq_num == 0) { + count = 0; + BPF_SEQ_PRINTF(seq, " tgid gid fd file\n"); + } + + if (tgid == task->tgid && task->tgid != task->pid) + count++; + + if (last_tgid != task->tgid) { + last_tgid = task->tgid; + unique_tgid_count++; + } + + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +In the above example, the section name ``SEC(iter/task_file)``, indicates that +the program is a BPF iterator program to iterate all files from all tasks. The +context of the program is ``bpf_iter__task_file`` struct. + +The user space program invokes the BPF iterator program running in the kernel +by issuing a ``read()`` syscall. Once invoked, the BPF +program can export data to user space using a variety of BPF helper functions. +You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or +``bpf_seq_write()`` function based on whether you need formatted output or just +binary data, respectively. For binary-encoded data, the user space applications +can process the data from ``bpf_seq_write()`` as needed. For the formatted data, +you can use ``cat <path>`` to print the results similar to ``cat +/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later, +use ``rm -f <path>`` to remove the pinned iterator. + +For example, you can use the following command to create a BPF iterator from the +``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route`` +path: + +:: + + $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route + +And then print out the results using the following command: + +:: + + $ cat /sys/fs/bpf/my_route + + +------------------------------------------------------- +Implement Kernel Support for BPF Iterator Program Types +------------------------------------------------------- + +To implement a BPF iterator in the kernel, the developer must make a one-time +change to the following key data structure defined in the `bpf.h +<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_ +file. + +:: + + struct bpf_iter_reg { + const char *target; + bpf_iter_attach_target_t attach_target; + bpf_iter_detach_target_t detach_target; + bpf_iter_show_fdinfo_t show_fdinfo; + bpf_iter_fill_link_info_t fill_link_info; + bpf_iter_get_func_proto_t get_func_proto; + u32 ctx_arg_info_size; + u32 feature; + struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; + const struct bpf_iter_seq_info *seq_info; + }; + +After filling the data structure fields, call ``bpf_iter_reg_target()`` to +register the iterator to the main BPF iterator subsystem. + +The following is the breakdown for each field in struct ``bpf_iter_reg``. + +.. list-table:: + :widths: 25 50 + :header-rows: 1 + + * - Fields + - Description + * - target + - Specifies the name of the BPF iterator. For example: ``bpf_map``, + ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel. + * - attach_target and detach_target + - Allows for target specific ``link_create`` action since some targets + may need special processing. Called during the user space link_create stage. + * - show_fdinfo and fill_link_info + - Called to fill target specific information when user tries to get link + info associated with the iterator. + * - get_func_proto + - Permits a BPF iterator to access BPF helpers specific to the iterator. + * - ctx_arg_info_size and ctx_arg_info + - Specifies the verifier states for BPF program arguments associated with + the bpf iterator. + * - feature + - Specifies certain action requests in the kernel BPF iterator + infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means + that the kernel function cond_resched() is called to avoid other kernel + subsystem (e.g., rcu) misbehaving. + * - seq_info + - Specifies certain action requests in the kernel BPF iterator + infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means + that the kernel function cond_resched() is called to avoid other kernel + subsystem (e.g., rcu) misbehaving. + + +`Click here +<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ +to see an implementation of the ``task_vma`` BPF iterator in the kernel. + +--------------------------------- +Parameterizing BPF Task Iterators +--------------------------------- + +By default, BPF iterators walk through all the objects of the specified types +(processes, cgroups, maps, etc.) across the entire system to read relevant +kernel data. But often, there are cases where we only care about a much smaller +subset of iterable kernel objects, such as only iterating tasks within a +specific process. Therefore, BPF iterator programs support filtering out objects +from iteration by allowing user space to configure the iterator program when it +is attached. + +-------------------------- +BPF Task Iterator Program +-------------------------- + +The following code is a BPF iterator program to print files and task information +through the ``seq_file`` of the iterator. It is a standard BPF iterator program +that visits every file of an iterator. We will use this BPF program in our +example later. + +:: + + #include <vmlinux.h> + #include <bpf/bpf_helpers.h> + + char _license[] SEC("license") = "GPL"; + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + struct seq_file *seq = ctx->meta->seq; + struct task_struct *task = ctx->task; + struct file *file = ctx->file; + __u32 fd = ctx->fd; + if (task == NULL || file == NULL) + return 0; + if (ctx->meta->seq_num == 0) { + BPF_SEQ_PRINTF(seq, " tgid pid fd file\n"); + } + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +---------------------------------------- +Creating a File Iterator with Parameters +---------------------------------------- + +Now, let us look at how to create an iterator that includes only files of a +process. + +First, fill the ``bpf_iter_attach_opts`` struct as shown below: + +:: + + LIBBPF_OPTS(bpf_iter_attach_opts, opts); + union bpf_iter_link_info linfo; + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + +``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator +that only includes opened files for the process with the specified ``pid``. In +this example, we will only be iterating files for our process. If +``linfo.task.pid`` is zero, the iterator will visit every opened file of every +process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator +that visits opened files of a specific thread, not a process. In this example, +``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a +separate file descriptor table. In most circumstances, all process threads share +a single file descriptor table. + +Now, in the userspace program, pass the pointer of struct to the +``bpf_program__attach_iter()``. + +:: + + link = bpf_program__attach_iter(prog, &opts); iter_fd = + bpf_iter_create(bpf_link__fd(link)); + +If both *tid* and *pid* are zero, an iterator created from this struct +``bpf_iter_attach_opts`` will include every opened file of every task in the +system (in the namespace, actually.) It is the same as passing a NULL as the +second argument to ``bpf_program__attach_iter()``. + +The whole program looks like the following code: + +:: + + #include <stdio.h> + #include <unistd.h> + #include <bpf/bpf.h> + #include <bpf/libbpf.h> + #include "bpf_iter_task_ex.skel.h" + + static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts) + { + struct bpf_link *link; + char buf[16] = {}; + int iter_fd = -1, len; + int ret = 0; + + link = bpf_program__attach_iter(prog, opts); + if (!link) { + fprintf(stderr, "bpf_program__attach_iter() fails\n"); + return -1; + } + iter_fd = bpf_iter_create(bpf_link__fd(link)); + if (iter_fd < 0) { + fprintf(stderr, "bpf_iter_create() fails\n"); + ret = -1; + goto free_link; + } + /* not check contents, but ensure read() ends without error */ + while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) { + buf[len] = 0; + printf("%s", buf); + } + printf("\n"); + free_link: + if (iter_fd >= 0) + close(iter_fd); + bpf_link__destroy(link); + return 0; + } + + static void test_task_file(void) + { + LIBBPF_OPTS(bpf_iter_attach_opts, opts); + struct bpf_iter_task_ex *skel; + union bpf_iter_link_info linfo; + skel = bpf_iter_task_ex__open_and_load(); + if (skel == NULL) + return; + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + printf("PID %d\n", getpid()); + do_read_opts(skel->progs.dump_task_file, &opts); + bpf_iter_task_ex__destroy(skel); + } + + int main(int argc, const char * const * argv) + { + test_task_file(); + return 0; + } + +The following lines are the output of the program. +:: + + PID 1859 + + tgid pid fd file + 1859 1859 0 ffffffff82270aa0 + 1859 1859 1 ffffffff82270aa0 + 1859 1859 2 ffffffff82270aa0 + 1859 1859 3 ffffffff82272980 + 1859 1859 4 ffffffff8225e120 + 1859 1859 5 ffffffff82255120 + 1859 1859 6 ffffffff82254f00 + 1859 1859 7 ffffffff82254d80 + 1859 1859 8 ffffffff8225abe0 + +------------------ +Without Parameters +------------------ + +Let us look at how a BPF iterator without parameters skips files of other +processes in the system. In this case, the BPF program has to check the pid or +the tid of tasks, or it will receive every opened file in the system (in the +current *pid* namespace, actually). So, we usually add a global variable in the +BPF program to pass a *pid* to the BPF program. + +The BPF program would look like the following block. + + :: + + ...... + int target_pid = 0; + + SEC("iter/task_file") + int dump_task_file(struct bpf_iter__task_file *ctx) + { + ...... + if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */ + return 0; + BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, + (long)file->f_op); + return 0; + } + +The user space program would look like the following block: + + :: + + ...... + static void test_task_file(void) + { + ...... + skel = bpf_iter_task_ex__open_and_load(); + if (skel == NULL) + return; + skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */ + memset(&linfo, 0, sizeof(linfo)); + linfo.task.pid = getpid(); + opts.link_info = &linfo; + opts.link_info_len = sizeof(linfo); + ...... + } + +``target_pid`` is a global variable in the BPF program. The user space program +should initialize the variable with a process ID to skip opened files of other +processes in the BPF program. When you parametrize a BPF iterator, the iterator +calls the BPF program fewer times which can save significant resources. + +--------------------------- +Parametrizing VMA Iterators +--------------------------- + +By default, a BPF VMA iterator includes every VMA in every process. However, +you can still specify a process or a thread to include only its VMAs. Unlike +files, a thread can not have a separate address space (since Linux 2.6.0-test6). +Here, using *tid* makes no difference from using *pid*. + +---------------------------- +Parametrizing Task Iterators +---------------------------- + +A BPF task iterator with *pid* includes all tasks (threads) of a process. The +BPF program receives these tasks one after another. You can specify a BPF task +iterator with *tid* parameter to include only the tasks that match the given +*tid*. diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 1088d44634d6..b81533d8b061 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -24,6 +24,7 @@ that goes into great technical depth about the BPF Architecture. maps bpf_prog_run classic_vs_extended.rst + bpf_iterators bpf_licensing test_debug clang-notes diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index 5d798437dad4..e672d5ec6cc7 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ``BPF_XOR | BPF_K | BPF_ALU`` means:: - src_reg = (u32) src_reg ^ (u32) imm32 + dst_reg = (u32) dst_reg ^ (u32) imm32 ``BPF_XOR | BPF_K | BPF_ALU64`` means:: - src_reg = src_reg ^ imm32 + dst_reg = dst_reg ^ imm32 Byte swap instructions diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 90774479ab7a..9fd7fb539f85 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -191,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these calls. At the moment they only require CAP_SYS_BOOT capability, but more can be added later. +2.4.8 KF_RCU flag +----------------- + +The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. +When used together with KF_ACQUIRE, it indicates the kfunc should have a +single argument which must be a trusted argument or a MEM_RCU pointer. +The argument may have reference count of 0 and the kfunc must take this +into consideration. + 2.5 Registering the kfuncs -------------------------- @@ -213,3 +222,201 @@ type. An example is shown below:: return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); } late_initcall(init_subsystem); + +3. Core kfuncs +============== + +The BPF subsystem provides a number of "core" kfuncs that are potentially +applicable to a wide variety of different possible use cases and programs. +Those kfuncs are documented here. + +3.1 struct task_struct * kfuncs +------------------------------- + +There are a number of kfuncs that allow ``struct task_struct *`` objects to be +used as kptrs: + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_task_acquire bpf_task_release + +These kfuncs are useful when you want to acquire or release a reference to a +``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a +struct_ops callback arg. For example: + +.. code-block:: c + + /** + * A trivial example tracepoint program that shows how to + * acquire and release a struct task_struct * pointer. + */ + SEC("tp_btf/task_newtask") + int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) + { + struct task_struct *acquired; + + acquired = bpf_task_acquire(task); + + /* + * In a typical program you'd do something like store + * the task in a map, and the map will automatically + * release it later. Here, we release it manually. + */ + bpf_task_release(acquired); + return 0; + } + +---- + +A BPF program can also look up a task from a pid. This can be useful if the +caller doesn't have a trusted pointer to a ``struct task_struct *`` object that +it can acquire a reference on with bpf_task_acquire(). + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_task_from_pid + +Here is an example of it being used: + +.. code-block:: c + + SEC("tp_btf/task_newtask") + int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) + { + struct task_struct *lookup; + + lookup = bpf_task_from_pid(task->pid); + if (!lookup) + /* A task should always be found, as %task is a tracepoint arg. */ + return -ENOENT; + + if (lookup->pid != task->pid) { + /* bpf_task_from_pid() looks up the task via its + * globally-unique pid from the init_pid_ns. Thus, + * the pid of the lookup task should always be the + * same as the input task. + */ + bpf_task_release(lookup); + return -EINVAL; + } + + /* bpf_task_from_pid() returns an acquired reference, + * so it must be dropped before returning from the + * tracepoint handler. + */ + bpf_task_release(lookup); + return 0; + } + +3.2 struct cgroup * kfuncs +-------------------------- + +``struct cgroup *`` objects also have acquire and release functions: + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_acquire bpf_cgroup_release + +These kfuncs are used in exactly the same manner as bpf_task_acquire() and +bpf_task_release() respectively, so we won't provide examples for them. + +---- + +You may also acquire a reference to a ``struct cgroup`` kptr that's already +stored in a map using bpf_cgroup_kptr_get(): + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_kptr_get + +Here's an example of how it can be used: + +.. code-block:: c + + /* struct containing the struct task_struct kptr which is actually stored in the map. */ + struct __cgroups_kfunc_map_value { + struct cgroup __kptr_ref * cgroup; + }; + + /* The map containing struct __cgroups_kfunc_map_value entries. */ + struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, int); + __type(value, struct __cgroups_kfunc_map_value); + __uint(max_entries, 1); + } __cgroups_kfunc_map SEC(".maps"); + + /* ... */ + + /** + * A simple example tracepoint program showing how a + * struct cgroup kptr that is stored in a map can + * be acquired using the bpf_cgroup_kptr_get() kfunc. + */ + SEC("tp_btf/cgroup_mkdir") + int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) + { + struct cgroup *kptr; + struct __cgroups_kfunc_map_value *v; + s32 id = cgrp->self.id; + + /* Assume a cgroup kptr was previously stored in the map. */ + v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); + if (!v) + return -ENOENT; + + /* Acquire a reference to the cgroup kptr that's already stored in the map. */ + kptr = bpf_cgroup_kptr_get(&v->cgroup); + if (!kptr) + /* If no cgroup was present in the map, it's because + * we're racing with another CPU that removed it with + * bpf_kptr_xchg() between the bpf_map_lookup_elem() + * above, and our call to bpf_cgroup_kptr_get(). + * bpf_cgroup_kptr_get() internally safely handles this + * race, and will return NULL if the task is no longer + * present in the map by the time we invoke the kfunc. + */ + return -EBUSY; + + /* Free the reference we just took above. Note that the + * original struct cgroup kptr is still in the map. It will + * be freed either at a later time if another context deletes + * it from the map, or automatically by the BPF subsystem if + * it's still present when the map is destroyed. + */ + bpf_cgroup_release(kptr); + + return 0; + } + +---- + +Another kfunc available for interacting with ``struct cgroup *`` objects is +bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, +and return it as a cgroup kptr. + +.. kernel-doc:: kernel/bpf/helpers.c + :identifiers: bpf_cgroup_ancestor + +Eventually, BPF should be updated to allow this to happen with a normal memory +load in the program itself. This is currently not possible without more work in +the verifier. bpf_cgroup_ancestor() can be used as follows: + +.. code-block:: c + + /** + * Simple tracepoint example that illustrates how a cgroup's + * ancestor can be accessed using bpf_cgroup_ancestor(). + */ + SEC("tp_btf/cgroup_mkdir") + int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) + { + struct cgroup *parent; + + /* The parent cgroup resides at the level before the current cgroup's level. */ + parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); + if (!parent) + return -ENOENT; + + bpf_printk("Parent id is %d", parent->self.id); + + /* Return the parent cgroup that was acquired above. */ + bpf_cgroup_release(parent); + return 0; + } diff --git a/Documentation/bpf/map_sk_storage.rst b/Documentation/bpf/map_sk_storage.rst new file mode 100644 index 000000000000..047e16c8aaa8 --- /dev/null +++ b/Documentation/bpf/map_sk_storage.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +======================= +BPF_MAP_TYPE_SK_STORAGE +======================= + +.. note:: + - ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2 + +``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF +programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage +to be provided and acts as the handle for accessing the socket-local +storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored +locally with each socket instead of with the map. The kernel is responsible for +allocating storage for a socket when requested and for freeing the storage when +either the map or the socket is deleted. + +.. note:: + - The key type must be ``int`` and ``max_entries`` must be set to ``0``. + - The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for + socket-local storage. + +Usage +===== + +Kernel BPF +---------- + +bpf_sk_storage_get() +~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags) + +Socket-local storage can be retrieved using the ``bpf_sk_storage_get()`` +helper. The helper gets the storage from ``sk`` that is associated with ``map``. +If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then +``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not +already exist. ``value`` can be used together with +``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it +will be zero initialized. Returns a pointer to the storage on success, or +``NULL`` in case of failure. + +.. note:: + - ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs. + - ``sk`` is a ``struct bpf_sock`` pointer for other program types. + +bpf_sk_storage_delete() +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + long bpf_sk_storage_delete(struct bpf_map *map, void *sk) + +Socket-local storage can be deleted using the ``bpf_sk_storage_delete()`` +helper. The helper deletes the storage from ``sk`` that is identified by +``map``. Returns ``0`` on success, or negative error in case of failure. + +User space +---------- + +bpf_map_update_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf +function. ``key`` must be a pointer to a valid ``fd`` in the user space +program. The ``flags`` parameter can be used to control the update behaviour: + +- ``BPF_ANY`` will create storage for ``fd`` or update existing storage. +- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already + exist, otherwise the call will fail with ``-EEXIST``. +- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists, + otherwise the call will fail with ``-ENOENT``. + +Returns ``0`` on success, or negative error in case of failure. + +bpf_map_lookup_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_lookup_elem(int map_fd, const void *key, void *value) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf +function. ``key`` must be a pointer to a valid ``fd`` in the user space +program. Returns ``0`` on success, or negative error in case of failure. + +bpf_map_delete_elem() +~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: c + + int bpf_map_delete_elem(int map_fd, const void *key) + +Socket-local storage for the socket identified by ``key`` belonging to +``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf +function. Returns ``0`` on success, or negative error in case of failure. + +Examples +======== + +Kernel BPF +---------- + +This snippet shows how to declare socket-local storage in a BPF program: + +.. code-block:: c + + struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct my_storage); + } socket_storage SEC(".maps"); + +This snippet shows how to retrieve socket-local storage in a BPF program: + +.. code-block:: c + + SEC("sockops") + int _sockops(struct bpf_sock_ops *ctx) + { + struct my_storage *storage; + struct bpf_sock *sk; + + sk = ctx->sk; + if (!sk) + return 1; + + storage = bpf_sk_storage_get(&socket_storage, sk, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!storage) + return 1; + + /* Use 'storage' here */ + + return 1; + } + + +Please see the ``tools/testing/selftests/bpf`` directory for functional +examples. + +References +========== + +https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/ |