diff options
191 files changed, 4488 insertions, 980 deletions
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index f49aeef62d0c..cf8722f96090 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -369,7 +369,8 @@ No additional type data follow ``btf_type``. * ``name_off``: offset to a valid C identifier * ``info.kind_flag``: 0 * ``info.kind``: BTF_KIND_FUNC - * ``info.vlen``: 0 + * ``info.vlen``: linkage information (BTF_FUNC_STATIC, BTF_FUNC_GLOBAL + or BTF_FUNC_EXTERN) * ``type``: a BTF_KIND_FUNC_PROTO type No additional type data follow ``btf_type``. @@ -380,6 +381,9 @@ type. The BTF_KIND_FUNC may in turn be referenced by a func_info in the :ref:`BTF_Ext_Section` (ELF) or in the arguments to :ref:`BPF_Prog_Load` (ABI). +Currently, only linkage values of BTF_FUNC_STATIC and BTF_FUNC_GLOBAL are +supported in the kernel. + 2.2.13 BTF_KIND_FUNC_PROTO ~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 96056a7447c7..1bc2c5c58bdb 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -19,6 +19,7 @@ that goes into great technical depth about the BPF Architecture. faq syscall_api helpers + kfuncs programs maps bpf_prog_run diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst new file mode 100644 index 000000000000..c0b7dae6dbf5 --- /dev/null +++ b/Documentation/bpf/kfuncs.rst @@ -0,0 +1,170 @@ +============================= +BPF Kernel Functions (kfuncs) +============================= + +1. Introduction +=============== + +BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux +kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, +kfuncs do not have a stable interface and can change from one kernel release to +another. Hence, BPF programs need to be updated in response to changes in the +kernel. + +2. Defining a kfunc +=================== + +There are two ways to expose a kernel function to BPF programs, either make an +existing function in the kernel visible, or add a new wrapper for BPF. In both +cases, care must be taken that BPF program can only call such function in a +valid context. To enforce this, visibility of a kfunc can be per program type. + +If you are not creating a BPF wrapper for existing kernel function, skip ahead +to :ref:`BPF_kfunc_nodef`. + +2.1 Creating a wrapper kfunc +---------------------------- + +When defining a wrapper kfunc, the wrapper function should have extern linkage. +This prevents the compiler from optimizing away dead code, as this wrapper kfunc +is not invoked anywhere in the kernel itself. It is not necessary to provide a +prototype in a header for the wrapper kfunc. + +An example is given below:: + + /* Disables missing prototype warnings */ + __diag_push(); + __diag_ignore_all("-Wmissing-prototypes", + "Global kfuncs as their definitions will be in BTF"); + + struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) + { + return find_get_task_by_vpid(nr); + } + + __diag_pop(); + +A wrapper kfunc is often needed when we need to annotate parameters of the +kfunc. Otherwise one may directly make the kfunc visible to the BPF program by +registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. + +2.2 Annotating kfunc parameters +------------------------------- + +Similar to BPF helpers, there is sometime need for additional context required +by the verifier to make the usage of kernel functions safer and more useful. +Hence, we can annotate a parameter by suffixing the name of the argument of the +kfunc with a __tag, where tag may be one of the supported annotations. + +2.2.1 __sz Annotation +--------------------- + +This annotation is used to indicate a memory and size pair in the argument list. +An example is given below:: + + void bpf_memzero(void *mem, int mem__sz) + { + ... + } + +Here, the verifier will treat first argument as a PTR_TO_MEM, and second +argument as its size. By default, without __sz annotation, the size of the type +of the pointer is used. Without __sz annotation, a kfunc cannot accept a void +pointer. + +.. _BPF_kfunc_nodef: + +2.3 Using an existing kernel function +------------------------------------- + +When an existing function in the kernel is fit for consumption by BPF programs, +it can be directly registered with the BPF subsystem. However, care must still +be taken to review the context in which it will be invoked by the BPF program +and whether it is safe to do so. + +2.4 Annotating kfuncs +--------------------- + +In addition to kfuncs' arguments, verifier may need more information about the +type of kfunc(s) being registered with the BPF subsystem. To do so, we define +flags on a set of kfuncs as follows:: + + BTF_SET8_START(bpf_task_set) + BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) + BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) + BTF_SET8_END(bpf_task_set) + +This set encodes the BTF ID of each kfunc listed above, and encodes the flags +along with it. Ofcourse, it is also allowed to specify no flags. + +2.4.1 KF_ACQUIRE flag +--------------------- + +The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a +refcounted object. The verifier will then ensure that the pointer to the object +is eventually released using a release kfunc, or transferred to a map using a +referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the +loading of the BPF program until no lingering references remain in all possible +explored states of the program. + +2.4.2 KF_RET_NULL flag +---------------------- + +The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc +may be NULL. Hence, it forces the user to do a NULL check on the pointer +returned from the kfunc before making use of it (dereferencing or passing to +another helper). This flag is often used in pairing with KF_ACQUIRE flag, but +both are orthogonal to each other. + +2.4.3 KF_RELEASE flag +--------------------- + +The KF_RELEASE flag is used to indicate that the kfunc releases the pointer +passed in to it. There can be only one referenced pointer that can be passed in. +All copies of the pointer being released are invalidated as a result of invoking +kfunc with this flag. + +2.4.4 KF_KPTR_GET flag +---------------------- + +The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument +as a pointer to kptr, safely increments the refcount of the object it points to, +and returns a reference to the user. The rest of the arguments may be normal +arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with +KF_ACQUIRE and KF_RET_NULL flags. + +2.4.5 KF_TRUSTED_ARGS flag +-------------------------- + +The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It +indicates that the all pointer arguments will always be refcounted, and have +their offset set to 0. It can be used to enforce that a pointer to a refcounted +object acquired from a kfunc or BPF helper is passed as an argument to this +kfunc without any modifications (e.g. pointer arithmetic) such that it is +trusted and points to the original object. This flag is often used for kfuncs +that operate (change some property, perform some operation) on an object that +was obtained using an acquire kfunc. Such kfuncs need an unchanged pointer to +ensure the integrity of the operation being performed on the expected object. + +2.5 Registering the kfuncs +-------------------------- + +Once the kfunc is prepared for use, the final step to making it visible is +registering it with the BPF subsystem. Registration is done per BPF program +type. An example is shown below:: + + BTF_SET8_START(bpf_task_set) + BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) + BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) + BTF_SET8_END(bpf_task_set) + + static const struct btf_kfunc_id_set bpf_task_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_task_set, + }; + + static int init_subsystem(void) + { + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); + } + late_initcall(init_subsystem); diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst new file mode 100644 index 000000000000..e85120878b27 --- /dev/null +++ b/Documentation/bpf/map_hash.rst @@ -0,0 +1,185 @@ +.. SPDX-License-Identifier: GPL-2.0-only +.. Copyright (C) 2022 Red Hat, Inc. + +=============================================== +BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants +=============================================== + +.. note:: + - ``BPF_MAP_TYPE_HASH`` was introduced in kernel version 3.19 + - ``BPF_MAP_TYPE_PERCPU_HASH`` was introduced in version 4.6 + - Both ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` + were introduced in version 4.10 + +``BPF_MAP_TYPE_HASH`` and ``BPF_MAP_TYPE_PERCPU_HASH`` provide general +purpose hash map storage. Both the key and the value can be structs, +allowing for composite keys and values. + +The kernel is responsible for allocating and freeing key/value pairs, up +to the max_entries limit that you specify. Hash maps use pre-allocation +of hash table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be +used to disable pre-allocation when it is too memory expensive. + +``BPF_MAP_TYPE_PERCPU_HASH`` provides a separate value slot per +CPU. The per-cpu values are stored internally in an array. + +The ``BPF_MAP_TYPE_LRU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` +variants add LRU semantics to their respective hash tables. An LRU hash +will automatically evict the least recently used entries when the hash +table reaches capacity. An LRU hash maintains an internal LRU list that +is used to select elements for eviction. This internal LRU list is +shared across CPUs but it is possible to request a per CPU LRU list with +the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. + +Usage +===== + +.. c:function:: + long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + +Hash entries can be added or updated using the ``bpf_map_update_elem()`` +helper. This helper replaces existing elements atomically. The ``flags`` +parameter can be used to control the update behaviour: + +- ``BPF_ANY`` will create a new element or update an existing element +- ``BPF_NOEXIST`` will create a new element only if one did not already + exist +- ``BPF_EXIST`` will update an existing element + +``bpf_map_update_elem()`` returns 0 on success, or negative error in +case of failure. + +.. c:function:: + void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + +Hash entries can be retrieved using the ``bpf_map_lookup_elem()`` +helper. This helper returns a pointer to the value associated with +``key``, or ``NULL`` if no entry was found. + +.. c:function:: + long bpf_map_delete_elem(struct bpf_map *map, const void *key) + +Hash entries can be deleted using the ``bpf_map_delete_elem()`` +helper. This helper will return 0 on success, or negative error in case +of failure. + +Per CPU Hashes +-------------- + +For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` +the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers +automatically access the hash slot for the current CPU. + +.. c:function:: + void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu) + +The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the +value in the hash slot for a specific CPU. Returns value associated with +``key`` on ``cpu`` , or ``NULL`` if no entry was found or ``cpu`` is +invalid. + +Concurrency +----------- + +Values stored in ``BPF_MAP_TYPE_HASH`` can be accessed concurrently by +programs running on different CPUs. Since Kernel version 5.1, the BPF +infrastructure provides ``struct bpf_spin_lock`` to synchronise access. +See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``. + +Userspace +--------- + +.. c:function:: + int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key) + +In userspace, it is possible to iterate through the keys of a hash using +libbpf's ``bpf_map_get_next_key()`` function. The first key can be fetched by +calling ``bpf_map_get_next_key()`` with ``cur_key`` set to +``NULL``. Subsequent calls will fetch the next key that follows the +current key. ``bpf_map_get_next_key()`` returns 0 on success, -ENOENT if +cur_key is the last key in the hash, or negative error in case of +failure. + +Note that if ``cur_key`` gets deleted then ``bpf_map_get_next_key()`` +will instead return the *first* key in the hash table which is +undesirable. It is recommended to use batched lookup if there is going +to be key deletion intermixed with ``bpf_map_get_next_key()``. + +Examples +======== + +Please see the ``tools/testing/selftests/bpf`` directory for functional +examples. The code snippets below demonstrates API usage. + +This example shows how to declare an LRU Hash with a struct key and a +struct value. + +.. code-block:: c + + #include <linux/bpf.h> + #include <bpf/bpf_helpers.h> + + struct key { + __u32 srcip; + }; + + struct value { + __u64 packets; + __u64 bytes; + }; + + struct { + __uint(type, BPF_MAP_TYPE_LRU_HASH); + __uint(max_entries, 32); + __type(key, struct key); + __type(value, struct value); + } packet_stats SEC(".maps"); + +This example shows how to create or update hash values using atomic +instructions: + +.. code-block:: c + + static void update_stats(__u32 srcip, int bytes) + { + struct key key = { + .srcip = srcip, + }; + struct value *value = bpf_map_lookup_elem(&packet_stats, &key); + + if (value) { + __sync_fetch_and_add(&value->packets, 1); + __sync_fetch_and_add(&value->bytes, bytes); + } else { + struct value newval = { 1, bytes }; + + bpf_map_update_elem(&packet_stats, &key, &newval, BPF_NOEXIST); + } + } + +Userspace walking the map elements from the map declared above: + +.. code-block:: c + + #include <bpf/libbpf.h> + #include <bpf/bpf.h> + + static void walk_hash_elements(int map_fd) + { + struct key *cur_key = NULL; + struct key next_key; + struct value value; + int err; + + for (;;) { + err = bpf_map_get_next_key(map_fd, cur_key, &next_key); + if (err) + break; + + bpf_map_lookup_elem(map_fd, &next_key, &value); + + // Use key and value here + + cur_key = &next_key; + } + } diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 6aa2dc836db1..834bff720582 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -510,6 +510,9 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg, unsigned int imm, enum aarch64_insn_size_type size, enum aarch64_insn_ldst_type type); +u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr, + enum aarch64_insn_register reg, + bool is64bit); u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, enum aarch64_insn_register reg2, enum aarch64_insn_register base, diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c index 695d7368fadc..49e972beeac7 100644 --- a/arch/arm64/lib/insn.c +++ b/arch/arm64/lib/insn.c @@ -323,7 +323,7 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type, return insn; } -static inline long branch_imm_common(unsigned long pc, unsigned long addr, +static inline long label_imm_common(unsigned long pc, unsigned long addr, long range) { long offset; @@ -354,7 +354,7 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, * ARM64 virtual address arrangement guarantees all kernel and module * texts are within +/-128M. */ - offset = branch_imm_common(pc, addr, SZ_128M); + offset = label_imm_common(pc, addr, SZ_128M); if (offset >= SZ_128M) return AARCH64_BREAK_FAULT; @@ -382,7 +382,7 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, u32 insn; long offset; - offset = branch_imm_common(pc, addr, SZ_1M); + offset = label_imm_common(pc, addr, SZ_1M); if (offset >= SZ_1M) return AARCH64_BREAK_FAULT; @@ -421,7 +421,7 @@ u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr, u32 insn; long offset; - offset = branch_imm_common(pc, addr, SZ_1M); + offset = label_imm_common(pc, addr, SZ_1M); insn = aarch64_insn_get_bcond_value(); @@ -543,6 +543,28 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm); } +u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr, + enum aarch64_insn_register reg, + bool is64bit) +{ + u32 insn; + long offset; + + offset = label_imm_common(pc, addr, SZ_1M); + if (offset >= SZ_1M) + return AARCH64_BREAK_FAULT; + + insn = aarch64_insn_get_ldr_lit_value(); + + if (is64bit) + insn |= BIT(30); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg); + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn, + offset >> 2); +} + u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, enum aarch64_insn_register reg2, enum aarch64_insn_register base, diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h index 194c95ccc1cf..a6acb94ea3d6 100644 --- a/arch/arm64/net/bpf_jit.h +++ b/arch/arm64/net/bpf_jit.h @@ -80,6 +80,12 @@ #define A64_STR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, STORE) #define A64_LDR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, LOAD) +/* LDR (literal) */ +#define A64_LDR32LIT(Wt, offset) \ + aarch64_insn_gen_load_literal(0, offset, Wt, false) +#define A64_LDR64LIT(Xt, offset) \ + aarch64_insn_gen_load_literal(0, offset, Xt, true) + /* Load/store register pair */ #define A64_LS_PAIR(Rt, Rt2, Rn, offset, ls, type) \ aarch64_insn_gen_load_store_pair(Rt, Rt2, Rn, offset, \ @@ -270,6 +276,7 @@ #define A64_BTI_C A64_HINT(AARCH64_INSN_HINT_BTIC) #define A64_BTI_J A64_HINT(AARCH64_INSN_HINT_BTIJ) #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC) +#define A64_NOP A64_HINT(AARCH64_INSN_HINT_NOP) /* DMB */ #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index f08a4447d363..7ca8779ae34f 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -10,6 +10,7 @@ #include <linux/bitfield.h> #include <linux/bpf.h> #include <linux/filter.h> +#include <linux/memory.h> #include <linux/printk.h> #include <linux/slab.h> @@ -18,6 +19,7 @@ #include <asm/cacheflush.h> #include <asm/debug-monitors.h> #include <asm/insn.h> +#include <asm/patching.h> #include <asm/set_memory.h> #include "bpf_jit.h" @@ -78,6 +80,15 @@ struct jit_ctx { int fpb_offset; }; +struct bpf_plt { + u32 insn_ldr; /* load target */ + u32 insn_br; /* branch to target */ + u64 target; /* target value */ +}; + +#define PLT_TARGET_SIZE sizeof_field(struct bpf_plt, target) +#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target) + static inline void emit(const u32 insn, struct jit_ctx *ctx) { if (ctx->image != NULL) @@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val, } } +static inline void emit_bti(u32 insn, struct jit_ctx *ctx) +{ + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) + emit(insn, ctx); +} + /* * Kernel addresses in the vmalloc space use at most 48 bits, and the * remaining bits are guaranteed to be 0x1. So we can compose the address @@ -159,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val, } } +static inline void emit_call(u64 target, struct jit_ctx *ctx) +{ + u8 tmp = bpf2a64[TMP_REG_1]; + + emit_addr_mov_i64(tmp, target, ctx); + emit(A64_BLR(tmp), ctx); +} + static inline int bpf2a64_offset(int bpf_insn, int off, const struct jit_ctx *ctx) { @@ -235,13 +260,30 @@ static bool is_lsi_offset(int offset, int scale) return true; } +/* generated prologue: + * bti c // if CONFIG_ARM64_BTI_KERNEL + * mov x9, lr + * nop // POKE_OFFSET + * paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL + * stp x29, lr, [sp, #-16]! + * mov x29, sp + * stp x19, x20, [sp, #-16]! + * stp x21, x22, [sp, #-16]! + * stp x25, x26, [sp, #-16]! + * stp x27, x28, [sp, #-16]! + * mov x25, sp + * mov tcc, #0 + * // PROLOGUE_OFFSET + */ + +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0) +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0) + +/* Offset of nop instruction in bpf prog entry to be poked */ +#define POKE_OFFSET (BTI_INSNS + 1) + /* Tail call offset to jump into */ -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \ - IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) -#define PROLOGUE_OFFSET 9 -#else -#define PROLOGUE_OFFSET 8 -#endif +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8) static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) { @@ -280,12 +322,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) * */ + emit_bti(A64_BTI_C, ctx); + + emit(A64_MOV(1, A64_R(9), A64_LR), ctx); + emit(A64_NOP, ctx); + /* Sign lr */ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)) emit(A64_PACIASP, ctx); - /* BTI landing pad */ - else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) - emit(A64_BTI_C, ctx); /* Save FP and LR registers to stay align with ARM64 AAPCS */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); @@ -312,8 +356,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf) } /* BTI landing pad for the tail call, done with a BR */ - if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) - emit(A64_BTI_J, ctx); + emit_bti(A64_BTI_J, ctx); } emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx); @@ -557,6 +600,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx) return 0; } +void dummy_tramp(void); + +asm ( +" .pushsection .text, \"ax\", @progbits\n" +" .global dummy_tramp\n" +" .type dummy_tramp, %function\n" +"dummy_tramp:" +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) +" bti j\n" /* dummy_tramp is called via "br x10" */ +#endif +" mov x10, x30\n" +" mov x30, x9\n" +" ret x10\n" +" .size dummy_tramp, .-dummy_tramp\n" +" .popsection\n" +); + +/* build a plt initialized like this: + * + * plt: + * ldr tmp, target + * br tmp + * target: + * .quad dummy_tramp + * + * when a long jump trampoline is attached, target is filled with the + * trampoline address, and when the trampoline is removed, target is + * restored to dummy_tramp address. + */ +static void build_plt(struct jit_ctx *ctx) +{ + const u8 tmp = bpf2a64[TMP_REG_1]; + struct bpf_plt *plt = NULL; + + /* make sure target is 64-bit aligned */ + if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2) + emit(A64_NOP, ctx); + + plt = (struct bpf_plt *)(ctx->image + ctx->idx); + /* plt is called via bl, no BTI needed here */ + emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx); + emit(A64_BR(tmp), ctx); + + if (ctx->image) + plt->target = (u64)&dummy_tramp; +} + static void build_epilogue(struct jit_ctx *ctx) { const u8 r0 = bpf2a64[BPF_REG_0]; @@ -991,8 +1081,7 @@ emit_cond_jmp: &func_addr, &func_addr_fixed); if (ret < 0) return ret; - emit_addr_mov_i64(tmp, func_addr, ctx); - emit(A64_BLR(tmp), ctx); + emit_call(func_addr, ctx); emit(A64_MOV(1, r0, A64_R(0)), ctx); break; } @@ -1336,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx) if (a64_insn == AARCH64_BREAK_FAULT) return -1; } + return 0; +} + +static int validate_ctx(struct jit_ctx *ctx) +{ + if (validate_code(ctx)) + return -1; if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries)) return -1; @@ -1356,7 +1452,7 @@ struct arm64_jit_data { struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { - int image_size, prog_size, extable_size; + int image_size, prog_size, extable_size, extable_align, extable_offset; struct bpf_prog *tmp, *orig_prog = prog; struct bpf_binary_header *header; struct arm64_jit_data *jit_data; @@ -1426,13 +1522,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) ctx.epilogue_offset = ctx.idx; build_epilogue(&ctx); + build_plt(&ctx); + extable_align = __alignof__(struct exception_table_entry); extable_size = prog->aux->num_exentries * sizeof(struct exception_table_entry); /* Now we know the actual image size. */ prog_size = sizeof(u32) * ctx.idx; - image_size = prog_size + extable_size; + /* also allocate space for plt target */ + extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align); + image_size = extable_offset + extable_size; header = bpf_jit_binary_alloc(image_size, &image_ptr, sizeof(u32), jit_fill_hole); if (header == NULL) { @@ -1444,7 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) ctx.image = (__le32 *)image_ptr; if (extable_size) - prog->aux->extable = (void *)image_ptr + prog_size; + prog->aux->extable = (void *)image_ptr + extable_offset; skip_init_ctx: ctx.idx = 0; ctx.exentry_idx = 0; @@ -1458,9 +1558,10 @@ skip_init_ctx: } build_epilogue(&ctx); + build_plt(&ctx); /* 3. Extra pass to validate JITed code. */ - if (validate_code(&ctx)) { + if (validate_ctx(&ctx)) { bpf_jit_binary_free(header); prog = orig_prog; goto out_off; @@ -1537,3 +1638,583 @@ bool bpf_jit_supports_subprog_tailcalls(void) { return true; } + +static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l, + int args_off, int retval_off, int run_ctx_off, + bool save_ret) +{ + u32 *branch; + u64 enter_prog; + u64 exit_prog; + struct bpf_prog *p = l->link.prog; + int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie); + + if (p->aux->sleepable) { + enter_prog = (u64)__bpf_prog_enter_sleepable; + exit_prog = (u64)__bpf_prog_exit_sleepable; + } else { + enter_prog = (u64)__bpf_prog_enter; + exit_prog = (u64)__bpf_prog_exit; + } + + if (l->cookie == 0) { + /* if cookie is zero, one instruction is enough to store it */ + emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx); + } else { + emit_a64_mov_i64(A64_R(10), l->cookie, ctx); + emit(A64_STR64I(A64_R(10), A64_SP, run_ctx_off + cookie_off), + ctx); + } + + /* save p to callee saved register x19 to avoid loading p with mov_i64 + * each time. + */ + emit_addr_mov_i64(A64_R(19), (const u64)p, ctx); + + /* arg1: prog */ + emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx); + /* arg2: &run_ctx */ + emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx); + + emit_call(enter_prog, ctx); + + /* if (__bpf_prog_enter(prog) == 0) + * goto skip_exec_of_prog; + */ + branch = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + + /* save return value to callee saved register x20 */ + emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx); + + emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx); + if (!p->jited) + emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx); + + emit_call((const u64)p->bpf_func, ctx); + + if (save_ret) + emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx); + + if (ctx->image) { + int offset = &ctx->image[ctx->idx] - branch; + *branch = A64_CBZ(1, A64_R(0), offset); + } + + /* arg1: prog */ + emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx); + /* arg2: start time */ + emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx); + /* arg3: &run_ctx */ + emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx); + + emit_call(exit_prog, ctx); +} + +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl, + int args_off, int retval_off, int run_ctx_off, + u32 **branches) +{ + int i; + + /* The first fmod_ret program will receive a garbage return value. + * Set this to 0 to avoid confusing the program. + */ + emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx); + for (i = 0; i < tl->nr_links; i++) { + invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off, + run_ctx_off, true); + /* if (*(u64 *)(sp + retval_off) != 0) + * goto do_fexit; + */ + emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx); + /* Save the location of branch, and generate a nop. + * This nop will be replaced with a cbnz later. + */ + branches[i] = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + } +} + +static void save_args(struct jit_ctx *ctx, int args_off, int nargs) +{ + int i; + + for (i = 0; i < nargs; i++) { + emit(A64_STR64I(i, A64_SP, args_off), ctx); + args_off += 8; + } +} + +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs) +{ + int i; + + for (i = 0; i < nargs; i++) { + emit(A64_LDR64I(i, A64_SP, args_off), ctx); + args_off += 8; + } +} + +/* Based on the x86's implementation of arch_prepare_bpf_trampoline(). + * + * bpf prog and function entry before bpf trampoline hooked: + * mov x9, lr + * nop + * + * bpf prog and function entry after bpf trampoline hooked: + * mov x9, lr + * bl <bpf_trampoline or plt> + * + */ +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, + struct bpf_tramp_links *tlinks, void *orig_call, + int nargs, u32 flags) +{ + int i; + int stack_size; + int retaddr_off; + int regs_off; + int retval_off; + int args_off; + int nargs_off; + int ip_off; + int run_ctx_off; + struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY]; + struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT]; + struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN]; + bool save_ret; + u32 **branches = NULL; + + /* trampoline stack layout: + * [ parent ip ] + * [ FP ] + * SP + retaddr_off [ self ip ] + * [ FP ] + * + * [ padding ] align SP to multiples of 16 + * + * [ x20 ] callee saved reg x20 + * SP + regs_off [ x19 ] callee saved reg x19 + * + * SP + retval_off [ return value ] BPF_TRAMP_F_CALL_ORIG or + * BPF_TRAMP_F_RET_FENTRY_RET + * + * [ argN ] + * [ ... ] + * SP + args_off [ arg1 ] + * + * SP + nargs_off [ args count ] + * + * SP + ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag + * + * SP + run_ctx_off [ bpf_tramp_run_ctx ] + */ + + stack_size = 0; + run_ctx_off = stack_size; + /* room for bpf_tramp_run_ctx */ + stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8); + + ip_off = stack_size; + /* room for IP address argument */ + if (flags & BPF_TRAMP_F_IP_ARG) + stack_size += 8; + + nargs_off = stack_size; + /* room for args count */ + stack_size += 8; + + args_off = stack_size; + /* room for args */ + stack_size += nargs * 8; + + /* room for return value */ + retval_off = stack_size; + save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET); + if (save_ret) + stack_size += 8; + + /* room for callee saved registers, currently x19 and x20 are used */ + regs_off = stack_size; + stack_size += 16; + + /* round up to multiples of 16 to avoid SPAlignmentFault */ + stack_size = round_up(stack_size, 16); + + /* return address locates above FP */ + retaddr_off = stack_size + 8; + + /* bpf trampoline may be invoked by 3 instruction types: + * 1. bl, attached to bpf prog or kernel function via short jump + * 2. br, attached to bpf prog or kernel function via long jump + * 3. blr, working as a function pointer, used by struct_ops. + * So BTI_JC should used here to support both br and blr. + */ + emit_bti(A64_BTI_JC, ctx); + + /* frame for parent function */ + emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + + /* frame for patched function */ + emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + + /* allocate stack space */ + emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx); + + if (flags & BPF_TRAMP_F_IP_ARG) { + /* save ip address of the traced function */ + emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx); + emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx); + } + + /* save args count*/ + emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx); + emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx); + + /* save args */ + save_args(ctx, args_off, nargs); + + /* save callee saved registers */ + emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx); + emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx); + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + emit_addr_mov_i64(A64_R(0), (const u64)im, ctx); + emit_call((const u64)__bpf_tramp_enter, ctx); + } + + for (i = 0; i < fentry->nr_links; i++) + invoke_bpf_prog(ctx, fentry->links[i], args_off, + retval_off, run_ctx_off, + flags & BPF_TRAMP_F_RET_FENTRY_RET); + + if (fmod_ret->nr_links) { + branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *), + GFP_KERNEL); + if (!branches) + return -ENOMEM; + + invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off, + run_ctx_off, branches); + } + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + restore_args(ctx, args_off, nargs); + /* call original func */ + emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx); + emit(A64_BLR(A64_R(10)), ctx); + /* store return value */ + emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx); + /* reserve a nop for bpf_tramp_image_put */ + im->ip_after_call = ctx->image + ctx->idx; + emit(A64_NOP, ctx); + } + + /* update the branches saved in invoke_bpf_mod_ret with cbnz */ + for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) { + int offset = &ctx->image[ctx->idx] - branches[i]; + *branches[i] = A64_CBNZ(1, A64_R(10), offset); + } + + for (i = 0; i < fexit->nr_links; i++) + invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off, + run_ctx_off, false); + + if (flags & BPF_TRAMP_F_CALL_ORIG) { + im->ip_epilogue = ctx->image + ctx->idx; + emit_addr_mov_i64(A64_R(0), (const u64)im, ctx); + emit_call((const u64)__bpf_tramp_exit, ctx); + } + + if (flags & BPF_TRAMP_F_RESTORE_REGS) + restore_args(ctx, args_off, nargs); + + /* restore callee saved register x19 and x20 */ + emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx); + emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx); + + if (save_ret) + emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx); + + /* reset SP */ + emit(A64_MOV(1, A64_SP, A64_FP), ctx); + + /* pop frames */ + emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx); + + if (flags & BPF_TRAMP_F_SKIP_FRAME) { + /* skip patched function, return to parent */ + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(9)), ctx); + } else { + /* return to patched function */ + emit(A64_MOV(1, A64_R(10), A64_LR), ctx); + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(10)), ctx); + } + + if (ctx->image) + bpf_flush_icache(ctx->image, ctx->image + ctx->idx); + + kfree(branches); + + return ctx->idx; +} + +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, + void *image_end, const struct btf_func_model *m, + u32 flags, struct bpf_tramp_links *tlinks, + void *orig_call) +{ + int ret; + int nargs = m->nr_args; + int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE; + struct jit_ctx ctx = { + .image = NULL, + .idx = 0, + }; + + /* the first 8 arguments are passed by registers */ + if (nargs > 8) + return -ENOTSUPP; + + ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags); + if (ret < 0) + return ret; + + if (ret > max_insns) + return -EFBIG; + + ctx.image = image; + ctx.idx = 0; + + jit_fill_hole(image, (unsigned int)(image_end - image)); + ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags); + + if (ret > 0 && validate_code(&ctx) < 0) + ret = -EINVAL; + + if (ret > 0) + ret *= AARCH64_INSN_SIZE; + + return ret; +} + +static bool is_long_jump(void *ip, void *target) +{ + long offset; + + /* NULL target means this is a NOP */ + if (!target) + return false; + + offset = (long)target - (long)ip; + return offset < -SZ_128M || offset >= SZ_128M; +} + +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip, + void *addr, void *plt, u32 *insn) +{ + void *target; + + if (!addr) { + *insn = aarch64_insn_gen_nop(); + return 0; + } + + if (is_long_jump(ip, addr)) + target = plt; + else + target = addr; + + *insn = aarch64_insn_gen_branch_imm((unsigned long)ip, + (unsigned long)target, + type); + + return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT; +} + +/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf + * trampoline with the branch instruction from @ip to @new_addr. If @old_addr + * or @new_addr is NULL, the old or new instruction is NOP. + * + * When @ip is the bpf prog entry, a bpf trampoline is being attached or + * detached. Since bpf trampoline and bpf prog are allocated separately with + * vmalloc, the address distance may exceed 128MB, the maximum branch range. + * So long jump should be handled. + * + * When a bpf prog is constructed, a plt pointing to empty trampoline + * dummy_tramp is placed at the end: + * + * bpf_prog: + * mov x9, lr + * nop // patchsite + * ... + * ret + * + * plt: + * ldr x10, target + * br x10 + * target: + * .quad dummy_tramp // plt target + * + * This is also the state when no trampoline is attached. + * + * When a short-jump bpf trampoline is attached, the patchsite is patched + * to a bl instruction to the trampoline directly: + * + * bpf_prog: + * mov x9, lr + * bl <short-jump bpf trampoline address> // patchsite + * ... + * ret + * + * plt: + * ldr x10, target + * br x10 + * target: + * .quad dummy_tramp // plt target + * + * When a long-jump bpf trampoline is attached, the plt target is filled with + * the trampoline address and the patchsite is patched to a bl instruction to + * the plt: + * + * bpf_prog: + * mov x9, lr + * bl plt // patchsite + * ... + * ret + * + * plt: + * ldr x10, target + * br x10 + * target: + * .quad <long-jump bpf trampoline address> // plt target + * + * The dummy_tramp is used to prevent another CPU from jumping to unknown + * locations during the patching process, making the patching process easier. + */ +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type, + void *old_addr, void *new_addr) +{ + int ret; + u32 old_insn; + u32 new_insn; + u32 replaced; + struct bpf_plt *plt = NULL; + unsigned long size = 0UL; + unsigned long offset = ~0UL; + enum aarch64_insn_branch_type branch_type; + char namebuf[KSYM_NAME_LEN]; + void *image = NULL; + u64 plt_target = 0ULL; + bool poking_bpf_entry; + + if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf)) + /* Only poking bpf text is supported. Since kernel function + * entry is set up by ftrace, we reply on ftrace to poke kernel + * functions. + */ + return -ENOTSUPP; + + image = ip - offset; + /* zero offset means we're poking bpf prog entry */ + poking_bpf_entry = (offset == 0UL); + + /* bpf prog entry, find plt and the real patchsite */ + if (poking_bpf_entry) { + /* plt locates at the end of bpf prog */ + plt = image + size - PLT_TARGET_OFFSET; + + /* skip to the nop instruction in bpf prog entry: + * bti c // if BTI enabled + * mov x9, x30 + * nop + */ + ip = image + POKE_OFFSET * AARCH64_INSN_SIZE; + } + + /* long jump is only possible at bpf prog entry */ + if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) && + !poking_bpf_entry)) + return -EINVAL; + + if (poke_type == BPF_MOD_CALL) + branch_type = AARCH64_INSN_BRANCH_LINK; + else + branch_type = AARCH64_INSN_BRANCH_NOLINK; + + if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0) + return -EFAULT; + + if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0) + return -EFAULT; + + if (is_long_jump(ip, new_addr)) + plt_target = (u64)new_addr; + else if (is_long_jump(ip, old_addr)) + /* if the old target is a long jump and the new target is not, + * restore the plt target to dummy_tramp, so there is always a + * legal and harmless address stored in plt target, and we'll + * never jump from plt to an unknown place. + */ + plt_target = (u64)&dummy_tramp; + + if (plt_target) { + /* non-zero plt_target indicates we're patching a bpf prog, + * which is read only. + */ + if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1)) + return -EFAULT; + WRITE_ONCE(plt->target, plt_target); + set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1); + /* since plt target points to either the new trampoline + * or dummy_tramp, even if another CPU reads the old plt + * target value before fetching the bl instruction to plt, + * it will be brought back by dummy_tramp, so no barrier is + * required here. + */ + } + + /* if the old target and the new target are both long jumps, no + * patching is required + */ + if (old_insn == new_insn) + return 0; + + mutex_lock(&text_mutex); + if (aarch64_insn_read(ip, &replaced)) { + ret = -EFAULT; + goto out; + } + + if (replaced != old_insn) { + ret = -EFAULT; + goto out; + } + + /* We call aarch64_insn_patch_text_nosync() to replace instruction + * atomically, so no other CPUs will fetch a half-new and half-old + * instruction. But there is chance that another CPU executes the + * old instruction after the patching operation finishes (e.g., + * pipeline not flushed, or icache not synchronized yet). + * + * 1. when a new trampoline is attached, it is not a problem for + * different CPUs to jump to different trampolines temporarily. + * + * 2. when an old trampoline is freed, we should wait for all other + * CPUs to exit the trampoline and make sure the trampoline is no + * longer reachable, since bpf_tramp_image_put() function already + * uses percpu_ref and task-based rcu to do the sync, no need to call + * the sync version here, see bpf_tramp_image_put() for details. + */ + ret = aarch64_insn_patch_text_nosync(ip, new_insn); +out: + mutex_unlock(&text_mutex); + + return ret; +} diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 7e95697a6459..c1f6c1c51d99 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1950,23 +1950,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, return 0; } -static bool is_valid_bpf_tramp_flags(unsigned int flags) -{ - if ((flags & BPF_TRAMP_F_RESTORE_REGS) && - (flags & BPF_TRAMP_F_SKIP_FRAME)) - return false; - - /* - * BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, - * and it must be used alone. - */ - if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) && - (flags & ~BPF_TRAMP_F_RET_FENTRY_RET)) - return false; - - return true; -} - /* Example: * __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev); * its 'struct btf_func_model' will be nr_args=2 @@ -2045,9 +2028,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (nr_args > 6) return -ENOTSUPP; - if (!is_valid_bpf_tramp_flags(flags)) - return -EINVAL; - /* Generated trampoline stack layout: * * RBP + 8 [ return address ] @@ -2153,10 +2133,15 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i if (flags & BPF_TRAMP_F_CALL_ORIG) { restore_regs(m, &prog, nr_args, regs_off); - /* call original function */ - if (emit_call(&prog, orig_call, prog)) { - ret = -EINVAL; - goto cleanup; + if (flags & BPF_TRAMP_F_ORIG_STACK) { + emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8); + EMIT2(0xff, 0xd0); /* call *rax */ + } else { + /* call original function */ + if (emit_call(&prog, orig_call, prog)) { + ret = -EINVAL; + goto cleanup; + } } /* remember return value in a stack for bpf prog to access */ emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); @@ -2520,3 +2505,28 @@ bool bpf_jit_supports_subprog_tailcalls(void) { return true; } + +void bpf_jit_free(struct bpf_prog *prog) +{ + if (prog->jited) { + struct x64_jit_data *jit_data = prog->aux->jit_data; + struct bpf_binary_header *hdr; + + /* + * If we fail the final pass of JIT (from jit_subprogs), + * the program may not be finalized yet. Call finalize here + * before freeing it. + */ + if (jit_data) { + bpf_jit_binary_pack_finalize(prog, jit_data->header, + jit_data->rw_header); + kvfree(jit_data->addrs); + kfree(jit_data); + } + hdr = bpf_jit_binary_pack_hdr(prog); + bpf_jit_binary_pack_free(hdr, NULL); + WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog)); + } + + bpf_prog_unlock_free(prog); +} diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c index 50f6719b3aa4..32804fed116c 100644 --- a/drivers/net/can/pch_can.c +++ b/drivers/net/can/pch_can.c @@ -489,6 +489,7 @@ static void pch_can_error(struct net_device *ndev, u32 status) if (!skb) return; + errc = ioread32(&priv->regs->errc); if (status & PCH_BUS_OFF) { pch_can_set_tx_all(priv, 0); pch_can_set_rx_all(priv, 0); @@ -502,7 +503,6 @@ static void pch_can_error(struct net_device *ndev, u32 status) cf->data[7] = (errc & PCH_REC) >> 8; } - errc = ioread32(&priv->regs->errc); /* Warning interrupt. */ if (status & PCH_EWARN) { state = CAN_STATE_ERROR_WARNING; diff --git a/drivers/net/ethernet/marvell/prestera/prestera_router.c b/drivers/net/ethernet/marvell/prestera/prestera_router.c index 3c8116f16b4d..58f4e44d5ad7 100644 --- a/drivers/net/ethernet/marvell/prestera/prestera_router.c +++ b/drivers/net/ethernet/marvell/prestera/prestera_router.c @@ -389,8 +389,8 @@ static int __prestera_inetaddr_event(struct prestera_switch *sw, unsigned long event, struct netlink_ext_ack *extack) { - if (!prestera_netdev_check(dev) || netif_is_bridge_port(dev) || - netif_is_lag_port(dev) || netif_is_ovs_port(dev)) + if (!prestera_netdev_check(dev) || netif_is_any_bridge_port(dev) || + netif_is_lag_port(dev)) return 0; return __prestera_inetaddr_port_event(dev, event, extack); diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 61eb96b93889..1b61bc8f59a2 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -2296,8 +2296,8 @@ void mlxsw_core_bus_device_unregister(struct mlxsw_core *mlxsw_core, devl_resources_unregister(devlink); mlxsw_core->bus->fini(mlxsw_core->bus_priv); if (!reload) { - devlink_free(devlink); devl_unlock(devlink); + devlink_free(devlink); } return; @@ -2305,8 +2305,8 @@ void mlxsw_core_bus_device_unregister(struct mlxsw_core *mlxsw_core, reload_fail_deinit: mlxsw_core_params_unregister(mlxsw_core); devl_resources_unregister(devlink); - devlink_free(devlink); devl_unlock(devlink); + devlink_free(devlink); } EXPORT_SYMBOL(mlxsw_core_bus_device_unregister); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c index 558dede130ee..2c4443c6b964 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c @@ -8590,9 +8590,7 @@ static int mlxsw_sp_inetaddr_port_event(struct net_device *port_dev, unsigned long event, struct netlink_ext_ack *extack) { - if (netif_is_bridge_port(port_dev) || - netif_is_lag_port(port_dev) || - netif_is_ovs_port(port_dev)) + if (netif_is_any_bridge_port(port_dev) || netif_is_lag_port(port_dev)) return 0; return mlxsw_sp_inetaddr_port_vlan_event(port_dev, port_dev, event, diff --git a/drivers/net/ethernet/sfc/Makefile b/drivers/net/ethernet/sfc/Makefile index b9298031ea51..4c759488fc77 100644 --- a/drivers/net/ethernet/sfc/Makefile +++ b/drivers/net/ethernet/sfc/Makefile @@ -8,7 +8,7 @@ sfc-y += efx.o efx_common.o efx_channels.o nic.o \ ef100.o ef100_nic.o ef100_netdev.o \ ef100_ethtool.o ef100_rx.o ef100_tx.o sfc-$(CONFIG_SFC_MTD) += mtd.o -sfc-$(CONFIG_SFC_SRIOV) += sriov.o ef10_sriov.o ef100_sriov.o +sfc-$(CONFIG_SFC_SRIOV) += sriov.o ef10_sriov.o ef100_sriov.o ef100_rep.o mae.o obj-$(CONFIG_SFC) += sfc.o diff --git a/drivers/net/ethernet/sfc/ef100_netdev.c b/drivers/net/ethernet/sfc/ef100_netdev.c index 060392ddd612..9e65de1ab889 100644 --- a/drivers/net/ethernet/sfc/ef100_netdev.c +++ b/drivers/net/ethernet/sfc/ef100_netdev.c @@ -85,6 +85,7 @@ static int ef100_net_stop(struct net_device *net_dev) netif_dbg(efx, ifdown, efx->net_dev, "closing on CPU %d\n", raw_smp_processor_id()); + efx_detach_reps(efx); netif_stop_queue(net_dev); efx_stop_all(efx); efx_mcdi_mac_fini_stats(efx); @@ -176,6 +177,8 @@ static int ef100_net_open(struct net_device *net_dev) mutex_unlock(&efx->mac_lock); efx->state = STATE_NET_UP; + if (netif_running(efx->net_dev)) + efx_attach_reps(efx); return 0; @@ -195,6 +198,15 @@ static netdev_tx_t ef100_hard_start_xmit(struct sk_buff *skb, struct net_device *net_dev) { struct efx_nic *efx = efx_netdev_priv(net_dev); + + return __ef100_hard_start_xmit(skb, efx, net_dev, NULL); +} + +netdev_tx_t __ef100_hard_start_xmit(struct sk_buff *skb, + struct efx_nic *efx, + struct net_device *net_dev, + struct efx_rep *efv) +{ struct efx_tx_queue *tx_queue; struct efx_channel *channel; int rc; @@ -209,7 +221,7 @@ static netdev_tx_t ef100_hard_start_xmit(struct sk_buff *skb, } tx_queue = &channel->tx_queue[0]; - rc = ef100_enqueue_skb(tx_queue, skb); + rc = __ef100_enqueue_skb(tx_queue, skb, efv); if (rc == 0) return NETDEV_TX_OK; @@ -312,7 +324,7 @@ void ef100_remove_netdev(struct efx_probe_data *probe_data) unregister_netdevice_notifier(&efx->netdev_notifier); #if defined(CONFIG_SFC_SRIOV) if (!efx->type->is_vf) - efx_ef100_pci_sriov_disable(efx); + efx_ef100_pci_sriov_disable(efx, true); #endif ef100_unregister_netdev(efx); diff --git a/drivers/net/ethernet/sfc/ef100_netdev.h b/drivers/net/ethernet/sfc/ef100_netdev.h index 38b032ba0953..86bf985e0951 100644 --- a/drivers/net/ethernet/sfc/ef100_netdev.h +++ b/drivers/net/ethernet/sfc/ef100_netdev.h @@ -10,7 +10,12 @@ */ #include <linux/netdevice.h> +#include "ef100_rep.h" +netdev_tx_t __ef100_hard_start_xmit(struct sk_buff *skb, + struct efx_nic *efx, + struct net_device *net_dev, + struct efx_rep *efv); int ef100_netdev_event(struct notifier_block *this, unsigned long event, void *ptr); int ef100_probe_netdev(struct efx_probe_data *probe_data); diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c index f89e695cf8ac..4625d35269e6 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.c +++ b/drivers/net/ethernet/sfc/ef100_nic.c @@ -946,6 +946,7 @@ static int ef100_probe_main(struct efx_nic *efx) unsigned int bar_size = resource_size(&efx->pci_dev->resource[efx->mem_bar]); struct ef100_nic_data *nic_data; char fw_version[32]; + u32 priv_mask = 0; int i, rc; if (WARN_ON(bar_size == 0)) @@ -1027,6 +1028,12 @@ static int ef100_probe_main(struct efx_nic *efx) efx_mcdi_print_fwver(efx, fw_version, sizeof(fw_version)); pci_dbg(efx->pci_dev, "Firmware version %s\n", fw_version); + rc = efx_mcdi_get_privilege_mask(efx, &priv_mask); + if (rc) /* non-fatal, and priv_mask will still be 0 */ + pci_info(efx->pci_dev, + "Failed to get privilege mask from FW, rc %d\n", rc); + nic_data->grp_mae = !!(priv_mask & MC_CMD_PRIVILEGE_MASK_IN_GRP_MAE); + if (compare_versions(fw_version, "1.1.0.1000") < 0) { pci_info(efx->pci_dev, "Firmware uses old event descriptors\n"); rc = -EINVAL; diff --git a/drivers/net/ethernet/sfc/ef100_nic.h b/drivers/net/ethernet/sfc/ef100_nic.h index 744dbbdb4adc..40f84a275057 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.h +++ b/drivers/net/ethernet/sfc/ef100_nic.h @@ -72,6 +72,7 @@ struct ef100_nic_data { u8 port_id[ETH_ALEN]; DECLARE_BITMAP(evq_phases, EFX_MAX_CHANNELS); u64 stats[EF100_STAT_COUNT]; + bool grp_mae; /* MAE Privilege */ u16 tso_max_hdr_len; u16 tso_max_payload_num_segs; u16 tso_max_frames; diff --git a/drivers/net/ethernet/sfc/ef100_regs.h b/drivers/net/ethernet/sfc/ef100_regs.h index 710bbdb19885..982b6ab1eb62 100644 --- a/drivers/net/ethernet/sfc/ef100_regs.h +++ b/drivers/net/ethernet/sfc/ef100_regs.h @@ -2,7 +2,7 @@ /**************************************************************************** * Driver for Solarflare network controllers and boards * Copyright 2018 Solarflare Communications Inc. - * Copyright 2019-2020 Xilinx Inc. + * Copyright 2019-2022 Xilinx Inc. * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 as published @@ -181,12 +181,6 @@ /* RHEAD_BASE_EVENT */ #define ESF_GZ_E_TYPE_LBN 60 #define ESF_GZ_E_TYPE_WIDTH 4 -#define ESE_GZ_EF100_EV_DRIVER 5 -#define ESE_GZ_EF100_EV_MCDI 4 -#define ESE_GZ_EF100_EV_CONTROL 3 -#define ESE_GZ_EF100_EV_TX_TIMESTAMP 2 -#define ESE_GZ_EF100_EV_TX_COMPLETION 1 -#define ESE_GZ_EF100_EV_RX_PKTS 0 #define ESF_GZ_EV_EVQ_PHASE_LBN 59 #define ESF_GZ_EV_EVQ_PHASE_WIDTH 1 #define ESE_GZ_RHEAD_BASE_EVENT_STRUCT_SIZE 64 @@ -369,14 +363,18 @@ #define ESF_GZ_RX_PREFIX_VLAN_STRIP_TCI_WIDTH 16 #define ESF_GZ_RX_PREFIX_CSUM_FRAME_LBN 144 #define ESF_GZ_RX_PREFIX_CSUM_FRAME_WIDTH 16 -#define ESF_GZ_RX_PREFIX_INGRESS_VPORT_LBN 128 -#define ESF_GZ_RX_PREFIX_INGRESS_VPORT_WIDTH 16 +#define ESF_GZ_RX_PREFIX_INGRESS_MPORT_LBN 128 +#define ESF_GZ_RX_PREFIX_INGRESS_MPORT_WIDTH 16 #define ESF_GZ_RX_PREFIX_USER_MARK_LBN 96 #define ESF_GZ_RX_PREFIX_USER_MARK_WIDTH 32 #define ESF_GZ_RX_PREFIX_RSS_HASH_LBN 64 #define ESF_GZ_RX_PREFIX_RSS_HASH_WIDTH 32 -#define ESF_GZ_RX_PREFIX_PARTIAL_TSTAMP_LBN 32 -#define ESF_GZ_RX_PREFIX_PARTIAL_TSTAMP_WIDTH 32 +#define ESF_GZ_RX_PREFIX_PARTIAL_TSTAMP_LBN 34 +#define ESF_GZ_RX_PREFIX_PARTIAL_TSTAMP_WIDTH 30 +#define ESF_GZ_RX_PREFIX_VSWITCH_STATUS_LBN 33 +#define ESF_GZ_RX_PREFIX_VSWITCH_STATUS_WIDTH 1 +#define ESF_GZ_RX_PREFIX_VLAN_STRIPPED_LBN 32 +#define ESF_GZ_RX_PREFIX_VLAN_STRIPPED_WIDTH 1 #define ESF_GZ_RX_PREFIX_CLASS_LBN 16 #define ESF_GZ_RX_PREFIX_CLASS_WIDTH 16 #define ESF_GZ_RX_PREFIX_USER_FLAG_LBN 15 @@ -454,12 +452,8 @@ #define ESF_GZ_M2M_TRANSLATE_ADDR_WIDTH 1 #define ESF_GZ_M2M_RSVD_LBN 120 #define ESF_GZ_M2M_RSVD_WIDTH 2 -#define ESF_GZ_M2M_ADDR_SPC_LBN 108 -#define ESF_GZ_M2M_ADDR_SPC_WIDTH 12 -#define ESF_GZ_M2M_ADDR_SPC_PASID_LBN 86 -#define ESF_GZ_M2M_ADDR_SPC_PASID_WIDTH 22 -#define ESF_GZ_M2M_ADDR_SPC_MODE_LBN 84 -#define ESF_GZ_M2M_ADDR_SPC_MODE_WIDTH 2 +#define ESF_GZ_M2M_ADDR_SPC_ID_LBN 84 +#define ESF_GZ_M2M_ADDR_SPC_ID_WIDTH 36 #define ESF_GZ_M2M_LEN_MINUS_1_LBN 64 #define ESF_GZ_M2M_LEN_MINUS_1_WIDTH 20 #define ESF_GZ_M2M_ADDR_LBN 0 @@ -492,12 +486,8 @@ #define ESF_GZ_TX_SEG_TRANSLATE_ADDR_WIDTH 1 #define ESF_GZ_TX_SEG_RSVD2_LBN 120 #define ESF_GZ_TX_SEG_RSVD2_WIDTH 2 -#define ESF_GZ_TX_SEG_ADDR_SPC_LBN 108 -#define ESF_GZ_TX_SEG_ADDR_SPC_WIDTH 12 -#define ESF_GZ_TX_SEG_ADDR_SPC_PASID_LBN 86 -#define ESF_GZ_TX_SEG_ADDR_SPC_PASID_WIDTH 22 -#define ESF_GZ_TX_SEG_ADDR_SPC_MODE_LBN 84 -#define ESF_GZ_TX_SEG_ADDR_SPC_MODE_WIDTH 2 +#define ESF_GZ_TX_SEG_ADDR_SPC_ID_LBN 84 +#define ESF_GZ_TX_SEG_ADDR_SPC_ID_WIDTH 36 #define ESF_GZ_TX_SEG_RSVD_LBN 80 #define ESF_GZ_TX_SEG_RSVD_WIDTH 4 #define ESF_GZ_TX_SEG_LEN_LBN 64 @@ -583,6 +573,12 @@ #define ESE_GZ_SF_TX_TSO_DSC_FMT_STRUCT_SIZE 124 +/* Enum D2VIO_MSG_OP */ +#define ESE_GZ_QUE_JBDNE 3 +#define ESE_GZ_QUE_EVICT 2 +#define ESE_GZ_QUE_EMPTY 1 +#define ESE_GZ_NOP 0 + /* Enum DESIGN_PARAMS */ #define ESE_EF100_DP_GZ_RX_MAX_RUNT 17 #define ESE_EF100_DP_GZ_VI_STRIDES 16 @@ -630,6 +626,19 @@ #define ESE_GZ_PCI_BASE_CONFIG_SPACE_SIZE 256 #define ESE_GZ_PCI_EXPRESS_XCAP_HDR_SIZE 4 +/* Enum RH_DSC_TYPE */ +#define ESE_GZ_TX_TOMB 0xF +#define ESE_GZ_TX_VIO 0xE +#define ESE_GZ_TX_TSO_OVRRD 0x8 +#define ESE_GZ_TX_D2CMP 0x7 +#define ESE_GZ_TX_DATA 0x6 +#define ESE_GZ_TX_D2M 0x5 +#define ESE_GZ_TX_M2M 0x4 +#define ESE_GZ_TX_SEG 0x3 +#define ESE_GZ_TX_TSO 0x2 +#define ESE_GZ_TX_OVRRD 0x1 +#define ESE_GZ_TX_SEND 0x0 + /* Enum RH_HCLASS_L2_CLASS */ #define ESE_GZ_RH_HCLASS_L2_CLASS_E2_0123VLAN 1 #define ESE_GZ_RH_HCLASS_L2_CLASS_OTHER 0 @@ -666,6 +675,25 @@ #define ESE_GZ_RH_HCLASS_TUNNEL_CLASS_VXLAN 1 #define ESE_GZ_RH_HCLASS_TUNNEL_CLASS_NONE 0 +/* Enum SF_CTL_EVENT_SUBTYPE */ +#define ESE_GZ_EF100_CTL_EV_EVQ_TIMEOUT 0x3 +#define ESE_GZ_EF100_CTL_EV_FLUSH 0x2 +#define ESE_GZ_EF100_CTL_EV_TIME_SYNC 0x1 +#define ESE_GZ_EF100_CTL_EV_UNSOL_OVERFLOW 0x0 + +/* Enum SF_EVENT_TYPE */ +#define ESE_GZ_EF100_EV_DRIVER 0x5 +#define ESE_GZ_EF100_EV_MCDI 0x4 +#define ESE_GZ_EF100_EV_CONTROL 0x3 +#define ESE_GZ_EF100_EV_TX_TIMESTAMP 0x2 +#define ESE_GZ_EF100_EV_TX_COMPLETION 0x1 +#define ESE_GZ_EF100_EV_RX_PKTS 0x0 + +/* Enum SF_EW_EVENT_TYPE */ +#define ESE_GZ_EF100_EWEV_VIRTQ_DESC 0x2 +#define ESE_GZ_EF100_EWEV_TXQ_DESC 0x1 +#define ESE_GZ_EF100_EWEV_64BIT 0x0 + /* Enum TX_DESC_CSO_PARTIAL_EN */ #define ESE_GZ_TX_DESC_CSO_PARTIAL_EN_TCP 2 #define ESE_GZ_TX_DESC_CSO_PARTIAL_EN_UDP 1 @@ -681,6 +709,15 @@ #define ESE_GZ_TX_DESC_IP4_ID_INC_MOD16 2 #define ESE_GZ_TX_DESC_IP4_ID_INC_MOD15 1 #define ESE_GZ_TX_DESC_IP4_ID_NO_OP 0 + +/* Enum VIRTIO_NET_HDR_F */ +#define ESE_GZ_NEEDS_CSUM 0x1 + +/* Enum VIRTIO_NET_HDR_GSO */ +#define ESE_GZ_TCPV6 0x4 +#define ESE_GZ_UDP 0x3 +#define ESE_GZ_TCPV4 0x1 +#define ESE_GZ_NONE 0x0 /**************************************************************************/ #define ESF_GZ_EV_DEBUG_EVENT_GEN_FLAGS_LBN 44 diff --git a/drivers/net/ethernet/sfc/ef100_rep.c b/drivers/net/ethernet/sfc/ef100_rep.c new file mode 100644 index 000000000000..d07539f091b8 --- /dev/null +++ b/drivers/net/ethernet/sfc/ef100_rep.c @@ -0,0 +1,244 @@ +// SPDX-License-Identifier: GPL-2.0-only +/**************************************************************************** + * Driver for Solarflare network controllers and boards + * Copyright 2019 Solarflare Communications Inc. + * Copyright 2020-2022 Xilinx Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation, incorporated herein by reference. + */ + +#include "ef100_rep.h" +#include "ef100_netdev.h" +#include "ef100_nic.h" +#include "mae.h" + +#define EFX_EF100_REP_DRIVER "efx_ef100_rep" + +static int efx_ef100_rep_init_struct(struct efx_nic *efx, struct efx_rep *efv, + unsigned int i) +{ + efv->parent = efx; + efv->idx = i; + INIT_LIST_HEAD(&efv->list); + efv->msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | + NETIF_MSG_LINK | NETIF_MSG_IFDOWN | + NETIF_MSG_IFUP | NETIF_MSG_RX_ERR | + NETIF_MSG_TX_ERR | NETIF_MSG_HW; + return 0; +} + +static netdev_tx_t efx_ef100_rep_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + struct efx_rep *efv = netdev_priv(dev); + struct efx_nic *efx = efv->parent; + netdev_tx_t rc; + + /* __ef100_hard_start_xmit() will always return success even in the + * case of TX drops, where it will increment efx's tx_dropped. The + * efv stats really only count attempted TX, not success/failure. + */ + atomic64_inc(&efv->stats.tx_packets); + atomic64_add(skb->len, &efv->stats.tx_bytes); + netif_tx_lock(efx->net_dev); + rc = __ef100_hard_start_xmit(skb, efx, dev, efv); + netif_tx_unlock(efx->net_dev); + return rc; +} + +static int efx_ef100_rep_get_port_parent_id(struct net_device *dev, + struct netdev_phys_item_id *ppid) +{ + struct efx_rep *efv = netdev_priv(dev); + struct efx_nic *efx = efv->parent; + struct ef100_nic_data *nic_data; + + nic_data = efx->nic_data; + /* nic_data->port_id is a u8[] */ + ppid->id_len = sizeof(nic_data->port_id); + memcpy(ppid->id, nic_data->port_id, sizeof(nic_data->port_id)); + return 0; +} + +static int efx_ef100_rep_get_phys_port_name(struct net_device *dev, + char *buf, size_t len) +{ + struct efx_rep *efv = netdev_priv(dev); + struct efx_nic *efx = efv->parent; + struct ef100_nic_data *nic_data; + int ret; + + nic_data = efx->nic_data; + ret = snprintf(buf, len, "p%upf%uvf%u", efx->port_num, + nic_data->pf_index, efv->idx); + if (ret >= len) + return -EOPNOTSUPP; + + return 0; +} + +static const struct net_device_ops efx_ef100_rep_netdev_ops = { + .ndo_start_xmit = efx_ef100_rep_xmit, + .ndo_get_port_parent_id = efx_ef100_rep_get_port_parent_id, + .ndo_get_phys_port_name = efx_ef100_rep_get_phys_port_name, +}; + +static void efx_ef100_rep_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *drvinfo) +{ + strscpy(drvinfo->driver, EFX_EF100_REP_DRIVER, sizeof(drvinfo->driver)); +} + +static u32 efx_ef100_rep_ethtool_get_msglevel(struct net_device *net_dev) +{ + struct efx_rep *efv = netdev_priv(net_dev); + + return efv->msg_enable; +} + +static void efx_ef100_rep_ethtool_set_msglevel(struct net_device *net_dev, + u32 msg_enable) +{ + struct efx_rep *efv = netdev_priv(net_dev); + + efv->msg_enable = msg_enable; +} + +static const struct ethtool_ops efx_ef100_rep_ethtool_ops = { + .get_drvinfo = efx_ef100_rep_get_drvinfo, + .get_msglevel = efx_ef100_rep_ethtool_get_msglevel, + .set_msglevel = efx_ef100_rep_ethtool_set_msglevel, +}; + +static struct efx_rep *efx_ef100_rep_create_netdev(struct efx_nic *efx, + unsigned int i) +{ + struct net_device *net_dev; + struct efx_rep *efv; + int rc; + + net_dev = alloc_etherdev_mq(sizeof(*efv), 1); + if (!net_dev) + return ERR_PTR(-ENOMEM); + + efv = netdev_priv(net_dev); + rc = efx_ef100_rep_init_struct(efx, efv, i); + if (rc) + goto fail1; + efv->net_dev = net_dev; + rtnl_lock(); + spin_lock_bh(&efx->vf_reps_lock); + list_add_tail(&efv->list, &efx->vf_reps); + spin_unlock_bh(&efx->vf_reps_lock); + if (netif_running(efx->net_dev) && efx->state == STATE_NET_UP) { + netif_device_attach(net_dev); + netif_carrier_on(net_dev); + } else { + netif_carrier_off(net_dev); + netif_tx_stop_all_queues(net_dev); + } + rtnl_unlock(); + + net_dev->netdev_ops = &efx_ef100_rep_netdev_ops; + net_dev->ethtool_ops = &efx_ef100_rep_ethtool_ops; + net_dev->min_mtu = EFX_MIN_MTU; + net_dev->max_mtu = EFX_MAX_MTU; + net_dev->features |= NETIF_F_LLTX; + net_dev->hw_features |= NETIF_F_LLTX; + return efv; +fail1: + free_netdev(net_dev); + return ERR_PTR(rc); +} + +static int efx_ef100_configure_rep(struct efx_rep *efv) +{ + struct efx_nic *efx = efv->parent; + u32 selector; + int rc; + + /* Construct mport selector for corresponding VF */ + efx_mae_mport_vf(efx, efv->idx, &selector); + /* Look up actual mport ID */ + rc = efx_mae_lookup_mport(efx, selector, &efv->mport); + if (rc) + return rc; + pci_dbg(efx->pci_dev, "VF %u has mport ID %#x\n", efv->idx, efv->mport); + /* mport label should fit in 16 bits */ + WARN_ON(efv->mport >> 16); + + return 0; +} + +static void efx_ef100_rep_destroy_netdev(struct efx_rep *efv) +{ + struct efx_nic *efx = efv->parent; + + rtnl_lock(); + spin_lock_bh(&efx->vf_reps_lock); + list_del(&efv->list); + spin_unlock_bh(&efx->vf_reps_lock); + rtnl_unlock(); + free_netdev(efv->net_dev); +} + +int efx_ef100_vfrep_create(struct efx_nic *efx, unsigned int i) +{ + struct efx_rep *efv; + int rc; + + efv = efx_ef100_rep_create_netdev(efx, i); + if (IS_ERR(efv)) { + rc = PTR_ERR(efv); + pci_err(efx->pci_dev, + "Failed to create representor for VF %d, rc %d\n", i, + rc); + return rc; + } + rc = efx_ef100_configure_rep(efv); + if (rc) { + pci_err(efx->pci_dev, + "Failed to configure representor for VF %d, rc %d\n", + i, rc); + goto fail; + } + rc = register_netdev(efv->net_dev); + if (rc) { + pci_err(efx->pci_dev, + "Failed to register representor for VF %d, rc %d\n", + i, rc); + goto fail; + } + pci_dbg(efx->pci_dev, "Representor for VF %d is %s\n", i, + efv->net_dev->name); + return 0; +fail: + efx_ef100_rep_destroy_netdev(efv); + return rc; +} + +void efx_ef100_vfrep_destroy(struct efx_nic *efx, struct efx_rep *efv) +{ + struct net_device *rep_dev; + + rep_dev = efv->net_dev; + if (!rep_dev) + return; + netif_dbg(efx, drv, rep_dev, "Removing VF representor\n"); + unregister_netdev(rep_dev); + efx_ef100_rep_destroy_netdev(efv); +} + +void efx_ef100_fini_vfreps(struct efx_nic *efx) +{ + struct ef100_nic_data *nic_data = efx->nic_data; + struct efx_rep *efv, *next; + + if (!nic_data->grp_mae) + return; + + list_for_each_entry_safe(efv, next, &efx->vf_reps, list) + efx_ef100_vfrep_destroy(efx, efv); +} diff --git a/drivers/net/ethernet/sfc/ef100_rep.h b/drivers/net/ethernet/sfc/ef100_rep.h new file mode 100644 index 000000000000..d47fd8ff6220 --- /dev/null +++ b/drivers/net/ethernet/sfc/ef100_rep.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/**************************************************************************** + * Driver for Solarflare network controllers and boards + * Copyright 2019 Solarflare Communications Inc. + * Copyright 2020-2022 Xilinx Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation, incorporated herein by reference. + */ + +/* Handling for ef100 representor netdevs */ +#ifndef EF100_REP_H +#define EF100_REP_H + +#include "net_driver.h" + +struct efx_rep_sw_stats { + atomic64_t rx_packets, tx_packets; + atomic64_t rx_bytes, tx_bytes; + atomic64_t rx_dropped, tx_errors; +}; + +/** + * struct efx_rep - Private data for an Efx representor + * + * @parent: the efx PF which manages this representor + * @net_dev: representor netdevice + * @msg_enable: log message enable flags + * @mport: m-port ID of corresponding VF + * @idx: VF index + * @list: entry on efx->vf_reps + * @stats: software traffic counters for netdev stats + */ +struct efx_rep { + struct efx_nic *parent; + struct net_device *net_dev; + u32 msg_enable; + u32 mport; + unsigned int idx; + struct list_head list; + struct efx_rep_sw_stats stats; +}; + +int efx_ef100_vfrep_create(struct efx_nic *efx, unsigned int i); +void efx_ef100_vfrep_destroy(struct efx_nic *efx, struct efx_rep *efv); +void efx_ef100_fini_vfreps(struct efx_nic *efx); + +#endif /* EF100_REP_H */ diff --git a/drivers/net/ethernet/sfc/ef100_sriov.c b/drivers/net/ethernet/sfc/ef100_sriov.c index 664578176bfe..94bdbfcb47e8 100644 --- a/drivers/net/ethernet/sfc/ef100_sriov.c +++ b/drivers/net/ethernet/sfc/ef100_sriov.c @@ -11,46 +11,62 @@ #include "ef100_sriov.h" #include "ef100_nic.h" +#include "ef100_rep.h" static int efx_ef100_pci_sriov_enable(struct efx_nic *efx, int num_vfs) { + struct ef100_nic_data *nic_data = efx->nic_data; struct pci_dev *dev = efx->pci_dev; - int rc; + struct efx_rep *efv, *next; + int rc, i; efx->vf_count = num_vfs; rc = pci_enable_sriov(dev, num_vfs); if (rc) - goto fail; + goto fail1; + if (!nic_data->grp_mae) + return 0; + + for (i = 0; i < num_vfs; i++) { + rc = efx_ef100_vfrep_create(efx, i); + if (rc) + goto fail2; + } return 0; -fail: +fail2: + list_for_each_entry_safe(efv, next, &efx->vf_reps, list) + efx_ef100_vfrep_destroy(efx, efv); + pci_disable_sriov(dev); +fail1: netif_err(efx, probe, efx->net_dev, "Failed to enable SRIOV VFs\n"); efx->vf_count = 0; return rc; } -int efx_ef100_pci_sriov_disable(struct efx_nic *efx) +int efx_ef100_pci_sriov_disable(struct efx_nic *efx, bool force) { struct pci_dev *dev = efx->pci_dev; unsigned int vfs_assigned; vfs_assigned = pci_vfs_assigned(dev); - if (vfs_assigned) { + if (vfs_assigned && !force) { netif_info(efx, drv, efx->net_dev, "VFs are assigned to guests; " "please detach them before disabling SR-IOV\n"); return -EBUSY; } - pci_disable_sriov(dev); - + efx_ef100_fini_vfreps(efx); + if (!vfs_assigned) + pci_disable_sriov(dev); return 0; } int efx_ef100_sriov_configure(struct efx_nic *efx, int num_vfs) { if (num_vfs == 0) - return efx_ef100_pci_sriov_disable(efx); + return efx_ef100_pci_sriov_disable(efx, false); else return efx_ef100_pci_sriov_enable(efx, num_vfs); } diff --git a/drivers/net/ethernet/sfc/ef100_sriov.h b/drivers/net/ethernet/sfc/ef100_sriov.h index c48fccd46c57..8ffdf464dd1d 100644 --- a/drivers/net/ethernet/sfc/ef100_sriov.h +++ b/drivers/net/ethernet/sfc/ef100_sriov.h @@ -11,4 +11,4 @@ #include "net_driver.h" int efx_ef100_sriov_configure(struct efx_nic *efx, int num_vfs); -int efx_ef100_pci_sriov_disable(struct efx_nic *efx); +int efx_ef100_pci_sriov_disable(struct efx_nic *efx, bool force); diff --git a/drivers/net/ethernet/sfc/ef100_tx.c b/drivers/net/ethernet/sfc/ef100_tx.c index 26ef51d6b542..102ddc7e206a 100644 --- a/drivers/net/ethernet/sfc/ef100_tx.c +++ b/drivers/net/ethernet/sfc/ef100_tx.c @@ -254,7 +254,8 @@ static void ef100_make_tso_desc(struct efx_nic *efx, static void ef100_tx_make_descriptors(struct efx_tx_queue *tx_queue, const struct sk_buff *skb, - unsigned int segment_count) + unsigned int segment_count, + struct efx_rep *efv) { unsigned int old_write_count = tx_queue->write_count; unsigned int new_write_count = old_write_count; @@ -272,6 +273,20 @@ static void ef100_tx_make_descriptors(struct efx_tx_queue *tx_queue, else next_desc_type = ESE_GZ_TX_DESC_TYPE_SEND; + if (unlikely(efv)) { + /* Create TX override descriptor */ + write_ptr = new_write_count & tx_queue->ptr_mask; + txd = ef100_tx_desc(tx_queue, write_ptr); + ++new_write_count; + + tx_queue->packet_write_count = new_write_count; + EFX_POPULATE_OWORD_3(*txd, + ESF_GZ_TX_DESC_TYPE, ESE_GZ_TX_DESC_TYPE_PREFIX, + ESF_GZ_TX_PREFIX_EGRESS_MPORT, efv->mport, + ESF_GZ_TX_PREFIX_EGRESS_MPORT_EN, 1); + nr_descs--; + } + /* if it's a raw write (such as XDP) then always SEND single frames */ if (!skb) nr_descs = 1; @@ -306,6 +321,9 @@ static void ef100_tx_make_descriptors(struct efx_tx_queue *tx_queue, /* if it's a raw write (such as XDP) then always SEND */ next_desc_type = skb ? ESE_GZ_TX_DESC_TYPE_SEG : ESE_GZ_TX_DESC_TYPE_SEND; + /* mark as an EFV buffer if applicable */ + if (unlikely(efv)) + buffer->flags |= EFX_TX_BUF_EFV; } while (new_write_count != tx_queue->insert_count); @@ -324,7 +342,7 @@ static void ef100_tx_make_descriptors(struct efx_tx_queue *tx_queue, void ef100_tx_write(struct efx_tx_queue *tx_queue) { - ef100_tx_make_descriptors(tx_queue, NULL, 0); + ef100_tx_make_descriptors(tx_queue, NULL, 0, NULL); ef100_tx_push_buffers(tx_queue); } @@ -351,6 +369,12 @@ void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event) */ int ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb) { + return __ef100_enqueue_skb(tx_queue, skb, NULL); +} + +int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb, + struct efx_rep *efv) +{ unsigned int old_insert_count = tx_queue->insert_count; struct efx_nic *efx = tx_queue->efx; bool xmit_more = netdev_xmit_more(); @@ -376,16 +400,64 @@ int ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb) return 0; } + if (unlikely(efv)) { + struct efx_tx_buffer *buffer = __efx_tx_queue_get_insert_buffer(tx_queue); + + /* Drop representor packets if the queue is stopped. + * We currently don't assert backoff to representors so this is + * to make sure representor traffic can't starve the main + * net device. + * And, of course, if there are no TX descriptors left. + */ + if (netif_tx_queue_stopped(tx_queue->core_txq) || + unlikely(efx_tx_buffer_in_use(buffer))) { + atomic64_inc(&efv->stats.tx_errors); + rc = -ENOSPC; + goto err; + } + + /* Also drop representor traffic if it could cause us to + * stop the queue. If we assert backoff and we haven't + * received traffic on the main net device recently then the + * TX watchdog can go off erroneously. + */ + fill_level = efx_channel_tx_old_fill_level(tx_queue->channel); + fill_level += efx_tx_max_skb_descs(efx); + if (fill_level > efx->txq_stop_thresh) { + struct efx_tx_queue *txq2; + + /* Refresh cached fill level and re-check */ + efx_for_each_channel_tx_queue(txq2, tx_queue->channel) + txq2->old_read_count = READ_ONCE(txq2->read_count); + + fill_level = efx_channel_tx_old_fill_level(tx_queue->channel); + fill_level += efx_tx_max_skb_descs(efx); + if (fill_level > efx->txq_stop_thresh) { + atomic64_inc(&efv->stats.tx_errors); + rc = -ENOSPC; + goto err; + } + } + + buffer->flags = EFX_TX_BUF_OPTION | EFX_TX_BUF_EFV; + tx_queue->insert_count++; + } + /* Map for DMA and create descriptors */ rc = efx_tx_map_data(tx_queue, skb, segments); if (rc) goto err; - ef100_tx_make_descriptors(tx_queue, skb, segments); + ef100_tx_make_descriptors(tx_queue, skb, segments, efv); fill_level = efx_channel_tx_old_fill_level(tx_queue->channel); if (fill_level > efx->txq_stop_thresh) { struct efx_tx_queue *txq2; + /* Because of checks above, representor traffic should + * not be able to stop the queue. + */ + WARN_ON(efv); + netif_tx_stop_queue(tx_queue->core_txq); /* Re-read after a memory barrier in case we've raced with * the completion path. Otherwise there's a danger we'll never @@ -404,8 +476,12 @@ int ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb) /* If xmit_more then we don't need to push the doorbell, unless there * are 256 descriptors already queued in which case we have to push to * ensure we never push more than 256 at once. + * + * Always push for representor traffic, and don't account it to parent + * PF netdevice's BQL. */ - if (__netdev_tx_sent_queue(tx_queue->core_txq, skb->len, xmit_more) || + if (unlikely(efv) || + __netdev_tx_sent_queue(tx_queue->core_txq, skb->len, xmit_more) || tx_queue->write_count - tx_queue->notify_count > 255) ef100_tx_push_buffers(tx_queue); diff --git a/drivers/net/ethernet/sfc/ef100_tx.h b/drivers/net/ethernet/sfc/ef100_tx.h index ddc4b98fa6db..e9e11540fcde 100644 --- a/drivers/net/ethernet/sfc/ef100_tx.h +++ b/drivers/net/ethernet/sfc/ef100_tx.h @@ -13,6 +13,7 @@ #define EFX_EF100_TX_H #include "net_driver.h" +#include "ef100_rep.h" int ef100_tx_probe(struct efx_tx_queue *tx_queue); void ef100_tx_init(struct efx_tx_queue *tx_queue); @@ -22,4 +23,6 @@ unsigned int ef100_tx_max_skb_descs(struct efx_nic *efx); void ef100_ev_tx(struct efx_channel *channel, const efx_qword_t *p_event); netdev_tx_t ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb); +int __ef100_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb, + struct efx_rep *efv); #endif diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h index c05a83da9e44..4239c7ece123 100644 --- a/drivers/net/ethernet/sfc/efx.h +++ b/drivers/net/ethernet/sfc/efx.h @@ -12,6 +12,7 @@ #include "net_driver.h" #include "ef100_rx.h" #include "ef100_tx.h" +#include "efx_common.h" #include "filter.h" int efx_net_open(struct net_device *net_dev); @@ -206,6 +207,9 @@ static inline void efx_device_detach_sync(struct efx_nic *efx) { struct net_device *dev = efx->net_dev; + /* We must stop reps (which use our TX) before we stop ourselves. */ + efx_detach_reps(efx); + /* Lock/freeze all TX queues so that we can be sure the * TX scheduler is stopped when we're done and before * netif_device_present() becomes false. @@ -217,8 +221,11 @@ static inline void efx_device_detach_sync(struct efx_nic *efx) static inline void efx_device_attach_if_not_resetting(struct efx_nic *efx) { - if ((efx->state != STATE_DISABLED) && !efx->reset_pending) + if ((efx->state != STATE_DISABLED) && !efx->reset_pending) { netif_device_attach(efx->net_dev); + if (efx->state == STATE_NET_UP) + efx_attach_reps(efx); + } } static inline bool efx_rwsem_assert_write_locked(struct rw_semaphore *sem) diff --git a/drivers/net/ethernet/sfc/efx_common.c b/drivers/net/ethernet/sfc/efx_common.c index 56eb717bb07a..a929a1aaba92 100644 --- a/drivers/net/ethernet/sfc/efx_common.c +++ b/drivers/net/ethernet/sfc/efx_common.c @@ -24,6 +24,7 @@ #include "mcdi_port_common.h" #include "io.h" #include "mcdi_pcol.h" +#include "ef100_rep.h" static unsigned int debug = (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK | NETIF_MSG_IFDOWN | @@ -1021,6 +1022,8 @@ int efx_init_struct(struct efx_nic *efx, struct pci_dev *pci_dev) efx->rps_hash_table = kcalloc(EFX_ARFS_HASH_TABLE_SIZE, sizeof(*efx->rps_hash_table), GFP_KERNEL); #endif + spin_lock_init(&efx->vf_reps_lock); + INIT_LIST_HEAD(&efx->vf_reps); INIT_WORK(&efx->mac_work, efx_mac_work); init_waitqueue_head(&efx->flush_wq); @@ -1389,3 +1392,38 @@ int efx_get_phys_port_name(struct net_device *net_dev, char *name, size_t len) return -EINVAL; return 0; } + +void efx_detach_reps(struct efx_nic *efx) +{ + struct net_device *rep_dev; + struct efx_rep *efv; + + ASSERT_RTNL(); + netif_dbg(efx, drv, efx->net_dev, "Detaching VF representors\n"); + list_for_each_entry(efv, &efx->vf_reps, list) { + rep_dev = efv->net_dev; + if (!rep_dev) + continue; + netif_carrier_off(rep_dev); + /* See efx_device_detach_sync() */ + netif_tx_lock_bh(rep_dev); + netif_tx_stop_all_queues(rep_dev); + netif_tx_unlock_bh(rep_dev); + } +} + +void efx_attach_reps(struct efx_nic *efx) +{ + struct net_device *rep_dev; + struct efx_rep *efv; + + ASSERT_RTNL(); + netif_dbg(efx, drv, efx->net_dev, "Attaching VF representors\n"); + list_for_each_entry(efv, &efx->vf_reps, list) { + rep_dev = efv->net_dev; + if (!rep_dev) + continue; + netif_tx_wake_all_queues(rep_dev); + netif_carrier_on(rep_dev); + } +} diff --git a/drivers/net/ethernet/sfc/efx_common.h b/drivers/net/ethernet/sfc/efx_common.h index 93babc1a2678..2c54dac3e662 100644 --- a/drivers/net/ethernet/sfc/efx_common.h +++ b/drivers/net/ethernet/sfc/efx_common.h @@ -111,4 +111,7 @@ int efx_get_phys_port_id(struct net_device *net_dev, int efx_get_phys_port_name(struct net_device *net_dev, char *name, size_t len); + +void efx_detach_reps(struct efx_nic *efx); +void efx_attach_reps(struct efx_nic *efx); #endif diff --git a/drivers/net/ethernet/sfc/mae.c b/drivers/net/ethernet/sfc/mae.c new file mode 100644 index 000000000000..011ebd46ada5 --- /dev/null +++ b/drivers/net/ethernet/sfc/mae.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-only +/**************************************************************************** + * Driver for Solarflare network controllers and boards + * Copyright 2019 Solarflare Communications Inc. + * Copyright 2020-2022 Xilinx Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation, incorporated herein by reference. + */ + +#include "mae.h" +#include "mcdi.h" +#include "mcdi_pcol.h" + +void efx_mae_mport_vf(struct efx_nic *efx __always_unused, u32 vf_id, u32 *out) +{ + efx_dword_t mport; + + EFX_POPULATE_DWORD_3(mport, + MAE_MPORT_SELECTOR_TYPE, MAE_MPORT_SELECTOR_TYPE_FUNC, + MAE_MPORT_SELECTOR_FUNC_PF_ID, MAE_MPORT_SELECTOR_FUNC_PF_ID_CALLER, + MAE_MPORT_SELECTOR_FUNC_VF_ID, vf_id); + *out = EFX_DWORD_VAL(mport); +} + +/* id is really only 24 bits wide */ +int efx_mae_lookup_mport(struct efx_nic *efx, u32 selector, u32 *id) +{ + MCDI_DECLARE_BUF(outbuf, MC_CMD_MAE_MPORT_LOOKUP_OUT_LEN); + MCDI_DECLARE_BUF(inbuf, MC_CMD_MAE_MPORT_LOOKUP_IN_LEN); + size_t outlen; + int rc; + + MCDI_SET_DWORD(inbuf, MAE_MPORT_LOOKUP_IN_MPORT_SELECTOR, selector); + rc = efx_mcdi_rpc(efx, MC_CMD_MAE_MPORT_LOOKUP, inbuf, sizeof(inbuf), + outbuf, sizeof(outbuf), &outlen); + if (rc) + return rc; + if (outlen < sizeof(outbuf)) + return -EIO; + *id = MCDI_DWORD(outbuf, MAE_MPORT_LOOKUP_OUT_MPORT_ID); + return 0; +} diff --git a/drivers/net/ethernet/sfc/mae.h b/drivers/net/ethernet/sfc/mae.h new file mode 100644 index 000000000000..27e69e8a54b6 --- /dev/null +++ b/drivers/net/ethernet/sfc/mae.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/**************************************************************************** + * Driver for Solarflare network controllers and boards + * Copyright 2019 Solarflare Communications Inc. + * Copyright 2020-2022 Xilinx Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation, incorporated herein by reference. + */ + +#ifndef EF100_MAE_H +#define EF100_MAE_H +/* MCDI interface for the ef100 Match-Action Engine */ + +#include "net_driver.h" + +void efx_mae_mport_vf(struct efx_nic *efx, u32 vf_id, u32 *out); + +int efx_mae_lookup_mport(struct efx_nic *efx, u32 selector, u32 *id); + +#endif /* EF100_MAE_H */ diff --git a/drivers/net/ethernet/sfc/mcdi.c b/drivers/net/ethernet/sfc/mcdi.c index 3225fe64c397..af338208eae9 100644 --- a/drivers/net/ethernet/sfc/mcdi.c +++ b/drivers/net/ethernet/sfc/mcdi.c @@ -2129,6 +2129,52 @@ fail: return rc; } +/* Failure to read a privilege mask is never fatal, because we can always + * carry on as though we didn't have the privilege we were interested in. + * So use efx_mcdi_rpc_quiet(). + */ +int efx_mcdi_get_privilege_mask(struct efx_nic *efx, u32 *mask) +{ + MCDI_DECLARE_BUF(fi_outbuf, MC_CMD_GET_FUNCTION_INFO_OUT_LEN); + MCDI_DECLARE_BUF(pm_inbuf, MC_CMD_PRIVILEGE_MASK_IN_LEN); + MCDI_DECLARE_BUF(pm_outbuf, MC_CMD_PRIVILEGE_MASK_OUT_LEN); + size_t outlen; + u16 pf, vf; + int rc; + + if (!efx || !mask) + return -EINVAL; + + /* Get our function number */ + rc = efx_mcdi_rpc_quiet(efx, MC_CMD_GET_FUNCTION_INFO, NULL, 0, + fi_outbuf, MC_CMD_GET_FUNCTION_INFO_OUT_LEN, + &outlen); + if (rc != 0) + return rc; + if (outlen < MC_CMD_GET_FUNCTION_INFO_OUT_LEN) + return -EIO; + + pf = MCDI_DWORD(fi_outbuf, GET_FUNCTION_INFO_OUT_PF); + vf = MCDI_DWORD(fi_outbuf, GET_FUNCTION_INFO_OUT_VF); + + MCDI_POPULATE_DWORD_2(pm_inbuf, PRIVILEGE_MASK_IN_FUNCTION, + PRIVILEGE_MASK_IN_FUNCTION_PF, pf, + PRIVILEGE_MASK_IN_FUNCTION_VF, vf); + + rc = efx_mcdi_rpc_quiet(efx, MC_CMD_PRIVILEGE_MASK, + pm_inbuf, sizeof(pm_inbuf), + pm_outbuf, sizeof(pm_outbuf), &outlen); + + if (rc != 0) + return rc; + if (outlen < MC_CMD_PRIVILEGE_MASK_OUT_LEN) + return -EIO; + + *mask = MCDI_DWORD(pm_outbuf, PRIVILEGE_MASK_OUT_OLD_MASK); + + return 0; +} + #ifdef CONFIG_SFC_MTD #define EFX_MCDI_NVRAM_LEN_MAX 128 diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h index 69c2924a147c..f74f6ce8b27d 100644 --- a/drivers/net/ethernet/sfc/mcdi.h +++ b/drivers/net/ethernet/sfc/mcdi.h @@ -366,6 +366,7 @@ int efx_mcdi_set_workaround(struct efx_nic *efx, u32 type, bool enabled, unsigned int *flags); int efx_mcdi_get_workarounds(struct efx_nic *efx, unsigned int *impl_out, unsigned int *enabled_out); +int efx_mcdi_get_privilege_mask(struct efx_nic *efx, u32 *mask); #ifdef CONFIG_SFC_MCDI_MON int efx_mcdi_mon_probe(struct efx_nic *efx); diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h index 2228c88a7f31..4cde54cf77b9 100644 --- a/drivers/net/ethernet/sfc/net_driver.h +++ b/drivers/net/ethernet/sfc/net_driver.h @@ -178,6 +178,7 @@ struct efx_tx_buffer { #define EFX_TX_BUF_OPTION 0x10 /* empty buffer for option descriptor */ #define EFX_TX_BUF_XDP 0x20 /* buffer was sent with XDP */ #define EFX_TX_BUF_TSO_V3 0x40 /* empty buffer for a TSO_V3 descriptor */ +#define EFX_TX_BUF_EFV 0x100 /* buffer was sent from representor */ /** * struct efx_tx_queue - An Efx TX queue @@ -966,6 +967,8 @@ enum efx_xdp_tx_queues_mode { * @vf_count: Number of VFs intended to be enabled. * @vf_init_count: Number of VFs that have been fully initialised. * @vi_scale: log2 number of vnics per VF. + * @vf_reps_lock: Protects vf_reps list + * @vf_reps: local VF reps * @ptp_data: PTP state data * @ptp_warned: has this NIC seen and warned about unexpected PTP events? * @vpd_sn: Serial number read from VPD @@ -1145,6 +1148,8 @@ struct efx_nic { unsigned vf_init_count; unsigned vi_scale; #endif + spinlock_t vf_reps_lock; + struct list_head vf_reps; struct efx_ptp_data *ptp_data; bool ptp_warned; diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c index 79cc0bb76321..d12474042c84 100644 --- a/drivers/net/ethernet/sfc/tx.c +++ b/drivers/net/ethernet/sfc/tx.c @@ -559,6 +559,7 @@ netdev_tx_t efx_hard_start_xmit(struct sk_buff *skb, void efx_xmit_done_single(struct efx_tx_queue *tx_queue) { unsigned int pkts_compl = 0, bytes_compl = 0; + unsigned int efv_pkts_compl = 0; unsigned int read_ptr; bool finished = false; @@ -580,7 +581,8 @@ void efx_xmit_done_single(struct efx_tx_queue *tx_queue) /* Need to check the flag before dequeueing. */ if (buffer->flags & EFX_TX_BUF_SKB) finished = true; - efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl); + efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl, + &efv_pkts_compl); ++tx_queue->read_count; read_ptr = tx_queue->read_count & tx_queue->ptr_mask; @@ -589,7 +591,7 @@ void efx_xmit_done_single(struct efx_tx_queue *tx_queue) tx_queue->pkts_compl += pkts_compl; tx_queue->bytes_compl += bytes_compl; - EFX_WARN_ON_PARANOID(pkts_compl != 1); + EFX_WARN_ON_PARANOID(pkts_compl + efv_pkts_compl != 1); efx_xmit_done_check_empty(tx_queue); } diff --git a/drivers/net/ethernet/sfc/tx_common.c b/drivers/net/ethernet/sfc/tx_common.c index 658ea2d34070..67e789b96c43 100644 --- a/drivers/net/ethernet/sfc/tx_common.c +++ b/drivers/net/ethernet/sfc/tx_common.c @@ -109,9 +109,11 @@ void efx_fini_tx_queue(struct efx_tx_queue *tx_queue) /* Free any buffers left in the ring */ while (tx_queue->read_count != tx_queue->write_count) { unsigned int pkts_compl = 0, bytes_compl = 0; + unsigned int efv_pkts_compl = 0; buffer = &tx_queue->buffer[tx_queue->read_count & tx_queue->ptr_mask]; - efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl); + efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl, + &efv_pkts_compl); ++tx_queue->read_count; } @@ -146,7 +148,8 @@ void efx_remove_tx_queue(struct efx_tx_queue *tx_queue) void efx_dequeue_buffer(struct efx_tx_queue *tx_queue, struct efx_tx_buffer *buffer, unsigned int *pkts_compl, - unsigned int *bytes_compl) + unsigned int *bytes_compl, + unsigned int *efv_pkts_compl) { if (buffer->unmap_len) { struct device *dma_dev = &tx_queue->efx->pci_dev->dev; @@ -164,9 +167,15 @@ void efx_dequeue_buffer(struct efx_tx_queue *tx_queue, if (buffer->flags & EFX_TX_BUF_SKB) { struct sk_buff *skb = (struct sk_buff *)buffer->skb; - EFX_WARN_ON_PARANOID(!pkts_compl || !bytes_compl); - (*pkts_compl)++; - (*bytes_compl) += skb->len; + if (unlikely(buffer->flags & EFX_TX_BUF_EFV)) { + EFX_WARN_ON_PARANOID(!efv_pkts_compl); + (*efv_pkts_compl)++; + } else { + EFX_WARN_ON_PARANOID(!pkts_compl || !bytes_compl); + (*pkts_compl)++; + (*bytes_compl) += skb->len; + } + if (tx_queue->timestamping && (tx_queue->completed_timestamp_major || tx_queue->completed_timestamp_minor)) { @@ -199,7 +208,8 @@ void efx_dequeue_buffer(struct efx_tx_queue *tx_queue, static void efx_dequeue_buffers(struct efx_tx_queue *tx_queue, unsigned int index, unsigned int *pkts_compl, - unsigned int *bytes_compl) + unsigned int *bytes_compl, + unsigned int *efv_pkts_compl) { struct efx_nic *efx = tx_queue->efx; unsigned int stop_index, read_ptr; @@ -218,7 +228,8 @@ static void efx_dequeue_buffers(struct efx_tx_queue *tx_queue, return; } - efx_dequeue_buffer(tx_queue, buffer, pkts_compl, bytes_compl); + efx_dequeue_buffer(tx_queue, buffer, pkts_compl, bytes_compl, + efv_pkts_compl); ++tx_queue->read_count; read_ptr = tx_queue->read_count & tx_queue->ptr_mask; @@ -241,15 +252,17 @@ void efx_xmit_done_check_empty(struct efx_tx_queue *tx_queue) void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index) { unsigned int fill_level, pkts_compl = 0, bytes_compl = 0; + unsigned int efv_pkts_compl = 0; struct efx_nic *efx = tx_queue->efx; EFX_WARN_ON_ONCE_PARANOID(index > tx_queue->ptr_mask); - efx_dequeue_buffers(tx_queue, index, &pkts_compl, &bytes_compl); + efx_dequeue_buffers(tx_queue, index, &pkts_compl, &bytes_compl, + &efv_pkts_compl); tx_queue->pkts_compl += pkts_compl; tx_queue->bytes_compl += bytes_compl; - if (pkts_compl > 1) + if (pkts_compl + efv_pkts_compl > 1) ++tx_queue->merge_events; /* See if we need to restart the netif queue. This memory @@ -274,6 +287,7 @@ void efx_xmit_done(struct efx_tx_queue *tx_queue, unsigned int index) void efx_enqueue_unwind(struct efx_tx_queue *tx_queue, unsigned int insert_count) { + unsigned int efv_pkts_compl = 0; struct efx_tx_buffer *buffer; unsigned int bytes_compl = 0; unsigned int pkts_compl = 0; @@ -282,7 +296,8 @@ void efx_enqueue_unwind(struct efx_tx_queue *tx_queue, while (tx_queue->insert_count != insert_count) { --tx_queue->insert_count; buffer = __efx_tx_queue_get_insert_buffer(tx_queue); - efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl); + efx_dequeue_buffer(tx_queue, buffer, &pkts_compl, &bytes_compl, + &efv_pkts_compl); } } diff --git a/drivers/net/ethernet/sfc/tx_common.h b/drivers/net/ethernet/sfc/tx_common.h index bbab7f248250..d87aecbc7bf1 100644 --- a/drivers/net/ethernet/sfc/tx_common.h +++ b/drivers/net/ethernet/sfc/tx_common.h @@ -19,7 +19,8 @@ void efx_remove_tx_queue(struct efx_tx_queue *tx_queue); void efx_dequeue_buffer(struct efx_tx_queue *tx_queue, struct efx_tx_buffer *buffer, unsigned int *pkts_compl, - unsigned int *bytes_compl); + unsigned int *bytes_compl, + unsigned int *efv_pkts_compl); static inline bool efx_tx_buffer_in_use(struct efx_tx_buffer *buffer) { diff --git a/drivers/net/ipa/data/ipa_data-v3.1.c b/drivers/net/ipa/data/ipa_data-v3.1.c index 00f4e506e6e5..1c1895aea811 100644 --- a/drivers/net/ipa/data/ipa_data-v3.1.c +++ b/drivers/net/ipa/data/ipa_data-v3.1.c @@ -6,10 +6,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v3.1 */ enum ipa_resource_type { diff --git a/drivers/net/ipa/data/ipa_data-v3.5.1.c b/drivers/net/ipa/data/ipa_data-v3.5.1.c index b7e32e87733e..58b708d2fc75 100644 --- a/drivers/net/ipa/data/ipa_data-v3.5.1.c +++ b/drivers/net/ipa/data/ipa_data-v3.5.1.c @@ -6,10 +6,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v3.5.1 */ enum ipa_resource_type { diff --git a/drivers/net/ipa/data/ipa_data-v4.11.c b/drivers/net/ipa/data/ipa_data-v4.11.c index 1be823e5c5c2..a204e439c23d 100644 --- a/drivers/net/ipa/data/ipa_data-v4.11.c +++ b/drivers/net/ipa/data/ipa_data-v4.11.c @@ -4,10 +4,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v4.11 */ enum ipa_resource_type { diff --git a/drivers/net/ipa/data/ipa_data-v4.2.c b/drivers/net/ipa/data/ipa_data-v4.2.c index 683f1f91042f..04f574fe006f 100644 --- a/drivers/net/ipa/data/ipa_data-v4.2.c +++ b/drivers/net/ipa/data/ipa_data-v4.2.c @@ -4,10 +4,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v4.2 */ enum ipa_resource_type { diff --git a/drivers/net/ipa/data/ipa_data-v4.5.c b/drivers/net/ipa/data/ipa_data-v4.5.c index 79398f286a9c..684239e71f46 100644 --- a/drivers/net/ipa/data/ipa_data-v4.5.c +++ b/drivers/net/ipa/data/ipa_data-v4.5.c @@ -4,10 +4,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v4.5 */ enum ipa_resource_type { diff --git a/drivers/net/ipa/data/ipa_data-v4.9.c b/drivers/net/ipa/data/ipa_data-v4.9.c index 4b96efd05cf2..2333e15f9533 100644 --- a/drivers/net/ipa/data/ipa_data-v4.9.c +++ b/drivers/net/ipa/data/ipa_data-v4.9.c @@ -4,10 +4,10 @@ #include <linux/log2.h> -#include "gsi.h" -#include "ipa_data.h" -#include "ipa_endpoint.h" -#include "ipa_mem.h" +#include "../gsi.h" +#include "../ipa_data.h" +#include "../ipa_endpoint.h" +#include "../ipa_mem.h" /** enum ipa_resource_type - IPA resource types for an SoC having IPA v4.9 */ enum ipa_resource_type { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2b21f2a3452f..20c26aed7896 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -47,6 +47,7 @@ struct kobject; struct mem_cgroup; struct module; struct bpf_func_state; +struct ftrace_ops; extern struct idr btf_idr; extern spinlock_t btf_idr_lock; @@ -221,7 +222,7 @@ struct bpf_map { u32 btf_vmlinux_value_type_id; struct btf *btf; #ifdef CONFIG_MEMCG_KMEM - struct mem_cgroup *memcg; + struct obj_cgroup *objcg; #endif char name[BPF_OBJ_NAME_LEN]; struct bpf_map_off_arr *off_arr; @@ -751,6 +752,16 @@ struct btf_func_model { /* Return the return value of fentry prog. Only used by bpf_struct_ops. */ #define BPF_TRAMP_F_RET_FENTRY_RET BIT(4) +/* Get original function from stack instead of from provided direct address. + * Makes sense for trampolines with fexit or fmod_ret programs. + */ +#define BPF_TRAMP_F_ORIG_STACK BIT(5) + +/* This trampoline is on a function with another ftrace_ops with IPMODIFY, + * e.g., a live patch. This flag is set and cleared by ftrace call backs, + */ +#define BPF_TRAMP_F_SHARE_IPMODIFY BIT(6) + /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 * bytes on x86. */ @@ -833,9 +844,11 @@ struct bpf_tramp_image { struct bpf_trampoline { /* hlist for trampoline_table */ struct hlist_node hlist; + struct ftrace_ops *fops; /* serializes access to fields of this trampoline */ struct mutex mutex; refcount_t refcnt; + u32 flags; u64 key; struct { struct btf_func_model model; @@ -1044,7 +1057,6 @@ struct bpf_prog_aux { bool sleepable; bool tail_call_reachable; bool xdp_has_frags; - bool use_bpf_prog_pack; /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */ const struct btf_type *attach_func_proto; /* function name for valid attach_btf_id */ @@ -1255,9 +1267,6 @@ struct bpf_dummy_ops { int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); #endif -int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog, - int cgroup_atype); -void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog); #else static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) { @@ -1281,6 +1290,13 @@ static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, { return -EINVAL; } +#endif + +#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM) +int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog, + int cgroup_atype); +void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog); +#else static inline int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog, int cgroup_atype) { @@ -1921,7 +1937,8 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog, struct bpf_reg_state *regs); int btf_check_kfunc_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, - struct bpf_reg_state *regs); + struct bpf_reg_state *regs, + u32 kfunc_flags); int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog, struct bpf_reg_state *reg); int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *prog, diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 81b19669efba..2e3bad8640dc 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -345,10 +345,10 @@ struct bpf_verifier_state_list { }; struct bpf_loop_inline_state { - int initialized:1; /* set to true upon first entry */ - int fit_for_inline:1; /* true if callback function is the same - * at each call and flags are always zero - */ + unsigned int initialized:1; /* set to true upon first entry */ + unsigned int fit_for_inline:1; /* true if callback function is the same + * at each call and flags are always zero + */ u32 callback_subprogno; /* valid when fit_for_inline is true */ }; diff --git a/include/linux/btf.h b/include/linux/btf.h index 1bfed7fa0428..cdb376d53238 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -12,14 +12,43 @@ #define BTF_TYPE_EMIT(type) ((void)(type *)0) #define BTF_TYPE_EMIT_ENUM(enum_val) ((void)enum_val) -enum btf_kfunc_type { - BTF_KFUNC_TYPE_CHECK, - BTF_KFUNC_TYPE_ACQUIRE, - BTF_KFUNC_TYPE_RELEASE, - BTF_KFUNC_TYPE_RET_NULL, - BTF_KFUNC_TYPE_KPTR_ACQUIRE, - BTF_KFUNC_TYPE_MAX, -}; +/* These need to be macros, as the expressions are used in assembler input */ +#define KF_ACQUIRE (1 << 0) /* kfunc is an acquire function */ +#define KF_RELEASE (1 << 1) /* kfunc is a release function */ +#define KF_RET_NULL (1 << 2) /* kfunc returns a pointer that may be NULL */ +#define KF_KPTR_GET (1 << 3) /* kfunc returns reference to a kptr */ +/* Trusted arguments are those which are meant to be referenced arguments with + * unchanged offset. It is used to enforce that pointers obtained from acquire + * kfuncs remain unmodified when being passed to helpers taking trusted args. + * + * Consider + * struct foo { + * int data; + * struct foo *next; + * }; + * + * struct bar { + * int data; + * struct foo f; + * }; + * + * struct foo *f = alloc_foo(); // Acquire kfunc + * struct bar *b = alloc_bar(); // Acquire kfunc + * + * If a kfunc set_foo_data() wants to operate only on the allocated object, it + * will set the KF_TRUSTED_ARGS flag, which will prevent unsafe usage like: + * + * set_foo_data(f, 42); // Allowed + * set_foo_data(f->next, 42); // Rejected, non-referenced pointer + * set_foo_data(&f->next, 42);// Rejected, referenced, but wrong type + * set_foo_data(&b->f, 42); // Rejected, referenced, but bad offset + * + * In the final case, usually for the purposes of type matching, it is deduced + * by looking at the type of the member at the offset, but due to the + * requirement of trusted argument, this deduction will be strict and not done + * for this case. + */ +#define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */ struct btf; struct btf_member; @@ -30,16 +59,7 @@ struct btf_id_set; struct btf_kfunc_id_set { struct module *owner; - union { - struct { - struct btf_id_set *check_set; - struct btf_id_set *acquire_set; - struct btf_id_set *release_set; - struct btf_id_set *ret_null_set; - struct btf_id_set *kptr_acquire_set; - }; - struct btf_id_set *sets[BTF_KFUNC_TYPE_MAX]; - }; + struct btf_id_set8 *set; }; struct btf_id_dtor_kfunc { @@ -378,9 +398,9 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); const char *btf_name_by_offset(const struct btf *btf, u32 offset); struct btf *btf_parse_vmlinux(void); struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog); -bool btf_kfunc_id_set_contains(const struct btf *btf, +u32 *btf_kfunc_id_set_contains(const struct btf *btf, enum bpf_prog_type prog_type, - enum btf_kfunc_type type, u32 kfunc_btf_id); + u32 kfunc_btf_id); int register_btf_kfunc_id_set(enum bpf_prog_type prog_type, const struct btf_kfunc_id_set *s); s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id); @@ -397,12 +417,11 @@ static inline const char *btf_name_by_offset(const struct btf *btf, { return NULL; } -static inline bool btf_kfunc_id_set_contains(const struct btf *btf, +static inline u32 *btf_kfunc_id_set_contains(const struct btf *btf, enum bpf_prog_type prog_type, - enum btf_kfunc_type type, u32 kfunc_btf_id) { - return false; + return NULL; } static inline int register_btf_kfunc_id_set(enum bpf_prog_type prog_type, const struct btf_kfunc_id_set *s) diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 252a4befeab1..2aea877d644f 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -8,6 +8,15 @@ struct btf_id_set { u32 ids[]; }; +struct btf_id_set8 { + u32 cnt; + u32 flags; + struct { + u32 id; + u32 flags; + } pairs[]; +}; + #ifdef CONFIG_DEBUG_INFO_BTF #include <linux/compiler.h> /* for __PASTE */ @@ -25,7 +34,7 @@ struct btf_id_set { #define BTF_IDS_SECTION ".BTF_ids" -#define ____BTF_ID(symbol) \ +#define ____BTF_ID(symbol, word) \ asm( \ ".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ ".local " #symbol " ; \n" \ @@ -33,10 +42,11 @@ asm( \ ".size " #symbol ", 4; \n" \ #symbol ": \n" \ ".zero 4 \n" \ +word \ ".popsection; \n"); -#define __BTF_ID(symbol) \ - ____BTF_ID(symbol) +#define __BTF_ID(symbol, word) \ + ____BTF_ID(symbol, word) #define __ID(prefix) \ __PASTE(prefix, __COUNTER__) @@ -46,7 +56,14 @@ asm( \ * to 4 zero bytes. */ #define BTF_ID(prefix, name) \ - __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__)) + __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), "") + +#define ____BTF_ID_FLAGS(prefix, name, flags) \ + __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), ".long " #flags "\n") +#define __BTF_ID_FLAGS(prefix, name, flags, ...) \ + ____BTF_ID_FLAGS(prefix, name, flags) +#define BTF_ID_FLAGS(prefix, name, ...) \ + __BTF_ID_FLAGS(prefix, name, ##__VA_ARGS__, 0) /* * The BTF_ID_LIST macro defines pure (unsorted) list @@ -145,10 +162,51 @@ asm( \ ".popsection; \n"); \ extern struct btf_id_set name; +/* + * The BTF_SET8_START/END macros pair defines sorted list of + * BTF IDs and their flags plus its members count, with the + * following layout: + * + * BTF_SET8_START(list) + * BTF_ID_FLAGS(type1, name1, flags) + * BTF_ID_FLAGS(type2, name2, flags) + * BTF_SET8_END(list) + * + * __BTF_ID__set8__list: + * .zero 8 + * list: + * __BTF_ID__type1__name1__3: + * .zero 4 + * .word (1 << 0) | (1 << 2) + * __BTF_ID__type2__name2__5: + * .zero 4 + * .word (1 << 3) | (1 << 1) | (1 << 2) + * + */ +#define __BTF_SET8_START(name, scope) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +"." #scope " __BTF_ID__set8__" #name "; \n" \ +"__BTF_ID__set8__" #name ":; \n" \ +".zero 8 \n" \ +".popsection; \n"); + +#define BTF_SET8_START(name) \ +__BTF_ID_LIST(name, local) \ +__BTF_SET8_START(name, local) + +#define BTF_SET8_END(name) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +".size __BTF_ID__set8__" #name ", .-" #name " \n" \ +".popsection; \n"); \ +extern struct btf_id_set8 name; + #else #define BTF_ID_LIST(name) static u32 __maybe_unused name[5]; #define BTF_ID(prefix, name) +#define BTF_ID_FLAGS(prefix, name, ...) #define BTF_ID_UNUSED #define BTF_ID_LIST_GLOBAL(name, n) u32 __maybe_unused name[n]; #define BTF_ID_LIST_SINGLE(name, prefix, typename) static u32 __maybe_unused name[1]; @@ -156,6 +214,8 @@ extern struct btf_id_set name; #define BTF_SET_START(name) static struct btf_id_set __maybe_unused name = { 0 }; #define BTF_SET_START_GLOBAL(name) static struct btf_id_set __maybe_unused name = { 0 }; #define BTF_SET_END(name) +#define BTF_SET8_START(name) static struct btf_id_set8 __maybe_unused name = { 0 }; +#define BTF_SET8_END(name) #endif /* CONFIG_DEBUG_INFO_BTF */ diff --git a/include/linux/filter.h b/include/linux/filter.h index 4c1a8b247545..a5f21dc3c432 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1027,6 +1027,14 @@ u64 bpf_jit_alloc_exec_limit(void); void *bpf_jit_alloc_exec(unsigned long size); void bpf_jit_free_exec(void *addr); void bpf_jit_free(struct bpf_prog *fp); +struct bpf_binary_header * +bpf_jit_binary_pack_hdr(const struct bpf_prog *fp); + +static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) +{ + return list_empty(&fp->aux->ksym.lnode) || + fp->aux->ksym.lnode.prev == LIST_POISON2; +} struct bpf_binary_header * bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **ro_image, diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 979f6bfa2c25..0b61371e287b 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -208,6 +208,43 @@ enum { FTRACE_OPS_FL_DIRECT = BIT(17), }; +/* + * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes + * to a ftrace_ops. Note, the requests may fail. + * + * ENABLE_SHARE_IPMODIFY_SELF - enable a DIRECT ops to work on the same + * function as an ops with IPMODIFY. Called + * when the DIRECT ops is being registered. + * This is called with both direct_mutex and + * ftrace_lock are locked. + * + * ENABLE_SHARE_IPMODIFY_PEER - enable a DIRECT ops to work on the same + * function as an ops with IPMODIFY. Called + * when the other ops (the one with IPMODIFY) + * is being registered. + * This is called with direct_mutex locked. + * + * DISABLE_SHARE_IPMODIFY_PEER - disable a DIRECT ops to work on the same + * function as an ops with IPMODIFY. Called + * when the other ops (the one with IPMODIFY) + * is being unregistered. + * This is called with direct_mutex locked. + */ +enum ftrace_ops_cmd { + FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF, + FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER, + FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER, +}; + +/* + * For most ftrace_ops_cmd, + * Returns: + * 0 - Success. + * Negative on failure. The return value is dependent on the + * callback. + */ +typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd); + #ifdef CONFIG_DYNAMIC_FTRACE /* The hash used to know what functions callbacks trace */ struct ftrace_ops_hash { @@ -250,6 +287,7 @@ struct ftrace_ops { unsigned long trampoline; unsigned long trampoline_size; struct list_head list; + ftrace_ops_func_t ops_func; #endif }; @@ -340,6 +378,7 @@ unsigned long ftrace_find_rec_direct(unsigned long ip); int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr); int unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr); int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr); +int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr); #else struct ftrace_ops; @@ -384,6 +423,10 @@ static inline int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned lo { return -ENODEV; } +static inline int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr) +{ + return -ENODEV; +} #endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS diff --git a/include/linux/lapb.h b/include/linux/lapb.h index eb56472f23b2..b5333f9413dc 100644 --- a/include/linux/lapb.h +++ b/include/linux/lapb.h @@ -6,6 +6,11 @@ #ifndef LAPB_KERNEL_H #define LAPB_KERNEL_H +#include <linux/skbuff.h> +#include <linux/timer.h> + +struct net_device; + #define LAPB_OK 0 #define LAPB_BADTOKEN 1 #define LAPB_INVALUE 2 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 5d7ff8850eb2..ca8afa382bf2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2487,6 +2487,14 @@ static inline void skb_set_tail_pointer(struct sk_buff *skb, const int offset) #endif /* NET_SKBUFF_DATA_USES_OFFSET */ +static inline void skb_assert_len(struct sk_buff *skb) +{ +#ifdef CONFIG_DEBUG_NET + if (WARN_ONCE(!skb->len, "%s\n", __func__)) + DO_ONCE_LITE(skb_dump, KERN_ERR, skb, false); +#endif /* CONFIG_DEBUG_NET */ +} + /* * Add data to an sk_buff */ diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h index f742e50207fb..1c53c4c4d88f 100644 --- a/include/net/af_vsock.h +++ b/include/net/af_vsock.h @@ -10,6 +10,7 @@ #include <linux/kernel.h> #include <linux/workqueue.h> +#include <net/sock.h> #include <uapi/linux/vm_sockets.h> #include "vsock_addr.h" diff --git a/include/net/amt.h b/include/net/amt.h index 08fc30cf2f34..c881bc8b673b 100644 --- a/include/net/amt.h +++ b/include/net/amt.h @@ -7,6 +7,9 @@ #include <linux/siphash.h> #include <linux/jhash.h> +#include <linux/netdevice.h> +#include <net/gro_cells.h> +#include <net/rtnetlink.h> enum amt_msg_type { AMT_MSG_DISCOVERY = 1, diff --git a/include/net/ax88796.h b/include/net/ax88796.h index 2ed23a368602..b658471f97f0 100644 --- a/include/net/ax88796.h +++ b/include/net/ax88796.h @@ -8,6 +8,8 @@ #ifndef __NET_AX88796_PLAT_H #define __NET_AX88796_PLAT_H +#include <linux/types.h> + struct sk_buff; struct net_device; struct platform_device; diff --git a/include/net/bond_options.h b/include/net/bond_options.h index d2aea5cf1e41..69292ecc0325 100644 --- a/include/net/bond_options.h +++ b/include/net/bond_options.h @@ -7,6 +7,14 @@ #ifndef _NET_BOND_OPTIONS_H #define _NET_BOND_OPTIONS_H +#include <linux/bits.h> +#include <linux/limits.h> +#include <linux/types.h> +#include <linux/string.h> + +struct netlink_ext_ack; +struct nlattr; + #define BOND_OPT_MAX_NAMELEN 32 #define BOND_OPT_VALID(opt) ((opt) < BOND_OPT_LAST) #define BOND_MODE_ALL_EX(x) (~(x)) diff --git a/include/net/codel_qdisc.h b/include/net/codel_qdisc.h index 58b6d0ebea10..7d3d9219f4fe 100644 --- a/include/net/codel_qdisc.h +++ b/include/net/codel_qdisc.h @@ -49,6 +49,7 @@ * Implemented on linux by Dave Taht and Eric Dumazet */ +#include <net/codel.h> #include <net/pkt_sched.h> /* Qdiscs using codel plugin must use codel_skb_cb in their own cb[] */ diff --git a/include/net/datalink.h b/include/net/datalink.h index d9b7faaa539f..c837ffc7ebf8 100644 --- a/include/net/datalink.h +++ b/include/net/datalink.h @@ -2,6 +2,13 @@ #ifndef _NET_INET_DATALINK_H_ #define _NET_INET_DATALINK_H_ +#include <linux/list.h> + +struct llc_sap; +struct net_device; +struct packet_type; +struct sk_buff; + struct datalink_proto { unsigned char type[8]; diff --git a/include/net/dcbevent.h b/include/net/dcbevent.h index 43e34131a53f..02700262f71a 100644 --- a/include/net/dcbevent.h +++ b/include/net/dcbevent.h @@ -8,6 +8,8 @@ #ifndef _DCB_EVENT_H #define _DCB_EVENT_H +struct notifier_block; + enum dcbevent_notif_type { DCB_APP_EVENT = 1, }; diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h index e4ad58c4062c..2b2d86fb3131 100644 --- a/include/net/dcbnl.h +++ b/include/net/dcbnl.h @@ -10,6 +10,8 @@ #include <linux/dcbnl.h> +struct net_device; + struct dcb_app_type { int ifindex; struct dcb_app app; diff --git a/include/net/dn_dev.h b/include/net/dn_dev.h index 595b4f6c1eb1..bec303ea8367 100644 --- a/include/net/dn_dev.h +++ b/include/net/dn_dev.h @@ -2,6 +2,7 @@ #ifndef _NET_DN_DEV_H #define _NET_DN_DEV_H +#include <linux/netdevice.h> struct dn_dev; diff --git a/include/net/dn_fib.h b/include/net/dn_fib.h index ddd6565957b3..1929a3cd5ebe 100644 --- a/include/net/dn_fib.h +++ b/include/net/dn_fib.h @@ -4,6 +4,8 @@ #include <linux/netlink.h> #include <linux/refcount.h> +#include <linux/rtnetlink.h> +#include <net/fib_rules.h> extern const struct nla_policy rtm_dn_policy[]; diff --git a/include/net/dn_neigh.h b/include/net/dn_neigh.h index 2e3e7793973a..1f7df98bfc33 100644 --- a/include/net/dn_neigh.h +++ b/include/net/dn_neigh.h @@ -2,6 +2,8 @@ #ifndef _NET_DN_NEIGH_H #define _NET_DN_NEIGH_H +#include <net/neighbour.h> + /* * The position of the first two fields of * this structure are critical - SJW diff --git a/include/net/dn_nsp.h b/include/net/dn_nsp.h index f83932b864a9..a4a18fee0b7c 100644 --- a/include/net/dn_nsp.h +++ b/include/net/dn_nsp.h @@ -6,6 +6,12 @@ *******************************************************************************/ /* dn_nsp.c functions prototyping */ +#include <linux/atomic.h> +#include <linux/types.h> +#include <net/sock.h> + +struct sk_buff; +struct sk_buff_head; void dn_nsp_send_data_ack(struct sock *sk); void dn_nsp_send_oth_ack(struct sock *sk); diff --git a/include/net/dn_route.h b/include/net/dn_route.h index 6f1e94ac0bdf..88c0300236cc 100644 --- a/include/net/dn_route.h +++ b/include/net/dn_route.h @@ -7,6 +7,9 @@ *******************************************************************************/ +#include <linux/types.h> +#include <net/dst.h> + struct sk_buff *dn_alloc_skb(struct sock *sk, int size, gfp_t pri); int dn_route_output_sock(struct dst_entry __rcu **pprt, struct flowidn *, struct sock *sk, int flags); diff --git a/include/net/erspan.h b/include/net/erspan.h index 0d9e86bd9893..6cb4cbd6a48f 100644 --- a/include/net/erspan.h +++ b/include/net/erspan.h @@ -58,6 +58,9 @@ * GRE proto ERSPAN type I/II = 0x88BE, type III = 0x22EB */ +#include <linux/ip.h> +#include <linux/ipv6.h> +#include <linux/skbuff.h> #include <uapi/linux/erspan.h> #define ERSPAN_VERSION 0x1 /* ERSPAN type II */ diff --git a/include/net/esp.h b/include/net/esp.h index 9c5637d41d95..322950727dd0 100644 --- a/include/net/esp.h +++ b/include/net/esp.h @@ -5,6 +5,7 @@ #include <linux/skbuff.h> struct ip_esp_hdr; +struct xfrm_state; static inline struct ip_esp_hdr *ip_esp_hdr(const struct sk_buff *skb) { diff --git a/include/net/ethoc.h b/include/net/ethoc.h index 78519ed42ab4..73810f3ca492 100644 --- a/include/net/ethoc.h +++ b/include/net/ethoc.h @@ -10,6 +10,9 @@ #ifndef LINUX_NET_ETHOC_H #define LINUX_NET_ETHOC_H 1 +#include <linux/if.h> +#include <linux/types.h> + struct ethoc_platform_data { u8 hwaddr[IFHWADDRLEN]; s8 phy_id; diff --git a/include/net/firewire.h b/include/net/firewire.h index 299e5df38552..2442d645e412 100644 --- a/include/net/firewire.h +++ b/include/net/firewire.h @@ -2,6 +2,8 @@ #ifndef _NET_FIREWIRE_H #define _NET_FIREWIRE_H +#include <linux/types.h> + /* Pseudo L2 address */ #define FWNET_ALEN 16 union fwnet_hwaddr { diff --git a/include/net/fq.h b/include/net/fq.h index 2eccbbd2b559..07b5aff6ec58 100644 --- a/include/net/fq.h +++ b/include/net/fq.h @@ -7,6 +7,10 @@ #ifndef __NET_SCHED_FQ_H #define __NET_SCHED_FQ_H +#include <linux/skbuff.h> +#include <linux/spinlock.h> +#include <linux/types.h> + struct fq_tin; /** diff --git a/include/net/garp.h b/include/net/garp.h index 4d9a0c6a2e5f..59a07b171def 100644 --- a/include/net/garp.h +++ b/include/net/garp.h @@ -2,6 +2,8 @@ #ifndef _NET_GARP_H #define _NET_GARP_H +#include <linux/if_ether.h> +#include <linux/types.h> #include <net/stp.h> #define GARP_PROTOCOL_ID 0x1 diff --git a/include/net/gtp.h b/include/net/gtp.h index c1d6169df331..2a503f035d18 100644 --- a/include/net/gtp.h +++ b/include/net/gtp.h @@ -2,6 +2,10 @@ #ifndef _GTP_H_ #define _GTP_H_ +#include <linux/netdevice.h> +#include <linux/types.h> +#include <net/rtnetlink.h> + /* General GTP protocol related definitions. */ #define GTP0_PORT 3386 diff --git a/include/net/gue.h b/include/net/gue.h index e42402f180b7..dfca298bec9c 100644 --- a/include/net/gue.h +++ b/include/net/gue.h @@ -30,6 +30,9 @@ * may refer to options placed after this field. */ +#include <asm/byteorder.h> +#include <linux/types.h> + struct guehdr { union { struct { diff --git a/include/net/hwbm.h b/include/net/hwbm.h index c81444611a22..aa495decec35 100644 --- a/include/net/hwbm.h +++ b/include/net/hwbm.h @@ -2,6 +2,8 @@ #ifndef _HWBM_H #define _HWBM_H +#include <linux/mutex.h> + struct hwbm_pool { /* Capacity of the pool */ int size; diff --git a/include/net/ila.h b/include/net/ila.h index f98dcd5791b0..73ebe5eab272 100644 --- a/include/net/ila.h +++ b/include/net/ila.h @@ -8,6 +8,8 @@ #ifndef _NET_ILA_H #define _NET_ILA_H +struct sk_buff; + int ila_xlat_outgoing(struct sk_buff *skb); int ila_xlat_incoming(struct sk_buff *skb); diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h index 7392f959a405..025bd8d3c769 100644 --- a/include/net/inet6_connection_sock.h +++ b/include/net/inet6_connection_sock.h @@ -11,6 +11,8 @@ #include <linux/types.h> +struct flowi; +struct flowi6; struct request_sock; struct sk_buff; struct sock; diff --git a/include/net/inet_common.h b/include/net/inet_common.h index cad2a611efde..cec453c18f1d 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -3,6 +3,10 @@ #define _INET_COMMON_H #include <linux/indirect_call_wrapper.h> +#include <linux/net.h> +#include <linux/netdev_features.h> +#include <linux/types.h> +#include <net/sock.h> extern const struct proto_ops inet_stream_ops; extern const struct proto_ops inet_dgram_ops; @@ -12,6 +16,8 @@ extern const struct proto_ops inet_dgram_ops; */ struct msghdr; +struct net; +struct page; struct sock; struct sockaddr; struct socket; diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 911ad930867d..0b0876610553 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -4,6 +4,9 @@ #include <linux/rhashtable-types.h> #include <linux/completion.h> +#include <linux/in6.h> +#include <linux/rbtree_types.h> +#include <linux/refcount.h> /* Per netns frag queues directory */ struct fqdir { diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index ca2d6b60e1ec..035d61d50a98 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -2,6 +2,16 @@ #ifndef _NET_IP6_ROUTE_H #define _NET_IP6_ROUTE_H +#include <net/addrconf.h> +#include <net/flow.h> +#include <net/ip6_fib.h> +#include <net/sock.h> +#include <net/lwtunnel.h> +#include <linux/ip.h> +#include <linux/ipv6.h> +#include <linux/route.h> +#include <net/nexthop.h> + struct route_info { __u8 type; __u8 length; @@ -19,16 +29,6 @@ struct route_info { __u8 prefix[]; /* 0,8 or 16 */ }; -#include <net/addrconf.h> -#include <net/flow.h> -#include <net/ip6_fib.h> -#include <net/sock.h> -#include <net/lwtunnel.h> -#include <linux/ip.h> -#include <linux/ipv6.h> -#include <linux/route.h> -#include <net/nexthop.h> - #define RT6_LOOKUP_F_IFACE 0x00000001 #define RT6_LOOKUP_F_REACHABLE 0x00000002 #define RT6_LOOKUP_F_HAS_SADDR 0x00000004 diff --git a/include/net/ipcomp.h b/include/net/ipcomp.h index fee6fc451597..c31108295079 100644 --- a/include/net/ipcomp.h +++ b/include/net/ipcomp.h @@ -2,11 +2,13 @@ #ifndef _NET_IPCOMP_H #define _NET_IPCOMP_H +#include <linux/skbuff.h> #include <linux/types.h> #define IPCOMP_SCRATCH_SIZE 65400 struct crypto_comp; +struct ip_comp_hdr; struct ipcomp_data { u16 threshold; diff --git a/include/net/ipconfig.h b/include/net/ipconfig.h index e3534299bd2a..8276897d0c2e 100644 --- a/include/net/ipconfig.h +++ b/include/net/ipconfig.h @@ -7,6 +7,8 @@ /* The following are initdata: */ +#include <linux/types.h> + extern int ic_proto_enabled; /* Protocols enabled (see IC_xxx) */ extern int ic_set_manually; /* IPconfig parameters set manually */ diff --git a/include/net/llc_c_ac.h b/include/net/llc_c_ac.h index e766300b3e99..3e1f76786d7b 100644 --- a/include/net/llc_c_ac.h +++ b/include/net/llc_c_ac.h @@ -16,6 +16,13 @@ * Connection state transition actions * (Fb = F bit; Pb = P bit; Xb = X bit) */ + +#include <linux/types.h> + +struct sk_buff; +struct sock; +struct timer_list; + #define LLC_CONN_AC_CLR_REMOTE_BUSY 1 #define LLC_CONN_AC_CONN_IND 2 #define LLC_CONN_AC_CONN_CONFIRM 3 diff --git a/include/net/llc_c_st.h b/include/net/llc_c_st.h index 48f3f891b2f9..53823d61d8b6 100644 --- a/include/net/llc_c_st.h +++ b/include/net/llc_c_st.h @@ -11,6 +11,10 @@ * * See the GNU General Public License for more details. */ + +#include <net/llc_c_ac.h> +#include <net/llc_c_ev.h> + /* Connection component state management */ /* connection states */ #define LLC_CONN_OUT_OF_SVC 0 /* prior to allocation */ diff --git a/include/net/llc_s_ac.h b/include/net/llc_s_ac.h index a61b98c108ee..f71790305bc9 100644 --- a/include/net/llc_s_ac.h +++ b/include/net/llc_s_ac.h @@ -11,6 +11,10 @@ * * See the GNU General Public License for more details. */ + +struct llc_sap; +struct sk_buff; + /* SAP component actions */ #define SAP_ACT_UNITDATA_IND 1 #define SAP_ACT_SEND_UI 2 diff --git a/include/net/llc_s_ev.h b/include/net/llc_s_ev.h index 84db3a59ed28..fb7df1d70af3 100644 --- a/include/net/llc_s_ev.h +++ b/include/net/llc_s_ev.h @@ -13,6 +13,7 @@ */ #include <linux/skbuff.h> +#include <net/llc.h> /* Defines SAP component events */ /* Types of events (possible values in 'ev->type') */ diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h index 9deb3a3735da..0c71c27979fb 100644 --- a/include/net/mpls_iptunnel.h +++ b/include/net/mpls_iptunnel.h @@ -6,6 +6,9 @@ #ifndef _NET_MPLS_IPTUNNEL_H #define _NET_MPLS_IPTUNNEL_H 1 +#include <linux/types.h> +#include <net/lwtunnel.h> + struct mpls_iptunnel_encap { u8 labels; u8 ttl_propagate; diff --git a/include/net/mrp.h b/include/net/mrp.h index 1c308c034e1a..92cd3fb6cf9d 100644 --- a/include/net/mrp.h +++ b/include/net/mrp.h @@ -2,6 +2,10 @@ #ifndef _NET_MRP_H #define _NET_MRP_H +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/types.h> + #define MRP_END_MARK 0x0 struct mrp_pdu_hdr { diff --git a/include/net/ncsi.h b/include/net/ncsi.h index fbefe80361ee..08a50d9acb0a 100644 --- a/include/net/ncsi.h +++ b/include/net/ncsi.h @@ -2,6 +2,8 @@ #ifndef __NET_NCSI_H #define __NET_NCSI_H +#include <linux/types.h> + /* * The NCSI device states seen from external. More NCSI device states are * only visible internally (in net/ncsi/internal.h). When the NCSI device diff --git a/include/net/netevent.h b/include/net/netevent.h index 4107016c3bb4..1be3757a8b7f 100644 --- a/include/net/netevent.h +++ b/include/net/netevent.h @@ -14,6 +14,7 @@ struct dst_entry; struct neighbour; +struct notifier_block ; struct netevent_redirect { struct dst_entry *old; diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index 37866c8386e2..3cd3a6e631aa 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -84,4 +84,23 @@ void nf_conntrack_lock(spinlock_t *lock); extern spinlock_t nf_conntrack_expect_lock; +/* ctnetlink code shared by both ctnetlink and nf_conntrack_bpf */ + +#if (IS_BUILTIN(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \ + (IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES) || \ + IS_ENABLED(CONFIG_NF_CT_NETLINK)) + +static inline void __nf_ct_set_timeout(struct nf_conn *ct, u64 timeout) +{ + if (timeout > INT_MAX) + timeout = INT_MAX; + WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout); +} + +int __nf_ct_change_timeout(struct nf_conn *ct, u64 cta_timeout); +void __nf_ct_change_status(struct nf_conn *ct, unsigned long on, unsigned long off); +int nf_ct_change_status_common(struct nf_conn *ct, unsigned int status); + +#endif + #endif /* _NF_CONNTRACK_CORE_H */ diff --git a/include/net/netns/can.h b/include/net/netns/can.h index 52fbd8291a96..48b79f7e6236 100644 --- a/include/net/netns/can.h +++ b/include/net/netns/can.h @@ -7,6 +7,7 @@ #define __NETNS_CAN_H__ #include <linux/spinlock.h> +#include <linux/timer.h> struct can_dev_rcv_lists; struct can_pkg_stats; diff --git a/include/net/netns/core.h b/include/net/netns/core.h index 388244e315e7..8249060cf5d0 100644 --- a/include/net/netns/core.h +++ b/include/net/netns/core.h @@ -2,6 +2,8 @@ #ifndef __NETNS_CORE_H__ #define __NETNS_CORE_H__ +#include <linux/types.h> + struct ctl_table_header; struct prot_inuse; diff --git a/include/net/netns/generic.h b/include/net/netns/generic.h index 8a1ab47c3fb3..7ce68183f6e1 100644 --- a/include/net/netns/generic.h +++ b/include/net/netns/generic.h @@ -8,6 +8,7 @@ #include <linux/bug.h> #include <linux/rcupdate.h> +#include <net/net_namespace.h> /* * Generic net pointers are to be used by modules to put some private diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index ce0cc4e8d8c7..c7320ef356d9 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -9,6 +9,7 @@ #include <linux/uidgid.h> #include <net/inet_frag.h> #include <linux/rcupdate.h> +#include <linux/seqlock.h> #include <linux/siphash.h> struct ctl_table_header; diff --git a/include/net/netns/mctp.h b/include/net/netns/mctp.h index acedef12a35e..1db8f9aaddb4 100644 --- a/include/net/netns/mctp.h +++ b/include/net/netns/mctp.h @@ -6,6 +6,7 @@ #ifndef __NETNS_MCTP_H__ #define __NETNS_MCTP_H__ +#include <linux/mutex.h> #include <linux/types.h> struct netns_mctp { diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index a7bdcfbb0b28..19ad2574b267 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -6,6 +6,8 @@ #ifndef __NETNS_MPLS_H__ #define __NETNS_MPLS_H__ +#include <linux/types.h> + struct mpls_route; struct ctl_table_header; diff --git a/include/net/netns/nexthop.h b/include/net/netns/nexthop.h index 1849e77eb68a..434239b37014 100644 --- a/include/net/netns/nexthop.h +++ b/include/net/netns/nexthop.h @@ -6,6 +6,7 @@ #ifndef __NETNS_NEXTHOP_H__ #define __NETNS_NEXTHOP_H__ +#include <linux/notifier.h> #include <linux/rbtree.h> struct netns_nexthop { diff --git a/include/net/netns/sctp.h b/include/net/netns/sctp.h index 40240722cdca..a681147aecd8 100644 --- a/include/net/netns/sctp.h +++ b/include/net/netns/sctp.h @@ -2,6 +2,9 @@ #ifndef __NETNS_SCTP_H__ #define __NETNS_SCTP_H__ +#include <linux/timer.h> +#include <net/snmp.h> + struct sock; struct proc_dir_entry; struct sctp_mib; diff --git a/include/net/netns/unix.h b/include/net/netns/unix.h index 6f1a33df061d..9859d134d5a8 100644 --- a/include/net/netns/unix.h +++ b/include/net/netns/unix.h @@ -5,6 +5,8 @@ #ifndef __NETNS_UNIX_H__ #define __NETNS_UNIX_H__ +#include <linux/spinlock.h> + struct unix_table { spinlock_t *locks; struct hlist_head *buckets; diff --git a/include/net/netrom.h b/include/net/netrom.h index 80f15b1c1a48..f0565a5987d1 100644 --- a/include/net/netrom.h +++ b/include/net/netrom.h @@ -14,6 +14,7 @@ #include <net/sock.h> #include <linux/refcount.h> #include <linux/seq_file.h> +#include <net/ax25.h> #define NR_NETWORK_LEN 15 #define NR_TRANSPORT_LEN 5 diff --git a/include/net/p8022.h b/include/net/p8022.h index c2bacc66bfbc..b690ffcad66b 100644 --- a/include/net/p8022.h +++ b/include/net/p8022.h @@ -1,6 +1,11 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NET_P8022_H #define _NET_P8022_H + +struct net_device; +struct packet_type; +struct sk_buff; + struct datalink_proto * register_8022_client(unsigned char type, int (*func)(struct sk_buff *skb, diff --git a/include/net/phonet/pep.h b/include/net/phonet/pep.h index 27b1ab5e4e6d..645dddf5ce77 100644 --- a/include/net/phonet/pep.h +++ b/include/net/phonet/pep.h @@ -10,6 +10,9 @@ #ifndef NET_PHONET_PEP_H #define NET_PHONET_PEP_H +#include <linux/skbuff.h> +#include <net/phonet/phonet.h> + struct pep_sock { struct pn_sock pn_sk; diff --git a/include/net/phonet/phonet.h b/include/net/phonet/phonet.h index a27bdc6cfeab..862f1719b523 100644 --- a/include/net/phonet/phonet.h +++ b/include/net/phonet/phonet.h @@ -10,6 +10,10 @@ #ifndef AF_PHONET_H #define AF_PHONET_H +#include <linux/phonet.h> +#include <linux/skbuff.h> +#include <net/sock.h> + /* * The lower layers may not require more space, ever. Make sure it's * enough. diff --git a/include/net/phonet/pn_dev.h b/include/net/phonet/pn_dev.h index 05b49d4d2b11..e9dc8dca5817 100644 --- a/include/net/phonet/pn_dev.h +++ b/include/net/phonet/pn_dev.h @@ -10,6 +10,11 @@ #ifndef PN_DEV_H #define PN_DEV_H +#include <linux/list.h> +#include <linux/mutex.h> + +struct net; + struct phonet_device_list { struct list_head list; struct mutex lock; diff --git a/include/net/pptp.h b/include/net/pptp.h index 383e25ca53a7..e63176bdd4c8 100644 --- a/include/net/pptp.h +++ b/include/net/pptp.h @@ -2,6 +2,9 @@ #ifndef _NET_PPTP_H #define _NET_PPTP_H +#include <linux/types.h> +#include <net/gre.h> + #define PPP_LCP_ECHOREQ 0x09 #define PPP_LCP_ECHOREP 0x0A #define SC_RCV_BITS (SC_RCV_B7_1|SC_RCV_B7_0|SC_RCV_ODDP|SC_RCV_EVNP) diff --git a/include/net/psnap.h b/include/net/psnap.h index 7cb0c8ab4171..88802b0754ad 100644 --- a/include/net/psnap.h +++ b/include/net/psnap.h @@ -2,6 +2,11 @@ #ifndef _NET_PSNAP_H #define _NET_PSNAP_H +struct datalink_proto; +struct sk_buff; +struct packet_type; +struct net_device; + struct datalink_proto * register_snap_client(const unsigned char *desc, int (*rcvfunc)(struct sk_buff *, struct net_device *, diff --git a/include/net/regulatory.h b/include/net/regulatory.h index 47f06f6f5a67..896191f420d5 100644 --- a/include/net/regulatory.h +++ b/include/net/regulatory.h @@ -1,3 +1,4 @@ + #ifndef __NET_REGULATORY_H #define __NET_REGULATORY_H /* @@ -19,6 +20,8 @@ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */ +#include <linux/ieee80211.h> +#include <linux/nl80211.h> #include <linux/rcupdate.h> /** diff --git a/include/net/rose.h b/include/net/rose.h index 0f0a4ce0fee7..f192a64ddef2 100644 --- a/include/net/rose.h +++ b/include/net/rose.h @@ -9,6 +9,7 @@ #define _ROSE_H #include <linux/rose.h> +#include <net/ax25.h> #include <net/sock.h> #define ROSE_ADDR_LEN 5 diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h index dac91aa38c5a..21e7fa2a1813 100644 --- a/include/net/secure_seq.h +++ b/include/net/secure_seq.h @@ -4,6 +4,8 @@ #include <linux/types.h> +struct net; + u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport); u64 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport); diff --git a/include/net/smc.h b/include/net/smc.h index e441aa97ad61..37f829d9c6e5 100644 --- a/include/net/smc.h +++ b/include/net/smc.h @@ -11,6 +11,13 @@ #ifndef _SMC_H #define _SMC_H +#include <linux/device.h> +#include <linux/spinlock.h> +#include <linux/types.h> +#include <linux/wait.h> + +struct sock; + #define SMC_MAX_PNETID_LEN 16 /* Max. length of PNET id */ struct smc_hashinfo { diff --git a/include/net/stp.h b/include/net/stp.h index 2914e6d53490..528103fce2c0 100644 --- a/include/net/stp.h +++ b/include/net/stp.h @@ -2,6 +2,8 @@ #ifndef _NET_STP_H #define _NET_STP_H +#include <linux/if_ether.h> + struct stp_proto { unsigned char group_address[ETH_ALEN]; void (*rcv)(const struct stp_proto *, struct sk_buff *, diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h index da06613c9603..b830463e3dff 100644 --- a/include/net/transp_v6.h +++ b/include/net/transp_v6.h @@ -3,6 +3,7 @@ #define _TRANSP_V6_H #include <net/checksum.h> +#include <net/sock.h> /* IPv6 transport protocols */ extern struct proto rawv6_prot; @@ -12,6 +13,7 @@ extern struct proto tcpv6_prot; extern struct proto pingv6_prot; struct flowi6; +struct ipcm6_cookie; /* extension headers */ int ipv6_exthdrs_init(void); diff --git a/include/net/tun_proto.h b/include/net/tun_proto.h index 2ea3deba4c99..7b0de7852908 100644 --- a/include/net/tun_proto.h +++ b/include/net/tun_proto.h @@ -1,7 +1,8 @@ #ifndef __NET_TUN_PROTO_H #define __NET_TUN_PROTO_H -#include <linux/kernel.h> +#include <linux/if_ether.h> +#include <linux/types.h> /* One byte protocol values as defined by VXLAN-GPE and NSH. These will * hopefully get a shared IANA registry. diff --git a/include/net/udplite.h b/include/net/udplite.h index a3c53110d30b..0143b373602e 100644 --- a/include/net/udplite.h +++ b/include/net/udplite.h @@ -6,6 +6,7 @@ #define _UDPLITE_H #include <net/ip6_checksum.h> +#include <net/udp.h> /* UDP-Lite socket options */ #define UDPLITE_SEND_CSCOV 10 /* sender partial coverage (as sent) */ diff --git a/include/net/xdp_priv.h b/include/net/xdp_priv.h index a2d58b1a12e1..c9df68d5f258 100644 --- a/include/net/xdp_priv.h +++ b/include/net/xdp_priv.h @@ -3,6 +3,7 @@ #define __LINUX_NET_XDP_PRIV_H__ #include <linux/rhashtable.h> +#include <net/xdp.h> /* Private to net/core/xdp.c, but used by trace/events/xdp.h */ struct xdp_mem_allocator { diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index 4aa031849668..4277b0dcee05 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -44,6 +44,15 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, xp_set_rxq_info(pool, rxq); } +static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool) +{ +#ifdef CONFIG_NET_RX_BUSY_POLL + return pool->heads[0].xdp.rxq->napi_id; +#else + return 0; +#endif +} + static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { @@ -198,6 +207,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, { } +static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool) +{ + return 0; +} + static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 3dd13fe738b9..59a217ca2dfd 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2361,7 +2361,8 @@ union bpf_attr { * Pull in non-linear data in case the *skb* is non-linear and not * all of *len* are part of the linear section. Make *len* bytes * from *skb* readable and writable. If a zero value is passed for - * *len*, then the whole length of the *skb* is pulled. + * *len*, then all bytes in the linear part of *skb* will be made + * readable and writable. * * This helper is only needed for reading and writing with direct * packet access. diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index fe40d3b9458f..d3e734bf8056 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -70,10 +70,8 @@ int array_map_alloc_check(union bpf_attr *attr) attr->map_flags & BPF_F_PRESERVE_ELEMS) return -EINVAL; - if (attr->value_size > KMALLOC_MAX_SIZE) - /* if value_size is bigger, the user space won't be able to - * access the elements. - */ + /* avoid overflow on round_up(map->value_size) */ + if (attr->value_size > INT_MAX) return -E2BIG; return 0; @@ -156,6 +154,11 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) return &array->map; } +static void *array_map_elem_ptr(struct bpf_array* array, u32 index) +{ + return array->value + (u64)array->elem_size * index; +} + /* Called from syscall or from eBPF program */ static void *array_map_lookup_elem(struct bpf_map *map, void *key) { @@ -165,7 +168,7 @@ static void *array_map_lookup_elem(struct bpf_map *map, void *key) if (unlikely(index >= array->map.max_entries)) return NULL; - return array->value + array->elem_size * (index & array->index_mask); + return array->value + (u64)array->elem_size * (index & array->index_mask); } static int array_map_direct_value_addr(const struct bpf_map *map, u64 *imm, @@ -203,7 +206,7 @@ static int array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_insn *insn = insn_buf; - u32 elem_size = round_up(map->value_size, 8); + u32 elem_size = array->elem_size; const int ret = BPF_REG_0; const int map_ptr = BPF_REG_1; const int index = BPF_REG_2; @@ -272,7 +275,7 @@ int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value) * access 'value_size' of them, so copying rounded areas * will not leak any kernel data */ - size = round_up(map->value_size, 8); + size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { @@ -339,7 +342,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, value, map->value_size); } else { val = array->value + - array->elem_size * (index & array->index_mask); + (u64)array->elem_size * (index & array->index_mask); if (map_flags & BPF_F_LOCK) copy_map_value_locked(map, val, value, false); else @@ -376,7 +379,7 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, * returned or zeros which were zero-filled by percpu_alloc, * so no kernel data leaks possible */ - size = round_up(map->value_size, 8); + size = array->elem_size; rcu_read_lock(); pptr = array->pptrs[index & array->index_mask]; for_each_possible_cpu(cpu) { @@ -408,8 +411,7 @@ static void array_map_free_timers(struct bpf_map *map) return; for (i = 0; i < array->map.max_entries; i++) - bpf_timer_cancel_and_free(array->value + array->elem_size * i + - map->timer_off); + bpf_timer_cancel_and_free(array_map_elem_ptr(array, i) + map->timer_off); } /* Called when map->refcnt goes to zero, either from workqueue or from syscall */ @@ -420,7 +422,7 @@ static void array_map_free(struct bpf_map *map) if (map_value_has_kptrs(map)) { for (i = 0; i < array->map.max_entries; i++) - bpf_map_free_kptrs(map, array->value + array->elem_size * i); + bpf_map_free_kptrs(map, array_map_elem_ptr(array, i)); bpf_map_free_kptr_off_tab(map); } @@ -556,7 +558,7 @@ static void *bpf_array_map_seq_start(struct seq_file *seq, loff_t *pos) index = info->index & array->index_mask; if (info->percpu_value_buf) return array->pptrs[index]; - return array->value + array->elem_size * index; + return array_map_elem_ptr(array, index); } static void *bpf_array_map_seq_next(struct seq_file *seq, void *v, loff_t *pos) @@ -575,7 +577,7 @@ static void *bpf_array_map_seq_next(struct seq_file *seq, void *v, loff_t *pos) index = info->index & array->index_mask; if (info->percpu_value_buf) return array->pptrs[index]; - return array->value + array->elem_size * index; + return array_map_elem_ptr(array, index); } static int __bpf_array_map_seq_show(struct seq_file *seq, void *v) @@ -583,6 +585,7 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v) struct bpf_iter_seq_array_map_info *info = seq->private; struct bpf_iter__bpf_map_elem ctx = {}; struct bpf_map *map = info->map; + struct bpf_array *array = container_of(map, struct bpf_array, map); struct bpf_iter_meta meta; struct bpf_prog *prog; int off = 0, cpu = 0; @@ -603,7 +606,7 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v) ctx.value = v; } else { pptr = v; - size = round_up(map->value_size, 8); + size = array->elem_size; for_each_possible_cpu(cpu) { bpf_long_memcpy(info->percpu_value_buf + off, per_cpu_ptr(pptr, cpu), @@ -633,11 +636,12 @@ static int bpf_iter_init_array_map(void *priv_data, { struct bpf_iter_seq_array_map_info *seq_info = priv_data; struct bpf_map *map = aux->map; + struct bpf_array *array = container_of(map, struct bpf_array, map); void *value_buf; u32 buf_size; if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { - buf_size = round_up(map->value_size, 8) * num_possible_cpus(); + buf_size = array->elem_size * num_possible_cpus(); value_buf = kmalloc(buf_size, GFP_USER | __GFP_NOWARN); if (!value_buf) return -ENOMEM; @@ -690,7 +694,7 @@ static int bpf_for_each_array_elem(struct bpf_map *map, bpf_callback_t callback_ if (is_percpu) val = this_cpu_ptr(array->pptrs[i]); else - val = array->value + array->elem_size * i; + val = array_map_elem_ptr(array, i); num_elems++; key = i; ret = callback_fn((u64)(long)map, (u64)(long)&key, @@ -1322,7 +1326,7 @@ static int array_of_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) { struct bpf_array *array = container_of(map, struct bpf_array, map); - u32 elem_size = round_up(map->value_size, 8); + u32 elem_size = array->elem_size; struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; const int map_ptr = BPF_REG_1; diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index d469b7f3deef..fa71d58b7ded 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -63,10 +63,11 @@ BTF_ID(func, bpf_lsm_socket_post_create) BTF_ID(func, bpf_lsm_socket_socketpair) BTF_SET_END(bpf_lsm_unlocked_sockopt_hooks) +#ifdef CONFIG_CGROUP_BPF void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func) { - const struct btf_param *args; + const struct btf_param *args __maybe_unused; if (btf_type_vlen(prog->aux->attach_func_proto) < 1 || btf_id_set_contains(&bpf_lsm_current_hooks, @@ -75,9 +76,9 @@ void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, return; } +#ifdef CONFIG_NET args = btf_params(prog->aux->attach_func_proto); -#ifdef CONFIG_NET if (args[0].type == btf_sock_ids[BTF_SOCK_TYPE_SOCKET]) *bpf_func = __cgroup_bpf_run_lsm_socket; else if (args[0].type == btf_sock_ids[BTF_SOCK_TYPE_SOCK]) @@ -86,6 +87,7 @@ void bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, #endif *bpf_func = __cgroup_bpf_run_lsm_current; } +#endif int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, const struct bpf_prog *prog) @@ -219,6 +221,7 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_get_retval: return prog->expected_attach_type == BPF_LSM_CGROUP ? &bpf_get_retval_proto : NULL; +#ifdef CONFIG_NET case BPF_FUNC_setsockopt: if (prog->expected_attach_type != BPF_LSM_CGROUP) return NULL; @@ -239,6 +242,7 @@ bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) prog->aux->attach_btf_id)) return &bpf_unlocked_sk_getsockopt_proto; return NULL; +#endif default: return tracing_prog_func_proto(func_id, prog); } diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 7e0068c3399c..84b2d9dba79a 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1; + /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, + * and it must be used alone. + */ flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 4423045b8ff3..7ac971ea98d1 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -213,7 +213,7 @@ enum { }; struct btf_kfunc_set_tab { - struct btf_id_set *sets[BTF_KFUNC_HOOK_MAX][BTF_KFUNC_TYPE_MAX]; + struct btf_id_set8 *sets[BTF_KFUNC_HOOK_MAX]; }; struct btf_id_dtor_kfunc_tab { @@ -1116,7 +1116,8 @@ __printf(2, 3) static void btf_show(struct btf_show *show, const char *fmt, ...) */ #define btf_show_type_value(show, fmt, value) \ do { \ - if ((value) != 0 || (show->flags & BTF_SHOW_ZERO) || \ + if ((value) != (__typeof__(value))0 || \ + (show->flags & BTF_SHOW_ZERO) || \ show->state.depth == 0) { \ btf_show(show, "%s%s" fmt "%s%s", \ btf_show_indent(show), \ @@ -1615,7 +1616,7 @@ static void btf_free_id(struct btf *btf) static void btf_free_kfunc_set_tab(struct btf *btf) { struct btf_kfunc_set_tab *tab = btf->kfunc_set_tab; - int hook, type; + int hook; if (!tab) return; @@ -1624,10 +1625,8 @@ static void btf_free_kfunc_set_tab(struct btf *btf) */ if (btf_is_module(btf)) goto free_tab; - for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++) { - for (type = 0; type < ARRAY_SIZE(tab->sets[0]); type++) - kfree(tab->sets[hook][type]); - } + for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++) + kfree(tab->sets[hook]); free_tab: kfree(tab); btf->kfunc_set_tab = NULL; @@ -6171,13 +6170,14 @@ static bool is_kfunc_arg_mem_size(const struct btf *btf, static int btf_check_func_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, struct bpf_reg_state *regs, - bool ptr_to_mem_ok) + bool ptr_to_mem_ok, + u32 kfunc_flags) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); + bool rel = false, kptr_get = false, trusted_arg = false; struct bpf_verifier_log *log = &env->log; u32 i, nargs, ref_id, ref_obj_id = 0; bool is_kfunc = btf_is_kernel(btf); - bool rel = false, kptr_get = false; const char *func_name, *ref_tname; const struct btf_type *t, *ref_t; const struct btf_param *args; @@ -6209,10 +6209,9 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, if (is_kfunc) { /* Only kfunc can be release func */ - rel = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog), - BTF_KFUNC_TYPE_RELEASE, func_id); - kptr_get = btf_kfunc_id_set_contains(btf, resolve_prog_type(env->prog), - BTF_KFUNC_TYPE_KPTR_ACQUIRE, func_id); + rel = kfunc_flags & KF_RELEASE; + kptr_get = kfunc_flags & KF_KPTR_GET; + trusted_arg = kfunc_flags & KF_TRUSTED_ARGS; } /* check that BTF function arguments match actual types that the @@ -6237,10 +6236,19 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, return -EINVAL; } + /* Check if argument must be a referenced pointer, args + i has + * been verified to be a pointer (after skipping modifiers). + */ + if (is_kfunc && trusted_arg && !reg->ref_obj_id) { + bpf_log(log, "R%d must be referenced\n", regno); + return -EINVAL; + } + ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id); ref_tname = btf_name_by_offset(btf, ref_t->name_off); - if (rel && reg->ref_obj_id) + /* Trusted args have the same offset checks as release arguments */ + if (trusted_arg || (rel && reg->ref_obj_id)) arg_type |= OBJ_RELEASE; ret = check_func_arg_reg_off(env, reg, regno, arg_type); if (ret < 0) @@ -6338,7 +6346,8 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env, reg_ref_tname = btf_name_by_offset(reg_btf, reg_ref_t->name_off); if (!btf_struct_ids_match(log, reg_btf, reg_ref_id, - reg->off, btf, ref_id, rel && reg->ref_obj_id)) { + reg->off, btf, ref_id, + trusted_arg || (rel && reg->ref_obj_id))) { bpf_log(log, "kernel function %s args#%d expected pointer to %s %s but R%d has a pointer to %s %s\n", func_name, i, btf_type_str(ref_t), ref_tname, @@ -6441,7 +6450,7 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog, return -EINVAL; is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL; - err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global); + err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0); /* Compiler optimizations can remove arguments from static functions * or mismatched type can be passed into a global function. @@ -6454,9 +6463,10 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog, int btf_check_kfunc_arg_match(struct bpf_verifier_env *env, const struct btf *btf, u32 func_id, - struct bpf_reg_state *regs) + struct bpf_reg_state *regs, + u32 kfunc_flags) { - return btf_check_func_arg_match(env, btf, func_id, regs, true); + return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags); } /* Convert BTF of a function into bpf_reg_state if possible @@ -6853,6 +6863,11 @@ bool btf_id_set_contains(const struct btf_id_set *set, u32 id) return bsearch(&id, set->ids, set->cnt, sizeof(u32), btf_id_cmp_func) != NULL; } +static void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id) +{ + return bsearch(&id, set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func); +} + enum { BTF_MODULE_F_LIVE = (1 << 0), }; @@ -7101,16 +7116,16 @@ BTF_TRACING_TYPE_xxx /* Kernel Function (kfunc) BTF ID set registration API */ -static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, - enum btf_kfunc_type type, - struct btf_id_set *add_set, bool vmlinux_set) +static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, + struct btf_id_set8 *add_set) { + bool vmlinux_set = !btf_is_module(btf); struct btf_kfunc_set_tab *tab; - struct btf_id_set *set; + struct btf_id_set8 *set; u32 set_cnt; int ret; - if (hook >= BTF_KFUNC_HOOK_MAX || type >= BTF_KFUNC_TYPE_MAX) { + if (hook >= BTF_KFUNC_HOOK_MAX) { ret = -EINVAL; goto end; } @@ -7126,7 +7141,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, btf->kfunc_set_tab = tab; } - set = tab->sets[hook][type]; + set = tab->sets[hook]; /* Warn when register_btf_kfunc_id_set is called twice for the same hook * for module sets. */ @@ -7140,7 +7155,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, * pointer and return. */ if (!vmlinux_set) { - tab->sets[hook][type] = add_set; + tab->sets[hook] = add_set; return 0; } @@ -7149,7 +7164,7 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, * and concatenate all individual sets being registered. While each set * is individually sorted, they may become unsorted when concatenated, * hence re-sorting the final set again is required to make binary - * searching the set using btf_id_set_contains function work. + * searching the set using btf_id_set8_contains function work. */ set_cnt = set ? set->cnt : 0; @@ -7164,8 +7179,8 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, } /* Grow set */ - set = krealloc(tab->sets[hook][type], - offsetof(struct btf_id_set, ids[set_cnt + add_set->cnt]), + set = krealloc(tab->sets[hook], + offsetof(struct btf_id_set8, pairs[set_cnt + add_set->cnt]), GFP_KERNEL | __GFP_NOWARN); if (!set) { ret = -ENOMEM; @@ -7173,15 +7188,15 @@ static int __btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, } /* For newly allocated set, initialize set->cnt to 0 */ - if (!tab->sets[hook][type]) + if (!tab->sets[hook]) set->cnt = 0; - tab->sets[hook][type] = set; + tab->sets[hook] = set; /* Concatenate the two sets */ - memcpy(set->ids + set->cnt, add_set->ids, add_set->cnt * sizeof(set->ids[0])); + memcpy(set->pairs + set->cnt, add_set->pairs, add_set->cnt * sizeof(set->pairs[0])); set->cnt += add_set->cnt; - sort(set->ids, set->cnt, sizeof(set->ids[0]), btf_id_cmp_func, NULL); + sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL); return 0; end: @@ -7189,38 +7204,25 @@ end: return ret; } -static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, - const struct btf_kfunc_id_set *kset) -{ - bool vmlinux_set = !btf_is_module(btf); - int type, ret = 0; - - for (type = 0; type < ARRAY_SIZE(kset->sets); type++) { - if (!kset->sets[type]) - continue; - - ret = __btf_populate_kfunc_set(btf, hook, type, kset->sets[type], vmlinux_set); - if (ret) - break; - } - return ret; -} - -static bool __btf_kfunc_id_set_contains(const struct btf *btf, +static u32 *__btf_kfunc_id_set_contains(const struct btf *btf, enum btf_kfunc_hook hook, - enum btf_kfunc_type type, u32 kfunc_btf_id) { - struct btf_id_set *set; + struct btf_id_set8 *set; + u32 *id; - if (hook >= BTF_KFUNC_HOOK_MAX || type >= BTF_KFUNC_TYPE_MAX) - return false; + if (hook >= BTF_KFUNC_HOOK_MAX) + return NULL; if (!btf->kfunc_set_tab) - return false; - set = btf->kfunc_set_tab->sets[hook][type]; + return NULL; + set = btf->kfunc_set_tab->sets[hook]; if (!set) - return false; - return btf_id_set_contains(set, kfunc_btf_id); + return NULL; + id = btf_id_set8_contains(set, kfunc_btf_id); + if (!id) + return NULL; + /* The flags for BTF ID are located next to it */ + return id + 1; } static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type) @@ -7248,14 +7250,14 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type) * keeping the reference for the duration of the call provides the necessary * protection for looking up a well-formed btf->kfunc_set_tab. */ -bool btf_kfunc_id_set_contains(const struct btf *btf, +u32 *btf_kfunc_id_set_contains(const struct btf *btf, enum bpf_prog_type prog_type, - enum btf_kfunc_type type, u32 kfunc_btf_id) + u32 kfunc_btf_id) { enum btf_kfunc_hook hook; hook = bpf_prog_type_to_kfunc_hook(prog_type); - return __btf_kfunc_id_set_contains(btf, hook, type, kfunc_btf_id); + return __btf_kfunc_id_set_contains(btf, hook, kfunc_btf_id); } /* This function must be invoked only from initcalls/module init functions */ @@ -7282,7 +7284,7 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type, return PTR_ERR(btf); hook = bpf_prog_type_to_kfunc_hook(prog_type); - ret = btf_populate_kfunc_set(btf, hook, kset); + ret = btf_populate_kfunc_set(btf, hook, kset->set); btf_put(btf); return ret; } diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index bfeb9b937315..c1e10d088dbb 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -652,12 +652,6 @@ static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp) return fp->jited && !bpf_prog_was_classic(fp); } -static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) -{ - return list_empty(&fp->aux->ksym.lnode) || - fp->aux->ksym.lnode.prev == LIST_POISON2; -} - void bpf_prog_kallsyms_add(struct bpf_prog *fp) { if (!bpf_prog_kallsyms_candidate(fp) || @@ -833,15 +827,6 @@ struct bpf_prog_pack { #define BPF_PROG_SIZE_TO_NBITS(size) (round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE) -static size_t bpf_prog_pack_size = -1; -static size_t bpf_prog_pack_mask = -1; - -static int bpf_prog_chunk_count(void) -{ - WARN_ON_ONCE(bpf_prog_pack_size == -1); - return bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE; -} - static DEFINE_MUTEX(pack_mutex); static LIST_HEAD(pack_list); @@ -849,55 +834,33 @@ static LIST_HEAD(pack_list); * CONFIG_MMU=n. Use PAGE_SIZE in these cases. */ #ifdef PMD_SIZE -#define BPF_HPAGE_SIZE PMD_SIZE -#define BPF_HPAGE_MASK PMD_MASK +#define BPF_PROG_PACK_SIZE (PMD_SIZE * num_possible_nodes()) #else -#define BPF_HPAGE_SIZE PAGE_SIZE -#define BPF_HPAGE_MASK PAGE_MASK +#define BPF_PROG_PACK_SIZE PAGE_SIZE #endif -static size_t select_bpf_prog_pack_size(void) -{ - size_t size; - void *ptr; - - size = BPF_HPAGE_SIZE * num_online_nodes(); - ptr = module_alloc(size); - - /* Test whether we can get huge pages. If not just use PAGE_SIZE - * packs. - */ - if (!ptr || !is_vm_area_hugepages(ptr)) { - size = PAGE_SIZE; - bpf_prog_pack_mask = PAGE_MASK; - } else { - bpf_prog_pack_mask = BPF_HPAGE_MASK; - } - - vfree(ptr); - return size; -} +#define BPF_PROG_CHUNK_COUNT (BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE) static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns) { struct bpf_prog_pack *pack; - pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(bpf_prog_chunk_count())), + pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)), GFP_KERNEL); if (!pack) return NULL; - pack->ptr = module_alloc(bpf_prog_pack_size); + pack->ptr = module_alloc(BPF_PROG_PACK_SIZE); if (!pack->ptr) { kfree(pack); return NULL; } - bpf_fill_ill_insns(pack->ptr, bpf_prog_pack_size); - bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE); + bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE); + bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE); list_add_tail(&pack->list, &pack_list); set_vm_flush_reset_perms(pack->ptr); - set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE); - set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE); + set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE); + set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE); return pack; } @@ -909,10 +872,7 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn void *ptr = NULL; mutex_lock(&pack_mutex); - if (bpf_prog_pack_size == -1) - bpf_prog_pack_size = select_bpf_prog_pack_size(); - - if (size > bpf_prog_pack_size) { + if (size > BPF_PROG_PACK_SIZE) { size = round_up(size, PAGE_SIZE); ptr = module_alloc(size); if (ptr) { @@ -924,9 +884,9 @@ static void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insn goto out; } list_for_each_entry(pack, &pack_list, list) { - pos = bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0, + pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0, nbits, 0); - if (pos < bpf_prog_chunk_count()) + if (pos < BPF_PROG_CHUNK_COUNT) goto found_free_area; } @@ -950,18 +910,15 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr) struct bpf_prog_pack *pack = NULL, *tmp; unsigned int nbits; unsigned long pos; - void *pack_ptr; mutex_lock(&pack_mutex); - if (hdr->size > bpf_prog_pack_size) { + if (hdr->size > BPF_PROG_PACK_SIZE) { module_memfree(hdr); goto out; } - pack_ptr = (void *)((unsigned long)hdr & bpf_prog_pack_mask); - list_for_each_entry(tmp, &pack_list, list) { - if (tmp->ptr == pack_ptr) { + if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) { pack = tmp; break; } @@ -971,14 +928,14 @@ static void bpf_prog_pack_free(struct bpf_binary_header *hdr) goto out; nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size); - pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT; + pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT; WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size), "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n"); bitmap_clear(pack->bitmap, pos, nbits); - if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0, - bpf_prog_chunk_count(), 0) == 0) { + if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0, + BPF_PROG_CHUNK_COUNT, 0) == 0) { list_del(&pack->list); module_memfree(pack->ptr); kfree(pack); @@ -1155,7 +1112,6 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog, bpf_prog_pack_free(ro_header); return PTR_ERR(ptr); } - prog->aux->use_bpf_prog_pack = true; return 0; } @@ -1179,17 +1135,23 @@ void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header, bpf_jit_uncharge_modmem(size); } +struct bpf_binary_header * +bpf_jit_binary_pack_hdr(const struct bpf_prog *fp) +{ + unsigned long real_start = (unsigned long)fp->bpf_func; + unsigned long addr; + + addr = real_start & BPF_PROG_CHUNK_MASK; + return (void *)addr; +} + static inline struct bpf_binary_header * bpf_jit_binary_hdr(const struct bpf_prog *fp) { unsigned long real_start = (unsigned long)fp->bpf_func; unsigned long addr; - if (fp->aux->use_bpf_prog_pack) - addr = real_start & BPF_PROG_CHUNK_MASK; - else - addr = real_start & PAGE_MASK; - + addr = real_start & PAGE_MASK; return (void *)addr; } @@ -1202,11 +1164,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp) if (fp->jited) { struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp); - if (fp->aux->use_bpf_prog_pack) - bpf_jit_binary_pack_free(hdr, NULL /* rw_buffer */); - else - bpf_jit_binary_free(hdr); - + bpf_jit_binary_free(hdr); WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp)); } diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index c2867068e5bd..1400561efb15 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -845,7 +845,7 @@ static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net, struct bpf_dtab_netdev *dev; dev = bpf_map_kmalloc_node(&dtab->map, sizeof(*dev), - GFP_ATOMIC | __GFP_NOWARN, + GFP_NOWAIT | __GFP_NOWARN, dtab->map.numa_node); if (!dev) return ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 17fb69c0e0dc..da7578426a46 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -61,7 +61,7 @@ * * As regular device interrupt handlers and soft interrupts are forced into * thread context, the existing code which does - * spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); + * spin_lock*(); alloc(GFP_ATOMIC); spin_unlock*(); * just works. * * In theory the BPF locks could be converted to regular spinlocks as well, @@ -978,7 +978,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, goto dec_count; } l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, - GFP_ATOMIC | __GFP_NOWARN, + GFP_NOWAIT | __GFP_NOWARN, htab->map.numa_node); if (!l_new) { l_new = ERR_PTR(-ENOMEM); @@ -996,7 +996,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, } else { /* alloc_percpu zero-fills */ pptr = bpf_map_alloc_percpu(&htab->map, size, 8, - GFP_ATOMIC | __GFP_NOWARN); + GFP_NOWAIT | __GFP_NOWARN); if (!pptr) { kfree(l_new); l_new = ERR_PTR(-ENOMEM); diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 8654fc97f5fe..49ef0ce040c7 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -165,7 +165,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *key, } new = bpf_map_kmalloc_node(map, struct_size(new, data, map->value_size), - __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN, + __GFP_ZERO | GFP_NOWAIT | __GFP_NOWARN, map->numa_node); if (!new) return -ENOMEM; diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index f0d05a3cc4b9..d789e3b831ad 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -285,7 +285,7 @@ static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie *trie, if (value) size += trie->map.value_size; - node = bpf_map_kmalloc_node(&trie->map, size, GFP_ATOMIC | __GFP_NOWARN, + node = bpf_map_kmalloc_node(&trie->map, size, GFP_NOWAIT | __GFP_NOWARN, trie->map.numa_node); if (!node) return NULL; diff --git a/kernel/bpf/preload/iterators/Makefile b/kernel/bpf/preload/iterators/Makefile index bfe24f8c5a20..6762b1260f2f 100644 --- a/kernel/bpf/preload/iterators/Makefile +++ b/kernel/bpf/preload/iterators/Makefile @@ -9,7 +9,7 @@ LLVM_STRIP ?= llvm-strip TOOLS_PATH := $(abspath ../../../../tools) BPFTOOL_SRC := $(TOOLS_PATH)/bpf/bpftool BPFTOOL_OUTPUT := $(abs_out)/bpftool -DEFAULT_BPFTOOL := $(OUTPUT)/sbin/bpftool +DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)/bootstrap/bpftool BPFTOOL ?= $(DEFAULT_BPFTOOL) LIBBPF_SRC := $(TOOLS_PATH)/lib/bpf @@ -61,9 +61,5 @@ $(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(LIBBPF_OU OUTPUT=$(abspath $(dir $@))/ prefix= \ DESTDIR=$(LIBBPF_DESTDIR) $(abspath $@) install_headers -$(DEFAULT_BPFTOOL): $(BPFOBJ) | $(BPFTOOL_OUTPUT) - $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOL_SRC) \ - OUTPUT=$(BPFTOOL_OUTPUT)/ \ - LIBBPF_OUTPUT=$(LIBBPF_OUTPUT)/ \ - LIBBPF_DESTDIR=$(LIBBPF_DESTDIR)/ \ - prefix= DESTDIR=$(abs_out)/ install-bin +$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT) + $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOL_SRC) OUTPUT=$(BPFTOOL_OUTPUT)/ bootstrap diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index ab688d85b2c6..83c7136c5788 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -419,35 +419,53 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock) #ifdef CONFIG_MEMCG_KMEM static void bpf_map_save_memcg(struct bpf_map *map) { - map->memcg = get_mem_cgroup_from_mm(current->mm); + /* Currently if a map is created by a process belonging to the root + * memory cgroup, get_obj_cgroup_from_current() will return NULL. + * So we have to check map->objcg for being NULL each time it's + * being used. + */ + map->objcg = get_obj_cgroup_from_current(); } static void bpf_map_release_memcg(struct bpf_map *map) { - mem_cgroup_put(map->memcg); + if (map->objcg) + obj_cgroup_put(map->objcg); +} + +static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map) +{ + if (map->objcg) + return get_mem_cgroup_from_objcg(map->objcg); + + return root_mem_cgroup; } void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags, int node) { - struct mem_cgroup *old_memcg; + struct mem_cgroup *memcg, *old_memcg; void *ptr; - old_memcg = set_active_memcg(map->memcg); + memcg = bpf_map_get_memcg(map); + old_memcg = set_active_memcg(memcg); ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node); set_active_memcg(old_memcg); + mem_cgroup_put(memcg); return ptr; } void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags) { - struct mem_cgroup *old_memcg; + struct mem_cgroup *memcg, *old_memcg; void *ptr; - old_memcg = set_active_memcg(map->memcg); + memcg = bpf_map_get_memcg(map); + old_memcg = set_active_memcg(memcg); ptr = kzalloc(size, flags | __GFP_ACCOUNT); set_active_memcg(old_memcg); + mem_cgroup_put(memcg); return ptr; } @@ -455,12 +473,14 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags) void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align, gfp_t flags) { - struct mem_cgroup *old_memcg; + struct mem_cgroup *memcg, *old_memcg; void __percpu *ptr; - old_memcg = set_active_memcg(map->memcg); + memcg = bpf_map_get_memcg(map); + old_memcg = set_active_memcg(memcg); ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT); set_active_memcg(old_memcg); + mem_cgroup_put(memcg); return ptr; } diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 6cd226584c33..42e387a12694 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -13,6 +13,7 @@ #include <linux/static_call.h> #include <linux/bpf_verifier.h> #include <linux/bpf_lsm.h> +#include <linux/delay.h> /* dummy _ops. The verifier will operate on target program's ops. */ const struct bpf_verifier_ops bpf_extension_verifier_ops = { @@ -29,6 +30,81 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE]; /* serializes access to trampoline_table */ static DEFINE_MUTEX(trampoline_mutex); +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex); + +static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, enum ftrace_ops_cmd cmd) +{ + struct bpf_trampoline *tr = ops->private; + int ret = 0; + + if (cmd == FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF) { + /* This is called inside register_ftrace_direct_multi(), so + * tr->mutex is already locked. + */ + lockdep_assert_held_once(&tr->mutex); + + /* Instead of updating the trampoline here, we propagate + * -EAGAIN to register_ftrace_direct_multi(). Then we can + * retry register_ftrace_direct_multi() after updating the + * trampoline. + */ + if ((tr->flags & BPF_TRAMP_F_CALL_ORIG) && + !(tr->flags & BPF_TRAMP_F_ORIG_STACK)) { + if (WARN_ON_ONCE(tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY)) + return -EBUSY; + + tr->flags |= BPF_TRAMP_F_SHARE_IPMODIFY; + return -EAGAIN; + } + + return 0; + } + + /* The normal locking order is + * tr->mutex => direct_mutex (ftrace.c) => ftrace_lock (ftrace.c) + * + * The following two commands are called from + * + * prepare_direct_functions_for_ipmodify + * cleanup_direct_functions_after_ipmodify + * + * In both cases, direct_mutex is already locked. Use + * mutex_trylock(&tr->mutex) to avoid deadlock in race condition + * (something else is making changes to this same trampoline). + */ + if (!mutex_trylock(&tr->mutex)) { + /* sleep 1 ms to make sure whatever holding tr->mutex makes + * some progress. + */ + msleep(1); + return -EAGAIN; + } + + switch (cmd) { + case FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER: + tr->flags |= BPF_TRAMP_F_SHARE_IPMODIFY; + + if ((tr->flags & BPF_TRAMP_F_CALL_ORIG) && + !(tr->flags & BPF_TRAMP_F_ORIG_STACK)) + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */); + break; + case FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER: + tr->flags &= ~BPF_TRAMP_F_SHARE_IPMODIFY; + + if (tr->flags & BPF_TRAMP_F_ORIG_STACK) + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */); + break; + default: + ret = -EINVAL; + break; + }; + + mutex_unlock(&tr->mutex); + return ret; +} +#endif + bool bpf_prog_has_trampoline(const struct bpf_prog *prog) { enum bpf_attach_type eatype = prog->expected_attach_type; @@ -89,6 +165,16 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key) tr = kzalloc(sizeof(*tr), GFP_KERNEL); if (!tr) goto out; +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS + tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL); + if (!tr->fops) { + kfree(tr); + tr = NULL; + goto out; + } + tr->fops->private = tr; + tr->fops->ops_func = bpf_tramp_ftrace_ops_func; +#endif tr->key = key; INIT_HLIST_NODE(&tr->hlist); @@ -128,7 +214,7 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr) int ret; if (tr->func.ftrace_managed) - ret = unregister_ftrace_direct((long)ip, (long)old_addr); + ret = unregister_ftrace_direct_multi(tr->fops, (long)old_addr); else ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, NULL); @@ -137,15 +223,20 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr) return ret; } -static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr) +static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr, + bool lock_direct_mutex) { void *ip = tr->func.addr; int ret; - if (tr->func.ftrace_managed) - ret = modify_ftrace_direct((long)ip, (long)old_addr, (long)new_addr); - else + if (tr->func.ftrace_managed) { + if (lock_direct_mutex) + ret = modify_ftrace_direct_multi(tr->fops, (long)new_addr); + else + ret = modify_ftrace_direct_multi_nolock(tr->fops, (long)new_addr); + } else { ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, new_addr); + } return ret; } @@ -163,10 +254,12 @@ static int register_fentry(struct bpf_trampoline *tr, void *new_addr) if (bpf_trampoline_module_get(tr)) return -ENOENT; - if (tr->func.ftrace_managed) - ret = register_ftrace_direct((long)ip, (long)new_addr); - else + if (tr->func.ftrace_managed) { + ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 0); + ret = register_ftrace_direct_multi(tr->fops, (long)new_addr); + } else { ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr); + } if (ret) bpf_trampoline_module_put(tr); @@ -332,11 +425,11 @@ out: return ERR_PTR(err); } -static int bpf_trampoline_update(struct bpf_trampoline *tr) +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex) { struct bpf_tramp_image *im; struct bpf_tramp_links *tlinks; - u32 flags = BPF_TRAMP_F_RESTORE_REGS; + u32 orig_flags = tr->flags; bool ip_arg = false; int err, total; @@ -358,15 +451,31 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) goto out; } + /* clear all bits except SHARE_IPMODIFY */ + tr->flags &= BPF_TRAMP_F_SHARE_IPMODIFY; + if (tlinks[BPF_TRAMP_FEXIT].nr_links || - tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links) - flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; + tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links) { + /* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME + * should not be set together. + */ + tr->flags |= BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; + } else { + tr->flags |= BPF_TRAMP_F_RESTORE_REGS; + } if (ip_arg) - flags |= BPF_TRAMP_F_IP_ARG; + tr->flags |= BPF_TRAMP_F_IP_ARG; + +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +again: + if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) && + (tr->flags & BPF_TRAMP_F_CALL_ORIG)) + tr->flags |= BPF_TRAMP_F_ORIG_STACK; +#endif err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE, - &tr->func.model, flags, tlinks, + &tr->func.model, tr->flags, tlinks, tr->func.addr); if (err < 0) goto out; @@ -375,17 +484,34 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) WARN_ON(!tr->cur_image && tr->selector); if (tr->cur_image) /* progs already running at this address */ - err = modify_fentry(tr, tr->cur_image->image, im->image); + err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex); else /* first time registering */ err = register_fentry(tr, im->image); + +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS + if (err == -EAGAIN) { + /* -EAGAIN from bpf_tramp_ftrace_ops_func. Now + * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the + * trampoline again, and retry register. + */ + /* reset fops->func and fops->trampoline for re-register */ + tr->fops->func = NULL; + tr->fops->trampoline = 0; + goto again; + } +#endif if (err) goto out; + if (tr->cur_image) bpf_tramp_image_put(tr->cur_image); tr->cur_image = im; tr->selector++; out: + /* If any error happens, restore previous flags */ + if (err) + tr->flags = orig_flags; kfree(tlinks); return err; } @@ -451,7 +577,7 @@ static int __bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_tr hlist_add_head(&link->tramp_hlist, &tr->progs_hlist[kind]); tr->progs_cnt[kind]++; - err = bpf_trampoline_update(tr); + err = bpf_trampoline_update(tr, true /* lock_direct_mutex */); if (err) { hlist_del_init(&link->tramp_hlist); tr->progs_cnt[kind]--; @@ -484,7 +610,7 @@ static int __bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_ } hlist_del_init(&link->tramp_hlist); tr->progs_cnt[kind]--; - return bpf_trampoline_update(tr); + return bpf_trampoline_update(tr, true /* lock_direct_mutex */); } /* bpf_trampoline_unlink_prog() should never fail. */ @@ -498,7 +624,7 @@ int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampolin return err; } -#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) +#if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM) static void bpf_shim_tramp_link_release(struct bpf_link *link) { struct bpf_shim_tramp_link *shim_link = @@ -712,6 +838,7 @@ void bpf_trampoline_put(struct bpf_trampoline *tr) * multiple rcu callbacks. */ hlist_del(&tr->hlist); + kfree(tr->fops); kfree(tr); out: mutex_unlock(&trampoline_mutex); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 328cfab3af60..096fdac70165 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5533,17 +5533,6 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type) type == ARG_CONST_SIZE_OR_ZERO; } -static bool arg_type_is_alloc_size(enum bpf_arg_type type) -{ - return type == ARG_CONST_ALLOC_SIZE_OR_ZERO; -} - -static bool arg_type_is_int_ptr(enum bpf_arg_type type) -{ - return type == ARG_PTR_TO_INT || - type == ARG_PTR_TO_LONG; -} - static bool arg_type_is_release(enum bpf_arg_type type) { return type & OBJ_RELEASE; @@ -5929,7 +5918,8 @@ skip_type_check: meta->ref_obj_id = reg->ref_obj_id; } - if (arg_type == ARG_CONST_MAP_PTR) { + switch (base_type(arg_type)) { + case ARG_CONST_MAP_PTR: /* bpf_map_xxx(map_ptr) call: remember that map_ptr */ if (meta->map_ptr) { /* Use map_uid (which is unique id of inner map) to reject: @@ -5954,7 +5944,8 @@ skip_type_check: } meta->map_ptr = reg->map_ptr; meta->map_uid = reg->map_uid; - } else if (arg_type == ARG_PTR_TO_MAP_KEY) { + break; + case ARG_PTR_TO_MAP_KEY: /* bpf_map_xxx(..., map_ptr, ..., key) call: * check that [key, key + map->key_size) are within * stack limits and initialized @@ -5971,7 +5962,8 @@ skip_type_check: err = check_helper_mem_access(env, regno, meta->map_ptr->key_size, false, NULL); - } else if (base_type(arg_type) == ARG_PTR_TO_MAP_VALUE) { + break; + case ARG_PTR_TO_MAP_VALUE: if (type_may_be_null(arg_type) && register_is_null(reg)) return 0; @@ -5987,14 +5979,16 @@ skip_type_check: err = check_helper_mem_access(env, regno, meta->map_ptr->value_size, false, meta); - } else if (arg_type == ARG_PTR_TO_PERCPU_BTF_ID) { + break; + case ARG_PTR_TO_PERCPU_BTF_ID: if (!reg->btf_id) { verbose(env, "Helper has invalid btf_id in R%d\n", regno); return -EACCES; } meta->ret_btf = reg->btf; meta->ret_btf_id = reg->btf_id; - } else if (arg_type == ARG_PTR_TO_SPIN_LOCK) { + break; + case ARG_PTR_TO_SPIN_LOCK: if (meta->func_id == BPF_FUNC_spin_lock) { if (process_spin_lock(env, regno, true)) return -EACCES; @@ -6005,12 +5999,15 @@ skip_type_check: verbose(env, "verifier internal error\n"); return -EFAULT; } - } else if (arg_type == ARG_PTR_TO_TIMER) { + break; + case ARG_PTR_TO_TIMER: if (process_timer_func(env, regno, meta)) return -EACCES; - } else if (arg_type == ARG_PTR_TO_FUNC) { + break; + case ARG_PTR_TO_FUNC: meta->subprogno = reg->subprogno; - } else if (base_type(arg_type) == ARG_PTR_TO_MEM) { + break; + case ARG_PTR_TO_MEM: /* The access to this pointer is only checked when we hit the * next is_mem_size argument below. */ @@ -6020,11 +6017,14 @@ skip_type_check: fn->arg_size[arg], false, meta); } - } else if (arg_type_is_mem_size(arg_type)) { - bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO); - - err = check_mem_size_reg(env, reg, regno, zero_size_allowed, meta); - } else if (arg_type_is_dynptr(arg_type)) { + break; + case ARG_CONST_SIZE: + err = check_mem_size_reg(env, reg, regno, false, meta); + break; + case ARG_CONST_SIZE_OR_ZERO: + err = check_mem_size_reg(env, reg, regno, true, meta); + break; + case ARG_PTR_TO_DYNPTR: if (arg_type & MEM_UNINIT) { if (!is_dynptr_reg_valid_uninit(env, reg)) { verbose(env, "Dynptr has to be an uninitialized dynptr\n"); @@ -6058,21 +6058,28 @@ skip_type_check: err_extra, arg + 1); return -EINVAL; } - } else if (arg_type_is_alloc_size(arg_type)) { + break; + case ARG_CONST_ALLOC_SIZE_OR_ZERO: if (!tnum_is_const(reg->var_off)) { verbose(env, "R%d is not a known constant'\n", regno); return -EACCES; } meta->mem_size = reg->var_off.value; - } else if (arg_type_is_int_ptr(arg_type)) { + break; + case ARG_PTR_TO_INT: + case ARG_PTR_TO_LONG: + { int size = int_ptr_type_to_size(arg_type); err = check_helper_mem_access(env, regno, size, false, meta); if (err) return err; err = check_ptr_alignment(env, reg, 0, size, true); - } else if (arg_type == ARG_PTR_TO_CONST_STR) { + break; + } + case ARG_PTR_TO_CONST_STR: + { struct bpf_map *map = reg->map_ptr; int map_off; u64 map_addr; @@ -6111,9 +6118,12 @@ skip_type_check: verbose(env, "string is not zero-terminated\n"); return -EINVAL; } - } else if (arg_type == ARG_PTR_TO_KPTR) { + break; + } + case ARG_PTR_TO_KPTR: if (process_kptr_func(env, regno, meta)) return -EACCES; + break; } return err; @@ -7160,6 +7170,7 @@ static void update_loop_inline_state(struct bpf_verifier_env *env, u32 subprogno static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn, int *insn_idx_p) { + enum bpf_prog_type prog_type = resolve_prog_type(env->prog); const struct bpf_func_proto *fn = NULL; enum bpf_return_type ret_type; enum bpf_type_flag ret_flag; @@ -7321,7 +7332,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn } break; case BPF_FUNC_set_retval: - if (env->prog->expected_attach_type == BPF_LSM_CGROUP) { + if (prog_type == BPF_PROG_TYPE_LSM && + env->prog->expected_attach_type == BPF_LSM_CGROUP) { if (!env->prog->aux->attach_func_proto->type) { /* Make sure programs that attach to void * hooks don't try to modify return value. @@ -7550,6 +7562,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, int err, insn_idx = *insn_idx_p; const struct btf_param *args; struct btf *desc_btf; + u32 *kfunc_flags; bool acq; /* skip for now, but return error when we find this in fixup_kfunc_call */ @@ -7565,18 +7578,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, func_name = btf_name_by_offset(desc_btf, func->name_off); func_proto = btf_type_by_id(desc_btf, func->type); - if (!btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), - BTF_KFUNC_TYPE_CHECK, func_id)) { + kfunc_flags = btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), func_id); + if (!kfunc_flags) { verbose(env, "calling kernel function %s is not allowed\n", func_name); return -EACCES; } - - acq = btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), - BTF_KFUNC_TYPE_ACQUIRE, func_id); + acq = *kfunc_flags & KF_ACQUIRE; /* Check the arguments */ - err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs); + err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, *kfunc_flags); if (err < 0) return err; /* In case of release function, we get register number of refcounted @@ -7620,8 +7631,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, regs[BPF_REG_0].btf = desc_btf; regs[BPF_REG_0].type = PTR_TO_BTF_ID; regs[BPF_REG_0].btf_id = ptr_type_id; - if (btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), - BTF_KFUNC_TYPE_RET_NULL, func_id)) { + if (*kfunc_flags & KF_RET_NULL) { regs[BPF_REG_0].type |= PTR_MAYBE_NULL; /* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */ regs[BPF_REG_0].id = ++env->id_gen; @@ -12562,6 +12572,7 @@ static bool is_tracing_prog_type(enum bpf_prog_type type) case BPF_PROG_TYPE_TRACEPOINT: case BPF_PROG_TYPE_PERF_EVENT: case BPF_PROG_TYPE_RAW_TRACEPOINT: + case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: return true; default: return false; @@ -13620,6 +13631,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) /* Below members will be freed only at prog->aux */ func[i]->aux->btf = prog->aux->btf; func[i]->aux->func_info = prog->aux->func_info; + func[i]->aux->func_info_cnt = prog->aux->func_info_cnt; func[i]->aux->poke_tab = prog->aux->poke_tab; func[i]->aux->size_poke_tab = prog->aux->size_poke_tab; @@ -13632,9 +13644,6 @@ static int jit_subprogs(struct bpf_verifier_env *env) poke->aux = func[i]->aux; } - /* Use bpf_prog_F_tag to indicate functions in stack traces. - * Long term would need debug info to populate names - */ func[i]->aux->name[0] = 'F'; func[i]->aux->stack_depth = env->subprog_info[i].stack_depth; func[i]->jit_requested = 1; diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index fbdf8d3279ac..79a85834ce9d 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -30,6 +30,7 @@ #include <linux/module.h> #include <linux/kernel.h> #include <linux/bsearch.h> +#include <linux/btf_ids.h> /* * These will be re-linked against their real values @@ -799,6 +800,96 @@ static const struct seq_operations kallsyms_op = { .show = s_show }; +#ifdef CONFIG_BPF_SYSCALL + +struct bpf_iter__ksym { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct kallsym_iter *, ksym); +}; + +static int ksym_prog_seq_show(struct seq_file *m, bool in_stop) +{ + struct bpf_iter__ksym ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = m; + prog = bpf_iter_get_info(&meta, in_stop); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.ksym = m ? m->private : NULL; + return bpf_iter_run_prog(prog, &ctx); +} + +static int bpf_iter_ksym_seq_show(struct seq_file *m, void *p) +{ + return ksym_prog_seq_show(m, false); +} + +static void bpf_iter_ksym_seq_stop(struct seq_file *m, void *p) +{ + if (!p) + (void) ksym_prog_seq_show(m, true); + else + s_stop(m, p); +} + +static const struct seq_operations bpf_iter_ksym_ops = { + .start = s_start, + .next = s_next, + .stop = bpf_iter_ksym_seq_stop, + .show = bpf_iter_ksym_seq_show, +}; + +static int bpf_iter_ksym_init(void *priv_data, struct bpf_iter_aux_info *aux) +{ + struct kallsym_iter *iter = priv_data; + + reset_iter(iter, 0); + + /* cache here as in kallsyms_open() case; use current process + * credentials to tell BPF iterators if values should be shown. + */ + iter->show_value = kallsyms_show_value(current_cred()); + + return 0; +} + +DEFINE_BPF_ITER_FUNC(ksym, struct bpf_iter_meta *meta, struct kallsym_iter *ksym) + +static const struct bpf_iter_seq_info ksym_iter_seq_info = { + .seq_ops = &bpf_iter_ksym_ops, + .init_seq_private = bpf_iter_ksym_init, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct kallsym_iter), +}; + +static struct bpf_iter_reg ksym_iter_reg_info = { + .target = "ksym", + .feature = BPF_ITER_RESCHED, + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__ksym, ksym), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &ksym_iter_seq_info, +}; + +BTF_ID_LIST(btf_ksym_iter_id) +BTF_ID(struct, kallsym_iter) + +static int __init bpf_ksym_iter_register(void) +{ + ksym_iter_reg_info.ctx_arg_info[0].btf_id = *btf_ksym_iter_id; + return bpf_iter_reg_target(&ksym_iter_reg_info); +} + +late_initcall(bpf_ksym_iter_register); + +#endif /* CONFIG_BPF_SYSCALL */ + static inline int kallsyms_for_perf(void) { #ifdef CONFIG_PERF_EVENTS diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 601ccf1b2f09..bc921a3f7ea8 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1861,6 +1861,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops, ftrace_hash_rec_update_modify(ops, filter_hash, 1); } +static bool ops_references_ip(struct ftrace_ops *ops, unsigned long ip); + /* * Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK * or no-needed to update, -EBUSY if it detects a conflict of the flag @@ -1869,6 +1871,13 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops, * - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected) * - If the hash is EMPTY_HASH, it hits nothing * - Anything else hits the recs which match the hash entries. + * + * DIRECT ops does not have IPMODIFY flag, but we still need to check it + * against functions with FTRACE_FL_IPMODIFY. If there is any overlap, call + * ops_func(SHARE_IPMODIFY_SELF) to make sure current ops can share with + * IPMODIFY. If ops_func(SHARE_IPMODIFY_SELF) returns non-zero, propagate + * the return value to the caller and eventually to the owner of the DIRECT + * ops. */ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops, struct ftrace_hash *old_hash, @@ -1877,17 +1886,26 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops, struct ftrace_page *pg; struct dyn_ftrace *rec, *end = NULL; int in_old, in_new; + bool is_ipmodify, is_direct; /* Only update if the ops has been registered */ if (!(ops->flags & FTRACE_OPS_FL_ENABLED)) return 0; - if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY)) + is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY; + is_direct = ops->flags & FTRACE_OPS_FL_DIRECT; + + /* neither IPMODIFY nor DIRECT, skip */ + if (!is_ipmodify && !is_direct) + return 0; + + if (WARN_ON_ONCE(is_ipmodify && is_direct)) return 0; /* - * Since the IPMODIFY is a very address sensitive action, we do not - * allow ftrace_ops to set all functions to new hash. + * Since the IPMODIFY and DIRECT are very address sensitive + * actions, we do not allow ftrace_ops to set all functions to new + * hash. */ if (!new_hash || !old_hash) return -EINVAL; @@ -1905,12 +1923,32 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops, continue; if (in_new) { - /* New entries must ensure no others are using it */ - if (rec->flags & FTRACE_FL_IPMODIFY) - goto rollback; - rec->flags |= FTRACE_FL_IPMODIFY; - } else /* Removed entry */ + if (rec->flags & FTRACE_FL_IPMODIFY) { + int ret; + + /* Cannot have two ipmodify on same rec */ + if (is_ipmodify) + goto rollback; + + FTRACE_WARN_ON(rec->flags & FTRACE_FL_DIRECT); + + /* + * Another ops with IPMODIFY is already + * attached. We are now attaching a direct + * ops. Run SHARE_IPMODIFY_SELF, to check + * whether sharing is supported. + */ + if (!ops->ops_func) + return -EBUSY; + ret = ops->ops_func(ops, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_SELF); + if (ret) + return ret; + } else if (is_ipmodify) { + rec->flags |= FTRACE_FL_IPMODIFY; + } + } else if (is_ipmodify) { rec->flags &= ~FTRACE_FL_IPMODIFY; + } } while_for_each_ftrace_rec(); return 0; @@ -2454,8 +2492,7 @@ static void call_direct_funcs(unsigned long ip, unsigned long pip, struct ftrace_ops direct_ops = { .func = call_direct_funcs, - .flags = FTRACE_OPS_FL_IPMODIFY - | FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS + .flags = FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_PERMANENT, /* * By declaring the main trampoline as this trampoline @@ -3072,14 +3109,14 @@ static inline int ops_traces_mod(struct ftrace_ops *ops) } /* - * Check if the current ops references the record. + * Check if the current ops references the given ip. * * If the ops traces all functions, then it was already accounted for. * If the ops does not trace the current record function, skip it. * If the ops ignores the function via notrace filter, skip it. */ -static inline bool -ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec) +static bool +ops_references_ip(struct ftrace_ops *ops, unsigned long ip) { /* If ops isn't enabled, ignore it */ if (!(ops->flags & FTRACE_OPS_FL_ENABLED)) @@ -3091,16 +3128,29 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec) /* The function must be in the filter */ if (!ftrace_hash_empty(ops->func_hash->filter_hash) && - !__ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip)) + !__ftrace_lookup_ip(ops->func_hash->filter_hash, ip)) return false; /* If in notrace hash, we ignore it too */ - if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip)) + if (ftrace_lookup_ip(ops->func_hash->notrace_hash, ip)) return false; return true; } +/* + * Check if the current ops references the record. + * + * If the ops traces all functions, then it was already accounted for. + * If the ops does not trace the current record function, skip it. + * If the ops ignores the function via notrace filter, skip it. + */ +static bool +ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec) +{ + return ops_references_ip(ops, rec->ip); +} + static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs) { bool init_nop = ftrace_need_init_nop(); @@ -5215,6 +5265,8 @@ static struct ftrace_direct_func *ftrace_alloc_direct_func(unsigned long addr) return direct; } +static int register_ftrace_function_nolock(struct ftrace_ops *ops); + /** * register_ftrace_direct - Call a custom trampoline directly * @ip: The address of the nop at the beginning of a function @@ -5286,7 +5338,7 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr) ret = ftrace_set_filter_ip(&direct_ops, ip, 0, 0); if (!ret && !(direct_ops.flags & FTRACE_OPS_FL_ENABLED)) { - ret = register_ftrace_function(&direct_ops); + ret = register_ftrace_function_nolock(&direct_ops); if (ret) ftrace_set_filter_ip(&direct_ops, ip, 1, 0); } @@ -5545,8 +5597,7 @@ int modify_ftrace_direct(unsigned long ip, } EXPORT_SYMBOL_GPL(modify_ftrace_direct); -#define MULTI_FLAGS (FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_DIRECT | \ - FTRACE_OPS_FL_SAVE_REGS) +#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS) static int check_direct_multi(struct ftrace_ops *ops) { @@ -5639,7 +5690,7 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) ops->flags = MULTI_FLAGS; ops->trampoline = FTRACE_REGS_ADDR; - err = register_ftrace_function(ops); + err = register_ftrace_function_nolock(ops); out_remove: if (err) @@ -5691,22 +5742,8 @@ int unregister_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) } EXPORT_SYMBOL_GPL(unregister_ftrace_direct_multi); -/** - * modify_ftrace_direct_multi - Modify an existing direct 'multi' call - * to call something else - * @ops: The address of the struct ftrace_ops object - * @addr: The address of the new trampoline to call at @ops functions - * - * This is used to unregister currently registered direct caller and - * register new one @addr on functions registered in @ops object. - * - * Note there's window between ftrace_shutdown and ftrace_startup calls - * where there will be no callbacks called. - * - * Returns: zero on success. Non zero on error, which includes: - * -EINVAL - The @ops object was not properly registered. - */ -int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) +static int +__modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) { struct ftrace_hash *hash; struct ftrace_func_entry *entry, *iter; @@ -5717,20 +5754,15 @@ int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) int i, size; int err; - if (check_direct_multi(ops)) - return -EINVAL; - if (!(ops->flags & FTRACE_OPS_FL_ENABLED)) - return -EINVAL; - - mutex_lock(&direct_mutex); + lockdep_assert_held_once(&direct_mutex); /* Enable the tmp_ops to have the same functions as the direct ops */ ftrace_ops_init(&tmp_ops); tmp_ops.func_hash = ops->func_hash; - err = register_ftrace_function(&tmp_ops); + err = register_ftrace_function_nolock(&tmp_ops); if (err) - goto out_direct; + return err; /* * Now the ftrace_ops_list_func() is called to do the direct callers. @@ -5754,7 +5786,64 @@ int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) /* Removing the tmp_ops will add the updated direct callers to the functions */ unregister_ftrace_function(&tmp_ops); - out_direct: + return err; +} + +/** + * modify_ftrace_direct_multi_nolock - Modify an existing direct 'multi' call + * to call something else + * @ops: The address of the struct ftrace_ops object + * @addr: The address of the new trampoline to call at @ops functions + * + * This is used to unregister currently registered direct caller and + * register new one @addr on functions registered in @ops object. + * + * Note there's window between ftrace_shutdown and ftrace_startup calls + * where there will be no callbacks called. + * + * Caller should already have direct_mutex locked, so we don't lock + * direct_mutex here. + * + * Returns: zero on success. Non zero on error, which includes: + * -EINVAL - The @ops object was not properly registered. + */ +int modify_ftrace_direct_multi_nolock(struct ftrace_ops *ops, unsigned long addr) +{ + if (check_direct_multi(ops)) + return -EINVAL; + if (!(ops->flags & FTRACE_OPS_FL_ENABLED)) + return -EINVAL; + + return __modify_ftrace_direct_multi(ops, addr); +} +EXPORT_SYMBOL_GPL(modify_ftrace_direct_multi_nolock); + +/** + * modify_ftrace_direct_multi - Modify an existing direct 'multi' call + * to call something else + * @ops: The address of the struct ftrace_ops object + * @addr: The address of the new trampoline to call at @ops functions + * + * This is used to unregister currently registered direct caller and + * register new one @addr on functions registered in @ops object. + * + * Note there's window between ftrace_shutdown and ftrace_startup calls + * where there will be no callbacks called. + * + * Returns: zero on success. Non zero on error, which includes: + * -EINVAL - The @ops object was not properly registered. + */ +int modify_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr) +{ + int err; + + if (check_direct_multi(ops)) + return -EINVAL; + if (!(ops->flags & FTRACE_OPS_FL_ENABLED)) + return -EINVAL; + + mutex_lock(&direct_mutex); + err = __modify_ftrace_direct_multi(ops, addr); mutex_unlock(&direct_mutex); return err; } @@ -7965,6 +8054,143 @@ int ftrace_is_dead(void) return ftrace_disabled; } +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS +/* + * When registering ftrace_ops with IPMODIFY, it is necessary to make sure + * it doesn't conflict with any direct ftrace_ops. If there is existing + * direct ftrace_ops on a kernel function being patched, call + * FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER on it to enable sharing. + * + * @ops: ftrace_ops being registered. + * + * Returns: + * 0 on success; + * Negative on failure. + */ +static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops) +{ + struct ftrace_func_entry *entry; + struct ftrace_hash *hash; + struct ftrace_ops *op; + int size, i, ret; + + lockdep_assert_held_once(&direct_mutex); + + if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY)) + return 0; + + hash = ops->func_hash->filter_hash; + size = 1 << hash->size_bits; + for (i = 0; i < size; i++) { + hlist_for_each_entry(entry, &hash->buckets[i], hlist) { + unsigned long ip = entry->ip; + bool found_op = false; + + mutex_lock(&ftrace_lock); + do_for_each_ftrace_op(op, ftrace_ops_list) { + if (!(op->flags & FTRACE_OPS_FL_DIRECT)) + continue; + if (ops_references_ip(op, ip)) { + found_op = true; + break; + } + } while_for_each_ftrace_op(op); + mutex_unlock(&ftrace_lock); + + if (found_op) { + if (!op->ops_func) + return -EBUSY; + + ret = op->ops_func(op, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY_PEER); + if (ret) + return ret; + } + } + } + + return 0; +} + +/* + * Similar to prepare_direct_functions_for_ipmodify, clean up after ops + * with IPMODIFY is unregistered. The cleanup is optional for most DIRECT + * ops. + */ +static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops) +{ + struct ftrace_func_entry *entry; + struct ftrace_hash *hash; + struct ftrace_ops *op; + int size, i; + + if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY)) + return; + + mutex_lock(&direct_mutex); + + hash = ops->func_hash->filter_hash; + size = 1 << hash->size_bits; + for (i = 0; i < size; i++) { + hlist_for_each_entry(entry, &hash->buckets[i], hlist) { + unsigned long ip = entry->ip; + bool found_op = false; + + mutex_lock(&ftrace_lock); + do_for_each_ftrace_op(op, ftrace_ops_list) { + if (!(op->flags & FTRACE_OPS_FL_DIRECT)) + continue; + if (ops_references_ip(op, ip)) { + found_op = true; + break; + } + } while_for_each_ftrace_op(op); + mutex_unlock(&ftrace_lock); + + /* The cleanup is optional, ignore any errors */ + if (found_op && op->ops_func) + op->ops_func(op, FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY_PEER); + } + } + mutex_unlock(&direct_mutex); +} + +#define lock_direct_mutex() mutex_lock(&direct_mutex) +#define unlock_direct_mutex() mutex_unlock(&direct_mutex) + +#else /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ + +static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops) +{ + return 0; +} + +static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops) +{ +} + +#define lock_direct_mutex() do { } while (0) +#define unlock_direct_mutex() do { } while (0) + +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */ + +/* + * Similar to register_ftrace_function, except we don't lock direct_mutex. + */ +static int register_ftrace_function_nolock(struct ftrace_ops *ops) +{ + int ret; + + ftrace_ops_init(ops); + + mutex_lock(&ftrace_lock); + + ret = ftrace_startup(ops, 0); + + mutex_unlock(&ftrace_lock); + + return ret; +} + /** * register_ftrace_function - register a function for profiling * @ops: ops structure that holds the function for profiling. @@ -7980,14 +8206,15 @@ int register_ftrace_function(struct ftrace_ops *ops) { int ret; - ftrace_ops_init(ops); - - mutex_lock(&ftrace_lock); - - ret = ftrace_startup(ops, 0); + lock_direct_mutex(); + ret = prepare_direct_functions_for_ipmodify(ops); + if (ret < 0) + goto out_unlock; - mutex_unlock(&ftrace_lock); + ret = register_ftrace_function_nolock(ops); +out_unlock: + unlock_direct_mutex(); return ret; } EXPORT_SYMBOL_GPL(register_ftrace_function); @@ -8006,6 +8233,7 @@ int unregister_ftrace_function(struct ftrace_ops *ops) ret = ftrace_shutdown(ops, 0); mutex_unlock(&ftrace_lock); + cleanup_direct_functions_after_ipmodify(ops); return ret; } EXPORT_SYMBOL_GPL(unregister_ftrace_function); diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 2ca96acbc50a..cbc9cd5058cb 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -691,52 +691,35 @@ noinline void bpf_kfunc_call_test_mem_len_fail2(u64 *mem, int len) { } +noinline void bpf_kfunc_call_test_ref(struct prog_test_ref_kfunc *p) +{ +} + __diag_pop(); ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO); -BTF_SET_START(test_sk_check_kfunc_ids) -BTF_ID(func, bpf_kfunc_call_test1) -BTF_ID(func, bpf_kfunc_call_test2) -BTF_ID(func, bpf_kfunc_call_test3) -BTF_ID(func, bpf_kfunc_call_test_acquire) -BTF_ID(func, bpf_kfunc_call_memb_acquire) -BTF_ID(func, bpf_kfunc_call_test_release) -BTF_ID(func, bpf_kfunc_call_memb_release) -BTF_ID(func, bpf_kfunc_call_memb1_release) -BTF_ID(func, bpf_kfunc_call_test_kptr_get) -BTF_ID(func, bpf_kfunc_call_test_pass_ctx) -BTF_ID(func, bpf_kfunc_call_test_pass1) -BTF_ID(func, bpf_kfunc_call_test_pass2) -BTF_ID(func, bpf_kfunc_call_test_fail1) -BTF_ID(func, bpf_kfunc_call_test_fail2) -BTF_ID(func, bpf_kfunc_call_test_fail3) -BTF_ID(func, bpf_kfunc_call_test_mem_len_pass1) -BTF_ID(func, bpf_kfunc_call_test_mem_len_fail1) -BTF_ID(func, bpf_kfunc_call_test_mem_len_fail2) -BTF_SET_END(test_sk_check_kfunc_ids) - -BTF_SET_START(test_sk_acquire_kfunc_ids) -BTF_ID(func, bpf_kfunc_call_test_acquire) -BTF_ID(func, bpf_kfunc_call_memb_acquire) -BTF_ID(func, bpf_kfunc_call_test_kptr_get) -BTF_SET_END(test_sk_acquire_kfunc_ids) - -BTF_SET_START(test_sk_release_kfunc_ids) -BTF_ID(func, bpf_kfunc_call_test_release) -BTF_ID(func, bpf_kfunc_call_memb_release) -BTF_ID(func, bpf_kfunc_call_memb1_release) -BTF_SET_END(test_sk_release_kfunc_ids) - -BTF_SET_START(test_sk_ret_null_kfunc_ids) -BTF_ID(func, bpf_kfunc_call_test_acquire) -BTF_ID(func, bpf_kfunc_call_memb_acquire) -BTF_ID(func, bpf_kfunc_call_test_kptr_get) -BTF_SET_END(test_sk_ret_null_kfunc_ids) - -BTF_SET_START(test_sk_kptr_acquire_kfunc_ids) -BTF_ID(func, bpf_kfunc_call_test_kptr_get) -BTF_SET_END(test_sk_kptr_acquire_kfunc_ids) +BTF_SET8_START(test_sk_check_kfunc_ids) +BTF_ID_FLAGS(func, bpf_kfunc_call_test1) +BTF_ID_FLAGS(func, bpf_kfunc_call_test2) +BTF_ID_FLAGS(func, bpf_kfunc_call_test3) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_acquire, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_kfunc_call_memb_acquire, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_release, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_kfunc_call_memb_release, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_kfunc_call_memb1_release, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_kptr_get, KF_ACQUIRE | KF_RET_NULL | KF_KPTR_GET) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass_ctx) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass1) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_pass2) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail1) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail2) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_fail3) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_pass1) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_fail1) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_mem_len_fail2) +BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS) +BTF_SET8_END(test_sk_check_kfunc_ids) static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size, u32 size, u32 headroom, u32 tailroom) @@ -955,6 +938,9 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb) { struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; + if (!skb->len) + return -EINVAL; + if (!__skb) return 0; @@ -1617,12 +1603,8 @@ out: } static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &test_sk_check_kfunc_ids, - .acquire_set = &test_sk_acquire_kfunc_ids, - .release_set = &test_sk_release_kfunc_ids, - .ret_null_set = &test_sk_ret_null_kfunc_ids, - .kptr_acquire_set = &test_sk_kptr_acquire_kfunc_ids + .owner = THIS_MODULE, + .set = &test_sk_check_kfunc_ids, }; BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids) diff --git a/net/core/datagram.c b/net/core/datagram.c index 8c702904d960..f3988ef8e9af 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -616,7 +616,7 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, { int frag; - if (msg && msg->sg_from_iter) + if (msg && msg->msg_ubuf && msg->sg_from_iter) return msg->sg_from_iter(sk, skb, from, length); frag = skb_shinfo(skb)->nr_frags; diff --git a/net/core/dev.c b/net/core/dev.c index d588fd0a54ce..716df64fcfa5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4168,6 +4168,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) bool again = false; skb_reset_mac_header(skb); + skb_assert_len(skb); if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED); diff --git a/net/core/filter.c b/net/core/filter.c index a0c61094aeac..57c5e4c4efd2 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -237,7 +237,7 @@ BPF_CALL_2(bpf_skb_load_helper_8_no_cache, const struct sk_buff *, skb, BPF_CALL_4(bpf_skb_load_helper_16, const struct sk_buff *, skb, const void *, data, int, headlen, int, offset) { - u16 tmp, *ptr; + __be16 tmp, *ptr; const int len = sizeof(tmp); if (offset >= 0) { @@ -264,7 +264,7 @@ BPF_CALL_2(bpf_skb_load_helper_16_no_cache, const struct sk_buff *, skb, BPF_CALL_4(bpf_skb_load_helper_32, const struct sk_buff *, skb, const void *, data, int, headlen, int, offset) { - u32 tmp, *ptr; + __be32 tmp, *ptr; const int len = sizeof(tmp); if (likely(offset >= 0)) { diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 266d3b7b7d0b..81627892bdd4 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -462,7 +462,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, if (copied == len) break; - } while (i != msg_rx->sg.end); + } while (!sg_is_last(sge)); if (unlikely(peek)) { msg_rx = sk_psock_next_msg(psock, msg_rx); @@ -472,7 +472,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, } msg_rx->sg.start = i; - if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) { + if (!sge->length && sg_is_last(sge)) { msg_rx = sk_psock_dequeue_msg(psock); kfree_sk_msg(msg_rx); } diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 7a181631b995..85a9e500c42d 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -197,17 +197,17 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id, } } -BTF_SET_START(bpf_tcp_ca_check_kfunc_ids) -BTF_ID(func, tcp_reno_ssthresh) -BTF_ID(func, tcp_reno_cong_avoid) -BTF_ID(func, tcp_reno_undo_cwnd) -BTF_ID(func, tcp_slow_start) -BTF_ID(func, tcp_cong_avoid_ai) -BTF_SET_END(bpf_tcp_ca_check_kfunc_ids) +BTF_SET8_START(bpf_tcp_ca_check_kfunc_ids) +BTF_ID_FLAGS(func, tcp_reno_ssthresh) +BTF_ID_FLAGS(func, tcp_reno_cong_avoid) +BTF_ID_FLAGS(func, tcp_reno_undo_cwnd) +BTF_ID_FLAGS(func, tcp_slow_start) +BTF_ID_FLAGS(func, tcp_cong_avoid_ai) +BTF_SET8_END(bpf_tcp_ca_check_kfunc_ids) static const struct btf_kfunc_id_set bpf_tcp_ca_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &bpf_tcp_ca_check_kfunc_ids, + .owner = THIS_MODULE, + .set = &bpf_tcp_ca_check_kfunc_ids, }; static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = { diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 075e744bfb48..54eec33c6e1c 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -1154,24 +1154,24 @@ static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = { .set_state = bbr_set_state, }; -BTF_SET_START(tcp_bbr_check_kfunc_ids) +BTF_SET8_START(tcp_bbr_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE -BTF_ID(func, bbr_init) -BTF_ID(func, bbr_main) -BTF_ID(func, bbr_sndbuf_expand) -BTF_ID(func, bbr_undo_cwnd) -BTF_ID(func, bbr_cwnd_event) -BTF_ID(func, bbr_ssthresh) -BTF_ID(func, bbr_min_tso_segs) -BTF_ID(func, bbr_set_state) +BTF_ID_FLAGS(func, bbr_init) +BTF_ID_FLAGS(func, bbr_main) +BTF_ID_FLAGS(func, bbr_sndbuf_expand) +BTF_ID_FLAGS(func, bbr_undo_cwnd) +BTF_ID_FLAGS(func, bbr_cwnd_event) +BTF_ID_FLAGS(func, bbr_ssthresh) +BTF_ID_FLAGS(func, bbr_min_tso_segs) +BTF_ID_FLAGS(func, bbr_set_state) #endif #endif -BTF_SET_END(tcp_bbr_check_kfunc_ids) +BTF_SET8_END(tcp_bbr_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_bbr_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &tcp_bbr_check_kfunc_ids, + .owner = THIS_MODULE, + .set = &tcp_bbr_check_kfunc_ids, }; static int __init bbr_register(void) diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 68178e7280ce..768c10c1f649 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -485,22 +485,22 @@ static struct tcp_congestion_ops cubictcp __read_mostly = { .name = "cubic", }; -BTF_SET_START(tcp_cubic_check_kfunc_ids) +BTF_SET8_START(tcp_cubic_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE -BTF_ID(func, cubictcp_init) -BTF_ID(func, cubictcp_recalc_ssthresh) -BTF_ID(func, cubictcp_cong_avoid) -BTF_ID(func, cubictcp_state) -BTF_ID(func, cubictcp_cwnd_event) -BTF_ID(func, cubictcp_acked) +BTF_ID_FLAGS(func, cubictcp_init) +BTF_ID_FLAGS(func, cubictcp_recalc_ssthresh) +BTF_ID_FLAGS(func, cubictcp_cong_avoid) +BTF_ID_FLAGS(func, cubictcp_state) +BTF_ID_FLAGS(func, cubictcp_cwnd_event) +BTF_ID_FLAGS(func, cubictcp_acked) #endif #endif -BTF_SET_END(tcp_cubic_check_kfunc_ids) +BTF_SET8_END(tcp_cubic_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_cubic_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &tcp_cubic_check_kfunc_ids, + .owner = THIS_MODULE, + .set = &tcp_cubic_check_kfunc_ids, }; static int __init cubictcp_register(void) diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index ab034a4e9324..2a6c0dd665a4 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -239,22 +239,22 @@ static struct tcp_congestion_ops dctcp_reno __read_mostly = { .name = "dctcp-reno", }; -BTF_SET_START(tcp_dctcp_check_kfunc_ids) +BTF_SET8_START(tcp_dctcp_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE -BTF_ID(func, dctcp_init) -BTF_ID(func, dctcp_update_alpha) -BTF_ID(func, dctcp_cwnd_event) -BTF_ID(func, dctcp_ssthresh) -BTF_ID(func, dctcp_cwnd_undo) -BTF_ID(func, dctcp_state) +BTF_ID_FLAGS(func, dctcp_init) +BTF_ID_FLAGS(func, dctcp_update_alpha) +BTF_ID_FLAGS(func, dctcp_cwnd_event) +BTF_ID_FLAGS(func, dctcp_ssthresh) +BTF_ID_FLAGS(func, dctcp_cwnd_undo) +BTF_ID_FLAGS(func, dctcp_state) #endif #endif -BTF_SET_END(tcp_dctcp_check_kfunc_ids) +BTF_SET8_END(tcp_dctcp_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_dctcp_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &tcp_dctcp_check_kfunc_ids, + .owner = THIS_MODULE, + .set = &tcp_dctcp_check_kfunc_ids, }; static int __init dctcp_register(void) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 6ed807b6c647..b624e3d8c5f0 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -7042,9 +7042,9 @@ static const struct ctl_table addrconf_sysctl[] = { .data = &ipv6_devconf.accept_untracked_na, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = (void *)SYSCTL_ZERO, - .extra2 = (void *)SYSCTL_ONE, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_TWO, }, { /* sentinel */ diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c index ecf3a553a0dc..b1179f62bd23 100644 --- a/net/ipv6/ping.c +++ b/net/ipv6/ping.c @@ -64,6 +64,8 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (err) return err; + memset(&fl6, 0, sizeof(fl6)); + if (msg->msg_name) { DECLARE_SOCKADDR(struct sockaddr_in6 *, u, msg->msg_name); if (msg->msg_namelen < sizeof(*u)) @@ -72,12 +74,15 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) return -EAFNOSUPPORT; } daddr = &(u->sin6_addr); + if (np->sndflow) + fl6.flowlabel = u->sin6_flowinfo & IPV6_FLOWINFO_MASK; if (__ipv6_addr_needs_scope_id(ipv6_addr_type(daddr))) oif = u->sin6_scope_id; } else { if (sk->sk_state != TCP_ESTABLISHED) return -EDESTADDRREQ; daddr = &sk->sk_v6_daddr; + fl6.flowlabel = np->flow_label; } if (!oif) @@ -101,7 +106,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) ipc6.sockc.tsflags = sk->sk_tsflags; ipc6.sockc.mark = sk->sk_mark; - memset(&fl6, 0, sizeof(fl6)); fl6.flowi6_oif = oif; if (msg->msg_controllen) { diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c index bc4d5cd63a94..1cd87b28c9b0 100644 --- a/net/netfilter/nf_conntrack_bpf.c +++ b/net/netfilter/nf_conntrack_bpf.c @@ -55,57 +55,131 @@ enum { NF_BPF_CT_OPTS_SZ = 12, }; -static struct nf_conn *__bpf_nf_ct_lookup(struct net *net, - struct bpf_sock_tuple *bpf_tuple, - u32 tuple_len, u8 protonum, - s32 netns_id, u8 *dir) +static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple, + u32 tuple_len, u8 protonum, u8 dir, + struct nf_conntrack_tuple *tuple) { - struct nf_conntrack_tuple_hash *hash; - struct nf_conntrack_tuple tuple; - struct nf_conn *ct; + union nf_inet_addr *src = dir ? &tuple->dst.u3 : &tuple->src.u3; + union nf_inet_addr *dst = dir ? &tuple->src.u3 : &tuple->dst.u3; + union nf_conntrack_man_proto *sport = dir ? (void *)&tuple->dst.u + : &tuple->src.u; + union nf_conntrack_man_proto *dport = dir ? &tuple->src.u + : (void *)&tuple->dst.u; if (unlikely(protonum != IPPROTO_TCP && protonum != IPPROTO_UDP)) - return ERR_PTR(-EPROTO); - if (unlikely(netns_id < BPF_F_CURRENT_NETNS)) - return ERR_PTR(-EINVAL); + return -EPROTO; + + memset(tuple, 0, sizeof(*tuple)); - memset(&tuple, 0, sizeof(tuple)); switch (tuple_len) { case sizeof(bpf_tuple->ipv4): - tuple.src.l3num = AF_INET; - tuple.src.u3.ip = bpf_tuple->ipv4.saddr; - tuple.src.u.tcp.port = bpf_tuple->ipv4.sport; - tuple.dst.u3.ip = bpf_tuple->ipv4.daddr; - tuple.dst.u.tcp.port = bpf_tuple->ipv4.dport; + tuple->src.l3num = AF_INET; + src->ip = bpf_tuple->ipv4.saddr; + sport->tcp.port = bpf_tuple->ipv4.sport; + dst->ip = bpf_tuple->ipv4.daddr; + dport->tcp.port = bpf_tuple->ipv4.dport; break; case sizeof(bpf_tuple->ipv6): - tuple.src.l3num = AF_INET6; - memcpy(tuple.src.u3.ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr)); - tuple.src.u.tcp.port = bpf_tuple->ipv6.sport; - memcpy(tuple.dst.u3.ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr)); - tuple.dst.u.tcp.port = bpf_tuple->ipv6.dport; + tuple->src.l3num = AF_INET6; + memcpy(src->ip6, bpf_tuple->ipv6.saddr, sizeof(bpf_tuple->ipv6.saddr)); + sport->tcp.port = bpf_tuple->ipv6.sport; + memcpy(dst->ip6, bpf_tuple->ipv6.daddr, sizeof(bpf_tuple->ipv6.daddr)); + dport->tcp.port = bpf_tuple->ipv6.dport; break; default: - return ERR_PTR(-EAFNOSUPPORT); + return -EAFNOSUPPORT; + } + tuple->dst.protonum = protonum; + tuple->dst.dir = dir; + + return 0; +} + +static struct nf_conn * +__bpf_nf_ct_alloc_entry(struct net *net, struct bpf_sock_tuple *bpf_tuple, + u32 tuple_len, struct bpf_ct_opts *opts, u32 opts_len, + u32 timeout) +{ + struct nf_conntrack_tuple otuple, rtuple; + struct nf_conn *ct; + int err; + + if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] || + opts_len != NF_BPF_CT_OPTS_SZ) + return ERR_PTR(-EINVAL); + + if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) + return ERR_PTR(-EINVAL); + + err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto, + IP_CT_DIR_ORIGINAL, &otuple); + if (err < 0) + return ERR_PTR(err); + + err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto, + IP_CT_DIR_REPLY, &rtuple); + if (err < 0) + return ERR_PTR(err); + + if (opts->netns_id >= 0) { + net = get_net_ns_by_id(net, opts->netns_id); + if (unlikely(!net)) + return ERR_PTR(-ENONET); } - tuple.dst.protonum = protonum; + ct = nf_conntrack_alloc(net, &nf_ct_zone_dflt, &otuple, &rtuple, + GFP_ATOMIC); + if (IS_ERR(ct)) + goto out; + + memset(&ct->proto, 0, sizeof(ct->proto)); + __nf_ct_set_timeout(ct, timeout * HZ); + ct->status |= IPS_CONFIRMED; + +out: + if (opts->netns_id >= 0) + put_net(net); + + return ct; +} + +static struct nf_conn *__bpf_nf_ct_lookup(struct net *net, + struct bpf_sock_tuple *bpf_tuple, + u32 tuple_len, struct bpf_ct_opts *opts, + u32 opts_len) +{ + struct nf_conntrack_tuple_hash *hash; + struct nf_conntrack_tuple tuple; + struct nf_conn *ct; + int err; + + if (!opts || !bpf_tuple || opts->reserved[0] || opts->reserved[1] || + opts_len != NF_BPF_CT_OPTS_SZ) + return ERR_PTR(-EINVAL); + if (unlikely(opts->l4proto != IPPROTO_TCP && opts->l4proto != IPPROTO_UDP)) + return ERR_PTR(-EPROTO); + if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) + return ERR_PTR(-EINVAL); + + err = bpf_nf_ct_tuple_parse(bpf_tuple, tuple_len, opts->l4proto, + IP_CT_DIR_ORIGINAL, &tuple); + if (err < 0) + return ERR_PTR(err); - if (netns_id >= 0) { - net = get_net_ns_by_id(net, netns_id); + if (opts->netns_id >= 0) { + net = get_net_ns_by_id(net, opts->netns_id); if (unlikely(!net)) return ERR_PTR(-ENONET); } hash = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &tuple); - if (netns_id >= 0) + if (opts->netns_id >= 0) put_net(net); if (!hash) return ERR_PTR(-ENOENT); ct = nf_ct_tuplehash_to_ctrack(hash); - if (dir) - *dir = NF_CT_DIRECTION(hash); + opts->dir = NF_CT_DIRECTION(hash); return ct; } @@ -114,6 +188,43 @@ __diag_push(); __diag_ignore_all("-Wmissing-prototypes", "Global functions as their definitions will be in nf_conntrack BTF"); +struct nf_conn___init { + struct nf_conn ct; +}; + +/* bpf_xdp_ct_alloc - Allocate a new CT entry + * + * Parameters: + * @xdp_ctx - Pointer to ctx (xdp_md) in XDP program + * Cannot be NULL + * @bpf_tuple - Pointer to memory representing the tuple to look up + * Cannot be NULL + * @tuple__sz - Length of the tuple structure + * Must be one of sizeof(bpf_tuple->ipv4) or + * sizeof(bpf_tuple->ipv6) + * @opts - Additional options for allocation (documented above) + * Cannot be NULL + * @opts__sz - Length of the bpf_ct_opts structure + * Must be NF_BPF_CT_OPTS_SZ (12) + */ +struct nf_conn___init * +bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple, + u32 tuple__sz, struct bpf_ct_opts *opts, u32 opts__sz) +{ + struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx; + struct nf_conn *nfct; + + nfct = __bpf_nf_ct_alloc_entry(dev_net(ctx->rxq->dev), bpf_tuple, tuple__sz, + opts, opts__sz, 10); + if (IS_ERR(nfct)) { + if (opts) + opts->error = PTR_ERR(nfct); + return NULL; + } + + return (struct nf_conn___init *)nfct; +} + /* bpf_xdp_ct_lookup - Lookup CT entry for the given tuple, and acquire a * reference to it * @@ -138,25 +249,50 @@ bpf_xdp_ct_lookup(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple, struct net *caller_net; struct nf_conn *nfct; - BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ); - - if (!opts) - return NULL; - if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] || - opts__sz != NF_BPF_CT_OPTS_SZ) { - opts->error = -EINVAL; - return NULL; - } caller_net = dev_net(ctx->rxq->dev); - nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts->l4proto, - opts->netns_id, &opts->dir); + nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz); if (IS_ERR(nfct)) { - opts->error = PTR_ERR(nfct); + if (opts) + opts->error = PTR_ERR(nfct); return NULL; } return nfct; } +/* bpf_skb_ct_alloc - Allocate a new CT entry + * + * Parameters: + * @skb_ctx - Pointer to ctx (__sk_buff) in TC program + * Cannot be NULL + * @bpf_tuple - Pointer to memory representing the tuple to look up + * Cannot be NULL + * @tuple__sz - Length of the tuple structure + * Must be one of sizeof(bpf_tuple->ipv4) or + * sizeof(bpf_tuple->ipv6) + * @opts - Additional options for allocation (documented above) + * Cannot be NULL + * @opts__sz - Length of the bpf_ct_opts structure + * Must be NF_BPF_CT_OPTS_SZ (12) + */ +struct nf_conn___init * +bpf_skb_ct_alloc(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple, + u32 tuple__sz, struct bpf_ct_opts *opts, u32 opts__sz) +{ + struct sk_buff *skb = (struct sk_buff *)skb_ctx; + struct nf_conn *nfct; + struct net *net; + + net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk); + nfct = __bpf_nf_ct_alloc_entry(net, bpf_tuple, tuple__sz, opts, opts__sz, 10); + if (IS_ERR(nfct)) { + if (opts) + opts->error = PTR_ERR(nfct); + return NULL; + } + + return (struct nf_conn___init *)nfct; +} + /* bpf_skb_ct_lookup - Lookup CT entry for the given tuple, and acquire a * reference to it * @@ -181,20 +317,31 @@ bpf_skb_ct_lookup(struct __sk_buff *skb_ctx, struct bpf_sock_tuple *bpf_tuple, struct net *caller_net; struct nf_conn *nfct; - BUILD_BUG_ON(sizeof(struct bpf_ct_opts) != NF_BPF_CT_OPTS_SZ); - - if (!opts) - return NULL; - if (!bpf_tuple || opts->reserved[0] || opts->reserved[1] || - opts__sz != NF_BPF_CT_OPTS_SZ) { - opts->error = -EINVAL; - return NULL; - } caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk); - nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts->l4proto, - opts->netns_id, &opts->dir); + nfct = __bpf_nf_ct_lookup(caller_net, bpf_tuple, tuple__sz, opts, opts__sz); if (IS_ERR(nfct)) { - opts->error = PTR_ERR(nfct); + if (opts) + opts->error = PTR_ERR(nfct); + return NULL; + } + return nfct; +} + +/* bpf_ct_insert_entry - Add the provided entry into a CT map + * + * This must be invoked for referenced PTR_TO_BTF_ID. + * + * @nfct - Pointer to referenced nf_conn___init object, obtained + * using bpf_xdp_ct_alloc or bpf_skb_ct_alloc. + */ +struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *nfct_i) +{ + struct nf_conn *nfct = (struct nf_conn *)nfct_i; + int err; + + err = nf_conntrack_hash_check_insert(nfct); + if (err < 0) { + nf_conntrack_free(nfct); return NULL; } return nfct; @@ -217,50 +364,90 @@ void bpf_ct_release(struct nf_conn *nfct) nf_ct_put(nfct); } +/* bpf_ct_set_timeout - Set timeout of allocated nf_conn + * + * Sets the default timeout of newly allocated nf_conn before insertion. + * This helper must be invoked for refcounted pointer to nf_conn___init. + * + * Parameters: + * @nfct - Pointer to referenced nf_conn object, obtained using + * bpf_xdp_ct_alloc or bpf_skb_ct_alloc. + * @timeout - Timeout in msecs. + */ +void bpf_ct_set_timeout(struct nf_conn___init *nfct, u32 timeout) +{ + __nf_ct_set_timeout((struct nf_conn *)nfct, msecs_to_jiffies(timeout)); +} + +/* bpf_ct_change_timeout - Change timeout of inserted nf_conn + * + * Change timeout associated of the inserted or looked up nf_conn. + * This helper must be invoked for refcounted pointer to nf_conn. + * + * Parameters: + * @nfct - Pointer to referenced nf_conn object, obtained using + * bpf_ct_insert_entry, bpf_xdp_ct_lookup, or bpf_skb_ct_lookup. + * @timeout - New timeout in msecs. + */ +int bpf_ct_change_timeout(struct nf_conn *nfct, u32 timeout) +{ + return __nf_ct_change_timeout(nfct, msecs_to_jiffies(timeout)); +} + +/* bpf_ct_set_status - Set status field of allocated nf_conn + * + * Set the status field of the newly allocated nf_conn before insertion. + * This must be invoked for referenced PTR_TO_BTF_ID to nf_conn___init. + * + * Parameters: + * @nfct - Pointer to referenced nf_conn object, obtained using + * bpf_xdp_ct_alloc or bpf_skb_ct_alloc. + * @status - New status value. + */ +int bpf_ct_set_status(const struct nf_conn___init *nfct, u32 status) +{ + return nf_ct_change_status_common((struct nf_conn *)nfct, status); +} + +/* bpf_ct_change_status - Change status of inserted nf_conn + * + * Change the status field of the provided connection tracking entry. + * This must be invoked for referenced PTR_TO_BTF_ID to nf_conn. + * + * Parameters: + * @nfct - Pointer to referenced nf_conn object, obtained using + * bpf_ct_insert_entry, bpf_xdp_ct_lookup or bpf_skb_ct_lookup. + * @status - New status value. + */ +int bpf_ct_change_status(struct nf_conn *nfct, u32 status) +{ + return nf_ct_change_status_common(nfct, status); +} + __diag_pop() -BTF_SET_START(nf_ct_xdp_check_kfunc_ids) -BTF_ID(func, bpf_xdp_ct_lookup) -BTF_ID(func, bpf_ct_release) -BTF_SET_END(nf_ct_xdp_check_kfunc_ids) - -BTF_SET_START(nf_ct_tc_check_kfunc_ids) -BTF_ID(func, bpf_skb_ct_lookup) -BTF_ID(func, bpf_ct_release) -BTF_SET_END(nf_ct_tc_check_kfunc_ids) - -BTF_SET_START(nf_ct_acquire_kfunc_ids) -BTF_ID(func, bpf_xdp_ct_lookup) -BTF_ID(func, bpf_skb_ct_lookup) -BTF_SET_END(nf_ct_acquire_kfunc_ids) - -BTF_SET_START(nf_ct_release_kfunc_ids) -BTF_ID(func, bpf_ct_release) -BTF_SET_END(nf_ct_release_kfunc_ids) - -/* Both sets are identical */ -#define nf_ct_ret_null_kfunc_ids nf_ct_acquire_kfunc_ids - -static const struct btf_kfunc_id_set nf_conntrack_xdp_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &nf_ct_xdp_check_kfunc_ids, - .acquire_set = &nf_ct_acquire_kfunc_ids, - .release_set = &nf_ct_release_kfunc_ids, - .ret_null_set = &nf_ct_ret_null_kfunc_ids, -}; +BTF_SET8_START(nf_ct_kfunc_set) +BTF_ID_FLAGS(func, bpf_xdp_ct_alloc, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_xdp_ct_lookup, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_skb_ct_alloc, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_skb_ct_lookup, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_ct_insert_entry, KF_ACQUIRE | KF_RET_NULL | KF_RELEASE) +BTF_ID_FLAGS(func, bpf_ct_release, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_ct_set_timeout, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_ct_change_timeout, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_ct_set_status, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_ct_change_status, KF_TRUSTED_ARGS) +BTF_SET8_END(nf_ct_kfunc_set) -static const struct btf_kfunc_id_set nf_conntrack_tc_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &nf_ct_tc_check_kfunc_ids, - .acquire_set = &nf_ct_acquire_kfunc_ids, - .release_set = &nf_ct_release_kfunc_ids, - .ret_null_set = &nf_ct_ret_null_kfunc_ids, +static const struct btf_kfunc_id_set nf_conntrack_kfunc_set = { + .owner = THIS_MODULE, + .set = &nf_ct_kfunc_set, }; int register_nf_conntrack_bpf(void) { int ret; - ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &nf_conntrack_xdp_kfunc_set); - return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &nf_conntrack_tc_kfunc_set); + ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &nf_conntrack_kfunc_set); + return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &nf_conntrack_kfunc_set); } diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 8c97d062b1ae..71c2f4f95d36 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -2806,3 +2806,65 @@ err_expect: free_percpu(net->ct.stat); return ret; } + +#if (IS_BUILTIN(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) || \ + (IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES) || \ + IS_ENABLED(CONFIG_NF_CT_NETLINK)) + +/* ctnetlink code shared by both ctnetlink and nf_conntrack_bpf */ + +int __nf_ct_change_timeout(struct nf_conn *ct, u64 timeout) +{ + if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status)) + return -EPERM; + + __nf_ct_set_timeout(ct, timeout); + + if (test_bit(IPS_DYING_BIT, &ct->status)) + return -ETIME; + + return 0; +} +EXPORT_SYMBOL_GPL(__nf_ct_change_timeout); + +void __nf_ct_change_status(struct nf_conn *ct, unsigned long on, unsigned long off) +{ + unsigned int bit; + + /* Ignore these unchangable bits */ + on &= ~IPS_UNCHANGEABLE_MASK; + off &= ~IPS_UNCHANGEABLE_MASK; + + for (bit = 0; bit < __IPS_MAX_BIT; bit++) { + if (on & (1 << bit)) + set_bit(bit, &ct->status); + else if (off & (1 << bit)) + clear_bit(bit, &ct->status); + } +} +EXPORT_SYMBOL_GPL(__nf_ct_change_status); + +int nf_ct_change_status_common(struct nf_conn *ct, unsigned int status) +{ + unsigned long d; + + d = ct->status ^ status; + + if (d & (IPS_EXPECTED|IPS_CONFIRMED|IPS_DYING)) + /* unchangeable */ + return -EBUSY; + + if (d & IPS_SEEN_REPLY && !(status & IPS_SEEN_REPLY)) + /* SEEN_REPLY bit can only be set */ + return -EBUSY; + + if (d & IPS_ASSURED && !(status & IPS_ASSURED)) + /* ASSURED bit can only be set */ + return -EBUSY; + + __nf_ct_change_status(ct, status, 0); + return 0; +} +EXPORT_SYMBOL_GPL(nf_ct_change_status_common); + +#endif diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index f8dd4ed8dc60..04169b54f2a2 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -1891,45 +1891,10 @@ ctnetlink_parse_nat_setup(struct nf_conn *ct, } #endif -static void -__ctnetlink_change_status(struct nf_conn *ct, unsigned long on, - unsigned long off) -{ - unsigned int bit; - - /* Ignore these unchangable bits */ - on &= ~IPS_UNCHANGEABLE_MASK; - off &= ~IPS_UNCHANGEABLE_MASK; - - for (bit = 0; bit < __IPS_MAX_BIT; bit++) { - if (on & (1 << bit)) - set_bit(bit, &ct->status); - else if (off & (1 << bit)) - clear_bit(bit, &ct->status); - } -} - static int ctnetlink_change_status(struct nf_conn *ct, const struct nlattr * const cda[]) { - unsigned long d; - unsigned int status = ntohl(nla_get_be32(cda[CTA_STATUS])); - d = ct->status ^ status; - - if (d & (IPS_EXPECTED|IPS_CONFIRMED|IPS_DYING)) - /* unchangeable */ - return -EBUSY; - - if (d & IPS_SEEN_REPLY && !(status & IPS_SEEN_REPLY)) - /* SEEN_REPLY bit can only be set */ - return -EBUSY; - - if (d & IPS_ASSURED && !(status & IPS_ASSURED)) - /* ASSURED bit can only be set */ - return -EBUSY; - - __ctnetlink_change_status(ct, status, 0); - return 0; + return nf_ct_change_status_common(ct, ntohl(nla_get_be32(cda[CTA_STATUS]))); } static int @@ -2024,16 +1989,7 @@ static int ctnetlink_change_helper(struct nf_conn *ct, static int ctnetlink_change_timeout(struct nf_conn *ct, const struct nlattr * const cda[]) { - u64 timeout = (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ; - - if (timeout > INT_MAX) - timeout = INT_MAX; - WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout); - - if (test_bit(IPS_DYING_BIT, &ct->status)) - return -ETIME; - - return 0; + return __nf_ct_change_timeout(ct, (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ); } #if defined(CONFIG_NF_CONNTRACK_MARK) @@ -2293,9 +2249,7 @@ ctnetlink_create_conntrack(struct net *net, goto err1; timeout = (u64)ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ; - if (timeout > INT_MAX) - timeout = INT_MAX; - ct->timeout = (u32)timeout + nfct_time_stamp; + __nf_ct_set_timeout(ct, timeout); rcu_read_lock(); if (cda[CTA_HELP]) { @@ -2837,7 +2791,7 @@ ctnetlink_update_status(struct nf_conn *ct, const struct nlattr * const cda[]) * unchangeable bits but do not error out. Also user programs * are allowed to clear the bits that they are allowed to change. */ - __ctnetlink_change_status(ct, status, ~status); + __nf_ct_change_status(ct, status, ~status); return 0; } diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 859ea02022c0..ed5e6f1df9c7 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1803,6 +1803,7 @@ static long tls_rx_reader_lock(struct sock *sk, struct tls_sw_context_rx *ctx, bool nonblock) { long timeo; + int err; lock_sock(sk); @@ -1818,15 +1819,23 @@ static long tls_rx_reader_lock(struct sock *sk, struct tls_sw_context_rx *ctx, !READ_ONCE(ctx->reader_present), &wait); remove_wait_queue(&ctx->wq, &wait); - if (!timeo) - return -EAGAIN; - if (signal_pending(current)) - return sock_intr_errno(timeo); + if (timeo <= 0) { + err = -EAGAIN; + goto err_unlock; + } + if (signal_pending(current)) { + err = sock_intr_errno(timeo); + goto err_unlock; + } } WRITE_ONCE(ctx->reader_present, 1); return timeo; + +err_unlock: + release_sock(sk); + return err; } static void tls_rx_reader_unlock(struct sock *sk, struct tls_sw_context_rx *ctx) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 09002387987e..5b4ce6ba1bc7 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -639,8 +639,11 @@ static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len if (unlikely(need_wait)) return -EOPNOTSUPP; - if (sk_can_busy_loop(sk)) + if (sk_can_busy_loop(sk)) { + if (xs->zc) + __sk_mark_napi_id_once(sk, xsk_pool_get_napi_id(xs->pool)); sk_busy_loop(sk, 1); /* only support non-blocking sockets */ + } if (xs->zc && xsk_no_wakeup(sk)) return 0; diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 5002a5b9a7da..727da3c5879b 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -282,12 +282,10 @@ $(LIBBPF): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(LIBBPF_OU BPFTOOLDIR := $(TOOLS_PATH)/bpf/bpftool BPFTOOL_OUTPUT := $(abspath $(BPF_SAMPLES_PATH))/bpftool -BPFTOOL := $(BPFTOOL_OUTPUT)/bpftool -$(BPFTOOL): $(LIBBPF) $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) | $(BPFTOOL_OUTPUT) - $(MAKE) -C $(BPFTOOLDIR) srctree=$(BPF_SAMPLES_PATH)/../../ \ - OUTPUT=$(BPFTOOL_OUTPUT)/ \ - LIBBPF_OUTPUT=$(LIBBPF_OUTPUT)/ \ - LIBBPF_DESTDIR=$(LIBBPF_DESTDIR)/ +BPFTOOL := $(BPFTOOL_OUTPUT)/bootstrap/bpftool +$(BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) | $(BPFTOOL_OUTPUT) + $(MAKE) -C $(BPFTOOLDIR) srctree=$(BPF_SAMPLES_PATH)/../../ \ + OUTPUT=$(BPFTOOL_OUTPUT)/ bootstrap $(LIBBPF_OUTPUT) $(BPFTOOL_OUTPUT): $(call msg,MKDIR,$@) diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c index 16dbf49e0f19..88a26f3ce201 100644 --- a/samples/bpf/fds_example.c +++ b/samples/bpf/fds_example.c @@ -17,6 +17,7 @@ #include <bpf/libbpf.h> #include "bpf_insn.h" #include "sock_example.h" +#include "bpf_util.h" #define BPF_F_PIN (1 << 0) #define BPF_F_GET (1 << 1) @@ -52,7 +53,7 @@ static int bpf_prog_create(const char *object) BPF_MOV64_IMM(BPF_REG_0, 1), BPF_EXIT_INSN(), }; - size_t insns_cnt = sizeof(insns) / sizeof(struct bpf_insn); + size_t insns_cnt = ARRAY_SIZE(insns); struct bpf_object *obj; int err; diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c index a88f69504c08..5b66f2401b96 100644 --- a/samples/bpf/sock_example.c +++ b/samples/bpf/sock_example.c @@ -29,6 +29,7 @@ #include <bpf/bpf.h> #include "bpf_insn.h" #include "sock_example.h" +#include "bpf_util.h" char bpf_log_buf[BPF_LOG_BUF_SIZE]; @@ -58,7 +59,7 @@ static int test_sock(void) BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), }; - size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); + size_t insns_cnt = ARRAY_SIZE(prog); LIBBPF_OPTS(bpf_prog_load_opts, opts, .log_buf = bpf_log_buf, .log_size = BPF_LOG_BUF_SIZE, diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c index 6d90874b09c3..68ce69457afe 100644 --- a/samples/bpf/test_cgrp2_attach.c +++ b/samples/bpf/test_cgrp2_attach.c @@ -31,6 +31,7 @@ #include <bpf/bpf.h> #include "bpf_insn.h" +#include "bpf_util.h" enum { MAP_KEY_PACKETS, @@ -70,7 +71,7 @@ static int prog_load(int map_fd, int verdict) BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */ BPF_EXIT_INSN(), }; - size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn); + size_t insns_cnt = ARRAY_SIZE(prog); LIBBPF_OPTS(bpf_prog_load_opts, opts, .log_buf = bpf_log_buf, .log_size = BPF_LOG_BUF_SIZE, diff --git a/samples/bpf/test_lru_dist.c b/samples/bpf/test_lru_dist.c index be98ccb4952f..5efb91763d65 100644 --- a/samples/bpf/test_lru_dist.c +++ b/samples/bpf/test_lru_dist.c @@ -523,7 +523,7 @@ int main(int argc, char **argv) return -1; } - for (f = 0; f < sizeof(map_flags) / sizeof(*map_flags); f++) { + for (f = 0; f < ARRAY_SIZE(map_flags); f++) { test_lru_loss0(BPF_MAP_TYPE_LRU_HASH, map_flags[f]); test_lru_loss1(BPF_MAP_TYPE_LRU_HASH, map_flags[f]); test_parallel_lru_loss(BPF_MAP_TYPE_LRU_HASH, map_flags[f], diff --git a/samples/bpf/test_map_in_map_user.c b/samples/bpf/test_map_in_map_user.c index e8b4cc184ac9..652ec720533d 100644 --- a/samples/bpf/test_map_in_map_user.c +++ b/samples/bpf/test_map_in_map_user.c @@ -12,6 +12,8 @@ #include <bpf/bpf.h> #include <bpf/libbpf.h> +#include "bpf_util.h" + static int map_fd[7]; #define PORT_A (map_fd[0]) @@ -28,7 +30,7 @@ static const char * const test_names[] = { "Hash of Hash", }; -#define NR_TESTS (sizeof(test_names) / sizeof(*test_names)) +#define NR_TESTS ARRAY_SIZE(test_names) static void check_map_id(int inner_map_fd, int map_in_map_fd, uint32_t key) { diff --git a/samples/bpf/tracex5_user.c b/samples/bpf/tracex5_user.c index e910dc265c31..9d7d79f0d47d 100644 --- a/samples/bpf/tracex5_user.c +++ b/samples/bpf/tracex5_user.c @@ -8,6 +8,7 @@ #include <bpf/bpf.h> #include <bpf/libbpf.h> #include "trace_helpers.h" +#include "bpf_util.h" #ifdef __mips__ #define MAX_ENTRIES 6000 /* MIPS n64 syscalls start at 5000 */ @@ -24,7 +25,7 @@ static void install_accept_all_seccomp(void) BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { - .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])), + .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; if (prctl(PR_SET_SECCOMP, 2, &prog)) diff --git a/samples/bpf/xdp_redirect_map.bpf.c b/samples/bpf/xdp_redirect_map.bpf.c index 415bac1758e3..8557c278df77 100644 --- a/samples/bpf/xdp_redirect_map.bpf.c +++ b/samples/bpf/xdp_redirect_map.bpf.c @@ -33,7 +33,7 @@ struct { } tx_port_native SEC(".maps"); /* store egress interface mac address */ -const volatile char tx_mac_addr[ETH_ALEN]; +const volatile __u8 tx_mac_addr[ETH_ALEN]; static __always_inline int xdp_redirect_map(struct xdp_md *ctx, void *redirect_map) { @@ -73,6 +73,7 @@ int xdp_redirect_map_egress(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; + u8 *mac_addr = (u8 *) tx_mac_addr; struct ethhdr *eth = data; u64 nh_off; @@ -80,7 +81,8 @@ int xdp_redirect_map_egress(struct xdp_md *ctx) if (data + nh_off > data_end) return XDP_DROP; - __builtin_memcpy(eth->h_source, (const char *)tx_mac_addr, ETH_ALEN); + barrier_var(mac_addr); /* prevent optimizing out memcpy */ + __builtin_memcpy(eth->h_source, mac_addr, ETH_ALEN); return XDP_PASS; } diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c index b6e4fc849577..c889a1394dc1 100644 --- a/samples/bpf/xdp_redirect_map_user.c +++ b/samples/bpf/xdp_redirect_map_user.c @@ -40,6 +40,8 @@ static const struct option long_options[] = { {} }; +static int verbose = 0; + int main(int argc, char **argv) { struct bpf_devmap_val devmap_val = {}; @@ -79,6 +81,7 @@ int main(int argc, char **argv) break; case 'v': sample_switch_mode(); + verbose = 1; break; case 's': mask |= SAMPLE_REDIRECT_MAP_CNT; @@ -134,6 +137,12 @@ int main(int argc, char **argv) ret = EXIT_FAIL; goto end_destroy; } + if (verbose) + printf("Egress ifindex:%d using src MAC %02x:%02x:%02x:%02x:%02x:%02x\n", + ifindex_out, + skel->rodata->tx_mac_addr[0], skel->rodata->tx_mac_addr[1], + skel->rodata->tx_mac_addr[2], skel->rodata->tx_mac_addr[3], + skel->rodata->tx_mac_addr[4], skel->rodata->tx_mac_addr[5]); } skel->rodata->from_match[0] = ifindex_in; diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index a0ec321469bd..dfb260de17a8 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -333,27 +333,7 @@ class PrinterRST(Printer): .. Copyright (C) All BPF authors and contributors from 2014 to present. .. See git log include/uapi/linux/bpf.h in kernel tree for details. .. -.. %%%LICENSE_START(VERBATIM) -.. Permission is granted to make and distribute verbatim copies of this -.. manual provided the copyright notice and this permission notice are -.. preserved on all copies. -.. -.. Permission is granted to copy and distribute modified versions of this -.. manual under the conditions for verbatim copying, provided that the -.. entire resulting derived work is distributed under the terms of a -.. permission notice identical to this one. -.. -.. Since the Linux kernel and libraries are constantly changing, this -.. manual page may be incorrect or out-of-date. The author(s) assume no -.. responsibility for errors or omissions, or for damages resulting from -.. the use of the information contained herein. The author(s) may not -.. have taken the same level of care in the production of this manual, -.. which is licensed free of charge, as they might when working -.. professionally. -.. -.. Formatted or processed versions of this manual, if unaccompanied by -.. the source, must acknowledge the copyright and authors of this work. -.. %%%LICENSE_END +.. SPDX-License-Identifier: Linux-man-pages-copyleft .. .. Please do not edit this file. It was generated from the documentation .. located in file include/uapi/linux/bpf.h of the Linux kernel sources diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c index 5d26f3c6f918..80cd7843c677 100644 --- a/tools/bpf/resolve_btfids/main.c +++ b/tools/bpf/resolve_btfids/main.c @@ -45,6 +45,19 @@ * .zero 4 * __BTF_ID__func__vfs_fallocate__4: * .zero 4 + * + * set8 - store symbol size into first 4 bytes and sort following + * ID list + * + * __BTF_ID__set8__list: + * .zero 8 + * list: + * __BTF_ID__func__vfs_getattr__3: + * .zero 4 + * .word (1 << 0) | (1 << 2) + * __BTF_ID__func__vfs_fallocate__5: + * .zero 4 + * .word (1 << 3) | (1 << 1) | (1 << 2) */ #define _GNU_SOURCE @@ -72,6 +85,7 @@ #define BTF_TYPEDEF "typedef" #define BTF_FUNC "func" #define BTF_SET "set" +#define BTF_SET8 "set8" #define ADDR_CNT 100 @@ -84,6 +98,7 @@ struct btf_id { }; int addr_cnt; bool is_set; + bool is_set8; Elf64_Addr addr[ADDR_CNT]; }; @@ -231,14 +246,14 @@ static char *get_id(const char *prefix_end) return id; } -static struct btf_id *add_set(struct object *obj, char *name) +static struct btf_id *add_set(struct object *obj, char *name, bool is_set8) { /* * __BTF_ID__set__name * name = ^ * id = ^ */ - char *id = name + sizeof(BTF_SET "__") - 1; + char *id = name + (is_set8 ? sizeof(BTF_SET8 "__") : sizeof(BTF_SET "__")) - 1; int len = strlen(name); if (id >= name + len) { @@ -444,9 +459,21 @@ static int symbols_collect(struct object *obj) } else if (!strncmp(prefix, BTF_FUNC, sizeof(BTF_FUNC) - 1)) { obj->nr_funcs++; id = add_symbol(&obj->funcs, prefix, sizeof(BTF_FUNC) - 1); + /* set8 */ + } else if (!strncmp(prefix, BTF_SET8, sizeof(BTF_SET8) - 1)) { + id = add_set(obj, prefix, true); + /* + * SET8 objects store list's count, which is encoded + * in symbol's size, together with 'cnt' field hence + * that - 1. + */ + if (id) { + id->cnt = sym.st_size / sizeof(uint64_t) - 1; + id->is_set8 = true; + } /* set */ } else if (!strncmp(prefix, BTF_SET, sizeof(BTF_SET) - 1)) { - id = add_set(obj, prefix); + id = add_set(obj, prefix, false); /* * SET objects store list's count, which is encoded * in symbol's size, together with 'cnt' field hence @@ -571,7 +598,8 @@ static int id_patch(struct object *obj, struct btf_id *id) int *ptr = data->d_buf; int i; - if (!id->id && !id->is_set) + /* For set, set8, id->id may be 0 */ + if (!id->id && !id->is_set && !id->is_set8) pr_err("WARN: resolve_btfids: unresolved symbol %s\n", id->name); for (i = 0; i < id->addr_cnt; i++) { @@ -643,13 +671,13 @@ static int sets_patch(struct object *obj) } idx = idx / sizeof(int); - base = &ptr[idx] + 1; + base = &ptr[idx] + (id->is_set8 ? 2 : 1); cnt = ptr[idx]; pr_debug("sorting addr %5lu: cnt %6d [%s]\n", (idx + 1) * sizeof(int), cnt, id->name); - qsort(base, cnt, sizeof(int), cmp_id); + qsort(base, cnt, id->is_set8 ? sizeof(uint64_t) : sizeof(int), cmp_id); next = rb_next(next); } diff --git a/tools/bpf/runqslower/Makefile b/tools/bpf/runqslower/Makefile index da6de16a3dfb..8b3d87b82b7a 100644 --- a/tools/bpf/runqslower/Makefile +++ b/tools/bpf/runqslower/Makefile @@ -4,7 +4,7 @@ include ../../scripts/Makefile.include OUTPUT ?= $(abspath .output)/ BPFTOOL_OUTPUT := $(OUTPUT)bpftool/ -DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bpftool +DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bootstrap/bpftool BPFTOOL ?= $(DEFAULT_BPFTOOL) LIBBPF_SRC := $(abspath ../../lib/bpf) BPFOBJ_OUTPUT := $(OUTPUT)libbpf/ @@ -86,6 +86,5 @@ $(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(BPFOBJ_OU $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(BPFOBJ_OUTPUT) \ DESTDIR=$(BPFOBJ_OUTPUT) prefix= $(abspath $@) install_headers -$(DEFAULT_BPFTOOL): $(BPFOBJ) | $(BPFTOOL_OUTPUT) - $(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT) \ - ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) +$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT) + $(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT) bootstrap diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 3dd13fe738b9..59a217ca2dfd 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2361,7 +2361,8 @@ union bpf_attr { * Pull in non-linear data in case the *skb* is non-linear and not * all of *len* are part of the linear section. Make *len* bytes * from *skb* readable and writable. If a zero value is passed for - * *len*, then the whole length of the *skb* is pulled. + * *len*, then all bytes in the linear part of *skb* will be made + * readable and writable. * * This helper is only needed for reading and writing with direct * packet access. diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h index 11f9096407fc..f4d3e1e2abe2 100644 --- a/tools/lib/bpf/bpf_tracing.h +++ b/tools/lib/bpf/bpf_tracing.h @@ -2,6 +2,8 @@ #ifndef __BPF_TRACING_H__ #define __BPF_TRACING_H__ +#include <bpf/bpf_helpers.h> + /* Scan the ARCH passed in from ARCH env variable (see Makefile) */ #if defined(__TARGET_ARCH_x86) #define bpf_target_x86 @@ -140,7 +142,7 @@ struct pt_regs___s390 { #define __PT_RC_REG gprs[2] #define __PT_SP_REG gprs[15] #define __PT_IP_REG psw.addr -#define PT_REGS_PARM1_SYSCALL(x) ({ _Pragma("GCC error \"use PT_REGS_PARM1_CORE_SYSCALL() instead\""); 0l; }) +#define PT_REGS_PARM1_SYSCALL(x) PT_REGS_PARM1_CORE_SYSCALL(x) #define PT_REGS_PARM1_CORE_SYSCALL(x) BPF_CORE_READ((const struct pt_regs___s390 *)(x), orig_gpr2) #elif defined(bpf_target_arm) @@ -174,7 +176,7 @@ struct pt_regs___arm64 { #define __PT_RC_REG regs[0] #define __PT_SP_REG sp #define __PT_IP_REG pc -#define PT_REGS_PARM1_SYSCALL(x) ({ _Pragma("GCC error \"use PT_REGS_PARM1_CORE_SYSCALL() instead\""); 0l; }) +#define PT_REGS_PARM1_SYSCALL(x) PT_REGS_PARM1_CORE_SYSCALL(x) #define PT_REGS_PARM1_CORE_SYSCALL(x) BPF_CORE_READ((const struct pt_regs___arm64 *)(x), orig_x0) #elif defined(bpf_target_mips) @@ -493,39 +495,62 @@ typeof(name(0)) name(struct pt_regs *ctx) \ } \ static __always_inline typeof(name(0)) ____##name(struct pt_regs *ctx, ##args) +/* If kernel has CONFIG_ARCH_HAS_SYSCALL_WRAPPER, read pt_regs directly */ #define ___bpf_syscall_args0() ctx -#define ___bpf_syscall_args1(x) ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs) -#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs) -#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs) -#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs) -#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs) +#define ___bpf_syscall_args1(x) ___bpf_syscall_args0(), (void *)PT_REGS_PARM1_SYSCALL(regs) +#define ___bpf_syscall_args2(x, args...) ___bpf_syscall_args1(args), (void *)PT_REGS_PARM2_SYSCALL(regs) +#define ___bpf_syscall_args3(x, args...) ___bpf_syscall_args2(args), (void *)PT_REGS_PARM3_SYSCALL(regs) +#define ___bpf_syscall_args4(x, args...) ___bpf_syscall_args3(args), (void *)PT_REGS_PARM4_SYSCALL(regs) +#define ___bpf_syscall_args5(x, args...) ___bpf_syscall_args4(args), (void *)PT_REGS_PARM5_SYSCALL(regs) #define ___bpf_syscall_args(args...) ___bpf_apply(___bpf_syscall_args, ___bpf_narg(args))(args) +/* If kernel doesn't have CONFIG_ARCH_HAS_SYSCALL_WRAPPER, we have to BPF_CORE_READ from pt_regs */ +#define ___bpf_syswrap_args0() ctx +#define ___bpf_syswrap_args1(x) ___bpf_syswrap_args0(), (void *)PT_REGS_PARM1_CORE_SYSCALL(regs) +#define ___bpf_syswrap_args2(x, args...) ___bpf_syswrap_args1(args), (void *)PT_REGS_PARM2_CORE_SYSCALL(regs) +#define ___bpf_syswrap_args3(x, args...) ___bpf_syswrap_args2(args), (void *)PT_REGS_PARM3_CORE_SYSCALL(regs) +#define ___bpf_syswrap_args4(x, args...) ___bpf_syswrap_args3(args), (void *)PT_REGS_PARM4_CORE_SYSCALL(regs) +#define ___bpf_syswrap_args5(x, args...) ___bpf_syswrap_args4(args), (void *)PT_REGS_PARM5_CORE_SYSCALL(regs) +#define ___bpf_syswrap_args(args...) ___bpf_apply(___bpf_syswrap_args, ___bpf_narg(args))(args) + /* - * BPF_KPROBE_SYSCALL is a variant of BPF_KPROBE, which is intended for + * BPF_KSYSCALL is a variant of BPF_KPROBE, which is intended for * tracing syscall functions, like __x64_sys_close. It hides the underlying * platform-specific low-level way of getting syscall input arguments from * struct pt_regs, and provides a familiar typed and named function arguments * syntax and semantics of accessing syscall input parameters. * - * Original struct pt_regs* context is preserved as 'ctx' argument. This might + * Original struct pt_regs * context is preserved as 'ctx' argument. This might * be necessary when using BPF helpers like bpf_perf_event_output(). * - * This macro relies on BPF CO-RE support. + * At the moment BPF_KSYSCALL does not handle all the calling convention + * quirks for mmap(), clone() and compat syscalls transparrently. This may or + * may not change in the future. User needs to take extra measures to handle + * such quirks explicitly, if necessary. + * + * This macro relies on BPF CO-RE support and virtual __kconfig externs. */ -#define BPF_KPROBE_SYSCALL(name, args...) \ +#define BPF_KSYSCALL(name, args...) \ name(struct pt_regs *ctx); \ +extern _Bool LINUX_HAS_SYSCALL_WRAPPER __kconfig; \ static __attribute__((always_inline)) typeof(name(0)) \ ____##name(struct pt_regs *ctx, ##args); \ typeof(name(0)) name(struct pt_regs *ctx) \ { \ - struct pt_regs *regs = PT_REGS_SYSCALL_REGS(ctx); \ + struct pt_regs *regs = LINUX_HAS_SYSCALL_WRAPPER \ + ? (struct pt_regs *)PT_REGS_PARM1(ctx) \ + : ctx; \ _Pragma("GCC diagnostic push") \ _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \ - return ____##name(___bpf_syscall_args(args)); \ + if (LINUX_HAS_SYSCALL_WRAPPER) \ + return ____##name(___bpf_syswrap_args(args)); \ + else \ + return ____##name(___bpf_syscall_args(args)); \ _Pragma("GCC diagnostic pop") \ } \ static __attribute__((always_inline)) typeof(name(0)) \ ____##name(struct pt_regs *ctx, ##args) +#define BPF_KPROBE_SYSCALL BPF_KSYSCALL + #endif diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index 400e84fd0578..627edb5bb6de 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -2045,7 +2045,7 @@ static int btf_dump_get_enum_value(struct btf_dump *d, *value = *(__s64 *)data; return 0; case 4: - *value = is_signed ? *(__s32 *)data : *(__u32 *)data; + *value = is_signed ? (__s64)*(__s32 *)data : *(__u32 *)data; return 0; case 2: *value = is_signed ? *(__s16 *)data : *(__u16 *)data; diff --git a/tools/lib/bpf/gen_loader.c b/tools/lib/bpf/gen_loader.c index 927745b08014..23f5c46708f8 100644 --- a/tools/lib/bpf/gen_loader.c +++ b/tools/lib/bpf/gen_loader.c @@ -533,7 +533,7 @@ void bpf_gen__record_attach_target(struct bpf_gen *gen, const char *attach_name, gen->attach_kind = kind; ret = snprintf(gen->attach_target, sizeof(gen->attach_target), "%s%s", prefix, attach_name); - if (ret == sizeof(gen->attach_target)) + if (ret >= sizeof(gen->attach_target)) gen->error = -ENOSPC; } diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index cb49408eb298..b01fe01b0761 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1694,7 +1694,7 @@ static int set_kcfg_value_tri(struct extern_desc *ext, void *ext_val, switch (ext->kcfg.type) { case KCFG_BOOL: if (value == 'm') { - pr_warn("extern (kcfg) %s=%c should be tristate or char\n", + pr_warn("extern (kcfg) '%s': value '%c' implies tristate or char type\n", ext->name, value); return -EINVAL; } @@ -1715,7 +1715,7 @@ static int set_kcfg_value_tri(struct extern_desc *ext, void *ext_val, case KCFG_INT: case KCFG_CHAR_ARR: default: - pr_warn("extern (kcfg) %s=%c should be bool, tristate, or char\n", + pr_warn("extern (kcfg) '%s': value '%c' implies bool, tristate, or char type\n", ext->name, value); return -EINVAL; } @@ -1729,7 +1729,8 @@ static int set_kcfg_value_str(struct extern_desc *ext, char *ext_val, size_t len; if (ext->kcfg.type != KCFG_CHAR_ARR) { - pr_warn("extern (kcfg) %s=%s should be char array\n", ext->name, value); + pr_warn("extern (kcfg) '%s': value '%s' implies char array type\n", + ext->name, value); return -EINVAL; } @@ -1743,7 +1744,7 @@ static int set_kcfg_value_str(struct extern_desc *ext, char *ext_val, /* strip quotes */ len -= 2; if (len >= ext->kcfg.sz) { - pr_warn("extern (kcfg) '%s': long string config %s of (%zu bytes) truncated to %d bytes\n", + pr_warn("extern (kcfg) '%s': long string '%s' of (%zu bytes) truncated to %d bytes\n", ext->name, value, len, ext->kcfg.sz - 1); len = ext->kcfg.sz - 1; } @@ -1800,13 +1801,20 @@ static bool is_kcfg_value_in_range(const struct extern_desc *ext, __u64 v) static int set_kcfg_value_num(struct extern_desc *ext, void *ext_val, __u64 value) { - if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR) { - pr_warn("extern (kcfg) %s=%llu should be integer\n", + if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR && + ext->kcfg.type != KCFG_BOOL) { + pr_warn("extern (kcfg) '%s': value '%llu' implies integer, char, or boolean type\n", ext->name, (unsigned long long)value); return -EINVAL; } + if (ext->kcfg.type == KCFG_BOOL && value > 1) { + pr_warn("extern (kcfg) '%s': value '%llu' isn't boolean compatible\n", + ext->name, (unsigned long long)value); + return -EINVAL; + + } if (!is_kcfg_value_in_range(ext, value)) { - pr_warn("extern (kcfg) %s=%llu value doesn't fit in %d bytes\n", + pr_warn("extern (kcfg) '%s': value '%llu' doesn't fit in %d bytes\n", ext->name, (unsigned long long)value, ext->kcfg.sz); return -ERANGE; } @@ -1870,16 +1878,19 @@ static int bpf_object__process_kconfig_line(struct bpf_object *obj, /* assume integer */ err = parse_u64(value, &num); if (err) { - pr_warn("extern (kcfg) %s=%s should be integer\n", - ext->name, value); + pr_warn("extern (kcfg) '%s': value '%s' isn't a valid integer\n", ext->name, value); return err; } + if (ext->kcfg.type != KCFG_INT && ext->kcfg.type != KCFG_CHAR) { + pr_warn("extern (kcfg) '%s': value '%s' implies integer type\n", ext->name, value); + return -EINVAL; + } err = set_kcfg_value_num(ext, ext_val, num); break; } if (err) return err; - pr_debug("extern (kcfg) %s=%s\n", ext->name, value); + pr_debug("extern (kcfg) '%s': set to %s\n", ext->name, value); return 0; } @@ -2320,6 +2331,37 @@ int parse_btf_map_def(const char *map_name, struct btf *btf, return 0; } +static size_t adjust_ringbuf_sz(size_t sz) +{ + __u32 page_sz = sysconf(_SC_PAGE_SIZE); + __u32 mul; + + /* if user forgot to set any size, make sure they see error */ + if (sz == 0) + return 0; + /* Kernel expects BPF_MAP_TYPE_RINGBUF's max_entries to be + * a power-of-2 multiple of kernel's page size. If user diligently + * satisified these conditions, pass the size through. + */ + if ((sz % page_sz) == 0 && is_pow_of_2(sz / page_sz)) + return sz; + + /* Otherwise find closest (page_sz * power_of_2) product bigger than + * user-set size to satisfy both user size request and kernel + * requirements and substitute correct max_entries for map creation. + */ + for (mul = 1; mul <= UINT_MAX / page_sz; mul <<= 1) { + if (mul * page_sz > sz) + return mul * page_sz; + } + + /* if it's impossible to satisfy the conditions (i.e., user size is + * very close to UINT_MAX but is not a power-of-2 multiple of + * page_size) then just return original size and let kernel reject it + */ + return sz; +} + static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def) { map->def.type = def->map_type; @@ -2333,6 +2375,10 @@ static void fill_map_from_def(struct bpf_map *map, const struct btf_map_def *def map->btf_key_type_id = def->key_type_id; map->btf_value_type_id = def->value_type_id; + /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */ + if (map->def.type == BPF_MAP_TYPE_RINGBUF) + map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries); + if (def->parts & MAP_DEF_MAP_TYPE) pr_debug("map '%s': found type = %u.\n", map->name, def->map_type); @@ -3687,7 +3733,7 @@ static int bpf_object__collect_externs(struct bpf_object *obj) ext->kcfg.type = find_kcfg_type(obj->btf, t->type, &ext->kcfg.is_signed); if (ext->kcfg.type == KCFG_UNKNOWN) { - pr_warn("extern (kcfg) '%s' type is unsupported\n", ext_name); + pr_warn("extern (kcfg) '%s': type is unsupported\n", ext_name); return -ENOTSUP; } } else if (strcmp(sec_name, KSYMS_SEC) == 0) { @@ -4232,7 +4278,7 @@ int bpf_map__set_autocreate(struct bpf_map *map, bool autocreate) int bpf_map__reuse_fd(struct bpf_map *map, int fd) { struct bpf_map_info info = {}; - __u32 len = sizeof(info); + __u32 len = sizeof(info), name_len; int new_fd, err; char *new_name; @@ -4242,7 +4288,12 @@ int bpf_map__reuse_fd(struct bpf_map *map, int fd) if (err) return libbpf_err(err); - new_name = strdup(info.name); + name_len = strlen(info.name); + if (name_len == BPF_OBJ_NAME_LEN - 1 && strncmp(map->name, info.name, name_len) == 0) + new_name = strdup(map->name); + else + new_name = strdup(info.name); + if (!new_name) return libbpf_err(-errno); @@ -4301,9 +4352,15 @@ struct bpf_map *bpf_map__inner_map(struct bpf_map *map) int bpf_map__set_max_entries(struct bpf_map *map, __u32 max_entries) { - if (map->fd >= 0) + if (map->obj->loaded) return libbpf_err(-EBUSY); + map->def.max_entries = max_entries; + + /* auto-adjust BPF ringbuf map max_entries to be a multiple of page size */ + if (map->def.type == BPF_MAP_TYPE_RINGBUF) + map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries); + return 0; } @@ -4654,6 +4711,8 @@ static int probe_kern_btf_enum64(void) strs, sizeof(strs))); } +static int probe_kern_syscall_wrapper(void); + enum kern_feature_result { FEAT_UNKNOWN = 0, FEAT_SUPPORTED = 1, @@ -4722,6 +4781,9 @@ static struct kern_feature_desc { [FEAT_BTF_ENUM64] = { "BTF_KIND_ENUM64 support", probe_kern_btf_enum64, }, + [FEAT_SYSCALL_WRAPPER] = { + "Kernel using syscall wrapper", probe_kern_syscall_wrapper, + }, }; bool kernel_supports(const struct bpf_object *obj, enum kern_feature_id feat_id) @@ -4854,37 +4916,6 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map) static void bpf_map__destroy(struct bpf_map *map); -static size_t adjust_ringbuf_sz(size_t sz) -{ - __u32 page_sz = sysconf(_SC_PAGE_SIZE); - __u32 mul; - - /* if user forgot to set any size, make sure they see error */ - if (sz == 0) - return 0; - /* Kernel expects BPF_MAP_TYPE_RINGBUF's max_entries to be - * a power-of-2 multiple of kernel's page size. If user diligently - * satisified these conditions, pass the size through. - */ - if ((sz % page_sz) == 0 && is_pow_of_2(sz / page_sz)) - return sz; - - /* Otherwise find closest (page_sz * power_of_2) product bigger than - * user-set size to satisfy both user size request and kernel - * requirements and substitute correct max_entries for map creation. - */ - for (mul = 1; mul <= UINT_MAX / page_sz; mul <<= 1) { - if (mul * page_sz > sz) - return mul * page_sz; - } - - /* if it's impossible to satisfy the conditions (i.e., user size is - * very close to UINT_MAX but is not a power-of-2 multiple of - * page_size) then just return original size and let kernel reject it - */ - return sz; -} - static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, bool is_inner) { LIBBPF_OPTS(bpf_map_create_opts, create_attr); @@ -4923,9 +4954,6 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b } switch (def->type) { - case BPF_MAP_TYPE_RINGBUF: - map->def.max_entries = adjust_ringbuf_sz(map->def.max_entries); - /* fallthrough */ case BPF_MAP_TYPE_PERF_EVENT_ARRAY: case BPF_MAP_TYPE_CGROUP_ARRAY: case BPF_MAP_TYPE_STACK_TRACE: @@ -7282,14 +7310,14 @@ static int kallsyms_cb(unsigned long long sym_addr, char sym_type, return 0; if (ext->is_set && ext->ksym.addr != sym_addr) { - pr_warn("extern (ksym) '%s' resolution is ambiguous: 0x%llx or 0x%llx\n", + pr_warn("extern (ksym) '%s': resolution is ambiguous: 0x%llx or 0x%llx\n", sym_name, ext->ksym.addr, sym_addr); return -EINVAL; } if (!ext->is_set) { ext->is_set = true; ext->ksym.addr = sym_addr; - pr_debug("extern (ksym) %s=0x%llx\n", sym_name, sym_addr); + pr_debug("extern (ksym) '%s': set to 0x%llx\n", sym_name, sym_addr); } return 0; } @@ -7493,28 +7521,52 @@ static int bpf_object__resolve_externs(struct bpf_object *obj, for (i = 0; i < obj->nr_extern; i++) { ext = &obj->externs[i]; - if (ext->type == EXT_KCFG && - strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) { - void *ext_val = kcfg_data + ext->kcfg.data_off; - __u32 kver = get_kernel_version(); + if (ext->type == EXT_KSYM) { + if (ext->ksym.type_id) + need_vmlinux_btf = true; + else + need_kallsyms = true; + continue; + } else if (ext->type == EXT_KCFG) { + void *ext_ptr = kcfg_data + ext->kcfg.data_off; + __u64 value = 0; + + /* Kconfig externs need actual /proc/config.gz */ + if (str_has_pfx(ext->name, "CONFIG_")) { + need_config = true; + continue; + } - if (!kver) { - pr_warn("failed to get kernel version\n"); + /* Virtual kcfg externs are customly handled by libbpf */ + if (strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) { + value = get_kernel_version(); + if (!value) { + pr_warn("extern (kcfg) '%s': failed to get kernel version\n", ext->name); + return -EINVAL; + } + } else if (strcmp(ext->name, "LINUX_HAS_BPF_COOKIE") == 0) { + value = kernel_supports(obj, FEAT_BPF_COOKIE); + } else if (strcmp(ext->name, "LINUX_HAS_SYSCALL_WRAPPER") == 0) { + value = kernel_supports(obj, FEAT_SYSCALL_WRAPPER); + } else if (!str_has_pfx(ext->name, "LINUX_") || !ext->is_weak) { + /* Currently libbpf supports only CONFIG_ and LINUX_ prefixed + * __kconfig externs, where LINUX_ ones are virtual and filled out + * customly by libbpf (their values don't come from Kconfig). + * If LINUX_xxx variable is not recognized by libbpf, but is marked + * __weak, it defaults to zero value, just like for CONFIG_xxx + * externs. + */ + pr_warn("extern (kcfg) '%s': unrecognized virtual extern\n", ext->name); return -EINVAL; } - err = set_kcfg_value_num(ext, ext_val, kver); + + err = set_kcfg_value_num(ext, ext_ptr, value); if (err) return err; - pr_debug("extern (kcfg) %s=0x%x\n", ext->name, kver); - } else if (ext->type == EXT_KCFG && str_has_pfx(ext->name, "CONFIG_")) { - need_config = true; - } else if (ext->type == EXT_KSYM) { - if (ext->ksym.type_id) - need_vmlinux_btf = true; - else - need_kallsyms = true; + pr_debug("extern (kcfg) '%s': set to 0x%llx\n", + ext->name, (long long)value); } else { - pr_warn("unrecognized extern '%s'\n", ext->name); + pr_warn("extern '%s': unrecognized extern kind\n", ext->name); return -EINVAL; } } @@ -7550,10 +7602,10 @@ static int bpf_object__resolve_externs(struct bpf_object *obj, ext = &obj->externs[i]; if (!ext->is_set && !ext->is_weak) { - pr_warn("extern %s (strong) not resolved\n", ext->name); + pr_warn("extern '%s' (strong): not resolved\n", ext->name); return -ESRCH; } else if (!ext->is_set) { - pr_debug("extern %s (weak) not resolved, defaulting to zero\n", + pr_debug("extern '%s' (weak): not resolved, defaulting to zero\n", ext->name); } } @@ -8381,6 +8433,7 @@ int bpf_program__set_log_buf(struct bpf_program *prog, char *log_buf, size_t log static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_uprobe(const struct bpf_program *prog, long cookie, struct bpf_link **link); +static int attach_ksyscall(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_usdt(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_tp(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_raw_tp(const struct bpf_program *prog, long cookie, struct bpf_link **link); @@ -8401,6 +8454,8 @@ static const struct bpf_sec_def section_defs[] = { SEC_DEF("uretprobe.s+", KPROBE, 0, SEC_SLEEPABLE, attach_uprobe), SEC_DEF("kprobe.multi+", KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi), SEC_DEF("kretprobe.multi+", KPROBE, BPF_TRACE_KPROBE_MULTI, SEC_NONE, attach_kprobe_multi), + SEC_DEF("ksyscall+", KPROBE, 0, SEC_NONE, attach_ksyscall), + SEC_DEF("kretsyscall+", KPROBE, 0, SEC_NONE, attach_ksyscall), SEC_DEF("usdt+", KPROBE, 0, SEC_NONE, attach_usdt), SEC_DEF("tc", SCHED_CLS, 0, SEC_NONE), SEC_DEF("classifier", SCHED_CLS, 0, SEC_NONE), @@ -9757,7 +9812,7 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name, { struct perf_event_attr attr = {}; char errmsg[STRERR_BUFSIZE]; - int type, pfd, err; + int type, pfd; if (ref_ctr_off >= (1ULL << PERF_UPROBE_REF_CTR_OFFSET_BITS)) return -EINVAL; @@ -9793,14 +9848,7 @@ static int perf_event_open_probe(bool uprobe, bool retprobe, const char *name, pid < 0 ? -1 : pid /* pid */, pid == -1 ? 0 : -1 /* cpu */, -1 /* group_fd */, PERF_FLAG_FD_CLOEXEC); - if (pfd < 0) { - err = -errno; - pr_warn("%s perf_event_open() failed: %s\n", - uprobe ? "uprobe" : "kprobe", - libbpf_strerror_r(err, errmsg, sizeof(errmsg))); - return err; - } - return pfd; + return pfd >= 0 ? pfd : -errno; } static int append_to_file(const char *file, const char *fmt, ...) @@ -9823,6 +9871,34 @@ static int append_to_file(const char *file, const char *fmt, ...) return err; } +#define DEBUGFS "/sys/kernel/debug/tracing" +#define TRACEFS "/sys/kernel/tracing" + +static bool use_debugfs(void) +{ + static int has_debugfs = -1; + + if (has_debugfs < 0) + has_debugfs = access(DEBUGFS, F_OK) == 0; + + return has_debugfs == 1; +} + +static const char *tracefs_path(void) +{ + return use_debugfs() ? DEBUGFS : TRACEFS; +} + +static const char *tracefs_kprobe_events(void) +{ + return use_debugfs() ? DEBUGFS"/kprobe_events" : TRACEFS"/kprobe_events"; +} + +static const char *tracefs_uprobe_events(void) +{ + return use_debugfs() ? DEBUGFS"/uprobe_events" : TRACEFS"/uprobe_events"; +} + static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz, const char *kfunc_name, size_t offset) { @@ -9835,9 +9911,7 @@ static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz, static int add_kprobe_event_legacy(const char *probe_name, bool retprobe, const char *kfunc_name, size_t offset) { - const char *file = "/sys/kernel/debug/tracing/kprobe_events"; - - return append_to_file(file, "%c:%s/%s %s+0x%zx", + return append_to_file(tracefs_kprobe_events(), "%c:%s/%s %s+0x%zx", retprobe ? 'r' : 'p', retprobe ? "kretprobes" : "kprobes", probe_name, kfunc_name, offset); @@ -9845,18 +9919,16 @@ static int add_kprobe_event_legacy(const char *probe_name, bool retprobe, static int remove_kprobe_event_legacy(const char *probe_name, bool retprobe) { - const char *file = "/sys/kernel/debug/tracing/kprobe_events"; - - return append_to_file(file, "-:%s/%s", retprobe ? "kretprobes" : "kprobes", probe_name); + return append_to_file(tracefs_kprobe_events(), "-:%s/%s", + retprobe ? "kretprobes" : "kprobes", probe_name); } static int determine_kprobe_perf_type_legacy(const char *probe_name, bool retprobe) { char file[256]; - snprintf(file, sizeof(file), - "/sys/kernel/debug/tracing/events/%s/%s/id", - retprobe ? "kretprobes" : "kprobes", probe_name); + snprintf(file, sizeof(file), "%s/events/%s/%s/id", + tracefs_path(), retprobe ? "kretprobes" : "kprobes", probe_name); return parse_uint_from_file(file, "%d\n"); } @@ -9905,6 +9977,60 @@ err_clean_legacy: return err; } +static const char *arch_specific_syscall_pfx(void) +{ +#if defined(__x86_64__) + return "x64"; +#elif defined(__i386__) + return "ia32"; +#elif defined(__s390x__) + return "s390x"; +#elif defined(__s390__) + return "s390"; +#elif defined(__arm__) + return "arm"; +#elif defined(__aarch64__) + return "arm64"; +#elif defined(__mips__) + return "mips"; +#elif defined(__riscv) + return "riscv"; +#else + return NULL; +#endif +} + +static int probe_kern_syscall_wrapper(void) +{ + char syscall_name[64]; + const char *ksys_pfx; + + ksys_pfx = arch_specific_syscall_pfx(); + if (!ksys_pfx) + return 0; + + snprintf(syscall_name, sizeof(syscall_name), "__%s_sys_bpf", ksys_pfx); + + if (determine_kprobe_perf_type() >= 0) { + int pfd; + + pfd = perf_event_open_probe(false, false, syscall_name, 0, getpid(), 0); + if (pfd >= 0) + close(pfd); + + return pfd >= 0 ? 1 : 0; + } else { /* legacy mode */ + char probe_name[128]; + + gen_kprobe_legacy_event_name(probe_name, sizeof(probe_name), syscall_name, 0); + if (add_kprobe_event_legacy(probe_name, false, syscall_name, 0) < 0) + return 0; + + (void)remove_kprobe_event_legacy(probe_name, false); + return 1; + } +} + struct bpf_link * bpf_program__attach_kprobe_opts(const struct bpf_program *prog, const char *func_name, @@ -9990,6 +10116,29 @@ struct bpf_link *bpf_program__attach_kprobe(const struct bpf_program *prog, return bpf_program__attach_kprobe_opts(prog, func_name, &opts); } +struct bpf_link *bpf_program__attach_ksyscall(const struct bpf_program *prog, + const char *syscall_name, + const struct bpf_ksyscall_opts *opts) +{ + LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts); + char func_name[128]; + + if (!OPTS_VALID(opts, bpf_ksyscall_opts)) + return libbpf_err_ptr(-EINVAL); + + if (kernel_supports(prog->obj, FEAT_SYSCALL_WRAPPER)) { + snprintf(func_name, sizeof(func_name), "__%s_sys_%s", + arch_specific_syscall_pfx(), syscall_name); + } else { + snprintf(func_name, sizeof(func_name), "__se_sys_%s", syscall_name); + } + + kprobe_opts.retprobe = OPTS_GET(opts, retprobe, false); + kprobe_opts.bpf_cookie = OPTS_GET(opts, bpf_cookie, 0); + + return bpf_program__attach_kprobe_opts(prog, func_name, &kprobe_opts); +} + /* Adapted from perf/util/string.c */ static bool glob_match(const char *str, const char *pat) { @@ -10160,6 +10309,27 @@ static int attach_kprobe(const struct bpf_program *prog, long cookie, struct bpf return libbpf_get_error(*link); } +static int attach_ksyscall(const struct bpf_program *prog, long cookie, struct bpf_link **link) +{ + LIBBPF_OPTS(bpf_ksyscall_opts, opts); + const char *syscall_name; + + *link = NULL; + + /* no auto-attach for SEC("ksyscall") and SEC("kretsyscall") */ + if (strcmp(prog->sec_name, "ksyscall") == 0 || strcmp(prog->sec_name, "kretsyscall") == 0) + return 0; + + opts.retprobe = str_has_pfx(prog->sec_name, "kretsyscall/"); + if (opts.retprobe) + syscall_name = prog->sec_name + sizeof("kretsyscall/") - 1; + else + syscall_name = prog->sec_name + sizeof("ksyscall/") - 1; + + *link = bpf_program__attach_ksyscall(prog, syscall_name, &opts); + return *link ? 0 : -errno; +} + static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, struct bpf_link **link) { LIBBPF_OPTS(bpf_kprobe_multi_opts, opts); @@ -10208,9 +10378,7 @@ static void gen_uprobe_legacy_event_name(char *buf, size_t buf_sz, static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe, const char *binary_path, size_t offset) { - const char *file = "/sys/kernel/debug/tracing/uprobe_events"; - - return append_to_file(file, "%c:%s/%s %s:0x%zx", + return append_to_file(tracefs_uprobe_events(), "%c:%s/%s %s:0x%zx", retprobe ? 'r' : 'p', retprobe ? "uretprobes" : "uprobes", probe_name, binary_path, offset); @@ -10218,18 +10386,16 @@ static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe, static inline int remove_uprobe_event_legacy(const char *probe_name, bool retprobe) { - const char *file = "/sys/kernel/debug/tracing/uprobe_events"; - - return append_to_file(file, "-:%s/%s", retprobe ? "uretprobes" : "uprobes", probe_name); + return append_to_file(tracefs_uprobe_events(), "-:%s/%s", + retprobe ? "uretprobes" : "uprobes", probe_name); } static int determine_uprobe_perf_type_legacy(const char *probe_name, bool retprobe) { char file[512]; - snprintf(file, sizeof(file), - "/sys/kernel/debug/tracing/events/%s/%s/id", - retprobe ? "uretprobes" : "uprobes", probe_name); + snprintf(file, sizeof(file), "%s/events/%s/%s/id", + tracefs_path(), retprobe ? "uretprobes" : "uprobes", probe_name); return parse_uint_from_file(file, "%d\n"); } @@ -10545,7 +10711,10 @@ bpf_program__attach_uprobe_opts(const struct bpf_program *prog, pid_t pid, ref_ctr_off = OPTS_GET(opts, ref_ctr_offset, 0); pe_opts.bpf_cookie = OPTS_GET(opts, bpf_cookie, 0); - if (binary_path && !strchr(binary_path, '/')) { + if (!binary_path) + return libbpf_err_ptr(-EINVAL); + + if (!strchr(binary_path, '/')) { err = resolve_full_path(binary_path, full_binary_path, sizeof(full_binary_path)); if (err) { @@ -10559,11 +10728,6 @@ bpf_program__attach_uprobe_opts(const struct bpf_program *prog, pid_t pid, if (func_name) { long sym_off; - if (!binary_path) { - pr_warn("prog '%s': name-based attach requires binary_path\n", - prog->name); - return libbpf_err_ptr(-EINVAL); - } sym_off = elf_find_func_offset(binary_path, func_name); if (sym_off < 0) return libbpf_err_ptr(sym_off); @@ -10711,6 +10875,9 @@ struct bpf_link *bpf_program__attach_usdt(const struct bpf_program *prog, return libbpf_err_ptr(-EINVAL); } + if (!binary_path) + return libbpf_err_ptr(-EINVAL); + if (!strchr(binary_path, '/')) { err = resolve_full_path(binary_path, resolved_path, sizeof(resolved_path)); if (err) { @@ -10776,9 +10943,8 @@ static int determine_tracepoint_id(const char *tp_category, char file[PATH_MAX]; int ret; - ret = snprintf(file, sizeof(file), - "/sys/kernel/debug/tracing/events/%s/%s/id", - tp_category, tp_name); + ret = snprintf(file, sizeof(file), "%s/events/%s/%s/id", + tracefs_path(), tp_category, tp_name); if (ret < 0) return -errno; if (ret >= sizeof(file)) { @@ -11728,6 +11894,22 @@ int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx) return cpu_buf->fd; } +int perf_buffer__buffer(struct perf_buffer *pb, int buf_idx, void **buf, size_t *buf_size) +{ + struct perf_cpu_buf *cpu_buf; + + if (buf_idx >= pb->cpu_cnt) + return libbpf_err(-EINVAL); + + cpu_buf = pb->cpu_bufs[buf_idx]; + if (!cpu_buf) + return libbpf_err(-ENOENT); + + *buf = cpu_buf->base; + *buf_size = pb->mmap_size; + return 0; +} + /* * Consume data from perf ring buffer corresponding to slot *buf_idx* in * PERF_EVENT_ARRAY BPF map without waiting/polling. If there is no data to diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e4d5353f757b..61493c4cddac 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -457,6 +457,52 @@ bpf_program__attach_kprobe_multi_opts(const struct bpf_program *prog, const char *pattern, const struct bpf_kprobe_multi_opts *opts); +struct bpf_ksyscall_opts { + /* size of this struct, for forward/backward compatiblity */ + size_t sz; + /* custom user-provided value fetchable through bpf_get_attach_cookie() */ + __u64 bpf_cookie; + /* attach as return probe? */ + bool retprobe; + size_t :0; +}; +#define bpf_ksyscall_opts__last_field retprobe + +/** + * @brief **bpf_program__attach_ksyscall()** attaches a BPF program + * to kernel syscall handler of a specified syscall. Optionally it's possible + * to request to install retprobe that will be triggered at syscall exit. It's + * also possible to associate BPF cookie (though options). + * + * Libbpf automatically will determine correct full kernel function name, + * which depending on system architecture and kernel version/configuration + * could be of the form __<arch>_sys_<syscall> or __se_sys_<syscall>, and will + * attach specified program using kprobe/kretprobe mechanism. + * + * **bpf_program__attach_ksyscall()** is an API counterpart of declarative + * **SEC("ksyscall/<syscall>")** annotation of BPF programs. + * + * At the moment **SEC("ksyscall")** and **bpf_program__attach_ksyscall()** do + * not handle all the calling convention quirks for mmap(), clone() and compat + * syscalls. It also only attaches to "native" syscall interfaces. If host + * system supports compat syscalls or defines 32-bit syscalls in 64-bit + * kernel, such syscall interfaces won't be attached to by libbpf. + * + * These limitations may or may not change in the future. Therefore it is + * recommended to use SEC("kprobe") for these syscalls or if working with + * compat and 32-bit interfaces is required. + * + * @param prog BPF program to attach + * @param syscall_name Symbolic name of the syscall (e.g., "bpf") + * @param opts Additional options (see **struct bpf_ksyscall_opts**) + * @return Reference to the newly created BPF link; or NULL is returned on + * error, error code is stored in errno + */ +LIBBPF_API struct bpf_link * +bpf_program__attach_ksyscall(const struct bpf_program *prog, + const char *syscall_name, + const struct bpf_ksyscall_opts *opts); + struct bpf_uprobe_opts { /* size of this struct, for forward/backward compatiblity */ size_t sz; @@ -1053,6 +1099,22 @@ LIBBPF_API int perf_buffer__consume(struct perf_buffer *pb); LIBBPF_API int perf_buffer__consume_buffer(struct perf_buffer *pb, size_t buf_idx); LIBBPF_API size_t perf_buffer__buffer_cnt(const struct perf_buffer *pb); LIBBPF_API int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx); +/** + * @brief **perf_buffer__buffer()** returns the per-cpu raw mmap()'ed underlying + * memory region of the ring buffer. + * This ring buffer can be used to implement a custom events consumer. + * The ring buffer starts with the *struct perf_event_mmap_page*, which + * holds the ring buffer managment fields, when accessing the header + * structure it's important to be SMP aware. + * You can refer to *perf_event_read_simple* for a simple example. + * @param pb the perf buffer structure + * @param buf_idx the buffer index to retreive + * @param buf (out) gets the base pointer of the mmap()'ed memory + * @param buf_size (out) gets the size of the mmap()'ed region + * @return 0 on success, negative error code for failure + */ +LIBBPF_API int perf_buffer__buffer(struct perf_buffer *pb, int buf_idx, void **buf, + size_t *buf_size); struct bpf_prog_linfo; struct bpf_prog_info; diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 94b589ecfeaa..0625adb9e888 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -356,10 +356,12 @@ LIBBPF_0.8.0 { LIBBPF_1.0.0 { global: bpf_prog_query_opts; + bpf_program__attach_ksyscall; btf__add_enum64; btf__add_enum64_value; libbpf_bpf_attach_type_str; libbpf_bpf_link_type_str; libbpf_bpf_map_type_str; libbpf_bpf_prog_type_str; + perf_buffer__buffer; }; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 9cd7829cbe41..4135ae0a2bc3 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -108,9 +108,9 @@ static inline bool str_has_sfx(const char *str, const char *sfx) size_t str_len = strlen(str); size_t sfx_len = strlen(sfx); - if (sfx_len <= str_len) - return strcmp(str + str_len - sfx_len, sfx); - return false; + if (sfx_len > str_len) + return false; + return strcmp(str + str_len - sfx_len, sfx) == 0; } /* Symbol versioning is different between static and shared library. @@ -352,6 +352,8 @@ enum kern_feature_id { FEAT_BPF_COOKIE, /* BTF_KIND_ENUM64 support and BTF_KIND_ENUM kflag support */ FEAT_BTF_ENUM64, + /* Kernel uses syscall wrapper (CONFIG_ARCH_HAS_SYSCALL_WRAPPER) */ + FEAT_SYSCALL_WRAPPER, __FEAT_CNT, }; diff --git a/tools/lib/bpf/usdt.bpf.h b/tools/lib/bpf/usdt.bpf.h index 4181fddb3687..4f2adc0bd6ca 100644 --- a/tools/lib/bpf/usdt.bpf.h +++ b/tools/lib/bpf/usdt.bpf.h @@ -6,7 +6,6 @@ #include <linux/errno.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -#include <bpf/bpf_core_read.h> /* Below types and maps are internal implementation details of libbpf's USDT * support and are subjects to change. Also, bpf_usdt_xxx() API helpers should @@ -30,14 +29,6 @@ #ifndef BPF_USDT_MAX_IP_CNT #define BPF_USDT_MAX_IP_CNT (4 * BPF_USDT_MAX_SPEC_CNT) #endif -/* We use BPF CO-RE to detect support for BPF cookie from BPF side. This is - * the only dependency on CO-RE, so if it's undesirable, user can override - * BPF_USDT_HAS_BPF_COOKIE to specify whether to BPF cookie is supported or not. - */ -#ifndef BPF_USDT_HAS_BPF_COOKIE -#define BPF_USDT_HAS_BPF_COOKIE \ - bpf_core_enum_value_exists(enum bpf_func_id___usdt, BPF_FUNC_get_attach_cookie___usdt) -#endif enum __bpf_usdt_arg_type { BPF_USDT_ARG_CONST, @@ -83,15 +74,12 @@ struct { __type(value, __u32); } __bpf_usdt_ip_to_spec_id SEC(".maps") __weak; -/* don't rely on user's BPF code to have latest definition of bpf_func_id */ -enum bpf_func_id___usdt { - BPF_FUNC_get_attach_cookie___usdt = 0xBAD, /* value doesn't matter */ -}; +extern const _Bool LINUX_HAS_BPF_COOKIE __kconfig; static __always_inline int __bpf_usdt_spec_id(struct pt_regs *ctx) { - if (!BPF_USDT_HAS_BPF_COOKIE) { + if (!LINUX_HAS_BPF_COOKIE) { long ip = PT_REGS_IP(ctx); int *spec_id_ptr; diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index e585e1cefc77..792cb15bac40 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -148,13 +148,13 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = { .write = bpf_testmod_test_write, }; -BTF_SET_START(bpf_testmod_check_kfunc_ids) -BTF_ID(func, bpf_testmod_test_mod_kfunc) -BTF_SET_END(bpf_testmod_check_kfunc_ids) +BTF_SET8_START(bpf_testmod_check_kfunc_ids) +BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc) +BTF_SET8_END(bpf_testmod_check_kfunc_ids) static const struct btf_kfunc_id_set bpf_testmod_kfunc_set = { - .owner = THIS_MODULE, - .check_set = &bpf_testmod_check_kfunc_ids, + .owner = THIS_MODULE, + .set = &bpf_testmod_check_kfunc_ids, }; extern int bpf_fentry_test1(int a); diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 7ff5fa93d056..a33874b081b6 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -27,6 +27,7 @@ #include "bpf_iter_test_kern5.skel.h" #include "bpf_iter_test_kern6.skel.h" #include "bpf_iter_bpf_link.skel.h" +#include "bpf_iter_ksym.skel.h" static int duration; @@ -1120,6 +1121,19 @@ static void test_link_iter(void) bpf_iter_bpf_link__destroy(skel); } +static void test_ksym_iter(void) +{ + struct bpf_iter_ksym *skel; + + skel = bpf_iter_ksym__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_ksym__open_and_load")) + return; + + do_dummy_read(skel->progs.dump_ksym); + + bpf_iter_ksym__destroy(skel); +} + #define CMP_BUFFER_SIZE 1024 static char task_vma_output[CMP_BUFFER_SIZE]; static char proc_maps_output[CMP_BUFFER_SIZE]; @@ -1267,4 +1281,6 @@ void test_bpf_iter(void) test_buf_neg_offset(); if (test__start_subtest("link-iter")) test_link_iter(); + if (test__start_subtest("ksym")) + test_ksym_iter(); } diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c index dd30b1e3a67c..7a74a1579076 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_nf.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_nf.c @@ -2,13 +2,29 @@ #include <test_progs.h> #include <network_helpers.h> #include "test_bpf_nf.skel.h" +#include "test_bpf_nf_fail.skel.h" + +static char log_buf[1024 * 1024]; + +struct { + const char *prog_name; + const char *err_msg; +} test_bpf_nf_fail_tests[] = { + { "alloc_release", "kernel function bpf_ct_release args#0 expected pointer to STRUCT nf_conn but" }, + { "insert_insert", "kernel function bpf_ct_insert_entry args#0 expected pointer to STRUCT nf_conn___init but" }, + { "lookup_insert", "kernel function bpf_ct_insert_entry args#0 expected pointer to STRUCT nf_conn___init but" }, + { "set_timeout_after_insert", "kernel function bpf_ct_set_timeout args#0 expected pointer to STRUCT nf_conn___init but" }, + { "set_status_after_insert", "kernel function bpf_ct_set_status args#0 expected pointer to STRUCT nf_conn___init but" }, + { "change_timeout_after_alloc", "kernel function bpf_ct_change_timeout args#0 expected pointer to STRUCT nf_conn but" }, + { "change_status_after_alloc", "kernel function bpf_ct_change_status args#0 expected pointer to STRUCT nf_conn but" }, +}; enum { TEST_XDP, TEST_TC_BPF, }; -void test_bpf_nf_ct(int mode) +static void test_bpf_nf_ct(int mode) { struct test_bpf_nf *skel; int prog_fd, err; @@ -39,14 +55,60 @@ void test_bpf_nf_ct(int mode) ASSERT_EQ(skel->bss->test_enonet_netns_id, -ENONET, "Test ENONET for bad but valid netns_id"); ASSERT_EQ(skel->bss->test_enoent_lookup, -ENOENT, "Test ENOENT for failed lookup"); ASSERT_EQ(skel->bss->test_eafnosupport, -EAFNOSUPPORT, "Test EAFNOSUPPORT for invalid len__tuple"); + ASSERT_EQ(skel->data->test_alloc_entry, 0, "Test for alloc new entry"); + ASSERT_EQ(skel->data->test_insert_entry, 0, "Test for insert new entry"); + ASSERT_EQ(skel->data->test_succ_lookup, 0, "Test for successful lookup"); + /* allow some tolerance for test_delta_timeout value to avoid races. */ + ASSERT_GT(skel->bss->test_delta_timeout, 8, "Test for min ct timeout update"); + ASSERT_LE(skel->bss->test_delta_timeout, 10, "Test for max ct timeout update"); + /* expected status is IPS_SEEN_REPLY */ + ASSERT_EQ(skel->bss->test_status, 2, "Test for ct status update "); end: test_bpf_nf__destroy(skel); } +static void test_bpf_nf_ct_fail(const char *prog_name, const char *err_msg) +{ + LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_buf = log_buf, + .kernel_log_size = sizeof(log_buf), + .kernel_log_level = 1); + struct test_bpf_nf_fail *skel; + struct bpf_program *prog; + int ret; + + skel = test_bpf_nf_fail__open_opts(&opts); + if (!ASSERT_OK_PTR(skel, "test_bpf_nf_fail__open")) + return; + + prog = bpf_object__find_program_by_name(skel->obj, prog_name); + if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name")) + goto end; + + bpf_program__set_autoload(prog, true); + + ret = test_bpf_nf_fail__load(skel); + if (!ASSERT_ERR(ret, "test_bpf_nf_fail__load must fail")) + goto end; + + if (!ASSERT_OK_PTR(strstr(log_buf, err_msg), "expected error message")) { + fprintf(stderr, "Expected: %s\n", err_msg); + fprintf(stderr, "Verifier: %s\n", log_buf); + } + +end: + test_bpf_nf_fail__destroy(skel); +} + void test_bpf_nf(void) { + int i; if (test__start_subtest("xdp-ct")) test_bpf_nf_ct(TEST_XDP); if (test__start_subtest("tc-bpf-ct")) test_bpf_nf_ct(TEST_TC_BPF); + for (i = 0; i < ARRAY_SIZE(test_bpf_nf_fail_tests); i++) { + if (test__start_subtest(test_bpf_nf_fail_tests[i].prog_name)) + test_bpf_nf_ct_fail(test_bpf_nf_fail_tests[i].prog_name, + test_bpf_nf_fail_tests[i].err_msg); + } } diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c index 941b0100bafa..ef6528b8084c 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf.c +++ b/tools/testing/selftests/bpf/prog_tests/btf.c @@ -5338,7 +5338,7 @@ static void do_test_pprint(int test_num) ret = snprintf(pin_path, sizeof(pin_path), "%s/%s", "/sys/fs/bpf", test->map_name); - if (CHECK(ret == sizeof(pin_path), "pin_path %s/%s is too long", + if (CHECK(ret >= sizeof(pin_path), "pin_path %s/%s is too long", "/sys/fs/bpf", test->map_name)) { err = -1; goto done; diff --git a/tools/testing/selftests/bpf/prog_tests/core_extern.c b/tools/testing/selftests/bpf/prog_tests/core_extern.c index 1931a158510e..63a51e9f3630 100644 --- a/tools/testing/selftests/bpf/prog_tests/core_extern.c +++ b/tools/testing/selftests/bpf/prog_tests/core_extern.c @@ -39,6 +39,7 @@ static struct test_case { "CONFIG_STR=\"abracad\"\n" "CONFIG_MISSING=0", .data = { + .unkn_virt_val = 0, .bpf_syscall = false, .tristate_val = TRI_MODULE, .bool_val = true, @@ -121,7 +122,7 @@ static struct test_case { void test_core_extern(void) { const uint32_t kern_ver = get_kernel_version(); - int err, duration = 0, i, j; + int err, i, j; struct test_core_extern *skel = NULL; uint64_t *got, *exp; int n = sizeof(*skel->data) / sizeof(uint64_t); @@ -136,19 +137,17 @@ void test_core_extern(void) continue; skel = test_core_extern__open_opts(&opts); - if (CHECK(!skel, "skel_open", "skeleton open failed\n")) + if (!ASSERT_OK_PTR(skel, "skel_open")) goto cleanup; err = test_core_extern__load(skel); if (t->fails) { - CHECK(!err, "skel_load", - "shouldn't succeed open/load of skeleton\n"); + ASSERT_ERR(err, "skel_load_should_fail"); goto cleanup; - } else if (CHECK(err, "skel_load", - "failed to open/load skeleton\n")) { + } else if (!ASSERT_OK(err, "skel_load")) { goto cleanup; } err = test_core_extern__attach(skel); - if (CHECK(err, "attach_raw_tp", "failed attach: %d\n", err)) + if (!ASSERT_OK(err, "attach_raw_tp")) goto cleanup; usleep(1); @@ -158,9 +157,7 @@ void test_core_extern(void) got = (uint64_t *)skel->data; exp = (uint64_t *)&t->data; for (j = 0; j < n; j++) { - CHECK(got[j] != exp[j], "check_res", - "result #%d: expected %llx, but got %llx\n", - j, (__u64)exp[j], (__u64)got[j]); + ASSERT_EQ(got[j], exp[j], "result"); } cleanup: test_core_extern__destroy(skel); diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c index 335917df0614..d457a55ff408 100644 --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c @@ -364,6 +364,8 @@ static int get_syms(char ***symsp, size_t *cntp) continue; if (!strncmp(name, "rcu_", 4)) continue; + if (!strcmp(name, "bpf_dispatcher_xdp_func")) + continue; if (!strncmp(name, "__ftrace_invalid_address__", sizeof("__ftrace_invalid_address__") - 1)) continue; diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c b/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c index eb5f7f5aa81a..1455911d9fcb 100644 --- a/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c +++ b/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c @@ -50,6 +50,13 @@ void test_ringbuf_multi(void) if (CHECK(!skel, "skel_open", "skeleton open failed\n")) return; + /* validate ringbuf size adjustment logic */ + ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), page_size, "rb1_size_before"); + ASSERT_OK(bpf_map__set_max_entries(skel->maps.ringbuf1, page_size + 1), "rb1_resize"); + ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), 2 * page_size, "rb1_size_after"); + ASSERT_OK(bpf_map__set_max_entries(skel->maps.ringbuf1, page_size), "rb1_reset"); + ASSERT_EQ(bpf_map__max_entries(skel->maps.ringbuf1), page_size, "rb1_size_final"); + proto_fd = bpf_map_create(BPF_MAP_TYPE_RINGBUF, NULL, 0, 0, page_size, NULL); if (CHECK(proto_fd < 0, "bpf_map_create", "bpf_map_create failed\n")) goto cleanup; @@ -65,6 +72,10 @@ void test_ringbuf_multi(void) close(proto_fd); proto_fd = -1; + /* make sure we can't resize ringbuf after object load */ + if (!ASSERT_ERR(bpf_map__set_max_entries(skel->maps.ringbuf1, 3 * page_size), "rb1_resize_after_load")) + goto cleanup; + /* only trigger BPF program for current process */ skel->bss->pid = getpid(); diff --git a/tools/testing/selftests/bpf/prog_tests/skeleton.c b/tools/testing/selftests/bpf/prog_tests/skeleton.c index 180afd632f4c..99dac5292b41 100644 --- a/tools/testing/selftests/bpf/prog_tests/skeleton.c +++ b/tools/testing/selftests/bpf/prog_tests/skeleton.c @@ -122,6 +122,8 @@ void test_skeleton(void) ASSERT_EQ(skel->bss->out_mostly_var, 123, "out_mostly_var"); + ASSERT_EQ(bss->huge_arr[ARRAY_SIZE(bss->huge_arr) - 1], 123, "huge_arr"); + elf_bytes = test_skeleton__elf_bytes(&elf_bytes_sz); ASSERT_OK_PTR(elf_bytes, "elf_bytes"); ASSERT_GE(elf_bytes_sz, 0, "elf_bytes_sz"); diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h index 97ec8bc76ae6..e9846606690d 100644 --- a/tools/testing/selftests/bpf/progs/bpf_iter.h +++ b/tools/testing/selftests/bpf/progs/bpf_iter.h @@ -22,6 +22,7 @@ #define BTF_F_NONAME BTF_F_NONAME___not_used #define BTF_F_PTR_RAW BTF_F_PTR_RAW___not_used #define BTF_F_ZERO BTF_F_ZERO___not_used +#define bpf_iter__ksym bpf_iter__ksym___not_used #include "vmlinux.h" #undef bpf_iter_meta #undef bpf_iter__bpf_map @@ -44,6 +45,7 @@ #undef BTF_F_NONAME #undef BTF_F_PTR_RAW #undef BTF_F_ZERO +#undef bpf_iter__ksym struct bpf_iter_meta { struct seq_file *seq; @@ -151,3 +153,8 @@ enum { BTF_F_PTR_RAW = (1ULL << 2), BTF_F_ZERO = (1ULL << 3), }; + +struct bpf_iter__ksym { + struct bpf_iter_meta *meta; + struct kallsym_iter *ksym; +}; diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c b/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c new file mode 100644 index 000000000000..285c008cbf9c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_ksym.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022, Oracle and/or its affiliates. */ +#include "bpf_iter.h" +#include <bpf/bpf_helpers.h> + +char _license[] SEC("license") = "GPL"; + +unsigned long last_sym_value = 0; + +static inline char tolower(char c) +{ + if (c >= 'A' && c <= 'Z') + c += ('a' - 'A'); + return c; +} + +static inline char toupper(char c) +{ + if (c >= 'a' && c <= 'z') + c -= ('a' - 'A'); + return c; +} + +/* Dump symbols with max size; the latter is calculated by caching symbol N value + * and when iterating on symbol N+1, we can print max size of symbol N via + * address of N+1 - address of N. + */ +SEC("iter/ksym") +int dump_ksym(struct bpf_iter__ksym *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct kallsym_iter *iter = ctx->ksym; + __u32 seq_num = ctx->meta->seq_num; + unsigned long value; + char type; + int ret; + + if (!iter) + return 0; + + if (seq_num == 0) { + BPF_SEQ_PRINTF(seq, "ADDR TYPE NAME MODULE_NAME KIND MAX_SIZE\n"); + return 0; + } + if (last_sym_value) + BPF_SEQ_PRINTF(seq, "0x%x\n", iter->value - last_sym_value); + else + BPF_SEQ_PRINTF(seq, "\n"); + + value = iter->show_value ? iter->value : 0; + + last_sym_value = value; + + type = iter->type; + + if (iter->module_name[0]) { + type = iter->exported ? toupper(type) : tolower(type); + BPF_SEQ_PRINTF(seq, "0x%llx %c %s [ %s ] ", + value, type, iter->name, iter->module_name); + } else { + BPF_SEQ_PRINTF(seq, "0x%llx %c %s ", value, type, iter->name); + } + if (!iter->pos_arch_end || iter->pos_arch_end > iter->pos) + BPF_SEQ_PRINTF(seq, "CORE "); + else if (!iter->pos_mod_end || iter->pos_mod_end > iter->pos) + BPF_SEQ_PRINTF(seq, "MOD "); + else if (!iter->pos_ftrace_mod_end || iter->pos_ftrace_mod_end > iter->pos) + BPF_SEQ_PRINTF(seq, "FTRACE_MOD "); + else if (!iter->pos_bpf_end || iter->pos_bpf_end > iter->pos) + BPF_SEQ_PRINTF(seq, "BPF "); + else + BPF_SEQ_PRINTF(seq, "KPROBE "); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/bpf_syscall_macro.c b/tools/testing/selftests/bpf/progs/bpf_syscall_macro.c index 05838ed9b89c..e1e11897e99b 100644 --- a/tools/testing/selftests/bpf/progs/bpf_syscall_macro.c +++ b/tools/testing/selftests/bpf/progs/bpf_syscall_macro.c @@ -64,9 +64,9 @@ int BPF_KPROBE(handle_sys_prctl) return 0; } -SEC("kprobe/" SYS_PREFIX "sys_prctl") -int BPF_KPROBE_SYSCALL(prctl_enter, int option, unsigned long arg2, - unsigned long arg3, unsigned long arg4, unsigned long arg5) +SEC("ksyscall/prctl") +int BPF_KSYSCALL(prctl_enter, int option, unsigned long arg2, + unsigned long arg3, unsigned long arg4, unsigned long arg5) { pid_t pid = bpf_get_current_pid_tgid() >> 32; diff --git a/tools/testing/selftests/bpf/progs/test_attach_probe.c b/tools/testing/selftests/bpf/progs/test_attach_probe.c index f1c88ad368ef..a1e45fec8938 100644 --- a/tools/testing/selftests/bpf/progs/test_attach_probe.c +++ b/tools/testing/selftests/bpf/progs/test_attach_probe.c @@ -1,11 +1,10 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2017 Facebook -#include <linux/ptrace.h> -#include <linux/bpf.h> +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -#include <stdbool.h> +#include <bpf/bpf_core_read.h> #include "bpf_misc.h" int kprobe_res = 0; @@ -31,8 +30,8 @@ int handle_kprobe(struct pt_regs *ctx) return 0; } -SEC("kprobe/" SYS_PREFIX "sys_nanosleep") -int BPF_KPROBE(handle_kprobe_auto) +SEC("ksyscall/nanosleep") +int BPF_KSYSCALL(handle_kprobe_auto, struct __kernel_timespec *req, struct __kernel_timespec *rem) { kprobe2_res = 11; return 0; @@ -56,11 +55,11 @@ int handle_kretprobe(struct pt_regs *ctx) return 0; } -SEC("kretprobe/" SYS_PREFIX "sys_nanosleep") -int BPF_KRETPROBE(handle_kretprobe_auto) +SEC("kretsyscall/nanosleep") +int BPF_KRETPROBE(handle_kretprobe_auto, int ret) { kretprobe2_res = 22; - return 0; + return ret; } SEC("uprobe") diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf.c b/tools/testing/selftests/bpf/progs/test_bpf_nf.c index f00a9731930e..196cd8dfe42a 100644 --- a/tools/testing/selftests/bpf/progs/test_bpf_nf.c +++ b/tools/testing/selftests/bpf/progs/test_bpf_nf.c @@ -8,6 +8,8 @@ #define EINVAL 22 #define ENOENT 2 +extern unsigned long CONFIG_HZ __kconfig; + int test_einval_bpf_tuple = 0; int test_einval_reserved = 0; int test_einval_netns_id = 0; @@ -16,6 +18,11 @@ int test_eproto_l4proto = 0; int test_enonet_netns_id = 0; int test_enoent_lookup = 0; int test_eafnosupport = 0; +int test_alloc_entry = -EINVAL; +int test_insert_entry = -EAFNOSUPPORT; +int test_succ_lookup = -ENOENT; +u32 test_delta_timeout = 0; +u32 test_status = 0; struct nf_conn; @@ -26,31 +33,44 @@ struct bpf_ct_opts___local { u8 reserved[3]; } __attribute__((preserve_access_index)); +struct nf_conn *bpf_xdp_ct_alloc(struct xdp_md *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32) __ksym; struct nf_conn *bpf_xdp_ct_lookup(struct xdp_md *, struct bpf_sock_tuple *, u32, struct bpf_ct_opts___local *, u32) __ksym; +struct nf_conn *bpf_skb_ct_alloc(struct __sk_buff *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32) __ksym; struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *, struct bpf_sock_tuple *, u32, struct bpf_ct_opts___local *, u32) __ksym; +struct nf_conn *bpf_ct_insert_entry(struct nf_conn *) __ksym; void bpf_ct_release(struct nf_conn *) __ksym; +void bpf_ct_set_timeout(struct nf_conn *, u32) __ksym; +int bpf_ct_change_timeout(struct nf_conn *, u32) __ksym; +int bpf_ct_set_status(struct nf_conn *, u32) __ksym; +int bpf_ct_change_status(struct nf_conn *, u32) __ksym; static __always_inline void -nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32, - struct bpf_ct_opts___local *, u32), +nf_ct_test(struct nf_conn *(*lookup_fn)(void *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32), + struct nf_conn *(*alloc_fn)(void *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32), void *ctx) { struct bpf_ct_opts___local opts_def = { .l4proto = IPPROTO_TCP, .netns_id = -1 }; struct bpf_sock_tuple bpf_tuple; struct nf_conn *ct; + int err; __builtin_memset(&bpf_tuple, 0, sizeof(bpf_tuple.ipv4)); - ct = func(ctx, NULL, 0, &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, NULL, 0, &opts_def, sizeof(opts_def)); if (ct) bpf_ct_release(ct); else test_einval_bpf_tuple = opts_def.error; opts_def.reserved[0] = 1; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); opts_def.reserved[0] = 0; opts_def.l4proto = IPPROTO_TCP; if (ct) @@ -59,21 +79,24 @@ nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32, test_einval_reserved = opts_def.error; opts_def.netns_id = -2; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); opts_def.netns_id = -1; if (ct) bpf_ct_release(ct); else test_einval_netns_id = opts_def.error; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def) - 1); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def) - 1); if (ct) bpf_ct_release(ct); else test_einval_len_opts = opts_def.error; opts_def.l4proto = IPPROTO_ICMP; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); opts_def.l4proto = IPPROTO_TCP; if (ct) bpf_ct_release(ct); @@ -81,37 +104,75 @@ nf_ct_test(struct nf_conn *(*func)(void *, struct bpf_sock_tuple *, u32, test_eproto_l4proto = opts_def.error; opts_def.netns_id = 0xf00f; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); opts_def.netns_id = -1; if (ct) bpf_ct_release(ct); else test_enonet_netns_id = opts_def.error; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); if (ct) bpf_ct_release(ct); else test_enoent_lookup = opts_def.error; - ct = func(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4) - 1, &opts_def, sizeof(opts_def)); + ct = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4) - 1, &opts_def, + sizeof(opts_def)); if (ct) bpf_ct_release(ct); else test_eafnosupport = opts_def.error; + + bpf_tuple.ipv4.saddr = bpf_get_prandom_u32(); /* src IP */ + bpf_tuple.ipv4.daddr = bpf_get_prandom_u32(); /* dst IP */ + bpf_tuple.ipv4.sport = bpf_get_prandom_u32(); /* src port */ + bpf_tuple.ipv4.dport = bpf_get_prandom_u32(); /* dst port */ + + ct = alloc_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), &opts_def, + sizeof(opts_def)); + if (ct) { + struct nf_conn *ct_ins; + + bpf_ct_set_timeout(ct, 10000); + bpf_ct_set_status(ct, IPS_CONFIRMED); + + ct_ins = bpf_ct_insert_entry(ct); + if (ct_ins) { + struct nf_conn *ct_lk; + + ct_lk = lookup_fn(ctx, &bpf_tuple, sizeof(bpf_tuple.ipv4), + &opts_def, sizeof(opts_def)); + if (ct_lk) { + /* update ct entry timeout */ + bpf_ct_change_timeout(ct_lk, 10000); + test_delta_timeout = ct_lk->timeout - bpf_jiffies64(); + test_delta_timeout /= CONFIG_HZ; + test_status = IPS_SEEN_REPLY; + bpf_ct_change_status(ct_lk, IPS_SEEN_REPLY); + bpf_ct_release(ct_lk); + test_succ_lookup = 0; + } + bpf_ct_release(ct_ins); + test_insert_entry = 0; + } + test_alloc_entry = 0; + } } SEC("xdp") int nf_xdp_ct_test(struct xdp_md *ctx) { - nf_ct_test((void *)bpf_xdp_ct_lookup, ctx); + nf_ct_test((void *)bpf_xdp_ct_lookup, (void *)bpf_xdp_ct_alloc, ctx); return 0; } SEC("tc") int nf_skb_ct_test(struct __sk_buff *ctx) { - nf_ct_test((void *)bpf_skb_ct_lookup, ctx); + nf_ct_test((void *)bpf_skb_ct_lookup, (void *)bpf_skb_ct_alloc, ctx); return 0; } diff --git a/tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c b/tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c new file mode 100644 index 000000000000..bf79af15c808 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_bpf_nf_fail.c @@ -0,0 +1,134 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> + +struct nf_conn; + +struct bpf_ct_opts___local { + s32 netns_id; + s32 error; + u8 l4proto; + u8 reserved[3]; +} __attribute__((preserve_access_index)); + +struct nf_conn *bpf_skb_ct_alloc(struct __sk_buff *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32) __ksym; +struct nf_conn *bpf_skb_ct_lookup(struct __sk_buff *, struct bpf_sock_tuple *, u32, + struct bpf_ct_opts___local *, u32) __ksym; +struct nf_conn *bpf_ct_insert_entry(struct nf_conn *) __ksym; +void bpf_ct_release(struct nf_conn *) __ksym; +void bpf_ct_set_timeout(struct nf_conn *, u32) __ksym; +int bpf_ct_change_timeout(struct nf_conn *, u32) __ksym; +int bpf_ct_set_status(struct nf_conn *, u32) __ksym; +int bpf_ct_change_status(struct nf_conn *, u32) __ksym; + +SEC("?tc") +int alloc_release(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + bpf_ct_release(ct); + return 0; +} + +SEC("?tc") +int insert_insert(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + ct = bpf_ct_insert_entry(ct); + if (!ct) + return 0; + ct = bpf_ct_insert_entry(ct); + return 0; +} + +SEC("?tc") +int lookup_insert(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_lookup(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + bpf_ct_insert_entry(ct); + return 0; +} + +SEC("?tc") +int set_timeout_after_insert(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + ct = bpf_ct_insert_entry(ct); + if (!ct) + return 0; + bpf_ct_set_timeout(ct, 0); + return 0; +} + +SEC("?tc") +int set_status_after_insert(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + ct = bpf_ct_insert_entry(ct); + if (!ct) + return 0; + bpf_ct_set_status(ct, 0); + return 0; +} + +SEC("?tc") +int change_timeout_after_alloc(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + bpf_ct_change_timeout(ct, 0); + return 0; +} + +SEC("?tc") +int change_status_after_alloc(struct __sk_buff *ctx) +{ + struct bpf_ct_opts___local opts = {}; + struct bpf_sock_tuple tup = {}; + struct nf_conn *ct; + + ct = bpf_skb_ct_alloc(ctx, &tup, sizeof(tup.ipv4), &opts, sizeof(opts)); + if (!ct) + return 0; + bpf_ct_change_status(ct, 0); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_core_extern.c b/tools/testing/selftests/bpf/progs/test_core_extern.c index 3ac3603ad53d..a3c7c1042f35 100644 --- a/tools/testing/selftests/bpf/progs/test_core_extern.c +++ b/tools/testing/selftests/bpf/progs/test_core_extern.c @@ -11,6 +11,7 @@ static int (*bpf_missing_helper)(const void *arg1, int arg2) = (void *) 999; extern int LINUX_KERNEL_VERSION __kconfig; +extern int LINUX_UNKNOWN_VIRTUAL_EXTERN __kconfig __weak; extern bool CONFIG_BPF_SYSCALL __kconfig; /* strong */ extern enum libbpf_tristate CONFIG_TRISTATE __kconfig __weak; extern bool CONFIG_BOOL __kconfig __weak; @@ -22,6 +23,7 @@ extern const char CONFIG_STR[8] __kconfig __weak; extern uint64_t CONFIG_MISSING __kconfig __weak; uint64_t kern_ver = -1; +uint64_t unkn_virt_val = -1; uint64_t bpf_syscall = -1; uint64_t tristate_val = -1; uint64_t bool_val = -1; @@ -38,6 +40,7 @@ int handle_sys_enter(struct pt_regs *ctx) int i; kern_ver = LINUX_KERNEL_VERSION; + unkn_virt_val = LINUX_UNKNOWN_VIRTUAL_EXTERN; bpf_syscall = CONFIG_BPF_SYSCALL; tristate_val = CONFIG_TRISTATE; bool_val = CONFIG_BOOL; diff --git a/tools/testing/selftests/bpf/progs/test_probe_user.c b/tools/testing/selftests/bpf/progs/test_probe_user.c index 702578a5e496..8e1495008e4d 100644 --- a/tools/testing/selftests/bpf/progs/test_probe_user.c +++ b/tools/testing/selftests/bpf/progs/test_probe_user.c @@ -1,35 +1,20 @@ // SPDX-License-Identifier: GPL-2.0 - -#include <linux/ptrace.h> -#include <linux/bpf.h> - -#include <netinet/in.h> - +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> #include "bpf_misc.h" static struct sockaddr_in old; -SEC("kprobe/" SYS_PREFIX "sys_connect") -int BPF_KPROBE(handle_sys_connect) +SEC("ksyscall/connect") +int BPF_KSYSCALL(handle_sys_connect, int fd, struct sockaddr_in *uservaddr, int addrlen) { -#if SYSCALL_WRAPPER == 1 - struct pt_regs *real_regs; -#endif struct sockaddr_in new; - void *ptr; - -#if SYSCALL_WRAPPER == 0 - ptr = (void *)PT_REGS_PARM2(ctx); -#else - real_regs = (struct pt_regs *)PT_REGS_PARM1(ctx); - bpf_probe_read_kernel(&ptr, sizeof(ptr), &PT_REGS_PARM2(real_regs)); -#endif - bpf_probe_read_user(&old, sizeof(old), ptr); + bpf_probe_read_user(&old, sizeof(old), uservaddr); __builtin_memset(&new, 0xab, sizeof(new)); - bpf_probe_write_user(ptr, &new, sizeof(new)); + bpf_probe_write_user(uservaddr, &new, sizeof(new)); return 0; } diff --git a/tools/testing/selftests/bpf/progs/test_skeleton.c b/tools/testing/selftests/bpf/progs/test_skeleton.c index 1b1187d2967b..1a4e93f6d9df 100644 --- a/tools/testing/selftests/bpf/progs/test_skeleton.c +++ b/tools/testing/selftests/bpf/progs/test_skeleton.c @@ -51,6 +51,8 @@ int out_dynarr[4] SEC(".data.dyn") = { 1, 2, 3, 4 }; int read_mostly_var __read_mostly; int out_mostly_var; +char huge_arr[16 * 1024 * 1024]; + SEC("raw_tp/sys_enter") int handler(const void *ctx) { @@ -71,6 +73,8 @@ int handler(const void *ctx) out_mostly_var = read_mostly_var; + huge_arr[sizeof(huge_arr) - 1] = 123; + return 0; } diff --git a/tools/testing/selftests/bpf/progs/test_xdp_noinline.c b/tools/testing/selftests/bpf/progs/test_xdp_noinline.c index 125d872d7981..ba48fcb98ab2 100644 --- a/tools/testing/selftests/bpf/progs/test_xdp_noinline.c +++ b/tools/testing/selftests/bpf/progs/test_xdp_noinline.c @@ -239,7 +239,7 @@ bool parse_udp(void *data, void *data_end, udp = data + off; if (udp + 1 > data_end) - return 0; + return false; if (!is_icmp) { pckt->flow.port16[0] = udp->source; pckt->flow.port16[1] = udp->dest; @@ -247,7 +247,7 @@ bool parse_udp(void *data, void *data_end, pckt->flow.port16[0] = udp->dest; pckt->flow.port16[1] = udp->source; } - return 1; + return true; } static __attribute__ ((noinline)) @@ -261,7 +261,7 @@ bool parse_tcp(void *data, void *data_end, tcp = data + off; if (tcp + 1 > data_end) - return 0; + return false; if (tcp->syn) pckt->flags |= (1 << 1); if (!is_icmp) { @@ -271,7 +271,7 @@ bool parse_tcp(void *data, void *data_end, pckt->flow.port16[0] = tcp->dest; pckt->flow.port16[1] = tcp->source; } - return 1; + return true; } static __attribute__ ((noinline)) @@ -287,7 +287,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval, void *data; if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr))) - return 0; + return false; data = (void *)(long)xdp->data; data_end = (void *)(long)xdp->data_end; new_eth = data; @@ -295,7 +295,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval, old_eth = data + sizeof(struct ipv6hdr); if (new_eth + 1 > data_end || old_eth + 1 > data_end || ip6h + 1 > data_end) - return 0; + return false; memcpy(new_eth->eth_dest, cval->mac, 6); memcpy(new_eth->eth_source, old_eth->eth_dest, 6); new_eth->eth_proto = 56710; @@ -314,7 +314,7 @@ bool encap_v6(struct xdp_md *xdp, struct ctl_value *cval, ip6h->saddr.in6_u.u6_addr32[2] = 3; ip6h->saddr.in6_u.u6_addr32[3] = ip_suffix; memcpy(ip6h->daddr.in6_u.u6_addr32, dst->dstv6, 16); - return 1; + return true; } static __attribute__ ((noinline)) @@ -335,7 +335,7 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval, ip_suffix <<= 15; ip_suffix ^= pckt->flow.src; if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr))) - return 0; + return false; data = (void *)(long)xdp->data; data_end = (void *)(long)xdp->data_end; new_eth = data; @@ -343,7 +343,7 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval, old_eth = data + sizeof(struct iphdr); if (new_eth + 1 > data_end || old_eth + 1 > data_end || iph + 1 > data_end) - return 0; + return false; memcpy(new_eth->eth_dest, cval->mac, 6); memcpy(new_eth->eth_source, old_eth->eth_dest, 6); new_eth->eth_proto = 8; @@ -367,8 +367,8 @@ bool encap_v4(struct xdp_md *xdp, struct ctl_value *cval, csum += *next_iph_u16++; iph->check = ~((csum & 0xffff) + (csum >> 16)); if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct iphdr))) - return 0; - return 1; + return false; + return true; } static __attribute__ ((noinline)) @@ -386,10 +386,10 @@ bool decap_v6(struct xdp_md *xdp, void **data, void **data_end, bool inner_v4) else new_eth->eth_proto = 56710; if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct ipv6hdr))) - return 0; + return false; *data = (void *)(long)xdp->data; *data_end = (void *)(long)xdp->data_end; - return 1; + return true; } static __attribute__ ((noinline)) @@ -404,10 +404,10 @@ bool decap_v4(struct xdp_md *xdp, void **data, void **data_end) memcpy(new_eth->eth_dest, old_eth->eth_dest, 6); new_eth->eth_proto = 8; if (bpf_xdp_adjust_head(xdp, (int)sizeof(struct iphdr))) - return 0; + return false; *data = (void *)(long)xdp->data; *data_end = (void *)(long)xdp->data_end; - return 1; + return true; } static __attribute__ ((noinline)) diff --git a/tools/testing/selftests/bpf/test_xdp_veth.sh b/tools/testing/selftests/bpf/test_xdp_veth.sh index 392d28cc4e58..49936c4c8567 100755 --- a/tools/testing/selftests/bpf/test_xdp_veth.sh +++ b/tools/testing/selftests/bpf/test_xdp_veth.sh @@ -106,9 +106,9 @@ bpftool prog loadall \ bpftool map update pinned $BPF_DIR/maps/tx_port key 0 0 0 0 value 122 0 0 0 bpftool map update pinned $BPF_DIR/maps/tx_port key 1 0 0 0 value 133 0 0 0 bpftool map update pinned $BPF_DIR/maps/tx_port key 2 0 0 0 value 111 0 0 0 -ip link set dev veth1 xdp pinned $BPF_DIR/progs/redirect_map_0 -ip link set dev veth2 xdp pinned $BPF_DIR/progs/redirect_map_1 -ip link set dev veth3 xdp pinned $BPF_DIR/progs/redirect_map_2 +ip link set dev veth1 xdp pinned $BPF_DIR/progs/xdp_redirect_map_0 +ip link set dev veth2 xdp pinned $BPF_DIR/progs/xdp_redirect_map_1 +ip link set dev veth3 xdp pinned $BPF_DIR/progs/xdp_redirect_map_2 ip -n ${NS1} link set dev veth11 xdp obj xdp_dummy.o sec xdp ip -n ${NS2} link set dev veth22 xdp obj xdp_tx.o sec xdp diff --git a/tools/testing/selftests/bpf/verifier/bpf_loop_inline.c b/tools/testing/selftests/bpf/verifier/bpf_loop_inline.c index 2d0023659d88..a535d41dc20d 100644 --- a/tools/testing/selftests/bpf/verifier/bpf_loop_inline.c +++ b/tools/testing/selftests/bpf/verifier/bpf_loop_inline.c @@ -251,6 +251,7 @@ .expected_insns = { PSEUDO_CALL_INSN() }, .unexpected_insns = { HELPER_CALL_INSN() }, .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_TRACEPOINT, .func_info = { { 0, MAIN_TYPE }, { 16, CALLBACK_TYPE } }, .func_info_cnt = 2, BTF_TYPES diff --git a/tools/testing/selftests/bpf/verifier/calls.c b/tools/testing/selftests/bpf/verifier/calls.c index 743ed34c1238..3fb4f69b1962 100644 --- a/tools/testing/selftests/bpf/verifier/calls.c +++ b/tools/testing/selftests/bpf/verifier/calls.c @@ -219,6 +219,59 @@ .errstr = "variable ptr_ access var_off=(0x0; 0x7) disallowed", }, { + "calls: invalid kfunc call: referenced arg needs refcounted PTR_TO_BTF_ID", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 16), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .fixup_kfunc_btf_id = { + { "bpf_kfunc_call_test_acquire", 3 }, + { "bpf_kfunc_call_test_ref", 8 }, + { "bpf_kfunc_call_test_ref", 10 }, + }, + .result_unpriv = REJECT, + .result = REJECT, + .errstr = "R1 must be referenced", +}, +{ + "calls: valid kfunc call: referenced arg needs refcounted PTR_TO_BTF_ID", + .insns = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_ST_MEM(BPF_DW, BPF_REG_1, 0, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_REG(BPF_REG_6, BPF_REG_0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_KFUNC_CALL, 0, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .fixup_kfunc_btf_id = { + { "bpf_kfunc_call_test_acquire", 3 }, + { "bpf_kfunc_call_test_ref", 8 }, + { "bpf_kfunc_call_test_release", 10 }, + }, + .result_unpriv = REJECT, + .result = ACCEPT, +}, +{ "calls: basic sanity", .insns = { BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 2), diff --git a/tools/testing/selftests/net/ipv6_flowlabel.c b/tools/testing/selftests/net/ipv6_flowlabel.c index a7c41375374f..708a9822259d 100644 --- a/tools/testing/selftests/net/ipv6_flowlabel.c +++ b/tools/testing/selftests/net/ipv6_flowlabel.c @@ -9,6 +9,7 @@ #include <errno.h> #include <fcntl.h> #include <limits.h> +#include <linux/icmpv6.h> #include <linux/in6.h> #include <stdbool.h> #include <stdio.h> @@ -29,26 +30,48 @@ #ifndef IPV6_FLOWLABEL_MGR #define IPV6_FLOWLABEL_MGR 32 #endif +#ifndef IPV6_FLOWINFO_SEND +#define IPV6_FLOWINFO_SEND 33 +#endif #define FLOWLABEL_WILDCARD ((uint32_t) -1) static const char cfg_data[] = "a"; static uint32_t cfg_label = 1; +static bool use_ping; +static bool use_flowinfo_send; + +static struct icmp6hdr icmp6 = { + .icmp6_type = ICMPV6_ECHO_REQUEST +}; + +static struct sockaddr_in6 addr = { + .sin6_family = AF_INET6, + .sin6_addr = IN6ADDR_LOOPBACK_INIT, +}; static void do_send(int fd, bool with_flowlabel, uint32_t flowlabel) { char control[CMSG_SPACE(sizeof(flowlabel))] = {0}; struct msghdr msg = {0}; - struct iovec iov = {0}; + struct iovec iov = { + .iov_base = (char *)cfg_data, + .iov_len = sizeof(cfg_data) + }; int ret; - iov.iov_base = (char *)cfg_data; - iov.iov_len = sizeof(cfg_data); + if (use_ping) { + iov.iov_base = &icmp6; + iov.iov_len = sizeof(icmp6); + } msg.msg_iov = &iov; msg.msg_iovlen = 1; - if (with_flowlabel) { + if (use_flowinfo_send) { + msg.msg_name = &addr; + msg.msg_namelen = sizeof(addr); + } else if (with_flowlabel) { struct cmsghdr *cm; cm = (void *)control; @@ -94,6 +117,8 @@ static void do_recv(int fd, bool with_flowlabel, uint32_t expect) ret = recvmsg(fd, &msg, 0); if (ret == -1) error(1, errno, "recv"); + if (use_ping) + goto parse_cmsg; if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) error(1, 0, "recv: truncated"); if (ret != sizeof(cfg_data)) @@ -101,6 +126,7 @@ static void do_recv(int fd, bool with_flowlabel, uint32_t expect) if (memcmp(data, cfg_data, sizeof(data))) error(1, 0, "recv: data mismatch"); +parse_cmsg: cm = CMSG_FIRSTHDR(&msg); if (with_flowlabel) { if (!cm) @@ -114,9 +140,11 @@ static void do_recv(int fd, bool with_flowlabel, uint32_t expect) flowlabel = ntohl(*(uint32_t *)CMSG_DATA(cm)); fprintf(stderr, "recv with label %u\n", flowlabel); - if (expect != FLOWLABEL_WILDCARD && expect != flowlabel) + if (expect != FLOWLABEL_WILDCARD && expect != flowlabel) { fprintf(stderr, "recv: incorrect flowlabel %u != %u\n", flowlabel, expect); + error(1, 0, "recv: flowlabel is wrong"); + } } else { fprintf(stderr, "recv without label\n"); @@ -165,11 +193,17 @@ static void parse_opts(int argc, char **argv) { int c; - while ((c = getopt(argc, argv, "l:")) != -1) { + while ((c = getopt(argc, argv, "l:ps")) != -1) { switch (c) { case 'l': cfg_label = strtoul(optarg, NULL, 0); break; + case 'p': + use_ping = true; + break; + case 's': + use_flowinfo_send = true; + break; default: error(1, 0, "%s: parse error", argv[0]); } @@ -178,27 +212,30 @@ static void parse_opts(int argc, char **argv) int main(int argc, char **argv) { - struct sockaddr_in6 addr = { - .sin6_family = AF_INET6, - .sin6_port = htons(8000), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; const int one = 1; int fdt, fdr; + int prot = 0; + + addr.sin6_port = htons(8000); parse_opts(argc, argv); - fdt = socket(PF_INET6, SOCK_DGRAM, 0); + if (use_ping) { + fprintf(stderr, "attempting to use ping sockets\n"); + prot = IPPROTO_ICMPV6; + } + + fdt = socket(PF_INET6, SOCK_DGRAM, prot); if (fdt == -1) error(1, errno, "socket t"); - fdr = socket(PF_INET6, SOCK_DGRAM, 0); + fdr = use_ping ? fdt : socket(PF_INET6, SOCK_DGRAM, 0); if (fdr == -1) error(1, errno, "socket r"); if (connect(fdt, (void *)&addr, sizeof(addr))) error(1, errno, "connect"); - if (bind(fdr, (void *)&addr, sizeof(addr))) + if (!use_ping && bind(fdr, (void *)&addr, sizeof(addr))) error(1, errno, "bind"); flowlabel_get(fdt, cfg_label, IPV6_FL_S_EXCL, IPV6_FL_F_CREATE); @@ -216,13 +253,21 @@ int main(int argc, char **argv) do_recv(fdr, false, 0); } + if (use_flowinfo_send) { + fprintf(stderr, "using IPV6_FLOWINFO_SEND to send label\n"); + addr.sin6_flowinfo = htonl(cfg_label); + if (setsockopt(fdt, SOL_IPV6, IPV6_FLOWINFO_SEND, &one, + sizeof(one)) == -1) + error(1, errno, "setsockopt flowinfo_send"); + } + fprintf(stderr, "send label\n"); do_send(fdt, true, cfg_label); do_recv(fdr, true, cfg_label); if (close(fdr)) error(1, errno, "close r"); - if (close(fdt)) + if (!use_ping && close(fdt)) error(1, errno, "close t"); return 0; diff --git a/tools/testing/selftests/net/ipv6_flowlabel.sh b/tools/testing/selftests/net/ipv6_flowlabel.sh index d3bc6442704e..cee95e252bee 100755 --- a/tools/testing/selftests/net/ipv6_flowlabel.sh +++ b/tools/testing/selftests/net/ipv6_flowlabel.sh @@ -18,4 +18,20 @@ echo "TEST datapath (with auto-flowlabels)" ./in_netns.sh \ sh -c 'sysctl -q -w net.ipv6.auto_flowlabels=1 && ./ipv6_flowlabel -l 1' +echo "TEST datapath (with ping-sockets)" +./in_netns.sh \ + sh -c 'sysctl -q -w net.ipv6.flowlabel_reflect=4 && \ + sysctl -q -w net.ipv4.ping_group_range="0 2147483647" && \ + ./ipv6_flowlabel -l 1 -p' + +echo "TEST datapath (with flowinfo-send)" +./in_netns.sh \ + sh -c './ipv6_flowlabel -l 1 -s' + +echo "TEST datapath (with ping-sockets flowinfo-send)" +./in_netns.sh \ + sh -c 'sysctl -q -w net.ipv6.flowlabel_reflect=4 && \ + sysctl -q -w net.ipv4.ping_group_range="0 2147483647" && \ + ./ipv6_flowlabel -l 1 -p -s' + echo OK. All tests passed diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c index dc26aae0feb0..4ecbac197c46 100644 --- a/tools/testing/selftests/net/tls.c +++ b/tools/testing/selftests/net/tls.c @@ -1597,6 +1597,38 @@ TEST_F(tls_err, bad_cmsg) EXPECT_EQ(errno, EBADMSG); } +TEST_F(tls_err, timeo) +{ + struct timeval tv = { .tv_usec = 10000, }; + char buf[128]; + int ret; + + if (self->notls) + SKIP(return, "no TLS support"); + + ret = setsockopt(self->cfd2, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + ASSERT_EQ(ret, 0); + + ret = fork(); + ASSERT_GE(ret, 0); + + if (ret) { + usleep(1000); /* Give child a head start */ + + EXPECT_EQ(recv(self->cfd2, buf, sizeof(buf), 0), -1); + EXPECT_EQ(errno, EAGAIN); + + EXPECT_EQ(recv(self->cfd2, buf, sizeof(buf), 0), -1); + EXPECT_EQ(errno, EAGAIN); + + wait(&ret); + } else { + EXPECT_EQ(recv(self->cfd2, buf, sizeof(buf), 0), -1); + EXPECT_EQ(errno, EAGAIN); + exit(0); + } +} + TEST(non_established) { struct tls12_crypto_info_aes_gcm_256 tls12; struct sockaddr_in addr; |