diff options
57 files changed, 3594 insertions, 288 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 10453c627135..cb402c59eca5 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -85,8 +85,33 @@ Q: Can loops be supported in a safe way? A: It's not clear yet. BPF developers are trying to find a way to -support bounded loops where the verifier can guarantee that -the program terminates in less than 4096 instructions. +support bounded loops. + +Q: What are the verifier limits? +-------------------------------- +A: The only limit known to the user space is BPF_MAXINSNS (4096). +It's the maximum number of instructions that the unprivileged bpf +program can have. The verifier has various internal limits. +Like the maximum number of instructions that can be explored during +program analysis. Currently, that limit is set to 1 million. +Which essentially means that the largest program can consist +of 1 million NOP instructions. There is a limit to the maximum number +of subsequent branches, a limit to the number of nested bpf-to-bpf +calls, a limit to the number of the verifier states per instruction, +a limit to the number of maps used by the program. +All these limits can be hit with a sufficiently complex program. +There are also non-numerical limits that can cause the program +to be rejected. The verifier used to recognize only pointer + constant +expressions. Now it can recognize pointer + bounded_register. +bpf_lookup_map_elem(key) had a requirement that 'key' must be +a pointer to the stack. Now, 'key' can be a pointer to map value. +The verifier is steadily getting 'smarter'. The limits are +being removed. The only way to know that the program is going to +be accepted by the verifier is to try to load it. +The bpf development process guarantees that the future kernel +versions will accept all bpf programs that were accepted by +the earlier versions. + Instruction level questions --------------------------- diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 4e77932959cc..d3fe4cac0c90 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -36,6 +36,16 @@ Two sets of Questions and Answers (Q&A) are maintained. bpf_devel_QA +Program types +============= + +.. toctree:: + :maxdepth: 1 + + prog_cgroup_sysctl + prog_flow_dissector + + .. Links: .. _Documentation/networking/filter.txt: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ diff --git a/Documentation/bpf/prog_cgroup_sysctl.rst b/Documentation/bpf/prog_cgroup_sysctl.rst new file mode 100644 index 000000000000..677d6c637cf3 --- /dev/null +++ b/Documentation/bpf/prog_cgroup_sysctl.rst @@ -0,0 +1,125 @@ +.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) + +=========================== +BPF_PROG_TYPE_CGROUP_SYSCTL +=========================== + +This document describes ``BPF_PROG_TYPE_CGROUP_SYSCTL`` program type that +provides cgroup-bpf hook for sysctl. + +The hook has to be attached to a cgroup and will be called every time a +process inside that cgroup tries to read from or write to sysctl knob in proc. + +1. Attach type +************** + +``BPF_CGROUP_SYSCTL`` attach type has to be used to attach +``BPF_PROG_TYPE_CGROUP_SYSCTL`` program to a cgroup. + +2. Context +********** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` provides access to the following context from +BPF program:: + + struct bpf_sysctl { + __u32 write; + __u32 file_pos; + }; + +* ``write`` indicates whether sysctl value is being read (``0``) or written + (``1``). This field is read-only. + +* ``file_pos`` indicates file position sysctl is being accessed at, read + or written. This field is read-write. Writing to the field sets the starting + position in sysctl proc file ``read(2)`` will be reading from or ``write(2)`` + will be writing to. Writing zero to the field can be used e.g. to override + whole sysctl value by ``bpf_sysctl_set_new_value()`` on ``write(2)`` even + when it's called by user space on ``file_pos > 0``. Writing non-zero + value to the field can be used to access part of sysctl value starting from + specified ``file_pos``. Not all sysctl support access with ``file_pos != + 0``, e.g. writes to numeric sysctl entries must always be at file position + ``0``. See also ``kernel.sysctl_writes_strict`` sysctl. + +See `linux/bpf.h`_ for more details on how context field can be accessed. + +3. Return code +************** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` program must return one of the following +return codes: + +* ``0`` means "reject access to sysctl"; +* ``1`` means "proceed with access". + +If program returns ``0`` user space will get ``-1`` from ``read(2)`` or +``write(2)`` and ``errno`` will be set to ``EPERM``. + +4. Helpers +********** + +Since sysctl knob is represented by a name and a value, sysctl specific BPF +helpers focus on providing access to these properties: + +* ``bpf_sysctl_get_name()`` to get sysctl name as it is visible in + ``/proc/sys`` into provided by BPF program buffer; + +* ``bpf_sysctl_get_current_value()`` to get string value currently held by + sysctl into provided by BPF program buffer. This helper is available on both + ``read(2)`` from and ``write(2)`` to sysctl; + +* ``bpf_sysctl_get_new_value()`` to get new string value currently being + written to sysctl before actual write happens. This helper can be used only + on ``ctx->write == 1``; + +* ``bpf_sysctl_set_new_value()`` to override new string value currently being + written to sysctl before actual write happens. Sysctl value will be + overridden starting from the current ``ctx->file_pos``. If the whole value + has to be overridden BPF program can set ``file_pos`` to zero before calling + to the helper. This helper can be used only on ``ctx->write == 1``. New + string value set by the helper is treated and verified by kernel same way as + an equivalent string passed by user space. + +BPF program sees sysctl value same way as user space does in proc filesystem, +i.e. as a string. Since many sysctl values represent an integer or a vector +of integers, the following helpers can be used to get numeric value from the +string: + +* ``bpf_strtol()`` to convert initial part of the string to long integer + similar to user space `strtol(3)`_; +* ``bpf_strtoul()`` to convert initial part of the string to unsigned long + integer similar to user space `strtoul(3)`_; + +See `linux/bpf.h`_ for more details on helpers described here. + +5. Examples +*********** + +See `test_sysctl_prog.c`_ for an example of BPF program in C that access +sysctl name and value, parses string value to get vector of integers and uses +the result to make decision whether to allow or deny access to sysctl. + +6. Notes +******** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` is intended to be used in **trusted** root +environment, for example to monitor sysctl usage or catch unreasonable values +an application, running as root in a separate cgroup, is trying to set. + +Since `task_dfl_cgroup(current)` is called at `sys_read` / `sys_write` time it +may return results different from that at `sys_open` time, i.e. process that +opened sysctl file in proc filesystem may differ from process that is trying +to read from / write to it and two such processes may run in different +cgroups, what means ``BPF_PROG_TYPE_CGROUP_SYSCTL`` should not be used as a +security mechanism to limit sysctl usage. + +As with any cgroup-bpf program additional care should be taken if an +application running as root in a cgroup should not be allowed to +detach/replace BPF program attached by administrator. + +.. Links +.. _linux/bpf.h: ../../include/uapi/linux/bpf.h +.. _strtol(3): http://man7.org/linux/man-pages/man3/strtol.3p.html +.. _strtoul(3): http://man7.org/linux/man-pages/man3/strtoul.3p.html +.. _test_sysctl_prog.c: + ../../tools/testing/selftests/bpf/progs/test_sysctl_prog.c diff --git a/Documentation/networking/bpf_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst index b375ae2ec2c4..ed343abe541e 100644 --- a/Documentation/networking/bpf_flow_dissector.rst +++ b/Documentation/bpf/prog_flow_dissector.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -================== -BPF Flow Dissector -================== +============================ +BPF_PROG_TYPE_FLOW_DISSECTOR +============================ Overview ======== diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 269d6f2661d5..f390fe3cfdfb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -9,7 +9,6 @@ Contents: netdev-FAQ af_xdp batman-adv - bpf_flow_dissector can can_ucan_protocol device_drivers/freescale/dpaa2/index diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c index 390a722e6211..ee657003c1a1 100644 --- a/drivers/media/rc/bpf-lirc.c +++ b/drivers/media/rc/bpf-lirc.c @@ -97,6 +97,12 @@ lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_ktime_get_ns: return &bpf_ktime_get_ns_proto; case BPF_FUNC_tail_call: diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index d65390727541..2d61e5e8c863 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -13,6 +13,7 @@ #include <linux/namei.h> #include <linux/mm.h> #include <linux/module.h> +#include <linux/bpf-cgroup.h> #include "internal.h" static const struct dentry_operations proc_sys_dentry_operations; @@ -569,8 +570,8 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, struct inode *inode = file_inode(filp); struct ctl_table_header *head = grab_header(inode); struct ctl_table *table = PROC_I(inode)->sysctl_entry; + void *new_buf = NULL; ssize_t error; - size_t res; if (IS_ERR(head)) return PTR_ERR(head); @@ -588,11 +589,27 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, if (!table->proc_handler) goto out; + error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, &count, + ppos, &new_buf); + if (error) + goto out; + /* careful: calling conventions are nasty here */ - res = count; - error = table->proc_handler(table, write, buf, &res, ppos); + if (new_buf) { + mm_segment_t old_fs; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + error = table->proc_handler(table, write, (void __user *)new_buf, + &count, ppos); + set_fs(old_fs); + kfree(new_buf); + } else { + error = table->proc_handler(table, write, buf, &count, ppos); + } + if (!error) - error = res; + error = count; out: sysctl_head_finish(head); diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index a4c644c1c091..cb3c6b3b89c8 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -17,6 +17,8 @@ struct bpf_map; struct bpf_prog; struct bpf_sock_ops_kern; struct bpf_cgroup_storage; +struct ctl_table; +struct ctl_table_header; #ifdef CONFIG_CGROUP_BPF @@ -109,6 +111,12 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk, int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, short access, enum bpf_attach_type type); +int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head, + struct ctl_table *table, int write, + void __user *buf, size_t *pcount, + loff_t *ppos, void **new_buf, + enum bpf_attach_type type); + static inline enum bpf_cgroup_storage_type cgroup_storage_type( struct bpf_map *map) { @@ -253,6 +261,18 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, \ __ret; \ }) + + +#define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, count, pos, nbuf) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_run_filter_sysctl(head, table, write, \ + buf, count, pos, nbuf, \ + BPF_CGROUP_SYSCTL); \ + __ret; \ +}) + int cgroup_bpf_prog_attach(const union bpf_attr *attr, enum bpf_prog_type ptype, struct bpf_prog *prog); int cgroup_bpf_prog_detach(const union bpf_attr *attr, @@ -321,6 +341,7 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; }) #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; }) #define for_each_cgroup_storage_type(stype) for (; false; ) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e4d4c1771ab0..f15432d90728 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -202,6 +202,8 @@ enum bpf_arg_type { ARG_ANYTHING, /* any (initialized) argument is ok */ ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */ + ARG_PTR_TO_INT, /* pointer to int */ + ARG_PTR_TO_LONG, /* pointer to long */ }; /* type of values returned from helper functions */ @@ -987,6 +989,8 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto; extern const struct bpf_func_proto bpf_spin_lock_proto; extern const struct bpf_func_proto bpf_spin_unlock_proto; extern const struct bpf_func_proto bpf_get_local_storage_proto; +extern const struct bpf_func_proto bpf_strtol_proto; +extern const struct bpf_func_proto bpf_strtoul_proto; /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 08bf2f1fe553..d26991a16894 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -28,6 +28,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint) #endif #ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl) #endif #ifdef CONFIG_BPF_LIRC_MODE2 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index b3ab61fe1932..1305ccbd8fe6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -295,6 +295,11 @@ struct bpf_verifier_env { const struct bpf_line_info *prev_linfo; struct bpf_verifier_log log; struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1]; + struct { + int *insn_state; + int *insn_stack; + int cur_stack; + } cfg; u32 subprog_cnt; /* number of instructions analyzed by the verifier */ u32 insn_processed; diff --git a/include/linux/filter.h b/include/linux/filter.h index 6074aa064b54..fb0edad75971 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -33,6 +33,8 @@ struct bpf_prog_aux; struct xdp_rxq_info; struct xdp_buff; struct sock_reuseport; +struct ctl_table; +struct ctl_table_header; /* ArgX, context and stack frame pointer register positions. Note, * Arg1, Arg2, Arg3, etc are used as argument mappings of function @@ -1177,4 +1179,18 @@ struct bpf_sock_ops_kern { */ }; +struct bpf_sysctl_kern { + struct ctl_table_header *head; + struct ctl_table *table; + void *cur_val; + size_t cur_len; + void *new_val; + size_t new_len; + int new_updated; + int write; + loff_t *ppos; + /* Temporary "register" for indirect stores to ppos. */ + u64 tmp_reg; +}; + #endif /* __LINUX_FILTER_H__ */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e4ee92089dd6..6f42942a443b 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1042,6 +1042,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags, int node); struct sk_buff *__build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb(void *data, unsigned int frag_size); +struct sk_buff *build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size); /** * alloc_skb - allocate a network buffer diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 2e96d0b4bf65..eaf2d3284248 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -167,6 +167,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_LIRC_MODE2, BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR, + BPF_PROG_TYPE_CGROUP_SYSCTL, }; enum bpf_attach_type { @@ -188,6 +189,7 @@ enum bpf_attach_type { BPF_CGROUP_UDP6_SENDMSG, BPF_LIRC_MODE2, BPF_FLOW_DISSECTOR, + BPF_CGROUP_SYSCTL, __MAX_BPF_ATTACH_TYPE }; @@ -1735,12 +1737,19 @@ union bpf_attr { * error if an eBPF program tries to set a callback that is not * supported in the current kernel. * - * The supported callback values that *argval* can combine are: + * *argval* is a flag array which can combine these flags: * * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) * + * Therefore, this function can be used to clear a callback flag by + * setting the appropriate bit to zero. e.g. to disable the RTO + * callback: + * + * **bpf_sock_ops_cb_flags_set(bpf_sock,** + * **bpf_sock->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RTO_CB_FLAG)** + * * Here are some examples of where one could call such eBPF * program: * @@ -2504,6 +2513,122 @@ union bpf_attr { * Return * 0 if iph and th are a valid SYN cookie ACK, or a negative error * otherwise. + * + * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags) + * Description + * Get name of sysctl in /proc/sys/ and copy it into provided by + * program buffer *buf* of size *buf_len*. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * + * If *flags* is zero, full name (e.g. "net/ipv4/tcp_mem") is + * copied. Use **BPF_F_SYSCTL_BASE_NAME** flag to copy base name + * only (e.g. "tcp_mem"). + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get current value of sysctl as it is presented in /proc/sys + * (incl. newline, etc), and copy it as a string into provided + * by program buffer *buf* of size *buf_len*. + * + * The whole value is copied, no matter what file position user + * space issued e.g. sys_read at. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if current value was unavailable, e.g. because + * sysctl is uninitialized and read returns -EIO for it. + * + * int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get new value being written by user space to sysctl (before + * the actual write happens) and copy it as a string into + * provided by program buffer *buf* of size *buf_len*. + * + * User space may write new value at file position > 0. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len) + * Description + * Override new value being written by user space to sysctl with + * value provided by program in buffer *buf* of size *buf_len*. + * + * *buf* should contain a string in same form as provided by user + * space on sysctl write. + * + * User space may write new value at file position > 0. To override + * the whole sysctl value file position should be set to zero. + * Return + * 0 on success. + * + * **-E2BIG** if the *buf_len* is too big. + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to a long integer according to the given base + * and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)) followed by a single optional '-' + * sign. + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtol(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. + * + * int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to an unsigned long integer according to the + * given base and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)). + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtoul(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2606,7 +2731,13 @@ union bpf_attr { FN(skb_ecn_set_ce), \ FN(get_listener_sock), \ FN(skc_lookup_tcp), \ - FN(tcp_check_syncookie), + FN(tcp_check_syncookie), \ + FN(sysctl_get_name), \ + FN(sysctl_get_current_value), \ + FN(sysctl_get_new_value), \ + FN(sysctl_set_new_value), \ + FN(strtol), \ + FN(strtoul), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2668,17 +2799,20 @@ enum bpf_func_id { /* BPF_FUNC_skb_adjust_room flags. */ #define BPF_F_ADJ_ROOM_FIXED_GSO (1ULL << 0) -#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff -#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56 +#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff +#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56 #define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1) #define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2) #define BPF_F_ADJ_ROOM_ENCAP_L4_GRE (1ULL << 3) #define BPF_F_ADJ_ROOM_ENCAP_L4_UDP (1ULL << 4) -#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \ +#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \ BPF_ADJ_ROOM_ENCAP_L2_MASK) \ << BPF_ADJ_ROOM_ENCAP_L2_SHIFT) +/* BPF_FUNC_sysctl_get_name flags. */ +#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0) + /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, @@ -3308,4 +3442,14 @@ struct bpf_line_info { struct bpf_spin_lock { __u32 val; }; + +struct bpf_sysctl { + __u32 write; /* Sysctl is being read (= 0) or written (= 1). + * Allows 1,2,4-byte read, but no write. + */ + __u32 file_pos; /* Sysctl file position to read from, write to. + * Allows 1,2,4-byte read an 4-byte write. + */ +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 4e807973aa80..fcde0f7b2585 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -11,7 +11,10 @@ #include <linux/kernel.h> #include <linux/atomic.h> #include <linux/cgroup.h> +#include <linux/filter.h> #include <linux/slab.h> +#include <linux/sysctl.h> +#include <linux/string.h> #include <linux/bpf.h> #include <linux/bpf-cgroup.h> #include <net/sock.h> @@ -701,7 +704,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission); static const struct bpf_func_proto * -cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_map_lookup_elem: @@ -710,6 +713,12 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_get_current_uid_gid: return &bpf_get_current_uid_gid_proto; case BPF_FUNC_get_local_storage: @@ -725,6 +734,12 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) } } +static const struct bpf_func_proto * +cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + return cgroup_base_func_proto(func_id, prog); +} + static bool cgroup_dev_is_valid_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, @@ -762,3 +777,356 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = { .get_func_proto = cgroup_dev_func_proto, .is_valid_access = cgroup_dev_is_valid_access, }; + +/** + * __cgroup_bpf_run_filter_sysctl - Run a program on sysctl + * + * @head: sysctl table header + * @table: sysctl table + * @write: sysctl is being read (= 0) or written (= 1) + * @buf: pointer to buffer passed by user space + * @pcount: value-result argument: value is size of buffer pointed to by @buf, + * result is size of @new_buf if program set new value, initial value + * otherwise + * @ppos: value-result argument: value is position at which read from or write + * to sysctl is happening, result is new position if program overrode it, + * initial value otherwise + * @new_buf: pointer to pointer to new buffer that will be allocated if program + * overrides new value provided by user space on sysctl write + * NOTE: it's caller responsibility to free *new_buf if it was set + * @type: type of program to be executed + * + * Program is run when sysctl is being accessed, either read or written, and + * can allow or deny such access. + * + * This function will return %-EPERM if an attached program is found and + * returned value != 1 during execution. In all other cases 0 is returned. + */ +int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head, + struct ctl_table *table, int write, + void __user *buf, size_t *pcount, + loff_t *ppos, void **new_buf, + enum bpf_attach_type type) +{ + struct bpf_sysctl_kern ctx = { + .head = head, + .table = table, + .write = write, + .ppos = ppos, + .cur_val = NULL, + .cur_len = PAGE_SIZE, + .new_val = NULL, + .new_len = 0, + .new_updated = 0, + }; + struct cgroup *cgrp; + int ret; + + ctx.cur_val = kmalloc_track_caller(ctx.cur_len, GFP_KERNEL); + if (ctx.cur_val) { + mm_segment_t old_fs; + loff_t pos = 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + if (table->proc_handler(table, 0, (void __user *)ctx.cur_val, + &ctx.cur_len, &pos)) { + /* Let BPF program decide how to proceed. */ + ctx.cur_len = 0; + } + set_fs(old_fs); + } else { + /* Let BPF program decide how to proceed. */ + ctx.cur_len = 0; + } + + if (write && buf && *pcount) { + /* BPF program should be able to override new value with a + * buffer bigger than provided by user. + */ + ctx.new_val = kmalloc_track_caller(PAGE_SIZE, GFP_KERNEL); + ctx.new_len = min_t(size_t, PAGE_SIZE, *pcount); + if (!ctx.new_val || + copy_from_user(ctx.new_val, buf, ctx.new_len)) + /* Let BPF program decide how to proceed. */ + ctx.new_len = 0; + } + + rcu_read_lock(); + cgrp = task_dfl_cgroup(current); + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN); + rcu_read_unlock(); + + kfree(ctx.cur_val); + + if (ret == 1 && ctx.new_updated) { + *new_buf = ctx.new_val; + *pcount = ctx.new_len; + } else { + kfree(ctx.new_val); + } + + return ret == 1 ? 0 : -EPERM; +} +EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl); + +static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, + size_t *lenp) +{ + ssize_t tmp_ret = 0, ret; + + if (dir->header.parent) { + tmp_ret = sysctl_cpy_dir(dir->header.parent, bufp, lenp); + if (tmp_ret < 0) + return tmp_ret; + } + + ret = strscpy(*bufp, dir->header.ctl_table[0].procname, *lenp); + if (ret < 0) + return ret; + *bufp += ret; + *lenp -= ret; + ret += tmp_ret; + + /* Avoid leading slash. */ + if (!ret) + return ret; + + tmp_ret = strscpy(*bufp, "/", *lenp); + if (tmp_ret < 0) + return tmp_ret; + *bufp += tmp_ret; + *lenp -= tmp_ret; + + return ret + tmp_ret; +} + +BPF_CALL_4(bpf_sysctl_get_name, struct bpf_sysctl_kern *, ctx, char *, buf, + size_t, buf_len, u64, flags) +{ + ssize_t tmp_ret = 0, ret; + + if (!buf) + return -EINVAL; + + if (!(flags & BPF_F_SYSCTL_BASE_NAME)) { + if (!ctx->head) + return -EINVAL; + tmp_ret = sysctl_cpy_dir(ctx->head->parent, &buf, &buf_len); + if (tmp_ret < 0) + return tmp_ret; + } + + ret = strscpy(buf, ctx->table->procname, buf_len); + + return ret < 0 ? ret : tmp_ret + ret; +} + +static const struct bpf_func_proto bpf_sysctl_get_name_proto = { + .func = bpf_sysctl_get_name, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +static int copy_sysctl_value(char *dst, size_t dst_len, char *src, + size_t src_len) +{ + if (!dst) + return -EINVAL; + + if (!dst_len) + return -E2BIG; + + if (!src || !src_len) { + memset(dst, 0, dst_len); + return -EINVAL; + } + + memcpy(dst, src, min(dst_len, src_len)); + + if (dst_len > src_len) { + memset(dst + src_len, '\0', dst_len - src_len); + return src_len; + } + + dst[dst_len - 1] = '\0'; + + return -E2BIG; +} + +BPF_CALL_3(bpf_sysctl_get_current_value, struct bpf_sysctl_kern *, ctx, + char *, buf, size_t, buf_len) +{ + return copy_sysctl_value(buf, buf_len, ctx->cur_val, ctx->cur_len); +} + +static const struct bpf_func_proto bpf_sysctl_get_current_value_proto = { + .func = bpf_sysctl_get_current_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +BPF_CALL_3(bpf_sysctl_get_new_value, struct bpf_sysctl_kern *, ctx, char *, buf, + size_t, buf_len) +{ + if (!ctx->write) { + if (buf && buf_len) + memset(buf, '\0', buf_len); + return -EINVAL; + } + return copy_sysctl_value(buf, buf_len, ctx->new_val, ctx->new_len); +} + +static const struct bpf_func_proto bpf_sysctl_get_new_value_proto = { + .func = bpf_sysctl_get_new_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +BPF_CALL_3(bpf_sysctl_set_new_value, struct bpf_sysctl_kern *, ctx, + const char *, buf, size_t, buf_len) +{ + if (!ctx->write || !ctx->new_val || !ctx->new_len || !buf || !buf_len) + return -EINVAL; + + if (buf_len > PAGE_SIZE - 1) + return -E2BIG; + + memcpy(ctx->new_val, buf, buf_len); + ctx->new_len = buf_len; + ctx->new_updated = 1; + + return 0; +} + +static const struct bpf_func_proto bpf_sysctl_set_new_value_proto = { + .func = bpf_sysctl_set_new_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +static const struct bpf_func_proto * +sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_strtol: + return &bpf_strtol_proto; + case BPF_FUNC_strtoul: + return &bpf_strtoul_proto; + case BPF_FUNC_sysctl_get_name: + return &bpf_sysctl_get_name_proto; + case BPF_FUNC_sysctl_get_current_value: + return &bpf_sysctl_get_current_value_proto; + case BPF_FUNC_sysctl_get_new_value: + return &bpf_sysctl_get_new_value_proto; + case BPF_FUNC_sysctl_set_new_value: + return &bpf_sysctl_set_new_value_proto; + default: + return cgroup_base_func_proto(func_id, prog); + } +} + +static bool sysctl_is_valid_access(int off, int size, enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + const int size_default = sizeof(__u32); + + if (off < 0 || off + size > sizeof(struct bpf_sysctl) || off % size) + return false; + + switch (off) { + case offsetof(struct bpf_sysctl, write): + if (type != BPF_READ) + return false; + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, size_default); + case offsetof(struct bpf_sysctl, file_pos): + if (type == BPF_READ) { + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, size_default); + } else { + return size == size_default; + } + default: + return false; + } +} + +static u32 sysctl_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, u32 *target_size) +{ + struct bpf_insn *insn = insn_buf; + + switch (si->off) { + case offsetof(struct bpf_sysctl, write): + *insn++ = BPF_LDX_MEM( + BPF_SIZE(si->code), si->dst_reg, si->src_reg, + bpf_target_off(struct bpf_sysctl_kern, write, + FIELD_SIZEOF(struct bpf_sysctl_kern, + write), + target_size)); + break; + case offsetof(struct bpf_sysctl, file_pos): + /* ppos is a pointer so it should be accessed via indirect + * loads and stores. Also for stores additional temporary + * register is used since neither src_reg nor dst_reg can be + * overridden. + */ + if (type == BPF_WRITE) { + int treg = BPF_REG_9; + + if (si->src_reg == treg || si->dst_reg == treg) + --treg; + if (si->src_reg == treg || si->dst_reg == treg) + --treg; + *insn++ = BPF_STX_MEM( + BPF_DW, si->dst_reg, treg, + offsetof(struct bpf_sysctl_kern, tmp_reg)); + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos), + treg, si->dst_reg, + offsetof(struct bpf_sysctl_kern, ppos)); + *insn++ = BPF_STX_MEM( + BPF_SIZEOF(u32), treg, si->src_reg, 0); + *insn++ = BPF_LDX_MEM( + BPF_DW, treg, si->dst_reg, + offsetof(struct bpf_sysctl_kern, tmp_reg)); + } else { + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sysctl_kern, ppos)); + *insn++ = BPF_LDX_MEM( + BPF_SIZE(si->code), si->dst_reg, si->dst_reg, 0); + } + *target_size = sizeof(u32); + break; + } + + return insn - insn_buf; +} + +const struct bpf_verifier_ops cg_sysctl_verifier_ops = { + .get_func_proto = sysctl_func_proto, + .is_valid_access = sysctl_is_valid_access, + .convert_ctx_access = sysctl_convert_ctx_access, +}; + +const struct bpf_prog_ops cg_sysctl_prog_ops = { +}; diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 3c18260403dd..cf727d77c6c6 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work) } static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, - struct xdp_frame *xdpf) + struct xdp_frame *xdpf, + struct sk_buff *skb) { unsigned int hard_start_headroom; unsigned int frame_size; void *pkt_data_start; - struct sk_buff *skb; /* Part of headroom was reserved to xdpf */ hard_start_headroom = sizeof(struct xdp_frame) + xdpf->headroom; @@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); pkt_data_start = xdpf->data - hard_start_headroom; - skb = build_skb(pkt_data_start, frame_size); - if (!skb) + skb = build_skb_around(skb, pkt_data_start, frame_size); + if (unlikely(!skb)) return NULL; skb_reserve(skb, hard_start_headroom); @@ -240,6 +240,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) } } +#define CPUMAP_BATCH 8 + static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; @@ -252,8 +254,11 @@ static int cpu_map_kthread_run(void *data) * kthread_stop signal until queue is empty. */ while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) { - unsigned int processed = 0, drops = 0, sched = 0; - struct xdp_frame *xdpf; + unsigned int drops = 0, sched = 0; + void *frames[CPUMAP_BATCH]; + void *skbs[CPUMAP_BATCH]; + gfp_t gfp = __GFP_ZERO | GFP_ATOMIC; + int i, n, m; /* Release CPU reschedule checks */ if (__ptr_ring_empty(rcpu->queue)) { @@ -269,18 +274,38 @@ static int cpu_map_kthread_run(void *data) sched = cond_resched(); } - /* Process packets in rcpu->queue */ - local_bh_disable(); /* * The bpf_cpu_map_entry is single consumer, with this * kthread CPU pinned. Lockless access to ptr_ring * consume side valid as no-resize allowed of queue. */ - while ((xdpf = __ptr_ring_consume(rcpu->queue))) { - struct sk_buff *skb; + n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH); + + for (i = 0; i < n; i++) { + void *f = frames[i]; + struct page *page = virt_to_page(f); + + /* Bring struct page memory area to curr CPU. Read by + * build_skb_around via page_is_pfmemalloc(), and when + * freed written by page_frag_free call. + */ + prefetchw(page); + } + + m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs); + if (unlikely(m == 0)) { + for (i = 0; i < n; i++) + skbs[i] = NULL; /* effect: xdp_return_frame */ + drops = n; + } + + local_bh_disable(); + for (i = 0; i < n; i++) { + struct xdp_frame *xdpf = frames[i]; + struct sk_buff *skb = skbs[i]; int ret; - skb = cpu_map_build_skb(rcpu, xdpf); + skb = cpu_map_build_skb(rcpu, xdpf, skb); if (!skb) { xdp_return_frame(xdpf); continue; @@ -290,13 +315,9 @@ static int cpu_map_kthread_run(void *data) ret = netif_receive_skb_core(skb); if (ret == NET_RX_DROP) drops++; - - /* Limit BH-disable period */ - if (++processed == 8) - break; } /* Feedback loop via tracepoint */ - trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched); + trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched); local_bh_enable(); /* resched point, may call do_softirq() */ } diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a411fc17d265..4266ffde07ca 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -18,6 +18,9 @@ #include <linux/sched.h> #include <linux/uidgid.h> #include <linux/filter.h> +#include <linux/ctype.h> + +#include "../../lib/kstrtox.h" /* If kernel subsystem is allowing eBPF programs to call this function, * inside its own verifier_ops->get_func_proto() callback it should return @@ -363,4 +366,132 @@ const struct bpf_func_proto bpf_get_local_storage_proto = { .arg2_type = ARG_ANYTHING, }; #endif + +#define BPF_STRTOX_BASE_MASK 0x1F + +static int __bpf_strtoull(const char *buf, size_t buf_len, u64 flags, + unsigned long long *res, bool *is_negative) +{ + unsigned int base = flags & BPF_STRTOX_BASE_MASK; + const char *cur_buf = buf; + size_t cur_len = buf_len; + unsigned int consumed; + size_t val_len; + char str[64]; + + if (!buf || !buf_len || !res || !is_negative) + return -EINVAL; + + if (base != 0 && base != 8 && base != 10 && base != 16) + return -EINVAL; + + if (flags & ~BPF_STRTOX_BASE_MASK) + return -EINVAL; + + while (cur_buf < buf + buf_len && isspace(*cur_buf)) + ++cur_buf; + + *is_negative = (cur_buf < buf + buf_len && *cur_buf == '-'); + if (*is_negative) + ++cur_buf; + + consumed = cur_buf - buf; + cur_len -= consumed; + if (!cur_len) + return -EINVAL; + + cur_len = min(cur_len, sizeof(str) - 1); + memcpy(str, cur_buf, cur_len); + str[cur_len] = '\0'; + cur_buf = str; + + cur_buf = _parse_integer_fixup_radix(cur_buf, &base); + val_len = _parse_integer(cur_buf, base, res); + + if (val_len & KSTRTOX_OVERFLOW) + return -ERANGE; + + if (val_len == 0) + return -EINVAL; + + cur_buf += val_len; + consumed += cur_buf - str; + + return consumed; +} + +static int __bpf_strtoll(const char *buf, size_t buf_len, u64 flags, + long long *res) +{ + unsigned long long _res; + bool is_negative; + int err; + + err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative); + if (err < 0) + return err; + if (is_negative) { + if ((long long)-_res > 0) + return -ERANGE; + *res = -_res; + } else { + if ((long long)_res < 0) + return -ERANGE; + *res = _res; + } + return err; +} + +BPF_CALL_4(bpf_strtol, const char *, buf, size_t, buf_len, u64, flags, + long *, res) +{ + long long _res; + int err; + + err = __bpf_strtoll(buf, buf_len, flags, &_res); + if (err < 0) + return err; + if (_res != (long)_res) + return -ERANGE; + *res = _res; + return err; +} + +const struct bpf_func_proto bpf_strtol_proto = { + .func = bpf_strtol, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_MEM, + .arg2_type = ARG_CONST_SIZE, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_LONG, +}; + +BPF_CALL_4(bpf_strtoul, const char *, buf, size_t, buf_len, u64, flags, + unsigned long *, res) +{ + unsigned long long _res; + bool is_negative; + int err; + + err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative); + if (err < 0) + return err; + if (is_negative) + return -EINVAL; + if (_res != (unsigned long)_res) + return -ERANGE; + *res = _res; + return err; +} + +const struct bpf_func_proto bpf_strtoul_proto = { + .func = bpf_strtoul, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_MEM, + .arg2_type = ARG_CONST_SIZE, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_LONG, +}; #endif diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index d995eedfdd16..92c9b8a32b50 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1888,6 +1888,9 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_FLOW_DISSECTOR: ptype = BPF_PROG_TYPE_FLOW_DISSECTOR; break; + case BPF_CGROUP_SYSCTL: + ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; + break; default: return -EINVAL; } @@ -1966,6 +1969,9 @@ static int bpf_prog_detach(const union bpf_attr *attr) return lirc_prog_detach(attr); case BPF_FLOW_DISSECTOR: return skb_flow_dissector_bpf_prog_detach(attr); + case BPF_CGROUP_SYSCTL: + ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; + break; default: return -EINVAL; } @@ -1999,6 +2005,7 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_SOCK_OPS: case BPF_CGROUP_DEVICE: + case BPF_CGROUP_SYSCTL: break; case BPF_LIRC_MODE2: return lirc_prog_query(attr, uattr); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f25b7c9c20ba..423f242a5efb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1177,30 +1177,32 @@ static int check_reg_arg(struct bpf_verifier_env *env, u32 regno, { struct bpf_verifier_state *vstate = env->cur_state; struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *regs = state->regs; + struct bpf_reg_state *reg, *regs = state->regs; if (regno >= MAX_BPF_REG) { verbose(env, "R%d is invalid\n", regno); return -EINVAL; } + reg = ®s[regno]; if (t == SRC_OP) { /* check whether register used as source operand can be read */ - if (regs[regno].type == NOT_INIT) { + if (reg->type == NOT_INIT) { verbose(env, "R%d !read_ok\n", regno); return -EACCES; } /* We don't need to worry about FP liveness because it's read-only */ - if (regno != BPF_REG_FP) - return mark_reg_read(env, ®s[regno], - regs[regno].parent); + if (regno == BPF_REG_FP) + return 0; + + return mark_reg_read(env, reg, reg->parent); } else { /* check whether register used as dest operand can be written to */ if (regno == BPF_REG_FP) { verbose(env, "frame pointer is read only\n"); return -EACCES; } - regs[regno].live |= REG_LIVE_WRITTEN; + reg->live |= REG_LIVE_WRITTEN; if (t == DST_OP) mark_reg_unknown(env, regs, regno); } @@ -2462,6 +2464,22 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type) type == ARG_CONST_SIZE_OR_ZERO; } +static bool arg_type_is_int_ptr(enum bpf_arg_type type) +{ + return type == ARG_PTR_TO_INT || + type == ARG_PTR_TO_LONG; +} + +static int int_ptr_type_to_size(enum bpf_arg_type type) +{ + if (type == ARG_PTR_TO_INT) + return sizeof(u32); + else if (type == ARG_PTR_TO_LONG) + return sizeof(u64); + + return -EINVAL; +} + static int check_func_arg(struct bpf_verifier_env *env, u32 regno, enum bpf_arg_type arg_type, struct bpf_call_arg_meta *meta) @@ -2554,6 +2572,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, type != expected_type) goto err_type; meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM; + } else if (arg_type_is_int_ptr(arg_type)) { + expected_type = PTR_TO_STACK; + if (!type_is_pkt_pointer(type) && + type != PTR_TO_MAP_VALUE && + type != expected_type) + goto err_type; } else { verbose(env, "unsupported arg_type %d\n", arg_type); return -EFAULT; @@ -2635,6 +2659,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, err = check_helper_mem_access(env, regno - 1, reg->umax_value, zero_size_allowed, meta); + } else if (arg_type_is_int_ptr(arg_type)) { + int size = int_ptr_type_to_size(arg_type); + + err = check_helper_mem_access(env, regno, size, false, meta); + if (err) + return err; + err = check_ptr_alignment(env, reg, 0, size, true); } return err; @@ -5267,6 +5298,7 @@ static int check_return_code(struct bpf_verifier_env *env) case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: case BPF_PROG_TYPE_SOCK_OPS: case BPF_PROG_TYPE_CGROUP_DEVICE: + case BPF_PROG_TYPE_CGROUP_SYSCTL: break; default: return 0; @@ -5337,10 +5369,6 @@ enum { #define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L) -static int *insn_stack; /* stack of insns to process */ -static int cur_stack; /* current stack index */ -static int *insn_state; - /* t, w, e - match pseudo-code above: * t - index of current instruction * w - next instruction @@ -5348,6 +5376,9 @@ static int *insn_state; */ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) { + int *insn_stack = env->cfg.insn_stack; + int *insn_state = env->cfg.insn_state; + if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH)) return 0; @@ -5368,9 +5399,9 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) /* tree-edge */ insn_state[t] = DISCOVERED | e; insn_state[w] = DISCOVERED; - if (cur_stack >= env->prog->len) + if (env->cfg.cur_stack >= env->prog->len) return -E2BIG; - insn_stack[cur_stack++] = w; + insn_stack[env->cfg.cur_stack++] = w; return 1; } else if ((insn_state[w] & 0xF0) == DISCOVERED) { verbose_linfo(env, t, "%d: ", t); @@ -5394,14 +5425,15 @@ static int check_cfg(struct bpf_verifier_env *env) { struct bpf_insn *insns = env->prog->insnsi; int insn_cnt = env->prog->len; + int *insn_stack, *insn_state; int ret = 0; int i, t; - insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); + insn_state = env->cfg.insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); if (!insn_state) return -ENOMEM; - insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); + insn_stack = env->cfg.insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); if (!insn_stack) { kvfree(insn_state); return -ENOMEM; @@ -5409,12 +5441,12 @@ static int check_cfg(struct bpf_verifier_env *env) insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */ insn_stack[0] = 0; /* 0 is the first instruction */ - cur_stack = 1; + env->cfg.cur_stack = 1; peek_stack: - if (cur_stack == 0) + if (env->cfg.cur_stack == 0) goto check_state; - t = insn_stack[cur_stack - 1]; + t = insn_stack[env->cfg.cur_stack - 1]; if (BPF_CLASS(insns[t].code) == BPF_JMP || BPF_CLASS(insns[t].code) == BPF_JMP32) { @@ -5483,7 +5515,7 @@ peek_stack: mark_explored: insn_state[t] = EXPLORED; - if (cur_stack-- <= 0) { + if (env->cfg.cur_stack-- <= 0) { verbose(env, "pop stack internal bug\n"); ret = -EFAULT; goto err_free; @@ -5503,6 +5535,7 @@ check_state: err_free: kvfree(insn_state); kvfree(insn_stack); + env->cfg.insn_state = env->cfg.insn_stack = NULL; return ret; } @@ -6191,6 +6224,22 @@ static bool states_equal(struct bpf_verifier_env *env, return true; } +static int propagate_liveness_reg(struct bpf_verifier_env *env, + struct bpf_reg_state *reg, + struct bpf_reg_state *parent_reg) +{ + int err; + + if (parent_reg->live & REG_LIVE_READ || !(reg->live & REG_LIVE_READ)) + return 0; + + err = mark_reg_read(env, reg, parent_reg); + if (err) + return err; + + return 0; +} + /* A write screens off any subsequent reads; but write marks come from the * straight-line code between a state and its parent. When we arrive at an * equivalent state (jump target or such) we didn't arrive by the straight-line @@ -6202,8 +6251,9 @@ static int propagate_liveness(struct bpf_verifier_env *env, const struct bpf_verifier_state *vstate, struct bpf_verifier_state *vparent) { - int i, frame, err = 0; + struct bpf_reg_state *state_reg, *parent_reg; struct bpf_func_state *state, *parent; + int i, frame, err = 0; if (vparent->curframe != vstate->curframe) { WARN(1, "propagate_live: parent frame %d current frame %d\n", @@ -6213,30 +6263,27 @@ static int propagate_liveness(struct bpf_verifier_env *env, /* Propagate read liveness of registers... */ BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG); for (frame = 0; frame <= vstate->curframe; frame++) { + parent = vparent->frame[frame]; + state = vstate->frame[frame]; + parent_reg = parent->regs; + state_reg = state->regs; /* We don't need to worry about FP liveness, it's read-only */ for (i = frame < vstate->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++) { - if (vparent->frame[frame]->regs[i].live & REG_LIVE_READ) - continue; - if (vstate->frame[frame]->regs[i].live & REG_LIVE_READ) { - err = mark_reg_read(env, &vstate->frame[frame]->regs[i], - &vparent->frame[frame]->regs[i]); - if (err) - return err; - } + err = propagate_liveness_reg(env, &state_reg[i], + &parent_reg[i]); + if (err) + return err; } - } - /* ... and stack slots */ - for (frame = 0; frame <= vstate->curframe; frame++) { - state = vstate->frame[frame]; - parent = vparent->frame[frame]; + /* Propagate stack slots. */ for (i = 0; i < state->allocated_stack / BPF_REG_SIZE && i < parent->allocated_stack / BPF_REG_SIZE; i++) { - if (parent->stack[i].spilled_ptr.live & REG_LIVE_READ) - continue; - if (state->stack[i].spilled_ptr.live & REG_LIVE_READ) - mark_reg_read(env, &state->stack[i].spilled_ptr, - &parent->stack[i].spilled_ptr); + parent_reg = &parent->stack[i].spilled_ptr; + state_reg = &state->stack[i].spilled_ptr; + err = propagate_liveness_reg(env, state_reg, + parent_reg); + if (err) + return err; } } return err; @@ -7601,9 +7648,8 @@ static int jit_subprogs(struct bpf_verifier_env *env) insn->src_reg != BPF_PSEUDO_CALL) continue; subprog = insn->off; - insn->imm = (u64 (*)(u64, u64, u64, u64, u64)) - func[subprog]->bpf_func - - __bpf_call_base; + insn->imm = BPF_CAST_CALL(func[subprog]->bpf_func) - + __bpf_call_base; } /* we use the aux data to keep a list of the start addresses @@ -8086,9 +8132,11 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, env->insn_aux_data[i].orig_idx = i; env->prog = *prog; env->ops = bpf_verifier_ops[env->prog->type]; + is_priv = capable(CAP_SYS_ADMIN); /* grab the mutex to protect few globals used by verifier */ - mutex_lock(&bpf_verifier_lock); + if (!is_priv) + mutex_lock(&bpf_verifier_lock); if (attr->log_level || attr->log_buf || attr->log_size) { /* user requested verbose verifier output @@ -8111,7 +8159,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (attr->prog_flags & BPF_F_ANY_ALIGNMENT) env->strict_alignment = false; - is_priv = capable(CAP_SYS_ADMIN); env->allow_ptr_leaks = is_priv; ret = replace_map_fd_with_map_ptr(env); @@ -8224,7 +8271,8 @@ err_release_maps: release_maps(env); *prog = env->prog; err_unlock: - mutex_unlock(&bpf_verifier_lock); + if (!is_priv) + mutex_unlock(&bpf_verifier_lock); vfree(env->insn_aux_data); err_free_env: kfree(env); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d64c00afceb5..91800be0c8eb 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -569,6 +569,12 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_probe_read: return &bpf_probe_read_proto; case BPF_FUNC_ktime_get_ns: diff --git a/net/core/filter.c b/net/core/filter.c index 1644a16afcec..fa8fb0548217 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3069,6 +3069,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, { int ret; + if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO) + return -EINVAL; + if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) { /* udp gso_size delineates datagrams, only allow if fixed */ if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) || @@ -4434,8 +4437,7 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock, if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk)) return -EINVAL; - if (val) - tcp_sk(sk)->bpf_sock_ops_cb_flags = val; + tcp_sk(sk)->bpf_sock_ops_cb_flags = val; return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS); } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a083e188374f..e89be6282693 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -258,6 +258,33 @@ nodata: } EXPORT_SYMBOL(__alloc_skb); +/* Caller must provide SKB that is memset cleared */ +static struct sk_buff *__build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size) +{ + struct skb_shared_info *shinfo; + unsigned int size = frag_size ? : ksize(data); + + size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + /* Assumes caller memset cleared SKB */ + skb->truesize = SKB_TRUESIZE(size); + refcount_set(&skb->users, 1); + skb->head = data; + skb->data = data; + skb_reset_tail_pointer(skb); + skb->end = skb->tail + size; + skb->mac_header = (typeof(skb->mac_header))~0U; + skb->transport_header = (typeof(skb->transport_header))~0U; + + /* make sure we initialize shinfo sequentially */ + shinfo = skb_shinfo(skb); + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); + atomic_set(&shinfo->dataref, 1); + + return skb; +} + /** * __build_skb - build a network buffer * @data: data buffer provided by caller @@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb); */ struct sk_buff *__build_skb(void *data, unsigned int frag_size) { - struct skb_shared_info *shinfo; struct sk_buff *skb; - unsigned int size = frag_size ? : ksize(data); skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); - if (!skb) + if (unlikely(!skb)) return NULL; - size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - memset(skb, 0, offsetof(struct sk_buff, tail)); - skb->truesize = SKB_TRUESIZE(size); - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb); - skb->end = skb->tail + size; - skb->mac_header = (typeof(skb->mac_header))~0U; - skb->transport_header = (typeof(skb->transport_header))~0U; - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); - - return skb; + return __build_skb_around(skb, data, frag_size); } /* build_skb() is wrapper over __build_skb(), that specifically @@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) } EXPORT_SYMBOL(build_skb); +/** + * build_skb_around - build a network buffer around provided skb + * @skb: sk_buff provide by caller, must be memset cleared + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + */ +struct sk_buff *build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size) +{ + if (unlikely(!skb)) + return NULL; + + skb = __build_skb_around(skb, data, frag_size); + + if (skb && frag_size) { + skb->head_frag = 1; + if (page_is_pfmemalloc(virt_to_head_page(data))) + skb->pfmemalloc = 1; + } + return skb; +} +EXPORT_SYMBOL(build_skb_around); + #define NAPI_SKB_CACHE_SIZE 64 struct napi_alloc_cache { diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index 610c0bdc0c2b..88b9ae24658d 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -43,6 +43,48 @@ struct xsk_queue { u64 invalid_descs; }; +/* The structure of the shared state of the rings are the same as the + * ring buffer in kernel/events/ring_buffer.c. For the Rx and completion + * ring, the kernel is the producer and user space is the consumer. For + * the Tx and fill rings, the kernel is the consumer and user space is + * the producer. + * + * producer consumer + * + * if (LOAD ->consumer) { LOAD ->producer + * (A) smp_rmb() (C) + * STORE $data LOAD $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->producer STORE ->consumer + * } + * + * (A) pairs with (D), and (B) pairs with (C). + * + * Starting with (B), it protects the data from being written after + * the producer pointer. If this barrier was missing, the consumer + * could observe the producer pointer being set and thus load the data + * before the producer has written the new data. The consumer would in + * this case load the old data. + * + * (C) protects the consumer from speculatively loading the data before + * the producer pointer actually has been read. If we do not have this + * barrier, some architectures could load old data as speculative loads + * are not discarded as the CPU does not know there is a dependency + * between ->producer and data. + * + * (A) is a control dependency that separates the load of ->consumer + * from the stores of $data. In case ->consumer indicates there is no + * room in the buffer to store $data we do not. So no barrier is needed. + * + * (D) protects the load of the data to be observed to happen after the + * store of the consumer pointer. If we did not have this memory + * barrier, the producer could observe the consumer pointer being set + * and overwrite the data with a new value before the consumer got the + * chance to read the old value. The consumer would thus miss reading + * the old entry and very likely read the new entry twice, once right + * now and again after circling through the ring. + */ + /* Common functions operating for both RXTX and umem queues */ static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q) @@ -106,6 +148,7 @@ static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr) static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr) { if (q->cons_tail == q->cons_head) { + smp_mb(); /* D, matches A */ WRITE_ONCE(q->ring->consumer, q->cons_tail); q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE); @@ -128,10 +171,11 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr) if (xskq_nb_free(q, q->prod_tail, 1) == 0) return -ENOSPC; + /* A, matches D */ ring->desc[q->prod_tail++ & q->ring_mask] = addr; /* Order producer and data */ - smp_wmb(); + smp_wmb(); /* B, matches C */ WRITE_ONCE(q->ring->producer, q->prod_tail); return 0; @@ -144,6 +188,7 @@ static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr) if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0) return -ENOSPC; + /* A, matches D */ ring->desc[q->prod_head++ & q->ring_mask] = addr; return 0; } @@ -152,7 +197,7 @@ static inline void xskq_produce_flush_addr_n(struct xsk_queue *q, u32 nb_entries) { /* Order producer and data */ - smp_wmb(); + smp_wmb(); /* B, matches C */ q->prod_tail += nb_entries; WRITE_ONCE(q->ring->producer, q->prod_tail); @@ -163,6 +208,7 @@ static inline int xskq_reserve_addr(struct xsk_queue *q) if (xskq_nb_free(q, q->prod_head, 1) == 0) return -ENOSPC; + /* A, matches D */ q->prod_head++; return 0; } @@ -204,11 +250,12 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q, struct xdp_desc *desc) { if (q->cons_tail == q->cons_head) { + smp_mb(); /* D, matches A */ WRITE_ONCE(q->ring->consumer, q->cons_tail); q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE); /* Order consumer and data */ - smp_rmb(); + smp_rmb(); /* C, matches B */ } return xskq_validate_desc(q, desc); @@ -228,6 +275,7 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q, if (xskq_nb_free(q, q->prod_head, 1) == 0) return -ENOSPC; + /* A, matches D */ idx = (q->prod_head++) & q->ring_mask; ring->desc[idx].addr = addr; ring->desc[idx].len = len; @@ -238,7 +286,7 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q, static inline void xskq_produce_flush_desc(struct xsk_queue *q) { /* Order producer and data */ - smp_wmb(); + smp_wmb(); /* B, matches C */ q->prod_tail = q->prod_head, WRITE_ONCE(q->ring->producer, q->prod_tail); diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index dd2b31ccca6a..6a148d0d51bf 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -99,7 +99,7 @@ gen_btf() pahole_ver=$(${PAHOLE} --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/') if [ "${pahole_ver}" -lt "113" ]; then info "BTF" "${1}: pahole version $(${PAHOLE} --version) is too old, need at least v1.13" - exit 0 + return 0 fi info "BTF" ${1} diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst index 9bb9ace54ba8..89b6b10e2183 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst @@ -29,7 +29,7 @@ CGROUP COMMANDS | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } | *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** | | **bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** | -| **sendmsg4** | **sendmsg6** } +| **sendmsg4** | **sendmsg6** | **sysctl** } | *ATTACH_FLAGS* := { **multi** | **override** } DESCRIPTION @@ -85,7 +85,8 @@ DESCRIPTION **sendmsg4** call to sendto(2), sendmsg(2), sendmmsg(2) for an unconnected udp4 socket (since 4.18); **sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an - unconnected udp6 socket (since 4.18). + unconnected udp6 socket (since 4.18); + **sysctl** sysctl access (since 5.2). **bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG* Detach *PROG* from the cgroup *CGROUP* and attach type @@ -99,7 +100,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool-feature.rst b/tools/bpf/bpftool/Documentation/bpftool-feature.rst index 82de03dd8f52..10177343b7ce 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-feature.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-feature.rst @@ -63,7 +63,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index 5c984ffc9f01..55ecf80ca03e 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -135,7 +135,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst b/tools/bpf/bpftool/Documentation/bpftool-net.rst index 779dab3650ee..6b692b428373 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-net.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst @@ -55,7 +55,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool-perf.rst b/tools/bpf/bpftool/Documentation/bpftool-perf.rst index bca5590a80d0..86154740fabb 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-perf.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-perf.rst @@ -43,7 +43,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst index 9386bd6e0396..2f183ffd8351 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst @@ -25,7 +25,7 @@ PROG COMMANDS | **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** | **visual** | **linum**}] | **bpftool** **prog dump jited** *PROG* [{**file** *FILE* | **opcodes** | **linum**}] | **bpftool** **prog pin** *PROG* *FILE* -| **bpftool** **prog { load | loadall }** *OBJ* *PATH* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*] +| **bpftool** **prog { load | loadall }** *OBJ* *PATH* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*] [**pinmaps** *MAP_DIR*] | **bpftool** **prog attach** *PROG* *ATTACH_TYPE* [*MAP*] | **bpftool** **prog detach** *PROG* *ATTACH_TYPE* [*MAP*] | **bpftool** **prog tracelog** @@ -39,7 +39,8 @@ PROG COMMANDS | **cgroup/sock** | **cgroup/dev** | **lwt_in** | **lwt_out** | **lwt_xmit** | | **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** | | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | -| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** +| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | +| **cgroup/sysctl** | } | *ATTACH_TYPE* := { | **msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector** @@ -56,6 +57,14 @@ DESCRIPTION Output will start with program ID followed by program type and zero or more named attributes (depending on kernel version). + Since Linux 5.1 the kernel can collect statistics on BPF + programs (such as the total time spent running the program, + and the number of times it was run). If available, bpftool + shows such statistics. However, the kernel does not collect + them by defaults, as it slightly impacts performance on each + program run. Activation or deactivation of the feature is + performed via the **kernel.bpf_stats_enabled** sysctl knob. + **bpftool prog dump xlated** *PROG* [{ **file** *FILE* | **opcodes** | **visual** | **linum** }] Dump eBPF instructions of the program from the kernel. By default, eBPF will be disassembled and printed to standard @@ -144,7 +153,7 @@ OPTIONS -h, --help Print short generic help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/Documentation/bpftool.rst b/tools/bpf/bpftool/Documentation/bpftool.rst index 4f2188845dd8..deb2ca911ddf 100644 --- a/tools/bpf/bpftool/Documentation/bpftool.rst +++ b/tools/bpf/bpftool/Documentation/bpftool.rst @@ -49,7 +49,7 @@ OPTIONS -h, --help Print short help message (similar to **bpftool help**). - -v, --version + -V, --version Print version number (similar to **bpftool version**). -j, --json diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index b803827d01e8..9f3ffe1e26ab 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -370,7 +370,8 @@ _bpftool() lirc_mode2 cgroup/bind4 cgroup/bind6 \ cgroup/connect4 cgroup/connect6 \ cgroup/sendmsg4 cgroup/sendmsg6 \ - cgroup/post_bind4 cgroup/post_bind6" -- \ + cgroup/post_bind4 cgroup/post_bind6 \ + cgroup/sysctl" -- \ "$cur" ) ) return 0 ;; @@ -619,7 +620,7 @@ _bpftool() attach|detach) local ATTACH_TYPES='ingress egress sock_create sock_ops \ device bind4 bind6 post_bind4 post_bind6 connect4 \ - connect6 sendmsg4 sendmsg6' + connect6 sendmsg4 sendmsg6 sysctl' local ATTACH_FLAGS='multi override' local PROG_TYPE='id pinned tag' case $prev in @@ -629,7 +630,7 @@ _bpftool() ;; ingress|egress|sock_create|sock_ops|device|bind4|bind6|\ post_bind4|post_bind6|connect4|connect6|sendmsg4|\ - sendmsg6) + sendmsg6|sysctl) COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ "$cur" ) ) return 0 diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c index 4b5c8da2a7c0..7e22f115c8c1 100644 --- a/tools/bpf/bpftool/cgroup.c +++ b/tools/bpf/bpftool/cgroup.c @@ -25,7 +25,7 @@ " ATTACH_TYPE := { ingress | egress | sock_create |\n" \ " sock_ops | device | bind4 | bind6 |\n" \ " post_bind4 | post_bind6 | connect4 |\n" \ - " connect6 | sendmsg4 | sendmsg6 }" + " connect6 | sendmsg4 | sendmsg6 | sysctl }" static const char * const attach_type_strings[] = { [BPF_CGROUP_INET_INGRESS] = "ingress", @@ -41,6 +41,7 @@ static const char * const attach_type_strings[] = { [BPF_CGROUP_INET6_POST_BIND] = "post_bind6", [BPF_CGROUP_UDP4_SENDMSG] = "sendmsg4", [BPF_CGROUP_UDP6_SENDMSG] = "sendmsg6", + [BPF_CGROUP_SYSCTL] = "sysctl", [__MAX_BPF_ATTACH_TYPE] = NULL, }; @@ -248,6 +249,13 @@ static int do_show_tree_fn(const char *fpath, const struct stat *sb, for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) show_attached_bpf_progs(cgroup_fd, type, ftw->level); + if (errno == EINVAL) + /* Last attach type does not support query. + * Do not report an error for this, especially because batch + * mode would stop processing commands. + */ + errno = 0; + if (json_output) { jsonw_end_array(json_wtr); jsonw_end_object(json_wtr); diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h index d7dd84d3c660..1ccc46169a19 100644 --- a/tools/bpf/bpftool/main.h +++ b/tools/bpf/bpftool/main.h @@ -73,6 +73,7 @@ static const char * const prog_type_name[] = { [BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2", [BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport", [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", + [BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl", }; extern const char * const map_type_name[]; diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index e96903078991..10b6c9d3e525 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -261,20 +261,20 @@ static void print_entry_json(struct bpf_map_info *info, unsigned char *key, } static void print_entry_error(struct bpf_map_info *info, unsigned char *key, - const char *value) + const char *error_msg) { - int value_size = strlen(value); + int msg_size = strlen(error_msg); bool single_line, break_names; - break_names = info->key_size > 16 || value_size > 16; - single_line = info->key_size + value_size <= 24 && !break_names; + break_names = info->key_size > 16 || msg_size > 16; + single_line = info->key_size + msg_size <= 24 && !break_names; printf("key:%c", break_names ? '\n' : ' '); fprint_hex(stdout, key, info->key_size, " "); printf(single_line ? " " : "\n"); - printf("value:%c%s", break_names ? '\n' : ' ', value); + printf("value:%c%s", break_names ? '\n' : ' ', error_msg); printf("\n"); } @@ -298,11 +298,7 @@ static void print_entry_plain(struct bpf_map_info *info, unsigned char *key, if (info->value_size) { printf("value:%c", break_names ? '\n' : ' '); - if (value) - fprint_hex(stdout, value, info->value_size, - " "); - else - printf("<no entry>"); + fprint_hex(stdout, value, info->value_size, " "); } printf("\n"); @@ -321,11 +317,8 @@ static void print_entry_plain(struct bpf_map_info *info, unsigned char *key, for (i = 0; i < n; i++) { printf("value (CPU %02d):%c", i, info->value_size > 16 ? '\n' : ' '); - if (value) - fprint_hex(stdout, value + i * step, - info->value_size, " "); - else - printf("<no entry>"); + fprint_hex(stdout, value + i * step, + info->value_size, " "); printf("\n"); } } @@ -538,6 +531,9 @@ static int show_map_close_json(int fd, struct bpf_map_info *info) } close(fd); + if (info->btf_id) + jsonw_int_field(json_wtr, "btf_id", info->btf_id); + if (!hash_empty(map_table.table)) { struct pinned_obj *obj; @@ -604,15 +600,19 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info) } close(fd); - printf("\n"); if (!hash_empty(map_table.table)) { struct pinned_obj *obj; hash_for_each_possible(map_table.table, obj, hash, info->id) { if (obj->id == info->id) - printf("\tpinned %s\n", obj->path); + printf("\n\tpinned %s", obj->path); } } + + if (info->btf_id) + printf("\n\tbtf_id %d", info->btf_id); + + printf("\n"); return 0; } @@ -722,11 +722,16 @@ static int dump_map_elem(int fd, void *key, void *value, jsonw_string_field(json_wtr, "error", strerror(lookup_errno)); jsonw_end_object(json_wtr); } else { + const char *msg = NULL; + if (errno == ENOENT) - print_entry_plain(map_info, key, NULL); - else - print_entry_error(map_info, key, - strerror(lookup_errno)); + msg = "<no entry>"; + else if (lookup_errno == ENOSPC && + map_info->type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) + msg = "<cannot read>"; + + print_entry_error(map_info, key, + msg ? : strerror(lookup_errno)); } return 0; @@ -780,6 +785,10 @@ static int do_dump(int argc, char **argv) } } + if (info.type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY && + info.value_size != 8) + p_info("Warning: cannot read values from %s map with value_size != 8", + map_type_name[info.type]); while (true) { err = bpf_map_get_next_key(fd, prev_key, key); if (err) { diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 81067803189e..fc495b27f0fc 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -323,7 +323,7 @@ static void print_prog_plain(struct bpf_prog_info *info, int fd) } if (info->btf_id) - printf("\n\tbtf_id %d\n", info->btf_id); + printf("\n\tbtf_id %d", info->btf_id); printf("\n"); } @@ -1060,7 +1060,7 @@ static int do_help(int argc, char **argv) " tracepoint | raw_tracepoint | xdp | perf_event | cgroup/skb |\n" " cgroup/sock | cgroup/dev | lwt_in | lwt_out | lwt_xmit |\n" " lwt_seg6local | sockops | sk_skb | sk_msg | lirc_mode2 |\n" - " sk_reuseport | flow_dissector |\n" + " sk_reuseport | flow_dissector | cgroup/sysctl |\n" " cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n" " cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n" " cgroup/sendmsg4 | cgroup/sendmsg6 }\n" diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 2e96d0b4bf65..704bb69514a2 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -167,6 +167,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_LIRC_MODE2, BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR, + BPF_PROG_TYPE_CGROUP_SYSCTL, }; enum bpf_attach_type { @@ -188,6 +189,7 @@ enum bpf_attach_type { BPF_CGROUP_UDP6_SENDMSG, BPF_LIRC_MODE2, BPF_FLOW_DISSECTOR, + BPF_CGROUP_SYSCTL, __MAX_BPF_ATTACH_TYPE }; @@ -2504,6 +2506,122 @@ union bpf_attr { * Return * 0 if iph and th are a valid SYN cookie ACK, or a negative error * otherwise. + * + * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags) + * Description + * Get name of sysctl in /proc/sys/ and copy it into provided by + * program buffer *buf* of size *buf_len*. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * + * If *flags* is zero, full name (e.g. "net/ipv4/tcp_mem") is + * copied. Use **BPF_F_SYSCTL_BASE_NAME** flag to copy base name + * only (e.g. "tcp_mem"). + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get current value of sysctl as it is presented in /proc/sys + * (incl. newline, etc), and copy it as a string into provided + * by program buffer *buf* of size *buf_len*. + * + * The whole value is copied, no matter what file position user + * space issued e.g. sys_read at. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if current value was unavailable, e.g. because + * sysctl is uninitialized and read returns -EIO for it. + * + * int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get new value being written by user space to sysctl (before + * the actual write happens) and copy it as a string into + * provided by program buffer *buf* of size *buf_len*. + * + * User space may write new value at file position > 0. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len) + * Description + * Override new value being written by user space to sysctl with + * value provided by program in buffer *buf* of size *buf_len*. + * + * *buf* should contain a string in same form as provided by user + * space on sysctl write. + * + * User space may write new value at file position > 0. To override + * the whole sysctl value file position should be set to zero. + * Return + * 0 on success. + * + * **-E2BIG** if the *buf_len* is too big. + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to a long integer according to the given base + * and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)) followed by a single optional '-' + * sign. + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtol(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. + * + * int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to an unsigned long integer according to the + * given base and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)). + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtoul(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2606,7 +2724,13 @@ union bpf_attr { FN(skb_ecn_set_ce), \ FN(get_listener_sock), \ FN(skc_lookup_tcp), \ - FN(tcp_check_syncookie), + FN(tcp_check_syncookie), \ + FN(sysctl_get_name), \ + FN(sysctl_get_current_value), \ + FN(sysctl_get_new_value), \ + FN(sysctl_set_new_value), \ + FN(strtol), \ + FN(strtoul), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2668,17 +2792,20 @@ enum bpf_func_id { /* BPF_FUNC_skb_adjust_room flags. */ #define BPF_F_ADJ_ROOM_FIXED_GSO (1ULL << 0) -#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff -#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56 +#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff +#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56 #define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1) #define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2) #define BPF_F_ADJ_ROOM_ENCAP_L4_GRE (1ULL << 3) #define BPF_F_ADJ_ROOM_ENCAP_L4_UDP (1ULL << 4) -#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \ +#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \ BPF_ADJ_ROOM_ENCAP_L2_MASK) \ << BPF_ADJ_ROOM_ENCAP_L2_SHIFT) +/* BPF_FUNC_sysctl_get_name flags. */ +#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0) + /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, @@ -3308,4 +3435,14 @@ struct bpf_line_info { struct bpf_spin_lock { __u32 val; }; + +struct bpf_sysctl { + __u32 write; /* Sysctl is being read (= 0) or written (= 1). + * Allows 1,2,4-byte read, but no write. + */ + __u32 file_pos; /* Sysctl file position to read from, write to. + * Allows 1,2,4-byte read an 4-byte write. + */ +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index bc30783d1403..ff2506fe6bd2 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -92,7 +92,7 @@ struct bpf_load_program_attr { #define MAPS_RELAX_COMPAT 0x01 /* Recommend log buffer size */ -#define BPF_LOG_BUF_SIZE (16 * 1024 * 1024) /* verifier maximum in kernels <= 5.1 */ +#define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */ LIBBPF_API int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, char *log_buf, size_t log_buf_sz); diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 701e7e28ada3..75eaf10b9e1a 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -1354,8 +1354,16 @@ static struct btf_dedup *btf_dedup_new(struct btf *btf, struct btf_ext *btf_ext, } /* special BTF "void" type is made canonical immediately */ d->map[0] = 0; - for (i = 1; i <= btf->nr_types; i++) - d->map[i] = BTF_UNPROCESSED_ID; + for (i = 1; i <= btf->nr_types; i++) { + struct btf_type *t = d->btf->types[i]; + __u16 kind = BTF_INFO_KIND(t->info); + + /* VAR and DATASEC are never deduped and are self-canonical */ + if (kind == BTF_KIND_VAR || kind == BTF_KIND_DATASEC) + d->map[i] = i; + else + d->map[i] = BTF_UNPROCESSED_ID; + } d->hypot_map = malloc(sizeof(__u32) * (1 + btf->nr_types)); if (!d->hypot_map) { @@ -1946,6 +1954,8 @@ static int btf_dedup_prim_type(struct btf_dedup *d, __u32 type_id) case BTF_KIND_UNION: case BTF_KIND_FUNC: case BTF_KIND_FUNC_PROTO: + case BTF_KIND_VAR: + case BTF_KIND_DATASEC: return 0; case BTF_KIND_INT: @@ -2699,6 +2709,7 @@ static int btf_dedup_remap_type(struct btf_dedup *d, __u32 type_id) case BTF_KIND_PTR: case BTF_KIND_TYPEDEF: case BTF_KIND_FUNC: + case BTF_KIND_VAR: r = btf_dedup_remap_type_id(d, t->type); if (r < 0) return r; @@ -2753,6 +2764,20 @@ static int btf_dedup_remap_type(struct btf_dedup *d, __u32 type_id) break; } + case BTF_KIND_DATASEC: { + struct btf_var_secinfo *var = (struct btf_var_secinfo *)(t + 1); + __u16 vlen = BTF_INFO_VLEN(t->info); + + for (i = 0; i < vlen; i++) { + r = btf_dedup_remap_type_id(d, var->type); + if (r < 0) + return r; + var->type = r; + var++; + } + break; + } + default: return -EINVAL; } diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 67484cf32b2e..d817bf20f3d6 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -817,7 +817,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, struct bpf_map *map, memcpy(*data_buff, data->d_buf, data->d_size); } - pr_debug("map %ld is \"%s\"\n", map - obj->maps, map->name); + pr_debug("map %td is \"%s\"\n", map - obj->maps, map->name); return 0; } @@ -2064,6 +2064,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type) case BPF_PROG_TYPE_TRACEPOINT: case BPF_PROG_TYPE_RAW_TRACEPOINT: case BPF_PROG_TYPE_PERF_EVENT: + case BPF_PROG_TYPE_CGROUP_SYSCTL: return false; case BPF_PROG_TYPE_KPROBE: default: @@ -3004,6 +3005,8 @@ static const struct { BPF_CGROUP_UDP4_SENDMSG), BPF_EAPROG_SEC("cgroup/sendmsg6", BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP6_SENDMSG), + BPF_EAPROG_SEC("cgroup/sysctl", BPF_PROG_TYPE_CGROUP_SYSCTL, + BPF_CGROUP_SYSCTL), }; #undef BPF_PROG_SEC_IMPL diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index 8c3a1c04dcb2..0f25541632e3 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -97,6 +97,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, case BPF_PROG_TYPE_LIRC_MODE2: case BPF_PROG_TYPE_SK_REUSEPORT: case BPF_PROG_TYPE_FLOW_DISSECTOR: + case BPF_PROG_TYPE_CGROUP_SYSCTL: default: break; } diff --git a/tools/lib/bpf/libbpf_util.h b/tools/lib/bpf/libbpf_util.h index 81ecda0cb9c9..da94c4cb2e4d 100644 --- a/tools/lib/bpf/libbpf_util.h +++ b/tools/lib/bpf/libbpf_util.h @@ -23,6 +23,36 @@ do { \ #define pr_info(fmt, ...) __pr(LIBBPF_INFO, fmt, ##__VA_ARGS__) #define pr_debug(fmt, ...) __pr(LIBBPF_DEBUG, fmt, ##__VA_ARGS__) +/* Use these barrier functions instead of smp_[rw]mb() when they are + * used in a libbpf header file. That way they can be built into the + * application that uses libbpf. + */ +#if defined(__i386__) || defined(__x86_64__) +# define libbpf_smp_rmb() asm volatile("" : : : "memory") +# define libbpf_smp_wmb() asm volatile("" : : : "memory") +# define libbpf_smp_mb() \ + asm volatile("lock; addl $0,-4(%%rsp)" : : : "memory", "cc") +/* Hinders stores to be observed before older loads. */ +# define libbpf_smp_rwmb() asm volatile("" : : : "memory") +#elif defined(__aarch64__) +# define libbpf_smp_rmb() asm volatile("dmb ishld" : : : "memory") +# define libbpf_smp_wmb() asm volatile("dmb ishst" : : : "memory") +# define libbpf_smp_mb() asm volatile("dmb ish" : : : "memory") +# define libbpf_smp_rwmb() libbpf_smp_mb() +#elif defined(__arm__) +/* These are only valid for armv7 and above */ +# define libbpf_smp_rmb() asm volatile("dmb ish" : : : "memory") +# define libbpf_smp_wmb() asm volatile("dmb ishst" : : : "memory") +# define libbpf_smp_mb() asm volatile("dmb ish" : : : "memory") +# define libbpf_smp_rwmb() libbpf_smp_mb() +#else +/* Architecture missing native barrier functions. */ +# define libbpf_smp_rmb() __sync_synchronize() +# define libbpf_smp_wmb() __sync_synchronize() +# define libbpf_smp_mb() __sync_synchronize() +# define libbpf_smp_rwmb() __sync_synchronize() +#endif + #ifdef __cplusplus } /* extern "C" */ #endif diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h index a497f00e2962..82ea71a0f3ec 100644 --- a/tools/lib/bpf/xsk.h +++ b/tools/lib/bpf/xsk.h @@ -16,6 +16,7 @@ #include <linux/if_xdp.h> #include "libbpf.h" +#include "libbpf_util.h" #ifdef __cplusplus extern "C" { @@ -36,6 +37,10 @@ struct name { \ DEFINE_XSK_RING(xsk_ring_prod); DEFINE_XSK_RING(xsk_ring_cons); +/* For a detailed explanation on the memory barriers associated with the + * ring, please take a look at net/xdp/xsk_queue.h. + */ + struct xsk_umem; struct xsk_socket; @@ -105,7 +110,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb) static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, size_t nb, __u32 *idx) { - if (unlikely(xsk_prod_nb_free(prod, nb) < nb)) + if (xsk_prod_nb_free(prod, nb) < nb) return 0; *idx = prod->cached_prod; @@ -116,10 +121,10 @@ static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod, static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb) { - /* Make sure everything has been written to the ring before signalling - * this to the kernel. + /* Make sure everything has been written to the ring before indicating + * this to the kernel by writing the producer pointer. */ - smp_wmb(); + libbpf_smp_wmb(); *prod->producer += nb; } @@ -129,11 +134,11 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, { size_t entries = xsk_cons_nb_avail(cons, nb); - if (likely(entries > 0)) { + if (entries > 0) { /* Make sure we do not speculatively read the data before * we have received the packet buffers from the ring. */ - smp_rmb(); + libbpf_smp_rmb(); *idx = cons->cached_cons; cons->cached_cons += entries; @@ -144,6 +149,11 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons, static inline void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb) { + /* Make sure data has been read before indicating we are done + * with the entries by updating the consumer pointer. + */ + libbpf_smp_rwmb(); + *cons->consumer += nb; } diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 078283d073b0..f9d83ba7843e 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \ test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \ - test_netcnt test_tcpnotify_user test_sock_fields + test_netcnt test_tcpnotify_user test_sock_fields test_sysctl BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c))) TEST_GEN_FILES = $(BPF_OBJ_FILES) @@ -93,6 +93,7 @@ $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c $(OUTPUT)/test_netcnt: cgroup_helpers.c $(OUTPUT)/test_sock_fields: cgroup_helpers.c +$(OUTPUT)/test_sysctl: cgroup_helpers.c .PHONY: force diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h index e85d62cb53d0..59e221586cf7 100644 --- a/tools/testing/selftests/bpf/bpf_helpers.h +++ b/tools/testing/selftests/bpf/bpf_helpers.h @@ -192,6 +192,25 @@ static int (*bpf_skb_ecn_set_ce)(void *ctx) = static int (*bpf_tcp_check_syncookie)(struct bpf_sock *sk, void *ip, int ip_len, void *tcp, int tcp_len) = (void *) BPF_FUNC_tcp_check_syncookie; +static int (*bpf_sysctl_get_name)(void *ctx, char *buf, + unsigned long long buf_len, + unsigned long long flags) = + (void *) BPF_FUNC_sysctl_get_name; +static int (*bpf_sysctl_get_current_value)(void *ctx, char *buf, + unsigned long long buf_len) = + (void *) BPF_FUNC_sysctl_get_current_value; +static int (*bpf_sysctl_get_new_value)(void *ctx, char *buf, + unsigned long long buf_len) = + (void *) BPF_FUNC_sysctl_get_new_value; +static int (*bpf_sysctl_set_new_value)(void *ctx, const char *buf, + unsigned long long buf_len) = + (void *) BPF_FUNC_sysctl_set_new_value; +static int (*bpf_strtol)(const char *buf, unsigned long long buf_len, + unsigned long long flags, long *res) = + (void *) BPF_FUNC_strtol; +static int (*bpf_strtoul)(const char *buf, unsigned long long buf_len, + unsigned long long flags, unsigned long *res) = + (void *) BPF_FUNC_strtoul; /* llvm builtin functions that eBPF C program may use to * emit BPF_LD_ABS and BPF_LD_IND instructions diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c index fc818bc1d729..126319f9a97c 100644 --- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c +++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c @@ -2,7 +2,7 @@ #include <test_progs.h> #define CHECK_FLOW_KEYS(desc, got, expected) \ - CHECK(memcmp(&got, &expected, sizeof(got)) != 0, \ + CHECK_ATTR(memcmp(&got, &expected, sizeof(got)) != 0, \ desc, \ "nhoff=%u/%u " \ "thoff=%u/%u " \ @@ -10,6 +10,7 @@ "is_frag=%u/%u " \ "is_first_frag=%u/%u " \ "is_encap=%u/%u " \ + "ip_proto=0x%x/0x%x " \ "n_proto=0x%x/0x%x " \ "sport=%u/%u " \ "dport=%u/%u\n", \ @@ -19,53 +20,32 @@ got.is_frag, expected.is_frag, \ got.is_first_frag, expected.is_first_frag, \ got.is_encap, expected.is_encap, \ + got.ip_proto, expected.ip_proto, \ got.n_proto, expected.n_proto, \ got.sport, expected.sport, \ got.dport, expected.dport) -static struct bpf_flow_keys pkt_v4_flow_keys = { - .nhoff = 0, - .thoff = sizeof(struct iphdr), - .addr_proto = ETH_P_IP, - .ip_proto = IPPROTO_TCP, - .n_proto = __bpf_constant_htons(ETH_P_IP), -}; - -static struct bpf_flow_keys pkt_v6_flow_keys = { - .nhoff = 0, - .thoff = sizeof(struct ipv6hdr), - .addr_proto = ETH_P_IPV6, - .ip_proto = IPPROTO_TCP, - .n_proto = __bpf_constant_htons(ETH_P_IPV6), -}; - -#define VLAN_HLEN 4 +struct ipv4_pkt { + struct ethhdr eth; + struct iphdr iph; + struct tcphdr tcp; +} __packed; -static struct { +struct svlan_ipv4_pkt { struct ethhdr eth; __u16 vlan_tci; __u16 vlan_proto; struct iphdr iph; struct tcphdr tcp; -} __packed pkt_vlan_v4 = { - .eth.h_proto = __bpf_constant_htons(ETH_P_8021Q), - .vlan_proto = __bpf_constant_htons(ETH_P_IP), - .iph.ihl = 5, - .iph.protocol = IPPROTO_TCP, - .iph.tot_len = __bpf_constant_htons(MAGIC_BYTES), - .tcp.urg_ptr = 123, - .tcp.doff = 5, -}; +} __packed; -static struct bpf_flow_keys pkt_vlan_v4_flow_keys = { - .nhoff = VLAN_HLEN, - .thoff = VLAN_HLEN + sizeof(struct iphdr), - .addr_proto = ETH_P_IP, - .ip_proto = IPPROTO_TCP, - .n_proto = __bpf_constant_htons(ETH_P_IP), -}; +struct ipv6_pkt { + struct ethhdr eth; + struct ipv6hdr iph; + struct tcphdr tcp; +} __packed; -static struct { +struct dvlan_ipv6_pkt { struct ethhdr eth; __u16 vlan_tci; __u16 vlan_proto; @@ -73,31 +53,97 @@ static struct { __u16 vlan_proto2; struct ipv6hdr iph; struct tcphdr tcp; -} __packed pkt_vlan_v6 = { - .eth.h_proto = __bpf_constant_htons(ETH_P_8021AD), - .vlan_proto = __bpf_constant_htons(ETH_P_8021Q), - .vlan_proto2 = __bpf_constant_htons(ETH_P_IPV6), - .iph.nexthdr = IPPROTO_TCP, - .iph.payload_len = __bpf_constant_htons(MAGIC_BYTES), - .tcp.urg_ptr = 123, - .tcp.doff = 5, +} __packed; + +struct test { + const char *name; + union { + struct ipv4_pkt ipv4; + struct svlan_ipv4_pkt svlan_ipv4; + struct ipv6_pkt ipv6; + struct dvlan_ipv6_pkt dvlan_ipv6; + } pkt; + struct bpf_flow_keys keys; }; -static struct bpf_flow_keys pkt_vlan_v6_flow_keys = { - .nhoff = VLAN_HLEN * 2, - .thoff = VLAN_HLEN * 2 + sizeof(struct ipv6hdr), - .addr_proto = ETH_P_IPV6, - .ip_proto = IPPROTO_TCP, - .n_proto = __bpf_constant_htons(ETH_P_IPV6), +#define VLAN_HLEN 4 + +struct test tests[] = { + { + .name = "ipv4", + .pkt.ipv4 = { + .eth.h_proto = __bpf_constant_htons(ETH_P_IP), + .iph.ihl = 5, + .iph.protocol = IPPROTO_TCP, + .iph.tot_len = __bpf_constant_htons(MAGIC_BYTES), + .tcp.doff = 5, + }, + .keys = { + .nhoff = 0, + .thoff = sizeof(struct iphdr), + .addr_proto = ETH_P_IP, + .ip_proto = IPPROTO_TCP, + .n_proto = __bpf_constant_htons(ETH_P_IP), + }, + }, + { + .name = "ipv6", + .pkt.ipv6 = { + .eth.h_proto = __bpf_constant_htons(ETH_P_IPV6), + .iph.nexthdr = IPPROTO_TCP, + .iph.payload_len = __bpf_constant_htons(MAGIC_BYTES), + .tcp.doff = 5, + }, + .keys = { + .nhoff = 0, + .thoff = sizeof(struct ipv6hdr), + .addr_proto = ETH_P_IPV6, + .ip_proto = IPPROTO_TCP, + .n_proto = __bpf_constant_htons(ETH_P_IPV6), + }, + }, + { + .name = "802.1q-ipv4", + .pkt.svlan_ipv4 = { + .eth.h_proto = __bpf_constant_htons(ETH_P_8021Q), + .vlan_proto = __bpf_constant_htons(ETH_P_IP), + .iph.ihl = 5, + .iph.protocol = IPPROTO_TCP, + .iph.tot_len = __bpf_constant_htons(MAGIC_BYTES), + .tcp.doff = 5, + }, + .keys = { + .nhoff = VLAN_HLEN, + .thoff = VLAN_HLEN + sizeof(struct iphdr), + .addr_proto = ETH_P_IP, + .ip_proto = IPPROTO_TCP, + .n_proto = __bpf_constant_htons(ETH_P_IP), + }, + }, + { + .name = "802.1ad-ipv6", + .pkt.dvlan_ipv6 = { + .eth.h_proto = __bpf_constant_htons(ETH_P_8021AD), + .vlan_proto = __bpf_constant_htons(ETH_P_8021Q), + .vlan_proto2 = __bpf_constant_htons(ETH_P_IPV6), + .iph.nexthdr = IPPROTO_TCP, + .iph.payload_len = __bpf_constant_htons(MAGIC_BYTES), + .tcp.doff = 5, + }, + .keys = { + .nhoff = VLAN_HLEN * 2, + .thoff = VLAN_HLEN * 2 + sizeof(struct ipv6hdr), + .addr_proto = ETH_P_IPV6, + .ip_proto = IPPROTO_TCP, + .n_proto = __bpf_constant_htons(ETH_P_IPV6), + }, + }, }; void test_flow_dissector(void) { - struct bpf_flow_keys flow_keys; struct bpf_object *obj; - __u32 duration, retval; - int err, prog_fd; - __u32 size; + int i, err, prog_fd; err = bpf_flow_load(&obj, "./bpf_flow.o", "flow_dissector", "jmp_table", &prog_fd); @@ -106,35 +152,24 @@ void test_flow_dissector(void) return; } - err = bpf_prog_test_run(prog_fd, 10, &pkt_v4, sizeof(pkt_v4), - &flow_keys, &size, &retval, &duration); - CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv4", - "err %d errno %d retval %d duration %d size %u/%lu\n", - err, errno, retval, duration, size, sizeof(flow_keys)); - CHECK_FLOW_KEYS("ipv4_flow_keys", flow_keys, pkt_v4_flow_keys); - - err = bpf_prog_test_run(prog_fd, 10, &pkt_v6, sizeof(pkt_v6), - &flow_keys, &size, &retval, &duration); - CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv6", - "err %d errno %d retval %d duration %d size %u/%lu\n", - err, errno, retval, duration, size, sizeof(flow_keys)); - CHECK_FLOW_KEYS("ipv6_flow_keys", flow_keys, pkt_v6_flow_keys); + for (i = 0; i < ARRAY_SIZE(tests); i++) { + struct bpf_flow_keys flow_keys; + struct bpf_prog_test_run_attr tattr = { + .prog_fd = prog_fd, + .data_in = &tests[i].pkt, + .data_size_in = sizeof(tests[i].pkt), + .data_out = &flow_keys, + }; - err = bpf_prog_test_run(prog_fd, 10, &pkt_vlan_v4, sizeof(pkt_vlan_v4), - &flow_keys, &size, &retval, &duration); - CHECK(size != sizeof(flow_keys) || err || retval != 1, "vlan_ipv4", - "err %d errno %d retval %d duration %d size %u/%lu\n", - err, errno, retval, duration, size, sizeof(flow_keys)); - CHECK_FLOW_KEYS("vlan_ipv4_flow_keys", flow_keys, - pkt_vlan_v4_flow_keys); - - err = bpf_prog_test_run(prog_fd, 10, &pkt_vlan_v6, sizeof(pkt_vlan_v6), - &flow_keys, &size, &retval, &duration); - CHECK(size != sizeof(flow_keys) || err || retval != 1, "vlan_ipv6", - "err %d errno %d retval %d duration %d size %u/%lu\n", - err, errno, retval, duration, size, sizeof(flow_keys)); - CHECK_FLOW_KEYS("vlan_ipv6_flow_keys", flow_keys, - pkt_vlan_v6_flow_keys); + err = bpf_prog_test_run_xattr(&tattr); + CHECK_ATTR(tattr.data_size_out != sizeof(flow_keys) || + err || tattr.retval != 1, + tests[i].name, + "err %d errno %d retval %d duration %d size %u/%lu\n", + err, errno, tattr.retval, tattr.duration, + tattr.data_size_out, sizeof(flow_keys)); + CHECK_FLOW_KEYS(tests[i].name, flow_keys, tests[i].keys); + } bpf_object__close(obj); } diff --git a/tools/testing/selftests/bpf/progs/test_sysctl_prog.c b/tools/testing/selftests/bpf/progs/test_sysctl_prog.c new file mode 100644 index 000000000000..a295cad805d7 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_sysctl_prog.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2019 Facebook + +#include <stdint.h> +#include <string.h> + +#include <linux/stddef.h> +#include <linux/bpf.h> + +#include "bpf_helpers.h" +#include "bpf_util.h" + +/* Max supported length of a string with unsigned long in base 10 (pow2 - 1). */ +#define MAX_ULONG_STR_LEN 0xF + +/* Max supported length of sysctl value string (pow2). */ +#define MAX_VALUE_STR_LEN 0x40 + +static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx) +{ + char tcp_mem_name[] = "net/ipv4/tcp_mem"; + unsigned char i; + char name[64]; + int ret; + + memset(name, 0, sizeof(name)); + ret = bpf_sysctl_get_name(ctx, name, sizeof(name), 0); + if (ret < 0 || ret != sizeof(tcp_mem_name) - 1) + return 0; + +#pragma clang loop unroll(full) + for (i = 0; i < sizeof(tcp_mem_name); ++i) + if (name[i] != tcp_mem_name[i]) + return 0; + + return 1; +} + +SEC("cgroup/sysctl") +int sysctl_tcp_mem(struct bpf_sysctl *ctx) +{ + unsigned long tcp_mem[3] = {0, 0, 0}; + char value[MAX_VALUE_STR_LEN]; + unsigned char i, off = 0; + int ret; + + if (ctx->write) + return 0; + + if (!is_tcp_mem(ctx)) + return 0; + + ret = bpf_sysctl_get_current_value(ctx, value, MAX_VALUE_STR_LEN); + if (ret < 0 || ret >= MAX_VALUE_STR_LEN) + return 0; + +#pragma clang loop unroll(full) + for (i = 0; i < ARRAY_SIZE(tcp_mem); ++i) { + ret = bpf_strtoul(value + off, MAX_ULONG_STR_LEN, 0, + tcp_mem + i); + if (ret <= 0 || ret > MAX_ULONG_STR_LEN) + return 0; + off += ret & MAX_ULONG_STR_LEN; + } + + + return tcp_mem[0] < tcp_mem[1] && tcp_mem[1] < tcp_mem[2]; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c index bcb00d737e95..ab56a6a72b7a 100644 --- a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c +++ b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c @@ -157,7 +157,7 @@ static __always_inline int encap_ipv4(struct __sk_buff *skb, __u8 encap_proto, bpf_ntohs(h_outer.ip.tot_len)); h_outer.ip.protocol = encap_proto; - set_ipv4_csum(&h_outer.ip); + set_ipv4_csum((void *)&h_outer.ip); /* store new outer network header */ if (bpf_skb_store_bytes(skb, ETH_HLEN, &h_outer, olen, diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c index 74f73b33a7b0..c7c3240e0dd4 100644 --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include <stddef.h> #include <string.h> +#include <netinet/in.h> #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/if_packet.h> @@ -9,7 +10,6 @@ #include <linux/types.h> #include <linux/socket.h> #include <linux/tcp.h> -#include <netinet/in.h> #include "bpf_helpers.h" #include "bpf_endian.h" #include "test_tcpbpf.h" diff --git a/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c b/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c index edbca203ce2d..ec6db6e64c41 100644 --- a/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c +++ b/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include <stddef.h> #include <string.h> +#include <netinet/in.h> #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/if_packet.h> @@ -9,7 +10,6 @@ #include <linux/types.h> #include <linux/socket.h> #include <linux/tcp.h> -#include <netinet/in.h> #include "bpf_helpers.h" #include "bpf_endian.h" #include "test_tcpnotify.h" diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c index 44cd3378d216..f8eb7987b794 100644 --- a/tools/testing/selftests/bpf/test_btf.c +++ b/tools/testing/selftests/bpf/test_btf.c @@ -6642,6 +6642,51 @@ const struct btf_dedup_test dedup_tests[] = { .dont_resolve_fwds = false, }, }, +{ + .descr = "dedup: datasec and vars pass-through", + .input = { + .raw_types = { + /* int */ + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */ + /* static int t */ + BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [2] */ + /* .bss section */ /* [3] */ + BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4), + BTF_VAR_SECINFO_ENC(2, 0, 4), + /* int, referenced from [5] */ + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [4] */ + /* another static int t */ + BTF_VAR_ENC(NAME_NTH(2), 4, 0), /* [5] */ + /* another .bss section */ /* [6] */ + BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4), + BTF_VAR_SECINFO_ENC(5, 0, 4), + BTF_END_RAW, + }, + BTF_STR_SEC("\0.bss\0t"), + }, + .expect = { + .raw_types = { + /* int */ + BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */ + /* static int t */ + BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [2] */ + /* .bss section */ /* [3] */ + BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4), + BTF_VAR_SECINFO_ENC(2, 0, 4), + /* another static int t */ + BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [4] */ + /* another .bss section */ /* [5] */ + BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4), + BTF_VAR_SECINFO_ENC(4, 0, 4), + BTF_END_RAW, + }, + BTF_STR_SEC("\0.bss\0t"), + }, + .opts = { + .dont_resolve_fwds = false, + .dedup_table_size = 1 + }, +}, }; @@ -6671,6 +6716,10 @@ static int btf_type_size(const struct btf_type *t) return base_size + vlen * sizeof(struct btf_member); case BTF_KIND_FUNC_PROTO: return base_size + vlen * sizeof(struct btf_param); + case BTF_KIND_VAR: + return base_size + sizeof(struct btf_var); + case BTF_KIND_DATASEC: + return base_size + vlen * sizeof(struct btf_var_secinfo); default: fprintf(stderr, "Unsupported BTF_KIND:%u\n", kind); return -EINVAL; diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh index d4d3391cc13a..acf7a74f97cd 100755 --- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh +++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh @@ -129,6 +129,24 @@ setup() ip link set veth7 netns ${NS2} ip link set veth8 netns ${NS3} + if [ ! -z "${VRF}" ] ; then + ip -netns ${NS1} link add red type vrf table 1001 + ip -netns ${NS1} link set red up + ip -netns ${NS1} route add table 1001 unreachable default metric 8192 + ip -netns ${NS1} -6 route add table 1001 unreachable default metric 8192 + ip -netns ${NS1} link set veth1 vrf red + ip -netns ${NS1} link set veth5 vrf red + + ip -netns ${NS2} link add red type vrf table 1001 + ip -netns ${NS2} link set red up + ip -netns ${NS2} route add table 1001 unreachable default metric 8192 + ip -netns ${NS2} -6 route add table 1001 unreachable default metric 8192 + ip -netns ${NS2} link set veth2 vrf red + ip -netns ${NS2} link set veth3 vrf red + ip -netns ${NS2} link set veth6 vrf red + ip -netns ${NS2} link set veth7 vrf red + fi + # configure addesses: the top route (1-2-3-4) ip -netns ${NS1} addr add ${IPv4_1}/24 dev veth1 ip -netns ${NS2} addr add ${IPv4_2}/24 dev veth2 @@ -163,29 +181,29 @@ setup() # NS1 # top route - ip -netns ${NS1} route add ${IPv4_2}/32 dev veth1 - ip -netns ${NS1} route add default dev veth1 via ${IPv4_2} # go top by default - ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1 - ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2} # go top by default + ip -netns ${NS1} route add ${IPv4_2}/32 dev veth1 ${VRF} + ip -netns ${NS1} route add default dev veth1 via ${IPv4_2} ${VRF} # go top by default + ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1 ${VRF} + ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2} ${VRF} # go top by default # bottom route - ip -netns ${NS1} route add ${IPv4_6}/32 dev veth5 - ip -netns ${NS1} route add ${IPv4_7}/32 dev veth5 via ${IPv4_6} - ip -netns ${NS1} route add ${IPv4_8}/32 dev veth5 via ${IPv4_6} - ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5 - ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6} - ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6} + ip -netns ${NS1} route add ${IPv4_6}/32 dev veth5 ${VRF} + ip -netns ${NS1} route add ${IPv4_7}/32 dev veth5 via ${IPv4_6} ${VRF} + ip -netns ${NS1} route add ${IPv4_8}/32 dev veth5 via ${IPv4_6} ${VRF} + ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5 ${VRF} + ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6} ${VRF} + ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6} ${VRF} # NS2 # top route - ip -netns ${NS2} route add ${IPv4_1}/32 dev veth2 - ip -netns ${NS2} route add ${IPv4_4}/32 dev veth3 - ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2 - ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3 + ip -netns ${NS2} route add ${IPv4_1}/32 dev veth2 ${VRF} + ip -netns ${NS2} route add ${IPv4_4}/32 dev veth3 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3 ${VRF} # bottom route - ip -netns ${NS2} route add ${IPv4_5}/32 dev veth6 - ip -netns ${NS2} route add ${IPv4_8}/32 dev veth7 - ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6 - ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7 + ip -netns ${NS2} route add ${IPv4_5}/32 dev veth6 ${VRF} + ip -netns ${NS2} route add ${IPv4_8}/32 dev veth7 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7 ${VRF} # NS3 # top route @@ -207,16 +225,16 @@ setup() ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255 ip -netns ${NS3} link set gre_dev up ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev - ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6} - ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8} + ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6} ${VRF} + ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8} ${VRF} # configure IPv6 GRE device in NS3, and a route to it via the "bottom" route ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255 ip -netns ${NS3} link set gre6_dev up ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev - ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6} - ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8} + ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6} ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8} ${VRF} # rp_filter gets confused by what these tests are doing, so disable it ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0 @@ -244,18 +262,18 @@ trap cleanup EXIT remove_routes_to_gredev() { - ip -netns ${NS1} route del ${IPv4_GRE} dev veth5 - ip -netns ${NS2} route del ${IPv4_GRE} dev veth7 - ip -netns ${NS1} -6 route del ${IPv6_GRE}/128 dev veth5 - ip -netns ${NS2} -6 route del ${IPv6_GRE}/128 dev veth7 + ip -netns ${NS1} route del ${IPv4_GRE} dev veth5 ${VRF} + ip -netns ${NS2} route del ${IPv4_GRE} dev veth7 ${VRF} + ip -netns ${NS1} -6 route del ${IPv6_GRE}/128 dev veth5 ${VRF} + ip -netns ${NS2} -6 route del ${IPv6_GRE}/128 dev veth7 ${VRF} } add_unreachable_routes_to_gredev() { - ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32 - ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32 - ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128 - ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128 + ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32 ${VRF} + ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32 ${VRF} + ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128 ${VRF} + ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128 ${VRF} } test_ping() @@ -265,10 +283,10 @@ test_ping() local RET=0 if [ "${PROTO}" == "IPv4" ] ; then - ip netns exec ${NS1} ping -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null + ip netns exec ${NS1} ping -c 1 -W 1 -I veth1 ${IPv4_DST} 2>&1 > /dev/null RET=$? elif [ "${PROTO}" == "IPv6" ] ; then - ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null + ip netns exec ${NS1} ping6 -c 1 -W 6 -I veth1 ${IPv6_DST} 2>&1 > /dev/null RET=$? else echo " test_ping: unknown PROTO: ${PROTO}" @@ -328,7 +346,7 @@ test_gso() test_egress() { local readonly ENCAP=$1 - echo "starting egress ${ENCAP} encap test" + echo "starting egress ${ENCAP} encap test ${VRF}" setup # by default, pings work @@ -336,26 +354,35 @@ test_egress() test_ping IPv6 0 # remove NS2->DST routes, ping fails - ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 - ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 + ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 ${VRF} + ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 ${VRF} test_ping IPv4 1 test_ping IPv6 1 # install replacement routes (LWT/eBPF), pings succeed if [ "${ENCAP}" == "IPv4" ] ; then - ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1 - ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1 + ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj \ + test_lwt_ip_encap.o sec encap_gre dev veth1 ${VRF} + ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj \ + test_lwt_ip_encap.o sec encap_gre dev veth1 ${VRF} elif [ "${ENCAP}" == "IPv6" ] ; then - ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1 - ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1 + ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj \ + test_lwt_ip_encap.o sec encap_gre6 dev veth1 ${VRF} + ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj \ + test_lwt_ip_encap.o sec encap_gre6 dev veth1 ${VRF} else echo " unknown encap ${ENCAP}" TEST_STATUS=1 fi test_ping IPv4 0 test_ping IPv6 0 - test_gso IPv4 - test_gso IPv6 + + # skip GSO tests with VRF: VRF routing needs properly assigned + # source IP/device, which is easy to do with ping and hard with dd/nc. + if [ -z "${VRF}" ] ; then + test_gso IPv4 + test_gso IPv6 + fi # a negative test: remove routes to GRE devices: ping fails remove_routes_to_gredev @@ -374,7 +401,7 @@ test_egress() test_ingress() { local readonly ENCAP=$1 - echo "starting ingress ${ENCAP} encap test" + echo "starting ingress ${ENCAP} encap test ${VRF}" setup # need to wait a bit for IPv6 to autoconf, otherwise @@ -385,18 +412,22 @@ test_ingress() test_ping IPv6 0 # remove NS2->DST routes, pings fail - ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 - ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 + ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 ${VRF} + ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 ${VRF} test_ping IPv4 1 test_ping IPv6 1 # install replacement routes (LWT/eBPF), pings succeed if [ "${ENCAP}" == "IPv4" ] ; then - ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2 - ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2 + ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj \ + test_lwt_ip_encap.o sec encap_gre dev veth2 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj \ + test_lwt_ip_encap.o sec encap_gre dev veth2 ${VRF} elif [ "${ENCAP}" == "IPv6" ] ; then - ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2 - ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2 + ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj \ + test_lwt_ip_encap.o sec encap_gre6 dev veth2 ${VRF} + ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj \ + test_lwt_ip_encap.o sec encap_gre6 dev veth2 ${VRF} else echo "FAIL: unknown encap ${ENCAP}" TEST_STATUS=1 @@ -418,6 +449,13 @@ test_ingress() process_test_results } +VRF="" +test_egress IPv4 +test_egress IPv6 +test_ingress IPv4 +test_ingress IPv6 + +VRF="vrf red" test_egress IPv4 test_egress IPv6 test_ingress IPv4 diff --git a/tools/testing/selftests/bpf/test_section_names.c b/tools/testing/selftests/bpf/test_section_names.c index 7c4f41572b1c..bebd4fbca1f4 100644 --- a/tools/testing/selftests/bpf/test_section_names.c +++ b/tools/testing/selftests/bpf/test_section_names.c @@ -119,6 +119,11 @@ static struct sec_name_test tests[] = { {0, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP6_SENDMSG}, {0, BPF_CGROUP_UDP6_SENDMSG}, }, + { + "cgroup/sysctl", + {0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL}, + {0, BPF_CGROUP_SYSCTL}, + }, }; static int test_prog_type_by_name(const struct sec_name_test *test) diff --git a/tools/testing/selftests/bpf/test_sysctl.c b/tools/testing/selftests/bpf/test_sysctl.c new file mode 100644 index 000000000000..a3bebd7c68dd --- /dev/null +++ b/tools/testing/selftests/bpf/test_sysctl.c @@ -0,0 +1,1567 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2019 Facebook + +#include <fcntl.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +#include <linux/filter.h> + +#include <bpf/bpf.h> +#include <bpf/libbpf.h> + +#include "bpf_rlimit.h" +#include "bpf_util.h" +#include "cgroup_helpers.h" + +#define CG_PATH "/foo" +#define MAX_INSNS 512 +#define FIXUP_SYSCTL_VALUE 0 + +char bpf_log_buf[BPF_LOG_BUF_SIZE]; + +struct sysctl_test { + const char *descr; + size_t fixup_value_insn; + struct bpf_insn insns[MAX_INSNS]; + const char *prog_file; + enum bpf_attach_type attach_type; + const char *sysctl; + int open_flags; + const char *newval; + const char *oldval; + enum { + LOAD_REJECT, + ATTACH_REJECT, + OP_EPERM, + SUCCESS, + } result; +}; + +static struct sysctl_test tests[] = { + { + .descr = "sysctl wrong attach_type", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = 0, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = ATTACH_REJECT, + }, + { + .descr = "sysctl:read allow all", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl:read deny all", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = OP_EPERM, + }, + { + .descr = "ctx:write sysctl:read read ok", + .insns = { + /* If (write) */ + BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1, + offsetof(struct bpf_sysctl, write)), + BPF_JMP_IMM(BPF_JNE, BPF_REG_7, 1, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "ctx:write sysctl:write read ok", + .insns = { + /* If (write) */ + BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_1, + offsetof(struct bpf_sysctl, write)), + BPF_JMP_IMM(BPF_JNE, BPF_REG_7, 1, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/domainname", + .open_flags = O_WRONLY, + .newval = "(none)", /* same as default, should fail anyway */ + .result = OP_EPERM, + }, + { + .descr = "ctx:write sysctl:read write reject", + .insns = { + /* write = X */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, + offsetof(struct bpf_sysctl, write)), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = LOAD_REJECT, + }, + { + .descr = "ctx:file_pos sysctl:read read ok", + .insns = { + /* If (file_pos == X) */ + BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1, + offsetof(struct bpf_sysctl, file_pos)), + BPF_JMP_IMM(BPF_JNE, BPF_REG_7, 0, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "ctx:file_pos sysctl:read read ok narrow", + .insns = { + /* If (file_pos == X) */ + BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_1, + offsetof(struct bpf_sysctl, file_pos)), + BPF_JMP_IMM(BPF_JNE, BPF_REG_7, 0, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "ctx:file_pos sysctl:read write ok", + .insns = { + /* file_pos = X */ + BPF_MOV64_IMM(BPF_REG_0, 2), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, + offsetof(struct bpf_sysctl, file_pos)), + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .oldval = "nux\n", + .result = SUCCESS, + }, + { + .descr = "sysctl_get_name sysctl_value:base ok", + .insns = { + /* sysctl_get_name arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_name arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* sysctl_get_name arg4 (flags) */ + BPF_MOV64_IMM(BPF_REG_4, BPF_F_SYSCTL_BASE_NAME), + + /* sysctl_get_name(ctx, buf, buf_len, flags) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_name), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, sizeof("tcp_mem") - 1, 6), + /* buf == "tcp_mem\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x006d656d5f706374ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_name sysctl_value:base E2BIG truncated", + .insns = { + /* sysctl_get_name arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_name arg3 (buf_len) too small */ + BPF_MOV64_IMM(BPF_REG_3, 7), + + /* sysctl_get_name arg4 (flags) */ + BPF_MOV64_IMM(BPF_REG_4, BPF_F_SYSCTL_BASE_NAME), + + /* sysctl_get_name(ctx, buf, buf_len, flags) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_name), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -E2BIG, 6), + + /* buf[0:7] == "tcp_me\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x00656d5f706374ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_name sysctl:full ok", + .insns = { + /* sysctl_get_name arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -24), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 16), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_name arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 17), + + /* sysctl_get_name arg4 (flags) */ + BPF_MOV64_IMM(BPF_REG_4, 0), + + /* sysctl_get_name(ctx, buf, buf_len, flags) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_name), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 16, 14), + + /* buf[0:8] == "net/ipv4" && */ + BPF_LD_IMM64(BPF_REG_8, 0x347670692f74656eULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 10), + + /* buf[8:16] == "/tcp_mem" && */ + BPF_LD_IMM64(BPF_REG_8, 0x6d656d5f7063742fULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 8), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 6), + + /* buf[16:24] == "\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x0ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 16), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_name sysctl:full E2BIG truncated", + .insns = { + /* sysctl_get_name arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -16), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 8), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_name arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 16), + + /* sysctl_get_name arg4 (flags) */ + BPF_MOV64_IMM(BPF_REG_4, 0), + + /* sysctl_get_name(ctx, buf, buf_len, flags) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_name), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -E2BIG, 10), + + /* buf[0:8] == "net/ipv4" && */ + BPF_LD_IMM64(BPF_REG_8, 0x347670692f74656eULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 6), + + /* buf[8:16] == "/tcp_me\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x00656d5f7063742fULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 8), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_name sysctl:full E2BIG truncated small", + .insns = { + /* sysctl_get_name arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_name arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 7), + + /* sysctl_get_name arg4 (flags) */ + BPF_MOV64_IMM(BPF_REG_4, 0), + + /* sysctl_get_name(ctx, buf, buf_len, flags) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_name), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -E2BIG, 6), + + /* buf[0:8] == "net/ip\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x000070692f74656eULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_current_value sysctl:read ok, gt", + .insns = { + /* sysctl_get_current_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_current_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* sysctl_get_current_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_current_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 6, 6), + + /* buf[0:6] == "Linux\n\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x000a78756e694cULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_current_value sysctl:read ok, eq", + .insns = { + /* sysctl_get_current_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_B, BPF_REG_7, BPF_REG_0, 7), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_current_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 7), + + /* sysctl_get_current_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_current_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 6, 6), + + /* buf[0:6] == "Linux\n\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x000a78756e694cULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_current_value sysctl:read E2BIG truncated", + .insns = { + /* sysctl_get_current_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_H, BPF_REG_7, BPF_REG_0, 6), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_current_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 6), + + /* sysctl_get_current_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_current_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -E2BIG, 6), + + /* buf[0:6] == "Linux\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x000078756e694cULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "kernel/ostype", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_current_value sysctl:read EINVAL", + .insns = { + /* sysctl_get_current_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_current_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* sysctl_get_current_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_current_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 4), + + /* buf[0:8] is NUL-filled) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 0, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv6/conf/lo/stable_secret", /* -EIO */ + .open_flags = O_RDONLY, + .result = OP_EPERM, + }, + { + .descr = "sysctl_get_current_value sysctl:write ok", + .fixup_value_insn = 6, + .insns = { + /* sysctl_get_current_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_current_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* sysctl_get_current_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_current_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 4, 6), + + /* buf[0:4] == expected) */ + BPF_LD_IMM64(BPF_REG_8, FIXUP_SYSCTL_VALUE), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_WRONLY, + .newval = "600", /* same as default, should fail anyway */ + .result = OP_EPERM, + }, + { + .descr = "sysctl_get_new_value sysctl:read EINVAL", + .insns = { + /* sysctl_get_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* sysctl_get_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_new_value), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_get_new_value sysctl:write ok", + .insns = { + /* sysctl_get_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 4), + + /* sysctl_get_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_new_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 3, 4), + + /* buf[0:4] == "606\0") */ + BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 0x00363036, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_WRONLY, + .newval = "606", + .result = OP_EPERM, + }, + { + .descr = "sysctl_get_new_value sysctl:write ok long", + .insns = { + /* sysctl_get_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -24), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 24), + + /* sysctl_get_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_new_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 23, 14), + + /* buf[0:8] == "3000000 " && */ + BPF_LD_IMM64(BPF_REG_8, 0x2030303030303033ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 10), + + /* buf[8:16] == "4000000 " && */ + BPF_LD_IMM64(BPF_REG_8, 0x2030303030303034ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 8), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 6), + + /* buf[16:24] == "6000000\0") */ + BPF_LD_IMM64(BPF_REG_8, 0x0030303030303036ULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 16), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_WRONLY, + .newval = "3000000 4000000 6000000", + .result = OP_EPERM, + }, + { + .descr = "sysctl_get_new_value sysctl:write E2BIG", + .insns = { + /* sysctl_get_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_B, BPF_REG_7, BPF_REG_0, 3), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_get_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 3), + + /* sysctl_get_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_get_new_value), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -E2BIG, 4), + + /* buf[0:3] == "60\0") */ + BPF_LDX_MEM(BPF_W, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 0x003036, 2), + + /* return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_A(1), + + /* else return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_WRONLY, + .newval = "606", + .result = OP_EPERM, + }, + { + .descr = "sysctl_set_new_value sysctl:read EINVAL", + .insns = { + /* sysctl_set_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_set_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 3), + + /* sysctl_set_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_set_new_value), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + .descr = "sysctl_set_new_value sysctl:write ok", + .fixup_value_insn = 2, + .insns = { + /* sysctl_set_new_value arg2 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, FIXUP_SYSCTL_VALUE), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + + /* sysctl_set_new_value arg3 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_3, 3), + + /* sysctl_set_new_value(ctx, buf, buf_len) */ + BPF_EMIT_CALL(BPF_FUNC_sysctl_set_new_value), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_WRONLY, + .newval = "606", + .result = SUCCESS, + }, + { + "bpf_strtoul one number string", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 3, 4), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 600, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtoul multi number string", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + /* "600 602\0" */ + BPF_LD_IMM64(BPF_REG_0, 0x0032303620303036ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 8), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 3, 18), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 600, 16), + + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 8), + BPF_ALU64_REG(BPF_SUB, BPF_REG_2, BPF_REG_0), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -16), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 4, 4), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 602, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtoul buf_len = 0, reject", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 0), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = LOAD_REJECT, + }, + { + "bpf_strtoul supported base, ok", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00373730), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 8), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 3, 4), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 63, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtoul unsupported base, EINVAL", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 3), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtoul buf with spaces only, EINVAL", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x090a0c0d), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtoul negative number, EINVAL", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00362d0a), /* " -6\0" */ + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -EINVAL, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtol negative number, ok", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00362d0a), /* " -6\0" */ + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 10), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtol), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 3, 4), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, -6, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtol hex number, ok", + .insns = { + /* arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x65667830), /* "0xfe" */ + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtol), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 4, 4), + /* res == expected) */ + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_IMM(BPF_JNE, BPF_REG_9, 254, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtol max long", + .insns = { + /* arg1 (buf) 9223372036854775807 */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -24), + BPF_LD_IMM64(BPF_REG_0, 0x3032373333323239ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_LD_IMM64(BPF_REG_0, 0x3537373435383633ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 8), + BPF_LD_IMM64(BPF_REG_0, 0x0000000000373038ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 16), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 19), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtol), + + /* if (ret == expected && */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 19, 6), + /* res == expected) */ + BPF_LD_IMM64(BPF_REG_8, 0x7fffffffffffffffULL), + BPF_LDX_MEM(BPF_DW, BPF_REG_9, BPF_REG_7, 0), + BPF_JMP_REG(BPF_JNE, BPF_REG_8, BPF_REG_9, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "bpf_strtol overflow, ERANGE", + .insns = { + /* arg1 (buf) 9223372036854775808 */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -24), + BPF_LD_IMM64(BPF_REG_0, 0x3032373333323239ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_LD_IMM64(BPF_REG_0, 0x3537373435383633ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 8), + BPF_LD_IMM64(BPF_REG_0, 0x0000000000383038ULL), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 16), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 19), + + /* arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + BPF_EMIT_CALL(BPF_FUNC_strtol), + + /* if (ret == expected) */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, -ERANGE, 2), + + /* return ALLOW; */ + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_A(1), + + /* else return DENY; */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, + { + "C prog: deny all writes", + .prog_file = "./test_sysctl_prog.o", + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_WRONLY, + .newval = "123 456 789", + .result = OP_EPERM, + }, + { + "C prog: deny access by name", + .prog_file = "./test_sysctl_prog.o", + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/route/mtu_expires", + .open_flags = O_RDONLY, + .result = OP_EPERM, + }, + { + "C prog: read tcp_mem", + .prog_file = "./test_sysctl_prog.o", + .attach_type = BPF_CGROUP_SYSCTL, + .sysctl = "net/ipv4/tcp_mem", + .open_flags = O_RDONLY, + .result = SUCCESS, + }, +}; + +static size_t probe_prog_length(const struct bpf_insn *fp) +{ + size_t len; + + for (len = MAX_INSNS - 1; len > 0; --len) + if (fp[len].code != 0 || fp[len].imm != 0) + break; + return len + 1; +} + +static int fixup_sysctl_value(const char *buf, size_t buf_len, + struct bpf_insn *prog, size_t insn_num) +{ + uint32_t value_num = 0; + uint8_t c, i; + + if (buf_len > sizeof(value_num)) { + log_err("Value is too big (%zd) to use in fixup", buf_len); + return -1; + } + + for (i = 0; i < buf_len; ++i) { + c = buf[i]; + value_num |= (c << i * 8); + } + + prog[insn_num].imm = value_num; + + return 0; +} + +static int load_sysctl_prog_insns(struct sysctl_test *test, + const char *sysctl_path) +{ + struct bpf_insn *prog = test->insns; + struct bpf_load_program_attr attr; + int ret; + + memset(&attr, 0, sizeof(struct bpf_load_program_attr)); + attr.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL; + attr.insns = prog; + attr.insns_cnt = probe_prog_length(attr.insns); + attr.license = "GPL"; + + if (test->fixup_value_insn) { + char buf[128]; + ssize_t len; + int fd; + + fd = open(sysctl_path, O_RDONLY | O_CLOEXEC); + if (fd < 0) { + log_err("open(%s) failed", sysctl_path); + return -1; + } + len = read(fd, buf, sizeof(buf)); + if (len == -1) { + log_err("read(%s) failed", sysctl_path); + close(fd); + return -1; + } + close(fd); + if (fixup_sysctl_value(buf, len, prog, test->fixup_value_insn)) + return -1; + } + + ret = bpf_load_program_xattr(&attr, bpf_log_buf, BPF_LOG_BUF_SIZE); + if (ret < 0 && test->result != LOAD_REJECT) { + log_err(">>> Loading program error.\n" + ">>> Verifier output:\n%s\n-------\n", bpf_log_buf); + } + + return ret; +} + +static int load_sysctl_prog_file(struct sysctl_test *test) +{ + struct bpf_prog_load_attr attr; + struct bpf_object *obj; + int prog_fd; + + memset(&attr, 0, sizeof(struct bpf_prog_load_attr)); + attr.file = test->prog_file; + attr.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL; + + if (bpf_prog_load_xattr(&attr, &obj, &prog_fd)) { + if (test->result != LOAD_REJECT) + log_err(">>> Loading program (%s) error.\n", + test->prog_file); + return -1; + } + + return prog_fd; +} + +static int load_sysctl_prog(struct sysctl_test *test, const char *sysctl_path) +{ + return test->prog_file + ? load_sysctl_prog_file(test) + : load_sysctl_prog_insns(test, sysctl_path); +} + +static int access_sysctl(const char *sysctl_path, + const struct sysctl_test *test) +{ + int err = 0; + int fd; + + fd = open(sysctl_path, test->open_flags | O_CLOEXEC); + if (fd < 0) + return fd; + + if (test->open_flags == O_RDONLY) { + char buf[128]; + + if (read(fd, buf, sizeof(buf)) == -1) + goto err; + if (test->oldval && + strncmp(buf, test->oldval, strlen(test->oldval))) { + log_err("Read value %s != %s", buf, test->oldval); + goto err; + } + } else if (test->open_flags == O_WRONLY) { + if (!test->newval) { + log_err("New value for sysctl is not set"); + goto err; + } + if (write(fd, test->newval, strlen(test->newval)) == -1) + goto err; + } else { + log_err("Unexpected sysctl access: neither read nor write"); + goto err; + } + + goto out; +err: + err = -1; +out: + close(fd); + return err; +} + +static int run_test_case(int cgfd, struct sysctl_test *test) +{ + enum bpf_attach_type atype = test->attach_type; + char sysctl_path[128]; + int progfd = -1; + int err = 0; + + printf("Test case: %s .. ", test->descr); + + snprintf(sysctl_path, sizeof(sysctl_path), "/proc/sys/%s", + test->sysctl); + + progfd = load_sysctl_prog(test, sysctl_path); + if (progfd < 0) { + if (test->result == LOAD_REJECT) + goto out; + else + goto err; + } + + if (bpf_prog_attach(progfd, cgfd, atype, BPF_F_ALLOW_OVERRIDE) == -1) { + if (test->result == ATTACH_REJECT) + goto out; + else + goto err; + } + + if (access_sysctl(sysctl_path, test) == -1) { + if (test->result == OP_EPERM && errno == EPERM) + goto out; + else + goto err; + } + + if (test->result != SUCCESS) { + log_err("Unexpected failure"); + goto err; + } + + goto out; +err: + err = -1; +out: + /* Detaching w/o checking return code: best effort attempt. */ + if (progfd != -1) + bpf_prog_detach(cgfd, atype); + close(progfd); + printf("[%s]\n", err ? "FAIL" : "PASS"); + return err; +} + +static int run_tests(int cgfd) +{ + int passes = 0; + int fails = 0; + int i; + + for (i = 0; i < ARRAY_SIZE(tests); ++i) { + if (run_test_case(cgfd, &tests[i])) + ++fails; + else + ++passes; + } + printf("Summary: %d PASSED, %d FAILED\n", passes, fails); + return fails ? -1 : 0; +} + +int main(int argc, char **argv) +{ + int cgfd = -1; + int err = 0; + + if (setup_cgroup_environment()) + goto err; + + cgfd = create_and_get_cgroup(CG_PATH); + if (cgfd < 0) + goto err; + + if (join_cgroup(CG_PATH)) + goto err; + + if (run_tests(cgfd)) + goto err; + + goto out; +err: + err = -1; +out: + close(cgfd); + cleanup_cgroup_environment(); + return err; +} diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index e2ebcaddbe78..ed9e894afef3 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -52,7 +52,7 @@ #define MAX_INSNS BPF_MAXINSNS #define MAX_TEST_INSNS 1000000 #define MAX_FIXUPS 8 -#define MAX_NR_MAPS 16 +#define MAX_NR_MAPS 17 #define MAX_TEST_RUNS 8 #define POINTER_VALUE 0xcafe4all #define TEST_DATA_LEN 64 @@ -208,6 +208,76 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self) self->retval = (uint32_t)res; } +/* test the sequence of 1k jumps */ +static void bpf_fill_scale1(struct bpf_test *self) +{ + struct bpf_insn *insn = self->fill_insns; + int i = 0, k = 0; + + insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1); + /* test to check that the sequence of 1024 jumps is acceptable */ + while (k++ < 1024) { + insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_prandom_u32); + insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2); + insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10); + insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, + -8 * (k % 64 + 1)); + } + /* every jump adds 1024 steps to insn_processed, so to stay exactly + * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT + */ + while (i < MAX_TEST_INSNS - 1025) + insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42); + insn[i] = BPF_EXIT_INSN(); + self->prog_len = i + 1; + self->retval = 42; +} + +/* test the sequence of 1k jumps in inner most function (function depth 8)*/ +static void bpf_fill_scale2(struct bpf_test *self) +{ + struct bpf_insn *insn = self->fill_insns; + int i = 0, k = 0; + +#define FUNC_NEST 7 + for (k = 0; k < FUNC_NEST; k++) { + insn[i++] = BPF_CALL_REL(1); + insn[i++] = BPF_EXIT_INSN(); + } + insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1); + /* test to check that the sequence of 1024 jumps is acceptable */ + while (k++ < 1024) { + insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_prandom_u32); + insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2); + insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10); + insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, + -8 * (k % (64 - 4 * FUNC_NEST) + 1)); + } + /* every jump adds 1024 steps to insn_processed, so to stay exactly + * within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT + */ + while (i < MAX_TEST_INSNS - 1025) + insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42); + insn[i] = BPF_EXIT_INSN(); + self->prog_len = i + 1; + self->retval = 42; +} + +static void bpf_fill_scale(struct bpf_test *self) +{ + switch (self->retval) { + case 1: + return bpf_fill_scale1(self); + case 2: + return bpf_fill_scale2(self); + default: + self->prog_len = 0; + break; + } +} + /* BPF_SK_LOOKUP contains 13 instructions, if you need to fix up maps */ #define BPF_SK_LOOKUP(func) \ /* struct bpf_sock_tuple tuple = {} */ \ diff --git a/tools/testing/selftests/bpf/verifier/int_ptr.c b/tools/testing/selftests/bpf/verifier/int_ptr.c new file mode 100644 index 000000000000..ca3b4729df66 --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/int_ptr.c @@ -0,0 +1,160 @@ +{ + "ARG_PTR_TO_LONG uninitialized", + .insns = { + /* bpf_strtoul arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* bpf_strtoul arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* bpf_strtoul arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* bpf_strtoul arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + /* bpf_strtoul() */ + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, + .errstr = "invalid indirect read from stack off -16+0 size 8", +}, +{ + "ARG_PTR_TO_LONG half-uninitialized", + .insns = { + /* bpf_strtoul arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* bpf_strtoul arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* bpf_strtoul arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* bpf_strtoul arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + /* bpf_strtoul() */ + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, + .errstr = "invalid indirect read from stack off -16+4 size 8", +}, +{ + "ARG_PTR_TO_LONG misaligned", + .insns = { + /* bpf_strtoul arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* bpf_strtoul arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* bpf_strtoul arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* bpf_strtoul arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -12), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 4), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + /* bpf_strtoul() */ + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, + .errstr = "misaligned stack access off (0x0; 0x0)+-20+0 size 8", +}, +{ + "ARG_PTR_TO_LONG size < sizeof(long)", + .insns = { + /* bpf_strtoul arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -16), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* bpf_strtoul arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* bpf_strtoul arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* bpf_strtoul arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 12), + BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + /* bpf_strtoul() */ + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, + .errstr = "invalid stack type R4 off=-4 access_size=8", +}, +{ + "ARG_PTR_TO_LONG initialized", + .insns = { + /* bpf_strtoul arg1 (buf) */ + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_MOV64_IMM(BPF_REG_0, 0x00303036), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + + BPF_MOV64_REG(BPF_REG_1, BPF_REG_7), + + /* bpf_strtoul arg2 (buf_len) */ + BPF_MOV64_IMM(BPF_REG_2, 4), + + /* bpf_strtoul arg3 (flags) */ + BPF_MOV64_IMM(BPF_REG_3, 0), + + /* bpf_strtoul arg4 (res) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_7), + + /* bpf_strtoul() */ + BPF_EMIT_CALL(BPF_FUNC_strtoul), + + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL, +}, diff --git a/tools/testing/selftests/bpf/verifier/scale.c b/tools/testing/selftests/bpf/verifier/scale.c new file mode 100644 index 000000000000..7f868d4802e0 --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/scale.c @@ -0,0 +1,18 @@ +{ + "scale: scale test 1", + .insns = { }, + .data = { }, + .fill_helper = bpf_fill_scale, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .result = ACCEPT, + .retval = 1, +}, +{ + "scale: scale test 2", + .insns = { }, + .data = { }, + .fill_helper = bpf_fill_scale, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + .result = ACCEPT, + .retval = 2, +}, |