kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2023-08-03	bpf, xdp: Add tracepoint to xdp attaching failure	Leon Hwang	2	-1/+21
	When error happens in dev_xdp_attach(), it should have a way to tell users the error message like the netlink approach. To avoid breaking uapi, adding a tracepoint in bpf_xdp_link_attach() is an appropriate way to notify users the error message. Hence, bpf libraries are able to retrieve the error message by this tracepoint, and then report the error message to users. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230801142621.7925-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-03	selftests/bpf: fix static assert compilation issue for test_cls_*.c	Alan Maguire	1	-0/+9
	commit bdeeed3498c7 ("libbpf: fix offsetof() and container_of() to work with CO-RE") ...was backported to stable trees such as 5.15. The problem is that with older LLVM/clang (14/15) - which is often used for older kernels - we see compilation failures in BPF selftests now: In file included from progs/test_cls_redirect_subprogs.c:2: progs/test_cls_redirect.c:90:2: error: static assertion expression is not an integral constant expression sizeof(flow_ports_t) != ^~~~~~~~~~~~~~~~~~~~~~~ progs/test_cls_redirect.c:91:3: note: cast that performs the conversions of a reinterpret_cast is not allowed in a constant expression offsetofend(struct bpf_sock_tuple, ipv4.dport) - ^ progs/test_cls_redirect.c:32:3: note: expanded from macro 'offsetofend' (offsetof(TYPE, MEMBER) + sizeof((((TYPE )0)->MEMBER))) ^ tools/testing/selftests/bpf/tools/include/bpf/bpf_helpers.h:86:33: note: expanded from macro 'offsetof' ^ In file included from progs/test_cls_redirect_subprogs.c:2: progs/test_cls_redirect.c:95:2: error: static assertion expression is not an integral constant expression sizeof(flow_ports_t) != ^~~~~~~~~~~~~~~~~~~~~~~ progs/test_cls_redirect.c:96:3: note: cast that performs the conversions of a reinterpret_cast is not allowed in a constant expression offsetofend(struct bpf_sock_tuple, ipv6.dport) - ^ progs/test_cls_redirect.c:32:3: note: expanded from macro 'offsetofend' (offsetof(TYPE, MEMBER) + sizeof((((TYPE )0)->MEMBER))) ^ tools/testing/selftests/bpf/tools/include/bpf/bpf_helpers.h:86:33: note: expanded from macro 'offsetof' ^ 2 errors generated. make: *** [Makefile:594: tools/testing/selftests/bpf/test_cls_redirect_subprogs.bpf.o] Error 1 The problem is the new offsetof() does not play nice with static asserts. Given that the context is a static assert (and CO-RE relocation is not needed at compile time), offsetof() usage can be replaced by restoring the original offsetof() definition as __builtin_offsetof(). Fixes: bdeeed3498c7 ("libbpf: fix offsetof() and container_of() to work with CO-RE") Reported-by: Colm Harrington <colm.harrington@oracle.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Yipeng Zou <zouyipeng@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230802073906.3197480-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-03	bpf: fix bpf_probe_read_kernel prototype mismatch	Arnd Bergmann	3	-20/+15
	bpf_probe_read_kernel() has a __weak definition in core.c and another definition with an incompatible prototype in kernel/trace/bpf_trace.c, when CONFIG_BPF_EVENTS is enabled. Since the two are incompatible, there cannot be a shared declaration in a header file, but the lack of a prototype causes a W=1 warning: kernel/bpf/core.c:1638:12: error: no previous prototype for 'bpf_probe_read_kernel' [-Werror=missing-prototypes] On 32-bit architectures, the local prototype u64 __weak bpf_probe_read_kernel(void dst, u32 size, const void unsafe_ptr) passes arguments in other registers as the one in bpf_trace.c BPF_CALL_3(bpf_probe_read_kernel, void , dst, u32, size, const void , unsafe_ptr) which uses 64-bit arguments in pairs of registers. As both versions of the function are fairly simple and only really differ in one line, just move them into a header file as an inline function that does not add any overhead for the bpf_trace.c callers and actually avoids a function call for the other one. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ac25cb0f-b804-1649-3afb-1dc6138c2716@iogearbox.net/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230801111449.185301-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-03	riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework	Pu Lehui	1	-71/+82
	Commit 6724a76cff85 ("riscv: ftrace: Reduce the detour code size to half") optimizes the detour code size of kernel functions to half with T0 register and the upcoming DYNAMIC_FTRACE_WITH_DIRECT_CALLS of riscv is based on this optimization, we need to adapt riscv bpf trampoline based on this. One thing to do is to reduce detour code size of bpf programs, and the second is to deal with the return address after the execution of bpf trampoline. Meanwhile, we need to construct the frame of parent function, otherwise we will miss one layer when unwinding. The related tests have passed. Signed-off-by: Pu Lehui <pulehui@huawei.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230721100627.2630326-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-02	libbpf: fix typos in Makefile	Randy Dunlap	1	-2/+2
	Capitalize ABI (acronym) and fix spelling of "destination". Fixes: 706819495921 ("libbpf: Improve usability of libbpf Makefile") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: bpf@vger.kernel.org Cc: Xin Liu <liuxin350@huawei.com> Link: https://lore.kernel.org/r/20230722065236.17010-1-rdunlap@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-01	tracing: bpf: use struct trace_entry in struct syscall_tp_t	Yauheni Kaliuta	1	-4/+8
	bpf tracepoint program uses struct trace_event_raw_sys_enter as argument where trace_entry is the first field. Use the same instead of unsigned long long since if it's amended (for example by RT patch) it accesses data with wrong offset. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230801075222.7717-1-ykaliuta@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-01	Merge branch 'Remove unused fields in cpumap & devmap'	Martin KaFai Lau	2	-5/+0
	Hou Tao says: ==================== Patchset "Simplify xdp_do_redirect_map()/xdp_do_flush_map() and XDP maps" [0] changed per-map flush list to global per-cpu flush list for cpumap, devmap and xskmap, but it forgot to remove these unused fields from cpumap and devmap. So just remove these unused fields. Comments and suggestions are always welcome. [0]: https://lore.kernel.org/bpf/20191219061006.21980-1-bjorn.topel@gmail.com ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-01	bpf, devmap: Remove unused dtab field from bpf_dtab_netdev	Hou Tao	1	-2/+0
	Commit 96360004b862 ("xdp: Make devmap flush_list common for all map instances") removes the use of bpf_dtab_netdev::dtab in bq_enqueue(), so just remove dtab from bpf_dtab_netdev. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230728014942.892272-3-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-01	bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry	Hou Tao	1	-3/+0
	Since commit cdfafe98cabe ("xdp: Make cpumap flush_list common for all map instances"), cmap is no longer used, so just remove it. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230728014942.892272-2-houtao@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-01	netfilter: bpf: Only define get_proto_defrag_hook() if necessary	Daniel Xu	1	-0/+2
	Before, we were getting this warning: net/netfilter/nf_bpf_link.c:32:1: warning: 'get_proto_defrag_hook' defined but not used [-Wunused-function] Guard the definition with CONFIG_NF_DEFRAG_IPV[4\|6]. Fixes: 91721c2d02d3 ("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307291213.fZ0zDmoG-lkp@intel.com/ Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/b128b6489f0066db32c4772ae4aaee1480495929.1690840454.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-01	bpf: Fix an array-index-out-of-bounds issue in disasm.c	Yonghong Song	1	-1/+2
	syzbot reported an array-index-out-of-bounds when printing out bpf insns. Further investigation shows the insn is illegal but is printed out due to log level 1 or 2 before actual insn verification in do_check(). This particular illegal insn is a MOVSX insn with offset value 2. The legal offset value for MOVSX should be 8, 16 and 32. The disasm sign-extension-size array index is calculated as (insn->off / 8) - 1 and offset value 2 gives an out-of-bound index -1. Tighten the checking for MOVSX insn in disasm.c to avoid array-index-out-of-bounds issue. Reported-by: syzbot+3758842a6c01012aa73b@syzkaller.appspotmail.com Fixes: f835bb622299 ("bpf: Add kernel/bpftool asm support for new instructions") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230731204534.1975311-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31	net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn	Lorenz Bauer	2	-4/+0
	There are already INDIRECT_CALLABLE_DECLARE in the hashtable headers, no need to declare them again. Fixes: 0f495f761722 ("net: remove duplicate reuseport_lookup functions") Suggested-by: Martin Lau <martin.lau@linux.dev> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-30	docs/bpf: Fix malformed documentation	Yonghong Song	1	-21/+24
	Two issues are fixed: 1. Malformed table due to newly-introduced BPF_MOVSX 2. Missing reference link for ``Sign-extension load operations`` Fixes: 245d4c40c09b ("docs/bpf: Add documentation for new instructions") Cc: bpf@ietf.org Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307291840.Cqhj7uox-lkp@intel.com/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230730004251.381307-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	Merge branch 'support-defragmenting-ipv-4-6-packets-in-bpf'	Alexei Starovoitov	14	-26/+718
	Daniel Xu says: ==================== Support defragmenting IPv(4\|6) packets in BPF === Context === In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice: 1. Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration. 2. Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses. So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3: Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal. === BPF related bits === Policy has traditionally been enforced from XDP/TC hooks. Both hooks run before kernel reassembly facilities. However, with the new BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing netfilter reassembly infra. The basic idea is we bump a refcnt on the netfilter defrag module and then run the bpf prog after the defrag module runs. This allows bpf progs to transparently see full, reassembled packets. The nice thing about this is that progs don't have to carry around logic to detect fragments. === Changelog === Changes from v5: * Fix defrag disable codepaths Changes from v4: * Refactor module handling code to not sleep in rcu_read_lock() * Also unify the v4 and v6 hook structs so they can share codepaths * Fixed some checkpatch.pl formatting warnings Changes from v3: * Correctly initialize `addrlen` stack var for recvmsg() Changes from v2: * module_put() if ->enable() fails * Fix CI build errors Changes from v1: * Drop bpf_program__attach_netfilter() patches * static -> static const where appropriate * Fix callback assignment order during registration * Only request_module() if callbacks are missing * Fix retval when modprobe fails in userspace * Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6) * Simplify priority checking code * Add warning if module doesn't assign callbacks in the future * Take refcnt on module while defrag link is active [0]: https://datatracker.ietf.org/doc/html/rfc8900 ==================== Link: https://lore.kernel.org/r/cover.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	bpf: selftests: Add defrag selftests	Daniel Xu	5	-2/+536
	These selftests tests 2 major scenarios: the BPF based defragmentation can successfully be done and that packet pointers are invalidated after calls to the kfunc. The logic is similar for both ipv4 and ipv6. In the first scenario, we create a UDP client and UDP echo server. The the server side is fairly straightforward: we attach the prog and simply echo back the message. The on the client side, we send fragmented packets to and expect the reassembled message back from the server. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/33e40fdfddf43be93f2cb259303f132f46750953.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	bpf: selftests: Support custom type and proto for client sockets	Daniel Xu	2	-6/+17
	Extend connect_to_fd_opts() to take optional type and protocol parameters for the client socket. These parameters are useful when opening a raw socket to send IP fragments. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/9067db539efdfd608aa86a2b143c521337c111fc.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	bpf: selftests: Support not connecting client socket	Daniel Xu	2	-2/+4
	For connectionless protocols or raw sockets we do not want to actually connect() to the server. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/525c13d66dac2d640a1db922546842c051c6f2e6.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link	Daniel Xu	3	-15/+118
	This commit adds support for enabling IP defrag using pre-existing netfilter defrag support. Basically all the flag does is bump a refcnt while the link the active. Checks are also added to ensure the prog requesting defrag support is run _after_ netfilter defrag hooks. We also take care to avoid any issues w.r.t. module unloading -- while defrag is active on a link, the module is prevented from unloading. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	netfilter: defrag: Add glue hooks for enabling/disabling defrag	Daniel Xu	4	-1/+43
	We want to be able to enable/disable IP packet defrag from core bpf/netfilter code. In other words, execute code from core that could possibly be built as a module. To help avoid symbol resolution errors, use glue hooks that the modules will register callbacks with during module init. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/f6a8824052441b72afe5285acedbd634bd3384c1.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-29	docs/bpf: Improve documentation for cpu=v4 instructions	Yonghong Song	1	-22/+32
	Improve documentation for cpu=v4 instructions based on David's suggestions. Cc: bpf@ietf.org Suggested-by: David Vernet <void@manifault.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728225105.919595-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Non-atomically allocate freelist during prefill	YiFei Zhu	1	-4/+8
	In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. And if during prefill all precpu chunks are full, -ENOMEM would happen immediately upon next unit_alloc. Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM. GFP_NOWAIT has to be used in unit_alloc when bpf program runs in atomic context. Even if bpf program runs in non-atomic context, in most cases, rcu read lock is enabled for the program so GFP_NOWAIT is still needed. This is often also the case for BPF_MAP_UPDATE_ELEM syscalls. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Enable test test_progs-cpuv4 for gcc build kernel	Yonghong Song	1	-2/+3
	Currently, test_progs-cpuv4 is generated with clang build kernel when bpf cpu=v4 is supported by the clang compiler. Let us enable test_progs-cpuv4 for gcc build kernel as well. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728055745.2285202-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Fix compilation warning with -Wparentheses	Yonghong Song	2	-4/+4
	The kernel test robot reported compilation warnings when -Wparentheses is added to KBUILD_CFLAGS with gcc compiler. The following is the error message: .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’: .../bpf-next/kernel/bpf/verifier.c:5901:14: error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses] if (s64_max >= 0 == s64_min >= 0) { ~~~~~~~~^~~~ .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’: .../bpf-next/kernel/bpf/verifier.c:5965:14: error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses] if (s32_min >= 0 == s32_max >= 0) { ~~~~~~~~^~~~ To fix the issue, add proper parentheses for the above '>=' condition to silence the warning/error. I tried a few clang compilers like clang16 and clang18 and they do not emit such warnings with -Wparentheses. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	Merge branch 'bpf-support-new-insns-from-cpu-v4'	Alexei Starovoitov	21	-170/+2251
	Yonghong Song says: ==================== bpf: Support new insns from cpu v4 In previous discussion ([1]), it is agreed that we should introduce cpu version 4 (llvm flag -mcpu=v4) which contains some instructions which can simplify code, make code easier to understand, fix the existing problem, or simply for feature completeness. More specifically, the following new insns are proposed: . sign extended load . sign extended mov . bswap . signed div/mod . ja with 32-bit offset This patch set added kernel support for insns proposed in [1] except BPF_ST which already has full kernel support. Beside the above proposed insns, LLVM will generate BPF_ST insn as well under -mcpu=v4. The llvm patch ([2]) has been merged into llvm-project 'main' branch. The patchset implements interpreter, jit and verifier support for these new insns. For this patch set, I tested cpu v2/v3/v4 and the selftests are all passed. I also tested selftests introduced in this patch set with additional changes beside normal jit testing (bpf_jit_enable = 1 and bpf_jit_harden = 0) - bpf_jit_enable = 0 - bpf_jit_enable = 1 and bpf_jit_harden = 1 and both testing passed. [1] https://lore.kernel.org/bpf/4bfe98be-5333-1c7e-2f6d-42486c8ec039@meta.com/ [2] https://reviews.llvm.org/D144829 Changelogs: v4 -> v5: . for v4, patch 8/17 missed in mailing list and patchwork, so resend. . rebase on top of master v3 -> v4: . some minor asm syntax adjustment based on llvm change. . add clang version and target arch guard for new tests so they can still compile with old llvm compilers. . some changes to the bpf doc. v2 -> v3: . add missed disasm change from v2. . handle signed load of ctx fields properly. . fix some interpreter sdiv/smod error when bpf_jit_enable = 0. . fix some verifier range bounding errors. . add more C tests. RFCv1 -> v2: . add more verifier supports for signed extend load and mov insns. . rename some insn names to be more consistent with intel practice. . add cpuv4 test runner for test progs. . add more unit and C tests. . add documentation. ==================== Link: https://lore.kernel.org/r/20230728011143.3710005-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	docs/bpf: Add documentation for new instructions	Yonghong Song	2	-41/+79
	Add documentation in instruction-set.rst for new instruction encoding and their corresponding operations. Also removed the question related to 'no BPF_SDIV' in bpf_design_QA.rst since we have BPF_SDIV insn now. Cc: bpf@ietf.org Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011342.3724411-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Test ldsx with more complex cases	Yonghong Song	3	-1/+265
	The following ldsx cases are tested: - signed readonly map value - read/write map value - probed memory - not-narrowed ctx field access - narrowed ctx field access. Without previous proper verifier/git handling, the test will fail. If cpuv4 is not supported either by compiler or by jit, the test will be skipped. # ./test_progs -t ldsx_insn #113/1 ldsx_insn/map_val and probed_memory:SKIP #113/2 ldsx_insn/ctx_member_sign_ext:SKIP #113/3 ldsx_insn/ctx_member_narrow_sign_ext:SKIP #113 ldsx_insn:SKIP Summary: 1/0 PASSED, 3 SKIPPED, 0 FAILED Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011336.3723434-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add unit tests for new gotol insn	Yonghong Song	2	-0/+46
	Add unit tests for gotol insn. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011329.3721881-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add unit tests for new sdiv/smod insns	Yonghong Song	2	-0/+783
	Add unit tests for sdiv/smod insns. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011321.3720500-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add unit tests for new bswap insns	Yonghong Song	2	-0/+61
	Add unit tests for bswap insns. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011314.3720109-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add unit tests for new sign-extension mov insns	Yonghong Song	2	-0/+215
	Add unit tests for movsx insns. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011309.3719295-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add unit tests for new sign-extension load insns	Yonghong Song	2	-0/+133
	Add unit tests for new ldsx insns. The test includes sign-extension with a single value or with a value range. If cpuv4 is not supported due to (1) older compiler, e.g., less than clang version 18, or (2) test runner test_progs and test_progs-no_alu32 which tests cpu v2 and v3, or (3) non-x86_64 arch not supporting new insns in jit yet, a dummy program is added with below output: #318/1 verifier_ldsx/cpuv4 is not supported by compiler or jit, use a dummy test:OK #318 verifier_ldsx:OK to indicate the test passed with a dummy test instead of actually testing cpuv4. I am using a dummy prog to avoid changing the verifier testing infrastructure. Once clang 18 is widely available and other architectures support cpuv4, at least for CI run, the dummy program can be removed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011304.3719139-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Add a cpuv4 test runner for cpu=v4 testing	Yonghong Song	2	-4/+26
	Similar to no-alu32 runner, if clang compiler supports -mcpu=v4, a cpuv4 runner is created to test bpf programs compiled with -mcpu=v4. The following are some num-of-insn statistics for each newer instructions based on existing selftests, excluding subsequent cpuv4 insn specific tests. insn pattern # of instructions reg = (s8)reg 4 reg = (s16)reg 4 reg = (s32)reg 144 reg = (s8 )(reg + off) 13 reg = (s16 )(reg + off) 14 reg = (s32 )(reg + off) 15215 reg = bswap16 reg 142 reg = bswap32 reg 38 reg = bswap64 reg 14 reg s/= reg 0 reg s%= reg 0 gotol <offset> 58 Note that in llvm -mcpu=v4 implementation, the compiler is a little bit conservative about generating 'gotol' insn (32-bit branch offset) as it didn't precise count the number of insns (e.g., some insns are debug insns, etc.). Compared to old 'goto' insn, newer 'gotol' insn should have comparable verification states to 'goto' insn. With current patch set, all selftests passed with -mcpu=v4 when running test_progs-cpuv4 binary. The -mcpu=v3 and -mcpu=v2 run are also successful. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011250.3718252-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	selftests/bpf: Fix a test_verifier failure	Yonghong Song	1	-3/+3
	The following test_verifier subtest failed due to new encoding for BSWAP. $ ./test_verifier ... #99/u invalid 64-bit BPF_END FAIL Unexpected success to load! verification time 215 usec stack depth 0 processed 3 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 #99/p invalid 64-bit BPF_END FAIL Unexpected success to load! verification time 198 usec stack depth 0 processed 3 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 Tighten the test so it still reports a failure. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011244.3717464-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Add kernel/bpftool asm support for new instructions	Yonghong Song	1	-6/+51
	Add asm support for new instructions so kernel verifier and bpftool xlated insn dumps can have proper asm syntax for new instructions. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Support new 32bit offset jmp instruction	Yonghong Song	3	-23/+56
	Add interpreter/jit/verifier support for 32bit offset jmp instruction. If a conditional jmp instruction needs more than 16bit offset, it can be simulated with a conditional jmp + a 32bit jmp insn. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Fix jit blinding with new sdiv/smov insns	Yonghong Song	2	-6/+12
	Handle new insns properly in bpf_jit_blind_insn() function. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011225.3715812-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Support new signed div/mod instructions.	Yonghong Song	3	-26/+117
	Add interpreter/jit support for new signed div/mod insns. The new signed div/mod instructions are encoded with unsigned div/mod instructions plus insn->off == 1. Also add basic verifier support to ensure new insns get accepted. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Support new unconditional bswap instruction	Yonghong Song	3	-2/+20
	The existing 'be' and 'le' insns will do conditional bswap depends on host endianness. This patch implements unconditional bswap insns. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Handle sign-extenstin ctx member accesses	Yonghong Song	1	-0/+6
	Currently, if user accesses a ctx member with signed types, the compiler will generate an unsigned load followed by necessary left and right shifts. With the introduction of sign-extension load, compiler may just emit a ldsx insn instead. Let us do a final movsx sign extension to the final unsigned ctx load result to satisfy original sign extension requirement. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Support new sign-extension mov insns	Yonghong Song	3	-31/+197
	Add interpreter/jit support for new sign-extension mov insns. The original 'MOV' insn is extended to support reg-to-reg signed version for both ALU and ALU64 operations. For ALU mode, the insn->off value of 8 or 16 indicates sign-extension from 8- or 16-bit value to 32-bit value. For ALU64 mode, the insn->off value of 8/16/32 indicates sign-extension from 8-, 16- or 32-bit value to 64-bit value. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf: Support new sign-extension load insns	Yonghong Song	6	-27/+181
	Add interpreter/jit support for new sign-extension load insns which adds a new mode (BPF_MEMSX). Also add verifier support to recognize these insns and to do proper verification with new insns. In verifier, besides to deduce proper bounds for the dst_reg, probed memory access is also properly handled. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28	bpf, docs: fix BPF_NEG entry in instruction-set.rst	Jose E. Marchesi	1	-1/+1
	This patch fixes the documentation of the BPF_NEG instruction to denote that it does not use the source register operand. Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com> Acked-by: Dave Thaler <dthaler@microsoft.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230726092543.6362-1-jose.marchesi@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-26	bpf: work around -Wuninitialized warning	Arnd Bergmann	1	-6/+6
	Splitting these out into separate helper functions means that we actually pass an uninitialized variable into another function call if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT is disabled: kernel/bpf/memalloc.c: In function 'add_obj_to_free_list': kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized] 200 \| dec_active(c, flags); Avoid this by passing the flags by reference, so they either get initialized and dereferenced through a pointer, or the pointer never gets accessed at all. Fixes: 18e027b1c7c6d ("bpf: Factor out inc/dec of active flag into helpers.") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-26	selftests/xsk: Fix spelling mistake "querrying" -> "querying"	Colin Ian King	1	-1/+1
	There is a spelling mistake in an error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/r/20230720104815.123146-1-colin.i.king@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-26	Merge branch 'Add SO_REUSEPORT support for TC bpf_sk_assign'	Martin KaFai Lau	13	-178/+661
	Lorenz Bauer says: ==================== We want to replace iptables TPROXY with a BPF program at TC ingress. To make this work in all cases we need to assign a SO_REUSEPORT socket to an skb, which is currently prohibited. This series adds support for such sockets to bpf_sk_assing. I did some refactoring to cut down on the amount of duplicate code. The key to this is to use INDIRECT_CALL in the reuseport helpers. To show that this approach is not just beneficial to TC sk_assign I removed duplicate code for bpf_sk_lookup as well. Joint work with Daniel Borkmann. Signed-off-by: Lorenz Bauer <lmb@isovalent.com> --- Changes in v6: - Reject unhashed UDP sockets in bpf_sk_assign to avoid ref leak - Link to v5: https://lore.kernel.org/r/20230613-so-reuseport-v5-0-f6686a0dbce0@isovalent.com Changes in v5: - Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki) - Link to v4: https://lore.kernel.org/r/20230613-so-reuseport-v4-0-4ece76708bba@isovalent.com Changes in v4: - WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki) - Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki) - Shuffle documentation patch around (Kuniyuki) - Update commit message to explain why IPv6 needs EXPORT_SYMBOL - Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent.com Changes in v3: - Fix warning re udp_ehashfn and udp6_ehashfn (Simon) - Return higher scoring connected UDP reuseport sockets (Kuniyuki) - Fix ipv6 module builds - Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent.com Changes in v2: - Correct commit abbrev length (Kuniyuki) - Reduce duplication (Kuniyuki) - Add checks on sk_state (Martin) - Split exporting inet[6]_lookup_reuseport into separate patch (Eric) --- Daniel Borkmann (1): selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-26	selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper	Daniel Borkmann	3	-0/+344
	We use two programs to check that the new reuseport logic is executed appropriately. The first is a TC clsact program which bpf_sk_assigns the skb to a UDP or TCP socket created by user space. Since the test communicates via lo we see both directions of packets in the eBPF. Traffic ingressing to the reuseport socket is identified by looking at the destination port. For TCP, we additionally need to make sure that we only assign the initial SYN packets towards our listening socket. The network stack then creates a request socket which transitions to ESTABLISHED after the 3WHS. The second is a reuseport program which shares the fact that it has been executed with user space. This tells us that the delayed lookup mechanism is working. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Cc: Joe Stringer <joe@cilium.io> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-8-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25	bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign	Lorenz Bauer	8	-21/+115
	Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT sockets. This means we can't use the helper to steer traffic to Envoy, which configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing TPROXY from our setup. The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead, one of the reuseport sockets is selected by hash. This could cause dispatch to the "wrong" socket: sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup helpers unfortunately. In the tc context, L2 headers are at the start of the skb, while SK_REUSEPORT expects L3 headers instead. Instead, we execute the SK_REUSEPORT program when the assigned socket is pulled out of the skb, further up the stack. This creates some trickiness with regards to refcounting as bpf_sk_assign will put both refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU freed. We can infer that the sk_assigned socket is RCU freed if the reuseport lookup succeeds, but convincing yourself of this fact isn't straight forward. Therefore we defensively check refcounting on the sk_assign sock even though it's probably not required in practice. Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign") Fixes: cf7fbe660f2d ("bpf: Add socket assign support") Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Joe Stringer <joe@cilium.io> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/ Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25	net: remove duplicate sk_lookup helpers	Lorenz Bauer	6	-84/+55
	Now that inet[6]_lookup_reuseport are parameterised on the ehashfn we can remove two sk_lookup helpers. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-6-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25	net: document inet[6]_lookup_reuseport sk_state requirements	Lorenz Bauer	2	-0/+30
	The current implementation was extracted from inet[6]_lhash2_lookup in commit 80b373f74f9e ("inet: Extract helper for selecting socket from reuseport group") and commit 5df6531292b5 ("inet6: Extract helper for selecting socket from reuseport group"). In the original context, sk is always in TCP_LISTEN state and so did not have a separate check. Add documentation that specifies which sk_state are valid to pass to the function. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-5-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25	net: remove duplicate reuseport_lookup functions	Lorenz Bauer	6	-63/+72
	There are currently four copies of reuseport_lookup: one each for (TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of those functions as well. This is already the case for sk_lookup helpers (inet,inet6,udp4,udp6)_lookup_run_bpf. There are two differences between the reuseport_lookup helpers: 1. They call different hash functions depending on protocol 2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED Move the check for sk_state into the caller and use the INDIRECT_CALL infrastructure to cut down the helpers to one per IP version. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>